1. Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints
Marc Finzi, Ke Alexander Wang, Andrew Gordon Wilson
- retweets: 1482, favorites: 235 (10/28/2020 10:49:15)
- links: abs | pdf
- cs.LG | math.DS | physics.comp-ph | physics.data-an | stat.ML
Reasoning about the physical world requires models that are endowed with the right inductive biases to learn the underlying dynamics. Recent works improve generalization for predicting trajectories by learning the Hamiltonian or Lagrangian of a system rather than the differential equations directly. While these methods encode the constraints of the systems using generalized coordinates, we show that embedding the system into Cartesian coordinates and enforcing the constraints explicitly with Lagrange multipliers dramatically simplifies the learning problem. We introduce a series of challenging chaotic and extended-body systems, including systems with N-pendulums, spring coupling, magnetic fields, rigid rotors, and gyroscopes, to push the limits of current approaches. Our experiments show that Cartesian coordinates with explicit constraints lead to a 100x improvement in accuracy and data efficiency.
We can greatly simplify Hamiltonian and Lagrangian neural nets by working in Cartesian coordinates with explicit constraints, leading to dramatic performance improvements! Our #NeurIPS2020 paper: https://t.co/G3geBSxlSU
— Andrew Gordon Wilson (@andrewgwils) October 27, 2020
with @m_finzi, @KAlexanderWang. 1/5 pic.twitter.com/5VG6XX9wUo
2. Attention is All You Need in Speech Separation
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, Jianyuan Zhong
- retweets: 1042, favorites: 162 (10/28/2020 10:49:16)
- links: abs | pdf
- eess.AS | cs.LG | cs.SD | eess.SP
Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the `SepFormer’, a novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model matches or overtakes the state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It indeed achieves an SI-SNRi of 20.2 dB on WSJ0-2mix matching the SOTA, and an SI-SNRi of 17.6 dB on WSJ0-3mix, a SOTA result. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest RNN-based systems.
Happy to announce our SepFormer! It is a novel RNN-free #Transformers for #speech separation. It matches or overtakes the SOTA on the WSJ0-2/3mix. It is faster and less memory-demanding than #RNN systems.https://t.co/QXXXaJenMC#DeepLearning #SpeechBrain #AI @MILAMontreal pic.twitter.com/JOHXbanLhb
— Mirco Ravanelli (@mirco_ravanelli) October 27, 2020
Attention is All You Need in Speech Separation
— AK (@ak92501) October 27, 2020
pdf: https://t.co/IduWOIECdp
abs: https://t.co/9USbAXn3US pic.twitter.com/WLIM8ghg1X
3. Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations
Judy Borowski, Roland S. Zimmermann, Judith Schepers, Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, Wieland Brendel
Feature visualizations such as synthetic maximally activating images are a widely used explanation method to better understand the information processing of convolutional neural networks (CNNs). At the same time, there are concerns that these visualizations might not accurately represent CNNs’ inner workings. Here, we measure how much extremely activating images help humans to predict CNN activations. Using a well-controlled psychophysical paradigm, we compare the informativeness of synthetic images (Olah et al., 2017) with a simple baseline visualization, namely exemplary natural images that also strongly activate a specific feature map. Given either synthetic or natural reference images, human participants choose which of two query images leads to strong positive activation. The experiment is designed to maximize participants’ performance, and is the first to probe intermediate instead of final layer representations. We find that synthetic images indeed provide helpful information about feature map activations (82% accuracy; chance would be 50%). However, natural images-originally intended to be a baseline-outperform synthetic images by a wide margin (92% accuracy). Additionally, participants are faster and more confident for natural images, whereas subjective impressions about the interpretability of feature visualization are mixed. The higher informativeness of natural images holds across most layers, for both expert and lay participants as well as for hand- and randomly-picked feature visualizations. Even if only a single reference image is given, synthetic images provide less information than natural images (65% vs. 73%). In summary, popular synthetic images from feature visualizations are significantly less informative for assessing CNN activations than natural images. We argue that future visualization methods should improve over this simple baseline.
We tested whether feature visualizations (from @ch402 et al.) really help humans understand CNNs. Our surprising finding: while they do help, they are outperformed by a very simple baseline - natural reference images from the dataset!
— Bethge Lab (@bethgelab) October 27, 2020
Paper @ https://t.co/xxFQtNXltX (1/N) pic.twitter.com/GMQEVMBK2O
4. OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning
Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, Ofir Nachum
Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent’s ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations are available at https://sites.google.com/view/opal-iclr
Hierarchy + offline RL are a natural combination. With OPAL, we extract primitives from offline data, then use them as actions for conservative Q-learning, leading to substantially better results!
— Sergey Levine (@svlevine) October 27, 2020
w/ Anurag Ajay, Aviral Kumar, @pulkitology, @ofirnachum https://t.co/ip39RyLfz7 pic.twitter.com/7CK890OGJk
The future of offline RL is unsupervised learning on large datasets. In this work (https://t.co/lQBG9qEsyB) we show that using a simple autoencoding objective on undirected experience dataset can dramatically improve perf for offline/online RL, imitation/transfer learning etc. 1/ pic.twitter.com/OTIvqioKVV
— Ofir Nachum (@ofirnachum) October 27, 2020
5. Lightning-Fast Gravitational Wave Parameter Inference through Neural Amortization
Arnaud Delaunoy, Antoine Wehenkel, Tanja Hinderer, Samaya Nissanke, Christoph Weniger, Andrew R. Williamson, Gilles Louppe
- retweets: 289, favorites: 83 (10/28/2020 10:49:17)
- links: abs | pdf
- astro-ph.IM | cs.LG | gr-qc
Gravitational waves from compact binaries measured by the LIGO and Virgo detectors are routinely analyzed using Markov Chain Monte Carlo sampling algorithms. Because the evaluation of the likelihood function requires evaluating millions of waveform models that link between signal shapes and the source parameters, running Markov chains until convergence is typically expensive and requires days of computation. In this extended abstract, we provide a proof of concept that demonstrates how the latest advances in neural simulation-based inference can speed up the inference time by up to three orders of magnitude — from days to minutes — without impairing the performance. Our approach is based on a convolutional neural network modeling the likelihood-to-evidence ratio and entirely amortizes the computation of the posterior. We find that our model correctly estimates credible intervals for the parameters of simulated gravitational waves.
Happy to announce our latest work led by @ArnaudDelaunoy for fast posterior inference on gravitational wave data using likelihood-to-evidence ratio estimation (AALR / SNRE), reducing inference time from days to minutes or less! https://t.co/SIbqvSjPOL 🔭🤖 pic.twitter.com/IymZ6WWYGB
— Gilles Louppe (@glouppe) October 27, 2020
6. TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis
Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR models, such as FastSpeech 2, have successfully achieved fast speech synthesis system. However, their quality is not satisfactory, especially when the amount of training data is insufficient. To address this problem, we propose an effective data augmentation method using a well-designed AR TTS system. In this method, large-scale synthetic corpora including text-waveform pairs with phoneme duration are generated by the AR TTS system and then used to train the target non-AR model. Perceptual listening test results showed that the proposed method significantly improved the quality of the non-AR TTS system. In particular, we augmented five hours of a training database to 179 hours of a synthetic one. Using these databases, our TTS system consisting of a FastSpeech 2 acoustic model with a Parallel WaveGAN vocoder achieved a mean opinion score of 3.74, which is 40% higher than that achieved by the conventional method.
TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis
— AK (@ak92501) October 27, 2020
pdf: https://t.co/weZDvpEFdI
abs: https://t.co/wQNQdUqvIF
project page: https://t.co/fvgMspndNQ pic.twitter.com/7ARHu8bFBs
7. S2cGAN: Semi-Supervised Training of Conditional GANs with Fewer Labels
Arunava Chakraborty, Rahul Ragesh, Mahir Shah, Nipun Kwatra
Generative adversarial networks (GANs) have been remarkably successful in learning complex high dimensional real word distributions and generating realistic samples. However, they provide limited control over the generation process. Conditional GANs (cGANs) provide a mechanism to control the generation process by conditioning the output on a user defined input. Although training GANs requires only unsupervised data, training cGANs requires labelled data which can be very expensive to obtain. We propose a framework for semi-supervised training of cGANs which utilizes sparse labels to learn the conditional mapping, and at the same time leverages a large amount of unsupervised data to learn the unconditional distribution. We demonstrate effectiveness of our method on multiple datasets and different conditional tasks.
S2cGAN: Semi-Supervised Training of Conditional GANs with Fewer Labels
— AK (@ak92501) October 27, 2020
pdf: https://t.co/fwcAavRKZV
abs: https://t.co/XBXazSVGZw pic.twitter.com/MypkhMKWYM
8. The LMU Munich System for the WMT 2020 Unsupervised Machine Translation Shared Task
Alexandra Chronopoulou, Dario Stojanovski, Viktor Hangya, Alexander Fraser
This paper describes the submission of LMU Munich to the WMT 2020 unsupervised shared task, in two language directions, German<->Upper Sorbian. Our core unsupervised neural machine translation (UNMT) system follows the strategy of Chronopoulou et al. (2020), using a monolingual pretrained language generation model (on German) and fine-tuning it on both German and Upper Sorbian, before initializing a UNMT model, which is trained with online backtranslation. Pseudo-parallel data obtained from an unsupervised statistical machine translation (USMT) system is used to fine-tune the UNMT model. We also apply BPE-Dropout to the low resource (Upper Sorbian) data to obtain a more robust system. We additionally experiment with residual adapters and find them useful in the Upper Sorbian->German direction. We explore sampling during backtranslation and curriculum learning to use SMT translations in a more principled way. Finally, we ensemble our best-performing systems and reach a BLEU score of 32.4 on German->Upper Sorbian and 35.2 on Upper Sorbian->German.
We achieved 1st place in the #WMT2020 Unsupervised MT task! Core elements: (twisted) MASS pretraining, online BT on monolingual data 📚+ translation loss on pseudo-parallel data📕📗from unsup. SMT(+BPE drop)
— Alexandra Chronopoulou (@alexandraxron) October 27, 2020
Joint work w. @StDario1, V.Hangya, A.Fraser
📄https://t.co/zKVNmO3RDk
9. High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards
Kai Ploeger, Michael Lutter, Jan Peters
Robots that can learn in the physical world will be important to en-able robots to escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system, amplifying the necessity of sample efficiency and safety for robot learning algorithms. In contrast to prior work which mainly focuses on the learning algorithm, we propose a learning system, that directly incorporates these requirements in the design of the policy representation, initialization, and optimization. We demonstrate that this system enables the high-speed Barrett WAM manipulator to learn juggling two balls from 56 minutes of experience with a binary reward signal. The final policy juggles continuously for up to 33 minutes or about 4500 repeated catches. The videos documenting the learning process and the evaluation can be found at https://sites.google.com/view/jugglingbot
High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewardshttps://t.co/O5o5YJ66KJhttps://t.co/iXgV3gC7Uh pic.twitter.com/Z0diCzVsHR
— sim2real (@sim2realAIorg) October 27, 2020
10. Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning
Younggyo Seo, Kimin Lee, Ignasi Clavera, Thanard Kurutach, Jinwoo Shin, Pieter Abbeel
Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However, learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments. Finally, to utilize the specialized prediction heads more effectively, we propose an adaptive planning method, which selects the most accurate prediction head over a recent experience. Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods. Source code and videos are available at https://sites.google.com/view/trajectory-mcl.
Excited to share our #NeurIPS2020 paper that introduces a new model-based RL method to learn the multi-modal transition distribution in an unsupervised manner
— Kimin (@kimin_le2) October 27, 2020
🎓https://t.co/nhKgJJJ9JW
💻https://t.co/fRg25TSOdR
w/@younggyoseo @clavera_i @KurutachThanard @jinwoos0417 @pabbeel
1/N
11. FastFormers: Highly Efficient Transformer Models for Natural Language Understanding
Young Jin Kim, Hany Hassan Awadalla
Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On GPU, we also achieve up to 12.4x speed-up with the presented methods. We show that FastFormers can drastically reduce cost of serving 100 million requests from 4,223 USD to just 18 USD on an Azure F16s_v2 instance. This translates to a sustainable runtime by reducing energy consumption 6.9x - 125.8x according to the metrics used in the SustaiNLP 2020 shared task.
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding
— AK (@ak92501) October 27, 2020
pdf: https://t.co/HRCHSDOG9i
abs: https://t.co/NI6w8qHEnQ
github: https://t.co/1vnyr3IBMQ pic.twitter.com/hXnynyolNO
12. Graph Information Bottleneck
Tailin Wu, Hongyu Ren, Pan Li, Jure Leskovec
Representation learning of graph-structured data is challenging because both graph structure and node features carry important information. Graph Neural Networks (GNNs) provide an expressive way to fuse information from network structure and node features. However, GNNs are prone to adversarial attacks. Here we introduce Graph Information Bottleneck (GIB), an information-theoretic principle that optimally balances expressiveness and robustness of the learned representation of graph-structured data. Inheriting from the general Information Bottleneck (IB), GIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target, and simultaneously constraining the mutual information between the representation and the input data. Different from the general IB, GIB regularizes the structural as well as the feature information. We design two sampling algorithms for structural regularization and instantiate the GIB principle with two new models: GIB-Cat and GIB-Bern, and demonstrate the benefits by evaluating the resilience to adversarial attacks. We show that our proposed models are more robust than state-of-the-art graph defense models. GIB-based models empirically achieve up to 31% improvement with adversarial perturbation of the graph structure as well as node features.
New #NeurIPS2020 paper on Graph Neural Nets #GNN, Representation Learning, Robustness!
— Hongyu Ren (@ren_hongyu) October 27, 2020
"Graph Information Bottleneck" w/ @tailintalent, Pan Li, @jure @StanfordAILab @PurdueCS
Website: https://t.co/sTYXObicuw
Paper: https://t.co/1IWqYXEHWe
Code: https://t.co/I0P05x8jz5
(1/n) pic.twitter.com/N22ZDwj1Qv
13. XLVIN: eXecuted Latent Value Iteration Nets
Andreea Deac, Petar Veličković, Ognjen Milinković, Pierre-Luc Bacon, Jian Tang, Mladen Nikolić
Value Iteration Networks (VINs) have emerged as a popular method to incorporate planning algorithms within deep reinforcement learning, enabling performance improvements on tasks requiring long-range reasoning and understanding of environment dynamics. This came with several limitations, however: the model is not incentivised in any way to perform meaningful planning computations, the underlying state space is assumed to be discrete, and the Markov decision process (MDP) is assumed fixed and known. We propose eXecuted Latent Value Iteration Networks (XLVINs), which combine recent developments across contrastive self-supervised learning, graph representation learning and neural algorithmic reasoning to alleviate all of the above limitations, successfully deploying VIN-style models on generic environments. XLVINs match the performance of VIN-like models when the underlying MDP is discrete, fixed and known, and provides significant improvements to model-free baselines across three general MDP setups.
Excited to share the XLVIN agent!
— Andreea Deac (@andreeadeac22) October 27, 2020
Using GNNs for implicit planning and algorithmically aligning to Value Iteration, we extend Value Iteration Nets to pixel-based/continuous MDPs (eg Atari-Freeway, CartPole, Acrobot), outperforming model-free baselines.https://t.co/Kkv5xXPvZr pic.twitter.com/4KVVenVrXr
14. Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modelling
Akash Srivastava, Yamini Bansal, Yukun Ding, Cole Hurwitz, Kai Xu, Bernhard Egger, Prasanna Sattigeri, Josh Tenenbaum, David D. Cox, Dan Gutfreund
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the (aggregate) posterior to encourage statistical independence of the latent factors. This approach introduces a trade-off between disentangled representation learning and reconstruction quality since the model does not have enough capacity to learn correlated latent variables that capture detail information present in most image data. To overcome this trade-off, we present a novel multi-stage modelling approach where the disentangled factors are first learned using a preexisting disentangled representation learning method (such as -TCVAE); then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables, adding detail information while maintaining conditioning on the previously learned disentangled factors. Taken together, our multi-stage modelling approach results in a single, coherent probabilistic model that is theoretically justified by the principal of D-separation and can be realized with a variety of model classes including likelihood-based models such as variational autoencoders, implicit models such as generative adversarial networks, and tractable models like normalizing flows or mixtures of Gaussians. We demonstrate that our multi-stage model has much higher reconstruction quality than current state-of-the-art methods with equivalent disentanglement performance across multiple standard benchmarks.
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modelling
— AK (@ak92501) October 27, 2020
pdf: https://t.co/dubu2XiYLf
abs: https://t.co/xtqmnXa4OI pic.twitter.com/rpcI1d2xk1
15. Contextualized Word Embeddings Encode Aspects of Human-Like Word Sense Knowledge
Sathvik Nair, Mahesh Srinivasan, Stephan Meylan
Understanding context-dependent variation in word meanings is a key aspect of human language comprehension supported by the lexicon. Lexicographic resources (e.g., WordNet) capture only some of this context-dependent variation; for example, they often do not encode how closely senses, or discretized word meanings, are related to one another. Our work investigates whether recent advances in NLP, specifically contextualized word embeddings, capture human-like distinctions between English word senses, such as polysemy and homonymy. We collect data from a behavioral, web-based experiment, in which participants provide judgments of the relatedness of multiple WordNet senses of a word in a two-dimensional spatial arrangement task. We find that participants’ judgments of the relatedness between senses are correlated with distances between senses in the BERT embedding space. Homonymous senses (e.g., bat as mammal vs. bat as sports equipment) are reliably more distant from one another in the embedding space than polysemous ones (e.g., chicken as animal vs. chicken as meat). Our findings point towards the potential utility of continuous-space representations of sense meanings.
Excited to officially announce my paper, Contextualized Word Embeddings Encode Aspects of Human-Like Word Sense Knowledge, for the Cognitive Aspects of the Lexicon Workshop at @coling2020. https://t.co/Ilu7lDSSLS
— Sathvik (@sathvikn4) October 27, 2020
16. Performance Analysis of Scientific Computing Workloads on Trusted Execution Environments
Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert
Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. To study the applicability of hardware-based trusted execution environments (TEEs) to enable secure scientific computing, we deeply analyze the performance impact of AMD SEV and Intel SGX for diverse HPC benchmarks including traditional scientific computing, machine learning, graph analytics, and emerging scientific computing workloads. We observe three main findings: 1) SEV requires careful memory placement on large scale NUMA machines (13.4 slowdown without and 11.15 slowdown with NUMA aware placement), 2) virtualizationa prerequisite for SEVresults in performance degradation for workloads with irregular memory accesses and large working sets (14 slowdown compared to native execution for graph applications) and 3) SGX is inappropriate for HPC given its limited secure memory size and inflexible programming model (1.2126 slowdown over unsecure execution). Finally, we discuss forthcoming new TEE designs and their potential impact on scientific computing.