Hot Papers 2020-10-23

1. Castle in the Sky: Dynamic Sky Replacement and Harmonization in Videos

Zhengxia Zou

retweets: 10176, favorites: 0 (10/24/2020 10:54:39)
links: abs | pdf
cs.CV

This paper proposes a vision-based method for video sky replacement and harmonization, which can automatically generate realistic and dramatic sky backgrounds in videos with controllable styles. Different from previous sky editing methods that either focus on static photos or require inertial measurement units integrated in smartphones on shooting videos, our method is purely vision-based, without any requirements on the capturing devices, and can be well applied to either online or offline processing scenarios. Our method runs in real-time and is free of user interactions. We decompose this artistic creation process into a couple of proxy tasks including sky matting, motion estimation, and image blending. Experiments are conducted on videos diversely captured in the wild by handheld smartphones and dash cameras, and show high fidelity and good generalization of our method in both visual quality and lighting/motion dynamics. Our code and animated results are available at \url{https://jiupinjia.github.io/skyar/}.

Castle in the Sky: Dynamic Sky Replacement and Harmonization in Videos
pdf: https://t.co/DZnFyK2V7G
abs: https://t.co/MqWd8LQz7R
project page: https://t.co/8ph0k5TjYd
github: https://t.co/JNYZYJ3bi2 pic.twitter.com/YxjeCs6uc6
— AK (@ak92501) October 23, 2020

2. mT5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel

retweets: 8034, favorites: 103 (10/24/2020 10:54:39)
links: abs | pdf
cs.CL

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. All of the code and model checkpoints used in this work are publicly available.

We are releasing mT5: A massively-multilingual version of T5 that supports over 💯 languages! mT5 was pre-trained on a multilingual version of C4 and achieves SoTA on many cross-lingual NLP tasks.

📜Pre-print: https://t.co/lfQd9CeF7b
💾Code/models: https://t.co/1MauV6etQY pic.twitter.com/LUJIFaMxF5
— Adam Roberts (@ada_rob) October 23, 2020

3. Learning Invariances in Neural Networks

Gregory Benton, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson

retweets: 2768, favorites: 273 (10/24/2020 10:54:39)
links: abs | pdf
cs.LG | stat.ML

Invariances to translations have imbued convolutional neural networks with powerful generalization properties. However, we often do not know a priori what invariances are present in the data, or to what extent a model should be invariant to a given symmetry group. We show how to \emph{learn} invariances and equivariances by parameterizing a distribution over augmentations and optimizing the training loss simultaneously with respect to the network parameters and augmentation parameters. With this simple procedure we can recover the correct set and extent of invariances on image classification, regression, segmentation, and molecular property prediction from a large space of augmentations, on training data alone.

Translation equivariance has imbued CNNs with powerful generalization abilities. Our #NeurIPS2020 paper shows how to *learn* symmetries -- rotations, translations, scalings, shears -- from training data alone! https://t.co/ur8sseuGRk
w/ @g_benton_, @Pavel_Izmailov, @m_finzi. 1/9 pic.twitter.com/Bf2DGEItX7
— Andrew Gordon Wilson (@andrewgwils) October 23, 2020

4. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

retweets: 1606, favorites: 532 (10/24/2020 10:54:40)
links: abs | pdf
cs.CV | cs.AI | cs.LG

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Vision Transformer's codes and arxiv papers are now available.

codes: https://t.co/563rU7dDXo
arxiv: https://t.co/N03Blafnl0
— Aran Komatsuzaki (@arankomatsuzaki) October 23, 2020

お、ViTきた
arXiv: https://t.co/Uli5bm5rTV
GitHub: https://t.co/pJmL4SOy6U
— Kazuyuki Miyazawa (@kzykmyzw) October 23, 2020

Check out our new work, "Vision Transformer" for image recognition at scale. So many cool findings... https://t.co/rvrvSYuCKN.https://t.co/tklTjAJFA5
— Mostafa Dehghani (@m__dehghani) October 23, 2020

Vision Transformer models & code. Have fun! 😃https://t.co/Cb9VyCMoy2 https://t.co/bRGFT5bBEJ
— Neil Houlsby (@neilhoulsby) October 23, 2020

5. Batch Exploration with Examples for Scalable Robotic Reinforcement Learning

Annie S. Chen, HyunJi Nam, Suraj Nair, Chelsea Finn

retweets: 1373, favorites: 256 (10/24/2020 10:54:40)
links: abs | pdf
cs.RO | cs.AI | cs.LG

Learning from diverse offline datasets is a promising path towards learning general purpose robotic agents. However, a core challenge in this paradigm lies in collecting large amounts of meaningful data, while not depending on a human in the loop for data collection. One way to address this challenge is through task-agnostic exploration, where an agent attempts to explore without a task-specific reward function, and collect data that can be useful for any downstream task. While these approaches have shown some promise in simple domains, they often struggle to explore the relevant regions of the state space in more challenging settings, such as vision based robotic manipulation. This challenge stems from an objective that encourages exploring everything in a potentially vast state space. To mitigate this challenge, we propose to focus exploration on the important parts of the state space using weak human supervision. Concretely, we propose an exploration technique, Batch Exploration with Examples (BEE), that explores relevant regions of the state-space, guided by a modest number of human provided images of important states. These human provided images only need to be collected once at the beginning of data collection and can be collected in a matter of minutes, allowing us to scalably collect diverse datasets, which can then be combined with any batch RL algorithm. We find that BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot, and observe that compared to task-agnostic and weakly-supervised exploration techniques, it (1) interacts more than twice as often with relevant objects, and (2) improves downstream task performance when used in conjunction with offline RL.

Can robots learn to autonomously explore their environment?

We introduce Batch Exploration with Examples (BEE)https://t.co/8zoswdeZzZ
led by @anniee268, Alex Nam, & @SurajNair_1

Thread👇 (1/8) pic.twitter.com/VtZ09y0MVN
— Chelsea Finn (@chelseabfinn) October 23, 2020

6. CoinDICE: Off-Policy Confidence Interval Estimation

Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, Dale Schuurmans

retweets: 600, favorites: 95 (10/24/2020 10:54:41)
links: abs | pdf
cs.LG | stat.ML

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy’s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the $Q$ -function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.

A new and beautiful (and practical!) technique for computing confidence intervals of policy value in RL! https://t.co/tfJLuLOm5I
This is a problem that I & collaborators have been thinking about for ~1 year. At the beginning, I didn't think such a nice result was possible... 1/ pic.twitter.com/yRO7icB74k
— Ofir Nachum (@ofirnachum) October 23, 2020

7. In Search of Robust Measures of Generalization

Gintare Karolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, Daniel M. Roy

retweets: 484, favorites: 85 (10/24/2020 10:54:41)
links: abs | pdf
cs.LG | stat.ML

One of the principal scientific challenges in deep learning is explaining generalization, i.e., why the particular way the community now trains networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories — such as those based on the VC dimension of the class of predictors induced by modern neural network architectures — are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically. Jiang et al. (2020) recently described a large-scale empirical study aimed at uncovering potential causal relationships between bounds/measures and generalization. Building on their study, we highlight where their proposed methods can obscure failures and successes of generalization measures in explaining generalization. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.

#NeurIPS2020 paper "In Search of Robust Measures of Generalization" evaluates robustness of generalization theories. None is robust, nevermind fantastic! 😭https://t.co/4lo03vmXFy

w/ @KDziugaite @CasualBrady @nitarshan @ethancaballero Linbo Wang @bouzoukipunks @roydanroy pic.twitter.com/vk2odDAJoH
— Alexandre Drouin (@alexandredrouin) October 23, 2020

8. Parallel Tacotron: Non-Autoregressive and Controllable TTS

Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu

retweets: 184, favorites: 54 (10/24/2020 10:54:41)
links: abs | pdf
cs.SD | eess.AS

Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

Parallel Tacotron: Non-Autoregressive and Controllable TTS
pdf: https://t.co/ApvuMNk3B3
abs: https://t.co/40FZNuHeDM pic.twitter.com/VFWi65QMJG
— AK (@ak92501) October 23, 2020

9. Identifying Learning Rules From Neural Network Observables

Aran Nayebi, Sanjana Srivastava, Surya Ganguli, Daniel L.K. Yamins

retweets: 132, favorites: 47 (10/24/2020 10:54:41)
links: abs | pdf
q-bio.NC | cs.LG | stat.ML

The brain modifies its synaptic strengths during learning in order to better adapt to its environment. However, the underlying plasticity rules that govern learning are unknown. Many proposals have been suggested, including Hebbian mechanisms, explicit error backpropagation, and a variety of alternatives. It is an open question as to what specific experimental measurements would need to be made to determine whether any given learning rule is operative in a real biological system. In this work, we take a “virtual experimental” approach to this problem. Simulating idealized neuroscience experiments with artificial neural networks, we generate a large-scale dataset of learning trajectories of aggregate statistics measured in a variety of neural network architectures, loss functions, learning rule hyperparameters, and parameter initializations. We then take a discriminative approach, training linear and simple non-linear classifiers to identify learning rules from features based on these observables. We show that different classes of learning rules can be separated solely on the basis of aggregate statistics of the weights, activations, or instantaneous layer-wise activity changes, and that these results generalize to limited access to the trajectory and held-out architectures and learning curricula. We identify the statistics of each observable that are most relevant for rule identification, finding that statistics from network activities across training are more robust to unit undersampling and measurement noise than those obtained from the synaptic strengths. Our results suggest that activation patterns, available from electrophysiological recordings of post-synaptic activities on the order of several hundred units, frequently measured at wider intervals over the course of learning, may provide a good basis on which to identify learning rules.

1/ Excited to share our new work on "Identifying Learning Rules From Neural Network Observables", to appear as a #NeurIPS2020 Spotlight!
w/ @sanjana__z @SuryaGanguli @dyamins

Paper: https://t.co/MBGBEJQenT
Code: https://t.co/l1JgOEq9JO

Summary below 👇
— Aran Nayebi (@aran_nayebi) October 23, 2020

10. NU-GAN: High resolution neural upsampling with GAN

Rithesh Kumar, Kundan Kumar, Vicki Anand, Yoshua Bengio, Aaron Courville

retweets: 85, favorites: 65 (10/24/2020 10:54:41)
links: abs | pdf
cs.SD | cs.AI | cs.CL | cs.LG | eess.AS

In this paper, we propose NU-GAN, a new method for resampling audio from lower to higher sampling rates (upsampling). Audio upsampling is an important problem since productionizing generative speech technology requires operating at high sampling rates. Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution. NU-GAN takes a leap towards solving audio upsampling as a separate component in the text-to-speech (TTS) pipeline by leveraging techniques for audio generation using GANs. ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.

NU-GAN: High resolution neural upsampling with GAN
pdf: https://t.co/dYSpxFbUfW
abs: https://t.co/ldP0zcmhjY
project page: https://t.co/4hKdK2iDYB pic.twitter.com/cM5tC3Z1FO
— AK (@ak92501) October 23, 2020

11. Learning Loss for Test-Time Augmentation

Ildoo Kim, Younghoon Kim, Sungwoong Kim

retweets: 90, favorites: 48 (10/24/2020 10:54:41)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Data augmentation has been actively studied for robust neural networks. Most of the recent data augmentation methods focus on augmenting datasets during the training phase. At the testing phase, simple transformations are still widely used for test-time augmentation. This paper proposes a novel instance-level test-time augmentation that efficiently selects suitable transformations for a test input. Our proposed method involves an auxiliary module to predict the loss of each possible transformation given the input. Then, the transformations having lower predicted losses are applied to the input. The network obtains the results by averaging the prediction results of augmented inputs. Experimental results on several image classification benchmarks show that the proposed instance-aware test-time augmentation improves the model’s robustness against various corruptions.

Learning Loss for Test-Time Augmentation (NeurIPS'20)https://t.co/jWByubE52h pic.twitter.com/JvAhiE6HcD
— phalanx (@ZFPhalanx) October 23, 2020

12. ConVEx: Data-Efficient and Few-Shot Slot Labeling

Matthew Henderson, Ivan Vulić

retweets: 42, favorites: 80 (10/24/2020 10:54:41)
links: abs | pdf
cs.CL

We propose ConVEx (Conversational Value Extractor), an efficient pretraining and fine-tuning neural approach for slot-labeling dialog tasks. Instead of relying on more general pretraining objectives from prior work (e.g., language modeling, response selection), ConVEx’s pretraining objective, a novel pairwise cloze task using Reddit data, is well aligned with its intended usage on sequence labeling tasks. This enables learning domain-specific slot labelers by simply fine-tuning decoding layers of the pretrained general-purpose sequence labeling model, while the majority of the pretrained model’s parameters are kept frozen. We report state-of-the-art performance of ConVEx across a range of diverse domains and data sets for dialog slot-labeling, with the largest gains in the most challenging, few-shot setups. We believe that ConVEx’s reduced pretraining times (i.e., only 18 hours on 12 GPUs) and cost, along with its efficient fine-tuning and strong performance, promise wider portability and scalability for data-efficient sequence-labeling tasks in general.

today we are releasing ConVEx, our new slot labeling framework. It achieves a new leap in performance for few-shot slot labeling. The key ingredient is a new pretraining task, pairwise cloze, that allows pre-training all sequence-level layers-https://t.co/xplofm3VFA pic.twitter.com/QIy3hJSAUc
— Matt Henderson (@matthen2) October 23, 2020

13. STAR: A Schema-Guided Dialog Dataset for Transfer Learning

Johannes E. M. Mosig, Shikib Mehri, Thomas Kober

retweets: 76, favorites: 45 (10/24/2020 10:54:42)
links: abs | pdf
cs.CL

We present STAR, a schema-guided task-oriented dialog dataset consisting of 127,833 utterances and knowledge base queries across 5,820 task-oriented dialogs in 13 domains that is especially designed to facilitate task and domain transfer learning in task-oriented dialog. Furthermore, we propose a scalable crowd-sourcing paradigm to collect arbitrarily large datasets of the same quality as STAR. Moreover, we introduce novel schema-guided dialog models that use an explicit description of the task(s) to generalize from known to unknown tasks. We demonstrate the effectiveness of these models, particularly for zero-shot generalization across tasks and domains.

We are excited to share our latest #NLProc dataset in collaboration with @LTIatCMU! Work by @JEM_Mosig, @shikibmehri & @tttthomasssss

We release ⭐️ STAR ⭐️ a schema-guided dialog dataset.

Arxiv: https://t.co/f9nJX6zVpU
Github: https://t.co/9ruRaSPwLN
— Rasa (@Rasa_HQ) October 23, 2020

14. BlendTorch: A Real-Time, Adaptive Domain Randomization Library

Christoph Heindl, Lukas Brunner, Sebastian Zambal, Josef Scharinger

retweets: 20, favorites: 52 (10/24/2020 10:54:42)
links: abs | pdf
cs.CV | cs.LG

Solving complex computer vision tasks by deep learning techniques relies on large amounts of (supervised) image data, typically unavailable in industrial environments. The lack of training data starts to impede the successful transfer of state-of-the-art methods in computer vision to industrial applications. We introduce BlendTorch, an adaptive Domain Randomization (DR) library, to help creating infinite streams of synthetic training data. BlendTorch generates data by massively randomizing low-fidelity simulations and takes care of distributing artificial training data for model learning in real-time. We show that models trained with BlendTorch repeatedly perform better in an industrial object detection task than those trained on real or photo-realistic datasets.

BlendTorch: A Real-Time, Adaptive Domain Randomization Libraryhttps://t.co/YznmfcPo2k pic.twitter.com/bsjd5X3XOA
— sim2real (@sim2realAIorg) October 23, 2020

Published 24 Oct 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter