Hot Papers 2020-07-29

1. Big Bird: Transformers for Longer Sequences

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

retweets: 100, favorites: 451 (07/30/2020 10:57:26)
links: abs | pdf
cs.LG | cs.CL | stat.ML

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

Big Bird: Transformers for Longer Sequences 🐦
pdf: https://t.co/1ZH5oC2T2e
abs: https://t.co/DLt59rpbps pic.twitter.com/XHuvaqPahM
— AK (@ak92501) July 29, 2020

That didn’t take long at all! As predicted in the recent @pagestlabs issue, long-span attention cost for transformer models like GPT-3 and T5 came down from O(N√N) to O(N) in BigBird. Looking forward to these models becoming viable for everyone to build.https://t.co/ukS1TAWsX3 https://t.co/4NVZ0j5Vss
— Delip Rao (@deliprao) July 29, 2020

BigBirdはTransformerで注意対象を1)疎なランダム位置 2) 局所周辺 3) 定数個の全位置の組み合わせで表す。系列長に対し線形計算量で、元のTransformerと同じ表現力を持ちTuring完全である。長距離依存を扱えNLPタスクのSOTAを更新、DNAのプロモーター領域をほぼ完璧に予測 https://t.co/iO5FEpUMvp
— Daisuke Okanohara (@hillbig) July 29, 2020

Big Bird is a transformer-based model that more effectively supports NLP tasks requiring longer contexts.

It satisfies the theoretical properties of the full model while reducing the attention mechanism complexity to linear in # of tokens.https://t.co/zMJ5ZmSBUc pic.twitter.com/HiMEMSlezY
— elvis (@omarsar0) July 29, 2020

Big Bird: Transformers for Longer Sequences (Google) https://t.co/wQWVWFyfI9 Transformerのスパースアテンションによる長期系列への対応．BigBird = Random + Local (Window) + Global．実験は最大長4096．MLM, QA, 要約などで評価．blockサイズ（図4, 表12）やglobalトークン数が結構大きい印象 pic.twitter.com/jVkpY8smSb
— Kyosuke Nishida (@kyoun) July 29, 2020

Link for the lazy: https://t.co/vGEdi2tWAf
— Madison May (@pragmaticml) July 29, 2020

Very nice work from colleagues in my team and in sibling teams, https://t.co/MGQZHi0AjW https://t.co/pCpIyuqzlR
— D. Sivakumar (@dsivakumar) July 29, 2020

2. Noise-Induced Barren Plateaus in Variational Quantum Algorithms

Samson Wang, Enrico Fontana, M. Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, Patrick J. Coles

retweets: 45, favorites: 247 (07/30/2020 10:57:27)
links: abs | pdf
quant-ph | cs.LG

Variational Quantum Algorithms (VQAs) may be a path to quantum advantage on Noisy Intermediate-Scale Quantum (NISQ) computers. A natural question is whether the noise on NISQ devices places any fundamental limitations on the performance of VQAs. In this work, we rigorously prove a serious limitation for noisy VQAs, in that the noise causes the training landscape to have a barren plateau (i.e., vanishing gradient). Specifically, for the local Pauli noise considered, we prove that the gradient vanishes exponentially in the number of layers $L$ . This implies exponential decay in the number of qubits $n$ when $L$ scales as $\operatorname{poly}(n)$ , for sufficiently large coefficients in the polynomial. These noise-induced barren plateaus (NIBPs) are conceptually different from noise-free barren plateaus, which are linked to random parameter initialization. Our result is formulated for an abstract ansatz that includes as special cases the Quantum Alternating Operator Ansatz (QAOA) and the Unitary Coupled Cluster Ansatz, among others. In the case of the QAOA, we implement numerical heuristics that confirm the NIBP phenomenon for a realistic hardware noise model.

Congrats to our students @samson_wang,@EnricoFontana19 for discovering a new phenomenon, in the first paper of our 2020 school. We prove that local Pauli noise if strong enough will cause a barren plateau in cost landscape. Deep ansatzes are untrainable. https://t.co/hHrP6HJJaZ pic.twitter.com/4YrPDZ3mOC
— Plateaus Coles (@ColesQuantum) July 29, 2020

Excited to have a preprint out, the first I have been involved in!

Our message for variational algorithms: keep depth linear or lower in number of qubits for hope of avoiding barren plateaus. Go deeper, and they're inevitable (asymptotically)https://t.co/YaeqSomPhC pic.twitter.com/KgTjW2OFJG
— Samson Wang (@samson_wang) July 29, 2020

Check out our new results: https://t.co/q7tKSOGhyv

with @samson_wang, @EnricoFontana19, @MvsCerezo, @SoneAkira, @LCincio, @ColesQuantum.

How does the noise on NISQ devices place limitations on the trainability of variational quantum algorithms (VQAs)? pic.twitter.com/snDe6YaLrM
— Kunal Sharma (@kunal_phy) July 29, 2020

New work from out LANL Quantum Computing Summer School by 👉@samson_wang & @EnricoFontana19 👈 and my collaborators @kunal_phy, @SoneAkira, @LCincio, @ColesQuantum. https://t.co/syHxyN5V40

Below I explain the results... 🔥VIA MEMES!🔥 pic.twitter.com/XbSGxzhO2E
— Marco Cerezo (@MvsCerezo) July 29, 2020

Map from our latest work on Noise Induced Barren Plateaus.

🗺️We have recently return from exploring the untamed land where noise roams free and gate infidelity lurks behind every door. What we have found is a bleak land, barren of any features. https://t.co/syHxyN5V40 pic.twitter.com/JEonJRo9zv
— Marco Cerezo (@MvsCerezo) July 29, 2020

3. Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Yoshiki Masuyama, Yoshiaki Bando, Kohei Yatabe, Yoko Sasaki, Masaki Onishi, Yasuhiro Oikawa

retweets: 14, favorites: 54 (07/30/2020 10:57:28)
links: abs | pdf
cs.SD | cs.CV | eess.AS

Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-supervised learning uses monaural audio signals and images and cannot distinguish sound source objects having similar appearances due to poor spatial information in audio signals. To solve this problem, this paper presents a self-supervised training method using 360{\deg} images and multichannel audio signals. By incorporating with the spatial information in multichannel audio signals, our method trains deep neural networks (DNNs) to distinguish multiple sound source objects. Our system for localizing sound source objects in the image is composed of audio and visual DNNs. The visual DNN is trained to localize sound source candidates within an input image. The audio DNN verifies whether each candidate actually produces sound or not. These DNNs are jointly trained in a self-supervised manner based on a probabilistic spatial audio model. Experimental results with simulated data showed that the DNNs trained by our method localized multiple speakers. We also demonstrate that the visual DNN detected objects including talking visitors and specific exhibits from real data recorded in a science museum.

Our #IROS2020 paper titled "Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling" is now available online! Our self-supervised learning is based on probabilistic inference of a multichannel audio model (cGMM).https://t.co/zi986Lz9sU pic.twitter.com/5RU2kb65Zl
— まっすー (@ymas0315) July 29, 2020

4. Toward Zero-Shot Unsupervised Image-to-Image Translation

Yuanqi Chen, Xiaoming Yu, Shan Liu, Ge Li

retweets: 14, favorites: 52 (07/30/2020 10:57:28)
links: abs | pdf
cs.CV

Recent studies have shown remarkable success in unsupervised image-to-image translation. However, if there has no access to enough images in target classes, learning a mapping from source classes to the target classes always suffers from mode collapse, which limits the application of the existing methods. In this work, we propose a zero-shot unsupervised image-to-image translation framework to address this limitation, by associating categories with their side information like attributes. To generalize the translator to previous unseen classes, we introduce two strategies for exploiting the space spanned by the semantic attributes. Specifically, we propose to preserve semantic relations to the visual space and expand attribute space by utilizing attribute vectors of unseen classes, thus encourage the translator to explore the modes of unseen classes. Quantitative and qualitative results on different datasets demonstrate the effectiveness of our proposed approach. Moreover, we demonstrate that our framework can be applied to many tasks, such as zero-shot classification and fashion design.

Toward Zero-Shot Unsupervised Image-to-Image Translation
pdf: https://t.co/saEvcoPpCx
abs: https://t.co/5dy2hz3TfZ pic.twitter.com/UcsJ7NxJuI
— AK (@ak92501) July 29, 2020

5. BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton

retweets: 16, favorites: 48 (07/30/2020 10:57:28)
links: abs | pdf
cs.PL | stat.ML

Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analysing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties: First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on a new data set for synthesis of string transformation programs.

What's the fuss(le) about BUSTLE?
It's our new paper on program synthesis! (https://t.co/aJMmFAZRG6)
We perform bottom-up search over programs, with machine learning in the inner loop.
A thread: (1/8) pic.twitter.com/78UD7FOPAX
— augustus odena (@gstsdn) July 29, 2020

Published 30 Jul 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter