Hot Papers 2020-09-16

1. Efficient Transformers: A Survey

Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler

retweets: 159, favorites: 657 (09/17/2020 09:55:42)
links: abs | pdf
cs.LG | cs.AI | cs.CL | cs.CV | cs.IR

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency. With the aim of helping the avid researcher navigate this flurry, this paper characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.

Inspired by the dizzying number of efficient Transformers ("x-formers") models that are coming out lately, we wrote a survey paper to organize all this information. Check it out at https://t.co/nAaTLG8wOp.

Joint work with @m__dehghani @dara_bahri and @metzlerd. @GoogleAI 😀😃 pic.twitter.com/0M7a0oCqdj
— Yi Tay (@ytay017) September 16, 2020

グーグルの研究者たちが、自然言語処理やコンピュータビジョンや強化学習などの様々な分野で成果を出している「Transformer」の改良手法についてまとめている。

Efficient Transformers: A Surveyhttps://t.co/tYjk0TFFx7 pic.twitter.com/o9qVpmC41F
— 小猫遊りょう（たかにゃし・りょう） (@jaguring1) September 16, 2020

Efficient Transformers: A Surveyhttps://t.co/leBKeOA1bE pic.twitter.com/zyYg2RAo53
— phalanx (@ZFPhalanx) September 16, 2020

Efficient Transformers: A Survey https://t.co/qj8gNqKdIf
— arXiv CS-CV (@arxiv_cscv) September 16, 2020

2. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Timo Schick, Hinrich Schütze

retweets: 69, favorites: 348 (09/17/2020 09:55:43)
links: abs | pdf
cs.CL | cs.AI | cs.LG

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance on challenging natural language understanding benchmarks. In this work, we show that performance similar to GPT-3 can be obtained with language models whose parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain some form of task description, combined with gradient-based optimization; additionally exploiting unlabeled data gives further improvements. Based on our findings, we identify several key factors required for successful natural language understanding with small language models.

🎉 New paper 🎉 We show that language models are few-shot learners even if they have far less than 175B parameters. Our method performs similar to @OpenAI's GPT-3 on SuperGLUE after training on 32 examples with just 0.1% of its parameter count: https://t.co/73RPkN2FuD #NLProc pic.twitter.com/vsr8ELN5Id
— Timo Schick (@timo_schick) September 16, 2020

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
pdf: https://t.co/K6JDwTTgmM
abs: https://t.co/wbup6gTCOI pic.twitter.com/JNCnYNNciZ
— AK (@ak92501) September 16, 2020

3. Old Photo Restoration via Deep Latent Space Translation

Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong Chen, Jing Liao, Fang Wen

retweets: 41, favorites: 194 (09/17/2020 09:55:43)
links: abs | pdf
cs.CV | cs.GR

We propose to restore old photos that suffer from severe degradation through a deep learning approach. Unlike conventional restoration tasks that can be solved through supervised learning, the degradation in real photos is complex and the domain gap between synthetic images and real old photos makes the network fail to generalize. Therefore, we propose a novel triplet domain translation network by leveraging real photos along with massive synthetic image pairs. Specifically, we train two variational autoencoders (VAEs) to respectively transform old photos and clean photos into two latent spaces. And the translation between these two latent spaces is learned with synthetic paired data. This translation generalizes well to real photos because the domain gap is closed in the compact latent space. Besides, to address multiple degradations mixed in one old photo, we design a global branch with apartial nonlocal block targeting to the structured defects, such as scratches and dust spots, and a local branch targeting to the unstructured defects, such as noises and blurriness. Two branches are fused in the latent space, leading to improved capability to restore old photos from multiple defects. Furthermore, we apply another face refinement network to recover fine details of faces in the old photos, thus ultimately generating photos with enhanced perceptual quality. With comprehensive experiments, the proposed pipeline demonstrates superior performance over state-of-the-art methods as well as existing commercial tools in terms of visual quality for old photos restoration.

Old Photo Restoration via Deep Latent Space Translation
pdf: https://t.co/MHh6WOUBRg
abs: https://t.co/rcb3aLUqy5 pic.twitter.com/kCVrB9rnIO
— AK (@ak92501) September 16, 2020

4. Current Limitations of Language Models: What You Need is Retrieval

Aran Komatsuzaki

retweets: 36, favorites: 177 (09/17/2020 09:55:43)
links: abs | pdf
cs.CL | cs.LG

We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models (such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval. We identify some limitations (1) - (4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by the input as well as performing general textual tasks like GPT-2/3 due to its need for a specific fine-tuning dataset. (2) and (3) do not improve the prediction of the first $\sim 10^3$ tokens. Scaling up a model size (e.g. efficiently with (4)) still results in poor performance scaling for some tasks. We argue (5) would resolve many of these limitations, and it can (a) reduce the amount of supervision and (b) efficiently extend the context over the entire training dataset and the entire past of the current sample. We speculate how to modify MARGE to perform unsupervised causal modeling that achieves (b) with the retriever jointly trained.

Current Limitations of Language Models: What You Need is Retrieval

- Classifies & analyzes LM approaches
- Some limitations on efficient attn, recurrence, scaling up model size etc
- Retrieval may solve them
- Speculates how to extend MARGE to GPT-3https://t.co/r8xc5vhR8Y pic.twitter.com/S3XOjYri7v
— Aran Komatsuzaki (@arankomatsuzaki) September 16, 2020

Jason Lee, Raphael Shu, Kyunghyun Cho

retweets: 25, favorites: 126 (09/17/2020 09:55:43)
links: abs | pdf
cs.CL

We propose an efficient inference procedure for non-autoregressive machine translation that iteratively refines translation purely in the continuous space. Given a continuous latent variable model for machine translation (Shu et al., 2020), we train an inference network to approximate the gradient of the marginal log probability of the target sentence, using only the latent variable as input. This allows us to use gradient-based optimization to find the target sentence at inference time that approximately maximizes its marginal probability. As each refinement step only involves computation in the latent space of low dimensionality (we use 8 in our experiments), we avoid computational overhead incurred by existing non-autoregressive inference procedures that often refine in token space. We compare our approach to a recently proposed EM-like inference procedure (Shu et al., 2020) that optimizes in a hybrid space, consisting of both discrete and continuous variables. We evaluate our approach on WMT’14 En-De, WMT’16 Ro-En and IWSLT’16 De-En, and observe two advantages over the EM-like inference: (1) it is computationally efficient, i.e. each refinement step is twice as fast, and (2) it is more effective, resulting in higher marginal probabilities and BLEU scores with the same number of refinement steps. On WMT’14 En-De, for instance, our approach is able to decode 6.2 times faster than the autoregressive model with minimal degradation to translation quality (0.9 BLEU).

New work: "Iterative Refinement in the Continuous Space
for Non-Autoregressive Neural Machine Translation" w/ @raphaelshu @kchonyc https://t.co/JqyAj8sfU7

We propose an inference procedure for non-autoregressive NMT that refines translations *purely in the continuous space.*
— Jason Lee (@jasonleeinf) September 16, 2020

6. Sustained Online Amplification of COVID-19 Elites in the United States

Ryan J. Gallagher, Larissa Doroshenko, Sarah Shugars, David Lazer, Brooke Foucault Welles

retweets: 49, favorites: 87 (09/17/2020 09:55:44)
links: abs | pdf
cs.SI | physics.soc-ph

The ongoing, fluid nature of the COVID-19 pandemic requires individuals to regularly seek information about best health practices, local community spreading, and public health guidelines. In the absence of a unified response to the pandemic in the United States and clear, consistent directives from federal and local officials, people have used social media to collectively crowdsource COVID-19 elites, a small set of trusted COVID-19 information sources. We take a census of COVID-19 crowdsourced elites in the United States who have received sustained attention on Twitter during the pandemic. Using a mixed methods approach with a panel of Twitter users linked to public U.S. voter registration records, we find that journalists, media outlets, and political accounts have been consistently amplified around COVID-19, while epidemiologists, public health officials, and medical professionals make up only a small portion of all COVID-19 elites on Twitter. We show that COVID-19 elites vary considerably across demographic groups, and that there are notable racial, geographic, and political similarities and disparities between various groups and the demographics of their elites. With this variation in mind, we discuss the potential for using the disproportionate online voice of crowdsourced COVID-19 elites to equitably promote timely public health information and mitigate rampant misinformation.

Who have people been consistently turning to for COVID-19 information?

Our new preprint takes a census of the accounts that Americans have been amplifying on Twitter the most during the pandemichttps://t.co/TYRnsOSOq8

w/ @doroshishka @Shugars @davidlazer @foucaultwelles

1/
— Ryan J. Gallagher (@ryanjgallag) September 16, 2020

"Sustained Online Amplification of COVID-19 Elites in the United States"

New preprint from alum @ryanjgallag w/team @doroshishka @Shugars @davidlazer @foucaultwelles https://t.co/VFeAA2baHX pic.twitter.com/7RIw8Ubx6K
— ComputationlStoryLab (@compstorylab) September 16, 2020

7. The Cost of Software-Based Memory Management Without Virtual Memory

Drew Zagieboylo, G. Edward Suh, Andrew C. Myers

retweets: 15, favorites: 69 (09/17/2020 09:55:44)
links: abs | pdf
cs.AR | cs.PL

Virtual memory has been a standard hardware feature for more than three decades. At the price of increased hardware complexity, it has simplified software and promised strong isolation among colocated processes. In modern computing systems, however, the costs of virtual memory have increased significantly. With large memory workloads, virtualized environments, data center computing, and chips with multiple DMA devices, virtual memory can degrade performance and increase power usage. We therefore explore the implications of building applications and operating systems without relying on hardware support for address translation. Primarily, we investigate the implications of removing the abstraction of large contiguous memory segments. Our experiments show that the overhead to remove this reliance is surprisingly small for real programs. We expect this small overhead to be worth the benefit of reducing the complexity and energy usage of address translation. In fact, in some cases, performance can even improve when address translation is avoided.

Is virtual memory still a good idea? Drew Zagieboylo's measurements suggest future computers should get rid of it, reducing hardware complexity while increasing performance. https://t.co/aBgUoZBdSZ @DZagieboylo
— Andrew Myers (@AndrewCMyers) September 16, 2020

8. The Importance of Pessimism in Fixed-Dataset Policy Optimization

Jacob Buckman, Carles Gelada, Marc G. Bellemare

retweets: 9, favorites: 45 (09/17/2020 09:55:44)
links: abs | pdf
cs.AI | cs.LG

We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. Our core contribution is a unified conceptual and mathematical framework for the study of algorithms in this regime. This analysis reveals that for naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement: in order to guarantee that we select a policy which is near-optimal, we may need the dataset to be informative of the value of every policy. To avoid this, algorithms can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle. These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four MinAtar environments.

You've heard about the "optimism principle" in RL, right? Well, turns out that it has a sibling: the *pessimism* principle. Find out more in our new paper (w/ @carlesgelada, @marcgbellemare): https://t.co/O3HxsdzeJL

Mini-thread below! 1/ pic.twitter.com/i9qOb2Yz25
— Jacob Buckman (@jacobmbuckman) September 16, 2020

Published 17 Sep 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter