Hot Papers 2021-06-01

1. An Attention Free Transformer

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Josh Susskind

retweets: 10042, favorites: 5 (06/02/2021 09:38:11)
links: abs | pdf
cs.LG | cs.CL | cs.CV

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

An Attention Free Transformer
pdf: https://t.co/iOURQMubTR
abs: https://t.co/6TSsVXmjww

an efficient variant of Transformers that eliminates the need for dot product self attention pic.twitter.com/ZfaIbdmvnL
— AK (@ak92501) June 1, 2021

2. Ten Quick Tips for Deep Learning in Biology

Benjamin D. Lee, Anthony Gitter, Casey S. Greene, Sebastian Raschka, Finlay Maguire, Alexander J. Titus, Michael D. Kessler, Alexandra J. Lee, Marc G. Chevrette, Paul Allen Stewart, Thiago Britto-Borges, Evan M. Cofer, Kun-Hsing Yu, Juan Jose Carmona, Elana J. Fertig, Alexandr A. Kalinin, Beth Signal, Benjamin J. Lengerich, Timothy J. Triche Jr, Simina M. Boca

retweets: 3734, favorites: 227 (06/02/2021 09:38:11)
links: abs | pdf
q-bio.OT | cs.LG

Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what is now described as deep learning. Given the computational advances made in the last decade, deep learning can now be applied to massive data sets and in innumerable contexts. Therefore, deep learning has become its own subfield of machine learning. In the context of biological research, it has been increasingly used to derive novel insights from high-dimensional biological data. To make the biological applications of deep learning more accessible to scientists who have some experience with machine learning, we solicited input from a community of researchers with varied biological and deep learning interests. These individuals collaboratively contributed to this manuscript’s writing using the GitHub version control platform and the Manubot manuscript generation toolset. The goal was to articulate a practical, accessible, and concise set of guidelines and suggestions to follow when using deep learning. In the course of our discussions, several themes became clear: the importance of understanding and applying machine learning fundamentals as a baseline for utilizing deep learning, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by deep learning, among others.

I was happy to contribute to the collaborative manuscript Ten Quick Tips for Deep Learning in Biology https://t.co/8XQNw3YZk6 1/ pic.twitter.com/79SD2qQX7J
— Anthony Gitter (@anthonygitter) June 1, 2021

3. Towards mental time travel: a hierarchical memory for reinforcement learning agents

Andrew Kyle Lampinen, Stephanie C.Y. Chan, Andrea Banino, Felix Hill

retweets: 2172, favorites: 285 (06/02/2021 09:38:12)
links: abs | pdf
cs.LG | cs.AI | cs.NE

Reinforcement learning agents often forget details of the past, especially after delays or distractor tasks. Agents with common memory architectures struggle to recall and integrate across multiple timesteps of a past event, or even to recall the details of a single timestep that is followed by distractor tasks. To address these limitations, we propose a Hierarchical Transformer Memory (HTM), which helps agents to remember the past in detail. HTM stores memories by dividing the past into chunks, and recalls by first performing high-level attention over coarse summaries of the chunks, and then performing detailed attention within only the most relevant chunks. An agent with HTM can therefore “mentally time-travel” — remember past events in detail without attending to all intervening events. We show that agents with HTM substantially outperform agents with other memory architectures at tasks requiring long-term recall, retention, or reasoning over memory. These include recalling where an object is hidden in a 3D environment, rapidly learning to navigate efficiently in a new neighborhood, and rapidly learning and retaining new object names. Agents with HTM can extrapolate to task sequences an order of magnitude longer than they were trained on, and can even generalize zero-shot from a meta-learning setting to maintaining knowledge across episodes. HTM improves agent sample efficiency, generalization, and generality (by solving tasks that previously required specialized architectures). Our work is a step towards agents that can learn, interact, and adapt in complex and temporally-extended environments.

How can RL agents recall the past in detail, in order to behave appropriately in the present? In our new preprint "Towards mental time travel: A hierarchical memory for RL agents" (https://t.co/7jk8DoOSnB) we propose a memory architecture that steps in this direction.
— Andrew Lampinen (@AndrewLampinen) June 1, 2021

4. Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang

retweets: 1061, favorites: 266 (06/02/2021 09:38:12)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of “easy” images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of “hard” ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed.

Not All Images are Worth 16x16 Words: Dynamic
Vision Transformers with Adaptive Sequence Length

Observes that there exist many “easy” images which can be predicted with 4x4 tokens, while only a small fraction of “hard” ones need a finer representation. https://t.co/r4pZeYQVjb pic.twitter.com/vtiZ0fjtYP
— Aran Komatsuzaki (@arankomatsuzaki) June 1, 2021

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
pdf: https://t.co/IUYCHGPg1t
abs: https://t.co/4laCgMDY6v pic.twitter.com/aSsYdUZP2H
— AK (@ak92501) June 1, 2021

5. Consistency Regularization for Variational Auto-Encoders

Samarth Sinha, Adji B. Dieng

retweets: 872, favorites: 222 (06/02/2021 09:38:12)
links: abs | pdf
cs.LG | cs.CV

Variational auto-encoders (VAEs) are a powerful approach to unsupervised learning. They enable scalable approximate posterior inference in latent-variable models using variational inference (VI). A VAE posits a variational family parameterized by a deep neural network called an encoder that takes data as input. This encoder is shared across all the observations, which amortizes the cost of inference. However the encoder of a VAE has the undesirable property that it maps a given observation and a semantics-preserving transformation of it to different latent representations. This “inconsistency” of the encoder lowers the quality of the learned representations, especially for downstream tasks, and also negatively affects generalization. In this paper, we propose a regularization method to enforce consistency in VAEs. The idea is to minimize the Kullback-Leibler (KL) divergence between the variational distribution when conditioning on the observation and the variational distribution when conditioning on a random semantic-preserving transformation of this observation. This regularization is applicable to any VAE. In our experiments we apply it to four different VAE variants on several benchmark datasets and found it always improves the quality of the learned representations but also leads to better generalization. In particular, when applied to the Nouveau Variational Auto-Encoder (NVAE), our regularization method yields state-of-the-art performance on MNIST and CIFAR-10. We also applied our method to 3D data and found it learns representations of superior quality as measured by accuracy on a downstream classification task.

We propose a very simple consistency regularization method for VAEs that achieves state-of-the-art on CIFAR-10 & MNIST when used w/ NVAE.

We applied it to 3D data where it also helps! Start applying it to your favorite VAE. Great work by @_sam_sinha_ https://t.co/lkgmoQWS3N pic.twitter.com/Nvu3aZd6lS
— Adji Bousso Dieng (@adjiboussodieng) June 1, 2021

6. Gotta Go Fast When Generating Data with Score-Based Models

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, Ioannis Mitliagkas

retweets: 742, favorites: 166 (06/02/2021 09:38:12)
links: abs | pdf
cs.LG | cs.CV | math.OC | stat.ML

Score-based (denoising diffusion) generative models have recently gained a lot of success in generating realistic and diverse data. These approaches define a forward diffusion process for transforming data to noise and generate data by reversing it (thereby going from noise to data). Unfortunately, current score-based models generate data very slowly due to the sheer number of score network evaluations required by numerical SDE solvers. In this work, we aim to accelerate this process by devising a more efficient SDE solver. Existing approaches rely on the Euler-Maruyama (EM) solver, which uses a fixed step size. We found that naively replacing it with other SDE solvers fares poorly - they either result in low-quality samples or become slower than EM. To get around this issue, we carefully devise an SDE solver with adaptive step sizes tailored to score-based generative models piece by piece. Our solver requires only two score function evaluations, rarely rejects samples, and leads to high-quality samples. Our approach generates data 2 to 10 times faster than EM while achieving better or equal sample quality. For high-resolution images, our method leads to significantly higher quality samples than all other methods tested. Our SDE solver has the benefit of requiring no step size tuning.

New paper is out! 😻 We show how to generate high-quality data as fast as possible with score-based (diffusion) models! 🏃🏻💨💨

Blog: https://t.co/24G39KidOJ
Paper: https://t.co/e7G4ygh2Ho
Code: https://t.co/JQd2rCkAoO

Work with @KL_Div @Remi29048827 @TalKachman @bouzoukipunks
— Alexia Jolicoeur-Martineau (@jm_alexia) June 1, 2021

7. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo

retweets: 703, favorites: 193 (06/02/2021 09:38:13)
links: abs | pdf
cs.CV | cs.LG

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

SegFormer: Simple and Efficient Design for Semantic
Segmentation with Transformers

Proposes SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with MLP decoders.

abs: https://t.co/29QILx5ZCq
code: https://t.co/hEgDYzmrBs pic.twitter.com/ooawlbis8Y
— Aran Komatsuzaki (@arankomatsuzaki) June 1, 2021

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
pdf: https://t.co/sTRSj4gieF
abs: https://t.co/fKTPFHwhMQ

efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perceptron (MLP) decoders pic.twitter.com/plF2Geic4j
— AK (@ak92501) June 1, 2021

8. Less is More: Pay Less Attention in Vision Transformers

Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai

retweets: 730, favorites: 132 (06/02/2021 09:38:13)
links: abs | pdf
cs.CV

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks.

Less is More: Pay Less Attention in Vision Transformers
pdf: https://t.co/ydo2bFvxsH
abs: https://t.co/baTSDrBpEd

hierarchical vision transformer pays less attention in
early stages to ease huge computational cost of self-attention modules over high-resolution
representations pic.twitter.com/6J5xdAO0mc
— AK (@ak92501) June 1, 2021

9. StyTr^2: Unbiased Image Style Transfer with Transformers

Yingying Deng, Fan Tang, Xingjia Pan, Weiming Dong, ChongyangMa, Changsheng Xu

retweets: 576, favorites: 136 (06/02/2021 09:38:13)
links: abs | pdf
cs.CV | eess.IV

The goal of image style transfer is to render an image with artistic features guided by a style reference while maintaining the original content. Due to the locality and spatial invariance in CNNs, it is difficult to extract and maintain the global information of input images. Therefore, traditional neural style transfer methods are usually biased and content leak can be observed by running several times of the style transfer process with the same reference style image. To address this critical issue, we take long-range dependencies of input images into account for unbiased style transfer by proposing a transformer-based approach, namely StyTr^2. In contrast with visual transformers for other vision tasks, our StyTr^2 contains two different transformer encoders to generate domain-specific sequences for content and style, respectively. Following the encoders, a multi-layer transformer decoder is adopted to stylize the content sequence according to the style sequence. In addition, we analyze the deficiency of existing positional encoding methods and propose the content-aware positional encoding (CAPE) which is scale-invariant and more suitable for image style transfer task. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed StyTr^2 compared to state-of-the-art CNN-based and flow-based approaches.

StyTr^2: Unbiased Image Style Transfer with Transformers
pdf: https://t.co/H3OraPsolh
abs: https://t.co/PM8dZiCuct pic.twitter.com/8ld0m4SDyN
— AK (@ak92501) June 1, 2021

10. How Attentive are Graph Attention Networks?

Shaked Brody, Uri Alon, Eran Yahav

retweets: 289, favorites: 65 (06/02/2021 09:38:13)
links: abs | pdf
cs.LG

Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GATs can only compute a restricted kind of attention where the ranking of attended nodes is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express: in a controlled problem, we show that static attention hinders GAT from even fitting the training data. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2: a dynamic graph attention variant that is strictly more expressive than GAT. We perform an extensive evaluation and show that GATv2 outperforms GAT across 11 OGB and other benchmarks while we match their parametric costs. Our code is available at https://github.com/tech-srl/how_attentive_are_gats .

Important read of the day: GATv2 (Brody, @urialon1, @yahave): https://t.co/p8SXyQF7Go

The exact attention mechanism I used in the GAT paper was intentionally 'weakened' to make it work on the easy-to-overfit datasets of the time. It was never meant to be a 'silver bullet'... 1/2 pic.twitter.com/TUg13PCkBq
— Petar Veličković (@PetarV_93) June 1, 2021

11. On the Bias Against Inductive Biases

George Cazenavette, Simon Lucey

retweets: 195, favorites: 65 (06/02/2021 09:38:14)
links: abs | pdf
cs.CV | cs.LG

Borrowing from the transformer models that revolutionized the field of natural language processing, self-supervised feature learning for visual tasks has also seen state-of-the-art success using these extremely deep, isotropic networks. However, the typical AI researcher does not have the resources to evaluate, let alone train, a model with several billion parameters and quadratic self-attention activations. To facilitate further research, it is necessary to understand the features of these huge transformer models that can be adequately studied by the typical researcher. One interesting characteristic of these transformer models is that they remove most of the inductive biases present in classical convolutional networks. In this work, we analyze the effect of these and more inductive biases on small to moderately-sized isotropic networks used for unsupervised visual feature learning and show that their removal is not always ideal.

On the Bias Against Inductive Biases
pdf: https://t.co/3Ry5kdgEAR
abs: https://t.co/OjYOuQAcFk

a convolutional architecture with a continuous input outperforms using a transformer with one-hot input on small-scale networks, in contrast to large-scale networks like image-GPT pic.twitter.com/1a7gTlpTgm
— AK (@ak92501) June 1, 2021

12. Exploring Sparse Expert Models and Beyond

An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, Hongxia Yang

retweets: 162, favorites: 78 (06/02/2021 09:38:14)
links: abs | pdf
cs.LG | cs.CL

Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still it is a mystery how MoE layers bring quality gains by leveraging the parameters with sparse activation. In this work, we investigate several key factors in sparse expert models. We observe that load imbalance may not be a significant problem affecting model quality, contrary to the perspectives of recent studies, while the number of sparsely activated experts $k$ and expert capacity $C$ in top- $k$ routing can significantly make a difference in this context. Furthermore, we take a step forward to propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top- $1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models. We push the model scale to over $1$ trillion parameters and implement it on solely $480$ NVIDIA V100-32GB GPUs, in comparison with the recent SOTA Switch Transformer on $2048$ TPUs. The proposed giant model achieves substantial speedup in convergence over the same-size baseline.

Exploring Sparse Expert Models and Beyond

Proposes some modifications to MoE, which improves the perf-compute trade-off. The proposed giant model (evaluated up to 1T params) achieves substantial speedup in convergence over the same-size baseline.https://t.co/IGNuWgs5DB pic.twitter.com/PI96PrM11t
— Aran Komatsuzaki (@arankomatsuzaki) June 1, 2021

Exploring Sparse Expert Models and Beyond
pdf: https://t.co/sDGT8yDZms
abs: https://t.co/WJihYIf3RN

1T model on 480 V100-32GB, compared with Switch Transformer implemented on 2048 TPUs, can improve the performance of 1T sparse expert models effectively
and speedup convergence pic.twitter.com/sX2OJTpil2
— AK (@ak92501) June 1, 2021

13. The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider

T. Aarrestad, M. van Beekveld, M. Bona, A. Boveia, S. Caron, J. Davies, A. De Simone, C. Doglioni, J.M. Duarte, A. Farbin, H. Gupta, L. Hendriks, L. Heinrich, J. Howarth, P. Jawahar, A. Jueid, J. Lastow, A. Leinweber, J. Mamuzic, E. Merényi, A. Morandini, P. Moskvitina, C. Nellist, J. Ngadiuba, B. Ostdiek, M. Pierini, B. Ravina, R. Ruiz de Austri, S. Sekmen, M. Touranakou, M. Vaškevičiūte, R. Vilalta, J.R. Vlimant, R. Verheyen, M. White, E. Wulff, E. Wallin, K.A. Wozniak, Z. Zhang

retweets: 155, favorites: 63 (06/02/2021 09:38:14)
links: abs | pdf
hep-ph | hep-ex | physics.data-an | stat.ML

We describe the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenged aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms. First, we propose how an anomaly score could be implemented to define model-independent signal regions in LHC searches. We define and describe a large benchmark dataset, consisting of >1 Billion simulated LHC events corresponding to $10~\rm{fb}^{-1}$ of proton-proton collisions at a center-of-mass energy of 13 TeV. We then review a wide range of anomaly detection and density estimation algorithms, developed in the context of the data challenge, and we measure their performance in a set of realistic analysis environments. We draw a number of useful conclusions that will aid the development of unsupervised new physics searches during the third run of the LHC, and provide our benchmark dataset for future studies at https://www.phenoMLdata.org. Code to reproduce the analysis is provided at https://github.com/bostdiek/DarkMachines-UnsupervisedChallenge.

Today on arxiv:

The Dark Machines Anomaly Score Challenge:

Benchmark Data and Model Independent Event Classification for the Large Hadron Colliderhttps://t.co/j7MyxDLH9T
— Dark Machines (@dark_machines) June 1, 2021

Paper documenting the @dark_machines unsupervised searches data challenge at particle collider is out. It was great to work on this with @SaschaCaron @CatDogLund @MCvBeekveld @BryanOstdiek @vlimant @BonaMarcella and many others. https://t.co/OWRZGJxvMt pic.twitter.com/qpbQCiY8Hy
— Maurizio Pierini (@xmpierinix) June 1, 2021

14. RaspberryPI for mosquito neutralization by power laser

R. Ildar

retweets: 67, favorites: 96 (06/02/2021 09:38:14)
links: abs | pdf
cs.CV

In this article for the first time, comprehensive studies of mosquito neutralization using machine vision and a 1 W power laser are considered. Developed laser installation with Raspberry Pi that changing the direction of the laser with a galvanometer. We developed a program for mosquito tracking in real. The possibility of using deep neural networks, Haar cascades, machine learning for mosquito recognition was considered. We considered in detail the classification problems of mosquitoes in images. A recommendation is given for the implementation of this device based on a microcontroller for subsequent use as part of an unmanned aerial vehicle. Any harmful insects in the fields can be used as objects for control.

"RaspberryPI for mosquito neutralization by power laser"https://t.co/cecBcopJpf

"The system can neutralize 2 mosquitos per sec and this result can be easily improved. " pic.twitter.com/3eE7WLB43h
— Dmytro Mishkin (@ducha_aiki) June 1, 2021

15. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models

Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney

retweets: 110, favorites: 25 (06/02/2021 09:38:14)
links: abs | pdf
cs.CL | cs.LG

Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. However, current approaches either only explore head pruning, which has a limited pruning ratio, or only focus on unstructured pruning, which has negligible effects on the real inference time and/or power consumption. To address these challenges, we develop a novel MultiLevel structured Pruning (MLPruning) framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning. We propose using a learnable Top-k threshold, which employs an adaptive regularization to adjust the regularization magnitude adaptively, to select appropriate pruning ratios for different weight matrices. We also propose a two-step pipeline to combine block-wise pruning with head/row pruning to achieve high structured pruning ratios with minimum accuracy degradation. Our empirical results show that for \bertbase, with \textapprox20% of remaining weights, \OURS can achieve an accuracy that is comparable to the full model on QQP/MNLI/\squad, with up to \textapprox3.69x speedup. Our framework has been open sourced~\cite{codebase}.

MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models
pdf: https://t.co/keOb2iMBGP
abs: https://t.co/pIAUeSScDO pic.twitter.com/Ebgf7gSMzp
— AK (@ak92501) June 1, 2021

16. Towards More Equitable Question Answering Systems: How Much More Data Do You Need?

Arnab Debnath, Navid Rajabi, Fardina Fathmiul Alam, Antonios Anastasopoulos

retweets: 42, favorites: 52 (06/02/2021 09:38:15)
links: abs | pdf
cs.CL

Question answering (QA) in English has been widely explored, but multilingual datasets are relatively new, with several methods attempting to bridge the gap between high- and low-resourced languages using data augmentation through translation and cross-lingual transfer. In this project, we take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages. Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs. In addition, we make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems. Code and data for reproducing our experiments are available here: https://github.com/NavidRajabi/EMQA.

How much data do you really need to train a question answering system? In our #ACL2021NLP paper, we find that you can do just as well with only a few training examples, and we make suggestions for future dataset creators.
Camera-Ready: https://t.co/OWMPKfdOQ2
👇Short thread👇 pic.twitter.com/g3N4BM6H0d
— Antonis Anastasopoulos (@anas_ant) June 1, 2021

17. UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Xiaotao Gu, Zihan Wang, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang

retweets: 30, favorites: 22 (06/02/2021 09:38:15)
links: abs | pdf
cs.CL

Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

A novel unsupervised, context-aware quality phrase tagger.https://t.co/PZ2Fv3vBYE https://t.co/Ds1lLuzT6R pic.twitter.com/jfPx8A7aJt
— Philip Vollet (@philipvollet) June 1, 2021

Published 2 Jun 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter