Hot Papers 2021-03-08

1. Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

retweets: 10132, favorites: 35 (03/09/2021 08:38:00)
links: abs | pdf
cs.LG

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

様々なTransformer論文が量産されて，もう何もかもAll You Needなのではと思われる中，まさかの"Attention is not all you need"論文が投下され，混沌の時代へ・・・
"Attention is not all you need: pure attention loses rank doubly exponentially with depth"https://t.co/4lByFg3EhM pic.twitter.com/cviD5DIDCd
— えるエル (@ImAI_Eruel) March 8, 2021

2. Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

retweets: 4250, favorites: 501 (03/09/2021 08:38:00)
links: abs | pdf
cs.LG | cs.AI | cs.CL

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

To find the limits of Transformers, we collected 12,500 math problems. While a three-time IMO gold medalist got 90%, GPT-3 models got ~5%, with accuracy increasing slowly.

If trends continue, ML models are far from achieving mathematical reasoning.https://t.co/X7dzRlut01 pic.twitter.com/coKAtgo09R
— Dan Hendrycks (@DanHendrycks) March 8, 2021

Measuring Mathematical Problem Solving With the
MATH Dataset

Introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems.

Observed that scaling is not currently solving MATH despite being helpful for most other datasets.https://t.co/LJDVTKOtEr pic.twitter.com/R0jbPSwYW8
— Aran Komatsuzaki (@arankomatsuzaki) March 8, 2021

3. Generating Images with Sparse Representations

Charlie Nash, Jacob Menick, Sander Dieleman, Peter W. Battaglia

retweets: 2396, favorites: 438 (03/09/2021 08:38:01)
links: abs | pdf
cs.CV | stat.ML

The high dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. Previous approaches such as VQ-VAE use deep autoencoders to obtain compact representations, which are more practical as inputs for likelihood-based models. We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks, which are represented sparsely as a sequence of DCT channel, spatial location, and DCT coefficient triples. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences, and which scales effectively to high resolution images. On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state of the art methods. We additionally show that simple modifications to our method yield effective image colorization and super-resolution models.

Excited to release our new paper 'Generating Images with Sparse Representations' (https://t.co/ErJXmaOE0C, @jacobmenick @sedielem @PeterWBattaglia)

Our model picks where to place content in an image, and what content to place there (see vid).

Thread for more info: pic.twitter.com/ihtnvM8Gzj
— Charlie Nash (@charlietcnash) March 8, 2021

Generating images with sparse representations

Proposes a Transformer-based autoregressive model inspired by DCT/JPEG, which scales effectively to high
resolution images.

Demonstrate that it can generate high quality, diverse images, with SotA quality.https://t.co/3zJtIPqdUD pic.twitter.com/8t6BwESRIQ
— Aran Komatsuzaki (@arankomatsuzaki) March 8, 2021

Generating Images with Sparse Representations
pdf: https://t.co/ze6TkvMiYR
abs: https://t.co/QtJg9zf80M pic.twitter.com/dZjTSFBzqP
— AK (@ak92501) March 8, 2021

4. Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical

Riccardo Paccagnella, Licheng Luo, Christopher W. Fletcher

retweets: 925, favorites: 132 (03/09/2021 08:38:01)
links: abs | pdf
cs.CR | cs.AR

We introduce the first microarchitectural side channel attacks that leverage contention on the CPU ring interconnect. There are two challenges that make it uniquely difficult to exploit this channel. First, little is known about the ring interconnect’s functioning and architecture. Second, information that can be learned by an attacker through ring contention is noisy by nature and has coarse spatial granularity. To address the first challenge, we perform a thorough reverse engineering of the sophisticated protocols that handle communication on the ring interconnect. With this knowledge, we build a cross-core covert channel over the ring interconnect with a capacity of over 4 Mbps from a single thread, the largest to date for a cross-core channel not relying on shared memory. To address the second challenge, we leverage the fine-grained temporal patterns of ring contention to infer a victim program’s secrets. We demonstrate our attack by extracting key bits from vulnerable EdDSA and RSA implementations, as well as inferring the precise timing of keystrokes typed by a victim user.

Another day, another CPU security bug.
Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect. Works on Intel and may work on AMD and other cpus too. https://t.co/sAoFuVuH51 Apply patches when released.
— nixCraft (@nixcraft) March 8, 2021

Lord of the Ring(s): Side Channel Attacks on the (Intel) CPU On-Chip Ring Interconnect

„In this paper, we introduced side channel attacks on the ring interconnect. … extracting key bits from vulner-able EdDSA and RSA implementations …“https://t.co/2ejltqWFel pic.twitter.com/8sB5ajU56S
— Andreas Schilling (@aschilling) March 8, 2021

Our work on ring interconnect side channel attacks was accepted at @USENIXSecurity 2021 (#usesec21)! Full paper and source code are now available at: https://t.co/bLXXhWmQZG
— Riccardo Paccagnella (@ricpacca) March 8, 2021

5. Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

Ethan Perez, Douwe Kiela, Kyunghyun Cho

retweets: 552, favorites: 95 (03/09/2021 08:38:02)
links: abs | pdf
cs.LG | cs.AI | cs.CL | stat.ML

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels’ minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

There's a lot of work on probing models, but models are reflections of the training data. Can we probe datasets for what capabilities they require? @kchonyc @douwekiela & I introduce Rissanen Data Analysis to do just that: https://t.co/f16EJF75qm
Code: https://t.co/wG0dg0VhxD
1/N
— Ethan Perez (@EthanJPerez) March 8, 2021

Yasamin Jafarian, Hyun Soo Park

retweets: 388, favorites: 203 (03/09/2021 08:38:02)
links: abs | pdf
cs.CV

A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real-world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of another image at a different time instant. This allows self-supervision as enforcing a temporal coherence over the predictions. In addition, we jointly learn the depth along with the surface normals that are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images.

Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
pdf: https://t.co/kFHjax0T99
abs: https://t.co/th4oERTtrY pic.twitter.com/EREdkGVY8E
— AK (@ak92501) March 8, 2021

Наконец-то что-то полезное вышло из тиктока.https://t.co/c6rzzghKqy pic.twitter.com/IHsevFE7NY
— Yuri Krupenin (@turbojedi) March 8, 2021

Alain Barrat, Guilherme Ferraz de Arruda, Iacopo Iacopini, Yamir Moreno

retweets: 438, favorites: 79 (03/09/2021 08:38:02)
links: abs | pdf
physics.soc-ph | cs.SI

In this Chapter, we discuss the effects of higher-order structures on SIS-like processes of social contagion. After a brief motivational introduction where we illustrate the standard SIS process on networks and the difference between simple and complex contagions, we introduce spreading processes on higher-order structures starting from the most general formulation on hypergraphs and then moving to several mean-field and heterogeneous mean-field approaches. The results highlight the rich phenomenology brought by taking into account higher-order contagion effects: both continuous and discontinuous transitions are observed, and critical mass effects emerge. We conclude with a short discussion on the theoretical results regarding the nature of the epidemic transition and the general need for data to validate these models.

Two preprints out today on dynamics of higher-order interactions:
1- "Evolutionary games on simplicial complexes" (https://t.co/oBeapACgck)

2- "Social contagion on higher-order structures" (https://t.co/bmUyZ4Blfd) pic.twitter.com/ZilIVxZp7t
— Yamir Moreno (@cosnet_bifi) March 8, 2021

In "Social contagion on higher-order structures" (https://t.co/bmUyZ4Blfd), we revise what we know about social contagion in higher-order structures. Work lead by @GuiFdeArruda @iacopoiacopini & @alainbarrat pic.twitter.com/xdjadA9WOt
— Yamir Moreno (@cosnet_bifi) March 8, 2021

8. An Effective Loss Function for Generating 3D Models from Single 2D Image without Rendering

Nikola Zubić, Pietro Liò

retweets: 308, favorites: 121 (03/09/2021 08:38:02)
links: abs | pdf
cs.CV | cs.AI

Differentiable rendering is a very successful technique that applies to a Single-View 3D Reconstruction. Current renderers use losses based on pixels between a rendered image of some 3D reconstructed object and ground-truth images from given matched viewpoints to optimise parameters of the 3D shape. These models require a rendering step, along with visibility handling and evaluation of the shading model. The main goal of this paper is to demonstrate that we can avoid these steps and still get reconstruction results as other state-of-the-art models that are equal or even better than existing category-specific reconstruction methods. First, we use the same CNN architecture for the prediction of a point cloud shape and pose prediction like the one used by Insafutdinov & Dosovitskiy. Secondly, we propose the novel effective loss function that evaluates how well the projections of reconstructed 3D point clouds cover the ground truth object’s silhouette. Then we use Poisson Surface Reconstruction to transform the reconstructed point cloud into a 3D mesh. Finally, we perform a GAN-based texture mapping on a particular 3D mesh and produce a textured 3D mesh from a single 2D image. We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

An Effective Loss Function for Generating 3D Models from Single 2D Image without Rendering
pdf: https://t.co/faTvBgq3fP
abs: https://t.co/Hooi2VlcMp pic.twitter.com/P4ijNIKASY
— AK (@ak92501) March 8, 2021

9. Causal Attention for Vision-Language Tasks

Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai

retweets: 210, favorites: 73 (03/09/2021 08:38:02)
links: abs | pdf
cs.CV

We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

Causal Attention for Vision-Language Tasks
pdf: https://t.co/bnxcdWKahC
abs: https://t.co/qHpxPNklUT pic.twitter.com/SOGkR5ngTY
— AK (@ak92501) March 8, 2021

10. Addressing Research Software Sustainability via Institutes

Daniel S. Katz, Jeffrey C. Carver, Neil P. Chue Hong, Sandra Gesing, Simon Hettrick, Tom Honeyman, Karthik Ram, Nicholas Weber

retweets: 240, favorites: 28 (03/09/2021 08:38:02)
links: abs | pdf
cs.SE

Research software is essential to modern research, but it requires ongoing human effort to sustain: to continually adapt to changes in dependencies, to fix bugs, and to add new features. Software sustainability institutes, amongst others, develop, maintain, and disseminate best practices for research software sustainability, and build community around them. These practices can both reduce the amount of effort that is needed and create an environment where the effort is appreciated and rewarded. The UK SSI is such an institute, and the US URSSI and the Australian AuSSI are planning to become institutes, and this extended abstract discusses them and the strengths and weaknesses of this approach.

Addressing Research Software Sustainability via Institutes

by @danielskatz @JeffCarver32 @npch @sandragesing @sjh5000 @TomHoneyman3 @_inundata @nniiicc

an #icse2021 #bokss2021 workshop paper

cc @SoftwareSaved @si2urssi #AuSSI @ICSEconf https://t.co/I8O0lWAYEq pic.twitter.com/xHiYoPRVAg
— Daniel S. Katz (@danielskatz) March 8, 2021

11. There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It

Jianyou Wang, Xiaoxuan Zhang, Yuren Zhou, Christopher Suh, Cynthia Rudin

retweets: 156, favorites: 54 (03/09/2021 08:38:03)
links: abs | pdf
cs.CL

Limerick generation exemplifies some of the most difficult challenges faced in poetry generation, as the poems must tell a story in only five lines, with constraints on rhyme, stress, and meter. To address these challenges, we introduce LimGen, a novel and fully automated system for limerick generation that outperforms state-of-the-art neural network-based poetry models, as well as prior rule-based poetry models. LimGen consists of three important pieces: the Adaptive Multi-Templated Constraint algorithm that constrains our search to the space of realistic poems, the Multi-Templated Beam Search algorithm which searches efficiently through the space, and the probabilistic Storyline algorithm that provides coherent storylines related to a user-provided prompt word. The resulting limericks satisfy poetic constraints and have thematically coherent storylines, which are sometimes even funny (when we are lucky).

There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It
pdf: https://t.co/jcDJXppKo3
abs: https://t.co/ZeEnAv21jY pic.twitter.com/lTbsiRpEKj
— AK (@ak92501) March 8, 2021

12. Compositional Explanations for Image Classifiers

Hana Chockler, Daniel Kroening, Youcheng Sun

retweets: 72, favorites: 21 (03/09/2021 08:38:03)
links: abs | pdf
cs.LG

Existing algorithms for explaining the output of image classifiers perform poorly on inputs where the object of interest is partially occluded. We present a novel, black-box algorithm for computing explanations that uses a principled approach based on causal theory. We implement the method in the tool CET (Compositional Explanation Tool). Owing to the compositionality in its algorithm, CET computes explanations that are much more accurate than those generated by the existing explanation tools on images with occlusions and delivers a level of performance comparable to the state of the art when explaining images without occlusions.

Compositional Explanations for Image Classifiers
pdf: https://t.co/dmmlckvsv1
abs: https://t.co/cdX3tqr61Z
project page: https://t.co/qgXG67DAXg pic.twitter.com/ficAJ5Bo1X
— AK (@ak92501) March 8, 2021

13. IOT: Instance-wise Layer Reordering for Transformer Structures

Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu

retweets: 28, favorites: 42 (03/09/2021 08:38:03)
links: abs | pdf
cs.CL | cs.AI

With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost the same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github.

IOT: Instance-wise Layer Reordering for Transformer Structures
pdf: https://t.co/ND2idxVDUC
abs: https://t.co/F6DEXrTrl1
github: https://t.co/QUShQ8nmmX pic.twitter.com/OYp72REwZO
— AK (@ak92501) March 8, 2021

14. Golem: An algorithm for robust experiment and process optimization

Matteo Aldeghi, Florian Häse, Riley J. Hickman, Isaac Tamblyn, Alán Aspuru-Guzik

retweets: 22, favorites: 34 (03/09/2021 08:38:03)
links: abs | pdf
math.OC | cs.LG | physics.chem-ph

Numerous challenges in science and engineering can be framed as optimization tasks, including the maximization of reaction yields, the optimization of molecular and materials properties, and the fine-tuning of automated hardware protocols. Design of experiment and optimization algorithms are often adopted to solve these tasks efficiently. Increasingly, these experiment planning strategies are coupled with automated hardware to enable autonomous experimental platforms. The vast majority of the strategies used, however, do not consider robustness against the variability of experiment and process conditions. In fact, it is generally assumed that these parameters are exact and reproducible. Yet some experiments may have considerable noise associated with some of their conditions, and process parameters optimized under precise control may be applied in the future under variable operating conditions. In either scenario, the optimal solutions found might not be robust against input variability, affecting the reproducibility of results and returning suboptimal performance in practice. Here, we introduce Golem, an algorithm that is agnostic to the choice of experiment planning strategy and that enables robust experiment and process optimization. Golem identifies optimal solutions that are robust to input uncertainty, thus ensuring the reproducible performance of optimized experimental protocols and processes. It can be used to analyze the robustness of past experiments, or to guide experiment planning algorithms toward robust solutions on the fly. We assess the performance and domain of applicability of Golem through extensive benchmark studies and demonstrate its practical relevance by optimizing an analytical chemistry protocol under the presence of significant noise in its experimental conditions.

ROBUST #optimization in #chemistry is an important requirement for e.g. scale up. Work with @matteo_aldeghi @florian_hase @riley_hickman @itamblyn on GOLEM the new #matterlab algorithm https://t.co/lC3PrU6qfh @UofT @VectorInst @chemuoft @UofTCompSci check it out!
— Alan Aspuru-Guzik (@A_Aspuru_Guzik) March 8, 2021

Published 9 Mar 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter