Hot Papers 2021-06-15

1. GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)

Min Jin Chong, David Forsyth

retweets: 35384, favorites: 1 (06/16/2021 09:34:35)
links: abs | pdf
cs.CV | cs.GR | cs.LG

We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse — a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos.

GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)
pdf: https://t.co/deIE02NarD
abs: https://t.co/iN9KZnV9jW
github: https://t.co/F4sdgUm68g pic.twitter.com/A2FvDs1xhY
— AK (@ak92501) June 15, 2021

2. Scalars are universal: Gauge-equivariant machine learning, structured like classical physics

Soledad Villar, David W.Hogg, Kate Storey-Fisher, Weichi Yao, Ben Blum-Smith

retweets: 1358, favorites: 211 (06/16/2021 09:34:36)
links: abs | pdf
cs.LG | math-ph | stat.ML

There has been enormous progress in the last few years in designing conceivable (though not always practical) neural networks that respect the gauge symmetries — or coordinate freedom — of physical law. Some of these frameworks make use of irreducible representations, some make use of higher order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar’e groups, at any dimensionality $d$ . The key observation is that nonlinear O( $d$ )-equivariant (and related-group-equivariant) functions can be expressed in terms of a lightweight collection of scalars — scalar products and scalar contractions of the scalar, vector, and tensor inputs. These results demonstrate theoretically that gauge-invariant deep learning models for classical physics with good scaling for large problems are feasible right now.

Between "gauge-invariant ML" ( https://t.co/KbdQbiLO7p ) and "gauge-equivariant ConvNets" ( https://t.co/xo8o8oGOH9 ), a flurry of work at the boundary of deep learning and {equi,in}-variance principles in physics these days from @davidwhogg, @wellingmax and others. https://t.co/cKaVMPYAO9
— Yann LeCun (@ylecun) June 15, 2021

https://t.co/6rI5wLDOcl

"Gauge-equivariant machine learning, structured like classical physics" — the reason I've been tweeting about geometry so much these past weeks, with @SoledadVillar5 @benblumsmith @katestoreyfish Yao.
— David W Hogg (@davidwhogg) June 15, 2021

3. Thinking Like Transformers

Gail Weiss, Yoav Goldberg, Eran Yahav

retweets: 1008, favorites: 186 (06/16/2021 09:34:36)
links: abs | pdf
cs.LG | cs.CL

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder — attention and feed-forward computation — into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

EXTREMELY excited to announce RASP, a programming language whose goal is to provide a computational model for transformers in much the same way that automata have served for RNNs. Work with @yoavgo and @yahave , accepted into ICML 2021. https://t.co/3Vp57mHhbJ pic.twitter.com/5OLDho6pvo
— gail weiss (@gail_w) June 15, 2021

4. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, Diyi Yang

retweets: 600, favorites: 130 (06/16/2021 09:34:36)
links: abs | pdf
cs.CL | cs.AI

NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.

Data augmentation has been one of the most common approaches for mitigating the need for labeled data&improving data efficiency. We provide an empirical*survey of data augm for limited data learning in NLP: https://t.co/y4jTDOSgTg

w/ Derek Tam @colinraffel @mohitban47 @Diyi_Yang pic.twitter.com/ah9qGoUUG7
— Jiaao Chen (@jiaao_chen) June 15, 2021

5. Pre-Trained Models: Past, Present and Future

Han Xu, Zhang Zhengyan, Ding Ning, Gu Yuxian, Liu Xiao, Huo Yuqi, Qiu Jiezhong, Zhang Liang, Han Wentao, Huang Minlie, Jin Qin, Lan Yanyan, Liu Yang, Liu Zhiyuan, Lu Zhiwu, Qiu Xipeng, Song Ruihua, Tang Jie, Wen Ji-Rong, Yuan Jinhui, Zhao Wayne Xin, Zhu Jun

retweets: 624, favorites: 105 (06/16/2021 09:34:37)
links: abs | pdf
cs.AI | cs.CL

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled and unlabeled data. By storing knowledge into huge parameters and fine-tuning on specific tasks, the rich knowledge implicitly encoded in huge parameters can benefit a variety of downstream tasks, which has been extensively demonstrated via experimental verification and empirical analysis. It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch. In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum. Further, we comprehensively review the latest breakthroughs of PTMs. These breakthroughs are driven by the surge of computational power and the increasing availability of data, towards four important directions: designing effective architectures, utilizing rich contexts, improving computational efficiency, and conducting interpretation and theoretical analysis. Finally, we discuss a series of open problems and research directions of PTMs, and hope our view can inspire and advance the future study of PTMs.

Pre-Trained Models: Past, Present and Future
pdf: https://t.co/8X6QC2ohp5
abs: https://t.co/CRqSZElwua pic.twitter.com/PPzScQr418
— AK (@ak92501) June 15, 2021

6. Unsupervised Learning of Visual 3D Keypoints for Control

Boyuan Chen, Pieter Abbeel, Deepak Pathak

retweets: 418, favorites: 200 (06/16/2021 09:34:37)
links: abs | pdf
cs.LG | cs.CV | cs.RO

Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations. Prior works show that structured latent space such as visual keypoints often outperforms unstructured representations for robotic control. However, most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment. In this work, we propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner. The input images are embedded into latent 3D keypoints via a differentiable encoder which is trained to optimize both a multi-view consistency loss and downstream task objective. These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space. The proposed approach outperforms prior state-of-art methods across a variety of reinforcement learning benchmarks. Code and videos at https://buoyancy99.github.io/unsup-3d-keypoints/

Unsupervised Learning of Visual 3D Keypoints for Control
pdf: https://t.co/iw0cUWJl94
abs: https://t.co/wXmFtvPzPW
project page: https://t.co/wreLtpc29w pic.twitter.com/wYD9ipYw9X
— AK (@ak92501) June 15, 2021

7. S $^2$ -MLP: Spatial-Shift MLP Architecture for Vision

Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

retweets: 440, favorites: 62 (06/16/2021 09:34:37)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNN. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that token-mixing operation in MLP-Mixer is a variant of depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S $^2$ -MLP). Different from MLP-Mixer, our S $^2$ -MLP only contains channel-mixing MLP. We devise a spatial-shift operation for achieving the communication between patches. It has a local reception field and is spatial-agnostic. Meanwhile, it is parameter-free and efficient for computation. The proposed S $^2$ -MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S $^2$ -MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

S^2-MLP: Spatial-Shift MLP Architecture for Vision
pdf: https://t.co/AY5B6vSLWF
abs: https://t.co/1Ejc0wUSkX

attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset pic.twitter.com/g8anWoHmyp
— AK (@ak92501) June 15, 2021

8. Styleformer: Transformer based Generative Adversarial Networks with Style Vector

Jeeseung Park, Younggeun Kim

retweets: 400, favorites: 87 (06/16/2021 09:34:37)
links: abs | pdf
cs.CV | eess.IV

We propose Styleformer, which is a style-based generator for GAN architecture, but a convolution-free transformer-based generator. In our paper, we explain how a transformer can generate high-quality images, overcoming the disadvantage that convolution operations are difficult to capture global features in an image. Furthermore, we change the demodulation of StyleGAN2 and modify the existing transformer structure (e.g., residual connection, layer normalization) to create a strong style-based generator with a convolution-free structure. We also make Styleformer lighter by applying Linformer, enabling Styleformer to generate higher resolution images and result in improvements in terms of speed and memory. We experiment with the low-resolution image dataset such as CIFAR-10, as well as the high-resolution image dataset like LSUN-church. Styleformer records FID 2.82 and IS 9.94 on CIFAR-10, a benchmark dataset, which is comparable performance to the current state-of-the-art and outperforms all GAN-based generative models, including StyleGAN2-ADA with fewer parameters on the unconditional setting. We also both achieve new state-of-the-art with FID 20.11, IS 10.16, and FID 3.66, respectively on STL-10 and CelebA. We release our code at https://github.com/Jeeseung-Park/Styleformer.

Styleformer: Transformer based Generative Adversarial Networks with Style Vector
pdf: https://t.co/jNVLty3unL
abs: https://t.co/SEK0ko63E7
github: https://t.co/hQanKidsZ8

outperforms GAN-based generative models, including StyleGAN2-ADA with fewer parameters on CIFAR-10 pic.twitter.com/bs3JmTJtdz
— AK (@ak92501) June 15, 2021

9. Machine Learning Implicit Solvation for Molecular Dynamics

Yaoyi Chen, Andreas Krämer, Nicholas E. Charron, Brooke E. Husic, Cecilia Clementi, Frank Noé

retweets: 362, favorites: 88 (06/16/2021 09:34:37)
links: abs | pdf
physics.comp-ph | physics.bio-ph | physics.chem-ph | q-bio.BM | stat.ML

Accurate modeling of the solvent environment for biological molecules is crucial for computational biology and drug design. A popular approach to achieve long simulation time scales for large system sizes is to incorporate the effect of the solvent in a mean-field fashion with implicit solvent models. However, a challenge with existing implicit solvent models is that they often lack accuracy or certain physical properties compared to explicit solvent models, as the many-body effects of the neglected solvent molecules is difficult to model as a mean field. Here, we leverage machine learning (ML) and multi-scale coarse graining (CG) in order to learn implicit solvent models that can approximate the energetic and thermodynamic properties of a given explicit solvent model with arbitrary accuracy, given enough training data. Following the previous ML—CG models CGnet and CGSchnet, we introduce ISSNet, a graph neural network, to model the implicit solvent potential of mean force. ISSNet can learn from explicit solvent simulation data and be readily applied to MD simulations. We compare the solute conformational distributions under different solvation treatments for two peptide systems. The results indicate that ISSNet models can outperform widely-used generalized Born and surface area models in reproducing the thermodynamics of small protein systems with respect to explicit solvent. The success of this novel method demonstrates the potential benefit of applying machine learning methods in accurate modeling of solvent effects for in silico research and biomedical applications.

New work led by Yaoyi Chen: #MachineLearning implicit solvation for molecular dynamics using the coarse-graining principle and CGnet. https://t.co/N7e3FmVlH2
— Frank Noe (@FrankNoeBerlin) June 15, 2021

Excellent work by Yaoyi from the @FrankNoeBerlin lab on building implicit solvent models through deep learning and coarse-graining theory https://t.co/pdUFcAKcm4
— Simon Olsson (@smnlssn) June 15, 2021

10. Break-It-Fix-It: Unsupervised Learning for Program Repair

Michihiro Yasunaga, Percy Liang

retweets: 323, favorites: 60 (06/16/2021 09:34:38)
links: abs | pdf
cs.LG | cs.CL | cs.SE

We consider repair tasks: given a critic (e.g., compiler) that assesses the quality of an input, the goal is to train a fixer that converts a bad example (e.g., code with syntax errors) into a good one (e.g., code with no errors). Existing works create training data consisting of (bad, good) pairs by corrupting good examples using heuristics (e.g., dropping tokens). However, fixers trained on this synthetically-generated data do not extrapolate well to the real distribution of bad inputs. To bridge this gap, we propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas: (i) we use the critic to check a fixer’s output on real bad inputs and add good (fixed) outputs to the training data, and (ii) we train a breaker to generate realistic bad code from good code. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. We evaluate BIFI on two code repair datasets: GitHub-Python, a new dataset we introduce where the goal is to repair Python code with AST parse errors; and DeepFix, where the goal is to repair C code with compiler errors. BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python (+28.5%) and 71.7% on DeepFix (+5.6%). Notably, BIFI does not require any labeled data; we hope it will be a strong starting point for unsupervised learning of various repair tasks.

Break-It-Fix-It: Unsupervised Learning for Program Repair
pdf: https://t.co/rs199G6b7M
abs: https://t.co/rjRD6m9fWg

outperforms sota methods, obtaining 90.5% repair accuracy on GitHub Python (+28.5%) and 71.7% on DeepFix (+5.6%) pic.twitter.com/c7AtQWMVpp
— AK (@ak92501) June 15, 2021

11. Improved Transformer for High-Resolution GANs

Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N. Metaxas, Han Zhang

retweets: 110, favorites: 57 (06/16/2021 09:34:38)
links: abs | pdf
cs.CV

Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs). In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional self-modulation component based on cross-attention. The resulting model, denoted as HiT, has a linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 \times 128$ and FFHQ $256 \times 256$ , respectively, with a reasonable throughput. We believe the proposed HiT is an important milestone for generators in GANs which are completely free of convolutions.

Improved Transformer for High-Resolution GANs
pdf: https://t.co/I7WWa2f6ZZ
abs: https://t.co/pUyGKJTUoU

achieves sota FID scores of 31.87 and 2.95 on unconditional ImageNet 128 × 128 and FFHQ 256 × 256, respectively, with a reasonable throughput pic.twitter.com/9H4Bf3UViI
— AK (@ak92501) June 15, 2021

12. Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization

Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Yoshua Bengio, Ioannis Mitliagkas, Irina Rish

retweets: 97, favorites: 58 (06/16/2021 09:34:38)
links: abs | pdf
cs.LG | stat.ML

The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.

Did you know invariance principle central to IRM is insufficient for OOD Generalization?😞 We show its union with information bottleneck remedies many of its issues 🙂:https://t.co/kIK3SBD3LA
w/ @ethancaballero, @zdhnarsil, Yoshua Bengio, @bouzoukipunks, @irinarish pic.twitter.com/pmHLCKTaP9
— Kartik Ahuja (@KartikAhuja1) June 15, 2021

Invariance Principle Meets Information Bottleneck for
Out-of-Distribution Generalization
pdf: https://t.co/Bid5yUB4H9
abs: https://t.co/GTMSY0Cjwl pic.twitter.com/tEzl5erbDf
— AK (@ak92501) June 15, 2021

13. CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

Simon Rouard, Gaëtan Hadjeres

retweets: 110, favorites: 42 (06/16/2021 09:34:38)
links: abs | pdf
cs.SD | cs.LG | eess.AS

In this paper, we propose a novel score-base generative model for unconditional raw audio synthesis. Our proposal builds upon the latest developments on diffusion process modeling with stochastic differential equations, which already demonstrated promising results on image generation. We motivate novel heuristics for the choice of the diffusion processes better suited for audio generation, and consider the use of a conditional U-Net to approximate the score function. While previous approaches on diffusion models on audio were mainly designed as speech vocoders in medium resolution, our method termed CRASH (Controllable Raw Audio Synthesis with High-resolution) allows us to generate short percussive sounds in 44.1kHz in a controllable way. Through extensive experiments, we showcase on a drum sound generation task the numerous sampling schemes offered by our method (unconditional generation, deterministic generation, inpainting, interpolation, variations, class-conditional sampling) and propose the class-mixing sampling, a novel way to generate “hybrid” sounds. Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities with lighter and easier-to-train models.

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis
pdf: https://t.co/yT07QuxVU2
abs: https://t.co/wyJVppFrEo
project page: https://t.co/zgdVWDfUTf pic.twitter.com/m8T1N7yxMU
— AK (@ak92501) June 15, 2021

14. Non Gaussian Denoising Diffusion Models

Eliya Nachmani, Robin San Roman, Lior Wolf

retweets: 76, favorites: 71 (06/16/2021 09:34:38)
links: abs | pdf
cs.LG | cs.AI | cs.CV | cs.SD

Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we show that noise from Gamma distribution provides improved results for image and speech generation. Moreover, we show that using a mixture of Gaussian noise variables in the diffusion process improves the performance over a diffusion process that is based on a single distribution. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise and a mixture of noise.

Non Gaussian Denoising Diffusion Models

Replacing the Gaussian noise of diffusion models with Gamma distributed-noise improves the performance. https://t.co/yzASEm2cbH pic.twitter.com/IUwFKGmaDA
— Aran Komatsuzaki (@arankomatsuzaki) June 15, 2021

Non Gaussian Denoising Diffusion Models
pdf: https://t.co/4ETYUYV9Yf
abs: https://t.co/U37kVoBQxE
project page: https://t.co/BwBFzWCbWd

improved quality of generated images and audio and speed of generation in comparison to conventional Gaussian-based diffusion processes pic.twitter.com/Exoq12E4RJ
— AK (@ak92501) June 15, 2021

15. SinIR: Efficient General Image Manipulation with Single Image Reconstruction

Jihyeong Yoo, Qifeng Chen

retweets: 64, favorites: 35 (06/16/2021 09:34:39)
links: abs | pdf
cs.CV

We propose SinIR, an efficient reconstruction-based framework trained on a single natural image for general image manipulation, including super-resolution, editing, harmonization, paint-to-image, photo-realistic style transfer, and artistic style transfer. We train our model on a single image with cascaded multi-scale learning, where each network at each scale is responsible for image reconstruction. This reconstruction objective greatly reduces the complexity and running time of training, compared to the GAN objective. However, the reconstruction objective also exacerbates the output quality. Therefore, to solve this problem, we further utilize simple random pixel shuffling, which also gives control over manipulation, inspired by the Denoising Autoencoder. With quantitative evaluation, we show that SinIR has competitive performance on various image manipulation tasks. Moreover, with a much simpler training objective (i.e., reconstruction), SinIR is trained 33.5 times faster than SinGAN (for 500 X 500 images) that solves similar tasks. Our code is publicly available at github.com/YooJiHyeong/SinIR.

SinIR: Efficient General Image Manipulation with Single Image Reconstruction
pdf: https://t.co/ehDu5H2h9z
abs: https://t.co/SzNVCQ9cFy

with a much simpler training objective, SinIR is trained 33.5 times faster than SinGAN (for 500 × 500 images) that solves similar tasks pic.twitter.com/6eKLdeGSKg
— AK (@ak92501) June 15, 2021

16. Video Super-Resolution Transformer

Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool

retweets: 56, favorites: 37 (06/16/2021 09:34:39)
links: abs | pdf
cs.CV

Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.

Video Super-Resolution Transformer
pdf: https://t.co/0zIDmnVp6i
abs: https://t.co/EACNGLGnkR

a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information and a bidirectional optical flow-based feed-forward layer pic.twitter.com/FzzbmgBj5p
— AK (@ak92501) June 15, 2021

17. Training Graph Neural Networks with 1000 Layers

Guohao Li, Matthias Müller, Bernard Ghanem, Vladlen Koltun

retweets: 54, favorites: 34 (06/16/2021 09:34:39)
links: abs | pdf
cs.LG | cs.AI | cs.SI

Deep graph neural networks (GNNs) have achieved excellent results on various tasks on increasingly large graph datasets with millions of nodes and edges. However, memory complexity has become a major obstacle when training deep GNNs for practical applications due to the immense number of nodes, edges, and intermediate activations. To improve the scalability of GNNs, prior works propose smart graph sampling or partitioning strategies to train GNNs with a smaller set of nodes or sub-graphs. In this work, we study reversible connections, group convolutions, weight tying, and equilibrium models to advance the memory and parameter efficiency of GNNs. We find that reversible connections in combination with deep network architectures enable the training of overparameterized GNNs that significantly outperform existing methods on multiple datasets. Our models RevGNN-Deep (1001 layers with 80 channels each) and RevGNN-Wide (448 layers with 224 channels each) were both trained on a single commodity GPU and achieve an ROC-AUC of $87.74 \pm 0.13$ and $88.14 \pm 0.15$ on the ogbn-proteins dataset. To the best of our knowledge, RevGNN-Deep is the deepest GNN in the literature by one order of magnitude. Please visit our project website https://www.deepgcns.org/arch/gnn1000 for more information.

Training Graph Neural Networks with 1000 Layers
pdf: https://t.co/1zt4qIHzgm
abs: https://t.co/6MPLjK8Ylt

RevGNN-Deep (1001 layers with 80 channels each) and RevGNN-Wide (448 layers with 224 channels each) were both trained on a single commodity GPU pic.twitter.com/oxE4jt9ku0
— AK (@ak92501) June 15, 2021

18. Unified Interpretation of Softmax Cross-Entropy and Negative Sampling: With Case Study for Knowledge Graph Embedding

Hidetaka Kamigaito, Katsuhiko Hayashi

retweets: 42, favorites: 44 (06/16/2021 09:34:39)
links: abs | pdf
cs.LG | cs.CL | stat.ML

In knowledge graph embedding, the theoretical relationship between the softmax cross-entropy and negative sampling loss functions has not been investigated. This makes it difficult to fairly compare the results of the two different loss functions. We attempted to solve this problem by using the Bregman divergence to provide a unified interpretation of the softmax cross-entropy and negative sampling loss functions. Under this interpretation, we can derive theoretical findings for fair comparison. Experimental results on the FB15k-237 and WN18RR datasets show that the theoretical findings are valid in practical settings.

https://t.co/JT7Imj8dPL
ACL21 Long paperに採択された東工大・上垣外先生との共著
Softmax CEとNegative Samplingの関係をBregman距離で整理
特にLabel smoothingと自己敵対Negative Samplingが類似している可能性を指摘しており、今後も調査を進める予定です
— Katsuhiko Hayashi (@Khayashi0201) June 15, 2021

19. Memory-efficient Transformers via Top- $k$ Attention

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, Jonathan Berant

retweets: 35, favorites: 36 (06/16/2021 09:34:40)
links: abs | pdf
cs.CL | cs.LG

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top- $k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top- $k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

Memory-efficient Transformers via Top-k Attention
pdf: https://t.co/0SNIVPZRJ8
abs: https://t.co/sQBnT2mBLL

leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference pic.twitter.com/p3FkN8NAvt
— AK (@ak92501) June 15, 2021

20. GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

retweets: 42, favorites: 29 (06/16/2021 09:34:40)
links: abs | pdf
cs.SD | cs.CL | eess.AS

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

GigaSpeech: An Evolving, Multi-domain ASR Corpus with
10,000 Hours of Transcribed Audio
pdf: https://t.co/UMfFtj1wn5
abs: https://t.co/L0fRY4L19s
github: https://t.co/ikCICc2Ebn pic.twitter.com/Ek9CRS5fjw
— AK (@ak92501) June 15, 2021

21. A Multi-Implicit Neural Representation for Fonts

Pradyumna Reddy, Zhifei Zhang, Matthew Fisher, Hailin Jin, Zhaowen Wang, Niloy J. Mitra

retweets: 36, favorites: 34 (06/16/2021 09:34:40)
links: abs | pdf
cs.CV | cs.GR

Fonts are ubiquitous across documents and come in a variety of styles. They are either represented in a native vector format or rasterized to produce fixed resolution images. In the first case, the non-standard representation prevents benefiting from latest network architectures for neural representations; while, in the latter case, the rasterized representation, when encoded via networks, results in loss of data fidelity, as font-specific discontinuities like edges and corners are difficult to represent using neural networks. Based on the observation that complex fonts can be represented by a superposition of a set of simpler occupancy functions, we introduce \textit{multi-implicits} to represent fonts as a permutation-invariant set of learned implict functions, without losing features (e.g., edges and corners). However, while multi-implicits locally preserve font features, obtaining supervision in the form of ground truth multi-channel signals is a problem in itself. Instead, we propose how to train such a representation with only local supervision, while the proposed neural architecture directly finds globally consistent multi-implicits for font families. We extensively evaluate the proposed representation for various tasks including reconstruction, interpolation, and synthesis to demonstrate clear advantages with existing alternatives. Additionally, the representation naturally enables glyph completion, wherein a single characteristic font is used to synthesize a whole font family in the target style.

A Multi-Implicit Neural Representation for Fonts
pdf: https://t.co/NPUGTcqwDF
abs: https://t.co/i3nqVxdG3Q

multi-implicits – a new vector representation that is easy to process with neural networks and maintains 2D shape fidelity under arbitrary resampling pic.twitter.com/jv0q0weBBW
— AK (@ak92501) June 15, 2021

22. Variational Causal Networks: Approximate Bayesian Inference over Causal Structures

Yashas Annadani, Jonas Rothfuss, Alexandre Lacoste, Nino Scherrer, Anirudh Goyal, Yoshua Bengio, Stefan Bauer

retweets: 56, favorites: 14 (06/16/2021 09:34:40)
links: abs | pdf
cs.LG | cs.AI | stat.ML

Learning the causal structure that underlies data is a crucial step towards robust real-world decision making. The majority of existing work in causal inference focuses on determining a single directed acyclic graph (DAG) or a Markov equivalence class thereof. However, a crucial aspect to acting intelligently upon the knowledge about causal structure which has been inferred from finite data demands reasoning about its uncertainty. For instance, planning interventions to find out more about the causal mechanisms that govern our data requires quantifying epistemic uncertainty over DAGs. While Bayesian causal inference allows to do so, the posterior over DAGs becomes intractable even for a small number of variables. Aiming to overcome this issue, we propose a form of variational inference over the graphs of Structural Causal Models (SCMs). To this end, we introduce a parametric variational family modelled by an autoregressive distribution over the space of discrete DAGs. Its number of parameters does not grow exponentially with the number of variables and can be tractably learned by maximising an Evidence Lower Bound (ELBO). In our experiments, we demonstrate that the proposed variational posterior is able to provide a good approximation of the true posterior.

23. Evaluating Various Tokenizers for Arabic Text Classification

Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Irfan Ahmad

retweets: 30, favorites: 33 (06/16/2021 09:34:40)
links: abs | pdf
cs.CL | cs.LG

The first step in any NLP pipeline is learning word vector representations. However, given a large text corpus, representing all the words is not efficient. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords which in turn limits the vocabulary size in any text corpus. However such algorithms are mostly language-agnostic and lack a proper way of capturing meaningful tokens. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations. In addition to that, we compare all the six algorithms by evaluating them on three tasks which are sentiment analysis, news classification and poetry classification. Our experiments show that the performance of such tokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset.

"Evaluating Various Tokenizers for Arabic Text Classification", a joint work with @_MagedSaeed_ @alwaridi and Irfan Ahmad. We evaluate six different tokenizers for Arabic using supervised and unsupervised evaluation methods.

preprint: https://t.co/Ce8p79T558
— Zaid زيد (@zaidalyafeai) June 15, 2021

24. Large-Scale Unsupervised Object Discovery

Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce

retweets: 36, favorites: 22 (06/16/2021 09:34:40)
links: abs | pdf
cs.CV

Existing approaches to unsupervised object discovery (UOD) do not scale up to large datasets without approximations which compromise their performance. We propose a novel formulation of UOD as a ranking problem, amenable to the arsenal of distributed methods available for eigenvalue problems and link analysis. Extensive experiments with COCO and OpenImages demonstrate that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1.7M images. In the multi-object discovery setting where multiple objects are sought in each image, the proposed LOD is over 14% better in average precision (AP) than all other methods for datasets ranging from 20K to 1.7M images.

Large-Scale Unsupervised Object Discovery
pdf: https://t.co/1Sx6cSeWjP
abs: https://t.co/EL1cdOsI2P

In the multi-object discovery setting, the proposed LOD is over 14% better in average precision (AP) than all other methods for datasets ranging from 20K to 1.7M images. pic.twitter.com/4v4hCrEPhu
— AK (@ak92501) June 15, 2021

25. Partial success in closing the gap between human and machine vision

Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wieland Brendel

retweets: 27, favorites: 31 (06/16/2021 09:34:40)
links: abs | pdf
cs.CV | cs.AI | cs.LG | q-bio.NC

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines “in the wild” and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the “missing human baseline” by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark here: https://github.com/bethgelab/model-vs-human/

Partial success in closing the gap between human and machine vision
pdf: https://t.co/NaHOqsbrBt
abs: https://t.co/HHpuhnJ1VC
github: https://t.co/0ZebblbRLD pic.twitter.com/qbOYKj7RVA
— AK (@ak92501) June 15, 2021

26. Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis

Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh

retweets: 30, favorites: 25 (06/16/2021 09:34:40)
links: abs | pdf
cs.SD | eess.AS

To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. WaveNet one of the best models that nearly resembles the human voice, has to generate a waveform in a time consuming sequential manner with an extremely complex structure of its neural networks.

Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis
pdf: https://t.co/F1doLMJ150
abs: https://t.co/rhRBjFMGIy
project page: https://t.co/aA3EY7gb70 pic.twitter.com/kdnFG3CfsN
— AK (@ak92501) June 15, 2021

27. D2C: Diffusion-Denoising Models for Few-shot Conditional Generation

Abhishek Sinha, Jiaming Song, Chenlin Meng, Stefano Ermon

retweets: 14, favorites: 37 (06/16/2021 09:34:40)
links: abs | pdf
cs.LG | cs.AI | cs.CV

Conditional generative models of high-dimensional images have many applications, but supervision signals from conditions to images can be expensive to acquire. This paper describes Diffusion-Decoding models with Contrastive representations (D2C), a paradigm for training unconditional variational autoencoders (VAEs) for few-shot conditional image generation. D2C uses a learned diffusion-based prior over the latent representations to improve generation and contrastive self-supervised learning to improve representation quality. D2C can adapt to novel generation tasks conditioned on labels or manipulation constraints, by learning from as few as 100 labeled examples. On conditional generation from new labels, D2C achieves superior performance over state-of-the-art VAEs and diffusion models. On conditional image manipulation, D2C generations are two orders of magnitude faster to produce over StyleGAN2 ones and are preferred by 50% - 60% of the human evaluators in a double-blind study.

D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation
pdf: https://t.co/1WkuLeVHeW
abs: https://t.co/54iMO3WgOs

On conditional generation from new labels, D2C achieves superior performance over state-of-the-art VAEs and diffusion models pic.twitter.com/F35VimwPPW
— AK (@ak92501) June 15, 2021

28. GitTables: A Large-Scale Corpus of Relational Tables

Madelon Hulsebos, Çağatay Demiralp, Paul Groth

retweets: 30, favorites: 20 (06/16/2021 09:34:41)
links: abs | pdf
cs.DB | cs.LG

The practical success of deep learning has sparked interest in improving relational table tasks, like data search, with models trained on large table corpora. Existing corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need additional resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 20M tables. We annotate table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. The corpus is available at https://gittables.github.io. Our analysis of GitTables shows that its structure, content, and topical coverage differ significantly from existing table corpora. We evaluate our annotation pipeline on hand-labeled tables from the T2Dv2 benchmark and find that our approach provides results on par with human annotations. We demonstrate a use case of GitTables by training a semantic type detection model on it and obtain high prediction accuracy. We also show that the same model trained on tables from theWeb generalizes poorly.

GitTables: A Large-Scale Corpus of Relational Tables
pdf: https://t.co/4L8mplxkkO
abs: https://t.co/n7y15Gw9Lz
project page: https://t.co/v4IUczYVWs pic.twitter.com/MVqb1KrCqw
— AK (@ak92501) June 15, 2021

Published 16 Jun 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter