Hot Papers 2021-04-09

1. SiT: Self-supervised vIsion Transformer

Sara Atito, Muhammad Awais, Josef Kittler

retweets: 5120, favorites: 356 (04/10/2021 09:00:11)
links: abs | pdf
cs.CV | cs.LG

Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the vision transformers have been shown to work well when pretrained either using a large scale supervised data or with some kind of co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in downstream tasks with minimal changes. In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under: https://github.com/Sara-Ahmed/SiT.

SiT: Self-supervised vIsion Transformer
pdf: https://t.co/LUmX5fyCbn
abs: https://t.co/ms4ksWdHnD pic.twitter.com/0ALi1YPLTF
— AK (@ak92501) April 9, 2021

2. Revisiting Simple Neural Probabilistic Language Models

Simeng Sun, Mohit Iyyer

retweets: 2707, favorites: 412 (04/10/2021 09:00:12)
links: abs | pdf
cs.CL

Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM) of~\citet{Bengio2003ANP}, which simply concatenates word embeddings within a fixed window and passes the result through a feed-forward network to predict the next word. When scaled up to modern hardware, this model (despite its many limitations) performs much better than expected on word-level language model benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a baseline Transformer with short input contexts but struggles to handle long-term dependencies. Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM’s local concatenation layer, which results in small but consistent perplexity decreases across three word-level language modeling datasets.

Revisiting Simple Neural Probabilistic Language Models

Neural probabilistic language model of Bengio et al. (2003), which simply concatenates word embeddings within a fixed window, performs much better than expected when scaled up to modern hardware.https://t.co/HzLtTjpThG pic.twitter.com/AVqgbQcHM7
— Aran Komatsuzaki (@arankomatsuzaki) April 9, 2021

3. InfinityGAN: Towards Infinite-Resolution Image Synthesis

Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, Ming-Hsuan Yang

retweets: 1708, favorites: 356 (04/10/2021 09:00:12)
links: abs | pdf
cs.CV

We present InfinityGAN, a method to generate arbitrary-resolution images. The problem is associated with several key challenges. First, scaling existing models to a high resolution is resource-constrained, both in terms of computation and availability of high-resolution training data. Infinity-GAN trains and infers patch-by-patch seamlessly with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, InfinityGAN takes global appearance, local structure and texture into account.With this formulation, we can generate images with resolution and level of detail not attainable before. Experimental evaluation supports that InfinityGAN generates imageswith superior global structure compared to baselines at the same time featuring parallelizable inference. Finally, we how several applications unlocked by our approach, such as fusing styles spatially, multi-modal outpainting and image inbetweening at arbitrary input and output resolutions

InfinityGAN: Towards Infinite-Resolution Image Synthesis

Generates arbitrary resolution images by training and infering patch-by-patch seamlessly with low computational resources and superior global structure.

abs: https://t.co/ktGdisrMit
site: https://t.co/NeqLcX1tyw pic.twitter.com/k3zUaEhcIq
— Aran Komatsuzaki (@arankomatsuzaki) April 9, 2021

InfinityGAN: Towards Infinite-Resolution Image Synthesis
pdf: https://t.co/HrNqpbxnhA
abs: https://t.co/aZX57FbWS5
project page: https://t.co/KhKCzlpC8k pic.twitter.com/tE7UMTsCyE
— AK (@ak92501) April 9, 2021

To infinity and beyond!
Is it possible to learn from limited-size images and generate images of infinite resolution? Check out "InfinityGAN: Towards Infinite-Resolution Image Synthesis". (1/4)
Site: https://t.co/UH9e4GxMgE
Abs: https://t.co/g5tptubQit
Complete video on site pic.twitter.com/5ItF7mbfsi
— Hsin-Ying James Lee (@hyjameslee) April 9, 2021

4. Does Your Dermatology Classifier Know What It Doesn’t Know? Detecting the Long-Tail of Unseen Conditions

Abhijit Guha Roy, Jie Ren, Shekoofeh Azizi, Aaron Loh, Vivek Natarajan, Basil Mustafa, Nick Pawlowski, Jan Freyberg, Yuan Liu, Zach Beaver, Nam Vo, Peggy Bui, Samantha Winter, Patricia MacWilliams, Greg S. Corrado, Umesh Telang, Yun Liu, Taylan Cemgil, Alan Karthikesalingam, Balaji Lakshminarayanan, Jim Winkens

retweets: 262, favorites: 188 (04/10/2021 09:00:12)
links: abs | pdf
cs.CV | cs.LG

We develop and rigorously evaluate a deep learning based system that can accurately classify skin conditions while detecting rare conditions for which there is not enough data available for training a confident classifier. We frame this task as an out-of-distribution (OOD) detection problem. Our novel approach, hierarchical outlier detection (HOD) assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate the effectiveness of the HOD loss in conjunction with modern representation learning approaches (BiT, SimCLR, MICLe) and explore different ensembling strategies for further improving the results. We perform an extensive subgroup analysis over conditions of varying risk levels and different skin types to investigate how the OOD detection performance changes over each subgroup and demonstrate the gains of our framework in comparison to baselines. Finally, we introduce a cost metric to approximate downstream clinical impact. We use this cost metric to compare the proposed method against a baseline system, thereby making a stronger case for the overall system effectiveness in a real-world deployment scenario.

Our new paper tackles an important safety hurdle for ML from code to clinic- “how does your dermatology classifier know what it doesn’t know?” In clinical practice patients may present with conditions unseen by ML systems in training, causing errors https://t.co/EgT3jOm6a5 1/2
— Alan Karthikesalingam (@alan_karthi) April 9, 2021

Excited to announce our new paper "Does Your Dermatology Classifier Know What It Doesn't Know?", led by @abzz4ssj @jessierenjie @jimwinkens along with an awesome set of collaborators @GoogleAI @GoogleHealth @DeepMind.

Paper: https://t.co/I237izZ1K5

Thread 1/n: pic.twitter.com/mZ9gG8tI99
— Balaji Lakshminarayanan (@balajiln) April 9, 2021

Does Your Dermatology Classifier Know What It Doesn't Know? Detecting the Long-Tail of Unseen Conditions
pdf: https://t.co/xZWC6FI5qA
abs: https://t.co/jverk8mk3Y pic.twitter.com/UX76l6WXbR
— AK (@ak92501) April 9, 2021

Check out our new work on how to safely handle the long tail of conditions you didn't observe during training: https://t.co/5YbTFWQIGw

Often training data does not capture all conditions you might see during deployment...

Led by @abzz4ssj @jessierenjie @balajiln @jimwinkens https://t.co/Dg1IUudP41 pic.twitter.com/01j7dxMbmy
— Nick Pawlowski (@pwnic) April 9, 2021

5. De-rendering the World’s Revolutionary Artefacts

Shangzhe Wu, Ameesh Makadia, Jiajun Wu, Noah Snavely, Richard Tucker, Angjoo Kanazawa

retweets: 275, favorites: 59 (04/10/2021 09:00:13)
links: abs | pdf
cs.CV | cs.GR

Recent works have shown exciting results in unsupervised image de-rendering — learning to decompose 3D shape, appearance, and lighting from single-image collections without explicit supervision. However, many of these assume simplistic material and lighting models. We propose a method, termed RADAR, that can recover environment illumination and surface materials from real single-image collections, relying neither on explicit 3D supervision, nor on multi-view or multi-light images. Specifically, we focus on rotationally symmetric artefacts that exhibit challenging surface properties including specular reflections, such as vases. We introduce a novel self-supervised albedo discriminator, which allows the model to recover plausible albedo without requiring any ground-truth during training. In conjunction with a shape reconstruction module exploiting rotational symmetry, we present an end-to-end learning framework that is able to de-render the world’s revolutionary artefacts. We conduct experiments on a real vase dataset and demonstrate compelling decomposition results, allowing for applications including free-viewpoint rendering and relighting.

De-rendering the World's Revolutionary Artefacts
pdf: https://t.co/DTQnEWgDKQ
abs: https://t.co/UzsiDPz8Ba
project page: https://t.co/XAvU8ZpBCV pic.twitter.com/OM3w2ACAdb
— AK (@ak92501) April 9, 2021

6. SOLD2: Self-supervised Occlusion-aware Line Description and Detection

Rémi Pautrat, Juan-Ting Lin, Viktor Larsson, Martin R. Oswald, Marc Pollefeys

retweets: 225, favorites: 47 (04/10/2021 09:00:13)
links: abs | pdf
cs.CV

Compared to feature point detection and description, detecting and matching line segments offer additional challenges. Yet, line features represent a promising complement to points for multi-view tasks. Lines are indeed well-defined by the image gradient, frequently appear even in poorly textured areas and offer robust structural cues. We thus hereby introduce the first joint detection and description of line segments in a single deep network. Thanks to a self-supervised training, our method does not require any annotated line labels and can therefore generalize to any dataset. Our detector offers repeatable and accurate localization of line segments in images, departing from the wireframe parsing approach. Leveraging the recent progresses in descriptor learning, our proposed line descriptor is highly discriminative, while remaining robust to viewpoint changes and occlusions. We evaluate our approach against previous line detection and description methods on several multi-view datasets created with homographic warps as well as real-world viewpoint changes. Our full pipeline yields higher repeatability, localization accuracy and matching metrics, and thus represents a first step to bridge the gap with learned feature points methods. Code and trained weights are available at https://github.com/cvg/SOLD2.

SOLD2: Self-supervised Occlusion-aware Line Description and Detection

Rémi Pautrat, Juan-Ting Lin, Viktor Larsson, Martin R. Oswald, @mapo1

Tl;dr: SuperPoint->SuperLine +dynamic programming matching.https://t.co/kFMPPDKKR4 pic.twitter.com/x33HKpp0qm
— Dmytro Mishkin (@ducha_aiki) April 9, 2021

7. An Information-Theoretic Proof of a Finite de Finetti Theorem

Lampros Gavalakis, Ioannis Kontoyiannis

retweets: 165, favorites: 101 (04/10/2021 09:00:13)
links: abs | pdf
cs.IT | math.PR

A finite form of de Finetti’s representation theorem is established using elementary information-theoretic tools: The distribution of the first $k$ random variables in an exchangeable binary vector of length $n\geq k$ is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided.

A fun little result for those of you with a taste for relative entropy :-)https://t.co/qY93c5IYHk pic.twitter.com/sCz4NXt2f1
— Ioannis Kontoyiannis (@yiannis_entropy) April 9, 2021

8. A single gradient step finds adversarial examples on random two-layers neural networks

Sébastien Bubeck, Yeshwanth Cherapanamjeri, Gauthier Gidel, Rémi Tachet des Combes

retweets: 182, favorites: 47 (04/10/2021 09:00:13)
links: abs | pdf
cs.LG | cs.CR | stat.ML

Daniely and Schacham recently showed that gradient descent finds adversarial examples on random undercomplete two-layers ReLU neural networks. The term “undercomplete” refers to the fact that their proof only holds when the number of neurons is a vanishing fraction of the ambient dimension. We extend their result to the overcomplete case, where the number of neurons is larger than the dimension (yet also subexponential in the dimension). In fact we prove that a single step of gradient descent suffices. We also show this result for any subexponential width random neural network with smooth activation function.

New video: what can we say about adversarial examples at (random) initialization?

Based on joint work https://t.co/8msy9NuYiT with Y. Cherapanamjeri, @gauthier_gidel and @RemiTachet .https://t.co/2eTt3LKaRH
— Sebastien Bubeck (@SebastienBubeck) April 9, 2021

9. CoCoNets: Continuous Contrastive 3D Scene Representations

Shamit Lal, Mihir Prabhudesai, Ishita Mediratta, Adam W. Harley, Katerina Fragkiadaki

retweets: 144, favorites: 48 (04/10/2021 09:00:14)
links: abs | pdf
cs.CV

This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. The model infers a latent3D representation of the scene in the form of 3D feature points, where each continuous world 3D point is mapped to its corresponding feature vector. The model is trained for contrastive view prediction by rendering 3D feature clouds in queried viewpoints and matching against the 3D feature point cloud predicted from the query view. Notably, the representation can be queried for any 3D location, even if it is not visible from the input view. Our model brings together three powerful ideas of recent exciting research work: 3D feature grids as a neural bottleneck for view prediction, implicit functions for handling resolution limitations of 3D grids, and contrastive learning for unsupervised training of feature representations. We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection. We outperform many existing state-of-the-art methods for 3D feature learning and view prediction, which are either limited by 3D grid spatial resolution, do not attempt to build amodal 3D representations, or do not handle combinatorial scene variability due to their non-convolutional bottlenecks.

CoCoNets: Continuous Contrastive 3D Scene Representations
pdf: https://t.co/5pwIBahGfA
abs: https://t.co/alb3fMIXHS
project page: https://t.co/N19CjpA3rN pic.twitter.com/2xQ2V3weK9
— AK (@ak92501) April 9, 2021

10. How Transferable are Reasoning Patterns in VQA?

Corentin Kervadec, Theo Jaunet, Grigory Antipov, Moez Baccouche, Romain Vuillemot, Christian Wolf

retweets: 62, favorites: 50 (04/10/2021 09:00:14)
links: abs | pdf
cs.CV

Since its inception, Visual Question Answering (VQA) is notoriously known as a task, where models are prone to exploit biases in datasets to find shortcuts instead of performing high-level reasoning. Classical methods address this by removing biases from training data, or adding branches to models to detect and remove biases. In this paper, we argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems. We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases compared to standard models. We propose to study the attention mechanisms at work in the visual oracle and compare them with a SOTA Transformer-based model. We provide an in-depth analysis and visualizations of reasoning patterns obtained with an online visualization tool which we make publicly available (https://reasoningpatterns.github.io). We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning. In experiments we report higher overall accuracy, as well as accuracy on infrequent answers for each question type, which provides evidence for improved generalization and a decrease of the dependency on dataset biases.

Our #CVPR2021 paper is on arxiv: "How Transferrable are Reasoning Patterns in VQA?", by @CorentK, @jaunet_theo, @antigregory , @moezbac, @romsson. A deep analysis of attention in visual reasoning.

arXiv: https://t.co/XHRtAlSzzh
interactive visualization: https://t.co/WJiJvJe0pG pic.twitter.com/dcPXO1O3Tk
— Christian Wolf (@chriswolfvision) April 9, 2021

11. Pushing the Limits of Non-Autoregressive Speech Recognition

Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan

retweets: 62, favorites: 42 (04/10/2021 09:00:14)
links: abs | pdf
eess.AS | cs.CL | cs.LG | cs.SD

We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.

Pushing the Limits of Non-Autoregressive Speech Recognition

Achieves SotA non-AR performance and even outperforms many AR models on speech recognition without using a language model.https://t.co/rStqUfz95U pic.twitter.com/AnXPQ6pYGx
— Aran Komatsuzaki (@arankomatsuzaki) April 9, 2021

12. 3D Shape Generation and Completion through Point-Voxel Diffusion

Linqi Zhou, Yilun Du, Jiajun Wu

retweets: 42, favorites: 54 (04/10/2021 09:00:14)
links: abs | pdf
cs.CV

We propose a novel approach for probabilistic generative modeling of 3D shapes. Unlike most existing models that learn to deterministically translate a latent vector to a shape, our model, Point-Voxel Diffusion (PVD), is a unified, probabilistic formulation for unconditional shape generation and conditional, multi-modal shape completion. PVD marries denoising diffusion models with the hybrid, point-voxel representation of 3D shapes. It can be viewed as a series of denoising steps, reversing the diffusion process from observed point cloud data to Gaussian noise, and is trained by optimizing a variational lower bound to the (conditional) likelihood function. Experiments demonstrate that PVD is capable of synthesizing high-fidelity shapes, completing partial point clouds, and generating multiple completion results from single-view depth scans of real objects.

3D Shape Generation and Completion through Point-Voxel Diffusion
pdf: https://t.co/IHpYXiY4Z2
abs: https://t.co/S3kgGg6yMK
project page: https://t.co/TDvHvEkdNQ pic.twitter.com/kvJHvHVwDh
— AK (@ak92501) April 9, 2021

13. On Biasing Transformer Attention Towards Monotonicity

Annette Rios, Chantal Amrhein, Noëmi Aepli, Rico Sennrich

retweets: 58, favorites: 36 (04/10/2021 09:00:14)
links: abs | pdf
cs.CL

Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.

Monotonic attention is popular for roughly monotonic seq2seq tasks.

We show that standard attention can be biased towards monotonicity via new loss, but RNNs more likely to benefit than Transformers.#NAACL2021 by @arios272 @chantalamrhein @noeminaepli https://t.co/qtRHPLxIPh
— Rico Sennrich (@RicoSennrich) April 9, 2021

14. Neural Temporal Point Processes: A Review

Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, Stephan Günnemann

retweets: 42, favorites: 48 (04/10/2021 09:00:14)
links: abs | pdf
cs.LG

Temporal point processes (TPP) are probabilistic generative models for continuous-time event sequences. Neural TPPs combine the fundamental ideas from point process literature with deep learning approaches, thus enabling construction of flexible and efficient models. The topic of neural TPPs has attracted significant attention in the recent years, leading to the development of numerous new architectures and applications for this class of models. In this review paper we aim to consolidate the existing body of knowledge on neural TPPs. Specifically, we focus on important design choices and general principles for defining neural TPP models. Next, we provide an overview of application areas commonly considered in the literature. We conclude this survey with the list of open challenges and important directions for future work in the field of neural TPPs.

Check out our review of Neural Temporal Point Processes https://t.co/KxuglTRw4V. We analyze the important design choices for neural TPPs, talk about applications of these models, and discuss some of the main challenges that the field currently faces. pic.twitter.com/mTiShVVFCS
— Oleksandr Shchur (@shchur_) April 9, 2021

15. EXPATS: A Toolkit for Explainable Automated Text Scoring

Hitoshi Manabe, Masato Hagiwara

retweets: 36, favorites: 29 (04/10/2021 09:00:14)
links: abs | pdf
cs.CL

Automated text scoring (ATS) tasks, such as automated essay scoring and readability assessment, are important educational applications of natural language processing. Due to their interpretability of models and predictions, traditional machine learning (ML) algorithms based on handcrafted features are still in wide use for ATS tasks. Practitioners often need to experiment with a variety of models (including deep and traditional ML ones), features, and training objectives (regression and classification), although modern deep learning frameworks such as PyTorch require deep ML expertise to fully utilize. In this paper, we present EXPATS, an open-source framework to allow its users to develop and experiment with different ATS models quickly by offering flexible components, an easy-to-use configuration system, and the command-line interface. The toolkit also provides seamless integration with the Language Interpretability Tool (LIT) so that one can interpret and visualize models and their predictions. We also describe two case studies where we build ATS models quickly with minimal engineering efforts. The toolkit is available at \url{https://github.com/octanove/expats}.

Launching EXPATS—an open-source toolkit for explainable automated text scoring! You can build & interpret traditional+deep scoring models just by writing config files and CLI commands

Paper: https://t.co/8LjxyWtInA
Code: https://t.co/f8ficzvLYA

Joint work w/ Hitoshi @ManaYsh13
— Masato Hagiwara (@mhagiwara) April 9, 2021

16. On tuning consistent annealed sampling for denoising score matching

Joan Serrà, Santiago Pascual, Jordi Pons

retweets: 30, favorites: 25 (04/10/2021 09:00:15)
links: abs | pdf
cs.LG | cs.AI | cs.CV | cs.SD | eess.AS

Score-based generative models provide state-of-the-art quality for image and audio synthesis. Sampling from these models is performed iteratively, typically employing a discretized series of noise levels and a predefined scheme. In this note, we first overview three common sampling schemes for models trained with denoising score matching. Next, we focus on one of them, consistent annealed sampling, and study its hyper-parameter boundaries. We then highlight a possible formulation of such hyper-parameter that explicitly considers those boundaries and facilitates tuning when using few or a variable number of steps. Finally, we highlight some connections of the formulation with other sampling schemes.

We've lately been digging into tuning consistent annealed sampling for denoising score matching, and wrote a short note about it: https://t.co/eLzkXmPIsU

w/ @santty128 and @jordiponsdotme at @Dolby AI pic.twitter.com/H4jLCriR3H
— Joan Serrà (@serrjoa) April 9, 2021

Published 10 Apr 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter