Hot Papers 2021-06-21

1. World-GAN: a Generative Model for Minecraft Worlds

Maren Awiszus, Frederik Schubert, Bodo Rosenhahn

retweets: 9936, favorites: 423 (06/22/2021 07:43:12)
links: abs | pdf
cs.LG | cs.CV | cs.NE

This work introduces World-GAN, the first method to perform data-driven Procedural Content Generation via Machine Learning in Minecraft from a single example. Based on a 3D Generative Adversarial Network (GAN) architecture, we are able to create arbitrarily sized world snippets from a given sample. We evaluate our approach on creations from the community as well as structures generated with the Minecraft World Generator. Our method is motivated by the dense representations used in Natural Language Processing (NLP) introduced with word2vec [1]. The proposed block2vec representations make World-GAN independent from the number of different blocks, which can vary a lot in Minecraft, and enable the generation of larger levels. Finally, we demonstrate that changing this new representation space allows us to change the generated style of an already trained generator. World-GAN enables its users to generate Minecraft worlds based on parts of their creations.

World-GAN: a Generative Model for Minecraft Worlds
pdf: https://t.co/rSx646L1hp
abs: https://t.co/RJcHEOTUQT pic.twitter.com/n0sw8fEAWF
— AK (@ak92501) June 21, 2021

2. Distributed Deep Learning in Open Collaborations

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko

retweets: 4852, favorites: 338 (06/22/2021 07:43:13)
links: abs | pdf
cs.LG | cs.DC

Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.

What if we could mutualize everyone's compute? 🤯🤯🤯
Super proud to share the result of our first experiment of decentralized collaborative training with 40 volunteers!
The result is an Albert-like model for Bengali competitive with the current SOTA!
📰 https://t.co/kdiuPmpaPv pic.twitter.com/ymEmU6UtYU
— Hugging Face (@huggingface) June 21, 2021

In our latest work, we propose DeDLOC — a method for efficient collaborative training. This approach allowed us to pretrain sahajBERT (a Bengali-language ALBERT) together with the help of volunteers from the community! (1/10)https://t.co/qqGYpwJupp https://t.co/3oWGvNRImx pic.twitter.com/8hBkFBhMyY
— Max Ryabinin (@m_ryabinin) June 21, 2021

Distributed Deep Learning In Open Collaborations

Using volunteer computes pooled by many small groups, they successfully train SwAV and ALBERT and achieve performance comparable to traditional setups at a fraction of the cost.https://t.co/QFc5TOdRBy pic.twitter.com/4LyY4glP7h
— Aran Komatsuzaki (@arankomatsuzaki) June 21, 2021

3. Efficient Self-supervised Vision Transformers for Representation Learning

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao

retweets: 2909, favorites: 379 (06/22/2021 07:43:13)
links: abs | pdf
cs.CV | cs.AI | cs.LG

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

Efficient Self-supervised Vision Transformers for Representation Learning
pdf: https://t.co/0YImgGI4Fg
abs: https://t.co/gIxv7H8Qsn

EsViT achieves 81.3% top-1 on the ImageNet linear
probe evaluation, outperforming prior arts with around an order magnitude of higher throughput pic.twitter.com/28P5rH7AW0
— AK (@ak92501) June 21, 2021

Efficient Self-supervised Vision Transformers for Representation Learning

EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput.https://t.co/At20ak6pMR pic.twitter.com/YGK7svZM3X
— Aran Komatsuzaki (@arankomatsuzaki) June 21, 2021

4. Riemannian Convex Potential Maps

Samuel Cohen, Brandon Amos, Yaron Lipman

retweets: 2146, favorites: 254 (06/22/2021 07:43:13)
links: abs | pdf
cs.LG | stat.ML

Modeling distributions on Riemannian manifolds is a crucial component in understanding non-Euclidean data that arises, e.g., in physics and geology. The budding approaches in this space are limited by representational and computational tradeoffs. We propose and study a class of flows that uses convex potentials from Riemannian optimal transport. These are universal and can model distributions on any compact Riemannian manifold without requiring domain knowledge of the manifold to be integrated into the architecture. We demonstrate that these flows can model standard distributions on spheres, and tori, on synthetic and geological data. Our source code is freely available online at http://github.com/facebookresearch/rcpm

Stoked to release our milestone #ICML2021 paper on Riemannian Convex Potential Maps! With @CohenSamuel13 and @lipmanya

Paper: https://t.co/FIDJe2CCLk
JAX Code: https://t.co/0KwXeXb33e
Slides: https://t.co/zkNde7WzXZ

🧵 pic.twitter.com/JzznubhCZE
— Brandon Amos (@brandondamos) June 21, 2021

Neat idea
Samuel Cohen, @brandondamos and Yaron Lipman !!!

😍 Riemannian Convex Potential Maps 😍https://t.co/DhyAfSTbYE
— Danilo J. Rezende (@DaniloJRezende) June 21, 2021

5. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer

retweets: 1937, favorites: 391 (06/22/2021 07:43:14)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer’s weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (AugReg” for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

How to train your ViT? https://t.co/3E7A7xMKxX

An investigation w/ Andreas Steiner @__kolesnikov__ @XiaohuaZhai @wightmanr @kyosu & yours truly.

Some "obvious" findings, some surprises, >50k models released, and ViT trained on *public* data matching JFT-trained ViT! (87%)
🧵👇 pic.twitter.com/6xqICSc8Az
— Lucas Beyer (@giffmana) June 21, 2021

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Trains ViT on ImageNet-21k with AugReg, which either matches or outperforms their counterparts trained on the larger, but not publicly available JFT-300M dataset.https://t.co/pnxWDqKwgw pic.twitter.com/pwcE2IWltH
— Aran Komatsuzaki (@arankomatsuzaki) June 21, 2021

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
pdf: https://t.co/6isx98PS1M

train ViT models of various sizes on the
ImageNet-21k dataset, either match or outperform their counterparts trained on larger, but not publicly available JFT-300M pic.twitter.com/Zoso6tUPPA
— AK (@ak92501) June 21, 2021

New paper https://t.co/zForQciqJA 🧵 : How to train your ViT? It is common to train vision transformers on ImageNet-1k (~1.3m images) for 300 epochs. We show that you are better off investing the same compute budget for training on ImageNet-21k (~13m images) for 30 epochs. pic.twitter.com/zaCJyhQy9S
— Alexander Kolesnikov (@__kolesnikov__) June 21, 2021

6. DeepLab2: A TensorFlow Library for Deep Labeling

Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen

retweets: 1480, favorites: 146 (06/22/2021 07:43:15)
links: abs | pdf
cs.CV

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision. DeepLab2 includes all our recently developed DeepLab model variants with pretrained checkpoints as well as model training and evaluation code, allowing the community to reproduce and further improve upon the state-of-art systems. To showcase the effectiveness of DeepLab2, our Panoptic-DeepLab employing Axial-SWideRNet as network backbone achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints. We hope that publicly sharing our library could facilitate future research on dense pixel labeling tasks and envision new applications of this technology. Code is made publicly available at \url{https://github.com/google-research/deeplab2}.

DeepLab2: A TensorFlow Library for Deep Labeling
pdf: https://t.co/9AoKx6SsAq
abs: https://t.co/HC1o1SNV7B
github (to be released): https://t.co/kGoQTP6Jpd pic.twitter.com/sIU32FAX58
— AK (@ak92501) June 21, 2021

7. Discovering Relationships between Object Categories via Universal Canonical Maps

Natalia Neverova, Artsiom Sanakoyeu, Patrick Labatut, David Novotny, Andrea Vedaldi

retweets: 589, favorites: 197 (06/22/2021 07:43:15)
links: abs | pdf
cs.CV

We tackle the problem of learning the geometry of multiple categories of deformable objects jointly. Recent work has shown that it is possible to learn a unified dense pose predictor for several categories of related objects. However, training such models requires to initialize inter-category correspondences by hand. This is suboptimal and the resulting models fail to maintain correct correspondences as individual categories are learned. In this paper, we show that improved correspondences can be learned automatically as a natural byproduct of learning category-specific dense pose predictors. To do this, we express correspondences between different categories and between images and categories using a unified embedding. Then, we use the latter to enforce two constraints: symmetric inter-category cycle consistency and a new asymmetric image-to-category cycle consistency. Without any manual annotations for the inter-category correspondences, we obtain state-of-the-art alignment results, outperforming dedicated methods for matching 3D shapes. Moreover, the new model is also better at the task of dense pose prediction than prior work.

Check out our new #CVPR21 paper!
Discovering Relationships between Object Categories via Universal Canonical Maps

In collaboration with FAIR (@NataliaNeverova, P. Labatut, @davnov134 and A. Vedaldi)

🌐https://t.co/MPRxgSLIFw
▶️https://t.co/SpMdp6LGD2
📝https://t.co/xzniLtflQm pic.twitter.com/CVv3QFJ4DZ
— Artsiom Sanakoyeu (@artsiom_s) June 21, 2021

Discovering Relationships between Object Categories via Universal Canonical Maps
pdf: https://t.co/LjQS6y7vuL

Without manual annotations for the inter-category correspondences, sota alignment results, outperforming dedicated methods for matching 3D shapes pic.twitter.com/Yt7F6csc3Y
— AK (@ak92501) June 21, 2021

8. PixInWav: Residual Steganography for Hiding Pixels in Audio

Margarita Geleta, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, Xavier Giro-i-Nieto

retweets: 600, favorites: 85 (06/22/2021 07:43:15)
links: abs | pdf
cs.MM | cs.SD | eess.AS

Steganography comprises the mechanics of hiding data in a host media that may be publicly available. While previous works focused on unimodal setups (e.g., hiding images in images, or hiding audio in audio), PixInWav targets the multimodal case of hiding images in audio. To this end, we propose a novel residual architecture operating on top of short-time discrete cosine transform (STDCT) audio spectrograms. Among our results, we find that the residual audio steganography setup we propose allows independent encoding of the hidden image from the host audio without compromising quality. Accordingly, while previous works require both host and hidden signals to hide a signal, PixInWav can encode images offline — which can be later hidden, in a residual fashion, into any audio signal. Finally, we test our scheme in a lab setting to transmit images over airwaves from a loudspeaker to a microphone verifying our theoretical insights and obtaining promising results.

Our paper is out! Pioneering multimodal audio steganography with deep learning. In this paper we present a residual approach for hiding images into audio and show a real "over the air transmission" of images through waveforms: https://t.co/AVq3eHkuJa pic.twitter.com/UPi6gp9Lrm
— Rita Geleta (@ritageleta) June 21, 2021

9. The Principles of Deep Learning Theory

Daniel A. Roberts, Sho Yaida, Boris Hanin

retweets: 218, favorites: 79 (06/22/2021 07:43:15)
links: abs | pdf
cs.LG | cs.AI | hep-th | stat.ML

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models’ predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

[Basic concepts] In this book is presented, in a very comprehensive way, the principles of Deep Learning Theory, with a strong focus on practical application. Highly recommended for anyone who want a deeper understanding of the topic. #DeepLearning https://t.co/uxrmP3ZUrC pic.twitter.com/p8IrYopyLV
— Underfox (@Underfox3) June 21, 2021

The Principles of Deep Learning Theory. (arXiv:2106.10165v1 [cs.LG]) https://t.co/uoajcL1mGF
— Stat.ML Papers (@StatMLPapers) June 21, 2021

10. HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping

Yuhan Wang, Xu Chen, Junwei Zhu, Wenqing Chu, Ying Tai, Chengjie Wang, Jilin Li, Yongjian Wu, Feiyue Huang, Rongrong Ji

retweets: 144, favorites: 80 (06/22/2021 07:43:16)
links: abs | pdf
cs.CV

In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods.

HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping
pdf: https://t.co/mBtUdIZsyy
abs: https://t.co/HdM2nPqD8A
project page: https://t.co/fioZ9ewuHM pic.twitter.com/SsmjYlenvn
— AK (@ak92501) June 21, 2021

Published 22 Jun 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter