1. World-GAN: a Generative Model for Minecraft Worlds
Maren Awiszus, Frederik Schubert, Bodo Rosenhahn
This work introduces World-GAN, the first method to perform data-driven Procedural Content Generation via Machine Learning in Minecraft from a single example. Based on a 3D Generative Adversarial Network (GAN) architecture, we are able to create arbitrarily sized world snippets from a given sample. We evaluate our approach on creations from the community as well as structures generated with the Minecraft World Generator. Our method is motivated by the dense representations used in Natural Language Processing (NLP) introduced with word2vec [1]. The proposed block2vec representations make World-GAN independent from the number of different blocks, which can vary a lot in Minecraft, and enable the generation of larger levels. Finally, we demonstrate that changing this new representation space allows us to change the generated style of an already trained generator. World-GAN enables its users to generate Minecraft worlds based on parts of their creations.
World-GAN: a Generative Model for Minecraft Worlds
— AK (@ak92501) June 21, 2021
pdf: https://t.co/rSx646L1hp
abs: https://t.co/RJcHEOTUQT pic.twitter.com/n0sw8fEAWF
2. Distributed Deep Learning in Open Collaborations
Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko
Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.
What if we could mutualize everyone's compute? 🤯🤯🤯
— Hugging Face (@huggingface) June 21, 2021
Super proud to share the result of our first experiment of decentralized collaborative training with 40 volunteers!
The result is an Albert-like model for Bengali competitive with the current SOTA!
📰 https://t.co/kdiuPmpaPv pic.twitter.com/ymEmU6UtYU
In our latest work, we propose DeDLOC — a method for efficient collaborative training. This approach allowed us to pretrain sahajBERT (a Bengali-language ALBERT) together with the help of volunteers from the community! (1/10)https://t.co/qqGYpwJupphttps://t.co/3oWGvNRImx pic.twitter.com/8hBkFBhMyY
— Max Ryabinin (@m_ryabinin) June 21, 2021
Distributed Deep Learning In Open Collaborations
— Aran Komatsuzaki (@arankomatsuzaki) June 21, 2021
Using volunteer computes pooled by many small groups, they successfully train SwAV and ALBERT and achieve performance comparable to traditional setups at a fraction of the cost.https://t.co/QFc5TOdRBy pic.twitter.com/4LyY4glP7h
3. Efficient Self-supervised Vision Transformers for Representation Learning
Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.
Efficient Self-supervised Vision Transformers for Representation Learning
— AK (@ak92501) June 21, 2021
pdf: https://t.co/0YImgGI4Fg
abs: https://t.co/gIxv7H8Qsn
EsViT achieves 81.3% top-1 on the ImageNet linear
probe evaluation, outperforming prior arts with around an order magnitude of higher throughput pic.twitter.com/28P5rH7AW0
Efficient Self-supervised Vision Transformers for Representation Learning
— Aran Komatsuzaki (@arankomatsuzaki) June 21, 2021
EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput.https://t.co/At20ak6pMR pic.twitter.com/YGK7svZM3X
4. Riemannian Convex Potential Maps
Samuel Cohen, Brandon Amos, Yaron Lipman
Modeling distributions on Riemannian manifolds is a crucial component in understanding non-Euclidean data that arises, e.g., in physics and geology. The budding approaches in this space are limited by representational and computational tradeoffs. We propose and study a class of flows that uses convex potentials from Riemannian optimal transport. These are universal and can model distributions on any compact Riemannian manifold without requiring domain knowledge of the manifold to be integrated into the architecture. We demonstrate that these flows can model standard distributions on spheres, and tori, on synthetic and geological data. Our source code is freely available online at http://github.com/facebookresearch/rcpm
Stoked to release our milestone #ICML2021 paper on Riemannian Convex Potential Maps! With @CohenSamuel13 and @lipmanya
— Brandon Amos (@brandondamos) June 21, 2021
Paper: https://t.co/FIDJe2CCLk
JAX Code: https://t.co/0KwXeXb33e
Slides: https://t.co/zkNde7WzXZ
🧵 pic.twitter.com/JzznubhCZE
Neat idea
— Danilo J. Rezende (@DaniloJRezende) June 21, 2021
Samuel Cohen, @brandondamos and Yaron Lipman !!!
😍 Riemannian Convex Potential Maps 😍https://t.co/DhyAfSTbYE
5. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer’s weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (AugReg” for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
How to train your ViT? https://t.co/3E7A7xMKxX
— Lucas Beyer (@giffmana) June 21, 2021
An investigation w/ Andreas Steiner @__kolesnikov__ @XiaohuaZhai @wightmanr @kyosu & yours truly.
Some "obvious" findings, some surprises, >50k models released, and ViT trained on *public* data matching JFT-trained ViT! (87%)
🧵👇 pic.twitter.com/6xqICSc8Az
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
— Aran Komatsuzaki (@arankomatsuzaki) June 21, 2021
Trains ViT on ImageNet-21k with AugReg, which either matches or outperforms their counterparts trained on the larger, but not publicly available JFT-300M dataset.https://t.co/pnxWDqKwgw pic.twitter.com/pwcE2IWltH
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
— AK (@ak92501) June 21, 2021
pdf: https://t.co/6isx98PS1M
train ViT models of various sizes on the
ImageNet-21k dataset, either match or outperform their counterparts trained on larger, but not publicly available JFT-300M pic.twitter.com/Zoso6tUPPA
New paper https://t.co/zForQciqJA 🧵 : How to train your ViT? It is common to train vision transformers on ImageNet-1k (~1.3m images) for 300 epochs. We show that you are better off investing the same compute budget for training on ImageNet-21k (~13m images) for 30 epochs. pic.twitter.com/zaCJyhQy9S
— Alexander Kolesnikov (@__kolesnikov__) June 21, 2021
6. DeepLab2: A TensorFlow Library for Deep Labeling
Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision. DeepLab2 includes all our recently developed DeepLab model variants with pretrained checkpoints as well as model training and evaluation code, allowing the community to reproduce and further improve upon the state-of-art systems. To showcase the effectiveness of DeepLab2, our Panoptic-DeepLab employing Axial-SWideRNet as network backbone achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints. We hope that publicly sharing our library could facilitate future research on dense pixel labeling tasks and envision new applications of this technology. Code is made publicly available at \url{https://github.com/google-research/deeplab2}.
DeepLab2: A TensorFlow Library for Deep Labeling
— AK (@ak92501) June 21, 2021
pdf: https://t.co/9AoKx6SsAq
abs: https://t.co/HC1o1SNV7B
github (to be released): https://t.co/kGoQTP6Jpd pic.twitter.com/sIU32FAX58
7. Discovering Relationships between Object Categories via Universal Canonical Maps
Natalia Neverova, Artsiom Sanakoyeu, Patrick Labatut, David Novotny, Andrea Vedaldi
We tackle the problem of learning the geometry of multiple categories of deformable objects jointly. Recent work has shown that it is possible to learn a unified dense pose predictor for several categories of related objects. However, training such models requires to initialize inter-category correspondences by hand. This is suboptimal and the resulting models fail to maintain correct correspondences as individual categories are learned. In this paper, we show that improved correspondences can be learned automatically as a natural byproduct of learning category-specific dense pose predictors. To do this, we express correspondences between different categories and between images and categories using a unified embedding. Then, we use the latter to enforce two constraints: symmetric inter-category cycle consistency and a new asymmetric image-to-category cycle consistency. Without any manual annotations for the inter-category correspondences, we obtain state-of-the-art alignment results, outperforming dedicated methods for matching 3D shapes. Moreover, the new model is also better at the task of dense pose prediction than prior work.
Check out our new #CVPR21 paper!
— Artsiom Sanakoyeu (@artsiom_s) June 21, 2021
Discovering Relationships between Object Categories via Universal Canonical Maps
In collaboration with FAIR (@NataliaNeverova, P. Labatut, @davnov134 and A. Vedaldi)
🌐https://t.co/MPRxgSLIFw
▶️https://t.co/SpMdp6LGD2
📝https://t.co/xzniLtflQm pic.twitter.com/CVv3QFJ4DZ
Discovering Relationships between Object Categories via Universal Canonical Maps
— AK (@ak92501) June 21, 2021
pdf: https://t.co/LjQS6y7vuL
Without manual annotations for the inter-category correspondences, sota alignment results, outperforming dedicated methods for matching 3D shapes pic.twitter.com/Yt7F6csc3Y
8. PixInWav: Residual Steganography for Hiding Pixels in Audio
Margarita Geleta, Cristina Punti, Kevin McGuinness, Jordi Pons, Cristian Canton, Xavier Giro-i-Nieto
Steganography comprises the mechanics of hiding data in a host media that may be publicly available. While previous works focused on unimodal setups (e.g., hiding images in images, or hiding audio in audio), PixInWav targets the multimodal case of hiding images in audio. To this end, we propose a novel residual architecture operating on top of short-time discrete cosine transform (STDCT) audio spectrograms. Among our results, we find that the residual audio steganography setup we propose allows independent encoding of the hidden image from the host audio without compromising quality. Accordingly, while previous works require both host and hidden signals to hide a signal, PixInWav can encode images offline — which can be later hidden, in a residual fashion, into any audio signal. Finally, we test our scheme in a lab setting to transmit images over airwaves from a loudspeaker to a microphone verifying our theoretical insights and obtaining promising results.
Our paper is out! Pioneering multimodal audio steganography with deep learning. In this paper we present a residual approach for hiding images into audio and show a real "over the air transmission" of images through waveforms: https://t.co/AVq3eHkuJa pic.twitter.com/UPi6gp9Lrm
— Rita Geleta (@ritageleta) June 21, 2021
9. The Principles of Deep Learning Theory
Daniel A. Roberts, Sho Yaida, Boris Hanin
- retweets: 218, favorites: 79 (06/22/2021 07:43:15)
- links: abs | pdf
- cs.LG | cs.AI | hep-th | stat.ML
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models’ predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
[Basic concepts] In this book is presented, in a very comprehensive way, the principles of Deep Learning Theory, with a strong focus on practical application. Highly recommended for anyone who want a deeper understanding of the topic. #DeepLearning https://t.co/uxrmP3ZUrC pic.twitter.com/p8IrYopyLV
— Underfox (@Underfox3) June 21, 2021
The Principles of Deep Learning Theory. (arXiv:2106.10165v1 [cs.LG]) https://t.co/uoajcL1mGF
— Stat.ML Papers (@StatMLPapers) June 21, 2021
10. HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping
Yuhan Wang, Xu Chen, Junwei Zhu, Wenqing Chu, Ying Tai, Chengjie Wang, Jilin Li, Yongjian Wu, Feiyue Huang, Rongrong Ji
In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods.
HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping
— AK (@ak92501) June 21, 2021
pdf: https://t.co/mBtUdIZsyy
abs: https://t.co/HdM2nPqD8A
project page: https://t.co/fioZ9ewuHM pic.twitter.com/SsmjYlenvn