Hot Papers 2021-04-08

1. Scaling Scaling Laws with Board Games

Andrew L. Jones

retweets: 1194, favorites: 176 (04/09/2021 09:17:45)
links: abs | pdf
cs.LG | cs.MA

The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well. By conducting a sequence of experiments using AlphaZero and Hex, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that increasing the test-time compute available to an agent can substitute for reduced train-time compute, and vice versa.

🚨 I've a paper out today: Scaling Scaling Laws with Board Games! 🚨https://t.co/GEXhooTnAO

Principle result is that by studying a sequence of small problems in ML, I could predict the outcome of experiments on orders-of-magnitude larger problems 🤯 pic.twitter.com/inlMMCXFWW
— Andy Jones (@andy_l_jones) April 8, 2021

2. Modern Hopfield Networks for Few- and Zero-Shot Reaction Prediction

Philipp Seidl, Philipp Renz, Natalia Dyubankova, Paulo Neves, Jonas Verhoeven, Jörg K. Wegner, Sepp Hochreiter, Günter Klambauer

retweets: 1088, favorites: 160 (04/09/2021 09:17:45)
links: abs | pdf
cs.LG | cs.AI | q-bio.BM | stat.ML

An essential step in the discovery of new drugs and materials is the synthesis of a molecule that exists so far only as an idea to test its biological and physical properties. While computer-aided design of virtual molecules has made large progress, computer-assisted synthesis planning (CASP) to realize physical molecules is still in its infancy and lacks a performance level that would enable large-scale molecule discovery. CASP supports the search for multi-step synthesis routes, which is very challenging due to high branching factors in each synthesis step and the hidden rules that govern the reactions. The central and repeatedly applied step in CASP is reaction prediction, for which machine learning methods yield the best performance. We propose a novel reaction prediction approach that uses a deep learning architecture with modern Hopfield networks (MHNs) that is optimized by contrastive learning. An MHN is an associative memory that can store and retrieve chemical reactions in each layer of a deep learning architecture. We show that our MHN contrastive learning approach enables few- and zero-shot learning for reaction prediction which, in contrast to previous methods, can deal with rare, single, or even no training example(s) for a reaction. On a well established benchmark, our MHN approach pushes the state-of-the-art performance up by a large margin as it improves the predictive top-100 accuracy from $0.858\pm0.004$ to $0.959\pm0.004$ . This advance might pave the way to large-scale molecule discovery.

Modern Hopfield Networks for Few- and Zero-Shot Reaction Prediction

Pushes the SotA up by a large margin on computer-assisted molecular synthesis planninghttps://t.co/Boxkanl2Ma pic.twitter.com/Ls2wuDTIiG
— Aran Komatsuzaki (@arankomatsuzaki) April 8, 2021

3. Regularizing Generative Adversarial Networks under Limited Data

Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, Weilong Yang

retweets: 930, favorites: 131 (04/09/2021 09:17:46)
links: abs | pdf
cs.LG | cs.CV

Recent years have witnessed the rapid progress of generative adversarial networks (GANs). However, the success of the GAN models hinges on a large amount of training data. This work proposes a regularization approach for training robust GAN models on limited data. We theoretically show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data. Extensive experiments on several benchmark datasets demonstrate that the proposed regularization scheme 1) improves the generalization performance and stabilizes the learning dynamics of GAN models under limited training data, and 2) complements the recent data augmentation methods. These properties facilitate training GAN models to achieve state-of-the-art performance when only limited training data of the ImageNet benchmark is available.

Regularizing Generative Adversarial Networks under Limited Data
pdf: https://t.co/oBmu2v1yyp
abs: https://t.co/dRpZkvKnt4
github: https://t.co/OkV7RZ3xoC

training GAN models to achieve sota performance when only limited training data of the ImageNet benchmark is available pic.twitter.com/AQkCMH3J8g
— AK (@ak92501) April 8, 2021

4. GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education

Masato Hagiwara, Joshua Tanner, Keisuke Sakaguchi

retweets: 460, favorites: 135 (04/09/2021 09:17:46)
links: abs | pdf
cs.CL

We present GrammarTagger, an open-source grammar profiler which, given an input text, identifies grammatical features useful for language education. The model architecture enables it to learn from a small amount of texts annotated with spans and their labels, which 1) enables easier and more intuitive annotation, 2) supports overlapping spans, and 3) is less prone to error propagation, compared to complex hand-crafted rules defined on constituency/dependency parses. We show that we can bootstrap a grammar profiler model with $F_1 \approx 0.6$ from only a couple hundred sentences both in English and Chinese, which can be further boosted via learning a multilingual model. With GrammarTagger, we also build Octanove Learn, a search engine of language learning materials indexed by their reading difficulty and grammatical features. The code and pretrained models are publicly available at \url{https://github.com/octanove/grammartagger}.

🎉 We are happy to announce GrammarTagger, an open-source toolkit for grammatical profiling for language learning!

Github: https://t.co/RlcZlqL5fX
Paper: https://t.co/zoBaR0IdYN
Blog post: https://t.co/7eBHWhjOaP

Joint work w/ Joshua Tanner and Keisuke Sakaguchi @KeisukeS_
— Masato Hagiwara (@mhagiwara) April 8, 2021

ニューラルネットによってテキスト中の文法項目を解析するオープンソースのツール、GrammarTaggerをリリースしました！

コード: https://t.co/mMHVvSJZYc
論文: https://t.co/LDItGarYoQ
ブログ記事: https://t.co/xAE68BmdIu

ワシントン大 Tanner さん、AI2 の坂口さん @KeisukeS_ との共同研究です
— ステート・オブ・AI ガイド (@stateofai_ja) April 8, 2021

5. Efficient transfer learning for NLP with ELECTRA

François Mercier

retweets: 162, favorites: 72 (04/09/2021 09:17:46)
links: abs | pdf
cs.CL | cs.AI | cs.LG

Clark et al. [2020] claims that the ELECTRA approach is highly efficient in NLP performances relative to computation budget. As such, this reproducibility study focus on this claim, summarized by the following question: Can we use ELECTRA to achieve close to SOTA performances for NLP in low-resource settings, in term of compute cost?

Efficient transfer learning for NLP with ELECTRA • Can we use ELECTRA to achieve close to SOTA performances for NLP in low-resource settings in term of compute cost?

Paper https://t.co/UPULOBXI8I
Model https://t.co/901Alsqcf0

_#NLP #NLProc pic.twitter.com/9s8pJP5GIn
— Philip Vollet (@philipvollet) April 8, 2021

6. Neural Articulated Radiance Field

Atsuhiro Noguchi, Xiao Sun, Stephen Lin, Tatsuya Harada

retweets: 143, favorites: 90 (04/09/2021 09:17:46)
links: abs | pdf
cs.CV

We present Neural Articulated Radiance Field (NARF), a novel deformable 3D representation for articulated objects learned from images. While recent advances in 3D implicit representation have made it possible to learn models of complex objects, learning pose-controllable representations of articulated objects remains a challenge, as current methods require 3D shape supervision and are unable to render appearance. In formulating an implicit representation of 3D articulated objects, our method considers only the rigid transformation of the most relevant object part in solving for the radiance field at each 3D location. In this way, the proposed method represents pose-dependent changes without significantly increasing the computational complexity. NARF is fully differentiable and can be trained from images with pose annotations. Moreover, through the use of an autoencoder, it can learn appearance variations over multiple instances of an object class. Experiments show that the proposed method is efficient and can generalize well to novel poses. We make the code, model and demo available for research purposes at https://github.com/nogu-atsu/NARF

Neural Articulated Radiance Field
pdf: https://t.co/rxB9gj5CLO
abs: https://t.co/XWQPSM7wpq pic.twitter.com/EDcJeytQXt
— AK (@ak92501) April 8, 2021

7. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu

retweets: 129, favorites: 91 (04/09/2021 09:17:46)
links: abs | pdf
cs.CV

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to “See Out of tHe bOx” that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR $^2$ test-P split, 6.7% accuracy on SNLI-VE test split, respectively.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
pdf: https://t.co/9L4kYkdZyz
abs: https://t.co/DBpBzJN9ZG pic.twitter.com/998ltZHTv6
— AK (@ak92501) April 8, 2021

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Proposes SOHO, which achieves SotA performance on various vision-language tasks. https://t.co/je8Iw6G1NY pic.twitter.com/jJVGLe6EGV
— Aran Komatsuzaki (@arankomatsuzaki) April 8, 2021

8. On Self-Contact and Human Pose

Lea Müller, Ahmed A. A. Osman, Siyu Tang, Chun-Hao P. Huang, Michael J. Black

retweets: 96, favorites: 121 (04/09/2021 09:17:47)
links: abs | pdf
cs.CV

People touch their face 23 times an hour, they cross their arms and legs, put their hands on their hips, etc. While many images of people contain some form of self-contact, current 3D human pose and shape (HPS) regression methods typically fail to estimate this contact. To address this, we develop new datasets and methods that significantly improve human pose estimation with self-contact. First, we create a dataset of 3D Contact Poses (3DCP) containing SMPL-X bodies fit to 3D scans as well as poses from AMASS, which we refine to ensure good contact. Second, we leverage this to create the Mimic-The-Pose (MTP) dataset of images, collected via Amazon Mechanical Turk, containing people mimicking the 3DCP poses with selfcontact. Third, we develop a novel HPS optimization method, SMPLify-XMC, that includes contact constraints and uses the known 3DCP body pose during fitting to create near ground-truth poses for MTP images. Fourth, for more image variety, we label a dataset of in-the-wild images with Discrete Self-Contact (DSC) information and use another new optimization method, SMPLify-DC, that exploits discrete contacts during pose optimization. Finally, we use our datasets during SPIN training to learn a new 3D human pose regressor, called TUCH (Towards Understanding Contact in Humans). We show that the new self-contact training data significantly improves 3D human pose estimates on withheld test data and existing datasets like 3DPW. Not only does our method improve results for self-contact poses, but it also improves accuracy for non-contact poses. The code and data are available for research purposes at https://tuch.is.tue.mpg.de.

TUCH: training human pose and shape regression with novel *self-contact* losses improves accuracy even on poses without self-contact. Mimic-The-Pose (MTP): novel crowd-sourced, high-quality, 3D reference data with self-contact. @CVPR (#CVPR2021 oral). https://t.co/ynr4cmqD6b pic.twitter.com/BvHW5tTA85
— Michael Black (@Michael_J_Black) April 8, 2021

On Self-Contact and Human Pose
pdf: https://t.co/zptiMINVvb
abs: https://t.co/GCTfe9qrh9
project page: https://t.co/YW5qKwBNmm pic.twitter.com/jFaPO5UEgu
— AK (@ak92501) April 8, 2021

9. NeuMIP: Multi-Resolution Neural Materials

Alexandr Kuznetsov, Krishna Mullia, Zexiang Xu, Miloš Hašan, Ravi Ramamoorthi

retweets: 156, favorites: 41 (04/09/2021 09:17:47)
links: abs | pdf
cs.GR | cs.LG | eess.IV

We propose NeuMIP, a neural method for representing and rendering a variety of material appearances at different scales. Classical prefiltering (mipmapping) methods work well on simple material properties such as diffuse color, but fail to generalize to normals, self-shadowing, fibers or more complex microstructures and reflectances. In this work, we generalize traditional mipmap pyramids to pyramids of neural textures, combined with a fully connected network. We also introduce neural offsets, a novel method which allows rendering materials with intricate parallax effects without any tessellation. This generalizes classical parallax mapping, but is trained without supervision by any explicit heightfield. Neural materials within our system support a 7-dimensional query, including position, incoming and outgoing direction, and the desired filter kernel size. The materials have small storage (on the order of standard mipmapping except with more texture channels), and can be integrated within common Monte-Carlo path tracing systems. We demonstrate our method on a variety of materials, resulting in complex appearance across levels of detail, with accurate parallax, self-shadowing, and other effects.

Cool new Multi-Resolution Neural Material research from Adobe Research. https://t.co/W6LJj3J6QI pic.twitter.com/krbn95wU1v
— Jonathan Granskog (@jongranskog) April 8, 2021

10. Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video

Berthy Feng, Alexander C. Ogren, Chiara Daraio, Katherine L. Bouman

retweets: 90, favorites: 33 (04/09/2021 09:17:47)
links: abs | pdf
cs.CV | eess.IV

An object’s interior material properties, while invisible to the human eye, determine motion observed on its surface. We propose an approach that estimates heterogeneous material properties of an object directly from a monocular video of its surface vibrations. Specifically, we estimate Young’s modulus and density throughout a 3D object with known geometry. Knowledge of how these values change across the object is useful for characterizing defects and simulating how the object will interact with different environments. Traditional non-destructive testing approaches, which generally estimate homogenized material properties or the presence of defects, are expensive and use specialized instruments. We propose an approach that leverages monocular video to (1) measure and object’s sub-pixel motion and decompose this motion into image-space modes, and (2) directly infer spatially-varying Young’s modulus and density values from the observed image-space modes. On both simulated and real videos, we demonstrate that our approach is able to image material properties simply by analyzing surface motion. In particular, our method allows us to identify unseen defects on a 2D drum head from real, high-speed video.

Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video https://t.co/LYxbOB7BJU #computervision pic.twitter.com/Roy1KtWfyB
— Tomasz Malisiewicz (@quantombone) April 8, 2021

11. MultiScene: A Large-scale Dataset and Benchmark for Multi-scene Recognition in Single Aerial Images

Yuansheng Hua, Lichao Mou, Pu Jin, Xiao Xiang Zhu

retweets: 72, favorites: 38 (04/09/2021 09:17:47)
links: abs | pdf
cs.CV

Aerial scene recognition is a fundamental research problem in interpreting high-resolution aerial imagery. Over the past few years, most studies focus on classifying an image into one scene category, while in real-world scenarios, it is more often that a single image contains multiple scenes. Therefore, in this paper, we investigate a more practical yet underexplored task — multi-scene recognition in single images. To this end, we create a large-scale dataset, called MultiScene, composed of 100,000 unconstrained high-resolution aerial images. Considering that manually labeling such images is extremely arduous, we resort to low-cost annotations from crowdsourcing platforms, e.g., OpenStreetMap (OSM). However, OSM data might suffer from incompleteness and incorrectness, which introduce noise into image labels. To address this issue, we visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean. With it, we can develop and evaluate deep networks for multi-scene recognition using clean data. Moreover, we provide crowdsourced annotations of all images for the purpose of studying network learning with noisy labels. We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images and learning from noisy labels for this task, respectively. To facilitate progress, we will make our dataset and pre-trained models available.

Interested in a new #AI4EO task - multi-scene classification in single aerial images?

We @Zhu_XLab @ai4eo_de are sharing a large-scale benchmark, called #MultiScene, composed of 100,000 high-resolution aerial images. Stay tuned!

Link to paper: https://t.co/ueM6SzJ1ep pic.twitter.com/1FiCL1xvQo
— Xiaoxiang ZHU (@xiaoxiang_zhu) April 8, 2021

12. SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks

Shunsuke Saito, Jinlong Yang, Qianli Ma, Michael J. Black

retweets: 56, favorites: 54 (04/09/2021 09:17:47)
links: abs | pdf
cs.CV

We present SCANimate, an end-to-end trainable framework that takes raw 3D scans of a clothed human and turns them into an animatable avatar. These avatars are driven by pose parameters and have realistic clothing that moves and deforms naturally. SCANimate does not rely on a customized mesh template or surface mesh registration. We observe that fitting a parametric 3D body model, like SMPL, to a clothed human scan is tractable while surface registration of the body topology to the scan is often not, because clothing can deviate significantly from the body shape. We also observe that articulated transformations are invertible, resulting in geometric cycle consistency in the posed and unposed shapes. These observations lead us to a weakly supervised learning method that aligns scans into a canonical pose by disentangling articulated deformations without template-based surface registration. Furthermore, to complete missing regions in the aligned scans while modeling pose-dependent deformations, we introduce a locally pose-aware implicit function that learns to complete and model geometry with learned pose correctives. In contrast to commonly used global pose embeddings, our local pose conditioning significantly reduces long-range spurious correlations and improves generalization to unseen poses, especially when training data is limited. Our method can be applied to pose-aware appearance modeling to generate a fully textured avatar. We demonstrate our approach on various clothing types with different amounts of training data, outperforming existing solutions and other variants in terms of fidelity and generality in every setting. The code is available at https://scanimate.is.tue.mpg.de.

SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks
pdf: https://t.co/zqbd59n3pW
abs: https://t.co/xjK0o2iPSJ
project page: https://t.co/EoobL7RxS0 pic.twitter.com/Ht97evFp61
— AK (@ak92501) April 8, 2021

13. Creativity and Machine Learning: a Survey

Giorgio Franceschelli, Mirco Musolesi

retweets: 64, favorites: 33 (04/09/2021 09:17:47)
links: abs | pdf
cs.LG | cs.AI | cs.CY

There is a growing interest in the area of machine learning and creativity. This survey presents an overview of the history and the state of the art of computational creativity theories, machine learning techniques, including generative deep learning, and corresponding automatic evaluation methods. After presenting a critical discussion of the key contributions in this area, we outline the current research challenges and emerging opportunities in this field.

New arXiv pre-print: "Creativity and Machine Learning: A Survey". Machine learning and artistic creation, from Ada Lovelace to GANs and GPT-3 (and beyond). Not only methods, but also metrics for evaluating them. https://t.co/dviVmaEDRY

\w @Gionceschelli pic.twitter.com/fj19F9im2P
— Mirco Musolesi (@mircomusolesi) April 8, 2021

14. The Power of Subsampling in Submodular Maximization

Christopher Harshaw, Ehsan Kazemi, Moran Feldman, Amin Karbasi

retweets: 6, favorites: 87 (04/09/2021 09:17:48)
links: abs | pdf
cs.DS | cs.LG | math.OC

We propose subsampling as a unified algorithmic technique for submodular maximization in centralized and online settings. The idea is simple: independently sample elements from the ground set, and use simple combinatorial techniques (such as greedy or local search) on these sampled elements. We show that this approach leads to optimal/state-of-the-art results despite being much simpler than existing methods. In the usual offline setting, we present SampleGreedy, which obtains a $(p + 2 + o(1))$ -approximation for maximizing a submodular function subject to a $p$ -extendible system using $O(n + nk/p)$ evaluation and feasibility queries, where $k$ is the size of the largest feasible set. The approximation ratio improves to $p+1$ and $p$ for monotone submodular and linear objectives, respectively. In the streaming setting, we present SampleStreaming, which obtains a $(4p +2 - o(1))$ -approximation for maximizing a submodular function subject to a $p$ -matchoid using $O(k)$ memory and $O(km/p)$ evaluation and feasibility queries per element, where $m$ is the number of matroids defining the $p$ -matchoid. The approximation ratio improves to $4p$ for monotone submodular objectives. We empirically demonstrate the effectiveness of our algorithms on video summarization, location summarization, and movie recommendation tasks.

In this paper, just accepted to the Mathematics of Operations Research, we study the Nyquist rate of subsampling for submodular maximization. A simple idea that provides the currently best-known approximation guarantee in offline and streaming settings. https://t.co/BFpNpudTkR pic.twitter.com/YfYCnlKJIG
— Amin Karbasi (@aminkarbasi) April 8, 2021

15. Self-supervised Learning of Depth Inference for Multi-view Stereo

Jiayu Yang, Jose M. Alvarez, Miaomiao Liu

retweets: 37, favorites: 37 (04/09/2021 09:17:48)
links: abs | pdf
cs.CV

Recent supervised multi-view depth estimation networks have achieved promising results. Similar to all supervised approaches, these networks require ground-truth data during training. However, collecting a large amount of multi-view depth data is very challenging. Here, we propose a self-supervised learning framework for multi-view stereo that exploit pseudo labels from the input data. We start by learning to estimate depth maps as initial pseudo labels under an unsupervised learning framework relying on image reconstruction loss as supervision. We then refine the initial pseudo labels using a carefully designed pipeline leveraging depth information inferred from higher resolution images and neighboring views. We use these high-quality pseudo labels as the supervision signal to train the network and improve, iteratively, its performance by self-training. Extensive experiments on the DTU dataset show that our proposed self-supervised learning framework outperforms existing unsupervised multi-view stereo networks by a large margin and performs on par compared to the supervised counterpart. Code is available at https://github.com/JiayuYANG/Self-supervised-CVP-MVSNet.

Self-supervised Learning of Depth Inference for Multi-view Stereo
pdf: https://t.co/15O9lhhjUB
abs: https://t.co/os1BMpJXNJ pic.twitter.com/T8aveFxGrC
— AK (@ak92501) April 8, 2021

Published 9 Apr 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter