All Articles

Hot Papers 2021-06-22

1. GRAND: Graph Neural Diffusion

Benjamin Paul Chamberlain, James Rowbottom, Maria Gorinova, Stefan Webb, Emanuele Rossi, Michael M. Bronstein

We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE. In our model, the layer structure and topology correspond to the discretisation choices of temporal and spatial operators. Our approach allows a principled development of a broad new class of GNNs that are able to address the common plights of graph learning models such as depth, oversmoothing, and bottlenecks. Key to the success of our models are stability with respect to perturbations in the data and this is addressed for both implicit and explicit discretisation schemes. We develop linear and nonlinear versions of GRAND, which achieve competitive results on many standard graph benchmarks.

2. Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data

Arlind Kadra, Marius Lindauer, Frank Hutter, Josif Grabocka

  • retweets: 6604, favorites: 68 (06/23/2021 10:32:04)
  • links: abs | pdf
  • cs.LG

Tabular datasets are the last “unconquered castle” for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures. In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularization techniques. As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters. We empirically assess the impact of these regularization cocktails for MLPs on a large-scale empirical study comprising 40 tabular datasets and demonstrate that (i) well-regularized plain MLPs significantly outperform recent state-of-the-art specialized neural network architectures, and (ii) they even outperform strong traditional ML methods, such as XGBoost.

3. Lossy Compression for Lossless Prediction

Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, Chris J. Maddison

Most data is automatically collected and only ever “seen” by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations. Based on our theory, we design unsupervised objectives for training neural compressors. Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than 1000×1000\times on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.

4. Improving security for users of decentralized exchanges through multiparty computation

Robert Annessi, Ethan Fast

  • retweets: 2466, favorites: 299 (06/23/2021 10:32:05)
  • links: abs | pdf
  • cs.CR

Decentralized cryptocurrency exchanges offer compelling security benefits over centralized exchanges: users control their funds and avoid the risk of an exchange hack or malicious operator. However, because user assets are fully accessible by a secret key, decentralized exchanges pose significant internal security risks for trading firms and automated trading systems, where a compromised system can result in total loss of funds. Centralized exchanges mitigate this risk through API key based security policies that allow professional users to give individual traders or automated systems specific and customizable access rights such as trading or withdrawal limits. Such policies, however, are not compatible with decentralized exchanges, where all exchange operations require a signature generated by the owner’s secret key. This paper introduces a protocol based upon multiparty computation that allows for the creation of API keys and security policies that can be applied to any existing decentralized exchange. Our protocol works with both ECDSA and EdDSA signature schemes and prioritizes efficient computation and communication. We have deployed this protocol on Nash exchange, as well as around several Ethereum-based automated market maker smart contracts, where it secures the trading accounts and wallets of thousands of users.

5. Calliar: An Online Handwritten Dataset for Arabic Calligraphy

Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Yousif Ahmed Al-Wajih

  • retweets: 1295, favorites: 157 (06/23/2021 10:32:05)
  • links: abs | pdf
  • cs.CL

Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.

6. VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We pre-train our model on uncurated videos and show that our pre-trained model can reach state-of-the-art results on several video understanding datasets (e.g., SSV2, Diving48). Lastly, we provide detailed analyses on model scalability and pre-training method design. Code is released at https://github.com/airsplay/vimpac.

7. Nested Variational Inference

Heiko Zimmermann, Hao Wu, Babak Esmaeili, Jan-Willem van de Meent

We develop nested variational inference (NVI), a family of methods that learn proposals for nested importance samplers by minimizing an forward or reverse KL divergence at each level of nesting. NVI is applicable to many commonly-used importance sampling strategies and provides a mechanism for learning intermediate densities, which can serve as heuristics to guide the sampler. Our experiments apply NVI to (a) sample from a multimodal distribution using a learned annealing path (b) learn heuristics that approximate the likelihood of future observations in a hidden Markov model and (c) to perform amortized inference in hierarchical deep generative models. We observe that optimizing nested objectives leads to improved sample quality in terms of log average weight and effective sample size.

8. Boundary Graph Neural Networks for 3D Simulations

Andreas Mayr, Sebastian Lehner, Arno Mayrhofer, Christoph Kloss, Sepp Hochreiter, Johannes Brandstetter

The abundance of data has given machine learning huge momentum in natural sciences and engineering. However, the modeling of simulated physical processes remains difficult. A key problem in doing so is the correct handling of geometric boundaries. While triangularized geometric boundaries are very common in engineering applications, they are notoriously difficult to model by machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce Boundary Graph Neural Networks (BGNNs), which dynamically modify graph structures to address boundary conditions. Boundary graph structures are constructed via modifying edges, augmenting node features, and dynamically inserting virtual nodes. The new BGNNs are tested on complex 3D granular flow processes of hoppers and rotating drums which are standard parts of industrial machinery. Using precise simulations that are obtained by an expensive and complex discrete element method, BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. Even if complex boundaries are present, BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps, and most notably particles completely stay within the geometric objects without using handcrafted conditions or restrictions.

9. Say Their Names: Resurgence in the collective attention toward Black victims of fatal police violence following the death of George Floyd

Henry H. Wu, Ryan J. Gallagher, Thayer Alshaabi, Jane L. Adams, Joshua R. Minot, Michael V. Arnold, Brooke Foucault Welles, Randall Harp, Peter Sheridan Dodds, Christopher M. Danforth

The murder of George Floyd by police in May 2020 sparked international protests and renewed attention in the Black Lives Matter movement. Here, we characterize ways in which the online activity following George Floyd’s death was unparalleled in its volume and intensity, including setting records for activity on Twitter, prompting the saddest day in the platform’s history, and causing George Floyd’s name to appear among the ten most frequently used phrases in a day, where he is the only individual to have ever received that level of attention who was not known to the public earlier that same week. Further, we find this attention extended beyond George Floyd and that more Black victims of fatal police violence received attention following his death than during other past moments in Black Lives Matter’s history. We place that attention within the context of prior online racial justice activism by showing how the names of Black victims of police violence have been lifted and memorialized over the last 12 years on Twitter. Our results suggest that the 2020 wave of attention to the Black Lives Matter movement centered past instances of police violence in an unprecedented way, demonstrating the impact of the movement’s rhetorical strategy to “say their names.”

10. Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

Hongyu Gong, Yun Tang, Juan Pino, Xian Li

  • retweets: 196, favorites: 61 (06/23/2021 10:32:06)
  • links: abs | pdf
  • cs.CL | cs.AI

Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of +2.0+2.0 BLEU over 1313 language directions in multilingual setting and +2.0+2.0 BLEU over 33 domains in multi-domain setting.

11. Multiplying Matrices Without Multiplying

Davis Blalock, John Guttag

Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning. Consequently, there has been significant work on efficiently approximating matrix multiplies. We introduce a learning-based algorithm for this task that greatly outperforms existing methods. Experiments using hundreds of matrices from diverse domains show that it often runs 100×100\times faster than exact matrix products and 10×10\times faster than current approximate methods. In the common case that one matrix is known ahead of time, our method also has the interesting property that it requires zero multiply-adds. These results suggest that a mixture of hashing, averaging, and byte shuffling-the core operations of our method-could be a more promising building block for machine learning than the sparsified, factorized, and/or scalar quantized matrix products that have recently been the focus of substantial research and hardware investment.

12. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen

  • retweets: 143, favorites: 45 (06/23/2021 10:32:06)
  • links: abs | pdf
  • cs.CV

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

13. One Million Scenes for Autonomous Driving: ONCE Dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Hang Xu, Chunjing Xu

  • retweets: 144, favorites: 43 (06/23/2021 10:32:07)
  • links: abs | pdf
  • cs.CV

Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.

14. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, Wenping Wang

  • retweets: 90, favorites: 60 (06/23/2021 10:32:07)
  • links: abs | pdf
  • cs.CV | cs.GR

We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inputs. Existing neural surface reconstruction approaches, such as DVR and IDR, require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures. Meanwhile, recent neural methods for novel view synthesis, such as NeRF and its variants, use volume rendering to produce a neural scene representation with robustness of optimization, even for highly complex objects. However, extracting high-quality surfaces from this learned implicit representation is difficult because there are not sufficient surface constraints in the representation. In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation. We observe that the conventional volume rendering method causes inherent geometric errors (i.e. bias) for surface reconstruction, and therefore propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision. Experiments on the DTU dataset and the BlendedMVS dataset show that NeuS outperforms the state-of-the-arts in high-quality surface reconstruction, especially for objects and scenes with complex structures and self-occlusion.

15. Understanding Object Dynamics for Interactive Image-to-Video Synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald, Björn Ommer

  • retweets: 98, favorites: 35 (06/23/2021 10:32:07)
  • links: abs | pdf
  • cs.CV

What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks. Project page is available at https://bit.ly/3cxfA2L .

16. Scenic4RL: Programmatic Modeling and Generation of Reinforcement Learning Environments

Abdus Salam Azad, Edward Kim, Qiancheng Wu, Kimin Lee, Ion Stoica, Pieter Abbeel, Sanjit A. Seshia

  • retweets: 72, favorites: 34 (06/23/2021 10:32:07)
  • links: abs | pdf
  • cs.LG | cs.AI

The capability of reinforcement learning (RL) agent directly depends on the diversity of learning scenarios the environment generates and how closely it captures real-world situations. However, existing environments/simulators lack the support to systematically model distributions over initial states and transition dynamics. Furthermore, in complex domains such as soccer, the space of possible scenarios is infinite, which makes it impossible for one research group to provide a comprehensive set of scenarios to train, test, and benchmark RL algorithms. To address this issue, for the first time, we adopt an existing formal scenario specification language, SCENIC, to intuitively model and generate interactive scenarios. We interfaced SCENIC to Google Research Soccer environment to create a platform called SCENIC4RL. Using this platform, we provide a dataset consisting of 36 scenario programs encoded in SCENIC and demonstration data generated from a subset of them. We share our experimental results to show the effectiveness of our dataset and the platform to train, test, and benchmark RL algorithms. More importantly, we open-source our platform to enable RL community to collectively contribute to constructing a comprehensive set of scenarios.

17. DiGS : Divergence guided shape implicit neural representation for unoriented point clouds

Yizhak Ben-Shabat, Chamin Hewa Koneputugodage, Stephen Gould

Neural shape representations have recently shown to be effective in shape analysis and reconstruction tasks. Existing neural network methods require point coordinates and corresponding normal vectors to learn the implicit level sets of the shape. Normal vectors are often not provided as raw data, therefore, approximation and reorientation are required as pre-processing stages, both of which can introduce noise. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal shape representation networks that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and show state-of-the-art performance compared to other unoriented methods and on-par performance compared to oriented methods.

18. Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval

Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis

  • retweets: 64, favorites: 19 (06/23/2021 10:32:07)
  • links: abs | pdf
  • cs.IR

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval — through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and computing their relevance scores — has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT’s [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings - while ensuring that these embeddings discriminate among passages — which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.

19. MeshRIR: A Dataset of Room Impulse Responses on Meshed Grid Points For Evaluating Sound Field Analysis and Synthesis Methods

Shoichi Koyama, Tomoya Nishida, Keisuke Kimura, Takumi Abe, Natsuki Ueno, Jesper Brunnström

A new impulse response (IR) dataset called “MeshRIR” is introduced. Currently available datasets usually include IRs at an array of microphones from several source positions under various room conditions, which are basically designed for evaluating speech enhancement and distant speech recognition methods. On the other hand, methods of estimating or controlling spatial sound fields have been extensively investigated in recent years; however, the current IR datasets are not applicable to validating and comparing these methods because of the low spatial resolution of measurement points. MeshRIR consists of IRs measured at positions obtained by finely discretizing a spatial region. Two subdatasets are currently available: one consists of IRs in a three-dimensional cuboidal region from a single source, and the other consists of IRs in a two-dimensional square region from an array of 32 sources. Therefore, MeshRIR is suitable for evaluating sound field analysis and synthesis methods. This dataset is freely available at \url{https://sh01k.github.io/MeshRIR/} with some codes of sample applications.

20. Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama

Ching-Yu Hsu, Cheng Sun, Hwann-Tzong Chen

  • retweets: 42, favorites: 34 (06/23/2021 10:32:08)
  • links: abs | pdf
  • cs.CV

We present Omnidirectional Neural Radiance Fields (OmniNeRF), the first method to the application of parallax-enabled novel panoramic view synthesis. Recent works for novel view synthesis focus on perspective images with limited field-of-view and require sufficient pictures captured in a specific condition. Conversely, OmniNeRF can generate panorama images for unknown viewpoints given a single equirectangular image as training data. To this end, we propose to augment the single RGB-D panorama by projecting back and forth between a 3D world and different 2D panoramic coordinates at different virtual camera positions. By doing so, we are able to optimize an Omnidirectional Neural Radiance Field with visible pixels collecting from omnidirectional viewing angles at a fixed center for the estimation of new viewing angles from varying camera positions. As a result, the proposed OmniNeRF achieves convincing renderings of novel panoramic views that exhibit the parallax effect. We showcase the effectiveness of each of our proposals on both synthetic and real-world datasets.

21. Towards Long-Form Video Understanding

Chao-Yuan Wu, Philipp Krähenbühl

  • retweets: 30, favorites: 26 (06/23/2021 10:32:08)
  • links: abs | pdf
  • cs.CV

Our world offers a never-ending stream of visual stimuli, yet today’s vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

22. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

  • retweets: 24, favorites: 28 (06/23/2021 10:32:08)
  • links: abs | pdf
  • cs.CV | cs.LG

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.