All Articles

Hot Papers 2021-03-11

1. Involution: Inverting the Inherence of Convolution for Visual Recognition

Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen

  • retweets: 2855, favorites: 350 (03/12/2021 08:32:46)
  • links: abs | pdf
  • cs.CV

Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision. In this work, we rethink the inherent principles of standard convolution for vision tasks, specifically spatial-agnostic and channel-specific. Instead, we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution, coined as involution. We additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over-complicated instantiation. The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition, powering different deep learning models on several prevalent benchmarks, including ImageNet classification, COCO detection and segmentation, together with Cityscapes segmentation. Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely while compressing the computational cost to 66%, 65%, 72%, and 57% on the above benchmarks, respectively. Code and pre-trained models for all the tasks are available at https://github.com/d-li14/involution.

2. Variable-rate discrete representation learning

Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan

Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.

3. Linear Constraints

Jean-Philippe Bernardy, Richard Eisenberg, Csongor Kiss, Arnaud Spiwack, Nicolas Wu

  • retweets: 703, favorites: 73 (03/12/2021 08:32:46)
  • links: abs | pdf
  • cs.PL

A linear argument must be consumed exactly once in the body of its function. A linear type system can verify the correct usage of resources such as file handles and manually managed memory. But this verification requires bureaucracy. This paper presents linear constraints, a front-end feature for linear typing that decreases the bureaucracy of working with linear types. Linear constraints are implicit linear arguments that are to be filled in automatically by the compiler. Linear constraints are presented as a qualified type system, together with an inference algorithm which extends OutsideIn, GHC’s existing constraint solver algorithm. Soundness of linear constraints is ensured by the fact that they desugar into Linear Haskell.

4. SMIL: Multimodal Learning with Severely Missing Modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, Xi Peng

  • retweets: 256, favorites: 63 (03/12/2021 08:32:47)
  • links: abs | pdf
  • cs.CV

A common assumption in multimodal learning is the completeness of training data, i.e., full modalities are available in all training examples. Although there exists research endeavor in developing novel methods to tackle the incompleteness of testing data, e.g., modalities are partially missing in testing examples, few of them can handle incomplete training modalities. The problem becomes even more challenging if considering the case of severely missing, e.g., 90% training examples may have incomplete modalities. For the first time in the literature, this paper formally studies multimodal learning with missing modality in terms of flexibility (missing modalities in training, testing, or both) and efficiency (most training data have incomplete modality). Technically, we propose a new method named SMIL that leverages Bayesian meta-learning in uniformly achieving both objectives. To validate our idea, we conduct a series of experiments on three popular benchmarks: MM-IMDb, CMU-MOSI, and avMNIST. The results prove the state-of-the-art performance of SMIL over existing methods and generative baselines including autoencoders and generative adversarial networks. Our code is available at https://github.com/mengmenm/SMIL.

5. VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, Wei Liu

MoCo is effective for unsupervised image representation learning. In this paper, we propose VideoMoCo for unsupervised video representation learning. Given a video sequence as an input sample, we improve the temporal feature representations of MoCo from two perspectives. First, we introduce a generator to drop out several frames from this sample temporally. The discriminator is then learned to encode similar feature representations regardless of frame removals. By adaptively dropping out different frames during training iterations of adversarial learning, we augment this input sample to train a temporally robust encoder. Second, we use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. As the momentum encoder updates after keys enqueue, the representation ability of these keys degrades when we use the current input sample for contrastive learning. This degradation is reflected via temporal decay to attend the input sample to recent keys in the queue. As a result, we adapt MoCo to learn video representations without empirically designing pretext tasks. By empowering the temporal robustness of the encoder and modeling the temporal decay of the keys, our VideoMoCo improves MoCo temporally based on contrastive learning. Experiments on benchmark datasets including UCF101 and HMDB51 show that VideoMoCo stands as a state-of-the-art video representation learning method.

6. Complex decision-making strategies in a stock market experiment explained as the combination of few simple strategies

Gael Poux-Medard, Sergio Cobo-Lopez, Jordi Duch, Roger Guimera, Marta Sales-Pardo

Many studies have shown that there are regularities in the way human beings make decisions. However, our ability to obtain models that capture such regularities and can accurately predict unobserved decisions is still limited. We tackle this problem in the context of individuals who are given information relative to the evolution of market prices and asked to guess the direction of the market. We use a networks inference approach with stochastic block models (SBM) to find the model and network representation that is most predictive of unobserved decisions. Our results suggest that users mostly use recent information (about the market and about their previous decisions) to guess. Furthermore, the analysis of SBM groups reveals a set of strategies used by players to process information and make decisions that is analogous to behaviors observed in other contexts. Our study provides and example on how to quantitatively explore human behavior strategies by representing decisions as networks and using rigorous inference and model-selection approaches.

7. A critical reappraisal of predicting suicidal ideation using fMRI

Timothy Verstynen, Konrad Kording

For many psychiatric disorders, neuroimaging offers a potential for revolutionizing diagnosis and treatment by providing access to preverbal mental processes. In their study “Machine learning of neural representations of suicide and emotion concepts identifies suicidal youth.”1, Just and colleagues report that a Naive Bayes classifier, trained on voxelwise fMRI responses in human participants during the presentation of words and concepts related to mortality, can predict whether an individual had reported having suicidal ideations with a classification accuracy of 91%. Here we report a reappraisal of the methods employed by the authors, including re-analysis of the same data set, that calls into question the accuracy of the authors findings.

8. Spatially Consistent Representation Learning

Byungseok Roh, Wuhyun Shin, Ildoo Kim, Sungwoong Kim

  • retweets: 42, favorites: 32 (03/12/2021 08:32:47)
  • links: abs | pdf
  • cs.CV | cs.LG

Self-supervised learning has been widely used to obtain transferrable representations from unlabeled images. Especially, recent contrastive learning methods have shown impressive performances on downstream image classification tasks. While these contrastive methods mainly focus on generating invariant global representations at the image-level under semantic-preserving transformations, they are prone to overlook spatial consistency of local representations and therefore have a limitation in pretraining for localization tasks such as object detection and instance segmentation. Moreover, aggressively cropped views used in existing contrastive methods can minimize representation distances between the semantically different regions of a single image. In this paper, we propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks. In particular, we devise a novel self-supervised objective that tries to produce coherent spatial representations of a randomly cropped local region according to geometric translations and zooming operations. On various downstream localization tasks with benchmark datasets, the proposed SCRL shows significant performance improvements over the image-level supervised pretraining as well as the state-of-the-art self-supervised learning methods.

9. FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding

Bo Sun, Banghuai Li, Shengcai Cai, Ye Yuan, Chi Zhang

  • retweets: 30, favorites: 34 (03/12/2021 08:32:47)
  • links: abs | pdf
  • cs.CV

Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). Recent researches demonstrate that good feature embedding is the key to reach favorable few-shot learning performance. We observe object proposals with different Intersection-of-Union (IoU) scores are analogous to the intra-image augmentation used in contrastive approaches. And we exploit this analogy and incorporate supervised contrastive learning to achieve more robust objects representations in FSOD. We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective approach to learning contrastive-aware object proposal encodings that facilitate the classification of detected objects. We notice the degradation of average precision (AP) for rare objects mainly comes from misclassifying novel instances as confusable classes. And we ease the misclassification issues by promoting instance level intra-class compactness and inter-class variance via our contrastive proposal encoding loss (CPE loss). Our design outperforms current state-of-the-art works in any shot and all data splits, with up to +8.8% on standard benchmark PASCAL VOC and +2.7% on challenging COCO benchmark. Code is available at: https://github.com/bsun0802/FSCE.git