All Articles

Hot Papers 2021-04-28

1. Why AI is Harder Than We Think

Melanie Mitchell

  • retweets: 27880, favorites: 25 (04/29/2021 12:53:36)
  • links: abs | pdf
  • cs.AI

Since its beginning in the 1950s, the field of artificial intelligence has cycled several times between periods of optimistic predictions and massive investment (“AI spring”) and periods of disappointment, loss of confidence, and reduced funding (“AI winter”). Even with today’s seemingly fast pace of AI breakthroughs, the development of long-promised technologies such as self-driving cars, housekeeping robots, and conversational companions has turned out to be much harder than many people expected. One reason for these repeating cycles is our limited understanding of the nature and complexity of intelligence itself. In this paper I describe four fallacies in common assumptions made by AI researchers, which can lead to overconfident predictions about the field. I conclude by discussing the open questions spurred by these fallacies, including the age-old challenge of imbuing machines with humanlike common sense.

2. Multimodal Self-Supervised Learning of General Audio Representations

Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, Aaron van den Oord

We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not require high resolution images to learn good audio features. This allows us to scale up the training batch size, while keeping the computational load incurred by the additional video modality to a reasonable level. Second, we use augmentations that mix together different samples. We show that this is effective to make the proxy task harder, which leads to substantial performance improvements when increasing the batch size. As a result, our audio model achieves a state-of-the-art of 42.4 mAP on the AudioSet classification downstream task, closing the gap between supervised and self-supervised methods trained on the same dataset. Moreover, we show that our method is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.

3. Multimodal Contrastive Training for Visual Representation Learning

Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, Baldo Faieta

  • retweets: 361, favorites: 85 (04/29/2021 12:53:37)
  • links: abs | pdf
  • cs.CV

We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. Unlike existing visual pre-training methods, which solve a proxy prediction task in a single domain, our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously, hence improving the quality of learned visual representations. By including multimodal training in a unified framework with different types of contrastive losses, our method can learn more powerful and generic visual features. We first train our model on COCO and evaluate the learned visual representations on various downstream tasks including image classification, object detection, and instance segmentation. For example, the visual representations pre-trained on COCO by our method achieve state-of-the-art top-1 validation accuracy of 55.3%55.3\% on ImageNet classification, under the common transfer protocol. We also evaluate our method on the large-scale Stock images dataset and show its effectiveness on multi-label image tagging, and cross-modal retrieval tasks.

4. Emergence as the conversion of information: A unifying theory

Thomas Varley, Erik Hoel

  • retweets: 225, favorites: 53 (04/29/2021 12:53:37)
  • links: abs | pdf
  • cs.IT

Is reduction always a good scientific strategy? Does it always lead to a gain in information? The very existence of the special sciences above and beyond physics seems to hint no. Previous research has shown that dimension reduction (macroscales) can increase the dependency between elements of a system (a phenomenon called “causal emergence”). However, this has been shown only for specific measures like effective information or integrated information. Here, we provide an umbrella mathematical framework for emergence based on information conversion. Specifically, we show evidence that a macroscale can have more of a certain type of information than its underlying microscale. This is because macroscales can convert information from one type to another. In such cases, reduction to a microscale means the loss of this type of information. We demonstrate this using the well-understood mutual information measure applied to Boolean networks. By using the partial information decomposition, the mutual information can be decomposed into redundant, unique, and synergistic information atoms. Then by introducing a novel measure of the synergy bias of a given decomposition, we are able to show that the synergy component of a Boolean network’s mutual information can increase at macroscales. This can occur even when there is no difference in the total mutual information between a macroscale and its underlying microscale, proving information conversion. We relate this broad framework to previous work, compare it to other theories, and argue it complexifies any notion of universal reduction in the sciences, since such reduction would likely lead to a loss of synergistic information in scientific models.

5. A Chromium-based Memento-aware Web Browser

Abigail Mabe

  • retweets: 132, favorites: 11 (04/29/2021 12:53:37)
  • links: abs | pdf
  • cs.DL | cs.CR

Web browsers provide a user-friendly means of navigating the web. Users rely on their web browser to provide information about the websites they are visiting, such as the security state. Browsers also provide a user interface (UI) with visual cues about each tab that is open, including icons for if the tab is playing audio or requires authentication to view. However, current browsers do not differentiate between the live web and the past web. If a user loads an archived webpage, known as a memento, they have to rely on UI elements present within the page itself to inform them that the page they are viewing is not the live web. Additionally, memento-awareness extends beyond recognizing a page that has already been archived. The browser should give users the ability to archive live webpages, essentially creating mementos of webpages they found important as they surf the web. In this report, the process to create a proof-of-concept browser that is aware of mementos is presented. The browser is created by adding on to the implementation of the open source web browser by Google, Chromium. Creating this prototype for a Memento-aware Browser shows that the features implemented fit well into the current Chromium implementation. The user experience is enhanced by adding the memento-awareness, and the changes to the Chromium code base are minimal.

6. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

Grzegorz Chrupała

This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.

7. Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News

Ashkan Kazemi, Zehua Li, Verónica Pérez-Rosas, Rada Mihalcea

  • retweets: 44, favorites: 44 (04/29/2021 12:53:37)
  • links: abs | pdf
  • cs.CL

In this paper, we explore the construction of natural language explanations for news claims, with the goal of assisting fact-checking and news evaluation applications. We experiment with two methods: (1) an extractive method based on Biased TextRank — a resource-effective unsupervised graph-based algorithm for content extraction; and (2) an abstractive method based on the GPT-2 language model. We perform comparative evaluations on two misinformation datasets in the political and health news domains, and find that the extractive method shows the most promise.

8. One Billion Audio Sounds from GPU-enabled Modular Synthesis

Joseph Turian, Jordie Shier, George Tzanetakis, Kirk McNally, Max Henry

We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, which is 100x larger than any audio dataset in the literature. Each sound is paired with the corresponding latent parameters used to generate it. synth1B1 samples are deterministically generated on-the-fly 16200x faster than real-time (714MHz) on a single GPU using torchsynth (https://github.com/torchsynth/torchsynth), an open-source modular synthesizer we release. Additionally, we release two new audio datasets: FM synth timbre (https://zenodo.org/record/4677102) and subtractive synth pitch (https://zenodo.org/record/4677097). Using these datasets, we demonstrate new rank-based synthesizer-motivated evaluation criteria for existing audio representations. Finally, we propose novel approaches to synthesizer hyperparameter optimization, and demonstrate how perceptually-correlated auditory distances could enable new applications in synthesizer design.

9. Adapting ImageNet-scale models to complex distribution shifts with self-learning

Evgenia Rusak, Steffen Schneider, Peter Gehler, Oliver Bringmann, Wieland Brendel, Matthias Bethge

  • retweets: 30, favorites: 47 (04/29/2021 12:53:37)
  • links: abs | pdf
  • cs.CV | cs.LG

While self-learning methods are an important component in many recent domain adaptation techniques, they are not yet comprehensively evaluated on ImageNet-scale datasets common in robustness research. In extensive experiments on ResNet and EfficientNet models, we find that three components are crucial for increasing performance with self-learning: (i) using short update times between the teacher and the student network, (ii) fine-tuning only few affine parameters distributed across the network, and (iii) leveraging methods from robust classification to counteract the effect of label noise. We use these insights to obtain drastically improved state-of-the-art results on ImageNet-C (22.0% mCE), ImageNet-R (17.4% error) and ImageNet-A (14.8% error). Our techniques yield further improvements in combination with previously proposed robustification methods. Self-learning is able to reduce the top-1 error to a point where no substantial further progress can be expected. We therefore re-purpose the dataset from the Visual Domain Adaptation Challenge 2019 and use a subset of it as a new robustness benchmark (ImageNet-D) which proves to be a more challenging dataset for all current state-of-the-art models (58.2% error) to guide future research efforts at the intersection of robustness and domain adaptation on ImageNet scale.

10. Unsupervised 3D Shape Completion through GAN Inversion

Junzhe Zhang, Xinyi Chen, Zhongang Cai, Liang Pan, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, Bo Dai, Chen Change Loy

  • retweets: 43, favorites: 25 (04/29/2021 12:53:37)
  • links: abs | pdf
  • cs.CV

Most 3D shape completion approaches rely heavily on partial-complete shape pairs and learn in a fully supervised manner. Despite their impressive performances on in-domain data, when generalizing to partial shapes in other forms or real-world partial scans, they often obtain unsatisfactory results due to domain gaps. In contrast to previous fully supervised approaches, in this paper we present ShapeInversion, which introduces Generative Adversarial Network (GAN) inversion to shape completion for the first time. ShapeInversion uses a GAN pre-trained on complete shapes by searching for a latent code that gives a complete shape that best reconstructs the given partial input. In this way, ShapeInversion no longer needs paired training data, and is capable of incorporating the rich prior captured in a well-trained generative model. On the ShapeNet benchmark, the proposed ShapeInversion outperforms the SOTA unsupervised method, and is comparable with supervised methods that are learned using paired data. It also demonstrates remarkable generalization ability, giving robust results for real-world scans and partial inputs of various forms and incompleteness levels. Importantly, ShapeInversion naturally enables a series of additional abilities thanks to the involvement of a pre-trained GAN, such as producing multiple valid complete shapes for an ambiguous partial input, as well as shape manipulation and interpolation.

11. Contrasting social and non-social sources of predictability in human mobility

Zexun Chen, Sean Kelty, Brooke Foucault Welles, James P. Bagrow, Ronaldo Menezes, Gourab Ghoshal

Social structures influence a variety of human behaviors including mobility patterns, but the extent to which one individual’s movements can predict another’s remains an open question. Further, latent information about an individual’s mobility can be present in the mobility patterns of both social and non-social ties, a distinction that has not yet been addressed. Here we develop a “colocation” network to distinguish the mobility patterns of an ego’s social ties from those of non-social colocators, individuals not socially connected to the ego but who nevertheless arrive at a location at the same time as the ego. We apply entropy and predictability measures to analyse and bound the predictive information of an individual’s mobility pattern and the flow of that information from their top social ties and from their non-social colocators. While social ties generically provide more information than non-social colocators, we find that significant information is present in the aggregation of non-social colocators: 3-7 colocators can provide as much predictive information as the top social tie, and colocators can replace up to 85% of the predictive information about an ego, compared with social ties that can replace up to 94% of the ego’s predictability. The presence of predictive information among non-social colocators raises privacy concerns: given the increasing availability of real-time mobility traces from smartphones, individuals sharing data may be providing actionable information not just about their own movements but the movements of others whose data are absent, both known and unknown individuals.

12. secml-malware: A Python Library for Adversarial Robustness Evaluation of Windows Malware Classifiers

Luca Demetrio, Battista Biggio

  • retweets: 46, favorites: 12 (04/29/2021 12:53:38)
  • links: abs | pdf
  • cs.CR

Machine learning has been increasingly used as a first line of defense for Windows malware detection. Recent work has however shown that learning-based malware detectors can be evaded by well-crafted, adversarial manipulations of input malware, highlighting the need for tools that can ease and automate the adversarial robustness evaluation of such detectors. To this end, we presentsecml-malware, the first Python library for computing adversarial attacks on Windows malware detectors. secml-malware implements state-of-the-art white-box and black-box attacks on Windows malware classifiers, by leveraging a set of functionality-preserving manipulations that can be applied to Windows programs without corrupting their functionality. The library can be used to assess the adversarial robustness of Windows malware detectors, and it can be easily extended to include novel attack strategies. It is available at https://github.com/zangobot/secml_malware.