All Articles

Hot Papers 2021-01-19

1. CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation

Alexander Ke, William Ellsworth, Oishi Banerjee, Andrew Y. Ng, Pranav Rajpurkar

Deep learning methods for chest X-ray interpretation typically rely on pretrained models developed for ImageNet. This paradigm assumes that better ImageNet architectures perform better on chest X-ray tasks and that ImageNet-pretrained weights provide a performance boost over random initialization. In this work, we compare the transfer performance and parameter efficiency of 16 popular convolutional architectures on a large chest X-ray dataset (CheXpert) to investigate these assumptions. First, we find no relationship between ImageNet performance and CheXpert performance for both models without pretraining and models with pretraining. Second, we find that, for models without pretraining, the choice of model family influences performance more than size within a family for medical imaging tasks. Third, we observe that ImageNet pretraining yields a statistically significant boost in performance across architectures, with a higher boost for smaller architectures. Fourth, we examine whether ImageNet architectures are unnecessarily large for CheXpert by truncating final blocks from pretrained models, and find that we can make models 3.25x more parameter-efficient on average without a statistically significant drop in performance. Our work contributes new experimental evidence about the relation of ImageNet to chest x-ray interpretation performance.

2. A simple geometric proof for the benefit of depth in ReLU networks

Asaf Amrami, Yoav Goldberg

  • retweets: 2110, favorites: 312 (01/20/2021 22:35:30)
  • links: abs | pdf
  • cs.LG

We present a simple proof for the benefit of depth in multi-layer feedforward network with rectified activation (“depth separation”). Specifically we present a sequence of classification problems indexed by mm such that (a) for any fixed depth rectified network there exist an mm above which classifying problem mm correctly requires exponential number of parameters (in mm); and (b) for any problem in the sequence, we present a concrete neural network with linear depth (in mm) and small constant width (4\leq 4) that classifies the problem with zero error. The constructive proof is based on geometric arguments and a space folding construction. While stronger bounds and results exist, our proof uses substantially simpler tools and techniques, and should be accessible to undergraduate students in computer science and people with similar backgrounds.

3. Can a Fruit Fly Learn Word Embeddings?

Yuchen Liang, Chaitanya K. Ryali, Benjamin Hoover, Leopold Grinberg, Saket Navlakha, Mohammed J. Zaki, Dmitry Krotov

  • retweets: 1300, favorites: 172 (01/20/2021 22:35:31)
  • links: abs | pdf
  • cs.CL

The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task. We show that this network can learn semantic representations of words and can generate both static and context-dependent word embeddings. Unlike conventional methods (e.g., BERT, GloVe) that use dense representations for word embedding, our algorithm encodes semantic meaning of words and their context in the form of sparse binary hash codes. The quality of the learned representations is evaluated on word similarity analysis, word-sense disambiguation, and document classification. It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources (shorter training time and smaller memory footprint).

4. The Role of Working Memory in Program Tracing

Will Crichton, Maneesh Agrawala, Pat Hanrahan

  • retweets: 1190, favorites: 155 (01/20/2021 22:35:31)
  • links: abs | pdf
  • cs.HC

Program tracing, or mentally simulating a program on concrete inputs, is an important part of general program comprehension. Programs involve many kinds of virtual state that must be held in memory, such as variable/value pairs and a call stack. In this work, we examine the influence of short-term working memory (WM) on a person’s ability to remember program state during tracing. We first confirm that previous findings in cognitive psychology transfer to the programming domain: people can keep about 7 variable/value pairs in WM, and people will accidentally swap associations between variables due to WM load. We use a restricted focus viewing interface to further analyze the strategies people use to trace through programs, and the relationship of tracing strategy to WM. Given a straight-line program, we find half of our participants traced a program from the top-down line-by-line (linearly), and the other half start at the bottom and trace upward based on data dependencies (on-demand). Participants with an on-demand strategy made more WM errors while tracing straight-line code than with a linear strategy, but the two strategies contained an equal number of WM errors when tracing code with functions. We conclude with the implications of these findings for the design of programming tools: first, programs should be analyzed to identify and refactor human-memory-intensive sections of code. Second, programming environments should interactively visualize variable metadata to reduce WM load in accordance with a person’s tracing strategy. Third, tools for program comprehension should enable externalizing program state while tracing.

5. GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Smith, Daniel S. Weld

  • retweets: 824, favorites: 135 (01/20/2021 22:35:31)
  • links: abs | pdf
  • cs.CL | cs.AI

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

6. Dissecting the Meme Magic: Understanding Indicators of Virality in Image Memes

Chen Ling, Ihab AbuHilal, Jeremy Blackburn, Emiliano De Cristofaro, Savvas Zannettou, Gianluca Stringhini

Despite the increasingly important role played by image memes, we do not yet have a solid understanding of the elements that might make a meme go viral on social media. In this paper, we investigate what visual elements distinguish image memes that are highly viral on social media from those that do not get re-shared, across three dimensions: composition, subjects, and target audience. Drawing from research in art theory, psychology, marketing, and neuroscience, we develop a codebook to characterize image memes, and use it to annotate a set of 100 image memes collected from 4chan’s Politically Incorrect Board (/pol/). On the one hand, we find that highly viral memes are more likely to use a close-up scale, contain characters, and include positive or negative emotions. On the other hand, image memes that do not present a clear subject the viewer can focus attention on, or that include long text are not likely to be re-shared by users. We train machine learning models to distinguish between image memes that are likely to go viral and those that are unlikely to be re-shared, obtaining an AUC of 0.866 on our dataset. We also show that the indicators of virality identified by our model can help characterize the most viral memes posted on mainstream online social networks too, as our classifiers are able to predict 19 out of the 20 most popular image memes posted on Twitter and Reddit between 2016 and 2018. Overall, our analysis sheds light on what indicators characterize viral and non-viral visual content online, and set the basis for developing better techniques to create or moderate content that is more likely to catch the viewer’s attention.

7. Free Lunch for Few-shot Learning: Distribution Calibration

Shuo Yang, Lu Liu, Min Xu

  • retweets: 441, favorites: 75 (01/20/2021 22:35:32)
  • links: abs | pdf
  • cs.LG | cs.CV

Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples, then an adequate number of examples can be sampled from the calibrated distribution to expand the inputs to the classifier. We assume every dimension in the feature representation follows a Gaussian distribution so that the mean and the variance of the distribution can borrow from that of similar classes whose statistics are better estimated with an adequate number of samples. Our method can be built on top of off-the-shelf pretrained feature extractors and classification models without extra parameters. We show that a simple logistic regression classifier trained using the features sampled from our calibrated distribution can outperform the state-of-the-art accuracy on two datasets (~5% improvement on miniImageNet compared to the next best). The visualization of these generated features demonstrates that our calibrated distribution is an accurate estimation.

8. Trav-SHACL: Efficiently Validating Networks of SHACL Constraints

Mónica Figuera, Philipp D. Rohde, Maria-Esther Vidal

  • retweets: 128, favorites: 81 (01/20/2021 22:35:32)
  • links: abs | pdf
  • cs.DB

Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of knowledge graphs, SHACL is rapidly gaining attention in real-world scenarios. SHACL models integrity constraints as a network of shapes, where a shape contains the constraints to be fullfiled by the same entities. The validation of a SHACL shape schema can face the issue of tractability during validation. To facilitate full adoption, efficient computational methods are required. We present Trav-SHACL, a SHACL engine capable of planning the traversal and execution of a shape schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shape schema for efficient validation and rewrites target and constraint queries for the fast detection of invalid entities. Trav-SHACL is empirically evaluated on 27 testbeds executed against knowledge graphs of up to 34M triples. Our experimental results suggest that Trav-SHACL exhibits high performance gradually and reduces validation time by a factor of up to 28.93 compared to the state of the art.

9. GeoSim: Photorealistic Image Simulation with Geometry-Aware Composition

Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, Raquel Urtasun

Scalable sensor simulation is an important yet challenging open problem for safety-critical domains such as self-driving. Current work in image simulation either fail to be photorealistic or do not model the 3D environment and the dynamic objects within, losing high-level control and physical realism. In this paper, we present GeoSim, a geometry-aware image composition process that synthesizes novel urban driving scenes by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses. Towards this goal, we first build a diverse bank of 3D objects with both realistic geometry and appearance from sensor data. During simulation, we perform a novel geometry-aware simulation-by-composition procedure which 1) proposes plausible and realistic object placements into a given scene, 2) renders novel views of dynamic objects from the asset bank, and 3) composes and blends the rendered image segments. The resulting synthetic images are photorealistic, traffic-aware, and geometrically consistent, allowing image simulation to scale to complex use cases. We demonstrate two such important applications: long-range realistic video simulation across multiple camera sensors, and synthetic data generation for data augmentation on downstream segmentation tasks.

10. Catching Out-of-Context Misinformation with Self-supervised Learning

Shivangi Aneja, Christoph Bregler, Matthias Nießner

  • retweets: 90, favorites: 37 (01/20/2021 22:35:32)
  • links: abs | pdf
  • cs.CV | cs.AI

Despite the recent attention to DeepFakes and other forms of image manipulations, one of the most prevalent ways to mislead audiences is the use of unaltered images in a new but false context. To address these challenges and support fact-checkers, we propose a new method that automatically detects out-of-context image and text pairs. Our core idea is a self-supervised training strategy where we only need images with matching (and non-matching) captions from different sources. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check for a given text pair if both texts correspond to same object(s) in the image but semantically convey different descriptions, which allows us to make fairly accurate out-of-context predictions. Our method achieves 82% out-of-context detection accuracy. To facilitate training our method, we created a large-scale dataset of 203,570 images which we match with 456,305 textual captions from a variety of news websites, blogs, and social media posts; i.e., for each image, we obtained several captions.

11. PLUME: Efficient 3D Object Detection from Stereo Images

Yan Wang, Bin Yang, Rui Hu, Ming Liang, Raquel Urtasun

  • retweets: 92, favorites: 28 (01/20/2021 22:35:32)
  • links: abs | pdf
  • cs.CV

3D object detection plays a significant role in various robotic applications including self-driving. While many approaches rely on expensive 3D sensors like LiDAR to produce accurate 3D estimates, stereo-based methods have recently shown promising results at a lower cost. Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space. However, because the two separate tasks are optimized in different metric spaces, the depth estimation is biased towards big objects and may cause sub-optimal performance of 3D detection. In this paper we propose a model that unifies these two tasks in the same metric space for the first time. Specifically, our model directly constructs a pseudo LiDAR feature volume (PLUME) in 3D space, which is used to solve both occupancy estimation and object detection tasks. PLUME achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.

12. Hierarchical disentangled representation learning for singing voice conversion

Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of disentangled representations with multiple resolutions independently. With the learned disentangled representations, the proposed method progressively performs SVC from low to high resolutions. Experimental results show that the proposed method outperforms baselines that operate with a single resolution in terms of mean opinion score (MOS), similarity score, and pitch accuracy.

13. Deterministic Decremental SSSP and Approximate Min-Cost Flow in Almost-Linear Time

Aaron Bernstein, Maximilian Probst Gutenberg, Thatchaphol Saranurak

  • retweets: 49, favorites: 66 (01/20/2021 22:35:33)
  • links: abs | pdf
  • cs.DS

In the decremental single-source shortest paths problem, the goal is to maintain distances from a fixed source ss to every vertex vv in an mm-edge graph undergoing edge deletions. In this paper, we conclude a long line of research on this problem by showing a near-optimal deterministic data structure that maintains (1+ϵ)(1+\epsilon)-approximate distance estimates and runs in m1+o(1)m^{1+o(1)} total update time. Our result, in particular, removes the oblivious adversary assumption required by the previous breakthrough result by Henzinger et al. [FOCS’14], which leads to our second result: the first almost-linear time algorithm for (1ϵ)(1-\epsilon)-approximate min-cost flow in undirected graphs where capacities and costs can be taken over edges and vertices. Previously, algorithms for max flow with vertex capacities, or min-cost flow with any capacities required super-linear time. Our result essentially completes the picture for approximate flow in undirected graphs. The key technique of the first result is a novel framework that allows us to treat low-diameter graphs like expanders. This allows us to harness expander properties while bypassing shortcomings of expander decomposition, which almost all previous expander-based algorithms needed to deal with. For the second result, we break the notorious flow-decomposition barrier from the multiplicative-weight-update framework using randomization.

14. What Makes Good In-Context Examples for GPT-33?

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, Weizhu Chen

  • retweets: 56, favorites: 29 (01/20/2021 22:35:33)
  • links: abs | pdf
  • cs.CL

GPT-33 has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context few-shot learning ability. Despite its success, we found that the empirical results of GPT-33 depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT-33’s few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-33’s extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders fine-tuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-to-text generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-33 and large-scale pre-trained LMs in general and enhance their few-shot capabilities.

15. S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling

Ze Yang, Shenlong Wang, Sivabalan Manivasagam, Zeng Huang, Wei-Chiu Ma, Xinchen Yan, Ersin Yumer, Raquel Urtasun

  • retweets: 32, favorites: 44 (01/20/2021 22:35:33)
  • links: abs | pdf
  • cs.CV

Constructing and animating humans is an important component for building virtual worlds in a wide variety of applications such as virtual reality or robotics testing in simulation. As there are exponentially many variations of humans with different shape, pose and clothing, it is critical to develop methods that can automatically reconstruct and animate humans at scale from real world data. Towards this goal, we represent the pedestrian’s shape, pose and skinning weights as neural implicit functions that are directly learned from data. This representation enables us to handle a wide variety of different pedestrian shapes and poses without explicitly fitting a human parametric body model, allowing us to handle a wider range of human geometries and topologies. We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods. Furthermore, our re-animation experiments show that we can generate 3D human animations at scale from a single RGB image (and/or an optional LiDAR sweep) as input.

16. Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos

Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, Animesh Garg

We present an approach for physical imitation from human videos for robot manipulation tasks. The key idea of our method lies in explicitly exploiting the kinematics and motion information embedded in the video to learn structured representations that endow the robot with the ability to imagine how to perform manipulation tasks in its own context. To achieve this, we design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection. The resulting keypoint-based representations provide semantically meaningful information that can be directly used for reward computing and policy learning. We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing. Detailed experimental evaluations demonstrate that our method performs favorably against previous approaches.

17. Hierarchical Reinforcement Learning By Discovering Intrinsic Options

Jesse Zhang, Haonan Yu, Wei Xu

  • retweets: 32, favorites: 27 (01/20/2021 22:35:33)
  • links: abs | pdf
  • cs.LG

We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lower-level policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic entropy minimization objective conditioned on the option sub-trajectories. The learned options are diverse and task-agnostic. In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency than regular RL baselines and two state-of-the-art hierarchical RL methods.

18. VideoClick: Video Object Segmentation with a Single Click

Namdar Homayounfar, Justin Liang, Wei-Chiu Ma, Raquel Urtasun

  • retweets: 30, favorites: 27 (01/20/2021 22:35:33)
  • links: abs | pdf
  • cs.CV

Annotating videos with object segmentation masks typically involves a two stage procedure of drawing polygons per object instance for all the frames and then linking them through time. While simple, this is a very tedious, time consuming and expensive process, making the creation of accurate annotations at scale only possible for well-funded labs. What if we were able to segment an object in the full video with only a single click? This will enable video segmentation at scale with a very low budget opening the door to many applications. Towards this goal, in this paper we propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video. In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background. We then refine this correlation volume via a recurrent attention module and decode the final segmentation. To evaluate the performance, we label the popular and challenging Cityscapes dataset with video object segmentations. Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.