Hot Papers 2021-04-14

1. Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, Sanja Fidler

retweets: 4485, favorites: 269 (04/15/2021 07:28:16)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Training deep networks with limited labeled data while achieving a strong generalization ability is key in the quest to reduce human annotation efforts. This is the goal of semi-supervised learning, which exploits more widely available unlabeled data to complement small labeled data sets. In this paper, we propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels. Concretely, we learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images supplemented with only few labeled ones. We build our architecture on top of StyleGAN2, augmented with a label synthesis branch. Image labeling at test time is achieved by first embedding the target image into the joint latent space via an encoder network and test-time optimization, and then generating the label from the inferred embedding. We evaluate our approach in two important domains: medical image segmentation and part-based face segmentation. We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization, such as transferring from CT to MRI in medical imaging, and photographs of real faces to paintings, sculptures, and even cartoons and animal faces. Project Page: \url{https://nv-tlabs.github.io/semanticGAN/}

Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization
pdf: https://t.co/99DM3uwwyu
abs: https://t.co/riNxSUfFBJ
project page: https://t.co/v8No5RHzJE pic.twitter.com/X8V9i4B1dO
— AK (@ak92501) April 14, 2021

2. BARF: Bundle-Adjusting Neural Radiance Fields

Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, Simon Lucey

retweets: 3043, favorites: 428 (04/15/2021 07:28:16)
links: abs | pdf
cs.CV | cs.GR | cs.LG | cs.RO

Neural Radiance Fields (NeRF) have recently gained a surge of interest within the computer vision community for its power to synthesize photorealistic novel views of real-world scenes. One limitation of NeRF, however, is its requirement of accurate camera poses to learn the scene representations. In this paper, we propose Bundle-Adjusting Neural Radiance Fields (BARF) for training NeRF from imperfect (or even unknown) camera poses — the joint problem of learning neural 3D representations and registering camera frames. We establish a theoretical connection to classical image alignment and show that coarse-to-fine registration is also applicable to NeRF. Furthermore, we show that na”ively applying positional encoding in NeRF has a negative impact on registration with a synthesis-based objective. Experiments on synthetic and real-world data show that BARF can effectively optimize the neural scene representations and resolve large camera pose misalignment at the same time. This enables view synthesis and localization of video sequences from unknown camera poses, opening up new avenues for visual localization systems (e.g. SLAM) and potential applications for dense 3D mapping and reconstruction.

BARF: Bundle-Adjusting Neural Radiance Fields https://t.co/jW3TZWJkgn

Cool new work showcasing how to blend classical bundle adjustment with new representations of 3D space like NeRF. #computervision #3d pic.twitter.com/H0xA14Tqsf
— Tomasz Malisiewicz (@quantombone) April 14, 2021

BARF : Bundle-Adjusting Neural Radiance Fields
pdf: https://t.co/35zgrdjnFE
abs: https://t.co/gpixZhhDfD
project page: https://t.co/39Y56Luua9 pic.twitter.com/i02bzZ31fH
— AK (@ak92501) April 14, 2021

3. What’s in your Head? Emergent Behaviour in Multi-Task Transformer Models

Mor Geva, Uri Katz, Aviv Ben-Arie, Jonathan Berant

retweets: 749, favorites: 103 (04/15/2021 07:28:16)
links: abs | pdf
cs.CL

The primary paradigm for multi-task training in natural language processing is to represent the input with a shared pre-trained language model, and add a small, thin network (head) per task. Given an input, a target head is the head that is selected for outputting the final prediction. In this work, we examine the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for. We find that non-target heads exhibit emergent behaviour, which may either explain the target task, or generalize beyond their original task. For example, in a numerical reasoning task, a span extraction head extracts from the input the arguments to a computation that results in a number generated by a target generative head. In addition, a summarization head that is trained with a target question answering head, outputs query-based summaries when given a question and a context from which the answer is to be extracted. This emergent behaviour suggests that multi-task training leads to non-trivial extrapolation of skills, which can be harnessed for interpretability and generalization.

New preprint! We show that training transformer models with multiple output heads leads to non-trivial interactions between the heads and emergent head behaviour that generalizes beyond the task the head was trained for.https://t.co/zftnny93B3 @UrikaUri Aviv BA @JonathanBerant pic.twitter.com/PX0pS9aM8L
— Mor Geva (@megamor2) April 14, 2021

4. Self-supervised object detection from audio-visual correspondence

Triantafyllos Afouras, Yuki M. Asano, Francois Fagan, Andrea Vedaldi, Florian Metze

retweets: 702, favorites: 112 (04/15/2021 07:28:17)
links: abs | pdf
cs.CV

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to “teach” the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.

Self-supervised object detection from audio-visual correspondence
pdf: https://t.co/wTRsrF15bJ
abs: https://t.co/KcH3VuqDKO pic.twitter.com/tfmTsdXzkS
— AK (@ak92501) April 14, 2021

5. Online and Offline Reinforcement Learning by Planning with a Learned Model

Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver

retweets: 317, favorites: 113 (04/15/2021 07:28:17)
links: abs | pdf
cs.LG

Learning efficiently from small amounts of data has long been the focus of model-based reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate state-of-the-art results in both settings. In this work, we describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL). Combining Reanalyse with the MuZero algorithm, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the off-policy or offline RL settings. MuZero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting.

Online and Offline Reinforcement Learning by Planning with a Learned Model

Proposes MuZero Unplugged by combining Reanalyse with MuZero, which sets a new SotA on various online/offline tasks.https://t.co/YCbr9hVeEC pic.twitter.com/g10CQj3Vmf
— Aran Komatsuzaki (@arankomatsuzaki) April 14, 2021

MuZeroの拡張「Sampled MuZero」https://t.co/NN3RG4tzx3

オンラインとオフラインの両方で新しい最高性能を達成する、統一されたモデルベース強化学習アルゴリズム「MuZero Unplugged」https://t.co/sSJ6UkNmEG

Muesli（ディープサーチを使わずにアタリでMuZeroと同等）https://t.co/443Zxx9mJ8
— 小猫遊りょう（たかにゃし・りょう） (@jaguring1) April 14, 2021

6. Learning and Planning in Complex Action Spaces

Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver

retweets: 226, favorites: 101 (04/15/2021 07:28:17)
links: abs | pdf
cs.LG

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

Learning and Planning in Complex Action Spaces

Proposes Sampled MuZero, an extension of the MuZero that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. https://t.co/eIzkitwpuv pic.twitter.com/ekLHn9zHBi
— Aran Komatsuzaki (@arankomatsuzaki) April 14, 2021

MuZeroの拡張「Sampled MuZero」https://t.co/NN3RG4tzx3

オンラインとオフラインの両方で新しい最高性能を達成する、統一されたモデルベース強化学習アルゴリズム「MuZero Unplugged」https://t.co/sSJ6UkNmEG

Muesli（ディープサーチを使わずにアタリでMuZeroと同等）https://t.co/443Zxx9mJ8
— 小猫遊りょう（たかにゃし・りょう） (@jaguring1) April 14, 2021

7. Podracer architectures for scalable Reinforcement Learning

Matteo Hessel, Manuel Kroiss, Aidan Clark, Iurii Kemaev, John Quan, Thomas Keck, Fabio Viola, Hado van Hasselt

retweets: 174, favorites: 147 (04/15/2021 07:28:18)
links: abs | pdf
cs.LG

Supporting state-of-the-art AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems.Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive parts of training and inference in modern deep learning systems. Popular training pipelines that use these frameworks for deep learning typically focus on (un-)supervised learning. How to best train reinforcement learning (RL) agents at scale is still an active research area. In this report we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way. Specifically we describe two architectures designed to make the best use of the resources available on a TPU Pod (a special configuration in a Google data center that features multiple TPU devices connected to each other by extremely low latency communication channels).

Podracer architectures for scalable Reinforcement Learning
pdf: https://t.co/2cKpTVT8cZ
abs: https://t.co/Fh697A7zs4
"we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way" pic.twitter.com/Z6kE75U4SR
— AK (@ak92501) April 14, 2021

Checkout our latest writeup on TPU-based computing architectures for scalable Reinforcement Learning! https://t.co/IvDnbIJOD2
— matteo hessel (@matteohessel) April 14, 2021

8. Muesli: Combining Improvements in Policy Optimization

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado van Hasselt

retweets: 175, favorites: 86 (04/15/2021 07:28:18)
links: abs | pdf
cs.LG | cs.AI

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

MuZeroの拡張「Sampled MuZero」https://t.co/NN3RG4tzx3

オンラインとオフラインの両方で新しい最高性能を達成する、統一されたモデルベース強化学習アルゴリズム「MuZero Unplugged」https://t.co/sSJ6UkNmEG

Muesli（ディープサーチを使わずにアタリでMuZeroと同等）https://t.co/443Zxx9mJ8
— 小猫遊りょう（たかにゃし・りょう） (@jaguring1) April 14, 2021

Interesting piece of work by:

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado van Hasselt

Muesli: Combining Improvements in Policy Optimizationhttps://t.co/zHbJ9NEJCK
— Anirudh Goyal (@anirudhg9119) April 14, 2021

Muesli: Combining Improvements in Policy Optimization

Proposes a novel policy update that combines regularized policy optimization with model learning as an aux. loss.

Muesli matches MuZero’s SotA perf on Atari with great perf on Atari and Go.https://t.co/mm9GQvRN8k pic.twitter.com/wQfVFR7tXg
— Aran Komatsuzaki (@arankomatsuzaki) April 14, 2021

9. DropLoss for Long-Tail Instance Segmentation

Ting-I Hsieh, Esther Robb, Hwann-Tzong Chen, Jia-Bin Huang

retweets: 72, favorites: 77 (04/15/2021 07:28:18)
links: abs | pdf
cs.CV

Long-tailed class distributions are prevalent among the practical applications of object detection and instance segmentation. Prior work in long-tail instance segmentation addresses the imbalance of losses between rare and frequent categories by reducing the penalty for a model incorrectly predicting a rare class label. We demonstrate that the rare categories are heavily suppressed by correct background predictions, which reduces the probability for all foreground categories with equal weight. Due to the relative infrequency of rare categories, this leads to an imbalance that biases towards predicting more frequent categories. Based on this insight, we develop DropLoss — a novel adaptive loss to compensate for this imbalance without a trade-off between rare and frequent categories. With this loss, we show state-of-the-art mAP across rare, common, and frequent categories on the LVIS dataset.

DropLoss for Long-Tail Instance Segmentation
pdf: https://t.co/pBgzdZTKpX
abs: https://t.co/FW5XOhZVLm
github: https://t.co/pJhaWF1gBT pic.twitter.com/cNqAyINCRd
— AK (@ak92501) April 14, 2021

10. Pointly-Supervised Instance Segmentation

Bowen Cheng, Omkar Parkhi, Alexander Kirillov

retweets: 43, favorites: 50 (04/15/2021 07:28:18)
links: abs | pdf
cs.CV

We propose point-based instance-level annotation, a new form of weak supervision for instance segmentation. It combines the standard bounding box annotation with labeled points that are uniformly sampled inside each bounding box. We show that the existing instance segmentation models developed for full mask supervision, like Mask R-CNN, can be seamlessly trained with the point-based annotation without any major modifications. In our experiments, Mask R-CNN models trained on COCO, PASCAL VOC, Cityscapes, and LVIS with only 10 annotated points per object achieve 94%—98% of their fully-supervised performance. The new point-based annotation is approximately 5 times faster to collect than object masks, making high-quality instance segmentation more accessible for new data. Inspired by the new annotation form, we propose a modification to PointRend instance segmentation module. For each object, the new architecture, called Implicit PointRend, generates parameters for a function that makes the final point-level mask prediction. Implicit PointRend is more straightforward and uses a single point-level mask loss. Our experiments show that the new module is more suitable for the proposed point-based supervision.

Pointly-Supervised Instance Segmentation
pdf: https://t.co/O9z072y4VR
abs: https://t.co/S2TxPapZkU
project page: https://t.co/OclZbxAaFt pic.twitter.com/CXvNvNPVzF
— AK (@ak92501) April 14, 2021

11. A Replication Study of Dense Passage Retriever

Xueguang Ma, Kai Sun, Ronak Pradeep, Jimmy Lin

retweets: 30, favorites: 58 (04/15/2021 07:28:19)
links: abs | pdf
cs.CL | cs.IR

Text retrieval using learned dense representations has recently emerged as a promising alternative to “traditional” text retrieval using sparse bag-of-words representations. One recent work that has garnered much attention is the dense passage retriever (DPR) technique proposed by Karpukhin et al. (2020) for end-to-end open-domain question answering. We present a replication study of this work, starting with model checkpoints provided by the authors, but otherwise from an independent implementation in our group’s Pyserini IR toolkit and PyGaggle neural text ranking library. Although our experimental results largely verify the claims of the original paper, we arrived at two important additional findings that contribute to a better understanding of DPR: First, it appears that the original authors under-report the effectiveness of the BM25 baseline and hence also dense—sparse hybrid retrieval results. Second, by incorporating evidence from the retriever and an improved answer span scoring technique, we are able to improve end-to-end question answering effectiveness using exactly the same models as in the original work.

If you're interested in dense retrieval, you'll want to check out this DPR replication effort led by @xueguang_ma https://t.co/kxCvJhgWEv tl;dr - BM25 is better than the original authors made it out to be, and free QA boost with better evidence fusion!
— Jimmy Lin (@lintool) April 14, 2021

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, Sergey Levine

retweets: 38, favorites: 30 (04/15/2021 07:28:19)
links: abs | pdf
cs.RO | cs.AI | cs.LG

We describe a robotic learning system for autonomous navigation in diverse environments. At the core of our method are two components: (i) a non-parametric map that reflects the connectivity of the environment but does not require geometric reconstruction or localization, and (ii) a latent variable model of distances and actions that enables efficiently constructing and traversing this map. The model is trained on a large dataset of prior experience to predict the expected amount of time and next action needed to transit between the current image and a goal image. Training the model in this way enables it to develop a representation of goals robust to distracting information in the input images, which aids in deploying the system to quickly explore new environments. We demonstrate our method on a mobile ground robot in a range of outdoor navigation scenarios. Our method can learn to reach new goals, specified as images, in a radius of up to 80 meters in just 20 minutes, and reliably revisit these goals in changing environments. We also demonstrate our method’s robustness to previously-unseen obstacles and variable weather conditions. We encourage the reader to visit the project website for videos of our experiments and demonstrations https://sites.google.com/view/recon-robot

13. Paragraph-level Simplification of Medical Texts

Ashwin Devaraj, Iain J. Marshall, Byron C. Wallace, Junyi Jessy Li

retweets: 32, favorites: 22 (04/15/2021 07:28:19)
links: abs | pdf
cs.CL

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.

How would we improve the accessibility of highly technical information? Sharing our #NAACL2021 paper "Paragraph-level Simplification of Medical Texts" w/ Ashwin Devaraj, @ijmarshall, and @byron_c_wallace https://t.co/sUucFy8lBH
(1/2) pic.twitter.com/mymbfqeopQ
— Jessy Li (@jessyjli) April 14, 2021

14. Co-Scale Conv-Attentional Image Transformers

Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu

retweets: 20, favorites: 30 (04/15/2021 07:28:19)
links: abs | pdf
cs.CV | cs.LG | cs.NE

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers’ encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale attention mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with the similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT’s backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to the downstream computer vision tasks.

Co-Scale Conv-Attentional Image Transformers
pdf: https://t.co/9aTQlMINPB
abs: https://t.co/RclvZgtHn3
"On ImageNet, relatively small CoaT models attain superior classification results compared with the similar-sized convolutional neural networks and image/vision Transformers" pic.twitter.com/D4YPucpPaq
— AK (@ak92501) April 14, 2021

Published 15 Apr 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter

Hot Papers 2021-04-14

1. Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

2. BARF: Bundle-Adjusting Neural Radiance Fields

3. What’s in your Head? Emergent Behaviour in Multi-Task Transformer Models

4. Self-supervised object detection from audio-visual correspondence

5. Online and Offline Reinforcement Learning by Planning with a Learned Model

6. Learning and Planning in Complex Action Spaces

7. Podracer architectures for scalable Reinforcement Learning

8. Muesli: Combining Improvements in Policy Optimization

9. DropLoss for Long-Tail Instance Segmentation

10. Pointly-Supervised Instance Segmentation

11. A Replication Study of Dense Passage Retriever

12. RECON: Rapid Exploration for Open-World Navigation with Latent Goal Models

13. Paragraph-level Simplification of Medical Texts

14. Co-Scale Conv-Attentional Image Transformers