Hot Papers 2021-04-27

1. Deep Probabilistic Graphical Modeling

Adji B. Dieng

retweets: 7650, favorites: 904 (04/28/2021 09:29:54)
links: abs | pdf
stat.ML | cs.LG

As one of the most commonly ordered imaging tests, computed tomography (CT) scan comes with inevitable radiation exposure that increases the cancer risk to patients. However, CT image quality is directly related to radiation dose, thus it is desirable to obtain high-quality CT images with as little dose as possible. CT image denoising tries to obtain high dose like high-quality CT images (domain X) from low dose low-quality CTimages (domain Y), which can be treated as an image-to-image translation task where the goal is to learn the transform between a source domain X (noisy images) and a target domain Y (clean images). In this paper, we propose a multi-cycle-consistent adversarial network (MCCAN) that builds intermediate domains and enforces both local and global cycle-consistency for edge denoising of CT images. The global cycle-consistency couples all generators together to model the whole denoising process, while the local cycle-consistency imposes effective supervision on the process between adjacent domains. Experiments show that both local and global cycle-consistency are important for the success of MCCAN, which outperformsCCADN in terms of denoising quality with slightly less computation resource consumption.

I finally found time to put my PhD thesis on arxiv. Check it out! Thanking everyone who made this thesis possible 🖤🎊https://t.co/6AAp04SJOr pic.twitter.com/EHTqEJADVP
— Adji Bousso Dieng (@adjiboussodieng) April 27, 2021

2. EigenGAN: Layer-Wise Eigen-Learning for GANs

Zhenliang He, Meina Kan, Shiguang Shan

retweets: 1845, favorites: 174 (04/28/2021 09:29:54)
links: abs | pdf
cs.CV | stat.ML

Recent studies on Generative Adversarial Network (GAN) reveal that different layers of a generative CNN hold different semantics of the synthesized images. However, few GAN models have explicit dimensions to control the semantic attributes represented in a specific layer. This paper proposes EigenGAN which is able to unsupervisedly mine interpretable and controllable dimensions from different generator layers. Specifically, EigenGAN embeds one linear subspace with orthogonal basis into each generator layer. Via the adversarial training to learn a target distribution, these layer-wise subspaces automatically discover a set of “eigen-dimensions” at each layer corresponding to a set of semantic attributes or interpretable variations. By traversing the coefficient of a specific eigen-dimension, the generator can produce samples with continuous changes corresponding to a specific semantic attribute. Taking the human face for example, EigenGAN can discover controllable dimensions for high-level concepts such as pose and gender in the subspace of deep layers, as well as low-level concepts such as hue and color in the subspace of shallow layers. Moreover, under the linear circumstance, we theoretically prove that our algorithm derives the principal components as PCA does. Codes can be found in https://github.com/LynnHo/EigenGAN-Tensorflow.

EigenGAN: Layer-Wise Eigen-Learning for GANs
pdf: https://t.co/gViyi29TCb
abs: https://t.co/67Sdgh1Zrz
github: https://t.co/7IvwikkjEi pic.twitter.com/fgnN3bW86A
— AK (@ak92501) April 27, 2021

Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, Nicolas Carion

retweets: 1520, favorites: 151 (04/28/2021 09:29:54)
links: abs | pdf
cs.CV | cs.CL | cs.LG

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at https://github.com/ashkamath/mdetr.

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
pdf: https://t.co/3wIsIudmle
abs: https://t.co/tV2Ux2yxsX

MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question pic.twitter.com/A6gICVY1rr
— AK (@ak92501) April 27, 2021

4. Visformer: The Vision-friendly Transformer

Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, Qi Tian

retweets: 1194, favorites: 213 (04/28/2021 09:29:54)
links: abs | pdf
cs.CV

The past year has witnessed the rapid development of applying the Transformer module to vision problems. While some researchers have demonstrated that Transformer-based models enjoy a favorable ability of fitting data, there are still growing number of evidences showing that these models suffer over-fitting especially when the training data is limited. This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based model. The results we obtain during the transition process deliver useful messages for improving visual recognition. Based on these observations, we propose a new architecture named Visformer, which is abbreviated from the `Vision-friendly Transformer’. With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy, and the advantage becomes more significant when the model complexity is lower or the training set is smaller. The code is available at https://github.com/danczs/Visformer.

Visformer: The Vision-friendly Transformer
pdf: https://t.co/koVlCmMnuU
abs: https://t.co/1YTeVdr2Fg
github: https://t.co/wHVArHwjtv pic.twitter.com/mouLZ9DT94
— AK (@ak92501) April 27, 2021

5. Improve Vision Transformers Training by Suppressing Over-smoothing

Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu

retweets: 361, favorites: 71 (04/28/2021 09:29:55)
links: abs | pdf
cs.CV | cs.LG

Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks. However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results. As a result, recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks. This work investigates how to stabilize the training of vision transformers \emph{without} special structure modification. We observe that the instability of transformer training on vision tasks can be attributed to the over-smoothing problem, that the self-attention layers tend to map the different patches from the input image into a similar latent representation, hence yielding the loss of information and degeneration of performance, especially when the number of layers is large. We then propose a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix. We show that our proposed techniques stabilize the training and allow us to train wider and deeper vision transformers, achieving 85.0% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers. Our code will be made publicly available at https://github.com/ChengyueGongR/PatchVisionTransformer .

Improve Vision Transformers Training by Suppressing
Over-smoothing
pdf: https://t.co/l9JsHwyT10
abs: https://t.co/cRxXv6hFqJ
github: https://t.co/7xBCeprDhg pic.twitter.com/OKuenxVOHU
— AK (@ak92501) April 27, 2021

6. dualFace:Two-Stage Drawing Guidance for Freehand Portrait Sketching

Zhengyu Huang, Yichen Peng, Tomohiro Hibino, Chunqi Zhao, Haoran Xie, Tsukasa Fukusato, Kazunori Miyata

retweets: 310, favorites: 54 (04/28/2021 09:29:55)
links: abs | pdf
cs.GR | cs.CV

In this paper, we propose dualFace, a portrait drawing interface to assist users with different levels of drawing skills to complete recognizable and authentic face sketches. dualFace consists of two-stage drawing assistance to provide global and local visual guidance: global guidance, which helps users draw contour lines of portraits (i.e., geometric structure), and local guidance, which helps users draws details of facial parts (which conform to user-drawn contour lines), inspired by traditional artist workflows in portrait drawing. In the stage of global guidance, the user draws several contour lines, and dualFace then searches several relevant images from an internal database and displays the suggested face contour lines over the background of the canvas. In the stage of local guidance, we synthesize detailed portrait images with a deep generative model from user-drawn contour lines, but use the synthesized results as detailed drawing guidance. We conducted a user study to verify the effectiveness of dualFace, and we confirmed that dualFace significantly helps achieve a detailed portrait sketch. see http://www.jaist.ac.jp/~xie/dualface.html

dualFace:Two-Stage Drawing Guidance for Freehand Portrait Sketching
pdf: https://t.co/BNwzFvP7k0
abs: https://t.co/odQ0gAwMFa
github: https://t.co/FXd4V1kJYR pic.twitter.com/ZGO7tBWj0n
— AK (@ak92501) April 27, 2021

7. Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui, Shijian Lu, Feiying Ma, Xuansong Xie, Chunyan Miao

retweets: 121, favorites: 47 (04/28/2021 09:29:55)
links: abs | pdf
cs.CV

Image inpainting is an underdetermined inverse problem, it naturally allows diverse contents that fill up the missing or corrupted regions reasonably and realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in transformers is suboptimal as corrupted regions can have arbitrary shapes with contexts from arbitrary directions. We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT) that models deep bidirectional contexts for autoregressive generation of diverse inpainting contents. BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers. Specifically, it first generates pluralistic image structures of low resolution by adapting transformers and then synthesizes realistic texture details of high resolutions with a CNN-based up-sampling network. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers
pdf: https://t.co/hpmaerMcat
abs: https://t.co/qY9zyh5m1t pic.twitter.com/xofOQ3NGsk
— AK (@ak92501) April 27, 2021

8. Bridging observation, theory and numerical simulation of the ocean using Machine Learning

Maike Sonnewald, Redouane Lguensat, Daniel C. Jones, Peter D. Dueben, Julien Brajard, Venkatramani Balaji

retweets: 111, favorites: 56 (04/28/2021 09:29:55)
links: abs | pdf
physics.ao-ph | stat.ML

We study high-dimensional Bayesian linear regression with product priors. Using the nascent theory of non-linear large deviations (Chatterjee and Dembo,2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant of the posterior. Furthermore, we establish that under an additional “separation” condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this “separation” condition. Finally, we illustrate our results on concrete examples with specific design matrices.

NEW preprint on #MachineLearning in #oceanography, led by the amazing Maike Sonnewald. https://t.co/nTfhHpTxiT

It has *not* yet been through peer review; comments and suggestions from the community are very welcome! pic.twitter.com/R73PjM91ZW
— Dr Dan Jones 🇦🇶🌊👩‍💻 (@DanJonesOcean) April 27, 2021

9. baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling Coordinated Agents

Michael A. Alcorn, Anh Nguyen

retweets: 90, favorites: 59 (04/28/2021 09:29:55)
links: abs | pdf
cs.LG | cs.MA

Grid maps are widely established for the representation of static objects in robotics and automotive applications. Though, incorporating velocity information is still widely examined because of the increased complexity of dynamic grids concerning both velocity measurement models for radar sensors and the representation of velocity in a grid framework. In this paper, both issues are addressed: sensor models and an efficient grid framework, which are required to ensure efficient and robust environment perception with radar. To that, we introduce new inverse radar sensor models covering radar sensor artifacts such as measurement ambiguities to integrate automotive radar sensors for improved velocity estimation. Furthermore, we introduce UNIFY, a multiple belief Bayesian grid map framework for static occupancy and velocity estimation with independent layers. The proposed UNIFY framework utilizes a grid-cell-based layer to provide occupancy information and a particle-based velocity layer for motion state estimation in an autonomous vehicle’s environment. Each UNIFY layer allows individual execution as well as simultaneous execution of both layers for optimal adaption to varying environments in autonomous driving applications. UNIFY was tested and evaluated in terms of plausibility and efficiency on a large real-world radar data-set in challenging traffic scenarios covering different densities in urban and rural sceneries.

baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling Coordinated Agents
pdf: https://t.co/jZhHm8LvRM
abs: https://t.co/cVMD7AbdQv
github: https://t.co/IWqe9TuUHl pic.twitter.com/lsqMOhlXEO
— AK (@ak92501) April 27, 2021

10. PanGu- $α$ : Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, Yonghong Tian

retweets: 98, favorites: 49 (04/28/2021 09:29:55)
links: abs | pdf
cs.CL

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu- $\alpha$ , with up to 200 billion parameters. PanGu- $\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu- $\alpha$ , we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu- $\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu- $\alpha$ in performing various tasks under few-shot or zero-shot settings.

Thanks for the attention. We have just released the tech report on arxiv: https://t.co/UryZRnHCy3. We believe the models are still under-trained, and have space for further improvement. https://t.co/AgpaYoM0j9
— Xin Jiang (@jxfeb) April 27, 2021

11. Playing Lottery Tickets with Vision and Language

Zhe Gan, Yen-Chun Chen, Linjie Li, Tianlong Chen, Yu Cheng, Shuohang Wang, Jingjing Liu

retweets: 90, favorites: 23 (04/28/2021 09:29:55)
links: abs | pdf
cs.CV | cs.CL | cs.LG

A concept of drone launched short range rockets (DLSRR) is presented. A drone or an aircraft rises DLSRR to a release altitude of up to 20 $km$ . At the release altitude, the drone or an aircraft is moving at a velocity of up to 700 $m/s$ and a steep angle of up to 68 $^o$ to the horizontal. After DLSRRs are released, their motors start firing. DLSRRs use slow burning motors to gain altitude and velocity. At the apogee of their flight DLSRRs release projectiles which fly to the target and strike it at high impact velocity. The projectiles reach a target at ranges of up to 442 $km$ and impact velocities up to 1.88 $km/s$ . We show that a rocket launched at high altitude and high initial velocity does not need expensive thermal protection to survive ascent. Delivery of munitions to target by DLSRRs should be much less expensive than delivery by a conventional rocket. %% Even though delivery of munitions by bomber aircraft is even less expensive, a bomber needs to fly close to the target, while a DLSRR carrier releases the rockets from a distance of at least 200 $km$ from the target. %% All parameters of DLSRRs, and their trajectories are calculated based on theoretical (mechanical and thermodynamical) analysis and on several MatLab programs.

Playing Lottery Tickets with Vision and Language
pdf: https://t.co/Osa6Mb5mkD
abs: https://t.co/yXpBUsx6SU pic.twitter.com/Cy8mtod2Yd
— AK (@ak92501) April 27, 2021

12. Unikraft: Fast, Specialized Unikernels the Easy Way

Simon Kuenzer, Vlad-Andrei Bădoiu, Hugo Lefeuvre, Sharan Santhanam, Alexander Jung, Gaulthier Gain, Cyril Soldani, Costin Lupu, Ştefan Teodorescu, Costi Răducanu, Cristian Banu, Laurent Mathy, Răzvan Deaconescu, Costin Raiciu, Felipe Huici

retweets: 81, favorites: 31 (04/28/2021 09:29:56)
links: abs | pdf
cs.OS

Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are infamous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance. Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found at www.unikraft.org.

Unikraft – Fast, Specialized Unikernels https://t.co/3hDrVguZRW
— Hacker News (@newsycombinator) April 27, 2021

13. M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry Davis, Dinesh Manocha

retweets: 81, favorites: 29 (04/28/2021 09:29:56)
links: abs | pdf
cs.CV

Recent theoretical works on over-parameterized neural nets have focused on two aspects: optimization and generalization. Many existing works that study optimization and generalization together are based on neural tangent kernel and require a very large width. In this work, we are interested in the following question: for a binary classification problem with two-layer mildly over-parameterized ReLU network, can we find a point with small test error in polynomial time? We first show that the landscape of loss functions with explicit regularization has the following property: all local minima and certain other points which are only stationary in certain directions achieve small test error. We then prove that for convolutional neural nets, there is an algorithm which finds one of these points in polynomial time (in the input dimension and the number of data points). In addition, we prove that for a fully connected neural net, with an additional assumption on the data distribution, there is a polynomial time algorithm.

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers
pdf: https://t.co/unjK4oj4cJ
abs: https://t.co/JeCnRO7xO9 pic.twitter.com/GkaoYuCSpQ
— AK (@ak92501) April 27, 2021

14. Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets

Yuan-Hong Liao, Amlan Kar, Sanja Fidler

retweets: 49, favorites: 29 (04/28/2021 09:29:56)
links: abs | pdf
cs.CV

Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expensive, and guaranteeing the quality of the labels is a major challenge. In this paper, we investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images. While methods that exploit learnt models for labeling exist, a surprisingly prevalent approach is to query humans for a fixed number of labels per datum and aggregate them, which is expensive. Building on prior work on online joint probabilistic modeling of human annotations and machine-generated beliefs, we propose modifications and best practices aimed at minimizing human labeling effort. Specifically, we make use of advances in self-supervised learning, view annotation as a semi-supervised learning problem, identify and mitigate pitfalls and ablate several key design choices to propose effective guidelines for labeling. Our analysis is done in a more realistic simulation that involves querying human labelers, which uncovers issues with evaluation using existing worker simulation methods. Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average, a 2.7x and 6.7x improvement over prior work and manual annotation, respectively. Project page: https://fidler-lab.github.io/efficient-annotation-cookbook

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets
pdf: https://t.co/RDhOb9Yq9W
abs: https://t.co/24NzW4O2HP
project page: https://t.co/EzW6Xumtfl pic.twitter.com/BjPCiocTuS
— AK (@ak92501) April 27, 2021

15. Focused Attention Improves Document-Grounded Generation

Shrimai Prabhumoye, Kazuma Hashimoto, Yingbo Zhou, Alan W Black, Ruslan Salakhutdinov

retweets: 30, favorites: 46 (04/28/2021 09:29:56)
links: abs | pdf
cs.CL

Document grounded generation is the task of using the information provided in a document to improve text generation. This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation. Our work introduces two novel adaptations of large scale pre-trained encoder-decoder models focusing on building context driven representation of the document and enabling specific attention to the information in the document. Additionally, we provide a stronger BART baseline for these tasks. Our proposed techniques outperform existing methods on both automated (at least 48% increase in BLEU-4 points) and human evaluation for closeness to reference and relevance to the document. Furthermore, we perform comprehensive manual inspection of the generated output and categorize errors to provide insights into future directions in modeling these tasks.

I'm very excited to announce that our work "Focused Attention Improves Document-Grounded Generation" w/ Kazuma Hashimoto @SFResearch, @yingbozhou_ai, Alan W Black, @rsalakhu has been accepted @NAACLHLT. #NAACL2021
Paper: https://t.co/oxazBO0vJ5
Code: https://t.co/h0lMJloVTs pic.twitter.com/ZyJRnFLdtJ
— Shrimai (@shrimai_) April 27, 2021

Published 28 Apr 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter