Hot Papers 2021-08-20

1. Do Vision Transformers See Like Convolutional Neural Networks?

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy

retweets: 9128, favorites: 5 (08/21/2021 09:06:00)
links: abs | pdf
cs.CV | cs.AI | cs.LG | stat.ML

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.

Do Vision Transformers See Like Convolutional Neural Networks?

New paper https://t.co/mxLCIRBRLy

The successes of Transformers in computer vision prompts a fundamental question: how are they solving these tasks? Do Transformers act like CNNs, or learn very different features? pic.twitter.com/3gJSZ3rArt
— Maithra Raghu (@maithra_raghu) August 20, 2021

2. Image2Lego: Customized LEGO Set Generation from Images

Kyle Lennon, Katharina Fransen, Alexander O’Brien, Yumeng Cao, Matthew Beveridge, Yamin Arefeen, Nikhil Singh, Iddo Drori

retweets: 1558, favorites: 163 (08/21/2021 09:06:00)
links: abs | pdf
cs.CV | cs.LG

Although LEGO sets have entertained generations of children and adults, the challenge of designing customized builds matching the complexity of real-world or imagined scenes remains too great for the average enthusiast. In order to make this feat possible, we implement a system that generates a LEGO brick model from 2D images. We design a novel solution to this problem that uses an octree-structured autoencoder trained on 3D voxelized models to obtain a feasible latent representation for model reconstruction, and a separate network trained to predict this latent representation from 2D images. LEGO models are obtained by algorithmic conversion of the 3D voxelized model to bricks. We demonstrate first-of-its-kind conversion of photographs to 3D LEGO models. An octree architecture enables the flexibility to produce multiple resolutions to best fit a user’s creative vision or design needs. In order to demonstrate the broad applicability of our system, we generate step-by-step building instructions and animations for LEGO models of objects and human faces. Finally, we test these automatically generated LEGO sets by constructing physical builds using real LEGO bricks.

Image2Lego: Customized LEGO® Set Generation from Images
pdf: https://t.co/yHBU4o5qSt
abs: https://t.co/3pBPqI1dFz
project page: https://t.co/96SPLGgO06

a pipeline for producing 3D LEGO® models from 2D images pic.twitter.com/EUs4AQiFnM
— AK (@ak92501) August 20, 2021

3. ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer

retweets: 272, favorites: 81 (08/21/2021 09:06:00)
links: abs | pdf
cs.CV

Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

ImageBART: Bidirectional Context with Multinomial
Diffusion for Autoregressive Image Synthesis
abs: https://t.co/dTknlBbANY

a hierarchical approach to introduce bidirectional context into autoregressive transformer models for high-fidelity controllable image synthesis pic.twitter.com/sjxiCFTd6G
— AK (@ak92501) August 20, 2021

4. Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing

Garvita Tiwari, Nikolaos Sarafianos, Tony Tung, Gerard Pons-Moll1

retweets: 169, favorites: 69 (08/21/2021 09:06:01)
links: abs | pdf
cs.CV

We present Neural Generalized Implicit Functions(Neural-GIF), to animate people in clothing as a function of the body pose. Given a sequence of scans of a subject in various poses, we learn to animate the character for new poses. Existing methods have relied on template-based representations of the human body (or clothing). However such models usually have fixed and limited resolutions, require difficult data pre-processing steps and cannot be used with complex clothing. We draw inspiration from template-based methods, which factorize motion into articulation and non-rigid deformation, but generalize this concept for implicit shape learning to obtain a more flexible model. We learn to map every point in the space to a canonical space, where a learned deformation field is applied to model non-rigid effects, before evaluating the signed distance field. Our formulation allows the learning of complex and non-rigid deformations of clothing and soft tissue, without computing a template registration as it is common with current approaches. Neural-GIF can be trained on raw 3D scans and reconstructs detailed complex surface geometry and deformations. Moreover, the model can generalize to new poses. We evaluate our method on a variety of characters from different public datasets in diverse clothing styles and show significant improvements over baseline methods, quantitatively and qualitatively. We also extend our model to multiple shape setting. To stimulate further research, we will make the model, code and data publicly available at: https://virtualhumans.mpi-inf.mpg.de/neuralgif/

Neural-GIF: Neural Generalized Implicit Functions
for Animating People in Clothing
pdf: https://t.co/JEnRjpK03Z
abs: https://t.co/Ykt8cfJyBr

model to learn articulation and pose-dependent deformation for humans in complex clothing using an implicit 3D surface representation pic.twitter.com/YDTkvxdSBK
— AK (@ak92501) August 20, 2021

5. Successive cohorts of Twitter users show increasing activity and shrinking content horizons

Frederik Wolf, Philipp Lorenz-Spreen, Sune Lehmann

retweets: 166, favorites: 50 (08/21/2021 09:06:01)
links: abs | pdf
cs.SI | physics.soc-ph

The global public sphere has changed dramatically over the past decades: a significant part of public discourse now takes place on algorithmically driven platforms owned by a handful of private companies. Despite its growing importance, there is scant large-scale academic research on the long-term evolution of user behaviour on these platforms, because the data are often proprietary to the platforms. Here, we evaluate the individual behaviour of 600,000 Twitter users between 2012 and 2019 and find empirical evidence for an acceleration of the way Twitter is used on an individual level. This manifests itself in the fact that cohorts of Twitter users behave differently depending on when they joined the platform. Behaviour within a cohort is relatively consistent over time and characterised by strong internal interactions, but over time behaviour from cohort to cohort shifts towards increased activity. Specifically, we measure this in terms of more tweets per user over time, denser interactions with others via retweets, and shorter content horizons, expressed as an individual’s decaying autocorrelation of topics over time. Our observations are explained by a growing proportion of active users who not only tweet more actively but also elicit more retweets. These behaviours suggest a collective contribution to an increased flow of information through each cohort’s news feed — an increase that potentially depletes available collective attention over time. Our findings complement recent, empirical work on social acceleration, which has been largely agnostic about individual user activity.

"Successive cohorts of Twitter users show increasing activity and shrinking content horizons"

Looking forward to reading this preprint from @suneman & team.

Findings look consistent w/accelerated engagement and broad language churn we've observed.https://t.co/e7mtgLGvA7 pic.twitter.com/hggvsj1NBV
— Chris Danforth (@ChrisDanforth) August 20, 2021

6. Gravity-Aware Monocular 3D Human-Object Reconstruction

Rishabh Dabral, Soshi Shimada, Arjun Jain, Christian Theobalt, Vladislav Golyanik

retweets: 110, favorites: 31 (08/21/2021 09:06:01)
links: abs | pdf
cs.CV

This paper proposes GraviCap, i.e., a new approach for joint markerless 3D human motion capture and object trajectory estimation from monocular RGB videos. We focus on scenes with objects partially observed during a free flight. In contrast to existing monocular methods, we can recover scale, object trajectories as well as human bone lengths in meters and the ground plane’s orientation, thanks to the awareness of the gravity constraining object motions. Our objective function is parametrised by the object’s initial velocity and position, gravity direction and focal length, and jointly optimised for one or several free flight episodes. The proposed human-object interaction constraints ensure geometric consistency of the 3D reconstructions and improved physical plausibility of human poses compared to the unconstrained case. We evaluate GraviCap on a new dataset with ground-truth annotations for persons and different objects undergoing free flights. In the experiments, our approach achieves state-of-the-art accuracy in 3D human motion capture on various metrics. We urge the reader to watch our supplementary video. Both the source code and the dataset are released; see http://4dqv.mpi-inf.mpg.de/GraviCap/.

Gravity-Aware Monocular 3D Human-Object Reconstruction
pdf: https://t.co/e4w4RrbbZw
abs: https://t.co/yUyEdvK3wP
project page: https://t.co/ubIrqJQAxv pic.twitter.com/7oeirKsTTm
— AK (@ak92501) August 20, 2021

7. Learning to Match Features with Seeded Graph Matching Network

Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, Long Quan

retweets: 49, favorites: 22 (08/21/2021 09:06:01)
links: abs | pdf
cs.CV

Matching local features across images is a fundamental problem in computer vision. Targeting towards high accuracy and efficiency, we propose Seeded Graph Matching Network, a graph neural network with sparse structure to reduce redundant connectivity and learn compact representation. The network consists of 1) Seeding Module, which initializes the matching by generating a small set of reliable matches as seeds. 2) Seeded Graph Neural Network, which utilizes seed matches to pass messages within/across images and predicts assignment costs. Three novel operations are proposed as basic elements for message passing: 1) Attentional Pooling, which aggregates keypoint features within the image to seed matches. 2) Seed Filtering, which enhances seed features and exchanges messages across images. 3) Attentional Unpooling, which propagates seed features back to original keypoints. Experiments show that our method reduces computational and memory complexity significantly compared with typical attention-based networks while competitive or higher performance is achieved.

Learning to Match Features with Seeded Graph Matching Network
Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, Long Quan

tl;dr: More weight on high-confidence matches, but fails to reach original SuperGlue performance.https://t.co/KQ8Uuk9mz8 pic.twitter.com/RRSwalpdY0
— Dmytro Mishkin (@ducha_aiki) August 20, 2021

8. PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers

Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu, Jie Zhou

retweets: 42, favorites: 26 (08/21/2021 09:06:01)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Point clouds captured in real-world applications are often incomplete due to the limited sensor resolution, single viewpoint, and occlusion. Therefore, recovering the complete point clouds from partial ones becomes an indispensable task in many practical applications. In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr that adopts a transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the point cloud to a sequence of point proxies and employ the transformers for point cloud generation. To facilitate transformers to better leverage the inductive bias about 3D geometric structures of point clouds, we further devise a geometry-aware block that models the local geometric relationships explicitly. The migration of transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Furthermore, we propose two more challenging benchmarks with more diverse incomplete point clouds that can better reflect the real-world scenarios to promote future research. Experimental results show that our method outperforms state-of-the-art methods by a large margin on both the new benchmarks and the existing ones. Code is available at https://github.com/yuxumin/PoinTr

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers
pdf: https://t.co/5LnPNERwB5
abs: https://t.co/7ejrxRXGEa
github: https://t.co/fqeEjoBC8s pic.twitter.com/Z3hJbipXXY
— AK (@ak92501) August 20, 2021

9. Estimating distinguishability measures on quantum computers

Rochisha Agarwal, Soorya Rethinasamy, Kunal Sharma, Mark M. Wilde

retweets: 30, favorites: 29 (08/21/2021 09:06:01)
links: abs | pdf
quant-ph | cs.DS

The performance of a quantum information processing protocol is ultimately judged by distinguishability measures that quantify how distinguishable the actual result of the protocol is from the ideal case. The most prominent distinguishability measures are those based on the fidelity and trace distance, due to their physical interpretations. In this paper, we propose and review several algorithms for estimating distinguishability measures based on trace distance and fidelity, and we evaluate their performance using simulators of quantum computers. The algorithms can be used for distinguishing quantum states, channels, and strategies (the last also known in the literature as “quantum combs”). The fidelity-based algorithms offer novel physical interpretations of these distinguishability measures in terms of the maximum probability with which a single prover (or competing provers) can convince a verifier to accept the outcome of an associated computation. We simulate these algorithms by using a variational approach with parameterized quantum circuits and find that they converge well for the examples that we consider.

preprint "Estimating distinguishability measures on quantum computers" now available on the arXivhttps://t.co/SPBRjwoxxg

In collaboration with @AgarwalRochisha, @SooryaRethin, and @kunal_phy
— Mark M. Wilde (@markwilde) August 20, 2021

10. Towards Controllable and Photorealistic Region-wise Image Manipulation

Ansheng You, Chenglin Zhou, Qixuan Zhang, Lan Xu

retweets: 30, favorites: 26 (08/21/2021 09:06:02)
links: abs | pdf
cs.CV

Adaptive and flexible image editing is a desirable function of modern generative models. In this work, we present a generative model with auto-encoder architecture for per-region style manipulation. We apply a code consistency loss to enforce an explicit disentanglement between content and style latent representations, making the content and style of generated samples consistent with their corresponding content and style references. The model is also constrained by a content alignment loss to ensure the foreground editing will not interfere background contents. As a result, given interested region masks provided by users, our model supports foreground region-wise style transfer. Specially, our model receives no extra annotations such as semantic labels except for self-supervision. Extensive experiments show the effectiveness of the proposed method and exhibit the flexibility of the proposed model for various applications, including region-wise style editing, latent space interpolation, cross-domain style transfer.

Towards Controllable and Photorealistic Region-wise Image Manipulation
pdf: https://t.co/vG3LJQGkoT
abs: https://t.co/ShjpROJVVm pic.twitter.com/rYe85j00Yp
— AK (@ak92501) August 20, 2021

11. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin

retweets: 25, favorites: 30 (08/21/2021 09:06:02)
links: abs | pdf
cs.CL | cs.IR

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call “mDPR”. Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse-dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.

Happy to share Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in 11 languages by @crystina_z @xueguang_ma @ShiPeng16 tl;dr - think of this as the open-retrieval condition of TyDi.

Paper: https://t.co/9qMoT1oYxd
Data: https://t.co/rjsxwNA2r6
— Jimmy Lin (@lintool) August 20, 2021

Published 21 Aug 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter