All Articles

Hot Papers 2021-03-31

1. Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh

  • retweets: 2406, favorites: 357 (04/01/2021 09:22:57)
  • links: abs | pdf
  • cs.CV

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture. We particularly attend the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit

2. 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop

Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, Zhenan Sun

  • retweets: 2346, favorites: 234 (04/01/2021 09:22:57)
  • links: abs | pdf
  • cs.CV

Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images. By directly mapping from raw pixels to model parameters, these methods can produce parametric models in a feed-forward manner via neural networks. However, minor deviation in parameters may lead to noticeable misalignment between the estimated meshes and image evidences. To address this issue, we propose a Pyramidal Mesh Alignment Feedback (PyMAF) loop to leverage a feature pyramid and rectify the predicted parameters explicitly based on the mesh-image alignment status in our deep regressor. In PyMAF, given the currently predicted parameters, mesh-aligned evidences will be extracted from finer-resolution features accordingly and fed back for parameter rectification. To reduce noise and enhance the reliability of these evidences, an auxiliary pixel-wise supervision is imposed on the feature encoder, which provides mesh-image correspondence guidance for our network to preserve the most related information in spatial features. The efficacy of our approach is validated on several benchmarks, including Human3.6M, 3DPW, LSP, and COCO, where experimental results show that our approach consistently improves the mesh-image alignment of the reconstruction. Our code is publicly available at https://hongwenzhang.github.io/pymaf .

3. In-Place Scene Labelling and Understanding with Implicit Scene Representation

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, Andrew J. Davison

  • retweets: 506, favorites: 139 (04/01/2021 09:22:57)
  • links: abs | pdf
  • cs.CV

Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties. We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes. We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic mapping systems.

4. High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation

Lele Chen, Chen Cao, Fernando De la Torre, Jason Saragih, Chenliang Xu, Yaser Sheikh

3D video avatars can empower virtual communications by providing compression, privacy, entertainment, and a sense of presence in AR/VR. Best 3D photo-realistic AR/VR avatars driven by video, that can minimize uncanny effects, rely on person-specific models. However, existing person-specific photo-realistic 3D models are not robust to lighting, hence their results typically miss subtle facial behaviors and cause artifacts in the avatar. This is a major drawback for the scalability of these models in communication systems (e.g., Messenger, Skype, FaceTime) and AR/VR. This paper addresses previous limitations by learning a deep learning lighting model, that in combination with a high-quality 3D face tracking algorithm, provides a method for subtle and robust facial motion transfer from a regular video to a 3D photo-realistic avatar. Extensive experimental validation and comparisons to other state-of-the-art methods demonstrate the effectiveness of the proposed framework in real-world scenarios with variability in pose, expression, and illumination. Please visit https://www.youtube.com/watch?v=dtz1LgZR8cc for more results. Our project page can be found at https://www.cs.rochester.edu/u/lchen63.

5. Unsupervised Learning of 3D Object Categories from Videos in the Wild

Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman Shapovalov, Tobias Ritschel, Andrea Vedaldi, David Novotny

  • retweets: 342, favorites: 97 (04/01/2021 09:22:58)
  • links: abs | pdf
  • cs.CV | cs.LG

Our goal is to learn a deep network that, given a small number of images of an object of a given category, reconstructs it in 3D. While several recent works have obtained analogous results using synthetic data or assuming the availability of 2D primitives such as keypoints, we are interested in working with challenging real data and with no manual annotations. We thus focus on learning a model from multiple views of a large collection of object instances. We contribute with a new large dataset of object centric videos suitable for training and benchmarking this class of models. We show that existing techniques leveraging meshes, voxels, or implicit surfaces, which work well for reconstructing isolated objects, fail on this challenging data. Finally, we propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction while obtaining a detailed implicit representation of the object surface and texture, also compensating for the noise in the initial SfM reconstruction that bootstrapped the learning process. Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks and on our novel dataset.

6. Broaden Your Views for Self-Supervised Video Learning

Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, Andrew Zisserman

  • retweets: 283, favorites: 121 (04/01/2021 09:22:58)
  • links: abs | pdf
  • cs.CV

Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by cropping and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervised learning framework for video. In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content. Our models learn to generalise from the narrow view to the general content of the video. Furthermore, BraVe processes the views with different backbones, enabling the use of alternative augmentations or modalities into the broad view such as optical flow, randomly convolved RGB frames, audio or their combinations. We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks including UCF101, HMDB51, Kinetics, ESC-50 and AudioSet.

7. Learning monocular 3D reconstruction of articulated categories from motion

Filippos Kokkinos, Iasonas Kokkinos

Monocular 3D reconstruction of articulated object categories is challenging due to the lack of training data and the inherent ill-posedness of the problem. In this work we use video self-supervision, forcing the consistency of consecutive 3D reconstructions by a motion-based cycle loss. This largely improves both optimization-based and learning-based 3D mesh reconstruction. We further introduce an interpretable model of 3D template deformations that controls a 3D surface through the displacement of a small number of local, learnable handles. We formulate this operation as a structured layer relying on mesh-laplacian regularization and show that it can be trained in an end-to-end manner. We finally introduce a per-sample numerical optimisation approach that jointly optimises over mesh displacements and cameras within a video, boosting accuracy both for training and also as test time post-processing. While relying exclusively on a small set of videos collected per category for supervision, we obtain state-of-the-art reconstructions with diverse shapes, viewpoints and textures for multiple articulated object categories.

8. Foveated Neural Radiance Fields for Real-Time and Egocentric Virtual Reality

Nianchen Deng, Zhenyi He, Jiannan Ye, Praneeth Chakravarthula, Xubo Yang, Qi Sun

  • retweets: 149, favorites: 71 (04/01/2021 09:22:58)
  • links: abs | pdf
  • cs.GR | cs.CV

Traditional high-quality 3D graphics requires large volumes of fine-detailed scene data for rendering. This demand compromises computational efficiency and local storage resources. Specifically, it becomes more concerning for future wearable and portable virtual and augmented reality (VR/AR) displays. Recent approaches to combat this problem include remote rendering/streaming and neural representations of 3D assets. These approaches have redefined the traditional local storage-rendering pipeline by distributed computing or compression of large data. However, these methods typically suffer from high latency or low quality for practical visualization of large immersive virtual scenes, notably with extra high resolution and refresh rate requirements for VR applications such as gaming and design. Tailored for the future portable, low-storage, and energy-efficient VR platforms, we present the first gaze-contingent 3D neural representation and view synthesis method. We incorporate the human psychophysics of visual- and stereo-acuity into an egocentric neural representation of 3D scenery. Furthermore, we jointly optimize the latency/performance and visual quality, while mutually bridging human perception and neural scene synthesis, to achieve perceptually high-quality immersive interaction. Both objective analysis and subjective study demonstrate the effectiveness of our approach in significantly reducing local storage volume and synthesis latency (up to 99% reduction in both data size and computational time), while simultaneously presenting high-fidelity rendering, with perceptual quality identical to that of fully locally stored and rendered high-quality imagery.

9. Symbolic Music Generation with Diffusion Models

Gautam Mittal, Jesse Engel, Curtis Hawthorne, Ian Simon

Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show strong unconditional generation and post-hoc conditional infilling results compared to autoregressive language models operating over the same continuous embeddings.

10. Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

  • retweets: 156, favorites: 49 (04/01/2021 09:22:59)
  • links: abs | pdf
  • cs.CV

Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but accurate transformer-based model via distillation and re-ranking. Finally, we validate our approach on the Flickr30K image dataset where we show an increase in inference speed by several orders of magnitude while having results competitive to the state of the art. We also extend our method to the video domain, improving the state of the art on the VATEX dataset.

11. Detecting informative higher-order interactions in statistically validated hypergraphs

Federico Musciotto, Federico Battiston, Rosario N. Mantegna

Recent empirical evidence has shown that in many real-world systems, successfully represented as networks, interactions are not limited to dyads, but often involve three or more agents at a time. These data are better described by hypergraphs, where hyperlinks encode higher-order interactions among a group of nodes. In spite of the large number of works on networks, highlighting informative hyperlinks in hypergraphs obtained from real world data is still an open problem. Here we propose an analytic approach to filter hypergraphs by identifying those hyperlinks that are over-expressed with respect to a random null hypothesis, and represent the most relevant higher-order connections. We apply our method to a class of synthetic benchmarks and to several datasets. For all cases, the method highlights hyperlinks that are more informative than those extracted with pairwise approaches. Our method provides a first way to obtain statistically validated hypergraphs, separating informative connections from redundant and noisy ones.

12. Benchmarking Representation Learning for Natural World Image Collections

Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, Oisin Mac Aodha

  • retweets: 121, favorites: 44 (04/01/2021 09:22:59)
  • links: abs | pdf
  • cs.CV

Recent progress in self-supervised learning has resulted in models that are capable of extracting rich representations from image collections without requiring any explicit label supervision. However, to date the vast majority of these approaches have restricted themselves to training on standard benchmark datasets such as ImageNet. We argue that fine-grained visual categorization problems, such as plant and animal species classification, provide an informative testbed for self-supervised learning. In order to facilitate progress in this area we present two new natural world visual classification datasets, iNat2021 and NeWT. The former consists of 2.7M images from 10k different species uploaded by users of the citizen science application iNaturalist. We designed the latter, NeWT, in collaboration with domain experts with the aim of benchmarking the performance of representation learning algorithms on a suite of challenging natural world binary classification tasks that go beyond standard species classification. These two new datasets allow us to explore questions related to large-scale representation and transfer learning in the context of fine-grained categories. We provide a comprehensive analysis of feature extractors trained with and without supervision on ImageNet and iNat2021, shedding light on the strengths and weaknesses of different learned features across a diverse set of tasks. We find that features produced by standard supervised methods still outperform those produced by self-supervised approaches such as SimCLR. However, improved self-supervised learning methods are constantly being released and the iNat2021 and NeWT datasets are a valuable resource for tracking their progress.

13. ZX-Calculus and Extended Wolfram Model Systems II: Fast Diagrammatic Reasoning with an Application to Quantum Circuit Simplification

Jonathan Gorard, Manojna Namuduri, Xerxes D. Arsiwalla

  • retweets: 90, favorites: 56 (04/01/2021 09:22:59)
  • links: abs | pdf
  • cs.LO | cs.DM

This article presents a novel algorithmic methodology for performing automated diagrammatic deductions over combinatorial structures, using a combination of modified equational theorem-proving techniques and the extended Wolfram model hypergraph rewriting formalism developed by the authors in previous work. We focus especially upon the application of this new algorithm to the problem of automated circuit simplification in quantum information theory, using Wolfram model multiway operator systems combined with the ZX-calculus formalism for enacting fast diagrammatic reasoning over linear transformations between qubits. We show how to construct a generalization of the deductive inference rules for Knuth-Bendix completion in which equation matches are selected on the basis of causal edge density in the associated multiway system, before proceeding to demonstrate how to embed the higher-order logic of the ZX-calculus rules within this first-order equational framework. After showing explicitly how the (hyper)graph rewritings of both Wolfram model systems and the ZX-calculus can be effectively realized within this formalism, we proceed to exhibit comparisons of time complexity vs. proof complexity for this new algorithmic approach when simplifying randomly-generated Clifford circuits down to pseudo-normal form, as well as when reducing the number of T-gates in randomly-generated non-Clifford circuits, with circuit sizes ranging up to 3000 gates, illustrating that the method performs favorably in comparison with existing circuit simplification frameworks, and also exhibiting the approximately quadratic speedup obtained by employing the causal edge density optimization. Finally, we present a worked example of an automated proof of correctness for a simple quantum teleportation protocol, in order to demonstrate more clearly the internal operations of the theorem-proving procedure.

14. pH-RL: A personalization architecture to bringreinforcement learning to health practice

Ali el Hassouni, Mark Hoogendoorn, Marketa Ciharova, Annet Kleiboer, Khadicha Amarti, Vesa Muhonen, Heleen Riper, A. E. Eiben

  • retweets: 108, favorites: 4 (04/01/2021 09:22:59)
  • links: abs | pdf
  • cs.AI | cs.CY

While reinforcement learning (RL) has proven to be the approach of choice for tackling many complex problems, it remains challenging to develop and deploy RL agents in real-life scenarios successfully. This paper presents pH-RL (personalization in e-Health with RL) a general RL architecture for personalization to bring RL to health practice. pH-RL allows for various levels of personalization in health applications and allows for online and batch learning. Furthermore, we provide a general-purpose implementation framework that can be integrated with various healthcare applications. We describe a step-by-step guideline for the successful deployment of RL policies in a mobile application. We implemented our open-source RL architecture and integrated it with the MoodBuster mobile application for mental health to provide messages to increase daily adherence to the online therapeutic modules. We then performed a comprehensive study with human participants over a sustained period. Our experimental results show that the developed policies learn to select appropriate actions consistently using only a few days’ worth of data. Furthermore, we empirically demonstrate the stability of the learned policies during the study.

15. SIMstack: A Generative Shape and Instance Model for Unordered Object Stacks

Zoe Landgraf, Raluca Scona, Tristan Laidlow, Stephen James, Stefan Leutenegger, Andrew J. Davison

  • retweets: 20, favorites: 70 (04/01/2021 09:22:59)
  • links: abs | pdf
  • cs.CV

By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that physics constrains decomposition as well as shape in occluded regions and hypothesise that a latent space learned from scenes built under physics simulation can serve as a prior to better predict shape and instances in occluded regions. To this end we propose SIMstack, a depth-conditioned Variational Auto-Encoder (VAE), trained on a dataset of objects stacked under physics simulation. We formulate instance segmentation as a centre voting task which allows for class-agnostic detection and doesn’t require setting the maximum number of objects in the scene. At test time, our model can generate 3D shape and instance segmentation from a single depth view, probabilistically sampling proposals for the occluded region from the learned latent space. Our method has practical applications in providing robots some of the ability humans have to make rapid intuitive inferences of partially observed scenes. We demonstrate an application for precise (non-disruptive) object grasping of unknown objects from a single depth view.

16. Cognitive networks identify the content of English and Italian popular posts about COVID-19 vaccines: Anticipation, logistics, conspiracy and loss of trust

Massimo Stella, Michael S. Vitevitch, Federico Botta

Monitoring social discourse about COVID-19 vaccines is key to understanding how large populations perceive vaccination campaigns. We focus on 4765 unique popular tweets in English or Italian about COVID-19 vaccines between 12/2020 and 03/2021. One popular English tweet was liked up to 495,000 times, stressing how popular tweets affected cognitively massive populations. We investigate both text and multimedia in tweets, building a knowledge graph of syntactic/semantic associations in messages including visual features and indicating how online users framed social discourse mostly around the logistics of vaccine distribution. The English semantic frame of “vaccine” was highly polarised between trust/anticipation (towards the vaccine as a scientific asset saving lives) and anger/sadness (mentioning critical issues with dose administering). Semantic associations with “vaccine,” “hoax” and conspiratorial jargon indicated the persistence of conspiracy theories and vaccines in massively read English posts (absent in Italian messages). The image analysis found that popular tweets with images of people wearing face masks used language lacking the trust and joy found in tweets showing people with no masks, indicating a negative affect attributed to face covering in social discourse. A behavioural analysis revealed a tendency for users to share content eliciting joy, sadness and disgust and to like less sad messages, highlighting an interplay between emotions and content diffusion beyond sentiment. With the AstraZeneca vaccine being suspended in mid March 2021, “Astrazeneca” was associated with trustful language driven by experts, but popular Italian tweets framed “vaccine” by crucially replacing earlier levels of trust with deep sadness. Our results stress how cognitive networks and innovative multimedia processing open new ways for reconstructing online perceptions about vaccines and trust.

17. Tigris: a DSL and Framework for Monitoring Software Systems at Runtime

Jhonny Mertz, Ingrid Nunes

  • retweets: 35, favorites: 26 (04/01/2021 09:23:00)
  • links: abs | pdf
  • cs.SE

The understanding of the behavioral aspects of a software system is an essential enabler for many software engineering activities, such as adaptation. This involves collecting runtime data from the system so that it is possible to analyze the collected data to guide actions upon the system. Consequently, software monitoring imposes practical challenges because it is often done by intercepting the system execution and recording gathered information. Such monitoring may degrade the performance and disrupt the system execution to unacceptable levels. In this paper, we introduce a two-phase monitoring approach to support the monitoring step in adaptive systems. The first phase collects lightweight coarse-grained information and identifies relevant parts of the software that should be monitored in detail based on a provided domain-specific language. This language is informed by a systematic literature review. The second phase collects relevant and fine-grained information needed for deciding whether and how to adapt the managed system. Our approach is implemented as a framework, called Tigris, that can be seamlessly integrated into existing software systems to support monitoring-based activities. To validate our proposal, we instantiated Tigris to support an application-level caching approach, which adapts caching decisions of a software system at runtime to improve its performance.

18. What Causes Optical Flow Networks to be Vulnerable to Physical Adversarial Attacks

Simon Schrodi, Tonmoy Saikia, Thomas Brox

  • retweets: 30, favorites: 28 (04/01/2021 09:23:00)
  • links: abs | pdf
  • cs.CV

Recent work demonstrated the lack of robustness of optical flow networks to physical, patch-based adversarial attacks. The possibility to physically attack a basic component of automotive systems is a reason for serious concerns. In this paper, we analyze the cause of the problem and show that the lack of robustness is rooted in the classical aperture problem of optical flow estimation in combination with bad choices in the details of the network architecture. We show how these mistakes can be rectified in order to make optical flow networks robust to physical, patch-based attacks.