Hot Papers 2021-05-06

1. RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

Xiaohan Ding, Xiangyu Zhang, Jungong Han, Guiguang Ding

retweets: 1204, favorites: 249 (05/07/2021 18:20:46)
links: abs | pdf
cs.CV | cs.AI | cs.LG

We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition). The code and models are available at https://github.com/DingXiaoH/RepMLP.

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition
pdf: https://t.co/eYnF2KV1oB
abs: https://t.co/GodBbmtL93

multi-layer-perceptron-style neural network building block for image recognition,composed of a series of fully-connected (FC) layers pic.twitter.com/9iLxArWt6q
— AK (@ak92501) May 6, 2021

Day after MLP-Mixer we have RepMLP :)

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for
Image Recognitionhttps://t.co/F7HhtIORts pic.twitter.com/MBKt4SLiRw
— Dmytro Mishkin (@ducha_aiki) May 6, 2021

The day after the release of MLP-Mixer, we have the same idea appearing independently (https://t.co/MVOuQtb8JM) and the slightly less similar RepMLP, an FC-only architecture (https://t.co/stYSewSrdt).
Please consider reading and citing not just the MLP-Mixer. https://t.co/LgWvIwqwzN
— Andrei Bursuc (@abursuc) May 6, 2021

2. Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors

Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qionghai Dai, Yebin Liu

retweets: 785, favorites: 112 (05/07/2021 18:20:47)
links: abs | pdf
cs.CV

Human volumetric capture is a long-standing topic in computer vision and computer graphics. Although high-quality results can be achieved using sophisticated off-line systems, real-time human volumetric capture of complex scenarios, especially using light-weight setups, remains challenging. In this paper, we propose a human volumetric capture method that combines temporal volumetric fusion and deep implicit functions. To achieve high-quality and temporal-continuous reconstruction, we propose dynamic sliding fusion to fuse neighboring depth observations together with topology consistency. Moreover, for detailed and complete surface generation, we propose detail-preserving deep implicit functions for RGBD input which can not only preserve the geometric details on the depth inputs but also generate more plausible texturing results. Results and experiments show that our method outperforms existing methods in terms of view sparsity, generalization capacity, reconstruction quality, and run-time efficiency.

Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors
pdf: https://t.co/h2HyGwuDm7
abs: https://t.co/Q2TXj6g9Da
project page: https://t.co/fMjut59OjY pic.twitter.com/zNr7BKlYvM
— AK (@ak92501) May 6, 2021

3. Foundations of Intelligence in Natural and Artificial Systems: A Workshop Report

Tyler Millhouse, Melanie Moses, Melanie Mitchell

retweets: 634, favorites: 117 (05/07/2021 18:20:47)
links: abs | pdf
cs.AI

In March of 2021, the Santa Fe Institute hosted a workshop as part of its Foundations of Intelligence in Natural and Artificial Systems project. This project seeks to advance the field of artificial intelligence by promoting interdisciplinary research on the nature of intelligence. During the workshop, speakers from diverse disciplines gathered to develop a taxonomy of intelligence, articulating their own understanding of intelligence and how their research has furthered that understanding. In this report, we summarize the insights offered by each speaker and identify the themes that emerged during the talks and subsequent discussions.

A report on our recent @sfiscience workshop, "Foundations of Intelligence in Natural and Artificial Systems". https://t.co/HofGBAq3SO

This was the first in a series of workshops we are organizing as part of an NSF-funded program at SFI. pic.twitter.com/1FAs473SSy
— Melanie Mitchell (@MelMitchell1) May 6, 2021

4. Cuboids Revisited: Learning Robust 3D Shape Fitting to Single RGB Images

Florian Kluger, Hanno Ackermann, Eric Brachmann, Michael Ying Yang, Bodo Rosenhahn

retweets: 272, favorites: 100 (05/07/2021 18:20:48)
links: abs | pdf
cs.CV

Humans perceive and construct the surrounding world as an arrangement of simple parametric models. In particular, man-made environments commonly consist of volumetric primitives such as cuboids or cylinders. Inferring these primitives is an important step to attain high-level, abstract scene descriptions. Previous approaches directly estimate shape parameters from a 2D or 3D input, and are only able to reproduce simple objects, yet unable to accurately parse more complex 3D scenes. In contrast, we propose a robust estimator for primitive fitting, which can meaningfully abstract real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to 3D features, such as a depth map. We condition the network on previously detected parts of the scene, thus parsing it one-by-one. To obtain 3D features from a single RGB image, we additionally optimise a feature extraction CNN in an end-to-end manner. However, naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene behind. We thus propose an occlusion-aware distance metric correctly handling opaque scenes. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the challenging NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.

RGB image in, set of 3D primitives out. A #CVPR2021 paper with @florian_kluger, H. Ackermann, M. Yang and B. Rosenhahn! #ComputerVision

abs: https://t.co/mVoz8VGtc4
code: https://t.co/Vhi7PixIbm

We take RANSAC out of its comfort zone into scene understanding territory. 👇 pic.twitter.com/T7v0Ic7HVc
— Eric Brachmann (@eric_brachmann) May 6, 2021

5. 4DComplete: Non-Rigid Motion Estimation Beyond the Observable Surface

Yang Li, Hikari Takehara, Takafumi Taketomi, Bo Zheng, Matthias Nießner

retweets: 182, favorites: 111 (05/07/2021 18:20:48)
links: abs | pdf
cs.CV

Tracking non-rigidly deforming scenes using range sensors has numerous applications including computer vision, AR/VR, and robotics. However, due to occlusions and physical limitations of range sensors, existing methods only handle the visible surface, thus causing discontinuities and incompleteness in the motion field. To this end, we introduce 4DComplete, a novel data-driven approach that estimates the non-rigid motion for the unobserved geometry. 4DComplete takes as input a partial shape and motion observation, extracts 4D time-space embedding, and jointly infers the missing geometry and motion field using a sparse fully-convolutional network. For network training, we constructed a large-scale synthetic dataset called DeformingThings4D, which consists of 1972 animation sequences spanning 31 different animals or humanoid categories with dense 4D annotation. Experiments show that 4DComplete 1) reconstructs high-resolution volumetric shape and motion field from a partial observation, 2) learns an entangled 4D feature representation that benefits both shape and motion estimation, 3) yields more accurate and natural deformation than classic non-rigid priors such as As-Rigid-As-Possible (ARAP) deformation, and 4) generalizes well to unseen objects in real-world sequences.

4DComplete: Non-Rigid Motion Estimation Beyond the Observable Surface
pdf: https://t.co/I5Oa419DjR
abs: https://t.co/Rpin7QZzvw

a novel data-driven approach that estimates the non-rigid motion for the unobserved geometry pic.twitter.com/u3C0Tkm9as
— AK (@ak92501) May 6, 2021

6. Software Engineering for AI-Based Systems: A Survey

Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, Stefan Wagner

retweets: 108, favorites: 50 (05/07/2021 18:20:48)
links: abs | pdf
cs.SE | cs.AI | cs.LG

AI-based systems are software systems with functionalities enabled by at least one AI component (e.g., for image- and speech-recognition, and autonomous driving). AI-based systems are becoming pervasive in society due to advances in AI. However, there is limited synthesized knowledge on Software Engineering (SE) approaches for building, operating, and maintaining AI-based systems. To collect and analyze state-of-the-art knowledge about SE for AI-based systems, we conducted a systematic mapping study. We considered 248 studies published between January 2010 and March 2020. SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety. We identified multiple SE approaches for AI-based systems, which we classified according to the SWEBOK areas. Studies related to software testing and software quality are very prevalent, while areas like software maintenance seem neglected. Data-related issues are the most recurrent challenges. Our results are valuable for: researchers, to quickly understand the state of the art and learn which topics need more research; practitioners, to learn about the approaches and challenges that SE entails for AI-based systems; and, educators, to bridge the gap among SE and AI in their curricula.

It has been a lot of work, but I am very proud of this literature survey of research on software engineering for AI-based systems. It is available now on arXiv: https://t.co/Z8FQRzX5gL
Feedback is welcome!
— Stefan Wagner (@prof_wagnerst) May 6, 2021

7. Self-Supervised Multi-Frame Monocular Scene Flow

Junhwa Hur, Stefan Roth

retweets: 90, favorites: 48 (05/07/2021 18:20:48)
links: abs | pdf
cs.CV | cs.LG | cs.RO

Estimating 3D scene flow from a sequence of monocular images has been gaining increased attention due to the simple, economical capture setup. Owing to the severe ill-posedness of the problem, the accuracy of current methods has been limited, especially that of efficient, real-time approaches. In this paper, we introduce a multi-frame monocular scene flow network based on self-supervised learning, improving the accuracy over previous networks while retaining real-time efficiency. Based on an advanced two-frame baseline with a split-decoder design, we propose (i) a multi-frame model using a triple frame input and convolutional LSTM connections, (ii) an occlusion-aware census loss for better accuracy, and (iii) a gradient detaching strategy to improve training stability. On the KITTI dataset, we observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.

Self-Supervised Multi-Frame Monocular Scene Flow
pdf: https://t.co/zzD5DxRAWK
abs: https://t.co/BlaxcCDPOS
github: https://t.co/pk8cgWhFI9

a multi-frame monocular scene flow network based on self-supervised learning, improving the accuracy over previous networks pic.twitter.com/sMp6wPcEWq
— AK (@ak92501) May 6, 2021

8. Texture for Colors: Natural Representations of Colors Using Variable Bit-Depth Textures

Shumeet Baluja

retweets: 49, favorites: 48 (05/07/2021 18:20:48)
links: abs | pdf
cs.CV | cs.GR | cs.LG

Numerous methods have been proposed to transform color and grayscale images to their single bit-per-pixel binary counterparts. Commonly, the goal is to enhance specific attributes of the original image to make it more amenable for analysis. However, when the resulting binarized image is intended for human viewing, aesthetics must also be considered. Binarization techniques, such as half-toning, stippling, and hatching, have been widely used for modeling the original image’s intensity profile. We present an automated method to transform an image to a set of binary textures that represent not only the intensities, but also the colors of the original. The foundation of our method is information preservation: creating a set of textures that allows for the reconstruction of the original image’s colors solely from the binarized representation. We present techniques to ensure that the textures created are not visually distracting, preserve the intensity profile of the images, and are natural in that they map sets of colors that are perceptually similar to patterns that are similar. The approach uses deep-neural networks and is entirely self-supervised; no examples of good vs. bad binarizations are required. The system yields aesthetically pleasing binary images when tested on a variety of image sources.

Texture for Colors: Natural Representations of Colors Using Variable Bit-Depth Textures
pdf: https://t.co/pmmVdSnhfq
abs: https://t.co/sdSWVe3omT

method to transform an image to a set of binary textures
that represent not only the intensities, but also the colors of the original pic.twitter.com/VbHUcEXJ57
— AK (@ak92501) May 6, 2021

9. Visual Composite Set Detection Using Part-and-Sum Transformers

Qi Dong, Zhuowen Tu, Haofu Liao, Yuting Zhang, Vijay Mahadevan, Stefano Soatto

retweets: 64, favorites: 33 (05/07/2021 18:20:48)
links: abs | pdf
cs.CV

Computer vision applications such as visual relationship detection and human-object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection, and human-object interaction, and demonstrate that PST achieves state-of-the-art results among single-stage models, while nearly matching the results of custom-designed two-stage models.

Visual Composite Set Detection Using Part-and-Sum Transformers
pdf: https://t.co/0P9AwU0QEf
abs: https://t.co/KuAJD1NrKD

for visual composite set detection, emphasizes the importance of maintaining separate representations for the sum and parts while enhancing their interactions pic.twitter.com/KR5I2NAliy
— AK (@ak92501) May 6, 2021

10. Self-Supervised Learning from Automatically Separated Sound Scenes

Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

retweets: 56, favorites: 34 (05/07/2021 18:20:48)
links: abs | pdf
cs.SD | cs.LG | eess.AS

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.

Self-Supervised Learning from Automatically Separated Sound Scenes
pdf: https://t.co/Imcc6qDICM
abs: https://t.co/2OFs4A61hu

a sound separation-based contrastive learning framework for unsupervised audio representation learning pic.twitter.com/9oQ9MSLFoM
— AK (@ak92501) May 6, 2021

11. Real-time Deep Dynamic Characters

Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, Christian Theobalt

retweets: 42, favorites: 46 (05/07/2021 18:20:48)
links: abs | pdf
cs.CV

We propose a deep videorealistic 3D human character model displaying highly realistic shape, motion, and dynamic appearance learned in a new weakly supervised way from multi-view imagery. In contrast to previous work, our controllable 3D character displays dynamics, e.g., the swing of the skirt, dependent on skeletal body motion in an efficient data-driven way, without requiring complex physics simulation. Our character model also features a learned dynamic texture model that accounts for photo-realistic motion-dependent appearance details, as well as view-dependent lighting effects. During training, we do not need to resort to difficult dynamic 3D capture of the human; instead we can train our model entirely from multi-view video in a weakly supervised manner. To this end, we propose a parametric and differentiable character representation which allows us to model coarse and fine dynamic deformations, e.g., garment wrinkles, as explicit space-time coherent mesh geometry that is augmented with high-quality dynamic textures dependent on motion and view point. As input to the model, only an arbitrary 3D skeleton motion is required, making it directly compatible with the established 3D animation pipeline. We use a novel graph convolutional network architecture to enable motion-dependent deformation learning of body and clothing, including dynamics, and a neural generative dynamic texture model creates corresponding dynamic texture maps. We show that by merely providing new skeletal motions, our model creates motion-dependent surface deformations, physically plausible dynamic clothing deformations, as well as video-realistic surface textures at a much higher level of detail than previous state of the art approaches, and even in real-time.

Real-time Deep Dynamic Characters
pdf: https://t.co/azlqfJLoel
abs: https://t.co/AQo1dJx4bD

a real-time method that allows to animate the dynamic
3D surface deformation and texture of highly realistic 3D avatars in a user-controllable way pic.twitter.com/b87yPPgfka
— AK (@ak92501) May 6, 2021

12. AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss

Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Feng Ji, Ji Zhang, Alberto Del Bimbo

retweets: 42, favorites: 16 (05/07/2021 18:20:49)
links: abs | pdf
cs.CV

A number of studies point out that current Visual Question Answering (VQA) models are severely affected by the language prior problem, which refers to blindly making predictions based on the language shortcut. Some efforts have been devoted to overcoming this issue with delicate models. However, there is no research to address it from the angle of the answer feature space learning, despite of the fact that existing VQA methods all cast VQA as a classification task. Inspired by this, in this work, we attempt to tackle the language prior problem from the viewpoint of the feature space learning. To this end, an adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space under each question type properly. As a result, the limited patterns within the language modality are largely reduced, thereby less language priors would be introduced by our method. We apply this loss function to several baseline models and evaluate its effectiveness on two VQA-CP benchmarks. Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models with an absolute performance gain of 15% on average, strongly verifying the potential of tackling the language prior problem in VQA from the angle of the answer feature space learning.

Published 7 May 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter