All Articles

Hot Papers 2020-11-30

1. Navigating the GAN Parameter Space for Semantic Image Editing

Anton Cherepkov, Andrey Voynov, Artem Babenko

  • retweets: 1532, favorites: 229 (12/01/2020 08:52:19)
  • links: abs | pdf
  • cs.LG | cs.CV

Generative Adversarial Networks (GANs) are currently an indispensable tool for visual editing, being a standard component of image-to-image translation and image restoration pipelines. Furthermore, GANs are especially useful for controllable generation since their latent spaces contain a wide range of interpretable directions, well suited for semantic editing operations. By gradually changing latent codes along these directions, one can produce impressive visual effects, unattainable without GANs. In this paper, we significantly expand the range of visual effects achievable with the state-of-the-art models, like StyleGAN2. In contrast to existing works, which mostly operate by latent codes, we discover interpretable directions in the space of the generator parameters. By several simple methods, we explore this space and demonstrate that it also contains a plethora of interpretable directions, which are an excellent source of non-trivial semantic manipulations. The discovered manipulations cannot be achieved by transforming the latent codes and can be used to edit both synthetic and real images. We release our code and models and hope they will serve as a handy tool for further efforts on GAN-based image editing.

2. Unsupervised part representation by Flow Capsules

Sara Sabour, Andrea Tagliasacchi, Soroosh Yazdani, Geoffrey E. Hinton, David J. Fleet

  • retweets: 409, favorites: 156 (12/01/2020 08:52:20)
  • links: abs | pdf
  • cs.CV | cs.LG

Capsule networks are designed to parse an image into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a novel self-supervised method for learning part descriptors of an image. During training, we exploit motion as a powerful perceptual cue for part definition, using an expressive decoder for part generation and layered image formation with occlusion. Experiments demonstrate robust part discovery in the presence of multiple objects, cluttered backgrounds, and significant occlusion. The resulting part descriptors, a.k.a. part capsules, are decoded into shape masks, filling in occluded pixels, along with relative depth on single images. We also report unsupervised object classification using our capsule parts in a stacked capsule autoencoder.

3. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes

Zhengqi Li, Simon Niklaus, Noah Snavely, Oliver Wang

  • retweets: 342, favorites: 141 (12/01/2020 08:52:20)
  • links: abs | pdf
  • cs.CV

We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for complex dynamic scenes, including thin structures, view-dependent effects, and natural degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.

4. A Grassmann Manifold Handbook: Basic Geometry and Computational Aspects

Thomas Bendokat, Ralf Zimmermann, P.-A. Absil

The Grassmann manifold of linear subspaces is important for the mathematical modelling of a multitude of applications, ranging from problems in machine learning, computer vision and image processing to low-rank matrix optimization problems, dynamic low-rank decompositions and model reduction. With this work, we aim to provide a collection of the essential facts and formulae on the geometry of the Grassmann manifold in a fashion that is fit for tackling the aforementioned problems with matrix-based algorithms. Moreover, we expose the Grassmann geometry both from the approach of representing subspaces with orthogonal projectors and when viewed as a quotient space of the orthogonal group, where subspaces are identified as equivalence classes of (orthogonal) bases. This bridges the associated research tracks and allows for an easy transition between these two approaches. Original contributions include a modified algorithm for computing the Riemannian logarithm map on the Grassmannian that is advantageous numerically but also allows for a more elementary, yet more complete description of the cut locus and the conjugate points. We also derive a formula for parallel transport along geodesics in the orthogonal projector perspective, formulae for the derivative of the exponential map, as well as a formula for Jacobi fields vanishing at one point.

5. Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio

Joรฃo P. Ferreira, Thiago M. Coutinho, Thiago L. Gomes, Josรฉ F. Neto, Rafael Azevedo, Renato Martins, Erickson R. Nascimento

Synthesizing human motion through learning techniques is becoming an increasingly popular approach to alleviating the requirement of new data capture to produce animations. Learning to move naturally from music, i.e., to dance, is one of the more complex motions humans often perform effortlessly. Each dance movement is unique, yet such movements maintain the core characteristics of the dance style. Most approaches addressing this problem with classical convolutional and recursive neural models undergo training and variability issues due to the non-Euclidean geometry of the motion manifold structure.In this paper, we design a novel method based on graph convolutional networks to tackle the problem of automatic dance generation from audio information. Our method uses an adversarial learning scheme conditioned on the input music audios to create natural motions preserving the key movements of different music styles. We evaluate our method with three quantitative metrics of generative methods and a user study. The results suggest that the proposed GCN model outperforms the state-of-the-art dance generation method conditioned on music in different experiments. Moreover, our graph-convolutional approach is simpler, easier to be trained, and capable of generating more realistic motion styles regarding qualitative and different quantitative metrics. It also presented a visual movement perceptual quality comparable to real motion data.

6. Deep orthogonal linear networks are shallow

Pierre Ablin

We consider the problem of training a deep orthogonal linear network, which consists of a product of orthogonal matrices, with no non-linearity in-between. We show that training the weights with Riemannian gradient descent is equivalent to training the whole factorization by gradient descent. This means that there is no effect of overparametrization and implicit bias at all in this setting: training such a deep, overparametrized, network is perfectly equivalent to training a one-layer shallow network.

7. 4D Human Body Capture from Egocentric Video via 3D Scene Grounding

Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M. Rehg, Siyu Tang

  • retweets: 49, favorites: 35 (12/01/2020 08:52:21)
  • links: abs | pdf
  • cs.CV

To understand human daily social interaction from egocentric perspective, we introduce a novel task of reconstructing a time series of second-person 3D human body meshes from monocular egocentric videos. The unique viewpoint and rapid embodied camera motion of egocentric videos raise additional technical barriers for human body capture. To address those challenges, we propose a novel optimization-based approach that leverages 2D observations of the entire video sequence and human-scene interaction constraint to estimate second-person human poses, shapes and global motion that are grounded on the 3D environment captured from the egocentric view. We conduct detailed ablation studies to validate our design choice. Moreover, we compare our method with previous state-of-the-art method on human motion capture from monocular video, and show that our method estimates more accurate human-body poses and shapes under the challenging egocentric setting. In addition, we demonstrate that our approach produces more realistic human-scene interaction. Our project page is available at: https://aptx4869lm.github.io/4DEgocentricBodyCapture/

8. 3DSNet: Unsupervised Shape-to-Shape 3D Style Transfer

Mattia Segu, Margarita Grinvald, Roland Siegwart, Federico Tombari

  • retweets: 25, favorites: 55 (12/01/2020 08:52:21)
  • links: abs | pdf
  • cs.CV | cs.LG

Transferring the style from one image onto another is a popular and widely studied task in computer vision. Yet, learning-based style transfer in the 3D setting remains a largely unexplored problem. To our knowledge, we propose the first learning-based generative approach for style transfer between 3D objects. Our method allows to combine the content and style of a source and target 3D model to generate a novel shape that resembles in style the target while retaining the source content. The proposed framework can synthesize new 3D shapes both in the form of point clouds and meshes. Furthermore, we extend our technique to implicitly learn the underlying multimodal style distribution of the individual category domains. By sampling style codes from the learned distributions, we increase the variety of styles that our model can confer to a given reference object. Experimental results validate the effectiveness of the proposed 3D style transfer method on a number of benchmarks.

9. Deep Convolutional Neural Networks: A survey of the foundations, selected improvements, and some current applications

Lars Lien Ankile, Morgan Feet Heggland, Kjartan Krange

  • retweets: 49, favorites: 26 (12/01/2020 08:52:21)
  • links: abs | pdf
  • cs.CV | cs.LG

Within the world of machine learning there exists a wide range of different methods with respective advantages and applications. This paper seeks to present and discuss one such method, namely Convolutional Neural Networks (CNNs). CNNs are deep neural networks that use a special linear operation called convolution. This operation represents a key and distinctive element of CNNs, and will therefore be the focus of this method paper. The discussion starts with the theoretical foundations that underlie convolutions and CNNs. Then, the discussion proceeds to discuss some improvements and augmentations that can be made to adapt the method to estimate a wider set of function classes. The paper mainly investigates two ways of improving the method: by using locally connected layers, which can make the network less invariant to translation, and tiled convolution, which allows for the learning of more complex invariances than standard convolution. Furthermore, the use of the Fast Fourier Transform can improve the computational efficiency of convolution. Subsequently, this paper discusses two applications of convolution that have proven to be very effective in practice. First, the YOLO architecture is a state of the art neural network for image object classification, which accurately predicts bounding boxes around objects in images. Second, tumor detection in mammography may be performed using CNNs, accomplishing 7.2% higher specificity than actual doctors with only .3% less sensitivity. Finally, the invention of technology that outperforms humans in different fields also raises certain ethical and regulatory questions that are briefly discussed.

10. Image Generators with Conditionally-Independent Pixel Synthesis

Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov

Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. Here, we present a new architecture for image generators, where the color value at each pixel is computed independently given the value of a random latent vector and the coordinate of that pixel. No spatial convolutions or similar operations that propagate information across pixels are involved during the synthesis. We analyze the modeling capabilities of such generators when trained in an adversarial fashion, and observe the new generators to achieve similar generation quality to state-of-the-art convolutional generators. We also investigate several interesting properties unique to the new architecture.

11. Generative Layout Modeling using Constraint Graphs

Wamiq Para, Paul Guerrero, Tom Kelly, Leonidas Guibas, Peter Wonka

  • retweets: 36, favorites: 27 (12/01/2020 08:52:21)
  • links: abs | pdf
  • cs.CV | cs.GR

We propose a new generative model for layout generation. We generate layouts in three steps. First, we generate the layout elements as nodes in a layout graph. Second, we compute constraints between layout elements as edges in the layout graph. Third, we solve for the final layout using constrained optimization. For the first two steps, we build on recent transformer architectures. The layout optimization implements the constraints efficiently. We show three practical contributions compared to the state of the art: our work requires no user input, produces higher quality layouts, and enables many novel capabilities for conditional layout generation.

12. FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Bichen Wu, Qing He, Peizhao Zhang, Thilo Koehler, Kurt Keutzer, Peter Vajda

Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos.

13. True-data Testbed for 5G/B5G Intelligent Network

Yongming Huang, Shengheng Liu, Cheng Zhang, Xiaohu You, Hequan Wu

Future beyond fifth-generation (B5G) and sixth-generation (6G) mobile communications will shift from facilitating interpersonal communications to supporting Internet of Everything (IoE), where intelligent communications with full integration of big data and artificial intelligence (AI) will play an important role in improving network efficiency and providing high-quality service. As a rapid evolving paradigm, the AI-empowered mobile communications demand large amounts of data acquired from real network environment for systematic test and verification. Hence, we build the worldโ€™s first true-data testbed for 5G/B5G intelligent network (TTIN), which comprises 5G/B5G on-site experimental networks, data acquisition & data warehouse, and AI engine & network optimization. In the TTIN, true network data acquisition, storage, standardization, and analysis are available, which enable system-level online verification of B5G/6G-orientated key technologies and support data-driven network optimization through the closed-loop control mechanism. This paper elaborates on the system architecture and module design of TTIN. Detailed technical specifications and some of the established use cases are also showcased.