1. Navigating the GAN Parameter Space for Semantic Image Editing
Anton Cherepkov, Andrey Voynov, Artem Babenko
Generative Adversarial Networks (GANs) are currently an indispensable tool for visual editing, being a standard component of image-to-image translation and image restoration pipelines. Furthermore, GANs are especially useful for controllable generation since their latent spaces contain a wide range of interpretable directions, well suited for semantic editing operations. By gradually changing latent codes along these directions, one can produce impressive visual effects, unattainable without GANs. In this paper, we significantly expand the range of visual effects achievable with the state-of-the-art models, like StyleGAN2. In contrast to existing works, which mostly operate by latent codes, we discover interpretable directions in the space of the generator parameters. By several simple methods, we explore this space and demonstrate that it also contains a plethora of interpretable directions, which are an excellent source of non-trivial semantic manipulations. The discovered manipulations cannot be achieved by transforming the latent codes and can be used to edit both synthetic and real images. We release our code and models and hope they will serve as a handy tool for further efforts on GAN-based image editing.
Navigating the GAN Parameter Space for Semantic Image Editing
— AK (@ak92501) November 30, 2020
pdf: https://t.co/89khxie4h7
abs: https://t.co/wD7PukF9tG
github: https://t.co/oOnCff89ha pic.twitter.com/mHQEwP9gLq
2. Unsupervised part representation by Flow Capsules
Sara Sabour, Andrea Tagliasacchi, Soroosh Yazdani, Geoffrey E. Hinton, David J. Fleet
Capsule networks are designed to parse an image into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a novel self-supervised method for learning part descriptors of an image. During training, we exploit motion as a powerful perceptual cue for part definition, using an expressive decoder for part generation and layered image formation with occlusion. Experiments demonstrate robust part discovery in the presence of multiple objects, cluttered backgrounds, and significant occlusion. The resulting part descriptors, a.k.a. part capsules, are decoded into shape masks, filling in occluded pixels, along with relative depth on single images. We also report unsupervised object classification using our capsule parts in a stacked capsule autoencoder.
Our new paper "๐๐ง๐ฌ๐ฎ๐ฉ๐๐ซ๐ฏ๐ข๐ฌ๐๐ ๐ฉ๐๐ซ๐ญ ๐ซ๐๐ฉ๐ซ๐๐ฌ๐๐ง๐ญ๐๐ญ๐ข๐จ๐ง ๐๐ฒ ๐ ๐ฅ๐จ๐ฐ ๐๐๐ฉ๐ฌ๐ฎ๐ฅ๐๐ฌ" is out https://t.co/0t4XH4FH8H
— Andrea Tagliasacchi (@taiyasaki) November 30, 2020
TL;DR: are newborns exposed to 14 million (i.e. ImageNet) labeled images? No! They learn by observing motion... in an unsupervised fashion pic.twitter.com/K4dOAHPmkF
3. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
Zhengqi Li, Simon Niklaus, Noah Snavely, Oliver Wang
We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for complex dynamic scenes, including thin structures, view-dependent effects, and natural degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.
Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
— AK (@ak92501) November 30, 2020
pdf: https://t.co/wXNNMRrK1n
abs: https://t.co/04GXItTvly
project page: https://t.co/SWy0dLeHn3 pic.twitter.com/Z3JqxFQiDM
4. A Grassmann Manifold Handbook: Basic Geometry and Computational Aspects
Thomas Bendokat, Ralf Zimmermann, P.-A. Absil
The Grassmann manifold of linear subspaces is important for the mathematical modelling of a multitude of applications, ranging from problems in machine learning, computer vision and image processing to low-rank matrix optimization problems, dynamic low-rank decompositions and model reduction. With this work, we aim to provide a collection of the essential facts and formulae on the geometry of the Grassmann manifold in a fashion that is fit for tackling the aforementioned problems with matrix-based algorithms. Moreover, we expose the Grassmann geometry both from the approach of representing subspaces with orthogonal projectors and when viewed as a quotient space of the orthogonal group, where subspaces are identified as equivalence classes of (orthogonal) bases. This bridges the associated research tracks and allows for an easy transition between these two approaches. Original contributions include a modified algorithm for computing the Riemannian logarithm map on the Grassmannian that is advantageous numerically but also allows for a more elementary, yet more complete description of the cut locus and the conjugate points. We also derive a formula for parallel transport along geodesics in the orthogonal projector perspective, formulae for the derivative of the exponential map, as well as a formula for Jacobi fields vanishing at one point.
A GRASSMANN MANIFOLD HANDBOOK: BASIC GEOMETRY AND COMPUTATIONAL ASPECTShttps://t.co/sON3ekaHhM
— Submersion (@Submersion13) November 30, 2020
Grassmannๅคๆงไฝใไฝฟใใจใใฏใใฒ็บใใฆใฟใใ
5. Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio
Joรฃo P. Ferreira, Thiago M. Coutinho, Thiago L. Gomes, Josรฉ F. Neto, Rafael Azevedo, Renato Martins, Erickson R. Nascimento
Synthesizing human motion through learning techniques is becoming an increasingly popular approach to alleviating the requirement of new data capture to produce animations. Learning to move naturally from music, i.e., to dance, is one of the more complex motions humans often perform effortlessly. Each dance movement is unique, yet such movements maintain the core characteristics of the dance style. Most approaches addressing this problem with classical convolutional and recursive neural models undergo training and variability issues due to the non-Euclidean geometry of the motion manifold structure.In this paper, we design a novel method based on graph convolutional networks to tackle the problem of automatic dance generation from audio information. Our method uses an adversarial learning scheme conditioned on the input music audios to create natural motions preserving the key movements of different music styles. We evaluate our method with three quantitative metrics of generative methods and a user study. The results suggest that the proposed GCN model outperforms the state-of-the-art dance generation method conditioned on music in different experiments. Moreover, our graph-convolutional approach is simpler, easier to be trained, and capable of generating more realistic motion styles regarding qualitative and different quantitative metrics. It also presented a visual movement perceptual quality comparable to real motion data.
Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio
— AK (@ak92501) November 30, 2020
pdf: https://t.co/0cdEeWOEts
abs: https://t.co/HHKQRk3rPr
project page: https://t.co/IaZrTb8b4S
github: https://t.co/ykHGJpIrCu pic.twitter.com/Z64LfLzdwM
6. Deep orthogonal linear networks are shallow
Pierre Ablin
We consider the problem of training a deep orthogonal linear network, which consists of a product of orthogonal matrices, with no non-linearity in-between. We show that training the weights with Riemannian gradient descent is equivalent to training the whole factorization by gradient descent. This means that there is no effect of overparametrization and implicit bias at all in this setting: training such a deep, overparametrized, network is perfectly equivalent to training a one-layer shallow network.
[Small paper] Deep orthogonal linear networks are shallow !
— Pierre Ablin (@PierreAblin) November 30, 2020
Training a deep linear network where the weights are orthogonal with Riemannian gradient descent is equivalent to training a shallow one-layer network.
๐๐๐https://t.co/24PBiqs7aw
7. 4D Human Body Capture from Egocentric Video via 3D Scene Grounding
Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M. Rehg, Siyu Tang
To understand human daily social interaction from egocentric perspective, we introduce a novel task of reconstructing a time series of second-person 3D human body meshes from monocular egocentric videos. The unique viewpoint and rapid embodied camera motion of egocentric videos raise additional technical barriers for human body capture. To address those challenges, we propose a novel optimization-based approach that leverages 2D observations of the entire video sequence and human-scene interaction constraint to estimate second-person human poses, shapes and global motion that are grounded on the 3D environment captured from the egocentric view. We conduct detailed ablation studies to validate our design choice. Moreover, we compare our method with previous state-of-the-art method on human motion capture from monocular video, and show that our method estimates more accurate human-body poses and shapes under the challenging egocentric setting. In addition, we demonstrate that our approach produces more realistic human-scene interaction. Our project page is available at: https://aptx4869lm.github.io/4DEgocentricBodyCapture/
4D Human Body Capture from Egocentric Video
— AK (@ak92501) November 30, 2020
via 3D Scene Grounding
pdf: https://t.co/PgVoyLyAGv
abs: https://t.co/WKkdYl29M2
project page: https://t.co/m5MkIvavPk pic.twitter.com/N84XFHv5fY
8. 3DSNet: Unsupervised Shape-to-Shape 3D Style Transfer
Mattia Segu, Margarita Grinvald, Roland Siegwart, Federico Tombari
Transferring the style from one image onto another is a popular and widely studied task in computer vision. Yet, learning-based style transfer in the 3D setting remains a largely unexplored problem. To our knowledge, we propose the first learning-based generative approach for style transfer between 3D objects. Our method allows to combine the content and style of a source and target 3D model to generate a novel shape that resembles in style the target while retaining the source content. The proposed framework can synthesize new 3D shapes both in the form of point clouds and meshes. Furthermore, we extend our technique to implicitly learn the underlying multimodal style distribution of the individual category domains. By sampling style codes from the learned distributions, we increase the variety of styles that our model can confer to a given reference object. Experimental results validate the effectiveness of the proposed 3D style transfer method on a number of benchmarks.
3DSNet: Unsupervised Shape-to-Shape 3D Style Transfer
— AK (@ak92501) November 30, 2020
pdf: https://t.co/PLrf3ldwfw
abs: https://t.co/NvZmwmJem1 pic.twitter.com/DqKdamZq7b
9. Deep Convolutional Neural Networks: A survey of the foundations, selected improvements, and some current applications
Lars Lien Ankile, Morgan Feet Heggland, Kjartan Krange
Within the world of machine learning there exists a wide range of different methods with respective advantages and applications. This paper seeks to present and discuss one such method, namely Convolutional Neural Networks (CNNs). CNNs are deep neural networks that use a special linear operation called convolution. This operation represents a key and distinctive element of CNNs, and will therefore be the focus of this method paper. The discussion starts with the theoretical foundations that underlie convolutions and CNNs. Then, the discussion proceeds to discuss some improvements and augmentations that can be made to adapt the method to estimate a wider set of function classes. The paper mainly investigates two ways of improving the method: by using locally connected layers, which can make the network less invariant to translation, and tiled convolution, which allows for the learning of more complex invariances than standard convolution. Furthermore, the use of the Fast Fourier Transform can improve the computational efficiency of convolution. Subsequently, this paper discusses two applications of convolution that have proven to be very effective in practice. First, the YOLO architecture is a state of the art neural network for image object classification, which accurately predicts bounding boxes around objects in images. Second, tumor detection in mammography may be performed using CNNs, accomplishing 7.2% higher specificity than actual doctors with only .3% less sensitivity. Finally, the invention of technology that outperforms humans in different fields also raises certain ethical and regulatory questions that are briefly discussed.
Deep Convolutional Neural Networks: A survey of the foundations, selected improvements, a... https://t.co/sjupLvZmyF pic.twitter.com/vOam63V9mV
— arxiv (@arxiv_org) November 30, 2020
10. Image Generators with Conditionally-Independent Pixel Synthesis
Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov
Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. Here, we present a new architecture for image generators, where the color value at each pixel is computed independently given the value of a random latent vector and the coordinate of that pixel. No spatial convolutions or similar operations that propagate information across pixels are involved during the synthesis. We analyze the modeling capabilities of such generators when trained in an adversarial fashion, and observe the new generators to achieve similar generation quality to state-of-the-art convolutional generators. We also investigate several interesting properties unique to the new architecture.
Image Generators with Conditionally-Independent Pixel Synthesis
— AK (@ak92501) November 30, 2020
pdf: https://t.co/kh1z5nQgnH
abs: https://t.co/VGsKP6Dpwc pic.twitter.com/EItaH9oIK3
11. Generative Layout Modeling using Constraint Graphs
Wamiq Para, Paul Guerrero, Tom Kelly, Leonidas Guibas, Peter Wonka
We propose a new generative model for layout generation. We generate layouts in three steps. First, we generate the layout elements as nodes in a layout graph. Second, we compute constraints between layout elements as edges in the layout graph. Third, we solve for the final layout using constrained optimization. For the first two steps, we build on recent transformer architectures. The layout optimization implements the constraints efficiently. We show three practical contributions compared to the state of the art: our work requires no user input, produces higher quality layouts, and enables many novel capabilities for conditional layout generation.
Generative Layout Modeling using Constraint Graphs
— AK (@ak92501) November 30, 2020
pdf: https://t.co/95bAFEjZGS
abs: https://t.co/q25AzMKpyY pic.twitter.com/i0Mw40QZfg
12. FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge
Bichen Wu, Qing He, Peizhao Zhang, Thilo Koehler, Kurt Keutzer, Peter Vajda
Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos.
FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge
— AK (@ak92501) November 30, 2020
pdf: https://t.co/SmWSOhBPUM
abs: https://t.co/k7PGLD4XU8
project page: https://t.co/iFXkCmaP8F pic.twitter.com/7VpS0iIwVA
13. True-data Testbed for 5G/B5G Intelligent Network
Yongming Huang, Shengheng Liu, Cheng Zhang, Xiaohu You, Hequan Wu
Future beyond fifth-generation (B5G) and sixth-generation (6G) mobile communications will shift from facilitating interpersonal communications to supporting Internet of Everything (IoE), where intelligent communications with full integration of big data and artificial intelligence (AI) will play an important role in improving network efficiency and providing high-quality service. As a rapid evolving paradigm, the AI-empowered mobile communications demand large amounts of data acquired from real network environment for systematic test and verification. Hence, we build the worldโs first true-data testbed for 5G/B5G intelligent network (TTIN), which comprises 5G/B5G on-site experimental networks, data acquisition & data warehouse, and AI engine & network optimization. In the TTIN, true network data acquisition, storage, standardization, and analysis are available, which enable system-level online verification of B5G/6G-orientated key technologies and support data-driven network optimization through the closed-loop control mechanism. This paper elaborates on the system architecture and module design of TTIN. Detailed technical specifications and some of the established use cases are also showcased.