1. Towards General Purpose Vision Systems
Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem
A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system’s ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting.
Towards General Purpose Vision Systems
— AK (@ak92501) April 5, 2021
pdf: https://t.co/lYmA9BIa3n
abs: https://t.co/KjXW1aQGBB
project page: https://t.co/U37GTpAxeI pic.twitter.com/PhyPZsAniH
Towards General Purpose Vision Systems
— Aran Komatsuzaki (@arankomatsuzaki) April 5, 2021
Proposes the first task-agnostic vision-language model for classification, grounding, QA, captioning etc that involve image, text and pointing (via bounding boxes) modalities.
abs: https://t.co/eHhZ2r806e
site: https://t.co/0vnktLqz6r pic.twitter.com/miNjTzgnxX
2. LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions
Oğuz Kaan Yüksel, Enis Simsar, Ezgi Gülperi Er, Pinar Yanardag
Recent research has shown great potential for finding interpretable directions in the latent spaces of pre-trained Generative Adversarial Networks (GANs). These directions provide controllable generation and support a wide range of semantic editing operations such as zoom or rotation. The discovery of such directions is often performed in a supervised or semi-supervised fashion and requires manual annotations, limiting their applications in practice. In comparison, unsupervised discovery enables finding subtle directions a priori hard to recognize. In this work, we propose a contrastive-learning-based approach for discovering semantic directions in the latent space of pretrained GANs in a self-supervised manner. Our approach finds semantically meaningful dimensions compatible with state-of-the-art methods.
LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions
— AK (@ak92501) April 5, 2021
pdf: https://t.co/KpNAPuIfT7
abs: https://t.co/YGKw1xxv2l pic.twitter.com/jLkX1IUDw8
Happy to share our latest research where latent space manipulation meets contrastive learning: "LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions" https://t.co/JhdgyuBeij pic.twitter.com/slRXNL23zj
— Pinar Yanardag (@PINguAR) April 5, 2021
3. Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation
Karl Stelzner, Kristian Kersting, Adam R. Kosiorek
We present ObSuRF, a method which turns a single image of a scene into a 3D model represented as a set of Neural Radiance Fields (NeRFs), with each NeRF corresponding to a different object. A single forward pass of an encoder network outputs a set of latent vectors describing the objects in the scene. These vectors are used independently to condition a NeRF decoder, defining the geometry and appearance of each object. We make learning more computationally efficient by deriving a novel loss, which allows training NeRFs on RGB-D inputs without explicit ray marching. After confirming that the model performs equal or better than state of the art on three 2D image segmentation benchmarks, we apply it to two multi-object 3D datasets: A multiview version of CLEVR, and a novel dataset in which scenes are populated by ShapeNet models. We find that after training ObSuRF on RGB-D views of training scenes, it is capable of not only recovering the 3D geometry of a scene depicted in a single input image, but also to segment it into objects, despite receiving no supervision in that regard.
Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation
— AK (@ak92501) April 5, 2021
pdf: https://t.co/oDPUl6LWlG
abs: https://t.co/OmqOukzsNV
project page: https://t.co/4AzFnng7Yo pic.twitter.com/kBsnJtnPzQ
4. Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts
Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, Hyunjung Shim
A few-shot font generation (FFG) method has to satisfy two objectives: the generated images should preserve the underlying global structure of the target character and present the diverse local reference style. Existing FFG methods aim to disentangle content and style either by extracting a universal representation style or extracting multiple component-wise style representations. However, previous methods either fail to capture diverse local styles or cannot be generalized to a character with unseen components, e.g., unseen language systems. To mitigate the issues, we propose a novel FFG method, named Multiple Localized Experts Few-shot Font Generation Network (MX-Font). MX-Font extracts multiple style features not explicitly conditioned on component labels, but automatically by multiple experts to represent different local concepts, e.g., left-side sub-glyph. Owing to the multiple experts, MX-Font can capture diverse local concepts and show the generalizability to unseen languages. During training, we utilize component labels as weak supervision to guide each expert to be specialized for different local concepts. We formulate the component assign problem to each expert as the graph matching problem, and solve it by the Hungarian algorithm. We also employ the independence loss and the content-style adversarial loss to impose the content-style disentanglement. In our experiments, MX-Font outperforms previous state-of-the-art FFG methods in the Chinese generation and cross-lingual, e.g., Chinese to Korean, generation. Source code is available at https://github.com/clovaai/mxfont.
Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts
— AK (@ak92501) April 5, 2021
pdf: https://t.co/nPMMewh0z4
abs: https://t.co/zCzjP8Mync pic.twitter.com/4RZcn82aT8
5. Information-constrained optimization: can adaptive processing of gradients help?
Jayadev Acharya, Clément L. Canonne, Prathamesh Mayekar, Himanshu Tyagi
We revisit first-order optimization under local information constraints such as local privacy, gradient quantization, and computational constraints limiting access to a few coordinates of the gradient. In this setting, the optimization algorithm is not allowed to directly access the complete output of the gradient oracle, but only gets limited information about it subject to the local information constraints. We study the role of adaptivity in processing the gradient output to obtain this limited information from it.We consider optimization for both convex and strongly convex functions and obtain tight or nearly tight lower bounds for the convergence rate, when adaptive gradient processing is allowed. Prior work was restricted to convex functions and allowed only nonadaptive processing of gradients. For both of these function classes and for the three information constraints mentioned above, our lower bound implies that adaptive processing of gradients cannot outperform nonadaptive processing in most regimes of interest. We complement these results by exhibiting a natural optimization problem under information constraints for which adaptive processing of gradient strictly outperforms nonadaptive processing.
Very happy about this paper on "info-constrained optimisation," w/ @AcharyaJayadev, @prathameshM220, and @hstyagi: https://t.co/A9Okn2dvNH
— Clément Canonne (@ccanonne_) April 5, 2021
You want to minimise a function, but the first-order oracle can only provide you w/ some limited info about the gradient. E.g., 5 bits.
1/3 pic.twitter.com/qqQ0Dsk0d8
6. The Spatially-Correlative Loss for Various Image Translation Tasks
Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai
We propose a novel spatially-correlative loss that is simple, efficient and yet effective for preserving scene structure consistency while supporting large appearance changes during unpaired image-to-image (I2I) translation. Previous methods attempt this by using pixel-level cycle-consistency or feature-level matching losses, but the domain-specific nature of these losses hinder translation across large domain gaps. To address this, we exploit the spatial patterns of self-similarity as a means of defining scene structure. Our spatially-correlative loss is geared towards only capturing spatial relationships within an image rather than domain appearance. We also introduce a new self-supervised learning method to explicitly learn spatially-correlative maps for each specific translation task. We show distinct improvement over baseline models in all three modes of unpaired I2I translation: single-modal, multi-modal, and even single-image translation. This new loss can easily be integrated into existing network architectures and thus allows wide applicability.
The Spatially-Correlative Loss for Various Image Translation Tasks
— AK (@ak92501) April 5, 2021
pdf: https://t.co/Xt0WpdAJyX
abs: https://t.co/ip3oPcXq4B pic.twitter.com/BgIqUstDl3
7. Language-based Video Editing via Multi-Modal Multi-Level Transformer
Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, Miguel P. Eckstein, William Yang Wang
Video editing tools are widely used nowadays for digital design. Although the demand for these tools is high, the prior knowledge required makes it difficult for novices to get started. Systems that could follow natural language instructions to perform automatic editing would significantly improve accessibility. This paper introduces the language-based video editing (LBVE) task, which allows the model to edit, guided by text instruction, a source video into a target video. LBVE contains two features: 1) the scenario of the source video is preserved instead of generating a completely different video; 2) the semantic is presented differently in the target video, and all changes are controlled by the given instruction. We propose a Multi-Modal Multi-Level Transformer (ML-Transformer) to carry out LBVE. The ML-Transformer dynamically learns the correspondence between video perception and language semantic at different levels, which benefits both the video understanding and video frame synthesis. We build three new datasets for evaluation, including two diagnostic and one from natural videos with human-labeled text. Extensive experimental results show that ML-Transformer is effective for video editing and that LBVE can lead to a new field toward vision-and-language research.
Language-based Video Editing via Multi-Modal Multi-Level Transformer
— AK (@ak92501) April 5, 2021
pdf: https://t.co/APK6dVUCyO
abs: https://t.co/IGCqPC2zWH pic.twitter.com/j9XOJQZCOb
8. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference
Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We re-evaluated principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 3.3 times faster than EfficientNet on the CPU.
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
— AK (@ak92501) April 5, 2021
pdf: https://t.co/dUNZwJBaw8
abs: https://t.co/7EfboUXLpa
"LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff" pic.twitter.com/9D2xiy6zfs
9. TFill: Image Completion via a Transformer-Based Architecture
Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai
Bridging distant context interactions is important for high quality image completion with large masks. Previous methods attempting this via deep or large receptive field (RF) convolutions cannot escape from the dominance of nearby interactions, which may be inferior. In this paper, we propose treating image completion as a directionless sequence-to-sequence prediction task, and deploy a transformer to directly capture long-range dependence in the encoder in a first phase. Crucially, we employ a restrictive CNN with small and non-overlapping RF for token representation, which allows the transformer to explicitly model the long-range context relations with equal importance in all layers, without implicitly confounding neighboring tokens when larger RFs are used. In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced to better exploit distantly related features and also avoid the insular effect of standard attention. Overall, extensive experiments demonstrate superior performance compared to state-of-the-art methods on several datasets.
TFill: Image Completion via a Transformer-Based Architecture
— AK (@ak92501) April 5, 2021
pdf: https://t.co/vyuxTgKHnn
abs: https://t.co/yzrw0ancGd pic.twitter.com/Jd4SJ5XIr5
10. Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques
Kang-wook Kim, Seung-won Park, Myun-chul Joe
In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models. Audio samples are available at https://mindslab-ai.github.io/assem-vc/
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques
— AK (@ak92501) April 5, 2021
pdf: https://t.co/HjZs7Mfe9F
abs: https://t.co/zM9BBRMtB5
project page: https://t.co/1NrQmOSvRg pic.twitter.com/NtczwN60uH
11. Towards High Fidelity Face Relighting with Realistic Shadows
Andrew Hou, Ze Zhang, Michel Sarkis, Ning Bi, Yiying Tong, Xiaoming Liu
Existing face relighting methods often struggle with two problems: maintaining the local facial details of the subject and accurately removing and synthesizing shadows in the relit image, especially hard shadows. We propose a novel deep face relighting method that addresses both problems. Our method learns to predict the ratio (quotient) image between a source image and the target image with the desired lighting, allowing us to relight the image while maintaining the local facial details. During training, our model also learns to accurately modify shadows by using estimated shadow masks to emphasize on the high-contrast shadow borders. Furthermore, we introduce a method to use the shadow mask to estimate the ambient light intensity in an image, and are thus able to leverage multiple datasets during training with different global lighting intensities. With quantitative and qualitative evaluations on the Multi-PIE and FFHQ datasets, we demonstrate that our proposed method faithfully maintains the local facial details of the subject and can accurately handle hard shadows while achieving state-of-the-art face relighting performance.
Towards High Fidelity Face Relighting with Realistic Shadows
— AK (@ak92501) April 5, 2021
pdf: https://t.co/8JdhkFvfRh
abs: https://t.co/wj0fWszLJ5 pic.twitter.com/wpqMyfb9xx