1. Finetuning Pretrained Transformers into RNNs
Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, Noah A. Smith
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. This comes with a significant computational overhead, as the attention mechanism scales with a quadratic complexity in sequence length. Efficient transformer variants have received increasing interest from recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train or yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving the efficiency while retaining the accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process needs lower training cost than training these recurrent variants from scratch. As many recent models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.
Finetuning Pretrained Transformers into RNNs
— Aran Komatsuzaki (@arankomatsuzaki) March 25, 2021
Successfully converts a pretrained transformer into its efficient linear-complexity recurrent counterpart with a learned feature map to improve the efficiency while retaining the accuracy.https://t.co/yRfENT2ch2 pic.twitter.com/k4gwCljy7m
Finetuning Pretrained Transformers into RNNs
— AK (@ak92501) March 25, 2021
pdf: https://t.co/EcFo5w5JYV
abs: https://t.co/XIVw5xTYUG pic.twitter.com/KKN1Tz3m3O
2. FastMoE: A Fast Mixture-of-Expert Training System
Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, Jie Tang
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google’s hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. In this paper, we present FastMoE, a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration skills. The system supports placing different experts on multiple GPUs across multiple nodes, enabling enlarging the number of experts linearly against the number of GPUs. The source of FastMoE is available at https://github.com/laekov/fastmoe under Apache-2 license.
FastMoE: A Fast Mixture-of-Expert Training System
— Aran Komatsuzaki (@arankomatsuzaki) March 25, 2021
Presents FastMoE, a distributed MoE training system based on PyTorch that works with GPUs unlike the existing MoE.
abs: https://t.co/TPpwrT4QOt
code: https://t.co/4GNIpJssFo pic.twitter.com/KqEErUSZhQ
FastMoE: A Fast Mixture-of-Expert Training System
— AK (@ak92501) March 25, 2021
pdf: https://t.co/8x3dr431BS
abs: https://t.co/YqZ3orckPr
github: https://t.co/DcTY8k5xln pic.twitter.com/yfcnElbTpt
3. Can Vision Transformers Learn without Natural Images?
Kodai Nakashima, Hirokatsu Kataoka, Asato Matsumoto, Kenji Iwata, Nakamasa Inoue
Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.
Can Vision Transformers Learn without Natural Images?
— AK (@ak92501) March 25, 2021
pdf: https://t.co/aLjEYIjZhu
abs: https://t.co/7BSqg3sttI
project page: https://t.co/T14KziiPlM pic.twitter.com/DgUH45eqnM
Can Vision Transformers Learn without Natural Images?
— Aran Komatsuzaki (@arankomatsuzaki) March 25, 2021
Partially outperforms strong SSL baselines such as SimCLRv2 and MoCov2 w/o using any natural images in the pre-training phase.
abs: https://t.co/SYrWTjpwqk
project: https://t.co/2pUrWTk5w9 pic.twitter.com/KDgLHP8iQu
自然界にはフラクタル性があると言われていますが、フラクタル画像使い倒すと最早画像データセット要らなくない?と言うのを友人の @HirokatuKataoka さんたちがやっていてACCVでHonorable MentionされてたやつのTransformer版が出てる。https://t.co/xkckHfSBiy
— Yoshitaka Ushiku (@losnuevetoros) March 25, 2021
我々の論文 “Can Vision Transformers Learn without Natural Images?” をarXivに掲載しました!
— cvpaper.challenge | AI/CV研究コミュニティ (@CVpaperChalleng) March 25, 2021
PDF: https://t.co/sfjEkzwd8p
Project page: https://t.co/3e4vpC2Eae pic.twitter.com/HdbdMXGOlU
4. AutoMix: Unveiling the Power of Mixup
Zicheng Liu, Siyuan Li, Di Wu, Zhiyuan Chen, Lirong Wu, Jianzhu Guo, Stan Z. Li
Mixup-based data augmentation has achieved great success as regularizer for deep neural networks. However, existing mixup methods require explicitly designed mixup policies. In this paper, we present a flexible, general Automatic Mixup (AutoMix) framework which utilizes discriminative features to learn a sample mixing policy adaptively. We regard mixup as a pretext task and split it into two sub-problems: mixed samples generation and mixup classification. To this end, we design a lightweight mix block to generate synthetic samples based on feature maps and mix labels. Since the two sub-problems are in the nature of Expectation-Maximization (EM), we also propose a momentum training pipeline to optimize the mixup process and mixup classification process alternatively in an end-to-end fashion. Extensive experiments on six popular classification benchmarks show that AutoMix consistently outperforms other leading mixup methods and improves generalization abilities to downstream tasks. We hope AutoMix will motivate the community to rethink the role of mixup in representation learning. The code will be released soon.
AutoMix: Unveiling the Power of Mixuphttps://t.co/rgXpFRzNjE pic.twitter.com/rkhMa10RhJ
— phalanx (@ZFPhalanx) March 25, 2021
5. Multi-view 3D Reconstruction with Transformer
Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang Shi, Septimiu Salcudean, Z. Jane Wang, Rabab Ward
Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - multi-view feature extraction and fusion, are usually investigated separately, and the object relations in different views are rarely explored. In this paper, inspired by the recent great success in self-attention-based Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a new framework named 3D Volume Transformer (VolT) for such a task. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction benchmark dataset, our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters ( less) than other CNN-based methods. Experimental results also suggest the strong scaling capability of our method. Our code will be made publicly available.
Multi-view 3D Reconstruction with Transformer
— AK (@ak92501) March 25, 2021
pdf: https://t.co/BsedtkzixG
abs: https://t.co/BgmFtsxyUH pic.twitter.com/Os8fUpKB5z
6. One-Shot GAN: Learning to Generate Samples from Single Images and Videos
Vadim Sushko, Juergen Gall, Anna Khoreva
Given a large number of training samples, GANs can achieve remarkable performance for the image synthesis task. However, training GANs in extremely low-data regimes remains a challenge, as overfitting often occurs, leading to memorization or training divergence. In this work, we introduce One-Shot GAN, an unconditional generative model that can learn to generate samples from a single training image or a single video clip. We propose a two-branch discriminator architecture, with content and layout branches designed to judge internal content and scene layout realism separately from each other. This allows synthesis of visually plausible, novel compositions of a scene, with varying content and layout, while preserving the context of the original sample. Compared to previous single-image GAN models, One-Shot GAN generates more diverse, higher quality images, while also not being restricted to a single image setting. We show that our model successfully deals with other one-shot regimes, and introduce a new task of learning generative models from a single video.
One-Shot GAN: Learning to Generate Samples from Single Images and Videos
— AK (@ak92501) March 25, 2021
pdf: https://t.co/GikSikQxPx
abs: https://t.co/FLTxC6PoAM pic.twitter.com/plefaU1a5n
7. Koo: The new King? Characterizing India’s Emerging Social Network
Asmit Kumar Singh, Chirag Jain, Jivitesh Jain, Rishi Raj Jain, Shradha Sehgal, Ponnurangam Kumaraguru
Social media has grown exponentially in a short period, coming to the forefront of communications and online interactions. Despite their rapid growth, social media platforms have been unable to scale to different languages globally and remain inaccessible to many. In this report, we characterize Koo, a multilingual micro-blogging site that rose in popularity in 2021, as an Indian alternative to Twitter. We collected a dataset of 4.07 million users, 163.12 million follower-following relationships, and their content and activity across 12 languages. The prominent presence of Indian languages in the discourse on Koo indicates the platform’s success in promoting regional languages. We observe Koo’s follower-following network to be much denser than Twitter’s, comprising of closely-knit linguistic communities. This initial characterization heralds a deeper study of the dynamics of the multilingual social network and its diverse Indian user base.
Is #kooapp, the new social network, India’s king?
— Ponnurangam Kumaraguru “PK” (@ponguru) March 25, 2021
With 4M users & 163M follower relations, we find out!
Linguistic communities; #Hindi #Bengaluru prominent;
Video: https://t.co/YxaTJUXdk1
Full report: https://t.co/XSFEQJjz2t
\c @aprameya @mayankbidawatka @rsprasad @kooindia
8. AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild
Daniel Joska, Liam Clark, Naoya Muramatsu, Ricardo Jericevich, Fred Nicolls, Alexander Mathis, Mackenzie W. Mathis, Amir Patel
Animals are capable of extreme agility, yet understanding their complex dynamics, which have ecological, biomechanical and evolutionary implications, remains challenging. Being able to study this incredible agility will be critical for the development of next-generation autonomous legged robots. In particular, the cheetah (acinonyx jubatus) is supremely fast and maneuverable, yet quantifying its whole-body 3D kinematic data during locomotion in the wild remains a challenge, even with new deep learning-based methods. In this work we present an extensive dataset of free-running cheetahs in the wild, called AcinoSet, that contains 119,490 frames of multi-view synchronized high-speed video footage, camera calibration files and 7,588 human-annotated frames. We utilize markerless animal pose estimation to provide 2D keypoints. Then, we use three methods that serve as strong baselines for 3D pose estimation tool development: traditional sparse bundle adjustment, an Extended Kalman Filter, and a trajectory optimization-based method we call Full Trajectory Estimation. The resulting 3D trajectories, human-checked 3D ground truth, and an interactive tool to inspect the data is also provided. We believe this dataset will be useful for a diverse range of fields such as ecology, neuroscience, robotics, biomechanics as well as computer vision.
🐆 Care about 3D animal pose & bio-inspired robotics?
— Mackenzie Mathis (@TrackingActions) March 25, 2021
🔥 In collaboration w/@UnitAfrican:
𝘈𝘤𝘪𝘯𝘰𝘚𝘦𝘵: 𝘈 3𝘋 𝘗𝘰𝘴𝘦 𝘌𝘴𝘵𝘪𝘮𝘢𝘵𝘪𝘰𝘯 𝘋𝘢𝘵𝘢𝘴𝘦𝘵 & 𝘉𝘢𝘴𝘦𝘭𝘪𝘯𝘦 𝘔𝘰𝘥𝘦𝘭𝘴 𝘧𝘰𝘳 𝘊𝘩𝘦𝘦𝘵𝘢𝘩𝘴 𝘪𝘯 𝘵𝘩𝘦 𝘞𝘪𝘭𝘥
🥳 at #ICRA2021https://t.co/actdU2tHLY pic.twitter.com/v9lHQbz9zJ
9. Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
Qing Liu, Vignesh Ramanathan, Dhruv Mahajan, Alan Yuille, Zhenheng Yang
Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training. We demonstrate that both approaches together improve the instance segmentation metric on video frames of two datasets: Youtube-VIS and Cityscapes by and respectively.
Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
— AK (@ak92501) March 25, 2021
pdf: https://t.co/05IA9xzsR9
abs: https://t.co/BaNhx6BRtR pic.twitter.com/ZUoXwrWrPm