1. Go Wider Instead of Deeper
Fuzhao Xue, Ziji Shi, Yuxuan Lou, Yong Liu, Yang You
The transformer has recently achieved impressive results on various tasks. To further improve the effectiveness and efficiency of the transformer, there are two trains of thought among existing works: (1) going wider by scaling to more trainable parameters; (2) going shallower by parameter sharing or model compressing along with the depth. However, larger models usually do not scale well when fewer tokens are available to train, and advanced parallelisms are required when the model is extremely large. Smaller models usually achieve inferior performance compared to the original transformer model due to the loss of representation power. In this paper, to achieve better performance with fewer trainable parameters, we propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper. Specially, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). We then share the MoE layers across transformer blocks using individual layer normalization. Such deployment plays the role to transform various semantic representations, which makes the model more parameter-efficient and effective. To evaluate our framework, we design WideNet and evaluate it on ImageNet-1K. Our best model outperforms Vision Transformer (ViT) by with trainable parameters. Using and parameters, our WideNet can still surpass ViT and ViT-MoE by and , respectively.
Go Wider Instead of Deeper
— AK (@ak92501) July 27, 2021
pdf: https://t.co/2OywbopfJU
abs: https://t.co/cYt5hfoRaR
best model outperforms Vision Transformer (ViT) by 1.46% with 0.72× trainable parameters pic.twitter.com/yqro9GtDB3
Eventually, go wider, or go home!https://t.co/NWHMLxp3sYhttps://t.co/Q2iOHMJG5Jhttps://t.co/ZYYvlgPaaD https://t.co/fykS3342A8
— Hamid (@heghbalz) July 27, 2021
2. Towards Generative Video Compression
Fabian Mentzer, Eirikur Agustsson, Johannes Ballé, David Minnen, Nick Johnston, George Toderici
We present a neural video compression method based on generative adversarial networks (GANs) that outperforms previous neural video compression methods and is comparable to HEVC in a user study. We propose a technique to mitigate temporal error accumulation caused by recursive frame compression that uses randomized shifting and un-shifting, motivated by a spectral analysis. We present in detail the network design choices, their relative importance, and elaborate on the challenges of evaluating video compression methods in user studies.
📢📢📢 New paper: "Towards Generative Video Compression". We present a GAN-based neural video compression system that is comparable to HEVC visually, and outperforms previous work that does not use GANs. Check it out on arxiv: https://t.co/0yBWEmQ3J9 pic.twitter.com/CDXq2cLICI
— Fabian Mentzer (@mentzer_f) July 27, 2021
Towards Generative Video Compression
— AK (@ak92501) July 27, 2021
pdf: https://t.co/ZNguNuZ2QW
abs: https://t.co/uUqoOt5Vqf
a neural video compression method based on GANs that outperforms previous neural video compression methods
and is comparable to HEVC in a user study pic.twitter.com/rYFKPbUn8J
3. Contextual Transformer Networks for Visual Recognition
Yehao Li, Ting Yao, Yingwei Pan, Tao Mei
Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection and instance segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at \url{https://github.com/JDAI-CV/CoTNet}.
Contextual Transformer Networks for Visual Recognition
— AK (@ak92501) July 27, 2021
pdf: https://t.co/tH0VpuPWWw
abs: https://t.co/2x6X0vumBy
github: https://t.co/T9KGgm66J9
exploits the contextual information among input keys
to guide self-attention learning pic.twitter.com/0UKgfx3syY
4. H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
Zhenhai Zhu, Radu Soricut
We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
— Aran Komatsuzaki (@arankomatsuzaki) July 27, 2021
Gains +6 points on average on the Long
Range Arena benchmark over the subquadratic alternatives.
Sets a new SOTA ppl on One-Billion Word
dataset with 5x fewer model parameters.https://t.co/bP66VDZ8de pic.twitter.com/sxNxDmZ9bE
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
— AK (@ak92501) July 27, 2021
pdf: https://t.co/zAoEjISOph
abs: https://t.co/AwzWbH2Te4
SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models pic.twitter.com/zgKYDp4qA4
5. A brief note on understanding neural networks as Gaussian processes
Mengwu Guo
As a generalization of the work in [Lee et al., 2017], this note briefly discusses when the prior of a neural network output follows a Gaussian process, and how a neural-network-induced Gaussian process is formulated. The posterior mean functions of such a Gaussian process regression lie in the reproducing kernel Hilbert space defined by the neural-network-induced kernel. In the case of two-layer neural networks, the induced Gaussian processes provide an interpretation of the reproducing kernel Hilbert spaces whose union forms a Barron space.
A brief note on understanding neural networks as Gaussian processes. (arXiv:2107.11892v1 [cs.LG]) https://t.co/8s8SzRezmW
— Stat.ML Papers (@StatMLPapers) July 27, 2021
6. A Realistic Simulation Framework for Learning with Label Noise
Keren Gu, Xander Masotto, Vandana Bachani, Balaji Lakshminarayanan, Jack Nikodem, Dong Yin
We propose a simulation framework for generating realistic instance-dependent noisy labels via a pseudo-labeling paradigm. We show that this framework generates synthetic noisy labels that exhibit important characteristics of the label noise in practical settings via comparison with the CIFAR10-H dataset. Equipped with controllable label noise, we study the negative impact of noisy labels across a few realistic settings to understand when label noise is more problematic. We also benchmark several existing algorithms for learning with noisy labels and compare their behavior on our synthetic datasets and on the datasets with independent random label noise. Additionally, with the availability of annotator information from our simulation framework, we propose a new technique, Label Quality Model (LQM), that leverages annotator features to predict and correct against noisy labels. We show that by adding LQM as a label correction step before applying existing noisy label techniques, we can further improve the models’ performance.
A Realistic Simulation Framework for Learning with Label Noise
— AK (@ak92501) July 27, 2021
pdf: https://t.co/fn4Pj58uSp
abs: https://t.co/0Q4TxBXrt2
github: https://t.co/mAxDa4BmjB
a simulation framework for generating realistic instance-dependent noisy labels via a pseudolabeling paradigm pic.twitter.com/0TVA2fCbJm
7. The Impact of Negative Sampling on Contrastive Structured World Models
Ondrej Biza, Elise van der Pol, Thomas Kipf
World models trained by contrastive learning are a compelling alternative to autoencoder-based world models, which learn by reconstructing pixel states. In this paper, we describe three cases where small changes in how we sample negative states in the contrastive loss lead to drastic changes in model performance. In previously studied Atari datasets, we show that leveraging time step correlations can double the performance of the Contrastive Structured World Model. We also collect a full version of the datasets to study contrastive learning under a more diverse set of experiences.
The Impact of Negative Sampling on Contrastive Structured World Models
— AK (@ak92501) July 27, 2021
pdf: https://t.co/C5UnwBKYvQ
abs: https://t.co/SsJYD5zGy7
github: https://t.co/6rgbTT4PJ1 pic.twitter.com/KWKoNKG7AT
8. Transcript to Video: Efficient Clip Sequencing from Texts
Yu Xiong, Fabian Caba Heilbron, Dahua Lin
Among numerous videos shared on the web, well-edited ones always attract more attention. However, it is difficult for inexperienced users to make well-edited videos because it requires professional expertise and immense manual labor. To meet the demands for non-experts, we present Transcript-to-Video — a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles, respectively. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing. Quantitative results and user studies demonstrate empirically that the proposed learning framework can retrieve content-relevant shots while creating plausible video sequences in terms of style. Besides, the run-time performance analysis shows that our framework can support real-world applications.
Transcript to Video: Efficient Clip Sequencing from Texts
— AK (@ak92501) July 27, 2021
pdf: https://t.co/sLEnh8zIZl
abs: https://t.co/pI1bWVelEg
project page: https://t.co/lhtepIOlnV pic.twitter.com/RNJ352eRPg
9. NeLF: Neural Light-transport Field for Portrait View Synthesis and Relighting
Tiancheng Sun, Kai-En Lin, Sai Bi, Zexiang Xu, Ravi Ramamoorthi
Human portraits exhibit various appearances when observed from different views under different lighting conditions. We can easily imagine how the face will look like in another setup, but computer algorithms still fail on this problem given limited observations. To this end, we present a system for portrait view synthesis and relighting: given multiple portraits, we use a neural network to predict the light-transport field in 3D space, and from the predicted Neural Light-transport Field (NeLF) produce a portrait from a new camera view under a new environmental lighting. Our system is trained on a large number of synthetic models, and can generalize to different synthetic and real portraits under various lighting conditions. Our method achieves simultaneous view synthesis and relighting given multi-view portraits as the input, and achieves state-of-the-art results.
NeLF: Neural Light-transport Field for Portrait View Synthesis and Relighting
— AK (@ak92501) July 27, 2021
pdf: https://t.co/hqHCdhOAM1
abs: https://t.co/1ety67LGu8
project page: https://t.co/5pF2u0tGDX pic.twitter.com/3QgEg7KUix