1. AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, Sivanesan Sangeetha
Transformer-based pretrained language models (T-PTLMs) have achieved great success in almost every NLP task. The evolution of these models started with GPT and BERT. These models are built on the top of transformers, self-supervised learning and transfer learning. Transformed-based PTLMs learn universal language representations from large volumes of text data using self-supervised learning and transfer this knowledge to downstream tasks. These models provide good background knowledge to downstream tasks which avoids training of downstream models from scratch. In this comprehensive survey paper, we initially give a brief overview of self-supervised learning. Next, we explain various core concepts like pretraining, pretraining methods, pretraining tasks, embeddings and downstream adaptation methods. Next, we present a new taxonomy of T-PTLMs and then give brief overview of various benchmarks including both intrinsic and extrinsic. We present a summary of various useful libraries to work with T-PTLMs. Finally, we highlight some of the future research directions which will further improve these models. We strongly believe that this comprehensive survey paper will serve as a good reference to learn the core concepts as well as to stay updated with the recent happenings in T-PTLMs.
🔥 New survey paper summarizing Transformer-based pre-trained models in NLP.
— elvis (@omarsar0) August 13, 2021
Another excellent read to get familiar with state-of-the-art machine learning approaches for NLP.
abs: https://t.co/6gLvPTBnHY
pdf: https://t.co/WDJzhmdkdx pic.twitter.com/YQNL2qE5LZ
2. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision
Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely
The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world’s landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections---namely language, e.g., from image captions---has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics---utilizing the strong constraints provided by 3D geometry---to associate semantic concepts to image pixels and 3D points.
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision
— AK (@ak92501) August 13, 2021
pdf: https://t.co/h6Z30bGMDS
abs: https://t.co/n70eDomcPz
project page: https://t.co/eICW1QO6l3 pic.twitter.com/wChFKxNx0f
3. Mobile-Former: Bridging MobileNet and Transformer
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, Zicheng Liu
We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between. This structure leverages the advantage of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different with recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. less than 6 tokens) that are randomly initialized, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.
Mobile-Former: Bridging MobileNet and Transformer
— AK (@ak92501) August 13, 2021
pdf: https://t.co/Ssr6oFOjy7
abs: https://t.co/lctrhRG2Oq
achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations pic.twitter.com/ChNT9kJtSy
4. PixelSynth: Generating a 3D-Consistent Experience from a Single Image
Chris Rockwell, David F. Fouhey, Justin Johnson
Recent advancements in differentiable rendering and 3D reasoning have driven exciting results in novel view synthesis from a single image. Despite realistic results, methods are limited to relatively small view change. In order to synthesize immersive scenes, models must also be able to extrapolate. We present an approach that fuses 3D reasoning with autoregressive modeling to outpaint large view changes in a 3D-consistent manner, enabling scene synthesis. We demonstrate considerable improvement in single image large-angle view synthesis results compared to a variety of methods and possible variants across simulated and real datasets. In addition, we show increased 3D consistency compared to alternative accumulation methods. Project website: https://crockwell.github.io/pixelsynth/
PixelSynth: Generating a 3D-Consistent Experience from a Single Image
— AK (@ak92501) August 13, 2021
pdf: https://t.co/QDA8YfeQXJ
abs: https://t.co/6AcIDqkKN9
project page: https://t.co/66l3wtkXxp pic.twitter.com/iyBJGU0BrV
5. Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation
Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Kweon, Kyung-Soo Kim, Soohyun Kim
Volumetric deep learning approach towards stereo matching aggregates a cost volume computed from input left and right images using 3D convolutions. Recent works showed that utilization of extracted image features and a spatially varying cost volume aggregation complements 3D convolutions. However, existing methods with spatially varying operations are complex, cost considerable computation time, and cause memory consumption to increase. In this work, we construct Guided Cost volume Excitation (GCE) and show that simple channel excitation of cost volume guided by image can improve performance considerably. Moreover, we propose a novel method of using top-k selection prior to soft-argmin disparity regression for computing the final disparity estimate. Combining our novel contributions, we present an end-to-end network that we call Correlate-and-Excite (CoEx). Extensive experiments of our model on the SceneFlow, KITTI 2012, and KITTI 2015 datasets demonstrate the effectiveness and efficiency of our model and show that our model outperforms other speed-based algorithms while also being competitive to other state-of-the-art algorithms. Codes will be made available at https://github.com/antabangun/coex.
Correlate-and-Excite: Real-Time Stereo Matching via
— AK (@ak92501) August 13, 2021
Guided Cost Volume Excitation
pdf: https://t.co/kiE3czmTgz
abs: https://t.co/deQJDhipFa
github: https://t.co/aYFbSbT4Xu pic.twitter.com/oo1ze5wNEc
6. Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
Josh Beal, Hao-Yu Wu, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk
Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively unexplored. We consider the case of a popular visual discovery product, where these representations are trained with multi-task learning, from use-case specific visual understanding (e.g. skin tone classification) to general representation learning for all visual content (e.g. embeddings for retrieval). In this work, we describe how we (1) generate a dataset with over a billion images via large weakly-supervised pretraining to improve the performance of these visual representations, and (2) leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially at 1B+ image scale. To support this backbone model, we detail a systematic approach to deriving weakly-supervised image annotations from heterogenous text signals, demonstrating the benefits of clustering techniques to handle the long-tail distribution of image labels. Through a comprehensive study of offline and online evaluation, we show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications. The model is deployed in a production visual shopping system, with 36% improvement in top-1 relevance and 23% improvement in click-through volume. We conduct extensive experiments to better understand the empirical relationships between Transformer-based architectures, dataset scale, and the performance of production vision systems.
Billion-Scale Pretraining with Vision Transformers for
— AK (@ak92501) August 13, 2021
Multi-Task Visual Representations
pdf: https://t.co/ZPTagL3LzO
abs: https://t.co/TfhdXimw4s
a scalable approach for pretraining with over a billion images in order to improve a production Unified Visual Embedding model pic.twitter.com/bFmlbpD01e
7. The paradox of the compositionality of natural language: a neural machine translation case study
Verna Dankers, Elia Bruni, Dieuwke Hupkes
Moving towards human-like linguistic performance is often argued to require compositional generalisation. Whether neural networks exhibit this ability is typically studied using artificial languages, for which the compositionality of input fragments can be guaranteed and their meanings algebraically composed. However, compositionality in natural language is vastly more complex than this rigid, arithmetics-like version of compositionality, and as such artificial compositionality tests do not allow us to draw conclusions about how neural models deal with compositionality in more realistic scenarios. In this work, we re-instantiate three compositionality tests from the literature and reformulate them for neural machine translation (NMT). The results highlight two main issues: the inconsistent behaviour of NMT models and their inability to (correctly) modulate between local and global processing. Aside from an empirical study, our work is a call to action: we should rethink the evaluation of compositionality in neural networks of natural language, where composing meaning is not as straightforward as doing the math.
Compositionality in NNs is usually tested using artificial data. But in natural language, it is not that simple! How can we test for compositionality in the wild? In our paper, we evaluate NMT models trained on real, unfiltered natural language. (1/11)https://t.co/B1gyfuqTKL pic.twitter.com/HTQ5nzgsWQ
— Dieuwke Hupkes (@_dieuwke_) August 13, 2021
8. Logit Attenuating Weight Normalization
Aman Gupta, Rohan Ramanath, Jun Shi, Anika Ramachandran, Sirou Zhou, Mingzhou Zhou, S. Sathiya Keerthi
Over-parameterized deep networks trained using gradient-based optimizers are a popular choice for solving classification and ranking problems. Without appropriately tuned regularization or weight decay, such networks have the tendency to make output scores (logits) and network weights large, causing training loss to become too small and the network to lose its adaptivity (ability to move around) in the parameter space. Although regularization is typically understood from an overfitting perspective, we highlight its role in making the network more adaptive and enabling it to escape more easily from weights that generalize poorly. To provide such a capability, we propose a method called Logit Attenuating Weight Normalization (LAWN), that can be stacked onto any gradient-based optimizer. LAWN controls the logits by constraining the weight norms of layers in the final homogeneous sub-network. Empirically, we show that the resulting LAWN variant of the optimizer makes a deep network more adaptive to finding minimas with superior generalization performance on large-scale image classification and recommender systems. While LAWN is particularly impressive in improving Adam, it greatly improves all optimizers when used with large batch sizes
9. An Early Look at the Gettr Social Network
Pujan Paudel, Jeremy Blackburn, Emiliano De Cristofaro, Savvas Zannettou, Gianluca Stringhini
This paper presents the first data-driven analysis of Gettr, a new social network platform launched by former US President Donald Trump’s team. Among other things, we find that users on the platform heavily discuss politics, with a focus on the Trump campaign in the US and Bolsonaro’s in Brazil. Activity on the platform has steadily been decreasing since its launch, although a core of verified users and early adopters kept posting and become central to it. Finally, although toxicity has been increasing over time, the average level of toxicity is still lower than the one recently observed on other fringe social networks like Gab and 4chan. Overall, we provide a first quantitative look at this new community, observing a lack of organic engagement and activity.
New preprint: An Early Look at the Gettr Social Network https://t.co/CMxSim5iA4 pic.twitter.com/jEktxIqLno
— Emiliano DC 💉💉🇪🇺 (@emilianoucl) August 13, 2021
10. Bridger: Toward Bursting Scientific Filter Bubbles and Boosting Innovation via Novel Author Discovery
Jason Portenoy, Marissa Radensky, Jevin West, Eric Horvitz, Daniel Weld, Tom Hope
Scientific silos can hinder innovation. These information “filter bubbles” and the growing challenge of information overload limit awareness across the literature, making it difficult to keep track of even narrow areas of interest, let alone discover new ones. Algorithmic curation and recommendation, which often prioritize relevance, can further reinforce these bubbles. In response, we describe Bridger, a system for facilitating discovery of scholars and their work, to explore design tradeoffs among relevant and novel recommendations. We construct a faceted representation of authors using information extracted from their papers and inferred personas. We explore approaches both for recommending new content and for displaying it in a manner that helps researchers to understand the work of authors who they are unfamiliar with. In studies with computer science researchers, our approach substantially improves users’ abilities to do so. We develop an approach that locates commonalities and contrasts between scientists---retrieving partially similar authors, rather than aiming for strict similarity. We find this approach helps users discover authors useful for generating novel research ideas of relevance to their work, at a higher rate than a state-of-art neural model. Our analysis reveals that Bridger connects authors who have different citation profiles, publish in different venues, and are more distant in social co-authorship networks, raising the prospect of bridging diverse communities and facilitating discovery.
11. PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management
Jiarui Fang, Yang Yu, Shenggui Li, Yang You, Jie Zhou
The pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology. It learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset. Unfortunately, PTM training requires prohibitively expensive computing devices, especially fine-tuning, which is still a game for a small proportion of people in the AI community. Enabling PTMs training on low-quality devices, PatrickStar now makes PTM accessible to everyone. PatrickStar reduces memory requirements of computing platforms by using the CPU-GPU heterogeneous memory space to store model data, consisting of parameters, gradients, and optimizer states. We observe that the GPU memory available for model data changes regularly, in a tide-like pattern, decreasing and increasing iteratively. However, the existing heterogeneous training works do not take advantage of this pattern. Instead, they statically partition the model data among CPU and GPU, leading to both memory waste and memory abuse. In contrast, PatrickStar manages model data in chunks, which are dynamically distributed in heterogeneous memory spaces. Chunks consist of stateful tensors which run as finite state machines during training. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs using data parallelism, with the lowest communication bandwidth requirements and more efficient bandwidth utilization. Experimental results show PatrickStar trains a 12 billion parameters GPT model, 2x larger than the STOA work, on an 8-V100 and 240GB CPU memory node, and is also more efficient on the same model size.
PatrickStar: Parallel Training of Pre-trained Models via
— AK (@ak92501) August 13, 2021
a Chunk-based Memory Management
pdf: https://t.co/MPrgUvEt2z
abs: https://t.co/YC9yCNi3Cu
trains a 12B parameters GPT model on an 8-V100
and 240GB CPU memory node,and is also more efficient on the same model size pic.twitter.com/mG1nZJT7wg