1. How to represent part-whole hierarchies in a neural network
Geoffrey Hinton
This paper does not describe a working system. Instead, it presents a single idea about representation which allows advances made by several different groups to be combined into an imaginary system called GLOM. The advances include transformers, neural fields, contrastive representation learning, distillation and capsules. GLOM answers the question: How can a neural network with a fixed architecture parse an image into a part-whole hierarchy which has a different structure for each image? The idea is simply to use islands of identical vectors to represent the nodes in the parse tree. If GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language
I have a new paper on how to represent part-whole hierarchies in neural networks. https://t.co/4hYb7MJ1SF
— Geoffrey Hinton (@geoffreyhinton) February 26, 2021
2. Modular Object-Oriented Games: A Task Framework for Reinforcement Learning, Psychology, and Neuroscience
Nicholas Watters, Joshua Tenenbaum, Mehrdad Jazayeri
In recent years, trends towards studying simulated games have gained momentum in the fields of artificial intelligence, cognitive science, psychology, and neuroscience. The intersections of these fields have also grown recently, as researchers increasing study such games using both artificial agents and human or animal subjects. However, implementing games can be a time-consuming endeavor and may require a researcher to grapple with complex codebases that are not easily customized. Furthermore, interdisciplinary researchers studying some combination of artificial intelligence, human psychology, and animal neurophysiology face additional challenges, because existing platforms are designed for only one of these domains. Here we introduce Modular Object-Oriented Games, a Python task framework that is lightweight, flexible, customizable, and designed for use by machine learning, psychology, and neurophysiology researchers.
Working at the intersection of AI/cogsci/neurosci? We present to you Modular Object-Oriented Games (MOOG), a flexible Python-based framework for interactive games that you can use for psychophysics, physiology, and RL.
— Mehrdad Jazayeri (@mjaztwit) February 26, 2021
Check out https://t.co/ZyFxJ7HY3w and https://t.co/2Bk7z2MTI6
3. IBRNet: Learning Multi-View Image-Based Rendering
Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, Thomas Funkhouser
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multi-view posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods.
IBRNet: Learning Multi-View Image-Based Rendering
— AK (@ak92501) February 26, 2021
pdf: https://t.co/PlSlGxlKzZ
abs: https://t.co/InrmbU6TjP
project page: https://t.co/JBAbMO2s7i pic.twitter.com/vTD3c8XWgY
4. A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives
Nils Rethmeier, Isabelle Augenstein
Modern natural language processing (NLP) methods employ self-supervised pretraining objectives such as masked language modeling to boost the performance of various application tasks. These pretraining methods are frequently extended with recurrence, adversarial or linguistic property masking, and more recently with contrastive learning objectives. Contrastive self-supervised training objectives enabled recent successes in image representation pretraining by learning to contrast input-input pairs of augmented images as either similar or dissimilar. However, in NLP, automated creation of text input augmentations is still very challenging because a single token can invert the meaning of a sentence. For this reason, some contrastive NLP pretraining methods contrast over input-label pairs, rather than over input-input pairs, using methods from Metric Learning and Energy Based Models. In this survey, we summarize recent self-supervised and supervised contrastive NLP pretraining methods and describe where they are used to improve language modeling, few or zero-shot learning, pretraining data-efficiency and specific NLP end-tasks. We introduce key contrastive learning concepts with lessons learned from prior research and structure works by applications and cross-field relations. Finally, we point to open challenges and future directions for contrastive NLP to encourage bringing contrastive NLP pretraining closer to recent successes in image representation pretraining.
Want to quickly learn about NLP contrastive pretraining and its benefits?
— Nils Rethmeier (@Nils_Rethmeier) February 26, 2021
We wrote: "A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives" with @IAugenstein
It's meant to be a quick start guide https://t.co/wvYRs5g5ZU pic.twitter.com/pmoZ30tyjH
5. AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation
Bing Li, Yuanlue Zhu, Yitong Wang, Chia-Wen Lin, Bernard Ghanem, Linlin Shen
In this paper, we propose a novel framework to translate a portrait photo-face into an anime appearance. Our aim is to synthesize anime-faces which are style-consistent with a given reference anime-face. However, unlike typical translation tasks, such anime-face translation is challenging due to complex variations of appearances among anime-faces. Existing methods often fail to transfer the styles of reference anime-faces, or introduce noticeable artifacts/distortions in the local shapes of their generated faces. We propose Ani- GAN, a novel GAN-based translator that synthesizes highquality anime-faces. Specifically, a new generator architecture is proposed to simultaneously transfer color/texture styles and transform local facial shapes into anime-like counterparts based on the style of a reference anime-face, while preserving the global structure of the source photoface. We propose a double-branch discriminator to learn both domain-specific distributions and domain-shared distributions, helping generate visually pleasing anime-faces and effectively mitigate artifacts. Extensive experiments qualitatively and quantitatively demonstrate the superiority of our method over state-of-the-art methods.
AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation
— AK (@ak92501) February 26, 2021
pdf: https://t.co/CXa3wwb2vG
abs: https://t.co/ZMMNPlBz2y pic.twitter.com/OcyCoCMFyh
AniGAN、眼の横や裏に都合良く避けてくれる前髪を生やしがち https://t.co/vGgbwIc7xx pic.twitter.com/VYvhFfYN7V
— Yosuke Shinya (@shinya7y) February 26, 2021
6. Investigating the Limitations of the Transformers with Simple Arithmetic Tasks
Rodrigo Nogueira, Zhiying Jiang, Jimmy Li
The ability to perform arithmetic tasks is a remarkable trait of human intelligence and might form a critical component of more complex reasoning tasks. In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values. We find that how a number is represented in its surface form has a strong influence on the model’s accuracy. In particular, the model fails to learn addition of five-digit numbers when using subwords (e.g., “32”), and it struggles to learn with character-level representations (e.g., “3 2”). By introducing position tokens (e.g., “3 10e1 2”), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation. This result bolsters evidence that subword tokenizers and positional encodings are components in current transformer designs that might need improvement. Moreover, we show that regardless of the number of parameters and training examples, models cannot learn addition rules that are independent of the length of the numbers seen during training. Code to reproduce our experiments is available at https://github.com/castorini/transformers-arithmetic
Investigating the Limitations of the Transformers with Simple Arithmetic Tasks
— Aran Komatsuzaki (@arankomatsuzaki) February 26, 2021
LM struggles with addition and subtraction when subword or char-level vocab, but it can perform remarkably accurately with better tokenization.https://t.co/XwB9fuY75b pic.twitter.com/DKj43zI4Xy
Investigating the Limitations of the Transformers with
— AK (@ak92501) February 26, 2021
Simple Arithmetic Tasks
pdf: https://t.co/jGGlcZ3l8V
abs: https://t.co/GY8swbXHBu pic.twitter.com/KLpxmQRq6j
In other news, @rodrigfnogueira and @ZhiyingJ have been teaching transformers to add. With 1k examples, T5 can learn addition up to ~15 digits... if you give it the right representation... but not otherwise. https://t.co/Lyt5hyP9Nv
— Jimmy Lin (@lintool) February 26, 2021
7. Evolving Attention with Residual Convolutions
Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, Gao Huang, Yunhai Tong
Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However, they are learned independently in each layer and sometimes fail to capture precise patterns. In this paper, we propose a novel and generic mechanism based on evolving attention to improve the performance of transformers. On one hand, the attention maps in different layers share common knowledge, thus the ones in preceding layers can instruct the attention in succeeding layers through residual connections. On the other hand, low-level and high-level attentions vary in the level of abstraction, so we adopt convolutional layers to model the evolutionary process of attention maps. The proposed evolving attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks, including image classification, natural language understanding and machine translation.
Evolving Attention with Residual Convolutions
— AK (@ak92501) February 26, 2021
pdf: https://t.co/pFkLZPQiqB
abs: https://t.co/UkPYwlJlRp pic.twitter.com/auGax70vr6
8. Algorithms and Complexity on Indexing Founder Graphs
Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, Veli Mäkinen
We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs repeat-free founder graphs when constructed from a gapless MSA and repeat-free elastic founder graphs when constructed from a general MSA with gaps. We give a linear time algorithm and a parameterized near linear time algorithm to construct a repeat-free founder graph and a repeat-free elastic founder graph, respectively. We derive a tailored succinct index structure to support queries of arbitrary length in the paths of a repeat-free (elastic) founder graph. In addition, we show how to turn a repeat-free (elastic) founder graph into a Wheeler graph in polynomial time. Furthermore, we show that a property such as repeat-freeness is essential for indexability. In particular, we show that unless the Strong Exponential Time Hypothesis (SETH) fails, one cannot build an index on an elastic founder graph in polynomial time to support fast queries.
https://t.co/ZZKN3dmHcE Our new preprint on pangenome indexing is out. The idea is to represent a pangenome by recombinations of a few founder sequences, and segment those into repeat-free blocks via dynamic programing. #bioinformatics #pangenomics
— Jarno N. Alanko (@jnalanko) February 26, 2021
9. A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs
Waqas Ali, Muhammad Saleem, Bin Yao, Aidan Hogan, Axel-Cyrille Ngonga Ngomo
Recent years have seen the growing adoption of non-relational data models for representing diverse, incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines, often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
10. Are Anti-Feminist Communities Gateways to the Far Right? Evidence from Reddit and YouTube
Robin Mamié, Manoel Horta Ribeiro, Robert West
Researchers have suggested that “the Manosphere,” a conglomerate of men-centered online communities, may serve as a gateway to far right movements. In that context, this paper quantitatively studies the migratory patterns between a variety of groups within the Manosphere and the Alt-right, a loosely connected far right movement that has been particularly active in mainstream social networks. Our analysis leverages over 300 million comments spread through Reddit (in 115 subreddits) and YouTube (in 526 channels) to investigate whether the audiences of channels and subreddits associated with these communities have converged between 2006 and 2018. In addition to subreddits related to the communities of interest, we also collect data on counterparts: other groups of users which we use for comparison (e.g., for YouTube we use a set of media channels). Besides measuring the similarity in the commenting user bases of these communities, we perform a migration study, calculating to which extent users in the Manosphere gradually engage with Alt-right content. Our results suggest that there is a large overlap between the user bases of the Alt-right and of the Manosphere and that members of the Manosphere have a bigger chance to engage with far right content than carefully chosen counterparts. However, our analysis also shows that migration and user base overlap varies substantially across different platforms and within the Manosphere. Members of some communities (e.g., Men’s Rights Activists) gradually engage with the Alt-right significantly more than counterparts on both Reddit and YouTube, whereas for other communities, this engagement happens mostly on Reddit (e.g., Pick Up Artists). Overall, our work paints a nuanced picture of the pipeline between the Manosphere and the Alt-right, which may inform platforms’ policies and moderation decisions regarding these communities.
Our newest pre-print (w/ @robin_mamie and @cervisiarius) tackles a very important question:
— Manoel (@manoelribeiro) February 26, 2021
Are Anti-Feminist Communities Gateways to the Far Right? https://t.co/eipeeKn6Kg
Check the thread for a summary with illustrative hand-drawn plots :D*! pic.twitter.com/GfLKZOlW3U
11. Restoring Uniqueness in MicroVM Snapshots
Marc Brooker, Adrian Costin Catangiu, Mike Danilov, Alexander Graf, Colm MacCarthaigh, Andrei Sandu
Code initialization — the step of loading code, executing static code, filling caches, and forming re-used connections — tends to dominate cold-start time in serverless compute systems such as AWS Lambda. Post-initialization memory snapshots, cloned and restored on start, have emerged as a viable solution to this problem, with incremental snapshot and fast restore support in VMMs like Firecracker. Saving memory introduces the challenge of managing high-value memory contents, such as cryptographic secrets. Cloning introduces the challenge of restoring the uniqueness of the VMs, to allow them to do unique things like generate UUIDs, secrets, and nonces. This paper examines solutions to these problems in the every microsecond counts context of serverless cold-start, and discusses the state-of-the-art of available solutions. We present two new interfaces aimed at solving this problem — MADV_WIPEONSUSPEND and SysGenId — and compare them to alternative solutions.
Our new preprint "Restoring Uniqueness in MicroVM Snapshots" is out: https://t.co/0cHuE37ytS A small, but very important, part of making MicroVM snapshots a better tool for implementing low-latency serverless computing.
— Marc Brooker (@MarcJBrooker) February 26, 2021
12. Cognitive network science for understanding online social cognitions: A brief review
Massimo Stella
- retweets: 96, favorites: 36 (02/28/2021 07:03:05)
- links: abs | pdf
- cs.CY | cs.CL | cs.SI | physics.soc-ph
Social media are digitalising massive amounts of users’ cognitions in terms of timelines and emotional content. Such Big Data opens unprecedented opportunities for investigating cognitive phenomena like perception, personality and information diffusion but requires suitable interpretable frameworks. Since social media data come from users’ minds, worthy candidates for this challenge are cognitive networks, models of cognition giving structure to mental conceptual associations. This work outlines how cognitive network science can open new, quantitative ways for understanding cognition through online media, like: (i) reconstructing how users semantically and emotionally frame events with contextual knowledge unavailable to machine learning, (ii) investigating conceptual salience/prominence through knowledge structure in social discourse; (iii) studying users’ personality traits like openness-to-experience, curiosity, and creativity through language in posts; (iv) bridging cognitive/emotional content and social dynamics via multilayer networks comparing the mindsets of influencers and followers. These advancements combine cognitive-, network- and computer science to understand cognitive mechanisms in both digital and real-world settings but come with limitations concerning representativeness, individual variability and data integration. Such aspects are discussed along the ethical implications of manipulating socio-cognitive data. In the future, reading cognitions through networks and social media can expose cognitive biases amplified by online platforms and relevantly inform policy making, education and markets about massive, complex cognitive trends.
My latest pre-print (my 1st full project while at @exetercompsci) is a brief review about #cognitive #network science and key advancements for reading the structure of the human #mind and its semantic/emotional expression in online #social discourse:https://t.co/hVHzQKJ776
— Massimo Stella (@MassimoSt) February 26, 2021
13. Visualizing MuZero Models
Joery A. de Vries, Ken S. Voskuil, Thomas M. Moerland, Aske Plaat
MuZero, a model-based reinforcement learning algorithm that uses a value equivalent dynamics model, achieved state-of-the-art performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value equivalent models have shown strong empirical success, there is no research yet that visualizes and investigates what types of representations these models actually learn. Therefore, in this paper we visualize the latent representation of MuZero agents. We find that action trajectories may diverge between observation embeddings and internal state transition dynamics, which could lead to instability during planning. Based on this insight, we propose two regularization techniques to stabilize MuZero’s performance. Additionally, we provide an open-source implementation of MuZero along with an interactive visualizer of learned representations, which may aid further investigation of value equivalent algorithms.
Visualizing MuZero Models
— AK (@ak92501) February 26, 2021
pdf: https://t.co/40km1tCezC
abs: https://t.co/u1XIncKKT3 pic.twitter.com/iGzkqsZh7f
14. IPFS and Friends: A Qualitative Comparison of Next Generation Peer-to-Peer Data Networks
Erik Daniel, Florian Tschorsch
Decentralized, distributed storage offers a way to reduce the impact of data silos as often fostered by centralized cloud storage. While the intentions of this trend are not new, the topic gained traction due to technological advancements, most notably blockchain networks. As a consequence, we observe that a new generation of peer-to-peer data networks emerges. In this survey paper, we therefore provide a technical overview of the next generation data networks. We use select data networks to introduce general concepts and to emphasize new developments. We identify common building blocks and provide a qualitative comparison. From the overview, we derive future challenges and research goals concerning data networks.
Our paper “IPFS and Friends” is now available as preprint. We believe that due to rather recent advancements a new generation of P2P data networks emerges: https://t.co/5RY3i932eG pic.twitter.com/YeHIGZhFDZ
— Florian Tschorsch (@flotschorsch) February 26, 2021
15. Random Graphs with Prescribed -Core Sequences: A New Null Model for Network Analysis
Katherine Van Koevering, Austin R. Benson, Jon Kleinberg
- retweets: 51, favorites: 31 (02/28/2021 07:03:05)
- links: abs | pdf
- cs.SI | cs.DS | math.CO | physics.soc-ph
In the analysis of large-scale network data, a fundamental operation is the comparison of observed phenomena to the predictions provided by null models: when we find an interesting structure in a family of real networks, it is important to ask whether this structure is also likely to arise in random networks with similar characteristics to the real ones. A long-standing challenge in network analysis has been the relative scarcity of reasonable null models for networks; arguably the most common such model has been the configuration model, which starts with a graph and produces a random graph with the same node degrees as . This leads to a very weak form of null model, since fixing the node degrees does not preserve many of the crucial properties of the network, including the structure of its subgraphs. Guided by this challenge, we propose a new family of network null models that operate on the -core decomposition. For a graph , the -core is its maximal subgraph of minimum degree ; and the core number of a node in is the largest such that belongs to the -core of . We provide the first efficient sampling algorithm to solve the following basic combinatorial problem: given a graph , produce a random graph sampled nearly uniformly from among all graphs with the same sequence of core numbers as . This opens the opportunity to compare observed networks with random graphs that exhibit the same core numbers, a comparison that preserves aspects of the structure of that are not captured by more local measures like the degree sequence. We illustrate the power of this core-based null model on some fundamental tasks in network analysis, including the enumeration of networks motifs.
Spotted on the arxiv, signal boosting for the k-core fans: a random graph null model (and Markov chain sampler) which is uniform over graphs with a given core sequence: https://t.co/kEn6ygvrj8
— Johan Ugander (@jugander) February 26, 2021
16. Reducing Labelled Data Requirement for Pneumonia Segmentation using Image Augmentations
Jitesh Seth, Rohit Lokwani, Viraj Kulkarni, Aniruddha Pant, Amit Kharat
Deep learning semantic segmentation algorithms can localise abnormalities or opacities from chest radiographs. However, the task of collecting and annotating training data is expensive and requires expertise which remains a bottleneck for algorithm performance. We investigate the effect of image augmentations on reducing the requirement of labelled data in the semantic segmentation of chest X-rays for pneumonia detection. We train fully convolutional network models on subsets of different sizes from the total training data. We apply a different image augmentation while training each model and compare it to the baseline trained on the entire dataset without augmentations. We find that rotate and mixup are the best augmentations amongst rotate, mixup, translate, gamma and horizontal flip, wherein they reduce the labelled data requirement by 70% while performing comparably to the baseline in terms of AUC and mean IoU in our experiments.
Reducing Labelled Data Requirement for Pneumonia Segmentation using Image Augmentationshttps://t.co/u3z3BlpFuL
— Jitesh Seth (@SethJitesh) February 26, 2021
17. Task-Agnostic Morphology Evolution
Donald J. Hejna III, Pieter Abbeel, Lerrel Pinto
Deep reinforcement learning primarily focuses on learning behavior, usually overlooking the fact that an agent’s function is largely determined by form. So, how should one go about finding a morphology fit for solving tasks in a given environment? Current approaches that co-adapt morphology and behavior use a specific task’s reward as a signal for morphology optimization. However, this often requires expensive policy optimization and results in task-dependent morphologies that are not built to generalize. In this work, we propose a new approach, Task-Agnostic Morphology Evolution (TAME), to alleviate both of these issues. Without any task or reward specification, TAME evolves morphologies by only applying randomly sampled action primitives on a population of agents. This is accomplished using an information-theoretic objective that efficiently ranks agents by their ability to reach diverse states in the environment and the causality of their actions. Finally, we empirically demonstrate that across 2D, 3D, and manipulation environments TAME can evolve morphologies that match the multi-task performance of those learned with task supervised algorithms. Our code and videos can be found at https://sites.google.com/view/task-agnostic-evolution.
Task-Agnostic Morphology Evolution
— AK (@ak92501) February 26, 2021
pdf: https://t.co/fG08SsW2QK
abs: https://t.co/JbiDPPScpn
project page: https://t.co/Di7L1q5Hvl pic.twitter.com/62BhGeWAIG
18. Simple multi-dataset detection
Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
How do we build a general and broad object detection system? We use all labels of all concepts ever annotated. These labels span diverse datasets with potentially inconsistent taxonomies. In this paper, we present a simple method for training a unified detector on multiple large-scale datasets. We use dataset-specific training protocols and losses, but share a common detection architecture with dataset-specific outputs. We show how to automatically integrate these dataset-specific outputs into a common semantic taxonomy. In contrast to prior work, our approach does not require manual taxonomy reconciliation. Our multi-dataset detector performs as well as dataset-specific models on each training domain, but generalizes much better to new unseen domains. Entries based on the presented methodology ranked first in the object detection and instance segmentation tracks of the ECCV 2020 Robust Vision Challenge.
この問題設定面白そうです!複数の物体検出データセット(MSCOCO, OpenImages, Object365)を並列で学習・検出する問題。
— Hirokatsu Kataoka | 片岡裕雄 (@HirokatuKataoka) February 26, 2021
Simple multi-dataset detection
Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühlhttps://t.co/y0TB9ERK7d pic.twitter.com/4dkMiFLx4g
19. SparseBERT: Rethinking the Importance Analysis in Self-attention
Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James T. Kwok
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. As the core component, self-attention module has aroused widespread interests. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism and some common patterns are observed in visualization. Based on these patterns, a series of efficient transformers are proposed with corresponding sparse attention masks. Besides above empirical results, universal approximability of Transformer-based models is also discovered from a theoretical perspective. However, above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we delve into dynamics of attention matrix importance during pre-training. One of surprising results is that the diagonal elements in the attention map are the most unimportant compared with other attention positions and we also provide a proof to show these elements can be removed without damaging the model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design further. The extensive experiments verify our interesting findings and illustrate the effect of our proposed algorithm.
SparseBERT: Rethinking the Importance Analysis in Self-attention
— AK (@ak92501) February 26, 2021
pdf: https://t.co/yiNPU2Lchc
abs: https://t.co/6ye4o3ME5y pic.twitter.com/4QJJldvRGT