Hot Papers 2021-03-23

1. MasakhaNER: Named Entity Recognition for African Languages

David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya

retweets: 12710, favorites: 13 (03/24/2021 09:20:13)
links: abs | pdf
cs.CL | cs.AI

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.

We're SO excited to present the first large publicly available high quality dataset for NER in 10 African languages, bringing together a variety of stakeholders: language speakers, dataset curators, NLP practitioners, and evaluation experts 💕🌍💪
(1/n)https://t.co/oa5ZbRrMyT pic.twitter.com/gJ3OlY97sx
— Masakhane (@MasakhaneNLP) March 23, 2021

2. Preliminary Analysis of Potential Harms in the Luca Tracing System

Theresa Stadler, Wouter Lueks, Katharina Kohls, Carmela Troncoso

retweets: 8707, favorites: 9 (03/24/2021 09:20:13)
links: abs | pdf
cs.CR | cs.CY

In this document, we analyse the potential harms a large-scale deployment of the Luca system might cause to individuals, venues, and communities. The Luca system is a digital presence tracing system designed to provide health departments with the contact information necessary to alert individuals who have visited a location at the same time as a SARS-CoV-2-positive person. Multiple regional health departments in Germany have announced their plans to deploy the Luca system for the purpose of presence tracing. The system’s developers suggest its use across various types of venues: from bars and restaurants to public and private events, such religious or political gatherings, weddings, and birthday parties. Recently, an extension to include schools and other educational facilities was discussed in public. Our analysis of the potential harms of the system is based on the publicly available Luca Security Concept which describes the system’s security architecture and its planned protection mechanisms. The Security Concept furthermore provides a set of claims about the system’s security and privacy properties. Besides an analysis of harms, our analysis includes a validation of these claims.

Über die letzten Tage haben wir eine Analyse der möglichen Nachteile, die ein breiter Einsatz der #LucaApp für Individuen, Gruppen und Veranstalter mit sich bringen könnte, verfasst https://t.co/Hlkg5Fn2S2 mit @WouterLueks @blister_green @carmelatroncoso
Eine Zusammenfassung 👇
— T (@thsStadler) March 23, 2021

3. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Javier Ortiz Suárez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar

retweets: 4613, favorites: 125 (03/24/2021 09:20:13)
links: abs | pdf
cs.CL | cs.AI

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

Does the data used for multilingual modeling really contain content in the languages it says it does? Short answer: sometimes 🙁 https://t.co/05NWjobkwO 1/n
— Isaac R Caswell (@iseeaswell) March 23, 2021

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

By manually auditing the quality of 205 language-specific corpora, they find that lower-resource corpora have systematic issues in quality.https://t.co/xUiwTCoZFd pic.twitter.com/ag4kPsxwwP
— Aran Komatsuzaki (@arankomatsuzaki) March 23, 2021

4. Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges

Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, Chudi Zhong

retweets: 4164, favorites: 362 (03/24/2021 09:20:14)
links: abs | pdf
cs.LG | stat.ML

Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the “Rashomon set” of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.

New review paper: "Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges" https://t.co/dIUwjxcISx
— Cynthia Rudin (@CynthiaRudin) March 23, 2021

5. Measuring and modeling the motor system with machine learning

Sébastien B. Hausmann, Alessandro Marin Vargas, Alexander Mathis, Mackenzie W. Mathis

retweets: 3376, favorites: 220 (03/24/2021 09:20:14)
links: abs | pdf
q-bio.QM | cs.CV | cs.LG

The utility of machine learning in understanding the motor system is promising a revolution in how to collect, measure, and analyze data. The field of movement science already elegantly incorporates theory and engineering principles to guide experimental work, and in this review we discuss the growing use of machine learning: from pose estimation, kinematic analyses, dimensionality reduction, and closed-loop feedback, to its use in understanding neural correlates and untangling sensorimotor systems. We also give our perspective on new avenues where markerless motion capture combined with biomechanical modeling and neural networks could be a new platform for hypothesis-driven research.

🦾Want to tackle the motor system with machine learning?

🎓 New *review* on machine learning approaches for behavior & sensorimotor modeling!

🔖Written with fabulous co-1st @EPFL_en PhD students @SebHausmann & @a_marinvargas + @TrackingPlumes & me! https://t.co/pgVMov8xZa pic.twitter.com/ZftEvNDyMz
— Dr. Mackenzie Mathis (@TrackingActions) March 23, 2021

6. Multimodal Motion Prediction with Stacked Transformers

Yicheng Liu, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, Bolei Zhou

retweets: 1190, favorites: 231 (03/24/2021 09:20:14)
links: abs | pdf
cs.CV | cs.AI

Predicting multiple plausible future trajectories of the nearby vehicles is crucial for the safety of autonomous driving. Recent motion prediction approaches attempt to achieve such multimodal motion prediction by implicitly regularizing the feature or explicitly generating multiple candidate proposals. However, it remains challenging since the latent features may concentrate on the most frequent mode of the data while the proposal-based methods depend largely on the prior knowledge to generate and select the proposals. In this work, we propose a novel transformer framework for multimodal motion prediction, termed as mmTransformer. A novel network architecture based on stacked transformers is designed to model the multimodality at feature level with a set of fixed independent proposals. A region-based training strategy is then developed to induce the multimodality of the generated proposals. Experiments on Argoverse dataset show that the proposed model achieves the state-of-the-art performance on motion prediction, substantially improving the diversity and the accuracy of the predicted trajectories. Demo video and code are available at https://decisionforce.github.io/mmTransformer.

Multimodal Motion Prediction with Stacked Transformers
pdf: https://t.co/OPjoTObGBt
abs: https://t.co/G32niBfY6h
project page: https://t.co/xmF1AyniZj pic.twitter.com/c4EJbz0aDV
— AK (@ak92501) March 23, 2021

7. DeepViT: Towards Deeper Vision Transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Qibin Hou, Jiashi Feng

retweets: 978, favorites: 213 (03/24/2021 09:20:14)
links: abs | pdf
cs.CV

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code will be made publicly available

DeepViT: Towards Deeper Vision Transformer

Achieves up to 1.6%+ in top1 acc. on Imagenet by regenerating the attention maps to increase their diversity at different layers.https://t.co/NsvfB1VJu4 pic.twitter.com/xZiEHkeGTy
— Aran Komatsuzaki (@arankomatsuzaki) March 23, 2021

DeepViT: Towards Deeper Vision Transformer
pdf: https://t.co/V84QXEelAV
abs: https://t.co/PwIza18U7N pic.twitter.com/mS9WNcyZzU
— AK (@ak92501) March 23, 2021

8. Efficient Visual Pretraining with Contrastive Detection

Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van den Oord, Oriol Vinyals, João Carreira

retweets: 715, favorites: 140 (03/24/2021 09:20:15)
links: abs | pdf
cs.CV

Self-supervised pretraining has been shown to yield powerful representations for transfer learning. These performance gains come at a large computational cost however, with state-of-the-art methods requiring an order of magnitude more computation than supervised pretraining. We tackle this computational bottleneck by introducing a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations. This objective extracts a rich learning signal per image, leading to state-of-the-art transfer performance from ImageNet to COCO, while requiring up to 5x less pretraining. In particular, our strongest ImageNet-pretrained model performs on par with SEER, one of the largest self-supervised systems to date, which uses 1000x more pretraining data. Finally, our objective seamlessly handles pretraining on more complex images such as those in COCO, closing the gap with supervised transfer learning from COCO to PASCAL.

Efficient Visual Pretraining with Contrastive Detection

With a new self-supervised objective, contrastive detection, DetCon performs on par with the SotA model (SEER) w/ 1000x less pretraining data on Imagenet. https://t.co/jnfkKEkMNd pic.twitter.com/GqhQpySe6I
— Aran Komatsuzaki (@arankomatsuzaki) March 23, 2021

Efficient Visual Pretraining with Contrastive Detection
pdf: https://t.co/8M9Go75XkU
abs: https://t.co/Z9r23oDPqZ pic.twitter.com/zvY2nZHaDh
— AK (@ak92501) March 23, 2021

9. Improving and Simplifying Pattern Exploiting Training

Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, Colin Raffel

retweets: 615, favorites: 181 (03/24/2021 09:20:15)
links: abs | pdf
cs.CL | cs.AI | cs.LG

Recently, pre-trained language models (LMs) have achieved strong performance when fine-tuned on difficult benchmarks like SuperGLUE. However, performance can suffer when there are very few labeled examples available for fine-tuning. Pattern Exploiting Training (PET) is a recent approach that leverages patterns for few-shot learning. However, PET uses task-specific unlabeled data. In this paper, we focus on few shot learning without any unlabeled data and introduce ADAPET, which modifies PET’s objective to provide denser supervision during fine-tuning. As a result, ADAPET outperforms PET on SuperGLUE without any task-specific unlabeled data. Our code can be found at https://github.com/rrmenon10/ADAPET.

New preprint! We introduce a simplified version of pattern-exploiting training called ADAPET. ADAPET outperforms PET and iPET on SuperGLUE without using task-specific unlabeled data or ensembling and beats few-shot GPT-3 with a much smaller model.https://t.co/ukPvsI340g pic.twitter.com/E3uTj2S9yx
— Colin Raffel (@colinraffel) March 23, 2021

Improving and Simplifying Pattern Exploiting Training

ADAPET outperforms PET (Pattern Exploiting
Training) on SuperGLUE without any task-specific unlabeled data.

abs: https://t.co/6iLFZEonEK
code: https://t.co/NVIUNbNlyt pic.twitter.com/X1BQdmLvje
— Aran Komatsuzaki (@arankomatsuzaki) March 23, 2021

Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulić, Iryna Gurevych

retweets: 276, favorites: 111 (03/24/2021 09:20:15)
links: abs | pdf
cs.CV | cs.CL

Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross-modal retrieval, we propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach which combines: 1) twin networks to separately encode all items of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder component for a more nuanced (i.e., smarter) ranking of the retrieved small set of items. We also propose to jointly fine-tune the two components with shared weights, yielding a more parameter-efficient model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.

Check out our paper
“Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval“

An efficient retrieve and rerank approach for image search!

w/ @GregorGeigle @Nils_Reimers @licwu
Paper: https://t.co/HQhp8y750K
Code: https://t.co/B43M1fkoFP pic.twitter.com/r9R9fdGgMT
— Jonas Pfeiffer (@PfeiffJo) March 23, 2021

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
pdf: https://t.co/QQhgFMXwU1
abs: https://t.co/aRzf5C13fH
github: https://t.co/Bb6m5c9A3t pic.twitter.com/9yQoLSoloo
— AK (@ak92501) March 23, 2021

11. Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning

Gongjie Zhang, Zhipeng Luo, Kaiwen Cui, Shijian Lu

retweets: 289, favorites: 80 (03/24/2021 09:20:16)
links: abs | pdf
cs.CV | cs.AI

Few-shot object detection aims at detecting novel objects with only a few annotated examples. Prior works have proved meta-learning a promising solution, and most of them essentially address detection by meta-learning over regions for their classification and location fine-tuning. However, these methods substantially rely on initially well-located region proposals, which are usually hard to obtain under the few-shot settings. This paper presents a novel meta-detector framework, namely Meta-DETR, which eliminates region-wise prediction and instead meta-learns object localization and classification at image level in a unified and complementary manner. Specifically, it first encodes both support and query images into category-specific features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. To facilitate meta-learning with deep networks, we design a simple but effective Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. Experiments over multiple few-shot object detection benchmarks show that Meta-DETR outperforms state-of-the-art methods by large margins.

Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning
pdf: https://t.co/iDoXts5AJV
abs: https://t.co/iPysNYg7wx pic.twitter.com/gQ498JTC4g
— AK (@ak92501) March 23, 2021

12. Incorporating Convolution Designs into Visual Transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, Wei Wu

retweets: 226, favorites: 73 (03/24/2021 09:20:16)
links: abs | pdf
cs.CV

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an \textbf{Image-to-Tokens (I2T)} module that extracts patches from generated low-level features; \textbf{2)} the feed-froward network in each encoder block is replaced with a \textbf{Locally-enhanced Feed-Forward (LeFF)} layer that promotes the correlation among neighboring tokens in the spatial dimension; \textbf{3)} a \textbf{Layer-wise Class token Attention (LCA)} is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with $3\times$ fewer training iterations, which can reduce the training cost significantly\footnote{Code and models will be released upon acceptance.}.

Incorporating Convolution Designs into Visual Transformers

CeiT matches DeiT with 3x fewer iterations by adding three modifications to the architecture.https://t.co/sXDsuIfPKS pic.twitter.com/t1SJOyZcm9
— Aran Komatsuzaki (@arankomatsuzaki) March 23, 2021

Incorporating Convolution Designs into Visual Transformers
pdf: https://t.co/KJUo5L9mxj
abs: https://t.co/WNDbVcCFdV pic.twitter.com/X4vTrYwxn2
— AK (@ak92501) March 23, 2021

13. Conceptual similarity and communicative need shape colexification: an experimental study

Andres Karjus, Richard A. Blythe, Simon Kirby, Tianyu Wang, Kenny Smith

retweets: 240, favorites: 31 (03/24/2021 09:20:16)
links: abs | pdf
cs.CL

Colexification refers to the phenomenon of multiple meanings sharing one word in a language. Cross-linguistic lexification patterns have been shown to be largely predictable, as similar concepts are often colexified. We test a recent claim that, beyond this general tendency, communicative needs play an important role in shaping colexification patterns. We approach this question by means of a series of human experiments, using an artificial language communication game paradigm. Our results across four experiments match the previous cross-linguistic findings: all other things being equal, speakers do prefer to colexify similar concepts. However, we also find evidence supporting the communicative need hypothesis: when faced with a frequent need to distinguish similar pairs of meanings, speakers adjust their colexification preferences to maintain communicative efficiency, and avoid colexifying those similar meanings which need to be distinguished in communication. This research provides further evidence to support the argument that languages are shaped by the needs and preferences of their speakers.

Preprint "Conceptual similarity and communicative need shape colexification" w/ @DrAlgernon @SimonKirby Tianyu Wang @kennysmithed: https://t.co/ZpF4dCWMxo. We carry out 4 artificial language experiments (incl a self-repl) to test 2 hypotheses from a crosslinguistic study 1/5 pic.twitter.com/ejJc7y9eyL
— Andres Karjus (@AndresKarjus) March 23, 2021

14. Which contributions count? Analysis of attribution in open source

Jean-Gabriel Young, Amanda Casari, Katie McLaughlin, Milo Z. Trujillo, Laurent Hébert-Dufresne, James P. Bagrow

retweets: 240, favorites: 26 (03/24/2021 09:20:16)
links: abs | pdf
cs.SE | cs.CY

Open source software projects usually acknowledge contributions with text files, websites, and other idiosyncratic methods. These data sources are hard to mine, which is why contributorship is most frequently measured through changes to repositories, such as commits, pushes, or patches. Recently, some open source projects have taken to recording contributor actions with standardized systems; this opens up a unique opportunity to understand how community-generated notions of contributorship map onto codebases as the measure of contribution. Here, we characterize contributor acknowledgment models in open source by analyzing thousands of projects that use a model called All Contributors to acknowledge diverse contributions like outreach, finance, infrastructure, and community management. We analyze the life cycle of projects through this model’s lens and contrast its representation of contributorship with the picture given by other methods of acknowledgment, including GitHub’s top committers indicator and contributions derived from actions taken on the platform. We find that community-generated systems of contribution acknowledgment make work like idea generation or bug finding more visible, which generates a more extensive picture of collaboration. Further, we find that models requiring explicit attribution lead to more clearly defined boundaries around what is and what is not a contribution.

"Which contributions count? Analysis of attribution in open source”

New preprint from faculty members @_jgyou @LHDnets @bagrow w/PhD student @illegaldaydream & Google Open sourcerers @amcasari @glasnt https://t.co/ZbPjbU2kct pic.twitter.com/7TX804DsA4
— Vermont Complex Systems Center @ UVM (@uvmcomplexity) March 23, 2021

15. MoViNets: Mobile Video Networks for Efficient Video Recognition

Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong

retweets: 174, favorites: 86 (03/24/2021 09:20:16)
links: abs | pdf
cs.CV | cs.AI | cs.LG

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code will be made available at https://github.com/tensorflow/models/tree/master/official/vision.

MoViNets: Mobile Video Networks for Efficient Video Recognition

Using simple techniques, we produce efficient video classifiers for mobile devices, reducing peak memory usage by 10x, and can operate on streaming video.

Paper: https://t.co/G3oZlRtNZb pic.twitter.com/RE7iReM0rh
— Dan Kondratyuk (@hyperparticle) March 23, 2021

MoViNets: Mobile Video Networks for Efficient Video Recognition
pdf: https://t.co/FQ6k72HaHx
abs: https://t.co/mCe5POAZdR
github: https://t.co/1NOFncyrEn pic.twitter.com/fMPu5bEqim
— AK (@ak92501) March 23, 2021

16. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Ning Wang, Wengang Zhou, Jie Wang, Houqaing Li

retweets: 100, favorites: 75 (03/24/2021 09:20:17)
links: abs | pdf
cs.CV

In video object tracking, there exist rich temporal contexts among successive frames, which have been largely overlooked in existing trackers. In this work, we bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches and carefully design them within the Siamese-like tracking pipelines. The transformer encoder promotes the target templates via attention-based feature reinforcement, which benefits the high-quality tracking model generation. The transformer decoder propagates the tracking cues from previous templates to the current frame, which facilitates the object searching process. Our transformer-assisted tracking framework is neat and trained in an end-to-end manner. With the proposed transformer, a simple Siamese matching approach is able to outperform the current top-performing trackers. By combining our transformer with the recent discriminative tracking pipeline, our method sets several new state-of-the-art records on prevalent tracking benchmarks.

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking
pdf: https://t.co/TJ33ds2qwf
abs: https://t.co/rKm5DtvOJ7
github: https://t.co/Vczrdcxv7O pic.twitter.com/ZmLhLkUp6c
— AK (@ak92501) March 23, 2021

17. Fairness Perceptions of Algorithmic Decision-Making: A Systematic Review of the Empirical Literature

Christopher Starke, Janine Baleis, Birte Keller, Frank Marcinkowski

retweets: 123, favorites: 44 (03/24/2021 09:20:17)
links: abs | pdf
cs.HC | cs.AI | cs.CY

Algorithmic decision-making (ADM) increasingly shapes people’s daily lives. Given that such autonomous systems can cause severe harm to individuals and social groups, fairness concerns have arisen. A human-centric approach demanded by scholars and policymakers requires taking people’s fairness perceptions into account when designing and implementing ADM. We provide a comprehensive, systematic literature review synthesizing the existing empirical insights on perceptions of algorithmic fairness from 39 empirical studies spanning multiple domains and scientific disciplines. Through thorough coding, we systemize the current empirical literature along four dimensions: (a) algorithmic predictors, (b) human predictors, (c) comparative effects (human decision-making vs. algorithmic decision-making), and (d) consequences of ADM. While we identify much heterogeneity around the theoretical concepts and empirical measurements of algorithmic fairness, the insights come almost exclusively from Western-democratic contexts. By advocating for more interdisciplinary research adopting a society-in-the-loop framework, we hope our work will contribute to fairer and more responsible ADM.

🚨New Pre-Print🚨

We conducted a systematic literature review of 39 empirical studies on people’s fairness perceptions of algorithmic decision-making.https://t.co/qcYtFyLSmc

Thanks to my co-authors @birtekeller @JanineBls F. Marcinkowski

Main insights 👇 pic.twitter.com/yTaLnGC4na
— Christopher Starke (@ch_starke) March 23, 2021

18. Higher-order Homophily is Combinatorially Impossible

Nate Veldt, Austin R. Benson, Jon Kleinberg

retweets: 72, favorites: 57 (03/24/2021 09:20:17)
links: abs | pdf
cs.SI | cs.DM

Homophily is the seemingly ubiquitous tendency for people to connect with similar others, which is fundamental to how society organizes. Even though many social interactions occur in groups, homophily has traditionally been measured from collections of pairwise interactions involving just two individuals. Here, we develop a framework using hypergraphs to quantify homophily from multiway, group interactions. This framework reveals that many homophilous group preferences are impossible; for instance, men and women cannot simultaneously exhibit preferences for groups where their gender is the majority. This is not a human behavior but rather a combinatorial impossibility of hypergraphs. At the same time, our framework reveals relaxed notions of group homophily that appear in numerous contexts. For example, in order for US members of congress to exhibit high preferences for co-sponsoring bills with their own political party, there must also exist a substantial number of individuals from each party that are willing to co-sponsor bills even when their party is in the minority. Our framework also reveals how gender distribution in group pictures varies with group size, a fact that is overlooked when applying graph-based measures.

(1/n) New preprint with @austinbenson and Jon Kleinberg! "Higher-order Homophily is Combinatorially Impossible" (https://t.co/5oFRQuGPgT)
— Nate Veldt (@n_veldt) March 23, 2021

19. AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Yudong Guo, Keyu Chen, Sen Liang, Yongjin Liu, Hujun Bao, Juyong Zhang

retweets: 42, favorites: 76 (03/24/2021 09:20:17)
links: abs | pdf
cs.CV

Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is directly fed into a conditional implicit function to generate a dynamic neural radiance field, from which a high-fidelity talking-head video corresponding to the audio signal is synthesized using volume rendering. Another advantage of our framework is that not only the head (with hair) region is synthesized as previous methods did, but also the upper body is generated via two individual neural radiance fields. Experimental results demonstrate that our novel framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
pdf: https://t.co/Yo6ceVTFGQ
abs: https://t.co/JOnW9Ywua1 pic.twitter.com/SN3TmhLFka
— AK (@ak92501) March 23, 2021

20. Open Domain Question Answering over Tables via Dense Retrieval

Jonathan Herzig, Thomas Müller, Syrine Krichene, Julian Martin Eisenschlos

retweets: 64, favorites: 31 (03/24/2021 09:20:17)
links: abs | pdf
cs.CL

Recent advances in open-domain QA have led to strong models based on dense retrieval, but only focused on retrieving textual passages. In this work, we tackle open-domain QA over tables for the first time, and show that retrieval can be improved by a retriever designed to handle tabular context. We present an effective pre-training procedure for our retriever and improve retrieval quality with mined hard negatives. As relevant datasets are missing, we extract a subset of Natural Questions (Kwiatkowski et al., 2019) into a Table QA dataset. We find that our retriever improves retrieval results from 72.0 to 81.1 recall@10 and end-to-end QA results from 33.8 to 37.7 exact match, over a BERT based retriever.

1/4 Much focus is given to dense retrieval of textual passages, but how should we design retrievers for tables in the context of open domain QA?

New #NAACL2021 short paper: https://t.co/q67xNSOwxq

With @muelletm, Syrine Krichene and @eisenjulian
— Jonathan Herzig (@jonherzig) March 23, 2021

21. Language Models have a Moral Dimension

Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin Rothkopf, Kristian Kersting

retweets: 30, favorites: 60 (03/24/2021 09:20:17)
links: abs | pdf
cs.CL | cs.CY

Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pretrained models and fine-tuning them for specific tasks, researchers have extended the state of the art for many NLP tasks and shown that they not only capture linguistic knowledge but also retain general knowledge implicitly present in the data. These and other successes are exciting. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerate and biased behaviour. While this is well established, we show that recent improvements of LMs also store ethical and moral values of the society and actually bring a moral dimension'' to surface: the values are capture geometrically by a direction in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts. This provides a path for attenuating or even preventing toxic degeneration in LMs. Since one can now rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, the moral dimension can be used asmoral compass” guiding (even other) LMs towards producing normative text, as we will show.

Language Models have a Moral Dimension
pdf: https://t.co/QxnEQs4206
abs: https://t.co/tCM0qvIIeC pic.twitter.com/taPdzacN88
— AK (@ak92501) March 23, 2021

22. Functional Pearl: Witness Me — Constructive Arguments Must Be Guided with Concrete Witness

Hiromi Ishii

retweets: 30, favorites: 39 (03/24/2021 09:20:17)
links: abs | pdf
cs.PL

Beloved Curry—Howard correspondence tells that types are intuitionistic propositions, and in constructive math, a proof of proposition can be seen as some kind of a construction, or witness, conveying the information of the proposition. We demonstrate how useful this point of view is as the guiding principle for developing dependently-typed programs.

[2103.11751] Functional Pearl: Witness Me -- Constructive Arguments Must Be Guided with Concrete Witness
某所に投稿した、社での依存型Haskellの攻めた取り組みのエッセンスを詰め込んだ論文をarXivに上げました。ご笑覧下さい。 https://t.co/ATUZTLNwnn
— スマートコン (@mr_konn) March 23, 2021

23. Catastrophic Forgetting in Deep Graph Networks: an Introductory Benchmark for Graph Classification

Antonio Carta, Andrea Cossu, Federico Errica, Davide Bacciu

retweets: 25, favorites: 26 (03/24/2021 09:20:18)
links: abs | pdf
cs.LG | cs.AI

In this work, we study the phenomenon of catastrophic forgetting in the graph representation learning scenario. The primary objective of the analysis is to understand whether classical continual learning techniques for flat and sequential data have a tangible impact on performances when applied to graph data. To do so, we experiment with a structure-agnostic model and a deep graph network in a robust and controlled environment on three different datasets. The benchmark is complemented by an investigation on the effect of structure-preserving regularization techniques on catastrophic forgetting. We find that replay is the most effective strategy in so far, which also benefits the most from the use of regularization. Our findings suggest interesting future research at the intersection of the continual and graph representation learning fields. Finally, we provide researchers with a flexible software framework to reproduce our results and carry out further experiments.

Really excited for this work! We perform a preliminary study on catastrophic forgetting for DGNs (deep graph nets). See you at WWW '21 GLB Workshop!
Joint work with @Cossu94 @acarta7 and Davide Bacciu. https://t.co/sjz325bkX9
— Federico Errica (@federico_errica) March 23, 2021

Published 24 Mar 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter