1. Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina Williams, Douwe Kiela
A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks — including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
[1/7] Excited to announce “Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little”. BERT gets high task scores due to its distributional prior rather than its ability to “discover the NLP pipeline”. https://t.co/2s0gq8vGFz #NLProc pic.twitter.com/yF3rCFgCc9
— Koustuv Sinha (@koustuvsinha) April 15, 2021
2. DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort
Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, Sanja Fidler
We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code can be decoded to produce a semantic segmentation of the image. Training the decoder only needs a few labeled examples to generalize to the rest of the latent space, resulting in an infinite annotated dataset generator! These generated datasets can then be used for training any computer vision architecture just as real datasets are. As only a few images need to be manually segmented, it becomes possible to annotate images in extreme detail and generate datasets with rich object and part segmentations. To showcase the power of our approach, we generated datasets for 7 image segmentation tasks which include pixel-level labels for 34 human face parts, and 32 car parts. Our approach outperforms all semi-supervised baselines significantly and is on par with fully supervised methods, which in some cases require as much as 100x more annotated data as our method.
DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort
— AK (@ak92501) April 15, 2021
pdf: https://t.co/uDMbrDU7V8
abs: https://t.co/obT7dz6GqO pic.twitter.com/gQmmMimM6Z
3. Aligning Latent and Image Spaces to Connect the Unconnectable
Ivan Skorokhodov, Grigorii Sotnikov, Mohamed Elhoseiny
In this work, we develop a method to generate infinite high-resolution images with diverse and complex content. It is based on a perfectly equivariant generator with synchronous interpolations in the image and latent spaces. Latent codes, when sampled, are positioned on the coordinate grid, and each pixel is computed from an interpolation of the nearby style codes. We modify the AdaIN mechanism to work in such a setup and train the generator in an adversarial setting to produce images positioned between any two latent vectors. At test time, this allows for generating complex and diverse infinite images and connecting any two unrelated scenes into a single arbitrarily large panorama. Apart from that, we introduce LHQ: a new dataset of \lhqsize high-resolution nature landscapes. We test the approach on LHQ, LSUN Tower and LSUN Bridge and outperform the baselines by at least 4 times in terms of quality and diversity of the produced infinite images. The project page is located at https://universome.github.io/alis.
Aligning Latent and Image Spaces to Connect the Unconnectable
— AK (@ak92501) April 15, 2021
pdf: https://t.co/2KB14lzCFq
abs: https://t.co/QP9g0kg0fw
project page: https://t.co/7HyH8Ew8pS pic.twitter.com/sOPnN73ImW
4. Modeling Framing in Immigration Discourse on Social Media
Julia Mendelsohn, Ceren Budak, David Jurgens
The framing of political issues can influence policy and public opinion. Even though the public plays a key role in creating and spreading frames, little is known about how ordinary people on social media frame political issues. By creating a new dataset of immigration-related tweets labeled for multiple framing typologies from political communication theory, we develop supervised models to detect frames. We demonstrate how users’ ideology and region impact framing choices, and how a message’s framing influences audience responses. We find that the more commonly-used issue-generic frames obscure important ideological and regional patterns that are only revealed by immigration-specific frames. Furthermore, frames oriented towards human interests, culture, and politics are associated with higher user engagement. This large-scale analysis of a complex social and linguistic phenomenon contributes to both NLP and social science research.
We know how politicians talk about political issues. But how about ordinary people, and why do they matter? My #NAACL2021 paper w/ @david__jurgens & @cerenbudak address these questions by analyzing the framing of immigration on Twitter.
— Julia Mendelsohn (@jmendelsohn2) April 15, 2021
Link: https://t.co/tVDdg27THk
🧵 (1/11) pic.twitter.com/W2TQUAhrg3
5. Sparse Attention with Linear Units
Biao Zhang, Ivan Titov, Rico Sennrich
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. ‘switch off’) for some queries, which is not possible with sparsified softmax alternatives.
Sparse Attention with Linear Units
— AK (@ak92501) April 15, 2021
pdf: https://t.co/g0Qsx4Clmr
abs: https://t.co/HP7UkNJW2W pic.twitter.com/8rPmVB4LH3
6. Large-Scale Self- and Semi-Supervised Learning for Speech Translation
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau
In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Different to existing work, our approach does not leverage any other supervision than ST data. Code and models will be publicly released.
Large-Scale Self- and Semi-Supervised Learning for Speech Translation
— Aran Komatsuzaki (@arankomatsuzaki) April 15, 2021
Improves over the previous SotA by 2.6 BLEU on average with wav2vec 2.0, pretraining, a single iteration of self-training and decoding with a LM. https://t.co/okuZJDlwsA pic.twitter.com/RL6d9FSk5O
7. Towards a framework for evaluating the safety, acceptability and efficacy of AI systems for health: an initial synthesis
Jessica Morley, Caroline Morton, Kassandra Karpathakis, Mariarosaria Taddeo, Luciano Floridi
The potential presented by Artificial Intelligence (AI) for healthcare has long been recognised by the technical community. More recently, this potential has been recognised by policymakers, resulting in considerable public and private investment in the development of AI for healthcare across the globe. Despite this, excepting limited success stories, real-world implementation of AI systems into front-line healthcare has been limited. There are numerous reasons for this, but a main contributory factor is the lack of internationally accepted, or formalised, regulatory standards to assess AI safety and impact and effectiveness. This is a well-recognised problem with numerous ongoing research and policy projects to overcome it. Our intention here is to contribute to this problem-solving effort by seeking to set out a minimally viable framework for evaluating the safety, acceptability and efficacy of AI systems for healthcare. We do this by conducting a systematic search across Scopus, PubMed and Google Scholar to identify all the relevant literature published between January 1970 and November 2020 related to the evaluation of: output performance; efficacy; and real-world use of AI systems, and synthesising the key themes according to the stages of evaluation: pre-clinical (theoretical phase); exploratory phase; definitive phase; and post-market surveillance phase (monitoring). The result is a framework to guide AI system developers, policymakers, and regulators through a sufficient evaluation of an AI system designed for use in healthcare.
It's #preprint time! Here is my new paper with @dr_c_morton, Kassandra Karpathakis, @RosariaTaddeo & @Floridi - an initial synthesis of requirements set out in lit 4 evaluating AI CDS - designed to provide the theory 4 the many policy convos in this space: https://t.co/VgkC2fEP7o
— Jess Morley (@jessRmorley) April 15, 2021
8. Few-shot Image Generation via Cross-domain Correspondence
Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A. Efros, Yong Jae Lee, Eli Shechtman, Richard Zhang
Training generative models, such as GANs, on a target domain containing limited examples (e.g., 10) can easily result in overfitting. In this work, we seek to utilize a large source domain for pretraining and transfer the diversity information from source to target. We propose to preserve the relative similarities and differences between instances in the source via a novel cross-domain distance consistency loss. To further reduce overfitting, we present an anchor-based strategy to encourage different levels of realism over different regions in the latent space. With extensive results in both photorealistic and non-photorealistic domains, we demonstrate qualitatively and quantitatively that our few-shot model automatically discovers correspondences between source and target domains and generates more diverse and realistic images than previous methods.
Few-shot Image Generation via Cross-domain Correspondence
— AK (@ak92501) April 15, 2021
pdf: https://t.co/dxbGIIjIu1
abs: https://t.co/uyDRTtTKED pic.twitter.com/IjM2vuGyT1
9. ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration
Junyu Chen, Yufan He, Eric C. Frey, Ye Li, Yong Du
In the last decade, convolutional neural networks (ConvNets) have dominated and achieved state-of-the-art performances in a variety of medical imaging applications. However, the performances of ConvNets are still limited by lacking the understanding of long-range spatial relations in an image. The recently proposed Vision Transformer (ViT) for image classification uses a purely self-attention-based model that learns long-range spatial relations to focus on the relevant parts of an image. Nevertheless, ViT emphasizes the low-resolution features because of the consecutive downsamplings, result in a lack of detailed localization information, making it unsuitable for image registration. Recently, several ViT-based image segmentation methods have been combined with ConvNets to improve the recovery of detailed localization information. Inspired by them, we present ViT-V-Net, which bridges ViT and ConvNet to provide volumetric medical image registration. The experimental results presented here demonstrate that the proposed architecture achieves superior performance to several top-performing registration methods.
ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration
— AK (@ak92501) April 15, 2021
pdf: https://t.co/0KZZtzXOID
abs: https://t.co/lWJt1YBtyN
github: https://t.co/GdcCoDDGFH pic.twitter.com/OvpvBDNq5I
10. Non-autoregressive sequence-to-sequence voice conversion
Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence. Furthermore, we extend variance predictors to variance converters to explicitly convert the source speaker’s prosody components such as pitch and energy into the target speaker. The experimental evaluation with the Japanese speaker dataset, which consists of male and female speakers of 1,000 utterances, demonstrates that the proposed model enables us to perform more stable, faster, and better conversion than autoregressive S2S (AR-S2S) models such as Tacotron2 and Transformer.
Non-autoregressive sequence-to-sequence voice conversion
— AK (@ak92501) April 15, 2021
pdf: https://t.co/0vHhEcjpge
abs: https://t.co/H0r3u60dIU
project page: https://t.co/GEx57qswG5 pic.twitter.com/RehOOzgxJF
11. Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes
Julian Chibane, Aayush Bansal, Verica Lazova, Gerard Pons-Moll
Recent neural view synthesis methods have achieved impressive quality and realism, surpassing classical pipelines which rely on multi-view reconstruction. State-of-the-Art methods, such as NeRF, are designed to learn a single scene with a neural network and require dense multi-view inputs. Testing on a new scene requires re-training from scratch, which takes 2-3 days. In this work, we introduce Stereo Radiance Fields (SRF), a neural view synthesis approach that is trained end-to-end, generalizes to new scenes, and requires only sparse views at test time. The core idea is a neural architecture inspired by classical multi-view stereo methods, which estimates surface points by finding similar image regions in stereo images. In SRF, we predict color and density for each 3D point given an encoding of its stereo correspondence in the input images. The encoding is implicitly learned by an ensemble of pair-wise similarities — emulating classical stereo. Experiments show that SRF learns structure instead of overfitting on a scene. We train on multiple scenes of the DTU dataset and generalize to new ones without re-training, requiring only 10 sparse and spread-out views as input. We show that 10-15 minutes of fine-tuning further improve the results, achieving significantly sharper, more detailed results than scene-specific models. The code, model, and videos are available at https://virtualhumans.mpi-inf.mpg.de/srf/.
Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes
— AK (@ak92501) April 15, 2021
pdf: https://t.co/X7OKXw3aOx
abs: https://t.co/uTPcFm0SqU pic.twitter.com/7tpSKc3MI0
12. Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction
Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong
Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Full videos and codes are available at https://1konny.github.io/HVP/.
Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction
— AK (@ak92501) April 15, 2021
pdf: https://t.co/nFkwCFpKMh
abs: https://t.co/0hdrfUAyvw
project page: https://t.co/yOV5Tmzbny
github: https://t.co/EXUfBGW9r1 pic.twitter.com/tPFn9Xhiqw
13. Pose Recognition with Cascade Transformers
Ke Li, Shijie Wang, Xiang Zhang, Yifan Xu, Weijian Xu, Zhuowen Tu
In this paper, we present a regression-based pose recognition method using cascade Transformers. One way to categorize the existing approaches in this domain is to separate them into 1). heatmap-based and 2). regression-based. In general, heatmap-based methods achieve higher accuracy but are subject to various heuristic designs (not end-to-end mostly), whereas regression-based approaches attain relatively lower accuracy but they have less intermediate non-differentiable steps. Here we utilize the encoder-decoder structure in Transformers to perform regression-based person and keypoint detection that is general-purpose and requires less heuristic design compared with the existing approaches. We demonstrate the keypoint hypothesis (query) refinement process across different self-attention layers to reveal the recursive self-attention mechanism in Transformers. In the experiments, we report competitive results for pose recognition when compared with the competing regression-based methods.
Pose Recognition with Cascade Transformers
— AK (@ak92501) April 15, 2021
pdf: https://t.co/Qciy1xWzbA
abs: https://t.co/x4bwnKOx9W pic.twitter.com/lLNwY4GTfJ
14. MS2: Multi-Document Summarization of Medical Studies
Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bailey Keuhl, Lucy Lu Wang
To assess the effectiveness of any medical intervention, researchers must conduct a time-intensive and highly manual literature review. NLP systems can help to automate or assist in parts of this expensive process. In support of this goal, we release MS^2 (Multi-Document Summarization of Medical Studies), a dataset of over 470k documents and 20k summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is the first large-scale, publicly available multi-document summarization dataset in the biomedical domain. We experiment with a summarization system based on BART, with promising early results. We formulate our summarization inputs and targets in both free text and structured forms and modify a recently proposed metric to assess the quality of our system’s generated summaries. Data and models are available at https://github.com/allenai/ms2
15. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury
A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (DR) models recently. A dense text retrieval model uses a single vector representation per query and passage to score a match, which enables low-latency first stage retrieval with a nearest neighbor search. Increasingly common, training approaches require enormous compute power, as they either conduct negative passage sampling out of a continuously updating refreshing index or require very large batch sizes for in-batch negative sampling. Instead of relying on more compute capability, we introduce an efficient topic-aware query and balanced margin sampling technique, called TAS-Balanced. We cluster queries once before training and sample queries out of a cluster per batch. We train our lightweight 6-layer DR model with a novel dual-teacher supervision that combines pairwise and in-batch negative teachers. Our method is trainable on a single consumer-grade GPU in under 48 hours (as opposed to a common configuration of 8x V100s). We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets. Evaluated on NDCG@10, we outperform BM25 by 44%, a plainly trained DR by 19%, docT5query by 11%, and the previous best DR model by 5%. Additionally, TAS-Balanced produces the first dense retriever that outperforms every other method on recall at any cutoff on TREC-DL and allows more resource intensive re-ranking models to operate on fewer passages to improve results further.
Can we efficiently train a very effective dense passage retriever? Yes, with Balanced Topic Aware Sampling! Let me introduce our #sigir2021 full paper: We compose batches based on query clusters and ...🧵
— Sebastian Hofstätter (@s_hofstaetter) April 15, 2021
w/ @jacklin_64 @mattjustram @lintool @allanhanbury https://t.co/sH4ToSL2l0 pic.twitter.com/N7zYCSuxHy
16. Can Active Learning Preemptively Mitigate Fairness Issues?
Frédéric Branchaud-Charron, Parmida Atighehchian, Pau Rodríguez, Grace Abuhamad, Alexandre Lacoste
Dataset bias is one of the prevailing causes of unfairness in machine learning. Addressing fairness at the data collection and dataset preparation stages therefore becomes an essential part of training fairer algorithms. In particular, active learning (AL) algorithms show promise for the task by drawing importance to the most informative training samples. However, the effect and interaction between existing AL algorithms and algorithmic fairness remain under-explored. In this paper, we study whether models trained with uncertainty-based AL heuristics such as BALD are fairer in their decisions with respect to a protected class than those trained with identically independently distributed (i.i.d.) sampling. We found a significant improvement on predictive parity when using BALD, while also improving accuracy compared to i.i.d. sampling. We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD. We found that, while addressing different fairness issues, their interaction further improves the results on most benchmarks and metrics we explored.