Hot Papers 2020-10-05

1. A Survey of the State of Explainable AI for Natural Language Processing

Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, Prithviraj Sen

retweets: 607, favorites: 152 (10/06/2020 09:33:58)
links: abs | pdf
cs.CL | cs.AI | cs.LG

Recent years have seen important advances in the quality of state-of-the-art models, but this has come at the expense of models becoming less interpretable. This survey presents an overview of the current state of Explainable AI (XAI), considered within the domain of Natural Language Processing (NLP). We discuss the main categorization of explanations, as well as the various ways explanations can be arrived at and visualized. We detail the operations and explainability techniques currently available for generating explanations for NLP model predictions, to serve as a resource for model developers in the community. Finally, we point out the current gaps and encourage directions for future work in this important research area.

A survey paper of explainable AI for NLP.

"explanations can help users of NLP-based AI systems build trust in these systems’ predictions... and may also allow users to provide useful feedback, which in turn can help developers improve model quality"https://t.co/aaEHwjspcL pic.twitter.com/1DX5IlTgkY
— elvis (@omarsar0) October 5, 2020

A Survey of the State of Explainable AI for Natural Language Processing. https://t.co/28yxttrHzA pic.twitter.com/MRYKytUXvs
— arxiv (@arxiv_org) October 5, 2020

2. Which *BERT? A Survey Organizing Contextualized Encoders

Patrick Xia, Shijie Wu, Benjamin Van Durme

retweets: 548, favorites: 147 (10/06/2020 09:33:59)
links: abs | pdf
cs.CL | cs.LG

Pretrained contextualized text encoders are now a staple of the NLP community. We present a survey on language representation learning with the aim of consolidating a series of shared lessons learned across a variety of recent efforts. While significant advancements continue at a rapid pace, we find that enough has now been discovered, in different directions, that we can begin to organize advances according to common themes. Through this organization, we highlight important considerations when interpreting recent contributions and choosing which model to use.

Which *BERT should you use? We (with @EzraWu and @ben_vandurme) surveyed as many of them as we could in this #EMNLP2020 paper https://t.co/fgLraDcevz. Takeaways: 1) Encoders can be expensive to use and may not be worth it if they only get you +0.2% ... (1/3)
— Patrick Xia (@nlpaxia) October 5, 2020

3. How to Motivate Your Dragon: Teaching Goal-Driven Agents to Speak and Act in Fantasy Worlds

Prithviraj Ammanabrolu, Jack Urbanek, Margaret Li, Arthur Szlam, Tim Rocktäschel, Jason Weston

retweets: 412, favorites: 122 (10/06/2020 09:33:59)
links: abs | pdf
cs.CL | cs.AI

We seek to create agents that both act and communicate with other agents in pursuit of a goal. Towards this end, we extend LIGHT (Urbanek et al. 2019)---a large-scale crowd-sourced fantasy text-game---with a dataset of quests. These contain natural language motivations paired with in-game goals and human demonstrations; completing a quest might require dialogue or actions (or both). We introduce a reinforcement learning system that (1) incorporates large-scale language modeling-based and commonsense reasoning-based pre-training to imbue the agent with relevant priors; and (2) leverages a factorized action space of action commands and dialogue, balancing between the two. We conduct zero-shot evaluations using held-out human expert demonstrations, showing that our agents are able to act consistently and talk naturally with respect to their motivations.

🚨New Paper Alert🚨
Having trouble keeping your (AI) dragon motivated? Same here. So we figured out how to teach it, interactively w/ RL & lang pretraining, to act consistently + talk naturally wrt its motivations when questing in a fantasy text game.https://t.co/Fiucf5wClj
1/4 pic.twitter.com/VY3EN1ogo2
— Prithviraj Ammanabrolu (@rajammanabrolu) October 5, 2020

4. Proof Repair Across Type Equivalences

Talia Ringer, RanDair Porter, Nathaniel Yazdani, John Leo, Dan Grossman

retweets: 225, favorites: 68 (10/06/2020 09:33:59)
links: abs | pdf
cs.PL

We describe a new approach to automatically repairing broken proofs in response to changes in type definitions in the Coq proof assistant. Our approach combines a configurable proof term transformation with a proof term to tactic script decompiler. The proof term transformation implements transport across certain equivalences in a way that is suitable for repair and does not rely on axioms beyond those Coq assumes. We have implemented this approach in PUMPKIN Pi, an extension to the PUMPKIN PATCH Coq plugin suite for proof repair. We have used PUMPKIN Pi to support a benchmark from a user study, ease development with dependent types, port functions and proofs between unary and binary natural numbers, and support an industrial proof engineer to more easily interoperate between Coq and other verification tools.

Check out the paper I just posted to arXiv: "Proof Repair Across Type Equivalences"!https://t.co/FI9VyYgR8p
— Talia Ringer (@TaliaRinger) October 5, 2020

5. Hard Negative Mixing for Contrastive Learning

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, Diane Larlus

retweets: 146, favorites: 72 (10/06/2020 09:33:59)
links: abs | pdf
cs.CV | cs.LG

Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies either at the image or the feature level improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing the memory size, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.

Our #NeurIPS2020 paper on mixing hard negatives for contrastive self-supervised learning is now public: https://t.co/Vzgil0LIzA Work with Bulent Sariyildiz, @Poyonoz, @WeinzaepfelP and @dlarlus Pre-trained models will be out very soon. pic.twitter.com/LsVZA0mos9
— Yannis Kalantidis (@skamalas) October 5, 2020

Junpei Komiyama, Shunya Noda

retweets: 101, favorites: 62 (10/06/2020 09:33:59)
links: abs | pdf
econ.TH | cs.GT | econ.EM | stat.ML

We analyze statistical discrimination using a multi-armed bandit model where myopic firms face candidate workers arriving with heterogeneous observable characteristics. The association between the worker’s skill and characteristics is unknown ex ante; thus, firms need to learn it. In such an environment, laissez-faire may result in a highly unfair and inefficient outcome---myopic firms are reluctant to hire minority workers because the lack of data about minority workers prevents accurate estimation of their performance. Consequently, minority groups could be perpetually underestimated---they are never hired, and therefore, data about them is never accumulated. We proved that this problem becomes more serious when the population ratio is imbalanced, as is the case in many extant discrimination problems. We consider two affirmative-action policies for solving this dilemma: One is a subsidy rule that is based on the popular upper confidence bound algorithm, and another is the Rooney Rule, which requires firms to interview at least one minority worker for each hiring opportunity. Our results indicate temporary affirmative actions are effective for statistical discrimination caused by data insufficiency.

NYU Stern の小宮山さん（@jkomiyama_
）との共同研究、"On Statistical Discrimination as a Failure of Social Learning: A Multi-Armed Bandit Approach" をarXivに公開しました！　Social Learningの結果として、どう統計的差別が発生・継続するのかを分析しています。https://t.co/SJVpW6o6NS
— 野田俊也 (Shunya Noda) (@himagegine) October 5, 2020

7. Contrastive Learning of Medical Visual Representations from Paired Images and Text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, Curtis P. Langlotz

retweets: 100, favorites: 51 (10/06/2020 09:34:00)
links: abs | pdf
cs.CV | cs.CL | cs.LG

Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

👋 Excited to share our latest work "Contrastive Learning of Medical Visual Representations from Paired Images and Text".

We propose a contrastive framework for learning visual representations of medical images from paired textual data.

arXiv: https://t.co/HhJxaxSWe1

👇 (1/7) pic.twitter.com/tuyA12iwoY
— Yuhao Zhang (@yuhaozhangx) October 5, 2020

8. MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models

Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, Anima Anandkumar, Bryan Catanzaro

retweets: 87, favorites: 56 (10/06/2020 09:34:00)
links: abs | pdf
cs.CL

Existing pre-trained large language models have shown unparalleled generative capabilities. However, they are not controllable. In this paper, we propose MEGATRON-CNTRL, a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base. Our framework consists of a keyword predictor, a knowledge retriever, a contextual knowledge ranker, and a conditional text generator. As we do not have access to ground-truth supervision for the knowledge ranker, we make use of weak supervision from sentence embedding. The empirical results show that our model generates more fluent, consistent, and coherent stories with less repetition and higher diversity compared to prior work on the ROC story dataset. We showcase the controllability of our model by replacing the keywords used to generate stories and re-running the generation process. Human evaluation results show that 77.5% of these stories are successfully controlled by the new keywords. Furthermore, by scaling our model from 124 million to 8.3 billion parameters we demonstrate that larger models improve both the quality of generation (from 74.5% to 93.0% for consistency) and controllability (from 77.5% to 91.5%).

MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models
pdf: https://t.co/NRR1mwIcvH
abs: https://t.co/YBCNFMCtd8 pic.twitter.com/CJaeINyo9b
— AK (@ak92501) October 5, 2020

9. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

Shangtong Zhang, Romain Laroche, Harm van Seijen, Shimon Whiteson, Remi Tachet des Combes

retweets: 102, favorites: 36 (10/06/2020 09:34:00)
links: abs | pdf
cs.AI | cs.LG

We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a $\gamma^t$ term in the actor update for the transition observed at time $t$ in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting ( $\gamma^t$ ) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective $(\gamma = 1)$ where $\gamma^t$ disappears naturally $(1^t = 1)$ . We then propose to interpret the discounting in critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective ( $\gamma < 1$ ) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

Theories of actor-critic algos usually have \gamma^t in actor updates, which are often ignored by practitioners. Do you ever wonder why? We empirically investigate its benefits from a representation learning perspective https://t.co/ysI2U4KwRm @whi_rl @MSFTResearch
— Shangtong Zhang (@ShangtongZhang) October 5, 2020

10. MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

Andreas Rücklé, Jonas Pfeiffer, Iryna Gurevych

retweets: 72, favorites: 54 (10/06/2020 09:34:00)
links: abs | pdf
cs.CL | cs.IR

We study the zero-shot transfer capabilities of text matching models on a massive scale, by self-supervised training on 140 source domains from community question answering forums in English. We investigate the model performances on nine benchmarks of answer selection and question similarity tasks, and show that all 140 models transfer surprisingly well, where the large majority of models substantially outperforms common IR baselines. We also demonstrate that considering a broad selection of source domains is crucial for obtaining the best zero-shot transfer performances, which contrasts the standard procedure that merely relies on the largest and most similar domains. In addition, we extensively study how to best combine multiple source domains. We propose to incorporate self-supervised with supervised multi-task learning on all available source domains. Our best zero-shot transfer model considerably outperforms in-domain BERT and the previous state of the art on six benchmarks. Fine-tuning of our model with in-domain data results in additional large gains and achieves the new state of the art on all nine benchmarks.

I'm excited to share “MultiCQA”, accepted at @EMNLP2020

We train 140 models on different domains and surprisingly find that neither domain similarity nor data size are critical factors for the best zero-shot transferability.https://t.co/JdPm3Nzwvr
\w @PfeiffJo IGurevych pic.twitter.com/vfhFmdJQa0
— Andreas Rücklé (@arueckle) October 5, 2020

11. Evaluating a Generative Adversarial Framework for Information Retrieval

Ameet Deshpande, Mitesh M. Khapra

retweets: 105, favorites: 12 (10/06/2020 09:34:00)
links: abs | pdf
cs.LG | cs.CL | cs.IR

Recent advances in Generative Adversarial Networks (GANs) have resulted in its widespread applications to multiple domains. A recent model, IRGAN, applies this framework to Information Retrieval (IR) and has gained significant attention over the last few years. In this focused work, we critically analyze multiple components of IRGAN, while providing experimental and theoretical evidence of some of its shortcomings. Specifically, we identify issues with the constant baseline term in the policy gradients optimization and show that the generator harms IRGAN’s performance. Motivated by our findings, we propose two models influenced by self-contrastive estimation and co-training which outperform IRGAN on two out of the three tasks considered.

12. Hyperharmonic analysis for the study of high-order information-theoretic signals

Anibal M. Medina-Mardones, Fernando E. Rosas, Sebastián E. Rodríguez, Rodrigo Cofré

retweets: 92, favorites: 20 (10/06/2020 09:34:00)
links: abs | pdf
math.AT | cs.IT

Network representations often cannot fully account for the structural richness of complex systems spanning multiple levels of organisation. Recently proposed high-order information-theoretic signals are well-suited to capture synergistic phenomena that transcend pairwise interactions; however, the exponential-growth of their cardinality severely hinders their applicability. In this work, we combine methods from harmonic analysis and combinatorial topology to construct efficient representations of high-order information-theoretic signals. The core of our method is the diangonalisation of a discrete version of the Laplace-de Rham operator, that geometrically encodes structural properties of the system. We capitalise these ideas by developing a complete workflow for the construction of hyperharmonic representations of high-order signals, which is applicable to a wide range of scenarios.

Preprint time:
“Hyperharmonic analysis for the study of high-order information-theoretic signals”https://t.co/nd2v0nDHQI

If you are looking for ways of doing Fourier analysis on high-order interdependency measures represented as weighted hypergraphs, this might be of interest!
— Fernando Rosas (@_fernando_rosas) October 5, 2020

13. Hypergraph regularity and higher arity VC-dimension

Artem Chernikov, Henry Towsner

retweets: 55, favorites: 47 (10/06/2020 09:34:00)
links: abs | pdf
math.CO | cs.DM | math.LO

We generalize the fact that graphs with small VC-dimension can be approximated by rectangles, showing that hypergraphs with small VCk-dimension (equivalently, omitting a fixed finite (k+1)-partite (k+1)-uniform hypergraph) can be approximated by k-ary cylinder sets. In the language of hypergraph regularity, this shows that when H is a k’-uniform hypergraph with small VCk-dimension for some k<k’, the decomposition of H given by hypergraph regularity only needs the first k levels---one can approximate H using sets of vertices, sets of pairs, and so on up to sets of k-tuples---and that on most of the resulting k-ary cylinder sets, the density of H is either close to 0 or close to 1. We also show a suitable converse: k’-uniform hypergraphs with large VC_k-dimension cannot have such approximations uniformly under all measures on the vertices.

🗣️ Our preprint with Henry Towsner @htowsner is out!

"Hypergraph regularity and higher arity VC-dimension"
https://t.co/tasnUn1Q1M

Phew this one took a minute. More details coming soon! pic.twitter.com/xhdLWPRn7L
— Artem Chernikov (@archernikov) October 5, 2020

14. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto

retweets: 20, favorites: 49 (10/06/2020 09:34:00)
links: abs | pdf
cs.CL | cs.LG

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.

Our @emnlp2020 paper “LUKE: Deep Contextualized Entity Representations with Entity-aware
Self-attention” is now available on arXiv! We present new pretrained contextualized representations that achieve SOTA on five datasets including SQuAD and CoNLL-2003.https://t.co/FrENNmtYZf
— Ikuya Yamada (@ikuyamada) October 5, 2020

15. Nearest Neighbor Machine Translation

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis

retweets: 27, favorites: 34 (10/06/2020 09:34:01)
links: abs | pdf
cs.CL

We introduce $k$ -nearest-neighbor machine translation ( $k$ NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. $k$ NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results---without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, $k$ NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples.

16. Bounding the forward classical capacity of bipartite quantum channels

Dawei Ding, Sumeet Khatri, Yihui Quek, Peter W. Shor, Xin Wang, Mark M. Wilde

retweets: 25, favorites: 31 (10/06/2020 09:34:01)
links: abs | pdf
quant-ph | cs.IT

We introduce various measures of forward classical communication for bipartite quantum channels. Since a point-to-point channel is a special case of a bipartite channel, the measures reduce to measures of classical communication for point-to-point channels. As it turns out, these reduced measures have been reported in prior work of Wang et al. on bounding the classical capacity of a quantum channel. As applications, we show that the measures are upper bounds on the forward classical capacity of a bipartite channel. The reduced measures are upper bounds on the classical capacity of a point-to-point quantum channel assisted by a classical feedback channel. Some of the various measures can be computed by semi-definite programming.

"Bounding the forward classical capacity of bipartite quantum channels," in collaboration with David Ding, @SumeetKhatri6, @quekpottheories, @PeterShor1, and @wangxinfelix https://t.co/UGS07JSgT1
— Mark M. Wilde (@markwilde) October 5, 2020

Published 6 Oct 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter