Hot Papers 2020-10-08

1. Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Jonathan Gray, Adam Lerer, Anton Bakhtin, Noam Brown

retweets: 930, favorites: 136 (10/09/2020 09:21:03)
links: abs | pdf
cs.AI | cs.GT | cs.LG

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website.

@jgrayatwork @adamlerer @anton_bakhtin and I are thrilled to share our latest work: a human-level no-press Diplomacy bot! Unlike prior AI benchmarks, Diplomacy involves a complex mix of both cooperation and competition. Thanks @webdiplomacy for your help! https://t.co/QUzxODxEtn pic.twitter.com/htSoLCCqEo
— Noam Brown (@polynoamial) October 8, 2020

2. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown

retweets: 110, favorites: 78 (10/09/2020 09:21:04)
links: abs | pdf
cs.CL

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.

Our EMNLP Findings paper presenting WikiLingua — a new multilingual abstractive summarization dataset — is now on arXiv.

It contains 770K article/summary pairs in 17 languages, parallel with English.

Paper: https://t.co/gvYYZOaW5q

Dataset: https://t.co/HfwGwJheUr #NLProc
— Faisal Ladhak (@faisalladhak) October 8, 2020

Younghoo Lee, Joshua Saxe, Richard Harang

retweets: 169, favorites: 10 (10/09/2020 09:21:04)
links: abs | pdf
cs.CR

Targeted phishing emails are on the rise and facilitate the theft of billions of dollars from organizations a year. While malicious signals from attached files or malicious URLs in emails can be detected by conventional malware signatures or machine learning technologies, it is challenging to identify hand-crafted social engineering emails which don’t contain any malicious code and don’t share word choices with known attacks. To tackle this problem, we fine-tune a pre-trained BERT model by replacing the half of Transformer blocks with simple adapters to efficiently learn sophisticated representations of the syntax and semantics of the natural language. Our Context-Aware network also learns the context representations between email’s content and context features from email headers. Our CatBERT(Context-Aware Tiny Bert) achieves a 87% detection rate as compared to DistilBERT, LSTM, and logistic regression baselines which achieve 83%, 79%, and 54% detection rates at false positive rates of 1%, respectively. Our model is also faster than competing transformer approaches and is resilient to adversarial attacks which deliberately replace keywords with typos or synonyms.

4. VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

retweets: 123, favorites: 42 (10/09/2020 09:21:04)
links: abs | pdf
cs.SD | eess.AS

In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.

``VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics. (arXiv:2010.02977v1 [https://t.co/mPAjnto8C8]),'' Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, https://t.co/orZds6cDjP
— arXiv Sound (@ArxivSound) October 8, 2020

5. WER we are and WER we think we are

Piotr Szymański, Piotr Żelasko, Mikolaj Morzy, Adrian Szymczak, Marzena Żyła-Hoppe, Joanna Banaszczak, Lukasz Augustyniak, Jan Mizgajski, Yishay Carmiel

retweets: 102, favorites: 29 (10/09/2020 09:21:04)
links: abs | pdf
cs.CL | cs.LG | cs.SD | eess.AS

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB’05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

Our research, authored with @niedakh and the team from Avaya Poznań R&D, titled "WER we are and WER we think we are" has been accepted to Findings of EMNLP 2020. We discuss, and reject, the - somewhat common - misconception that ASR is a solved task.

🔗 https://t.co/PCUiC8Ib5s
— Piotr Żelasko (@PiotrZelasko) October 8, 2020

6. Less is more: Faster and better music version identification with embedding distillation

Furkan Yesiler, Joan Serrà, Emilia Gómez

retweets: 90, favorites: 37 (10/09/2020 09:21:04)
links: abs | pdf
cs.SD | cs.LG | eess.AS

Version identification systems aim to detect different renditions of the same underlying musical composition (loosely called cover songs). By learning to encode entire recordings into plain vector embeddings, recent systems have made significant progress in bridging the gap between accuracy and scalability, which has been a key challenge for nearly two decades. In this work, we propose to further narrow this gap by employing a set of data distillation techniques that reduce the embedding dimensionality of a pre-trained state-of-the-art model. We compare a wide range of techniques and propose new ones, from classical dimensionality reduction to more sophisticated distillation schemes. With those, we obtain 99% smaller embeddings that, moreover, yield up to a 3% accuracy increase. Such small embeddings can have an important impact in retrieval time, up to the point of making a real-world system practical on a standalone laptop.

Do your models perform better using larger embeddings? With embedding distillation, we show how to use models w/ large embeddings to train ones w/ smaller embeddings and keep the accuracy.

w/ @serrjoa and @emiliagogu

Paper: https://t.co/ukWTaSV1fY
Code: https://t.co/9tgXuhaHm8
— Furkan Yesiler (@furkanyesiler) October 8, 2020

7. Plug and Play Autoencoders for Conditional Text Generation

Florian Mai, Nikolaos Pappas, Ivan Montero, Noah A. Smith, James Henderson

retweets: 64, favorites: 62 (10/09/2020 09:21:05)
links: abs | pdf
cs.CL | cs.AI

Text autoencoders are commonly used for conditional generation tasks such as style transfer. We propose methods which are plug and play, where any pretrained autoencoder can be used, and only require learning a mapping within the autoencoder’s embedding space, training embedding-to-embedding (Emb2Emb). This reduces the need for labeled training data for the task and makes the training procedure more efficient. Crucial to the success of this method is a loss term for keeping the mapped embedding on the manifold of the autoencoder and a mapping which is trained to navigate the manifold by learning offset vectors. Evaluations on style transfer tasks both with and without sequence-to-sequence supervision show that our method performs better than or comparable to strong baselines while being up to four times faster.

Why not pretrain text autoencoders (like we do language models) and then treat text-to-text tasks as regression in continuous "code" space? New EMNLP paper by @_florianmai @nik0spapp Ivan Montero @nlpnoah and "neural" NLP visionary @JamieBHenderson https://t.co/CKTTwGaYhF
— Noah A Smith (@nlpnoah) October 8, 2020

8. Inductive Entity Representations from Text via Link Prediction

Daniel Daza, Michael Cochez, Paul Groth

retweets: 81, favorites: 33 (10/09/2020 09:21:05)
links: abs | pdf
cs.CL | cs.AI

We present a method for learning representations of entities, that uses a Transformer-based architecture as an entity encoder, and link prediction training on a knowledge graph with textual entity descriptions. We demonstrate that our approach can be applied effectively for link prediction in different inductive settings involving entities not seen during training, outperforming related state-of-the-art methods (22% MRR improvement on average). We provide evidence that the learned representations transfer to other tasks that do not require fine-tuning the entity encoder. In an entity classification task we obtain an average improvement of 16% accuracy compared with baselines that also employ pre-trained models. For an information retrieval task, significant improvements of up to 8.8% in NDCG@10 were obtained for natural language queries.

How far can we get with a language model fine-tuned for link prediction? We show that we can learn useful representations of unseen entities, that transfer to classification and retrieval tasks without fine-tuning.
Preprint https://t.co/YC59d3oIEl with @michaelcochez and @pgroth pic.twitter.com/7QKDZXVn6q
— Daniel Daza (@danieldazac) October 8, 2020

9. Empirical Frequentist Coverage of Deep Learning Uncertainty Quantification Procedures

Benjamin Kompa, Jasper Snoek, Andrew Beam

retweets: 56, favorites: 28 (10/09/2020 09:21:05)
links: abs | pdf
cs.LG

Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics such as negative log-likelihood or the Brier score on heldout data. In this study, we provide the first large scale evaluation of the empirical frequentist coverage properties of well known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and establish coverage as an important metric in developing models for real-world applications.

How often do intervals from popular uncertainty quantification (UQ) methods actually contain the observed value?@BenKompa, @latentjasper, and I investigated this (known in the stats lit as "coverage") for several popular methods.

Paper: https://t.co/YPHUA6mLON

Details 👇 pic.twitter.com/HxYNNMjDMA
— Andrew Beam (@AndrewLBeam) October 8, 2020

10. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment

Zirui Wang, Zachary C. Lipton, Yulia Tsvetkov

retweets: 30, favorites: 25 (10/09/2020 09:21:05)
links: abs | pdf
cs.CL | cs.LG

Modern multilingual models are trained on concatenated text from multiple languages in hopes of conferring benefits to each (positive transfer), with the most pronounced benefits accruing to low-resource languages. However, recent work has shown that this approach can degrade performance on high-resource languages, a phenomenon known as negative interference. In this paper, we present the first systematic study of negative interference. We show that, contrary to previous belief, negative interference also impacts low-resource languages. While parameters are maximally shared to learn language-universal structures, we demonstrate that language-specific parameters do exist in multilingual models and they are a potential cause of negative interference. Motivated by these observations, we also present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference, by adding language-specific layers as meta-parameters and training them in a manner that explicitly improves shared layers’ generalization on all languages. Overall, our results show that negative interference is more common than previously known, suggesting new directions for improving multilingual representations.

In multilingual models, negative interference between languages can degrade performance.
Our #EMNLP2020 paper studies its causes and proposes a meta-learning treatment: https://t.co/yw1DcXVeLz

Joint work with Zachary Lipton (@zacharylipton) and Yulia Tsvetkov! 1/4 pic.twitter.com/Gq3O8sFg7P
— Zirui Wang (@MrZiruiWang) October 8, 2020

Published 9 Oct 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter