1. Human-Level Performance in No-Press Diplomacy via Equilibrium Search
Jonathan Gray, Adam Lerer, Anton Bakhtin, Noam Brown
Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website.
@jgrayatwork @adamlerer @anton_bakhtin and I are thrilled to share our latest work: a human-level no-press Diplomacy bot! Unlike prior AI benchmarks, Diplomacy involves a complex mix of both cooperation and competition. Thanks @webdiplomacy for your help! https://t.co/QUzxODxEtn pic.twitter.com/htSoLCCqEo
— Noam Brown (@polynoamial) October 8, 2020
2. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.
Our EMNLP Findings paper presenting WikiLingua — a new multilingual abstractive summarization dataset — is now on arXiv.
— Faisal Ladhak (@faisalladhak) October 8, 2020
It contains 770K article/summary pairs in 17 languages, parallel with English.
Paper: https://t.co/gvYYZOaW5q
Dataset: https://t.co/HfwGwJheUr#NLProc
3. CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails
Younghoo Lee, Joshua Saxe, Richard Harang
Targeted phishing emails are on the rise and facilitate the theft of billions of dollars from organizations a year. While malicious signals from attached files or malicious URLs in emails can be detected by conventional malware signatures or machine learning technologies, it is challenging to identify hand-crafted social engineering emails which don’t contain any malicious code and don’t share word choices with known attacks. To tackle this problem, we fine-tune a pre-trained BERT model by replacing the half of Transformer blocks with simple adapters to efficiently learn sophisticated representations of the syntax and semantics of the natural language. Our Context-Aware network also learns the context representations between email’s content and context features from email headers. Our CatBERT(Context-Aware Tiny Bert) achieves a 87% detection rate as compared to DistilBERT, LSTM, and logistic regression baselines which achieve 83%, 79%, and 54% detection rates at false positive rates of 1%, respectively. Our model is also faster than competing transformer approaches and is resilient to adversarial attacks which deliberately replace keywords with typos or synonyms.
4. VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo
In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.
``VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics. (arXiv:2010.02977v1 [https://t.co/mPAjnto8C8]),'' Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, https://t.co/orZds6cDjP
— arXiv Sound (@ArxivSound) October 8, 2020
5. WER we are and WER we think we are
Piotr Szymański, Piotr Żelasko, Mikolaj Morzy, Adrian Szymczak, Marzena Żyła-Hoppe, Joanna Banaszczak, Lukasz Augustyniak, Jan Mizgajski, Yishay Carmiel
Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB’05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
Our research, authored with @niedakh and the team from Avaya Poznań R&D, titled "WER we are and WER we think we are" has been accepted to Findings of EMNLP 2020. We discuss, and reject, the - somewhat common - misconception that ASR is a solved task.
— Piotr Żelasko (@PiotrZelasko) October 8, 2020
🔗 https://t.co/PCUiC8Ib5s
6. Less is more: Faster and better music version identification with embedding distillation
Furkan Yesiler, Joan Serrà, Emilia Gómez
Version identification systems aim to detect different renditions of the same underlying musical composition (loosely called cover songs). By learning to encode entire recordings into plain vector embeddings, recent systems have made significant progress in bridging the gap between accuracy and scalability, which has been a key challenge for nearly two decades. In this work, we propose to further narrow this gap by employing a set of data distillation techniques that reduce the embedding dimensionality of a pre-trained state-of-the-art model. We compare a wide range of techniques and propose new ones, from classical dimensionality reduction to more sophisticated distillation schemes. With those, we obtain 99% smaller embeddings that, moreover, yield up to a 3% accuracy increase. Such small embeddings can have an important impact in retrieval time, up to the point of making a real-world system practical on a standalone laptop.
Do your models perform better using larger embeddings? With embedding distillation, we show how to use models w/ large embeddings to train ones w/ smaller embeddings and keep the accuracy.
— Furkan Yesiler (@furkanyesiler) October 8, 2020
w/ @serrjoa and @emiliagogu
Paper: https://t.co/ukWTaSV1fY
Code: https://t.co/9tgXuhaHm8
7. Plug and Play Autoencoders for Conditional Text Generation
Florian Mai, Nikolaos Pappas, Ivan Montero, Noah A. Smith, James Henderson
Text autoencoders are commonly used for conditional generation tasks such as style transfer. We propose methods which are plug and play, where any pretrained autoencoder can be used, and only require learning a mapping within the autoencoder’s embedding space, training embedding-to-embedding (Emb2Emb). This reduces the need for labeled training data for the task and makes the training procedure more efficient. Crucial to the success of this method is a loss term for keeping the mapped embedding on the manifold of the autoencoder and a mapping which is trained to navigate the manifold by learning offset vectors. Evaluations on style transfer tasks both with and without sequence-to-sequence supervision show that our method performs better than or comparable to strong baselines while being up to four times faster.
Why not pretrain text autoencoders (like we do language models) and then treat text-to-text tasks as regression in continuous "code" space? New EMNLP paper by @_florianmai @nik0spapp Ivan Montero @nlpnoah and "neural" NLP visionary @JamieBHenderson https://t.co/CKTTwGaYhF
— Noah A Smith (@nlpnoah) October 8, 2020
8. Inductive Entity Representations from Text via Link Prediction
Daniel Daza, Michael Cochez, Paul Groth
We present a method for learning representations of entities, that uses a Transformer-based architecture as an entity encoder, and link prediction training on a knowledge graph with textual entity descriptions. We demonstrate that our approach can be applied effectively for link prediction in different inductive settings involving entities not seen during training, outperforming related state-of-the-art methods (22% MRR improvement on average). We provide evidence that the learned representations transfer to other tasks that do not require fine-tuning the entity encoder. In an entity classification task we obtain an average improvement of 16% accuracy compared with baselines that also employ pre-trained models. For an information retrieval task, significant improvements of up to 8.8% in NDCG@10 were obtained for natural language queries.
How far can we get with a language model fine-tuned for link prediction? We show that we can learn useful representations of unseen entities, that transfer to classification and retrieval tasks without fine-tuning.
— Daniel Daza (@danieldazac) October 8, 2020
Preprint https://t.co/YC59d3oIEl with @michaelcochez and @pgroth pic.twitter.com/7QKDZXVn6q
9. Empirical Frequentist Coverage of Deep Learning Uncertainty Quantification Procedures
Benjamin Kompa, Jasper Snoek, Andrew Beam
Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics such as negative log-likelihood or the Brier score on heldout data. In this study, we provide the first large scale evaluation of the empirical frequentist coverage properties of well known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and establish coverage as an important metric in developing models for real-world applications.
How often do intervals from popular uncertainty quantification (UQ) methods actually contain the observed value?@BenKompa, @latentjasper, and I investigated this (known in the stats lit as "coverage") for several popular methods.
— Andrew Beam (@AndrewLBeam) October 8, 2020
Paper: https://t.co/YPHUA6mLON
Details 👇 pic.twitter.com/HxYNNMjDMA
10. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment
Zirui Wang, Zachary C. Lipton, Yulia Tsvetkov
Modern multilingual models are trained on concatenated text from multiple languages in hopes of conferring benefits to each (positive transfer), with the most pronounced benefits accruing to low-resource languages. However, recent work has shown that this approach can degrade performance on high-resource languages, a phenomenon known as negative interference. In this paper, we present the first systematic study of negative interference. We show that, contrary to previous belief, negative interference also impacts low-resource languages. While parameters are maximally shared to learn language-universal structures, we demonstrate that language-specific parameters do exist in multilingual models and they are a potential cause of negative interference. Motivated by these observations, we also present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference, by adding language-specific layers as meta-parameters and training them in a manner that explicitly improves shared layers’ generalization on all languages. Overall, our results show that negative interference is more common than previously known, suggesting new directions for improving multilingual representations.
In multilingual models, negative interference between languages can degrade performance.
— Zirui Wang (@MrZiruiWang) October 8, 2020
Our #EMNLP2020 paper studies its causes and proposes a meta-learning treatment: https://t.co/yw1DcXVeLz
Joint work with Zachary Lipton (@zacharylipton) and Yulia Tsvetkov! 1/4 pic.twitter.com/Gq3O8sFg7P