1. Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves
Luke Metz, Niru Maheswaranathan, C. Daniel Freeman, Ben Poole, Jascha Sohl-Dickstein
Much as replacing hand-designed features with learned functions has revolutionized how we solve perceptual tasks, we believe learned algorithms will transform how we train models. In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters. We introduce a new, neural network parameterized, hierarchical optimizer with access to additional features such as validation loss to enable automatic regularization. Most learned optimizers have been trained on only a single task, or a small number of tasks. We train our optimizers on thousands of tasks, making use of orders of magnitude more compute, resulting in optimizers that generalize better to unseen tasks. The learned optimizers not only perform well, but learn behaviors that are distinct from existing first order optimizers. For instance, they generate update steps that have implicit regularization and adapt as the problem hyperparameters (e.g. batch size) or architecture (e.g. neural network width) change. Finally, these learned optimizers show evidence of being useful for out of distribution tasks such as training themselves from scratch.
We have a new paper on learned optimizers! We used thousands of tasks (and a lot of compute đŹ) to train general purpose learned optimizers that perform well on never-before-seen tasks, and can even train new versions of themselves.https://t.co/LQf6o3Fwq7
— Luke Metz (@Luke_Metz) September 24, 2020
1/8
2. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi
Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce âData Mapsâ---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the modelâs confidence in the true class, and the variability of this confidence across epochs, in a single run of training. Experiments on four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of âambiguousâ regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are âeasy to learnâ for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds âhard to learnâ; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
Dataset cartography: a new way to look at your training dataset, derived from model training dynamics with respect to each instance. Forthcoming EMNLP paper by @swabhz @royschwartz02 @NickLourie @yizhongwyz @HannaHajishirzi @nlpnoah @YejinChoinka https://t.co/MNu6Kbha0J
— Noah A Smith (@nlpnoah) September 24, 2020
As datasets have grown larger, data exploration has become increasingly challenging. Our new work on Dataset Cartography, at @emnlp2020 with @royschwartz02, @NickLourie, @yizhongwyz, @HannaHajishirzi, @nlpnoah, @YejinChoinka offers a solution đşď¸
— Swabha Swayamdipta (@swabhz) September 24, 2020
Paper: https://t.co/9JsYrxeACa 1/n pic.twitter.com/1hItp5yOx2
3. Message Passing for Hyper-Relational Knowledge Graphs
Mikhail Galkin, Priyansh Trivedi, Gaurav Maheshwari, Ricardo Usbeck, Jens Lehmann
Hyper-relational knowledge graphs (KGs) (e.g., Wikidata) enable associating additional key-value pairs along with the main triple to disambiguate, or restrict the validity of a fact. In this work, we propose a message passing based graph encoder - StarE capable of modeling such hyper-relational KGs. Unlike existing approaches, StarE can encode an arbitrary number of additional information (qualifiers) along with the main triple while keeping the semantic roles of qualifiers and triples intact. We also demonstrate that existing benchmarks for evaluating link prediction (LP) performance on hyper-relational KGs suffer from fundamental flaws and thus develop a new Wikidata-based dataset - WD50K. Our experiments demonstrate that StarE based LP model outperforms existing approaches across multiple benchmarks. We also confirm that leveraging qualifiers is vital for link prediction with gains up to 25 MRR points compared to triple-based representations.
We're releasing everything on StarE âď¸ - a #GNN encoder for hyper-relational #KnowledgeGraph techniques like RDF* and LPG. Have fun đ
— Michael Galkin (@michael_galkin) September 24, 2020
Blog: https://t.co/OV2FprqzJt
Paper: https://t.co/K49xGI6MI7
Code: https://t.co/mrwyvXoPMf
Report @weights_biases : https://t.co/KFsxZyiw31 https://t.co/NfAndJ5eSs
4. Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
Xavier Garcia, Aditya Siddhant, Orhan Firat, Ankur P. Parikh
Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.
Check out our multilingual unsupervised translation work! Theory + SOTA results. Led by @xgarcia238 (1/4)
— Ankur Parikh (@ank_parikh) September 24, 2020
1. Multilingual View of Unsupervised MT - Findings of EMNLP 2020 (https://t.co/oibhq2FDZ4 )
2. Multilingual Unsupervised MT for Rare Languages (https://t.co/PkQAlH7lcq ) pic.twitter.com/FcfMCeRJ7y
5. The cost of coordination can exceed the benefit of collaboration in performing complex tasks
Vince J. Straub, Milena Tsvetkova, Taha Yasseri
- retweets: 159, favorites: 71 (09/25/2020 08:00:48)
- links: abs | pdf
- cs.SI | cs.CY | cs.GT | nlin.AO | physics.soc-ph
Collective decision-making is ubiquitous when observing the behavior of intelligent agents, including humans. However, there are inconsistencies in our theoretical understanding of whether there is a collective advantage from interacting with group members of varying levels of competence in solving problems of varying complexity. Moreover, most existing experiments have relied on highly stylized tasks, reducing the generality of their results. The present study narrows the gap between experimental control and realistic settings, reporting the results from an analysis of collective problem-solving in the context of a real-world citizen science task environment in which individuals with manipulated differences in task-relevant training collaborated on the Wildcam Gorongosa task, hosted by The Zooniverse. We find that dyads gradually improve in performance but do not experience a collective benefit compared to individuals in most situations; rather, the cost of team coordination to efficiency and speed is consistently larger than the leverage of having a partner, even if they are expertly trained. It is only in terms of accuracy in the most complex tasks that having an additional expert significantly improves performance upon that of non-experts. Our findings have important theoretical and applied implications for collective problem-solving: to improve efficiency, one could prioritize providing task-relevant training and relying on trained experts working alone over interaction and to improve accuracy, one could target the expertise of selectively trained individuals.
Prefer to finish a task alone rather than w someone else when it's too complex? You should do that if you're good! After 3 years of work finally the science is in!https://t.co/O53mAUMgG1
— Taha Yasseri (@TahaYasseri) September 24, 2020
The cost of coordination can exceed the benefit of collaboration in performing complex tasks pic.twitter.com/JhfXFmkRuc
6. Scene Graph to Image Generation with Contextualized Object Layout Refinement
Maor Ivgi, Yaniv Benny, Avichai Ben-David, Jonathan Berant, Lior Wolf
Generating high-quality images from scene graphs, that is, graphs that describe multiple entities in complex relations, is a challenging task that attracted substantial interest recently. Prior work trained such models by using supervised learning, where the goal is to produce the exact target image layout for each scene graph. It relied on predicting object locations and shapes independently and in parallel. However, scene graphs are underspecified, and thus the same scene graph often occurs with many target images in the training data. This leads to generated images with high inter-object overlap, empty areas, blurry objects, and overall compromised quality. In this work, we propose a method that alleviates these issues by generating all object layouts together and reducing the reliance on such supervision. Our model predicts layouts directly from embeddings (without predicting intermediate boxes) by gradually upsampling, refining and contextualizing object layouts. It is trained with a novel adversarial loss, that optimizes the interaction between object pairs. This improves coverage and removes overlaps, while maintaining sensible contours and respecting objects relations. We empirically show on the COCO-STUFF dataset that our proposed approach substantially improves the quality of generated layouts as well as the overall image quality. Our evaluation shows that we improve layout coverage by almost 20 points, and drop object overlap to negligible amounts. This leads to better image generation, relation fulfillment and objects quality.
Scene Graph to Image Generation with Contextualized Object Layout Refinement
— AK (@ak92501) September 24, 2020
pdf: https://t.co/So4Ux1OXQA
abs: https://t.co/Xd02vZTGHH pic.twitter.com/VRv2kdRHwu
7. On Data Augmentation for Extreme Multi-label Classification
Danqing Zhang, Tao Li, Haiyang Zhang, Bing Yin
In this paper, we focus on data augmentation for the extreme multi-label classification (XMC) problem. One of the most challenging issues of XMC is the long tail label distribution where even strong models suffer from insufficient supervision. To mitigate such label bias, we propose a simple and effective augmentation framework and a new state-of-the-art classifier. Our augmentation framework takes advantage of the pre-trained GPT-2 model to generate label-invariant perturbations of the input texts to augment the existing training data. As a result, it present substantial improvements over baseline models. Our contributions are two-factored: (1) we introduce a new state-of-the-art classifier that uses label attention with RoBERTa and combine it with our augmentation framework for further improvement; (2) we present a broad study on how effective are different augmentation methods in the XMC task.
Using language-model-based and rule-based data augmentation to deal with extremely unbalanced data scenarios.
— elvis (@omarsar0) September 24, 2020
This is common when dealing with multi-label text classification.https://t.co/HkHGJMpepx pic.twitter.com/U5CuZAHEfC
8. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERTâs image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
— AK (@ak92501) September 24, 2020
pdf: https://t.co/nxyYF4LTDy
abs: https://t.co/CL4SDy0DjZ
project page: https://t.co/gOqLdAxAJK pic.twitter.com/8ycgxyPpkP
9. KoBE: Knowledge-Based Machine Translation Evaluation
Zorik Gekhman, Roee Aharoni, Genady Beryozkin, Markus Freitag, Wolfgang Macherey
We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data.
Turns out that using multilingual entity linking, we can automatically evaluate machine translation without any references! New paper with Zorik Gekhman, Genady Beryozkin, Markus Freitag and Wolfgang Macherey, to appear in Findings of EMNLP: https://t.co/87QTZWV0bS @GoogleAI pic.twitter.com/wiuKL9BXkg
— roeeaharoni (@roeeaharoni) September 24, 2020
10. Few-shot Font Generation with Localized Style Representations and Factorization
Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, Hyunjung Shim
Automatic few-shot font generation is in high demand because manual designs are expensive and sensitive to the expertise of designers. Existing few-shot font generation methods aim to learn to disentangle the style and content element from a few reference glyphs, and mainly focus on a universal style representation for each font style. However, such approach limits the model in representing diverse local styles, and thus makes it unsuitable to the most complicated letter system, e.g., Chinese, whose characters consist of a varying number of components (often called âradicalâ) with a highly complex structure. In this paper, we propose a novel font generation method by learning localized styles, namely component-wise style representations, instead of universal styles. The proposed style representations enable us to synthesize complex local details in text designs. However, learning component-wise styles solely from reference glyphs is infeasible in the few-shot font generation scenario, when a target script has a large number of components, e.g., over 200 for Chinese. To reduce the number of reference glyphs, we simplify component-wise styles by a product of component factor and style factor, inspired by low-rank matrix factorization. Thanks to the combination of strong representation and a compact factorization strategy, our method shows remarkably better few-shot font generation results (with only 8 reference glyph images) than other state-of-the-arts, without utilizing strong locality supervision, e.g., location of each component, skeleton, or strokes. The source code is available at https://github.com/clovaai/lffont.
Few-shot Font Generation with Localized Style Representations and Factorization
— AK (@ak92501) September 24, 2020
pdf: https://t.co/wVmkoK4Zgf
abs: https://t.co/FzWufs2TlA
github: https://t.co/KJ2BCzpwMX pic.twitter.com/fwHUilnISu
11. Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot
Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, Jason D. Lee
Network pruning is a method for reducing test-time computational resource requirements with minimal performance degradation. Conventional wisdom of pruning algorithms suggests that: (1) Pruning methods exploit information from training data to find good subnetworks; (2) The architecture of the pruned network is crucial for good performance. In this paper, we conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call âinitial ticketsâ), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance. These findings inspire us to choose a series of simple \emph{data-independent} prune ratios for each layer, and randomly prune each layer accordingly to get a subnetwork (which we call ârandom ticketsâ). Experimental results show that our zero-shot random tickets outperforms or attains similar performance compared to existing âinitial ticketsâ. In addition, we identify one existing pruning method that passes our sanity checks. We hybridize the ratios in our random ticket with this method and propose a new method called âhybrid ticketsâ, which achieves further improvement.
Do existing pruning methods really exploit the info from data? Are the architectures of the pruned networks really matter for the performance? We propose sanity checks on pruning methods and find a great part of existing methods does not rely on these! https://t.co/6Wk9DparBL pic.twitter.com/Wp2DJznQD0
— Tianle Cai (@tianle_cai) September 24, 2020