1. Deduplicating Training Data Makes Language Models Better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini
We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets — for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.
Deduplicating Training Data Makes Language Models Better
— Aran Komatsuzaki (@arankomatsuzaki) July 15, 2021
Finds that deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy.https://t.co/vWjKaGXFfx pic.twitter.com/sE37pw2T1Z
Data duplication is serious business!
— Katherine Lee (@katherine1ee) July 15, 2021
3% of documents in the large language dataset, C4, have near-duplicates.
Deduplication reduces model memorization while training faster and without reducing accuracy.
Paper: https://t.co/ENRVYgjnOw
Code: coming soon!
🧵⬇️ (1/9)
Deduplicating Training Data Makes Language Models Better
— AK (@ak92501) July 15, 2021
pdf: https://t.co/w8J8NZ5v7t
abs: https://t.co/4Woo78QjST
Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy pic.twitter.com/sMMud34rj7
2. Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion
Xin Dong, Hongxu Yin, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov
Understanding the behavior and vulnerability of pre-trained deep neural networks (DNNs) can help to improve them. Analysis can be performed via reversing the network’s flow to generate inputs from internal representations. Most existing work relies on priors or data-intensive optimization to invert a model, yet struggles to scale to deep architectures and complex datasets. This paper presents a zero-shot direct model inversion framework that recovers the input to the trained model given only the internal representation. The crux of our method is to inverse the DNN in a divide-and-conquer manner while re-syncing the inverted layers via cycle-consistency guidance with the help of synthesized data. As a result, we obtain a single feed-forward model capable of inversion with a single forward pass without seeing any real data of the original task. With the proposed approach, we scale zero-shot direct inversion to deep architectures and complex datasets. We empirically show that modern classification models on ImageNet can, surprisingly, be inverted, allowing an approximate recovery of the original 224x224px images from a representation after more than 20 layers. Moreover, inversion of generators in GANs unveils latent code of a given synthesized face image at 128x128px, which can even, in turn, improve defective synthesized images from GANs.
Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion
— AK (@ak92501) July 15, 2021
pdf: https://t.co/rETZDP6rW2
abs: https://t.co/037Rx1nB5M
a zero-shot direct model inversion framework that recovers the input to the trained model given only the internal representation pic.twitter.com/FVUQahDlu5
3. Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals
Guillaume Cabanac, Cyril Labbé, Alexander Magazinov
Probabilistic text generators have been used to produce fake scientific papers for more than a decade. Such nonsensical papers are easily detected by both human and machine. Now more complex AI-powered generation techniques produce texts indistinguishable from that of humans and the generation of scientific texts from a few keywords has been documented. Our study introduces the concept of tortured phrases: unexpected weird phrases in lieu of established ones, such as ‘counterfeit consciousness’ instead of ‘artificial intelligence.’ We combed the literature for tortured phrases and study one reputable journal where these concentrated en masse. Hypothesising the use of advanced language models we ran a detector on the abstracts of recent articles of this journal and on several control sets. The pairwise comparisons reveal a concentration of abstracts flagged as ‘synthetic’ in the journal. We also highlight irregularities in its operation, such as abrupt changes in editorial timelines. We substantiate our call for investigation by analysing several individual dubious articles, stressing questionable features: tortured writing style, citation of non-existent literature, and unacknowledged image reuse. Surprisingly, some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases. We believe some authors used rewritten texts to pad their manuscripts. We wish to raise the awareness on publications containing such questionable AI-generated or rewritten texts that passed (poor) peer review. Deception with synthetic texts threatens the integrity of the scientific literature.
Just out @arxiv: “Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals” w/ Labbé & Magazinov https://t.co/eh0YNO5T8v Probable systematic manipulation of the publication process of Microprocessors in Microsystems pic.twitter.com/vrVQUi3TNn
— Guillaume Cabanac (@gcabanac) July 15, 2021
4. Multi-Label Generalized Zero Shot Learning for the Classification of Disease in Chest Radiographs
Nasir Hayat, Hazem Lashen, Farah E. Shamout
Despite the success of deep neural networks in chest X-ray (CXR) diagnosis, supervised learning only allows the prediction of disease classes that were seen during training. At inference, these networks cannot predict an unseen disease class. Incorporating a new class requires the collection of labeled data, which is not a trivial task, especially for less frequently-occurring diseases. As a result, it becomes inconceivable to build a model that can diagnose all possible disease classes. Here, we propose a multi-label generalized zero shot learning (CXR-ML-GZSL) network that can simultaneously predict multiple seen and unseen diseases in CXR images. Given an input image, CXR-ML-GZSL learns a visual representation guided by the input’s corresponding semantics extracted from a rich medical text corpus. Towards this ambitious goal, we propose to map both visual and semantic modalities to a latent feature space using a novel learning objective. The objective ensures that (i) the most relevant labels for the query image are ranked higher than irrelevant labels, (ii) the network learns a visual representation that is aligned with its semantics in the latent feature space, and (iii) the mapped semantics preserve their original inter-class representation. The network is end-to-end trainable and requires no independent pre-training for the offline feature extractor. Experiments on the NIH Chest X-ray dataset show that our network outperforms two strong baselines in terms of recall, precision, f1 score, and area under the receiver operating characteristic curve. Our code is publicly available at: https://github.com/nyuad-cai/CXR-ML-GZSL.git
Are deep neural networks able to predict diseases they haven’t been trained on? Our new work accepted at @mlforhc investigates this question via Multi-Label Generalized Zero Shot Learning (ML-GZSL) for chest X-rays (CXR): https://t.co/OGNSG8IuNW
— Farah Shamout (@farahshamout) July 15, 2021
Here’s a brief summary!
1/5
5. Nowcasting transmission and suppression of the Delta variant of SARS-CoV-2 in Australia
Sheryl L. Chang, Oliver M. Cliff, Mikhail Prokopenko
As of July 2021, there is a continuing outbreak of the B.1.617.2 (Delta) variant of SARS-CoV-2 in Sydney, Australia. The outbreak is of major concern as the Delta variant is estimated to have twice the reproductive number to previous variants that circulated in Australia in 2020, which is worsened by low levels of acquired immunity in the population. Using a re-calibrated agent-based model, we explored a feasible range of non-pharmaceutical interventions, in terms of both mitigation (case isolation, home quarantine) and suppression (school closures, social distancing). Our nowcasting modelling indicated that the level of social distancing currently attained in Sydney is inadequate for the outbreak control. A counter-factual analysis suggested that if 80% of agents comply with social distancing, then at least a month is needed for the new daily cases to reduce from their peak to below ten. A small reduction in social distancing compliance to 70% lengthens this period to over two months.
Model suggests we are not winning in Greater Sydney yet. Nowcasting transmission and suppression of the Delta variant of SARS-CoV-2 in Australia https://t.co/gPXl4Pxilr
— MJA Editor in Chief (@MJA_Editor) July 15, 2021
6. Differentiable Programming of Reaction-Diffusion Patterns
Alexander Mordvintsev, Ettore Randazzo, Eyvind Niklasson
Reaction-Diffusion (RD) systems provide a computational framework that governs many pattern formation processes in nature. Current RD system design practices boil down to trial-and-error parameter search. We propose a differentiable optimization method for learning the RD system parameters to perform example-based texture synthesis on a 2D plane. We do this by representing the RD system as a variant of Neural Cellular Automata and using task-specific differentiable loss functions. RD systems generated by our method exhibit robust, non-trivial ‘life-like’ behavior.
Differentiable Programming of Reaction-Diffusion Patterns
— AK (@ak92501) July 15, 2021
pdf: https://t.co/WNpD7bzWyd
project page: https://t.co/vaVaO9kuo0
a differentiable optimization method for learning the RD system parameters to perform example-based texture synthesis on a 2D plane pic.twitter.com/ksb4XF2duV
7. How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.
How Much Can CLIP Benefit Vision-and-Language Tasks?
— AK (@ak92501) July 15, 2021
pdf: https://t.co/JvlKycMcBj
github: https://t.co/ilqUdlozPw
competitive or better results on diverse V&L tasks, while establishing new sota results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks pic.twitter.com/1FgZf0eQxu
8. A Generalized Lottery Ticket Hypothesis
Ibrahim Alabdulmohsin, Larisa Markeeva, Daniel Keysers, Ilya Tolstikhin
We introduce a generalization to the lottery ticket hypothesis in which the notion of “sparsity” is relaxed by choosing an arbitrary basis in the space of parameters. We present evidence that the original results reported for the canonical basis continue to hold in this broader setting. We describe how structured pruning methods, including pruning units or factorizing fully-connected layers into products of low-rank matrices, can be cast as particular instances of this “generalized” lottery ticket hypothesis. The investigations reported here are preliminary and are provided to encourage further research along this direction.
A Generalized Lottery Ticket Hypothesis
— AK (@ak92501) July 15, 2021
pdf: https://t.co/UTEOQ9jm4Y
abs: https://t.co/zxvAGYcpWL pic.twitter.com/FOiph4VEQO
Short preprint where we claim that the Lottery Ticket Hypothesis holds for *any* notion of sparsity.
— Ilya Tolstikhin (@tolstikhini) July 15, 2021
Outcome: Actual speedups in inference (still based on the expensive Iterative Magnitude Pruning).
Work together with @ibomohsin @re_rayne @keysers https://t.co/Ui9kmlV9QL
9. GgViz: Accelerating Large-Scale Esports Game Analysis
Peter Xenopoulos, Joao Rulff, Claudio Silva
Game review is crucial for teams, players and media staff in sports. Despite its importance, game review is work-intensive and hard to scale. Recent advances in sports data collection have introduced systems that couple video with clustering techniques to allow for users to query sports situations of interest through sketching. However, due to data limitations, as well as differences in the sport itself, esports has seen a dearth of such systems. In this paper, we leverage emerging data for Counter-Strike: Global Offensive (CSGO) to develop ggViz, a novel visual analytics system that allows users to query a large esports data set for similar plays by drawing situations of interest. Along with ggViz, we also present a performant retrieval algorithm that can easily scale to hundreds of millions of game situations. We demonstrate ggViz’s utility through detailed cases studies and interviews with staff from professional esports teams.
For most sports, but especially esports, we analyze the game by watching film, which makes finding specific player setups time-consuming. In ggViz, we introduce a system to query a large CSGO dataset to find player setups. Link to paper: https://t.co/39OkWDynqk pic.twitter.com/xx2J1fDk0g
— Peter Xenopoulos (@peterxeno) July 15, 2021
10. Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers
Patryk Orzechowski, Jason H. Moore
- retweets: 51, favorites: 7 (07/16/2021 07:00:59)
- links: abs | pdf
- cs.LG | cs.AI | cs.CV | cs.NE | stat.ML
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.
11. How to make qubits speak
Bob Coecke, Giovanni de Felice, Konstantinos Meichanetzidis, Alexis Toumi
This is a story about making quantum computers speak, and doing so in a quantum-native, compositional and meaning-aware manner. Recently we did question-answering with an actual quantum computer. We explain what we did, stress that this was all done in terms of pictures, and provide many pointers to the related literature. In fact, besides natural language, many other things can be implemented in a quantum-native, compositional and meaning-aware manner, and we provide the reader with some indications of that broader pictorial landscape, including our account on the notion of compositionality. We also provide some guidance for the actual execution, so that the reader can give it a go as well.
New paper on arXiv with @konstantinosmei @AlexisToumi @gio_defel with easy reading QNLP, and also some digressions on "compositionality". https://t.co/4tvNlMCqx1
— bOb cOeCke (@coecke) July 15, 2021
To appear in @bio_computer's book on quantum and the arts.
12. Learning Algebraic Recombination for Compositional Generalization
Chenyao Liu, Shengnan An, Zeqi Lin, Qian Liu, Bei Chen, Jian-Guang Lou, Lijie Wen, Nanning Zheng, Dongmei Zhang
Neural sequence models exhibit limited compositional generalization ability in semantic parsing tasks. Compositional generalization requires algebraic recombination, i.e., dynamically recombining structured expressions in a recursive manner. However, most previous studies mainly concentrate on recombining lexical units, which is an important but not sufficient part of algebraic recombination. In this paper, we propose LeAR, an end-to-end neural model to learn algebraic recombination for compositional generalization. The key insight is to model the semantic parsing task as a homomorphism between a latent syntactic algebra and a semantic algebra, thus encouraging algebraic recombination. Specifically, we learn two modules jointly: a Composer for producing latent syntax, and an Interpreter for assigning semantic operations. Experiments on two realistic and comprehensive compositional generalization benchmarks demonstrate the effectiveness of our model. The source code is publicly available at https://github.com/microsoft/ContextualSP.