1. Explaining Neural Scaling Laws
Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, Utkarsh Sharma
- retweets: 3332, favorites: 351 (02/16/2021 09:39:16)
- links: abs | pdf
- cs.LG | cond-mat.dis-nn | stat.ML
The test loss of well-trained neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents: super-classing image tasks does not change exponents, while changing input distribution (via changing datasets or adding noise) has a strong effect. We further explore the effect of architecture aspect ratio on scaling exponents.
Explaining Neural Scaling Laws
— Aran Komatsuzaki (@arankomatsuzaki) February 15, 2021
Proposes a theory that explains and connects various scaling laws concerning the size of the dataset, the number of parameters, resolution and variance. https://t.co/rq96u1mkyL pic.twitter.com/GWNzWOidBz
2. A Too-Good-to-be-True Prior to Reduce Shortcut Reliance
Nikolay Dagaev, Brett D. Roads, Xiaoliang Luo, Daniel N. Barry, Kaustubh R. Patil, Bradley C. Love
Despite their impressive performance in object recognition and other tasks under standard testing conditions, deep convolutional neural networks (DCNNs) often fail to generalize to out-of-distribution (o.o.d.) samples. One cause for this shortcoming is that modern architectures tend to rely on “shortcuts” - superficial features that correlate with categories without capturing deeper invariants that hold across contexts. Real-world concepts often possess a complex structure that can vary superficially across contexts, which can make the most intuitive and promising solutions in one context not generalize to others. One potential way to improve o.o.d. generalization is to assume simple solutions are unlikely to be valid across contexts and downweight them, which we refer to as the too-good-to-be-true prior. We implement this inductive bias in a two-stage approach that uses predictions from a low-capacity network (LCN) to inform the training of a high-capacity network (HCN). Since the shallow architecture of the LCN can only learn surface relationships, which includes shortcuts, we downweight training items for the HCN that the LCN can master, thereby encouraging the HCN to rely on deeper invariant features that should generalize broadly. Using a modified version of the CIFAR-10 dataset in which we introduced shortcuts, we found that the two-stage LCN-HCN approach reduced reliance on shortcuts and facilitated o.o.d. generalization.
New preprint, "A Too-Good-to-be-True Prior to Reduce Shortcut Reliance". If it's too good to be true, it probably is and that holds for deep learning as well. To generalize broadly, models need to learn invariants but instead are fooled by shortcuts. https://t.co/ylrZcVxSGS (1/4) pic.twitter.com/gxPilisojd
— Bradley Love (@ProfData) February 15, 2021
3. A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
Zachary Nado, Justin M. Gilmer, Christopher J. Shallue, Rohan Anil, George E. Dahl
Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes. Our results establish new, stronger baselines for future comparisons at these batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally.
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
— Aran Komatsuzaki (@arankomatsuzaki) February 15, 2021
In fact, Nesterov momentum and Adam matches or exceeds the results of LARS and LAMB at large batch sizes.https://t.co/yI3L7CiuKX pic.twitter.com/J8qJuiQA0D
4. Banana for scale: Gauging trends in academic interest by normalising publication rates to common and innocuous keywords
Edwin S. Dalmaijer, Joram Van Rheede, Edwin V. Sperr, Juliane Tkotz
Many academics use yearly publication numbers to quantify academic interest for their research topic. While such visualisations are ubiquitous in grant applications, manuscript introductions, and review articles, they fail to account for the rapid growth in scientific publications. As a result, any search term will likely show an increase in supposed “academic interest”. One proposed solution is to normalise yearly publication rates by field size, but this is arduous and difficult. Here, we propose an simpler index that normalises keywords of interest by a ubiquitous and innocuous keyword, such as “banana”. Alternatively, one could opt for field-specific keywords or hierarchical structures (e.g. PubMed’s Medical Subject Headings, MeSH) to compute “interest market share”. Using this approach, we uncovered plausible trends in academic interest in examples from the medical literature. In neuroimaging, we found that not the supplementary motor area (as was previously claimed), but the prefrontal cortex is the most interesting part of the brain. In cancer research, we found a contemporary preference for cancers with high prevalence and clinical severity, and notable declines in interest for more treatable or likely benign neoplasms. Finally, we found that interest in respiratory viral infections spiked when strains showed potential for pandemic involvement, with SARS-CoV-2 and the COVID-19 pandemic being the most extreme example. In sum, the time is ripe for a quick and easy method to quantify trends in academic interest for anecdotal purposes. We provide such a method, along with software for researchers looking to implement it in their own writing.
Writing a grant or review? Highlighting changes in interest for the topic? That's hard when publication rates skyrocket everywhere! Here's a simple solution: just use a banana for scale! 🍌
— Edwin Dalmaijer (@esdalmaijer) February 15, 2021
Manuscript with @Joranium @edsperr @juli_tkotz here: https://t.co/uS0mAOk1re
Thread below! pic.twitter.com/3KN7wedEGm
5. Why Don’t Developers Detect Improper Input Validation?’; DROP TABLE Papers; —
Larissa Braz, Enrico Fregnan, Gül Çalikli, Alberto Bacchelli
Improper Input Validation (IIV) is a software vulnerability that occurs when a system does not safely handle input data. Even though IIV is easy to detect and fix, it still commonly happens in practice. In this paper, we study to what extent developers can detect IIV and investigate underlying reasons. This knowledge is essential to better understand how to support developers in creating secure software systems. We conduct an online experiment with 146 participants, of which 105 report at least three years of professional software development experience. Our results show that the existence of a visible attack scenario facilitates the detection of IIV vulnerabilities and that a significant portion of developers who did not find the vulnerability initially could identify it when warned about its existence. Yet, a total of 60 participants could not detect the vulnerability even after the warning. Other factors, such as the frequency with which the participants perform code reviews, influence the detection of IIV. Data and materials: https://doi.org/10.5281/zenodo.3996696
Our "Why Don’t Developers Detect Improper Input Validation? '; DROP TABLE Papers; --" @ICSEconf 2021 paper pre-print is now available!
— Larissa Braz (@larissabrazb) February 15, 2021
check it out: https://t.co/fn0oBuZLRH@EnFregnan @GulCalikli @sback_
6. Optimizing Inference Performance of Transformers on CPUs
Dave Dice, Alex Kogan
- retweets: 143, favorites: 48 (02/16/2021 09:39:17)
- links: abs | pdf
- cs.CL | cs.AI | cs.DC | cs.LG | cs.MS
The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose three optimizations to speed them up. The optimizations are evaluated using the inference benchmark from HuggingFace, and are shown to achieve the speedup of up to x2.36. The considered optimizations do not require any changes to the implementation of the models nor affect their accuracy.
Optimizing Inference Performance of Transformers on CPUs
— AK (@ak92501) February 15, 2021
pdf: https://t.co/PViwYSdCVy
abs: https://t.co/L4DiVypnOH pic.twitter.com/Ave1s4XTXj
7. Improving Object Detection in Art Images Using Only Style Transfer
David Kadish, Sebastian Risi, Anders Sundnes Løvlie
Despite recent advances in object detection using deep learning neural networks, these neural networks still struggle to identify objects in art images such as paintings and drawings. This challenge is known as the cross depiction problem and it stems in part from the tendency of neural networks to prioritize identification of an object’s texture over its shape. In this paper we propose and evaluate a process for training neural networks to localize objects - specifically people - in art images. We generate a large dataset for training and validation by modifying the images in the COCO dataset using AdaIn style transfer. This dataset is used to fine-tune a Faster R-CNN object detection network, which is then tested on the existing People-Art testing dataset. The result is a significant improvement on the state of the art and a new way forward for creating datasets to train neural networks to process art images.
Improving Object Detection in Art Images Using Only Style Transfer
— AK (@ak92501) February 15, 2021
pdf: https://t.co/kxMdMmurys
abs: https://t.co/IBkfF1l2Gz pic.twitter.com/frEYjvuEYU
8. How do climate change skeptics engage with opposing views? Understanding mechanisms of social identity and cognitive dissonance in an online forum
Lisa Oswald, Jonathan Bright
Does engagement with opposing views help break down ideological `echo chambers’; or does it backfire and reinforce them? This question remains critical as academics, policymakers and activists grapple with the question of how to regulate political discussion on social media. In this study, we contribute to the debate by examining the impact of opposing views within a major climate change skeptic online community on Reddit. A large sample of posts (N = 3000) was manually coded as either dissonant or consonant which allowed the automated classification of the full dataset of more than 50,000 posts, with codes inferred from linked websites. We find that ideologically dissonant submissions act as a stimulant to activity in the community: they received more attention (comments) than consonant submissions, even though they received lower scores through up-voting and down-voting. Users who engaged with dissonant submissions were also more likely to return to the forum. Consistent with identity theory, confrontation with opposing views triggered activity in the forum, particularly among users that are highly engaged with the community. In light of the findings, theory of social identity and echo chambers is discussed and enhanced.
How do climate change skeptics engage with opposing views? Understanding mechanisms of social identity and cognitive dissonance in an online forum - Fresh draft with @jonmbright out on https://t.co/RRcHNGXnNd ! ✨ pic.twitter.com/R7V3zYbknB
— Lisa Oswald (@LisaFOswaldo) February 15, 2021
9. VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention
Peng Liu, Yuewen Cao, Songxiang Liu, Na Hu, Guangzhi Li, Chao Weng, Dan Su
This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global alignment from previous attention layer as an extra input, the following attention layer can produce a refined version of alignment. This amortizes the burden of learning the textual-to-acoustic alignment among multiple attention layers and outperforms the use of only a single attention layer in robustness. An utterance-level speaking speed factor is computed by a jointly-trained speaking speed predictor, which takes the mean-pooled latent variables of the coarsest layer as input, to determine number of acoustic frames at inference. Experimental results show that VARA-TTS achieves slightly inferior speech quality to an AR counterpart Tacotron 2 but an order-of-magnitude speed-up at inference; and outperforms an analogous non-AR model, BVAE-TTS, in terms of speech quality.
VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention
— Aran Komatsuzaki (@arankomatsuzaki) February 15, 2021
Proposes a nonautoregressive end-to-end text-tospeech model that performs close to Tacotoron 2 with substantially faster inference speed. https://t.co/wEh5yDXas7 pic.twitter.com/jlXlHMP1Mw
VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention
— AK (@ak92501) February 15, 2021
pdf: https://t.co/yiprp0IeDK
abs: https://t.co/mSg7OUyF1w
project page: https://t.co/KpQg9S9TOT pic.twitter.com/60DyWjs6NR
10. Efficient Conditional GAN Transfer with Knowledge Propagation across Classes
Mohamad Shahbazi, Zhiwu Huang, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
Generative adversarial networks (GANs) have shown impressive results in both unconditional and conditional image generation. In recent literature, it is shown that pre-trained GANs, on a different dataset, can be transferred to improve the image generation from a small target data. The same, however, has not been well-studied in the case of conditional GANs (cGANs), which provides new opportunities for knowledge transfer compared to unconditional setup. In particular, the new classes may borrow knowledge from the related old classes, or share knowledge among themselves to improve the training. This motivates us to study the problem of efficient conditional GAN transfer with knowledge propagation across classes. To address this problem, we introduce a new GAN transfer method to explicitly propagate the knowledge from the old classes to the new classes. The key idea is to enforce the popularly used conditional batch normalization (BN) to learn the class-specific information of the new classes from that of the old classes, with implicit knowledge sharing among the new ones. This allows for an efficient knowledge propagation from the old classes to the new classes, with the BN parameters increasing linearly with the number of new classes. The extensive evaluation demonstrates the clear superiority of the proposed method over state-of-the-art competitors for efficient conditional GAN transfer tasks. The code will be available at: https://github.com/mshahbazi72/cGANTransfer
Efficient Conditional GAN Transfer with Knowledge Propagation across Classes
— AK (@ak92501) February 15, 2021
pdf: https://t.co/qKojLkD9AU
abs: https://t.co/qg5XUVxYBm pic.twitter.com/x0iadNTNRK
11. Same File, Different Changes: The Potential of Meta-Maintenance on GitHub
Hideaki Hata, Raula Gaikovina Kula, Takashi Ishio, Christoph Treude
Online collaboration platforms such as GitHub have provided software developers with the ability to easily reuse and share code between repositories. With clone-and-own and forking becoming prevalent, maintaining these shared files is important, especially for keeping the most up-to-date version of reused code. Different to related work, we propose the concept of meta-maintenance — i.e., tracking how the same files evolve in different repositories with the aim to provide useful maintenance opportunities to those files. We conduct an exploratory study by analyzing repositories from seven different programming languages to explore the potential of meta-maintenance. Our results indicate that a majority of active repositories on GitHub contains at least one file which is also present in another repository, and that a significant minority of these files are maintained differently in the different repositories which contain them. We manually analyzed a representative sample of shared files and their variants to understand which changes might be useful for meta-maintenance. Our findings support the potential of meta-maintenance and open up avenues for future work to capitalize on this potential.
In our @ICSEconf 2021 paper we (@Augaiko, Takashi Ishio, @ctreude) propose 'meta-maintenance', a concept for maintaining the entire software ecosystem.
— Hideaki Hata (@hideakihata) February 15, 2021
#icsePromo
Preprint: https://t.co/Mo8Gn7mEyw
Data: https://t.co/tyq7YYF5MR
12. Multiversal views on language models
Laria Reynolds, Kyle McDonell
The virtuosity of language models like GPT-3 opens a new world of possibility for human-AI collaboration in writing. In this paper, we present a framework in which generative language models are conceptualized as multiverse generators. This framework also applies to human imagination and is core to how we read and write fiction. We call for exploration into this commonality through new forms of interfaces which allow humans to couple their imagination to AI to write, explore, and understand non-linear fiction. We discuss the early insights we have gained from actively pursuing this approach by developing and testing a novel multiversal GPT-3-assisted writing interface.
Multiversal views on language models
— AK (@ak92501) February 15, 2021
pdf: https://t.co/jSptVyWkP7
abs: https://t.co/M5pn2i9NUw pic.twitter.com/5VU7iDhSM6
13. End-to-end Audio-visual Speech Recognition with Conformers
Pingchuan Ma, Stavros Petridis, Maja Pantic
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
End-to-end Audio-visual Speech Recognition with Conformers
— AK (@ak92501) February 15, 2021
pdf: https://t.co/f3HDIZpWnK
abs: https://t.co/EYBm5zeiGX pic.twitter.com/YAOMTo167b
14. The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development
Thibault Allançon, Antoine Pietri, Stefano Zacchiroli
We introduce the Software Heritage filesystem (SwhFS), a user-space filesystem that integrates large-scale open source software archival with development workflows. SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history.Using SwhFS, developers can quickly “checkout” any of the 2 billion commits archived by Software Heritage, even after they disappear from their previous known location and without incurring the performance cost of repository cloning. SwhFS works across unrelated repositories and different VCS technologies. Other source code artifacts archived by Software Heritage-individual source code files and trees, releases, and branches-can also be accessed using common programming tools and custom scripts, as if they were locally available.A screencast of SwhFS is available online at dx.doi.org/10.5281/zenodo.4531411.