1. Grounded Compositional Outputs for Adaptive Language Modeling
Nikolaos Pappas, Phoebe Mulcaire, Noah A. Smith
Language models have emerged as a central component across NLP, and a great deal of progress depends on the ability to cheaply adapt them (e.g., through finetuning) to new domains and tasks. A language model’s \emph{vocabulary}---typically selected before training and permanently fixed later---affects its size and is part of what makes it resistant to such adaptation. Prior work has used compositional input embeddings based on surface forms to ameliorate this issue. In this work, we go one step beyond and propose a fully compositional output embedding layer for language models, which is further grounded in information from a structured lexicon (WordNet), namely semantically related words and free-text definitions. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary. We evaluate the model on conventional language modeling as well as challenging cross-domain settings with an open vocabulary, finding that it matches or outperforms previous state-of-the-art output embedding methods and adaptation approaches. Our analysis attributes the improvements to sample efficiency: our model is more accurate for low-frequency words.
GroC: a new (word-level) language modeling approach using compositional outputs, so it's ready to score new words not seen in training and it doesn't need to grow with the vocabulary size. Work by @nik0spapp @PhoebeNLP @nlpnoah to appear at EMNLP 2020. https://t.co/K2wekA7u4K
— Noah A Smith (@nlpnoah) September 25, 2020
2. A Unifying Review of Deep and Shallow Anomaly Detection
Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, Klaus-Robert Müller
Deep learning approaches to anomaly detection have recently improved the state of the art in detection performance on complex datasets such as large collections of images or text. These results have sparked a renewed interest in the anomaly detection problem and led to the introduction of a great variety of new methods. With the emergence of numerous such methods, including approaches based on generative models, one-class classification, and reconstruction, there is a growing need to bring methods of this field into a systematic and unified perspective. In this review we aim to identify the common underlying principles as well as the assumptions that are often made implicitly by various methods. In particular, we draw connections between classic ‘shallow’ and novel deep approaches and show how this relation might cross-fertilize or extend both directions. We further provide an empirical assessment of major existing methods that is enriched by the use of recent explainability techniques, and present specific worked-through examples together with practical advice. Finally, we outline critical open challenges and identify specific paths for future research in anomaly detection.
New paper: https://t.co/BO6xu6C1uT @lukasruff led our team in this attempt to unify various perspectives on deep anomaly detection within a probabilistic framework. Jacob Kauffmann, @robvdm
— Thomas G. Dietterich (@tdietterich) September 25, 2020
, Grégoire Montavon, @WojciechSamek
, @KloftMarius
, and Klaus-Robert Müller.
3. A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints
Guodong Zhang, Xuchao Bao, Laurent Lessard, Roger Grosse
The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements. In this work, we adapt the IQC theory to study first-order methods for smooth and strongly-monotone games and show how to design tailored quadratic constraints to get tight upper bounds of convergence rates. Using this framework, we recover the existing bound for the gradient method~(GD), derive sharper bounds for the proximal point method~(PPM) and optimistic gradient method~(OG), and provide \emph{for the first time} a global convergence rate for the negative momentum method~(NM) with an iteration complexity \bigo(\kappa^{1.5}), which matches its known lower bound. In addition, for time-varying systems, we prove that the gradient method with optimal step size achieves the fastest provable worst-case convergence rate with quadratic Lyapunov functions. Finally, we further extend our analysis to stochastic games and study the impact of multiplicative noise on different algorithms. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch (in contrast with the stochastic strongly-convex optimization setting, where such acceleration has been demonstrated). However, we exhibit an algorithm which achieves acceleration with two gradient queries per batch.
New paper alert: https://t.co/qfmEYSrcPh
— Guodong Zhang (@Guodzh) September 25, 2020
We provide a unified and automated method to analyze first-order methods for smooth & strongly-monotone games. The convergence rate for any first-order method can be obtained via a mechanical procedure of deriving and solving an SDP. pic.twitter.com/moo3Ebx7t8
4. A Personal Perspective on Numerical Analysis and Optimization
Desmond J. Higham
I give a brief, non-technical, historical perspective on numerical analysis and optimization. I also touch on emerging trends and future challenges. This content is based on the short presentation that I made at the opening ceremony of \emph{The International Conference on Numerical Analysis and Optimization}, which was held at Sultan Qaboos University, Muscat, Oman, on January 6—9, 2020. Of course, the material covered here is necessarily incomplete and biased towards my own interests and comfort zones. My aim is to give a feel for how the area has developed over the past few decades and how it may continue.
"A Personal Perspective on Numerical Analysis and Optimization" (by Desmond J. Higham): https://t.co/T1aSSQvBOp
— DynamicalSystemsSIAM (@DynamicsSIAM) September 25, 2020
5. Structure Aware Negative Sampling in Knowledge Graphs
Kian Ahrabian, Aarash Feizi, Yasmin Salehi, William L. Hamilton, Avishek Joey Bose
Learning low-dimensional representations for entities and relations in knowledge graphs using contrastive estimation represents a scalable and effective method for inferring connectivity patterns. A crucial aspect of contrastive learning approaches is the choice of corruption distribution that generates hard negative samples, which force the embedding model to learn discriminative representations and find critical characteristics of observed data. While earlier methods either employ too simple corruption distributions, i.e. uniform, yielding easy uninformative negatives or sophisticated adversarial distributions with challenging optimization schemes, they do not explicitly incorporate known graph structure resulting in suboptimal negatives. In this paper, we propose Structure Aware Negative Sampling (SANS), an inexpensive negative sampling strategy that utilizes the rich graph structure by selecting negative samples from a node’s k-hop neighborhood. Empirically, we demonstrate that SANS finds high-quality negatives that are highly competitive with SOTA methods, and requires no additional parameters nor difficult adversarial optimization.
Our #emnlp2020 short paper on "Structure Aware Negative Sampling on Knowledge Graphs" is now available: https://t.co/4ZrhWVol7S
— Joey Bose (@bose_joey) September 25, 2020
This work was led by impressive graduate students: @kahrabian, @aarashfeizi, @SalehiYasmin, and also with my amazing supervisor @williamleif.
6. Investigating Applications on the A64FX
Adrian Jackson, Michèle Weiland, Nick Brown, Andrew Turner, Mark Parsons
The A64FX processor from Fujitsu, being designed for computational simulation and machine learning applications, has the potential for unprecedented performance in HPC systems. In this paper, we evaluate the A64FX by benchmarking against a range of production HPC platforms that cover a number of processor technologies. We investigate the performance of complex scientific applications across multiple nodes, as well as single node and mini-kernel benchmarks. This paper finds that the performance of the A64FX processor across our chosen benchmarks often significantly exceeds other platforms, even without specific application optimisations for the processor instruction set or hardware. However, this is not true for all the benchmarks we have undertaken. Furthermore, the specific configuration of applications can have an impact on the runtime and performance experienced.