Hot Papers 2020-07-01

1. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

retweets: 108, favorites: 474 (07/02/2020 13:27:00)
links: abs | pdf
cs.CL | cs.LG | stat.ML

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

https://t.co/V3XPuaBKcG

We scaled the Transformer model with Sparsely-Gated Mixture-of-Experts using GShard, and trained a 600B multilingual translation model in about 4 days (for 100 languages) achieving 13.5 BLEU gain compared to the baseline. pic.twitter.com/oOHRK7iiHm
— Dmitry (Dima) Lepikhin (@lepikhin) July 1, 2020

“In weeks decades happen”

13+ BLeU point improvements in this new work. I have a feeling Google is quietly sitting on a GPT-100 implementation and doesn’t bother telling anyone. https://t.co/5trXwvKOz2 pic.twitter.com/qrmqQzVRux
— Delip Rao (@deliprao) July 1, 2020

Scaling Transformer models up to 600B parameters using sparsely-gated mixture-of-experts leads to big gains in BLEU score in multilingual machine translationhttps://t.co/BW8cujfUpI

Impressive work by Google folks pic.twitter.com/k4FZjHjkhh
— Alexis Conneau (@alex_conneau) July 1, 2020

2. Involutive MCMC: a Unifying Framework

Kirill Neklyudov, Max Welling, Evgenii Egorov, Dmitry Vetrov

retweets: 75, favorites: 324 (07/02/2020 13:27:01)
links: abs | pdf
cs.LG | stat.CO | stat.ME | stat.ML

Markov Chain Monte Carlo (MCMC) is a computational approach to fundamental problems such as inference, integration, optimization, and simulation. The field has developed a broad spectrum of algorithms, varying in the way they are motivated, the way they are applied and how efficiently they sample. Despite all the differences, many of them share the same core principle, which we unify as the Involutive MCMC (iMCMC) framework. Building upon this, we describe a wide range of MCMC algorithms in terms of iMCMC, and formulate a number of “tricks” which one can use as design principles for developing new MCMC algorithms. Thus, iMCMC provides a unified view of many known MCMC algorithms, which facilitates the derivation of powerful extensions. We demonstrate the latter with two examples where we transform known reversible MCMC algorithms into more efficient irreversible ones.

Our #icml2020 paper "Involutive MCMC: a Unifying Framework" is now available on arxiv https://t.co/ViSnnfi1hr. It describes many MCMC algorithms from a single perspective.

Work with @wellingmax, @eeevgen, Dmitry Vetrov pic.twitter.com/FltFuIavau
— Kirill Neklyudov (@k_neklyudov) July 1, 2020

3. Model-based Reinforcement Learning: A Survey

Thomas M. Moerland, Joost Broekens, Catholijn M. Jonker

retweets: 38, favorites: 182 (07/02/2020 13:27:01)
links: abs | pdf
cs.LG | cs.AI | stat.ML

Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a key challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two key sections, we also discuss the potential benefits of model-based RL, like enhanced data efficiency, targeted exploration, and improved stability. Along the survey, we also draw connections to several related RL fields, like hierarchical RL and transfer, and other research disciplines, like behavioural psychology. Altogether, the survey presents a broad conceptual overview of planning-learning combinations for MDP optimization.

Model-based Reinforcement Learning: A Survey: https://t.co/5tvG5Npx4v

“A systematic categorization of planning-learning integration, including where to start planning, what budgets to allocate, how to plan, and how to integrate planning in the learning and acting loop” pic.twitter.com/m1DEgdeiU8
— Denny Britz (@dennybritz) July 1, 2020

4. MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning

Elise van der Pol, Daniel E. Worrall, Herke van Hoof, Frans A. Oliehoek, Max Welling

retweets: 42, favorites: 161 (07/02/2020 13:27:01)
links: abs | pdf
cs.LG | stat.ML

This paper introduces MDP homomorphic networks for deep reinforcement learning. MDP homomorphic networks are neural networks that are equivariant under symmetries in the joint state-action space of an MDP. Current approaches to deep reinforcement learning do not usually exploit knowledge about such structure. By building this prior knowledge into policy and value networks using an equivariance constraint, we can reduce the size of the solution space. We specifically focus on group-structured symmetries (invertible transformations). Additionally, we introduce an easy method for constructing equivariant network layers numerically, so the system designer need not solve the constraints by hand, as is typically done. We construct MDP homomorphic MLPs and CNNs that are equivariant under either a group of reflections or rotations. We show that such networks converge faster than unstructured baselines on CartPole, a grid world and Pong.

New preprint: “MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning”, a new method for exploiting symmetries in deep RL using equivariance.

Arxiv: https://t.co/72sXC7GaIo

Work with @deworrall92, @herkevanhoof, @faoliehoek and @wellingmax pic.twitter.com/PIAX3sqWFs
— Elise van der Pol (@ElisevanderPol) July 1, 2020

5. Deep Isometric Learning for Visual Recognition

Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, Jitendra Malik

retweets: 38, favorites: 161 (07/02/2020 13:27:01)
links: abs | pdf
cs.CV

Initialization, normalization, and skip connections are believed to be three indispensable techniques for training very deep convolutional neural networks and obtaining state-of-the-art performance. This paper shows that deep vanilla ConvNets without normalization nor skip connections can also be trained to achieve surprisingly good performance on standard image recognition benchmarks. This is achieved by enforcing the convolution kernels to be near isometric during initialization and training, as well as by using a variant of ReLU that is shifted towards being isometric. Further experiments show that if combined with skip connections, such near isometric networks can achieve performances on par with (for ImageNet) and better than (for COCO) the standard ResNet, even without normalization at all. Our code is available at https://github.com/HaozhiQi/ISONet.

How to train very deep ConvNets without residual blocks? Our ICML paper on Deep Isometric Learning successfully trains 100-layer ConvNets without any shortcut connections nor normalization layers (BN/GN) on ImageNet.

Paper: https://t.co/mxMJO1Wub9
Code: https://t.co/8Omd8SYYS9 pic.twitter.com/nrePhbE0hf
— xiaolonw (@xiaolonw) July 1, 2020

6. Multi-Head Attention: Collaborate Instead of Concatenate

Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

retweets: 12, favorites: 77 (07/02/2020 13:27:01)
links: abs | pdf
cs.LG | cs.CL | stat.ML

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. However, they suffer from over-parameterization. For instance, it was shown that the majority of attention heads could be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. For instance, by allowing heads to collaborate on a neural machine translation task, we can reduce the key dimension by a factor of eight without any loss in performance. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Even without retraining, collaborative multi-head attention manages to reduce the size of the key and query projections by half without sacrificing accuracy. Our code is public.

Happy to share our work on attention models. We show that heads learn redundant key/query projections. Attention layer can be more efficient sharing projections across heads. Collaborate Instead of Concatenate!😉1/5
📄Paper: https://t.co/BFcuaRvO9l
🖥Code: https://t.co/Qci9FLixIl pic.twitter.com/B59VI9CvFL
— Jean-Baptiste Cordonnier (@jb_cordonnier) July 1, 2020

7. PriorGAN: Real Data Prior for Generative Adversarial Nets

Shuyang Gu, Jianmin Bao, Dong Chen, Fang Wen

retweets: 19, favorites: 53 (07/02/2020 13:27:02)
links: abs | pdf
cs.CV | eess.IV

Generative adversarial networks (GANs) have achieved rapid progress in learning rich data distributions. However, we argue about two main issues in existing techniques. First, the low quality problem where the learned distribution has massive low quality samples. Second, the missing modes problem where the learned distribution misses some certain regions of the real data distribution. To address these two issues, we propose a novel prior that captures the whole real data distribution for GANs, which are called PriorGANs. To be specific, we adopt a simple yet elegant Gaussian Mixture Model (GMM) to build an explicit probability distribution on the feature level for the whole real data. By maximizing the probability of generated data, we can push the low quality samples to high quality. Meanwhile, equipped with the prior, we can estimate the missing modes in the learned distribution and design a sampling strategy on the real data to solve the problem. The proposed real data prior can generalize to various training settings of GANs, such as LSGAN, WGAN-GP, SNGAN, and even the StyleGAN. Our experiments demonstrate that PriorGANs outperform the state-of-the-art on the CIFAR-10, FFHQ, LSUN-cat, and LSUN-bird datasets by large margins.

PriorGAN: Real Data Prior for Generative Adversarial Nets
pdf: https://t.co/47dgaFaCCy
abs: https://t.co/WL2IehrCq3 pic.twitter.com/cbHHtUL6pD
— roadrunner01 (@ak92501) July 1, 2020

8. Learning Sparse Prototypes for Text Generation

Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig

retweets: 9, favorites: 56 (07/02/2020 13:27:02)
links: abs | pdf
cs.CL | cs.LG

Prototype-driven text generation uses non-parametric models that first choose from a library of sentence “prototypes” and then modify the prototype to generate the output text. While effective, these methods are inefficient at test time as a result of needing to store and index the entire training corpus. Further, existing methods often require heuristics to identify which prototypes to reference at training time. In this paper, we propose a novel generative model that automatically learns a \emph{sparse} prototype support set that, nonetheless, achieves strong language modeling performance. This is achieved by (1) imposing a sparsity-inducing prior on the prototype selection distribution, and (2) utilizing amortized variational inference to \emph{learn} a prototype retrieval function. In experiments, our model outperforms previous prototype-driven language models while achieving up to a 1000x memory reduction, as well as a 1000x speed-up at test time. More interestingly, we show that the learned prototypes are able to capture semantics and syntax at different granularity as we vary the sparsity of prototype selection, and that certain sentence attributes can be controlled by specifying the prototype for generation.

Do we really need to store ALL the training examples for non-parametric (retrieval-based) text generation? Our new work achieves competitive language modeling performance with sparse non-parametric memories (up to 1000x memory savings): https://t.co/6DmYUdfr94 pic.twitter.com/n8cdXTT7KM
— Junxian He (@junxian_he) July 1, 2020

9. Firmware Insider: Bluetooth Randomness is Mostly Random

Jörn Tillmanns, Jiska Classen, Felix Rohrbach, Matthias Hollick

retweets: 18, favorites: 46 (07/02/2020 13:27:02)
links: abs | pdf
cs.CR | cs.AR | cs.NI

Bluetooth chips must include a Random Number Generator (RNG). This RNG is used internally within cryptographic primitives but also exposed to the operating system for chip-external applications. In general, it is a black box with security-critical authentication and encryption mechanisms depending on it. In this paper, we evaluate the quality of RNGs in various Broadcom and Cypress Bluetooth chips. We find that the RNG implementation significantly changed over the last decade. Moreover, most devices implement an insecure Pseudo-Random Number Generator (PRNG) fallback. Multiple popular devices, such as the Samsung Galaxy S8 and its variants as well as an iPhone, rely on the weak fallback due to missing a Hardware Random Number Generator (HRNG). We statistically evaluate the output of various HRNGs in chips used by hundreds of millions of devices. While the Broadcom and Cypress HRNGs pass advanced tests, it remains indistinguishable for users if a Bluetooth chip implements a secure RNG without an extensive analysis as in this paper. We describe our measurement methods and publish our tools to enable further public testing.

Finally, the preprint of our Broadcom/Cypress Bluetooth RNG analysis: https://t.co/1ZGj9rp9U9

Joined work with @matedealer and @Fxrh. CVE-2020-6616 was fixed in iOS 13.5 and the Samsung May release. The paper got accepted at #WOOT20, which is co-located with @USENIXSecurity.
— Jiska 👻🌈 (@naehrdine) July 1, 2020

10. Learning Patterns of Tourist Movement and Photography from Geotagged Photos at Archaeological Heritage Sites in Cuzco, Peru

Nicole D. Payntar, Wei-Lin Hsiao, R. Alan Covey, Kristen Grauman

retweets: 7, favorites: 53 (07/02/2020 13:27:02)
links: abs | pdf
cs.SI | cs.CV

The popularity of media sharing platforms in recent decades has provided an abundance of open source data that remains underutilized by heritage scholars. By pairing geotagged internet photographs with machine learning and computer vision algorithms, we build upon the current theoretical discourse of anthropology associated with visuality and heritage tourism to identify travel patterns across a known archaeological heritage circuit, and quantify visual culture and experiences in Cuzco, Peru. Leveraging large-scale in-the-wild tourist photos, our goals are to (1) understand how the intensification of tourism intersects with heritage regulations and social media, aiding in the articulation of travel patterns across Cuzco’s heritage landscape; and to (2) assess how aesthetic preferences and visuality become entangled with the rapidly evolving expectations of tourists, whose travel narratives are curated on social media and grounded in historic site representations.

We’re sharing the first analysis of photo sharing's impact on heritage tourism. We used #computervision to extract insights (e.g. travel patterns, photo themes) over 15 years. The learnings could inform preservation of historic sites or tourism management. https://t.co/9mvPocTTgG
— Facebook AI (@facebookai) July 1, 2020

11. Sparse Gaussian Processes with Spherical Harmonic Features

Vincent Dutordoir, Nicolas Durrande, James Hensman

retweets: 11, favorites: 46 (07/02/2020 13:27:03)
links: abs | pdf
stat.ML | cs.LG

We introduce a new class of inter-domain variational Gaussian processes (GP) where data is mapped onto the unit hypersphere in order to use spherical harmonic representations. Our inference scheme is comparable to variational Fourier features, but it does not suffer from the curse of dimensionality, and leads to diagonal covariance matrices between inducing variables. This enables a speed-up in inference, because it bypasses the need to invert large covariance matrices. Our experiments show that our model is able to fit a regression model for a dataset with 6 million entries two orders of magnitude faster compared to standard sparse GPs, while retaining state of the art accuracy. We also demonstrate competitive performance on classification with non-conjugate likelihoods.

Worried about scaling that pesky matrix inversion in your GP approximation? Choose your inducing variables to be the spherical harmonics to get it done in linear time!

Presenting our #icml2020 paper: https://t.co/E2zZSzSzL7 with @NicolasDurrande and @jameshensman @PROWLER_IO
— Vincent Dutordoir (@vdutor) July 1, 2020

12. Bitcoin Covenants: Three Ways to Control the Future

Jacob Swambo, Spencer Hommel, Bob McElrath, Bryan Bishop

retweets: 14, favorites: 42 (07/02/2020 13:27:03)
links: abs | pdf
cs.CR

A bitcoin covenant is a mechanism to enforce conditions on how the control of coins will be transferred in the future. This work introduces deleted-key covenants; using pre-signed transactions with secure key deletion. With this, a general class of covenants are possible without introducing new security risks to bitcoin. There is a range of security models for the key deletion process, but this is subject to a security-convenience trade-off and requires interactivity in a multi-party context. On the other hand, this work makes a compelling case for what can be gained through a soft-fork upgrade to the signature hash system [Dec17] which enables recovered-key covenants through elliptic curve key recovery. This has similar properties to script-based covenant mechanisms proposed previously [Rub20]. Key factors are discussed and compared for the three covenant mechanisms, including; the enforcement process, methods for proving accessibility of funds and whether or not they are bound by a covenant, methods for dynamic fee allocation, the underlying cryptographic assumptions, and their feasibility in single-party, hierarchical and adversarial multi-party contexts. Despite the relative downsides of deleted-key covenants, they are a practical tool for custody protocol design. The comparison shows precisely how soft-fork proposals improve the practicality of bitcoin covenants, through non-interactive enforcement and tighter cryptographic assumptions, to enhance custody protocols and enable some adversarial applications such as payment protocols.

TL;DR -- if we get NOINPUT via bip118 (a new segwit script version) we can implement covenants using elliptic curve key recovery. (As opposed to ANYPREVOUT which is a tapscript) Comparisons within:@JakeSwambo @HommelSpencer @kanzure https://t.co/61wdzbIoFC
— Bob McElrath (@BobMcElrath) July 1, 2020

13. Quantum algorithm for Petz recovery channels and pretty good measurements

András Gilyén, Seth Lloyd, Iman Marvian, Yihui Quek, Mark M. Wilde

retweets: 3, favorites: 51 (07/02/2020 13:27:03)
links: abs | pdf
quant-ph | cs.DS | hep-th | math-ph

The Petz recovery channel plays an important role in quantum information science as an operation that approximately reverses the effect of a quantum channel. The pretty good measurement is a special case of the Petz recovery channel, and it allows for near-optimal state discrimination. A hurdle to the experimental realization of these vaunted theoretical tools is the lack of a systematic and efficient method to implement them. This paper sets out to rectify this lack: using the recently developed tools of quantum singular value transformation and oblivious amplitude amplification, we provide a quantum algorithm to implement the Petz recovery channel when given the ability to perform the channel that one wishes to reverse. Moreover, we prove that our quantum algorithm’s usage of the channel implementation cannot be improved by more than a quadratic factor. Our quantum algorithm also provides a procedure to perform pretty good measurements when given multiple copies of the states that one is trying to distinguish.

Delighted to announce https://t.co/bMdLbmF0UV (with a great team: Andras, Seth, Iman, Mark @markwilde), where we provide a systematic algorithm for implementing the Petz map and Pretty Good Measurements. Limerick below!
— Yihui Quek (@quekpottheories) July 1, 2020

14. Universal linguistic inductive biases via meta-learning

R. Thomas McCoy, Erin Grant, Paul Smolensky, Thomas L. Griffiths, Tal Linzen

retweets: 9, favorites: 44 (07/02/2020 13:27:03)
links: abs | pdf
cs.CL | cs.LG

How do learners acquire languages from the limited data available to them? This process must involve some inductive biases - factors that affect how a learner generalizes - but it is unclear which inductive biases can explain observed patterns in language acquisition. To facilitate computational modeling aimed at addressing this question, we introduce a framework for giving particular linguistic inductive biases to a neural network model; such a model can then be used to empirically explore the effects of those inductive biases. This framework disentangles universal inductive biases, which are encoded in the initial values of a neural network’s parameters, from non-universal factors, which the neural network must learn from data in a given language. The initial state that encodes the inductive biases is found with meta-learning, a technique through which a model discovers how to acquire new languages more easily via exposure to many possible languages. By controlling the properties of the languages that are used during meta-learning, we can control the inductive biases that meta-learning imparts. We demonstrate this framework with a case study based on syllable structure. First, we specify the inductive biases that we intend to give our model, and then we translate those inductive biases into a space of languages from which a model can meta-learn. Finally, using existing analysis techniques, we verify that our approach has imparted the linguistic inductive biases that it was intended to impart.

Human language learning is fast & robust because of the inductive biases that guide it. Neural nets lack these biases, limiting their utility for cognitive modeling. We introduce an approach to address this w/ meta-learning.https://t.co/nrxoDMllnB
Demo: https://t.co/d29dLTFbRi pic.twitter.com/36I72E04Ku
— Tom McCoy (@RTomMcCoy) July 1, 2020

15. From Simple Features to Moving Features and Beyond?

Anita Graser, Esteban Zimányi, Krishna Chaitanya Bommakanti

retweets: 6, favorites: 45 (07/02/2020 13:27:04)
links: abs | pdf
cs.CY

Mobility data science lacks common data structures and analytical functions. This position paper assesses the current status and open issues towards a universal API for mobility data science. In particular, we look at standardization efforts revolving around the OGC Moving Features standard which, so far, has not attracted much attention within the mobility data science community. We discuss the hurdles any universal API for movement data has to overcome and propose key steps of a roadmap that would provide the foundation for the development of this API.

🎉 Happy to share my first arXiv submission: https://t.co/VFqEVBvRMe

We originally wanted to discuss #MovingFeatures at the #GIScience2020 Workshop on
Advancing #MovementDataScience but that was postponed and we didn't want to wait! @smayadodge @udemsar @KatarzynaSila pic.twitter.com/KptXvAadNz
— Anita Graser (@underdarkGIS) July 1, 2020

Published 2 Jul 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter