Hot Papers 2020-10-13

1. A Refined Laser Method and Faster Matrix Multiplication

Josh Alman, Virginia Vassilevska Williams

retweets: 13044, favorites: 0 (10/14/2020 09:37:07)
links: abs | pdf
cs.DS | cs.CC | math.CO

The complexity of matrix multiplication is measured in terms of $\omega$ , the smallest real number such that two $n\times n$ matrices can be multiplied using $O(n^{\omega+\epsilon})$ field operations for all $\epsilon>0$ ; the best bound until now is $\omega<2.37287$ [Le Gall’14]. All bounds on $\omega$ since 1986 have been obtained using the so-called laser method, a way to lower-bound the `value’ of a tensor in designing matrix multiplication algorithms. The main result of this paper is a refinement of the laser method that improves the resulting value bound for most sufficiently large tensors. Thus, even before computing any specific values, it is clear that we achieve an improved bound on $\omega$ , and we indeed obtain the best bound on $\omega$ to date: $\omega < 2.37286.$ The improvement is of the same magnitude as the improvement that [Le Gall’14] obtained over the previous bound [Vassilevska W.‘12]. Our improvement to the laser method is quite general, and we believe it will have further applications in arithmetic complexity.

行列積の計算量が6年ぶりに改善されたhttps://t.co/B7HYCjTFRZ
— のぶしみ (@knewknowl) October 13, 2020

2. Resolution Dependant GAN Interpolation for Controllable Image Synthesis Between Domains

Justin N. M. Pinkney, Doron Adler

retweets: 11187, favorites: 0 (10/14/2020 09:37:07)
links: abs | pdf
cs.CV

GANs can generate photo-realistic images from the domain of their training data. However, those wanting to use them for creative purposes often want to generate imagery from a truly novel domain, a task which GANs are inherently unable to do. It is also desirable to have a level of control so that there is a degree of artistic direction rather than purely curation of random results. Here we present a method for interpolating between generative models of the StyleGAN architecture in a resolution dependant manner. This allows us to generate images from an entirely novel domain and do this with a degree of control over the nature of the output.

Resolution Dependant GAN Interpolation for Controllable Image Synthesis Between Domains
pdf: https://t.co/4Grjqsw45C
abs: https://t.co/i7o5WcFh6B pic.twitter.com/dgJUj9S5OP
— AK (@ak92501) October 13, 2020

3. High-Fidelity 3D Digital Human Creation from RGB-D Selfies

Xiangkai Lin, Yajing Chen, Linchao Bao, Haoxian Zhang, Sheng Wang, Xuefei Zhe, Xinwei Jiang, Jue Wang, Dong Yu, Zhengyou Zhang

retweets: 784, favorites: 131 (10/14/2020 09:37:07)
links: abs | pdf
cs.CV | cs.GR

We present a fully automatic system that can produce high-fidelity, photo-realistic 3D digital human characters with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head, and can produce a high quality reconstruction in less than 30 seconds. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state-of-the-art. Specifically, given the input video a two-stage frame selection algorithm is first employed to select a few high-quality frames for reconstruction. A novel, differentiable renderer based 3D Morphable Model (3DMM) fitting method is then applied to recover facial geometries from multiview RGB-D data, which takes advantages of extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear bases. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and CNNs to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D characters with extremely realistic details. Code and the constructed 3DMM is publicly available.

High-Fidelity 3D Digital Human Creation from RGB-D Selfies
pdf: https://t.co/RxEbcVVBvY
abs: https://t.co/gGqgUC8ajW pic.twitter.com/FJo3IW9dGP
— AK (@ak92501) October 13, 2020

4. Unsupervised Image-to-Image Translation via Pre-trained StyleGAN2 Network

Jialu Huang, Jing Liao, Sam Kwong

retweets: 422, favorites: 72 (10/14/2020 09:37:07)
links: abs | pdf
cs.CV

Image-to-Image (I2I) translation is a heated topic in academia, and it also has been applied in real-world industry for tasks like image synthesis, super-resolution, and colorization. However, traditional I2I translation methods train data in two or more domains together. This requires lots of computation resources. Moreover, the results are of lower quality, and they contain many more artifacts. The training process could be unstable when the data in different domains are not balanced, and modal collapse is more likely to happen. We proposed a new I2I translation method that generates a new model in the target domain via a series of model transformations on a pre-trained StyleGAN2 model in the source domain. After that, we proposed an inversion method to achieve the conversion between an image and its latent vector. By feeding the latent vector into the generated model, we can perform I2I translation between the source domain and target domain. Both qualitative and quantitative evaluations were conducted to prove that the proposed method can achieve outstanding performance in terms of image quality, diversity and semantic similarity to the input and reference images compared to state-of-the-art works.

Unsupervised Image-to-Image Translation via Pre-trained StyleGAN2 Network
pdf: https://t.co/jPcyWiAZR5
abs: https://t.co/LjfiwlLOJO pic.twitter.com/RE58Z3e4sJ
— AK (@ak92501) October 13, 2020

5. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

retweets: 380, favorites: 53 (10/14/2020 09:37:08)
links: abs | pdf
cs.SD | cs.LG | eess.AS

Several recent studies on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this study, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real time on CPU with comparable quality to an autoregressive counterpart.

HiFi-GAN: Generative Adversarial Networks for
Efficient and High Fidelity Speech Synthesis
pdf: https://t.co/ARUIpTyJLT
abs: https://t.co/lIaeJ5rhur
samples: https://t.co/shLaeZCj02 pic.twitter.com/VI1chbLeaW
— AK (@ak92501) October 13, 2020

6. Cut-and-Paste Neural Rendering

Anand Bhattad, David A. Forsyth

retweets: 342, favorites: 69 (10/14/2020 09:37:08)
links: abs | pdf
cs.CV | cs.GR | cs.LG

Cut-and-paste methods take an object from one image and insert it into another. Doing so often results in unrealistic looking images because the inserted object’s shading is inconsistent with the target scene’s shading. Existing reshading methods require a geometric and physical model of the inserted object, which is then rendered using environment parameters. Accurately constructing such a model only from a single image is beyond the current understanding of computer vision. We describe an alternative procedure — cut-and-paste neural rendering, to render the inserted fragment’s shading field consistent with the target scene. We use a Deep Image Prior (DIP) as a neural renderer trained to render an image with consistent image decomposition inferences. The resulting rendering from DIP should have an albedo consistent with composite albedo; it should have a shading field that, outside the inserted fragment, is the same as the target scene’s shading field; and composite surface normals are consistent with the final rendering’s shading field. The result is a simple procedure that produces convincing and realistic shading. Moreover, our procedure does not require rendered images or image-decomposition from real images in the training or labeled annotations. In fact, our only use of simulated ground truth is our use of a pre-trained normal estimator. Qualitative results are strong, supported by a user study comparing against the state-of-the-art image harmonization baseline.

Cut-and-Paste Neural Rendering https://t.co/dM9QFozI3a #computervision #Graphics pic.twitter.com/iC0lDTjgSo
— Tomasz Malisiewicz (@quantombone) October 13, 2020

7. What causes the test error? Going beyond bias-variance via ANOVA

Licong Lin, Edgar Dobriban

retweets: 324, favorites: 60 (10/14/2020 09:37:08)
links: abs | pdf
stat.ML | cs.LG | math.ST

Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called “double descent”. Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. This leads to discovering the unimodality of variance as a function of the level of parametrization, and to decomposing the variance into that arising from label noise, initialization, and randomness in the training data to understand the sources of the error. In this work we develop a deeper understanding of this area. Specifically, we propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way, for studying the generalization performance of certain two-layer linear and non-linear networks. The advantage of the analysis of variance is that it reveals the effects of initialization, label noise, and training data more clearly than prior approaches. Moreover, we also study the monotonicity and unimodality of the variance components. While prior work studied the unimodality of the overall variance, we study the properties of each term in variance decomposition. One key insight is that in typical settings, the interaction between training samples and initialization can dominate the variance; surprisingly being larger than their marginal effect. Also, we characterize “phase transitions” where the variance changes from unimodal to monotone. On a technical level, we leverage advanced deterministic equivalent techniques for Haar random matrices, that---to our knowledge---have not yet been used in the area. We also verify our results in numerical simulations and on empirical data examples.

Excited to share our latest work! Have you wondered what gives rise to the test error, in #deeplearning and beyond? For instance, how much of it is due to the randomness in the train data, label noise, and algorithms (e.g. SGD minibatches)? See https://t.co/es3PphwXPt & share! 1/ pic.twitter.com/6LdOuCrVul
— Edgar Dobriban (@EdgarDobriban) October 13, 2020

8. Nearly Minimax Optimal Reward-free Reinforcement Learning

Zihan Zhang, Simon S. Du, Xiangyang Ji

retweets: 222, favorites: 151 (10/14/2020 09:37:08)
links: abs | pdf
cs.LG

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\text{poly}\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with $H$ , in contrast to all existing results which scale \emph{polynomially} with $H$ . Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an $\epsilon$ -suboptimal policy; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.

Want to explore the environment without any reward signal (Reward-free Reinforcement Learning)? We show how to do it near-optimally.
Surprise: the number of episodes needed is almost independent of the horizon!

Link: https://t.co/ij62ZmioG4
— Simon Shaolei Du (@SimonShaoleiDu) October 13, 2020

9. Information geometry and Frobenius algebra

Ruichao Jiang, Javad Tavakoli, Yiqiang Zhao

retweets: 290, favorites: 69 (10/14/2020 09:37:08)
links: abs | pdf
math.DG | cond-mat.quant-gas | cs.IT

We show that a Frobenius sturcture is equivalent to a dually flat sturcture in information geometry. We define a multiplication structure on the tangent spaces of statistical manifolds, which we call the statistical product. We also define a scalar quantity, which we call the Yukawa term. By showing two examples from statistical mechanics, first the classical ideal gas, second the quantum bosonic ideal gas, we argue that the Yukawa term quantifies information generation, which resembles how mass is generated via the 3-points interaction of two fermions and a Higgs boson (Higgs mechanism). In the classical case, The Yukawa term is identically zero, whereas in the quantum case, the Yukawa term diverges as the fugacity goes to zero, which indicates the Bose-Einstein condensation.

Information geometry and Frobenius algebrahttps://t.co/MRen1SDIT2

統計多様体において双対平坦構造とフロベニウス代数なるものが同値であるという研究
量子、統計物理への応用あり
— Submersion (@Submersion13) October 13, 2020

10. CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

Ossama Ahmed, Frederik Träuble, Anirudh Goyal, Alexander Neitz, Manuel Wüthrich, Yoshua Bengio, Bernhard Schölkopf, Stefan Bauer

retweets: 242, favorites: 73 (10/14/2020 09:37:08)
links: abs | pdf
cs.RO | cs.LG | stat.ML

Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.

Causalworld:A Robotic Manipulation Benchmark For Causal Structure And Transfer Learninghttps://t.co/3NbXFuXjY6 https://t.co/kCS41Pqtm3 pic.twitter.com/MUdWQlA1V5
— sim2real (@sim2realAIorg) October 12, 2020

11. Learning Adaptive Language Interfaces through Decomposition

Siddharth Karamcheti, Dorsa Sadigh, Percy Liang

retweets: 174, favorites: 105 (10/14/2020 09:37:08)
links: abs | pdf
cs.CL | cs.AI | cs.LG | cs.RO

Our goal is to create an interactive natural language interface that efficiently and reliably learns from users to complete tasks in simulated robotics settings. We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition: users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps that it can understand. Unfortunately, existing methods either rely on grammars which parse sentences with limited flexibility, or neural sequence-to-sequence models that do not learn efficiently or reliably from individual examples. Our approach bridges this gap, demonstrating the flexibility of modern neural systems, as well as the one-shot reliable generalization of grammar-based methods. Our crowdsourced interactive experiments suggest that over time, users complete complex tasks more efficiently while using our system by leveraging what they just taught. At the same time, getting users to trust the system enough to be incentivized to teach high-level utterances is still an ongoing challenge. We end with a discussion of some of the obstacles we need to overcome to fully realize the potential of the interactive paradigm.

How do we build adaptive language interfaces that learn through interaction with real human users?

New work w/ my amazing advisors @DorsaSadigh and @percyliang, to be presented at the @intexsempar2020 workshop at #emnlp2020.

Link: https://t.co/2VqAPhtks3

A thread 🧵(1 / N). pic.twitter.com/174Ju39VQj
— Siddharth Karamcheti (@siddkaramcheti) October 13, 2020

Beginning of many future exciting RoboNLP work. Learning adaptive language interfaces through interaction: https://t.co/c9CkE4MAsI
w/ @siddkaramcheti and Percy Liang
— Dorsa Sadigh (@DorsaSadigh) October 13, 2020

12. Large-Scale Methods for Distributionally Robust Optimization

Daniel Levy, Yair Carmon, John C. Duchi, Aaron Sidford

retweets: 212, favorites: 60 (10/14/2020 09:37:09)
links: abs | pdf
math.OC | cs.LG | stat.ML

We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $\chi^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $\chi^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9—36 times more efficient than full-batch methods.

DRO: no-one knows what it does and it doesn't scale anyway... Or does it?

In our #NeurIPS2020 paper, we propose optimization algorithms with running time independent of dimension and dataset size (think SGD for ERM) for CVaR and chi-square objectives. https://t.co/jZMkwuTLgy 1/4
— Daniel Levy (@daniellevy__) October 13, 2020

13. ArXiving Before Submission Helps Everyone

Dmytro Mishkin, Amy Tabb, Jiri Matas

retweets: 196, favorites: 50 (10/14/2020 09:37:09)
links: abs | pdf
cs.DL | cs.LG

We claim, and present evidence, that allowing arXiv publication before a conference or journal submission benefits researchers, especially early career, as well as the whole scientific community. Specifically, arXiving helps professional identity building, protects against independent re-discovery, idea theft and gate-keeping; it facilitates open research result distribution and reduces inequality. The advantages dwarf the drawbacks — mainly the relative increase in acceptance rate of papers of well-known authors — which studies show to be marginal. Analyzing the pros and cons of arXiving papers, we conclude that requiring preprints be anonymous is nearly as detrimental as not allowing them. We see no reasons why anyone but the authors should decide whether to arXiv or not.

ArXiving before submission :

- helps professional identity building;
-protects against idea re-discovery/theft/gate-keeping
- facilitates open research distribution
- reduces inequality.

our refined statement, with @amy_tabb & Jiri Matas, now on arXivhttps://t.co/UVdQLyYcEv
— Dmytro Mishkin (@ducha_aiki) October 13, 2020

14. Unsupervised Distillation of Syntactic Information from Contextualized Word Representations

Shauli Ravfogel, Yanai Elazar, Jacob Goldberger, Yoav Goldberg

retweets: 138, favorites: 67 (10/14/2020 09:37:09)
links: abs | pdf
cs.CL | cs.LG

Contextualized word representations, such as ELMo and BERT, were shown to perform well on various semantic and syntactic tasks. In this work, we tackle the task of unsupervised disentanglement between semantics and structure in neural language representations: we aim to learn a transformation of the contextualized vectors, that discards the lexical semantics, but keeps the structural information. To this end, we automatically generate groups of sentences which are structurally similar but semantically different, and use metric-learning approach to learn a transformation that emphasizes the structural component that is encoded in the vectors. We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics. Finally, we demonstrate the utility of our distilled representations by showing that they outperform the original contextualized representations in a few-shot parsing setting.

Happy to share our paper "Unsupervised Distillation of Syntactic Information from Contextualized Word Representations", accepted to #BlackboxNLP - a joint work with @yanaiela*, Jacob Goldberger and @yoavgo. https://t.co/us1MbJNiL9 (1/4) pic.twitter.com/xqPHna0zNI
— Shauli Ravfogel (@ravfogel) October 13, 2020

15. MammoGANesis: Controlled Generation of High-Resolution Mammograms for Radiology Education

Cyril Zakka, Ghida Saheb, Elie Najem, Ghina Berjawi

retweets: 132, favorites: 38 (10/14/2020 09:37:09)
links: abs | pdf
cs.CV | cs.LG | cs.MM

During their formative years, radiology trainees are required to interpret hundreds of mammograms per month, with the objective of becoming apt at discerning the subtle patterns differentiating benign from malignant lesions. Unfortunately, medico-legal and technical hurdles make it difficult to access and query medical images for training. In this paper we train a generative adversarial network (GAN) to synthesize 512 x 512 high-resolution mammograms. The resulting model leads to the unsupervised separation of high-level features (e.g. the standard mammography views and the nature of the breast lesions), with stochastic variation in the generated images (e.g. breast adipose tissue, calcification), enabling user-controlled global and local attribute-editing of the synthesized images. We demonstrate the model’s ability to generate anatomically and medically relevant mammograms by achieving an average AUC of 0.54 in a double-blind study on four expert mammography radiologists to distinguish between generated and real images, ascribing to the high visual quality of the synthesized and edited mammograms, and to their potential use in advancing and facilitating medical education.

Preprint Alert!

"MammoGANesis: Controlled Generation of High-Resolution Mammograms for Radiology Education" is now on arXiv!

📄 Paper: https://t.co/FGrOND5otF
✍🏼 Blog: https://t.co/Iht37KxGjw

And just in time for #BreastCancerAwareness month 🎀

Read on to learn more: 1/7 pic.twitter.com/kmpCYchwTk
— Cyril (@cyrilzakka) October 13, 2020

16. SMYRF: Efficient Attention using Asymmetric Clustering

Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis

retweets: 90, favorites: 61 (10/14/2020 09:37:09)
links: abs | pdf
cs.LG

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$ , where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50\%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

Excited to announce our #NeurIPS2020 paper:
SMYRF: Efficient Attention using Asymmetric Clustering.
Paper: https://t.co/ldvgwqR0cc
Code: https://t.co/tslmLyruxk https://t.co/vKN0OmA3he
We propose a novel way to approximate *pre-trained* attention layers or train from scratch.
— Giannis Daras (@giannis_daras) October 13, 2020

17. Explaining Neural Matrix Factorization with Gradient Rollback

Carolin Lawrence, Timo Sztyler, Mathias Niepert

retweets: 90, favorites: 51 (10/14/2020 09:37:09)
links: abs | pdf
cs.LG | cs.AI | stat.ML

Explaining the predictions of neural black-box models is an important problem, especially when such models are used in applications where user trust is crucial. Estimating the influence of training examples on a learned neural model’s behavior allows us to identify training examples most responsible for a given prediction and, therefore, to faithfully explain the output of a black-box model. The most generally applicable existing method is based on influence functions, which scale poorly for larger sample sizes and models. We propose gradient rollback, a general approach for influence estimation, applicable to neural models where each parameter update step during gradient descent touches a smaller number of parameters, even if the overall number of parameters is large. Neural matrix factorization models trained with gradient descent are part of this model class. These models are popular and have found a wide range of applications in industry. Especially knowledge graph embedding methods, which belong to this class, are used extensively. We show that gradient rollback is highly efficient at both training and test time. Moreover, we show theoretically that the difference between gradient rollback’s influence approximation and the true influence on a model’s behavior is smaller than known bounds on the stability of stochastic gradient descent. This establishes that gradient rollback is robustly estimating example influence. We also conduct experiments which show that gradient rollback provides faithful explanations for knowledge base completion and recommender datasets. An implementation is available in the submission system.

Want to make your NN more explainable?
We present Gradient Rollback (GR) which tracks how training examples influence the model & use this to explain predictions. We apply GR to knowledge base completion. #ExplainableAI #KnowledgeGraph #ML https://t.co/ToOmJX7LDJ
Overview below:
— Carolin Lawrence (@caro__lawrence) October 13, 2020

18. An Open Review of OpenReview: A Critical Analysis of the Machine Learning Conference Review Process

David Tran, Alex Valtchanov, Keshav Ganapathy, Raymond Feng, Eric Slud, Micah Goldblum, Tom Goldstein

retweets: 99, favorites: 28 (10/14/2020 09:37:09)
links: abs | pdf
cs.LG | cs.CY

Mainstream machine learning conferences have seen a dramatic increase in the number of participants, along with a growing range of perspectives, in recent years. Members of the machine learning community are likely to overhear allegations ranging from randomness of acceptance decisions to institutional bias. In this work, we critically analyze the review process through a comprehensive study of papers submitted to ICLR between 2017 and 2020. We quantify reproducibility/randomness in review scores and acceptance decisions, and examine whether scores correlate with paper impact. Our findings suggest strong institutional bias in accept/reject decisions, even after controlling for paper quality. Furthermore, we find evidence for a gender gap, with female authors receiving lower scores, lower acceptance rates, and fewer citations per paper than their male counterparts. We conclude our work with recommendations for future conference organizers.

機械学習の国際会議査読システムを批判する論文。
面白そうだ

An Open Review of OpenReview: A Critical Analysis of the Machine Learning Conference Review Process https://t.co/tcvIB8dW2h
— Tomo (@T45356) October 13, 2020

19. Probing Pretrained Language Models for Lexical Semantics

Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, Goran Glavaš, Anna Korhonen

retweets: 64, favorites: 33 (10/14/2020 09:37:10)
links: abs | pdf
cs.CL

The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, in order to unveil what types of knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, and world knowledge, it remains unclear to which extent LMs also derive lexical type-level knowledge from words in context. In this work, we present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks, addressing the following questions: 1) How do different lexical knowledge extraction strategies (monolingual versus multilingual source LM, out-of-context versus in-context encoding, inclusion of special tokens, and layer-wise averaging) impact performance? How consistent are the observed effects across tasks and languages? 2) Is lexical knowledge stored in few parameters, or is it scattered throughout the network? 3) How do these representations fare against traditional static word vectors in lexical tasks? 4) Does the lexical information emerging from independently trained monolingual LMs display latent similarities? Our main results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks. Moreover, we validate the claim that lower Transformer layers carry more type-level lexical knowledge, but also show that this knowledge is distributed across multiple layers.

Our paper "Probing Pretrained Language Models for Lexical Semantics" will appear soon @emnlp2020, but it has appeared online already! https://t.co/tit7X9omcZ @licwu @PontiEdoardo R. Litschko @annalkorhonen

Thread 👇 1/5
— CambridgeLTL (@CambridgeLTL) October 13, 2020

20. How well does surprisal explain N400 amplitude under different experimental conditions?

James A. Michaelov, Benjamin K. Bergen

retweets: 42, favorites: 48 (10/14/2020 09:37:10)
links: abs | pdf
cs.CL | cs.AI | cs.IT | cs.LG | q-bio.NC

We investigate the extent to which word surprisal can be used to predict a neural measure of human language processing difficulty - the N400. To do this, we use recurrent neural networks to calculate the surprisal of stimuli from previously published neurolinguistic studies of the N400. We find that surprisal can predict N400 amplitude in a wide range of cases, and the cases where it cannot do so provide valuable insight into the neurocognitive processes underlying the response.

How well does surprisal explain N400 amplitude under different experimental conditions? #conll2020 https://t.co/WgfKAA1j5O
— Tal Linzen (@tallinzen) October 13, 2020

21. The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain

Francesco Ragusa, Antonino Furnari, Salvatore Livatino, Giovanni Maria Farinella

retweets: 56, favorites: 30 (10/14/2020 09:37:10)
links: abs | pdf
cs.CV

Wearable cameras allow to collect images and videos of humans interacting with the world. While human-object interactions have been thoroughly investigated in third person vision, the problem has been understudied in egocentric settings and in industrial scenarios. To fill this gap, we introduce MECCANO, the first dataset of egocentric videos to study human-object interactions in industrial-like settings. MECCANO has been acquired by 20 participants who were asked to build a motorbike model, for which they had to interact with tiny objects and tools. The dataset has been explicitly labeled for the task of recognizing human-object interactions from an egocentric perspective. Specifically, each interaction has been labeled both temporally (with action segments) and spatially (with active object bounding boxes). With the proposed dataset, we investigate four different tasks including 1) action recognition, 2) active object detection, 3) active object recognition and 4) egocentric human-object interaction detection, which is a revisited version of the standard human-object interaction detection task. Baseline results show that the MECCANO dataset is a challenging benchmark to study egocentric human-object interactions in industrial-like scenarios. We publicy release the dataset at https://iplab.dmi.unict.it/MECCANO.

The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domainhttps://t.co/jwQCkfAwV3 https://t.co/K48fXERFkQ pic.twitter.com/V72nxUwr1G
— sim2real (@sim2realAIorg) October 13, 2020

22. Capturing Dynamics of Time-Varying Data via Topology

Lu Xian, Henry Adams, Chad M. Topaz, Lori Ziegelmeier

retweets: 64, favorites: 22 (10/14/2020 09:37:10)
links: abs | pdf
cs.LG | cs.CG | math.AT | math.ST | stat.ML

One approach to understanding complex data is to study its shape through the lens of algebraic topology. While the early development of topological data analysis focused primarily on static data, in recent years, theoretical and applied studies have turned to data that varies in time. A time-varying collection of metric spaces as formed, for example, by a moving school of fish or flock of birds, can contain a vast amount of information. There is often a need to simplify or summarize the dynamic behavior. We provide an introduction to topological summaries of time-varying metric spaces including vineyards [17], crocker plots [52], and multiparameter rank functions [34]. We then introduce a new tool to summarize time-varying metric spaces: a crocker stack. Crocker stacks are convenient for visualization, amenable to machine learning, and satisfy a desirable stability property which we prove. We demonstrate the utility of crocker stacks for a parameter identification task involving an influential model of biological aggregations [54]. Altogether, we aim to bring the broader applied mathematics community up-to-date on topological summaries of time-varying metric spaces.

For #math #datascience #tda friends... some smart people let me write a paper with them! See below for today's mini-thread on this work, which is meant to be a primer to bring non-experts up to date on certain aspects of topological data analysis. https://t.co/tvPnSUgEtz
— Chad Topaz (@chadtopaz) October 13, 2020

23. Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

William Huang, Haokun Liu, Samuel R. Bowman

retweets: 42, favorites: 38 (10/14/2020 09:37:10)
links: abs | pdf
cs.CL

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data---data built by minimally editing a set of seed examples to yield counterfactual labels---to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.

New paper with @WillHuang93 and @liu_haokun at the Negative Results workshop at EMNLP (thread)https://t.co/wHDIXVMlPl
— Sam Bowman (disband the NYPD) (@sleepinyourhat) October 13, 2020

24. Convergence to the fixed-node limit in deep variational Monte Carlo

Zeno Schätzle, Jan Hermann, Frank Noé

retweets: 49, favorites: 21 (10/14/2020 09:37:10)
links: abs | pdf
physics.comp-ph | cs.LG | physics.chem-ph | stat.ML

Variational quantum Monte Carlo (QMC) is an ab-initio method for solving the electronic Schr”odinger equation that is exact in principle, but limited by the flexibility of the available ansatzes in practice. The recently introduced deep QMC approach, specifically two deep-neural-network ansatzes PauliNet and FermiNet, allows variational QMC to reach the accuracy of diffusion QMC, but little is understood about the convergence behavior of such ansatzes. Here, we analyze how deep variational QMC approaches the fixed-node limit with increasing network size. First, we demonstrate that a deep neural network can overcome the limitations of a small basis set and reach the mean-field complete-basis-set limit. Moving to electron correlation, we then perform an extensive hyperparameter scan of a deep Jastrow factor for LiH and H $_4$ and find that variational energies at the fixed-node limit can be obtained with a sufficiently large network. Finally, we benchmark mean-field and many-body ansatzes on H $_2$ O, increasing the fraction of recovered fixed-node correlation energy by half an order of magnitude compared to previous VMC results. This analysis helps understanding the superb performance of deep variational ansatzes, and will guide future improvements of the neural network architectures in deep QMC.

Zeno Schätzle and @jhrmnn strike again and improve understanding of deep quantum Monte Carlo.

Can converge to large-basis-set limit by deep #MachineLearning a small basis set and the fixed-node / DMC limit by deep learning Jastrow. cc: @gppcarleo, @pfau https://t.co/p8c3J4pYYN
— Frank Noe (@FrankNoeBerlin) October 13, 2020

25. OCNLI: Original Chinese Natural Language Inference

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, Lawrence S. Moss

retweets: 30, favorites: 36 (10/14/2020 09:37:10)
links: abs | pdf
cs.CL

Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world’s languages. In this paper, we present the first large-scale NLI dataset (consisting of ~56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (~12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese NLU. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language.

It looks like the best broad-coverage evaluation dataset for NLI might no longer be in English. (thread)https://t.co/6kQ6QlqAER pic.twitter.com/1FMhYvrtlW
— Sam Bowman (disband the NYPD) (@sleepinyourhat) October 13, 2020

26. Robust Optimal Transport with Applications in Generative Modeling and Domain Adaptation

Yogesh Balaji, Rama Chellappa, Soheil Feizi

retweets: 48, favorites: 18 (10/14/2020 09:37:10)
links: abs | pdf
cs.LG | cs.CV

Optimal Transport (OT) distances such as Wasserstein have been used in several areas such as GANs and domain adaptation. OT, however, is very sensitive to outliers (samples with large noise) in the data since in its objective function, every sample, including outliers, is weighed similarly due to the marginal constraints. To remedy this issue, robust formulations of OT with unbalanced marginal constraints have previously been proposed. However, employing these methods in deep learning problems such as GANs and domain adaptation is challenging due to the instability of their dual optimization solvers. In this paper, we resolve these issues by deriving a computationally-efficient dual form of the robust OT optimization that is amenable to modern deep learning applications. We demonstrate the effectiveness of our formulation in two applications of GANs and domain adaptation. Our approach can train state-of-the-art GAN models on noisy datasets corrupted with outlier distributions. In particular, our optimization computes weights for training samples reflecting how difficult it is for those samples to be generated in the model. In domain adaptation, our robust OT formulation leads to improved accuracy compared to the standard adversarial adaptation methods. Our code is available at https://github.com/yogeshbalaji/robustOT.

27. Beyond Language: Learning Commonsense from Images for Reasoning

Wanqing Cui, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng

retweets: 30, favorites: 30 (10/14/2020 09:37:10)
links: abs | pdf
cs.CL | cs.AI

This paper proposes a novel approach to learn commonsense from images, instead of limited raw texts or costly constructed knowledge bases, for the commonsense reasoning problem in NLP. Our motivation comes from the fact that an image is worth a thousand words, where richer scene information could be leveraged to help distill the commonsense knowledge, which is often hidden in languages. Our approach, namely Loire, consists of two stages. In the first stage, a bi-modal sequence-to-sequence approach is utilized to conduct the scene layout generation task, based on a text representation model ViBERT. In this way, the required visual scene knowledge, such as spatial relations, will be encoded in ViBERT by the supervised learning process with some bi-modal data like COCO. Then ViBERT is concatenated with a pre-trained language model to perform the downstream commonsense reasoning tasks. Experimental results on two commonsense reasoning problems, i.e. commonsense question answering and pronoun resolution, demonstrate that Loire outperforms traditional language-based methods. We also give some case studies to show what knowledge is learned from images and explain how the generated scene layout helps the commonsense reasoning process.

Beyond Language: Learning Commonsense from Images for Reasoning
pdf: https://t.co/2LLIKLugDX
abs: https://t.co/Rlcsi1Twwl pic.twitter.com/3sO21jM3W4
— AK (@ak92501) October 13, 2020

28. Software Sustainability & High Energy Physics

Daniel S. Katz, Sudhir Malik, Mark S. Neubauer, Graeme A. Stewart, Kétévi A. Assamagan, Erin A. Becker, Neil P. Chue Hong, Ian A. Cosden, Samuel Meehan, Edward J. W. Moyse, Adrian M. Price-Whelan, Elizabeth Sexton-Kennedy, Meirin Oan Evans, Matthew Feickert, Clemens Lange, Kilian Lieret, Rob Quick, Arturo Sánchez Pineda, Christopher Tunnell

retweets: 42, favorites: 17 (10/14/2020 09:37:11)
links: abs | pdf
hep-ex | cs.SE

New facilities of the 2020s, such as the High Luminosity Large Hadron Collider (HL-LHC), will be relevant through at least the 2030s. This means that their software efforts and those that are used to analyze their data need to consider sustainability to enable their adaptability to new challenges, longevity, and efficiency, over at least this period. This will help ensure that this software will be easier to develop and maintain, that it remains available in the future on new platforms, that it meets new needs, and that it is as reusable as possible. This report discusses a virtual half-day workshop on “Software Sustainability and High Energy Physics” that aimed 1) to bring together experts from HEP as well as those from outside to share their experiences and practices, and 2) to articulate a vision that helps the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP) to create a work plan to implement elements of software sustainability. Software sustainability practices could lead to new collaborations, including elements of HEP software being directly used outside the field, and, as has happened more frequently in recent years, to HEP developers contributing to software developed outside the field rather than reinventing it. A focus on and skills related to sustainable software will give HEP software developers an important skill that is essential to careers in the realm of software, inside or outside HEP. The report closes with recommendations to improve software sustainability in HEP, aimed at the HEP community via IRIS-HEP and the HEP Software Foundation (HSF).

29. Do Language Embeddings Capture Scales?

Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, Dan Roth

retweets: 42, favorites: 17 (10/14/2020 09:37:11)
links: abs | pdf
cs.CL

Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense, and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense reasoning. We identify contextual information in pre-training and numeracy as two key factors affecting their performance and show that a simple method of canonicalizing numbers can have a significant effect on the results.

Published 14 Oct 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter