All Articles

Hot Papers 2021-06-02

1. Machine-Learning Non-Conservative Dynamics for New-Physics Detection

Ziming Liu, Bohan Wang, Qi Meng, Wei Chen, Max Tegmark, Tie-Yan Liu

Energy conservation is a basic physics principle, the breakdown of which often implies new physics. This paper presents a method for data-driven “new physics” discovery. Specifically, given a trajectory governed by unknown forces, our Neural New-Physics Detector (NNPhD) aims to detect new physics by decomposing the force field into conservative and non-conservative components, which are represented by a Lagrangian Neural Network (LNN) and a universal approximator network (UAN), respectively, trained to minimize the force recovery error plus a constant λ\lambda times the magnitude of the predicted non-conservative force. We show that a phase transition occurs at λ\lambda=1, universally for arbitrary forces. We demonstrate that NNPhD successfully discovers new physics in toy numerical experiments, rediscovering friction (1493) from a damped double pendulum, Neptune from Uranus’ orbit (1846) and gravitational waves (2017) from an inspiraling orbit. We also show how NNPhD coupled with an integrator outperforms previous methods for predicting the future of a damped double pendulum.

2. Towards Real-time and Light-weight Line Segment Detection

Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, Minchul Shin

  • retweets: 2448, favorites: 217 (06/03/2021 09:56:29)
  • links: abs | pdf
  • cs.CV | cs.LG

Previous deep learning-based line segment detection (LSD) suffer from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely efficient LSD architecture by minimizing the backbone network and removing the typical multi-module process for line prediction in previous methods. To maintain competitive performance with such a light-weight network, we present novel training schemes: Segments of Line segment (SoL) augmentation and geometric learning scheme. SoL augmentation splits a line segment into multiple subparts, which are used to provide auxiliary line data during the training process. Moreover, the geometric learning scheme allows a model to capture additional geometry cues from matching loss, junction and line segmentation, length and degree regression. Compared with TP-LSD-Lite, previously the best real-time LSD method, our model (M-LSD-tiny) achieves competitive performance with 2.5% of model size and an increase of 130.5% in inference speed on GPU when evaluated with Wireframe and YorkUrban datasets. Furthermore, our model runs at 56.8 FPS and 48.6 FPS on Android and iPhone mobile devices, respectively. To the best of our knowledge, this is the first real-time deep LSD method available on mobile devices.

3. PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

Rowan Zellers, Ari Holtzman, Matthew Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, Yejin Choi

  • retweets: 884, favorites: 141 (06/03/2021 09:56:29)
  • links: abs | pdf
  • cs.CL | cs.AI

We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast “what happens next” given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.

4. You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu

Can Transformer perform 2D2\mathrm{D} object-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D2\mathrm{D} spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the na”ive Vision Transformer with the fewest possible modifications as well as inductive biases. We find that YOLOS pre-trained on the mid-sized ImageNet-1k1k dataset only can already achieve competitive object detection performance on COCO, \textit{e.g.}, YOLOS-Base directly adopted from BERT-Base can achieve 42.042.0 box AP. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through object detection. Code and model weights are available at \url{https://github.com/hustvl/YOLOS}.

5. Language-Driven Image Style Transfer

Tsu-Jui Fu, Xin Eric Wang, William Yang Wang

  • retweets: 650, favorites: 117 (06/03/2021 09:56:30)
  • links: abs | pdf
  • cs.CV

Despite having promising results, style transfer, which requires preparing style images in advance, may result in lack of creativity and accessibility. Following human instruction, on the other hand, is the most natural way to perform artistic style transfer that can significantly improve controllability for visual effect applications. We introduce a new task — language-driven image style transfer (\texttt{LDIST}) — to manipulate the style of a content image, guided by a text. We propose contrastive language visual artist (CLVA) that learns to extract visual semantics from style instructions and accomplish \texttt{LDIST} by the patch-wise style discriminator. The discriminator considers the correlation between language and patches of style images or transferred results to jointly embed style instructions. CLVA further compares contrastive pairs of content image and style instruction to improve the mutual relativeness between transfer results. The transferred results from the same content image can preserve consistent content structures. Besides, they should present analogous style patterns from style instructions that contain similar visual semantics. The experiments show that our CLVA is effective and achieves superb transferred results on \texttt{LDIST}.

6. Omnizart: A General Toolbox for Automatic Music Transcription

Yu-Te Wu, Yin-Jyun Luo, Tsung-Ping Chen, I-Chieh Wei, Jui-Yang Hsu, Yi-Chin Chuang, Li Su

We present and release Omnizart, a new Python library that provides a streamlined solution to automatic music transcription (AMT). Omnizart encompasses modules that construct the life-cycle of deep learning-based AMT, and is designed for ease of use with a compact command-line interface. To the best of our knowledge, Omnizart is the first transcription toolkit which offers models covering a wide class of instruments ranging from solo, instrument ensembles, percussion instruments to vocal, as well as models for chord recognition and beat/downbeat tracking, two music information retrieval (MIR) tasks highly related to AMT.

7. Bootstrap Your Own Correspondences

Mohamed El Banani, Justin Johnson

  • retweets: 380, favorites: 64 (06/03/2021 09:56:30)
  • links: abs | pdf
  • cs.CV

Geometric feature extraction is a crucial component of point cloud registration pipelines. Recent work has demonstrated how supervised learning can be leveraged to learn better and more compact 3D features. However, those approaches’ reliance on ground-truth annotation limits their scalability. We propose BYOC: a self-supervised approach that learns visual and geometric features from RGB-D video without relying on ground-truth pose or correspondence. Our key observation is that randomly-initialized CNNs readily provide us with good correspondences; allowing us to bootstrap the learning of both visual and geometric features. Our approach combines classic ideas from point cloud registration with more recent representation learning approaches. We evaluate our approach on indoor scene datasets and find that our method outperforms traditional and learned descriptors, while being competitive with current state-of-the-art supervised approaches.

8. KVT: k-NN Attention for Boosting Vision Transformers

Pichao Wang, Xue Wang, Fan Wang, Ming Lin, Shuning Chang, Wen Xie, Hao Li, Rong Jin

  • retweets: 342, favorites: 51 (06/03/2021 09:56:30)
  • links: abs | pdf
  • cs.CV

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potentially degradation of performance. To address these problems, we propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filter out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that kk-NN attention is powerful in distilling noise from input tokens and in speeding up training. Extensive experiments are conducted by using ten different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance.

9. Reward is enough for convex MDPs

Tom Zahavy, Brendan O’Donoghue, Guillaume Desjardins, Satinder Singh

Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov Decision Process (MDP) based on the Reinforcement Learning (RL) problem formulation. However, not all goals can be captured in this manner. Specifically, it is easy to see that Convex MDPs in which goals are expressed as convex functions of stationary distributions cannot, in general, be formulated in this manner. In this paper, we reformulate the convex MDP problem as a min-max game between the policy and cost (negative reward) players using Fenchel duality and propose a meta-algorithm for solving it. We show that the average of the policies produced by an RL agent that maximizes the non-stationary reward produced by the cost player converges to an optimal solution to the convex MDP. Finally, we show that the meta-algorithm unifies several disparate branches of reinforcement learning algorithms in the literature, such as apprenticeship learning, variational intrinsic control, constrained MDPs, and pure exploration into a single framework.

10. Language Model Evaluation Beyond Perplexity

Clara Meister, Ryan Cotterell

  • retweets: 132, favorites: 56 (06/03/2021 09:56:31)
  • links: abs | pdf
  • cs.CL

We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework—paired with significance tests—for evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type—token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

11. Gaussian Processes with Differential Privacy

Antti Honkela

Gaussian processes (GPs) are non-parametric Bayesian models that are widely used for diverse prediction tasks. Previous work in adding strong privacy protection to GPs via differential privacy (DP) has been limited to protecting only the privacy of the prediction targets (model outputs) but not inputs. We break this limitation by introducing GPs with DP protection for both model inputs and outputs. We achieve this by using sparse GP methodology and publishing a private variational approximation on known inducing points. The approximation covariance is adjusted to approximately account for the added uncertainty from DP noise. The approximation can be used to compute arbitrary predictions using standard sparse GP techniques. We propose a method for hyperparameter learning using a private selection protocol applied to validation set log-likelihood. Our experiments demonstrate that given sufficient amount of data, the method can produce accurate models under strong privacy protection.

12. Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms

Srikanth Raj Chetupalli, Prashant Krishnan, Neeraj Sharma, Ananya Muguli, Rohit Kumar, Viral Nanda, Lancelot Mark Pinto, Prasanta Kumar Ghosh, Sriram Ganapathy

The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a web-application over a period of ten months. We investigate the use of statistical descriptors of simple time-frequency features for acoustic signals and binary features for the presence of symptoms. Unlike previous works, we primarily focus on the application of simple linear classifiers like logistic regression and support vector machines for acoustic data while decision tree models are employed on the symptoms data. We show that a multi-modal integration of acoustics and symptoms classifiers achieves an area-under-curve (AUC) of 92.40, a significant improvement over any individual modality. Several ablation experiments are also provided which highlight the acoustic and symptom dimensions that are important for the task of COVID-19 diagnostics.

13. What Matters for Adversarial Imitation Learning?

Manu Orsini, Anton Raichuk, Léonard Hussenot, Damien Vincent, Robert Dadashi, Sertan Girgin, Matthieu Geist, Olivier Bachem, Olivier Pietquin, Marcin Andrychowicz

Adversarial imitation learning has become a popular framework for imitation in continuous control. Over the years, several variations of its components were proposed to enhance the performance of the learned policies as well as the sample complexity of the algorithm. In practice, these choices are rarely tested all together in rigorous empirical studies. It is therefore difficult to discuss and understand what choices, among the high-level algorithmic options as well as low-level implementation details, matter. To tackle this issue, we implement more than 50 of these choices in a generic adversarial imitation learning framework and investigate their impacts in a large-scale study (>500k trained agents) with both synthetic and human-generated demonstrations. While many of our findings confirm common practices, some of them are surprising or even contradict prior work. In particular, our results suggest that artificial demonstrations are not a good proxy for human data and that the very common practice of evaluating imitation algorithms only with synthetic demonstrations may lead to algorithms which perform poorly in the more realistic scenarios with human demonstrations.

14. Privacy and Confidentiality in Process Mining — Threats and Research Challenges

Gamal Elkoumy, Stephan A. Fahrenkrog-Petersen, Mohammadreza Fani Sani, Agnes Koschmider, Felix Mannhardt, Saskia Nuñez von Voigt, Majid Rafiei, Leopold von Waldthausen

  • retweets: 42, favorites: 16 (06/03/2021 09:56:31)
  • links: abs | pdf
  • cs.CR | cs.DB

Privacy and confidentiality are very important prerequisites for applying process mining in order to comply with regulations and keep company secrets. This paper provides a foundation for future research on privacy-preserving and confidential process mining techniques. Main threats are identified and related to an motivation application scenario in a hospital context as well as to the current body of work on privacy and confidentiality in process mining. A newly developed conceptual model structures the discussion that existing techniques leave room for improvement. This results in a number of important research challenges that should be addressed by future process mining research.

15. Construction of Simplicial Complexes with Prescribed Degree-Size Sequences

Tzu-Chi Yen

We study the realizability of simplicial complexes with a given pair of integer sequences, representing the node degree distribution and facet size distribution, respectively. While the ss-uniform variant of the problem is NP\mathsf{NP}-complete when s3s \geq 3, we identify two populations of input sequences, most of which can be solved in polynomial time using a recursive algorithm that we contribute. Combining with a sampler for the simplicial configuration model [Young et al.\textit{et al.}, Phys. Rev. E 96\textbf{96}, 032312 (2017)], we facilitate efficient sampling of simplicial ensembles from arbitrary degree and size distributions. We find that, contrary to expectations based on dyadic networks, increasing nodes’ degrees reduces the number of loops in simplicial complexes. Our work unveils a fundamental constraint on the degree-size sequences and sheds light on further analysis of higher-order phenomena based on local structures.