Hot Papers 2021-06-11

1. Flow-based sampling for fermionic lattice field theories

Michael S. Albergo, Gurtej Kanwar, Sébastien Racanière, Danilo J. Rezende, Julian M. Urban, Denis Boyda, Kyle Cranmer, Daniel C. Hackett, Phiala E. Shanahan

retweets: 1508, favorites: 243 (06/12/2021 14:49:14)
links: abs | pdf
hep-lat | cond-mat.stat-mech | cs.LG

Algorithms based on normalizing flows are emerging as promising machine learning approaches to sampling complicated probability distributions in a way that can be made asymptotically exact. In the context of lattice field theory, proof-of-principle studies have demonstrated the effectiveness of this approach for scalar theories, gauge theories, and statistical systems. This work develops approaches that enable flow-based sampling of theories with dynamical fermions, which is necessary for the technique to be applied to lattice field theory studies of the Standard Model of particle physics and many condensed matter systems. As a practical demonstration, these methods are applied to the sampling of field configurations for a two-dimensional theory of massless staggered fermions coupled to a scalar field via a Yukawa interaction.

*New paper* with the @MIT crew, @DaniloJRezende @sracaniere @DeepMind , @KyleCranmer and Julian Urban! We construct normalizing flows that are compatible with sampling path integrals of quantum field theories involving fermions. Fermions make this tricky! https://t.co/GpospbH8Du pic.twitter.com/ZspdXTaPY4
— Michael Albergo (@msalbergo) June 11, 2021

The saga continues!
Announcing our latest work on "Flow-based sampling for fermionic lattice field theories" together with an awesome team from MIT @iaifi_news, @DeepMind, & @NYUPhysics @CILVRatNYU @NYUDataScience https://t.co/dEu4OsClPh @msalbergo @sracaniere @DaniloJRezende pic.twitter.com/GLB4HZSpfd
— Kyle Cranmer (@KyleCranmer) June 11, 2021

2. MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training

Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, Tie-Yan Liu

retweets: 1560, favorites: 183 (06/12/2021 14:49:14)
links: abs | pdf
cs.SD | cs.CL | cs.IR | cs.MM | eess.AS

Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While good music representations are beneficial for these applications, the lack of training data hinders representation learning. Inspired by the success of pre-training models in natural language processing, in this paper, we develop MusicBERT, a large-scale pre-trained model for music understanding. To this end, we construct a large-scale symbolic music corpus that contains more than 1 million music songs. Since symbolic music contains more structural (e.g., bar, position) and diverse information (e.g., tempo, instrument, and pitch), simply adopting the pre-training techniques from NLP to symbolic music only brings marginal gains. Therefore, we design several mechanisms, including OctupleMIDI encoding and bar-level masking strategy, to enhance pre-training with symbolic music data. Experiments demonstrate the advantages of MusicBERT on four music understanding tasks, including melody completion, accompaniment suggestion, genre classification, and style classification. Ablation studies also verify the effectiveness of our designs of OctupleMIDI encoding and bar-level masking strategy in MusicBERT.

MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
pdf: https://t.co/nvO4RUj540
abs: https://t.co/vjEeRlX9Ud

a largescale pre-trained model for symbolic music understanding, sota performance on four evaluated symbolic music understanding tasks pic.twitter.com/cF7I2vwjx0
— AK (@ak92501) June 11, 2021

3. Does Knowledge Distillation Really Work?

Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, Andrew Gordon Wilson

retweets: 1023, favorites: 144 (06/12/2021 14:49:15)
links: abs | pdf
cs.LG | stat.ML

Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model, such as an ensemble of networks. We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood: there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We identify difficulties in optimization as a key reason for why the student is unable to match the teacher. We also show how the details of the dataset used for distillation play a role in how closely the student matches the teacher — and that more closely matching the teacher paradoxically does not always lead to better student generalization.

Does knowledge distillation really work?
While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher.https://t.co/VpK6Xy2q3S
With @samscub, @Pavel_Izmailov, @polkirichenko, Alex Alemi. 1/10 pic.twitter.com/SuX1uuvukG
— Andrew Gordon Wilson (@andrewgwils) June 11, 2021

4. Plan2Scene: Converting Floorplans to 3D Scenes

Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X. Chang, Manolis Savva

retweets: 901, favorites: 132 (06/12/2021 14:49:15)
links: abs | pdf
cs.CV | cs.GR

We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. Our system 1) lifts a floorplan image to a 3D mesh model; 2) synthesizes surface textures based on the input photos; and 3) infers textures for unobserved surfaces using a graph neural network architecture. To train and evaluate our system we create indoor surface texture datasets, and augment a dataset of floorplans and photos from prior work with rectified surface crops and additional annotations. Our approach handles the challenge of producing tileable textures for dominant surfaces such as floors, walls, and ceilings from a sparse set of unaligned photos that only partially cover the residence. Qualitative and quantitative evaluations show that our system produces realistic 3D interior models, outperforming baseline approaches on a suite of texture quality metrics and as measured by a holistic user study.

Plan2Scene: Converting Floorplans to 3D Scenes
pdf: https://t.co/YSLR6wQwny
abs: https://t.co/6y3YkAgajn
project page: https://t.co/2xgqThbDGv
github: https://t.co/47hsfDsbSu pic.twitter.com/88ISTODFab
— AK (@ak92501) June 11, 2021

5. The Medical Segmentation Decathlon

Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, AnnetteKopp-Schneider, Bennett A. Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M.Summers, Bram van Ginneken, Michel Bilello, Patrick Bilic, Patrick F. Christ, Richard K. G. Do, Marc J. Gollub, Stephan H. Heckers, Henkjan Huisman, William R. Jarnagin, Maureen K. McHugo, Sandy Napel, Jennifer S. Goli Pernicka, Kawal Rhode, Catalina Tobon-Gomez, Eugene Vorontsov, Henkjan Huisman, James A. Meakin, Sebastien Ourselin, Manuel Wiesenfarth, Pablo Arbelaez, Byeonguk Bae, Sihong Chen, Laura Daza, Jianjiang Feng, Baochun He, Fabian Isensee, Yuanfeng Ji, Fucang Jia, Namkug Kim, Ildoo Kim, Dorit Merhof, Akshay Pai, Beomhee Park, Mathias Perslev, Ramin Rezaiifar, Oliver Rippel, Ignacio Sarasua, Wei Shen, Jaemin Son, Christian Wachinger

retweets: 777, favorites: 113 (06/12/2021 14:49:15)
links: abs | pdf
eess.IV | cs.CV | cs.LG

International challenges have become the de facto standard for comparative assessment of image analysis algorithms given a specific task. Segmentation is so far the most widely investigated medical image processing task, but the various segmentation challenges have typically been organized in isolation, such that algorithm development was driven by the need to tackle a single specific clinical problem. We hypothesized that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a custom-designed solution. To investigate the hypothesis, we organized the Medical Segmentation Decathlon (MSD) - a biomedical image analysis challenge, in which algorithms compete in a multitude of both tasks and modalities. The underlying data set was designed to explore the axis of difficulties typically encountered when dealing with medical images, such as small data sets, unbalanced labels, multi-site data and small objects. The MSD challenge confirmed that algorithms with a consistent good performance on a set of tasks preserved their good average performance on a different set of previously unseen tasks. Moreover, by monitoring the MSD winner for two years, we found that this algorithm continued generalizing well to a wide range of other clinical problems, further confirming our hypothesis. Three main conclusions can be drawn from this study: (1) state-of-the-art image segmentation algorithms are mature, accurate, and generalize well when retrained on unseen tasks; (2) consistent algorithmic performance across multiple tasks is a strong surrogate of algorithmic generalizability; (3) the training of accurate AI segmentation models is now commoditized to non AI experts.

After two years of work, the preprint of largest ever medical image segmentation challenge - the Medical Segmentation Decathlon - is out: https://t.co/OvMrVxB1FV. It's incredible how far we've come in making image segmentation a commodity tool. More here https://t.co/K4wB93vKdO
— Jorge Cardoso (@mjorgecardoso) June 11, 2021

6. Learning to See by Looking at Noise

Manel Baradad, Jonas Wulff, Tongzhou Wang, Phillip Isola, Antonio Torralba

retweets: 704, favorites: 148 (06/12/2021 14:49:16)
links: abs | pdf
cs.CV | cs.AI

Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this paper we go a step further and ask if we can do away with real image datasets entirely, instead learning from noise processes. We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. We study two types of noise processes, statistical image models and deep generative models under different random initializations. Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic. We also find that diversity is a key property to learn good representations. Datasets, models, and code are available at https://mbaradad.github.io/learning_with_noise.

Learning to See by Looking at Noise
pdf: https://t.co/lzOFWW7BTF
abs: https://t.co/plGgeEy6Kl
project page: https://t.co/wpBQTh7nho

important for noise to capture certain structural properties of real data, good performance can be achieved with processes far from realistic pic.twitter.com/n9fRaoJAPl
— AK (@ak92501) June 11, 2021

憧れのコンピュータビジョン研究者 Antonio Torralba先生やMITの研究グループが書いた論文に自分の業績が載っているのただただ感激です！

Proj.: https://t.co/aLstKyo5Wc
arXiv: https://t.co/3lD7kZKm9c pic.twitter.com/ideaVswDxV
— Hirokatsu Kataoka | 片岡裕雄 (@HirokatuKataoka) June 11, 2021

7. Input Augmentation Improves Constrained Beam Search for Neural Machine Translation: NTT at WAT 2021

Katsuki Chousa, Makoto Morishita

retweets: 406, favorites: 121 (06/12/2021 14:49:16)
links: abs | pdf
cs.CL

This paper describes our systems that were submitted to the restricted translation task at WAT 2021. In this task, the systems are required to output translated sentences that contain all given word constraints. Our system combined input augmentation and constrained beam search algorithms. Through experiments, we found that this combination significantly improves translation accuracy and can save inference time while containing all the constraints in the output. For both En->Ja and Ja->En, our systems obtained the best evaluation performances in automatic evaluation.

新しい論文をarXivに公開しました！
出力に指定された語句を必ず含む機械翻訳システムを作った話です。
これまで、出力を制御する話は翻訳精度が犠牲になってしまうことが多かったのですが、今回は翻訳精度も上げつつ、必ず出力に関する制約を満たせるという特徴があります。https://t.co/ifJnBw6C8q
— Makoto Morishita (@MorinoseiMorizo) June 11, 2021

8. Adaptive machine learning for protein engineering

Brian L. Hie, Kevin K. Yang

retweets: 324, favorites: 71 (06/12/2021 14:49:16)
links: abs | pdf
q-bio.QM | cs.LG | q-bio.BM

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.

@BrianHie and I wrote a review on adaptive machine learning for protein engineering.

Basically: you've collected some data and trained a model. How do you decide what sequences you want to measure next?https://t.co/whC5yImrDL pic.twitter.com/WPSRKGHhvF
— Kevin Yang 楊凱筌 (@KevinKaichuang) June 11, 2021

9. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, Jo\ão F. Henriques

retweets: 300, favorites: 66 (06/12/2021 14:49:16)
links: abs | pdf
cs.CV

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame $t$ may be entirely unrelated to what is found at that location in frame $t+k$ . These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers — trajectory attention — that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something—Something V2, and Epic-Kitchens datasets. Code and models are available at: https://github.com/facebookresearch/Motionformer

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
pdf: https://t.co/Ft1MX7c9Xy
abs: https://t.co/uLCLgctWuN
project page: https://t.co/x0T8daHIav
github: https://t.co/WiAtrvazL8

drop-in block for video transformers pic.twitter.com/vjVmNhmrCe
— AK (@ak92501) June 11, 2021

10. Pivotal Tuning for Latent-based Editing of Real Images

Daniel Roich, Ron Mokady, Amit H. Bermano, Daniel Cohen-Or

retweets: 240, favorites: 110 (06/12/2021 14:49:16)
links: abs | pdf
cs.CV

Recently, a surge of advanced facial editing techniques have been proposed that leverage the generative power of a pre-trained StyleGAN. To successfully edit an image this way, one must first project (or invert) the image into the pre-trained generator’s domain. As it turns out, however, StyleGAN’s latent space induces an inherent tradeoff between distortion and editability, i.e. between maintaining the original appearance and convincingly altering some of its attributes. Practically, this means it is still challenging to apply ID-preserving facial latent-space editing to faces which are out of the generator’s domain. In this paper, we present an approach to bridge this gap. Our technique slightly alters the generator, so that an out-of-domain image is faithfully mapped into an in-domain latent code. The key idea is pivotal tuning - a brief training process that preserves the editing quality of an in-domain latent region, while changing its portrayed identity and appearance. In Pivotal Tuning Inversion (PTI), an initial inverted latent code serves as a pivot, around which the generator is fined-tuned. At the same time, a regularization term keeps nearby identities intact, to locally contain the effect. This surgical training process ends up altering appearance features that represent mostly identity, without affecting editing capabilities. We validate our technique through inversion and editing metrics, and show preferable scores to state-of-the-art methods. We further qualitatively demonstrate our technique by applying advanced edits (such as pose, age, or expression) to numerous images of well-known and recognizable identities. Finally, we demonstrate resilience to harder cases, including heavy make-up, elaborate hairstyles and/or headwear, which otherwise could not have been successfully inverted and edited by state-of-the-art methods.

Pivotal Tuning for Latent-based Editing of Real Images
pdf: https://t.co/QMEmtCHRYE
abs: https://t.co/4GV5NiDjqH

a brief training process that preserves the editing quality of an in-domain latent region, while changing its portrayed identity and appearance pic.twitter.com/hpBWqPRxZ3
— AK (@ak92501) June 11, 2021

11. Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Kieran Murphy, Carlos Esteves, Varun Jampani, Srikumar Ramalingam, Ameesh Makadia

retweets: 227, favorites: 79 (06/12/2021 14:49:17)
links: abs | pdf
cs.CV

Single image pose estimation is a fundamental problem in many vision and robotics tasks, and existing deep learning approaches suffer by not completely modeling and handling: i) uncertainty about the predictions, and ii) symmetric objects with multiple (sometimes infinite) correct poses. To this end, we introduce a method to estimate arbitrary, non-parametric distributions on SO(3). Our key idea is to represent the distributions implicitly, with a neural network that estimates the probability given the input image and a candidate pose. Grid sampling or gradient ascent can be used to find the most likely pose, but it is also possible to evaluate the probability at any pose, enabling reasoning about symmetries and uncertainty. This is the most general way of representing distributions on manifolds, and to showcase the rich expressive power, we introduce a dataset of challenging symmetric and nearly-symmetric objects. We require no supervision on pose uncertainty — the model trains only with a single pose per example. Nonetheless, our implicit model is highly expressive to handle complex distributions over 3D poses, while still obtaining accurate pose estimation on standard non-ambiguous environments, achieving state-of-the-art performance on Pascal3D+ and ModelNet10-SO(3) benchmarks.

Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
pdf: https://t.co/UPtJx57Bsu
abs: https://t.co/bjCCq7TiGB
project page: https://t.co/cWRmWOXoZE
predict arbitrary, non-parametric probability distributions over the rotation manifold pic.twitter.com/MA3F80DWBY
— AK (@ak92501) June 11, 2021

12. Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, Ari Morcos

retweets: 224, favorites: 65 (06/12/2021 14:49:17)
links: abs | pdf
cs.LG

Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN (+2.2% top-1 on ImageNet-1k for a ResNet50-RS) as well as substantially improved robustness (+11% top-1 on ImageNet-C). We analyze the representations learnt by the T-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention. Finally, we experiment initializing the T-CNN from a partially trained CNN, and find that it reaches better performance than the corresponding hybrid model trained from scratch, while reducing training time.

Transformed CNNs: recasting pre-trained convolutional layers with self-attention
pdf: https://t.co/NNk9Ngp90o
abs: https://t.co/RM4nyVA3N6

+2.2% top-1 on ImageNet-1k for a ResNet50-RS as well as substantially improved robustness +11% top-1 on ImageNet-C pic.twitter.com/0z6GRrgjC7
— AK (@ak92501) June 11, 2021

13. MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

retweets: 160, favorites: 56 (06/12/2021 14:49:17)
links: abs | pdf
cs.CV

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

MST: Masked Self-Supervised Transformer for Visual Representation
pdf: https://t.co/pW79tyCylN
abs: https://t.co/SyA4g48w8N

Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation pic.twitter.com/5oGuVRYMdo
— AK (@ak92501) June 11, 2021

14. Programming Puzzles

Tal Schuster, Ashwin Kalyan, Oleksandr Polozov, Adam Tauman Kalai

retweets: 88, favorites: 83 (06/12/2021 14:49:17)
links: abs | pdf
cs.LG | cs.AI | cs.CL | cs.PL | cs.SE

We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$ , and the goal is to find an input $x$ which makes $f$ output “True”. The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$ , so evaluating $f(x)$ is all that is needed to test a candidate solution $x$ . They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping. We develop baseline enumerative program synthesis and GPT-3 solvers that are capable of solving easy puzzles — even without access to any reference solutions — by learning from their own past solutions. Based on a small user study, we find puzzle difficulty to correlate between human programmers and the baseline AI solvers.

Programming Puzzles

Develops baseline enumerative program synthesis and GPT-3 solvers that are capable of solving easy puzzle by learning from their own past solutions.

abs: https://t.co/HBkQeR2oqo
code: https://t.co/V1fzQhzH3B pic.twitter.com/vLSlWBue2W
— Aran Komatsuzaki (@arankomatsuzaki) June 11, 2021

Programming Puzzles
pdf: https://t.co/EKjQsGKmg0
abs: https://t.co/dowVUkbZ8Q
github: https://t.co/zWwR9T6zWG

dataset with puzzles described only in source code, develop baseline enumerative program synthesis and GPT-3 solvers that are capable of
solving easy puzzles pic.twitter.com/ZRnVVem2D3
— AK (@ak92501) June 11, 2021

15. To The Point: Correspondence-driven monocular 3D category reconstruction

Filippos Kokkinos, Iasonas Kokkinos

retweets: 121, favorites: 30 (06/12/2021 14:49:17)
links: abs | pdf
cs.CV

We present To The Point (TTP), a method for reconstructing 3D objects from a single image using 2D to 3D correspondences learned from weak supervision. We recover a 3D shape from a 2D image by first regressing the 2D positions corresponding to the 3D template vertices and then jointly estimating a rigid camera transform and non-rigid template deformation that optimally explain the 2D positions through the 3D shape projection. By relying on 3D-2D correspondences we use a simple per-sample optimization problem to replace CNN-based regression of camera pose and non-rigid deformation and thereby obtain substantially more accurate 3D reconstructions. We treat this optimization as a differentiable layer and train the whole system in an end-to-end manner. We report systematic quantitative improvements on multiple categories and provide qualitative results comprising diverse shape, pose and texture prediction examples. Project website: https://fkokkinos.github.io/to_the_point/.

To The Point: Correspondence-driven monocular 3D
category reconstruction
pdf: https://t.co/zkTDDfZxZa
abs: https://t.co/qLRnAmziJR
project page: https://t.co/tCmz6UynIq pic.twitter.com/JAanaf2vPl
— AK (@ak92501) June 11, 2021

16. Space-time Mixing Attention for Video Transformer

Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos

retweets: 90, favorites: 53 (06/12/2021 14:49:17)
links: abs | pdf
cs.CV | cs.AI | cs.LG

This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces \textit{no overhead} compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer’s depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend \textit{jointly} spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code will be made available.

Space-time Mixing Attention for Video Transformer
pdf: https://t.co/GMc0Hq61AK
abs: https://t.co/NbN7dojB9R

a novel approximation to the full space-time attention that is amenable to an efficient implementation and applied it to video recognition pic.twitter.com/0xsehiwRdR
— AK (@ak92501) June 11, 2021

17. Temporal and Object Quantification Networks

Jiayuan Mao, Zhezheng Luo, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu, Leslie Pack Kaelbling, Tomer D. Ullman

retweets: 49, favorites: 37 (06/12/2021 14:49:18)
links: abs | pdf
cs.LG | cs.AI | stat.ML

We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. This is done by including reasoning layers that implement finite-domain quantification over objects and time. The structure allows them to generalize directly to input instances with varying numbers of objects in temporal sequences of varying lengths. We evaluate TOQ-Nets on input domains that require recognizing event-types in terms of complex temporal relational patterns. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios containing more objects than were present during training and to temporal warpings of input sequences.

Temporal and Object Quantification Networks
pdf: https://t.co/toqX0Y8f7B
abs: https://t.co/wHS1PfLPtc
project page: https://t.co/FWTRXzbuBd

a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events pic.twitter.com/6Th5b9Zrgg
— AK (@ak92501) June 11, 2021

18. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information

Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal

retweets: 66, favorites: 18 (06/12/2021 14:49:18)
links: abs | pdf
cs.CL

Fact verification has attracted a lot of attention in the machine learning and natural language processing communities, as it is one of the key methods for detecting misinformation. Existing large-scale benchmarks for this task have focused mostly on textual sources, i.e. unstructured information, and thus ignored the wealth of information available in structured formats, such as tables. In this paper we introduce a novel dataset and benchmark, Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS), which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. Furthermore, we detail our efforts to track and minimize the biases present in the dataset and could be exploited by models, e.g. being able to predict the label without using evidence. Finally, we develop a baseline for verifying claims against text and tables which predicts both the correct evidence and verdict for 18% of the claims.

The full dataset is now ready: https://t.co/eEbbJAXfa1 and here id the paper describing the annotation process and a baseline to get you started: https://t.co/Wc6NEukDIq Looking forward to seeing your submissions! @FEVERworkshop https://t.co/bSXkYqZnuG
— Andreas Vlachos (@vlachos_nlp) June 11, 2021

19. Fine-Grained System Identification of Nonlinear Neural Circuits

Dawna Bagherian, James Gornet, Jeremy Bernstein, Yu-Li Ni, Yisong Yue, Markus Meister

retweets: 42, favorites: 27 (06/12/2021 14:49:18)
links: abs | pdf
q-bio.QM | cs.LG

We study the problem of sparse nonlinear model recovery of high dimensional compositional functions. Our study is motivated by emerging opportunities in neuroscience to recover fine-grained models of biological neural circuits using collected measurement data. Guided by available domain knowledge in neuroscience, we explore conditions under which one can recover the underlying biological circuit that generated the training data. Our results suggest insights of both theoretical and practical interests. Most notably, we find that a sign constraint on the weights is a necessary condition for system recovery, which we establish both theoretically with an identifiability guarantee and empirically on simulated biological circuits. We conclude with a case study on retinal ganglion cell circuits using data collected from mouse retina, showcasing the practical potential of this approach.

There's now data available that, in principle, can enable fine-grained SysID in neuroscience, eg. identify the structure & weights of a biological circuit from inputs/outputs.

We show evidence that this is possible:
Paper: https://t.co/kKQQTGVGdS
Code: https://t.co/aRWnl5oya2 pic.twitter.com/HrxmZGQAKx
— Yisong Yue (@yisongyue) June 11, 2021

20. Score Matching Model for Unbounded Data Score

Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, Il-Chul Moon

retweets: 30, favorites: 29 (06/12/2021 14:49:18)
links: abs | pdf
cs.LG | cs.AI | stat.ML

Recent advance in score-based models incorporates the stochastic differential equation (SDE), which brings the state-of-the art performance on image generation tasks. This paper improves such score-based models by analyzing the model at the zero perturbation noise. In real datasets, the score function diverges as the perturbation noise ( $\sigma$ ) decreases to zero, and this observation leads an argument that the score estimation fails at $\sigma=0$ with any neural network structure. Subsequently, we introduce Unbounded Noise Conditional Score Network (UNCSN) that resolves the score diverging problem with an easily applicable modification to any noise conditional score-based models. Additionally, we introduce a new type of SDE, so the exact log likelihood can be calculated from the newly suggested SDE. On top of that, the associated loss function mitigates the loss imbalance issue in a mini-batch, and we present a theoretic analysis on the proposed loss to uncover the behind mechanism of the data distribution modeling by the score-based models.

Score Matching Model for Unbounded Data Score
pdf: https://t.co/BrtyEiWo6j
abs: https://t.co/xlmFdNkxPH
github: https://t.co/3nAOzIWbUG

sota performance among likelihood-based models on several benchmark datasets pic.twitter.com/oKzd2LPrtz
— AK (@ak92501) June 11, 2021

Published 12 Jun 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter