1. GRAND: Graph Neural Diffusion
Benjamin Paul Chamberlain, James Rowbottom, Maria Gorinova, Stefan Webb, Emanuele Rossi, Michael M. Bronstein
We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE. In our model, the layer structure and topology correspond to the discretisation choices of temporal and spatial operators. Our approach allows a principled development of a broad new class of GNNs that are able to address the common plights of graph learning models such as depth, oversmoothing, and bottlenecks. Key to the success of our models are stability with respect to perturbations in the data and this is addressed for both implicit and explicit discretisation schemes. We develop linear and nonlinear versions of GRAND, which achieve competitive results on many standard graph benchmarks.
New paper from Twitter GraphML at #ICML2021 now on arxiv.
— Ben Chamberlain (@b_p_chamberlain) June 22, 2021
We develop links between Partial Differential Equations & GNNs -> new GNNs + new theory
Blog post: https://t.co/dNaibcliBR
Paper: https://t.co/wSyHNemQ9x
Code: https://t.co/xQ7wj92aAA#MachineLearning https://t.co/cwYusZXaXs
#GNNs are related to PDEs governing information diffusion on graphs. In a new paper with @b_p_chamberlain James Rowbottom @migorinova @stefan_webb @emaros96 we study a new class of Neural Graph Diffusion PDEs
— Michael Bronstein (@mmbronstein) June 22, 2021
Blog post: https://t.co/sxVcS1pWmK
Paper: https://t.co/upMNI0EyW8 pic.twitter.com/SYNWeRjP4z
2. Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data
Arlind Kadra, Marius Lindauer, Frank Hutter, Josif Grabocka
Tabular datasets are the last “unconquered castle” for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures. In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularization techniques. As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters. We empirically assess the impact of these regularization cocktails for MLPs on a large-scale empirical study comprising 40 tabular datasets and demonstrate that (i) well-regularized plain MLPs significantly outperform recent state-of-the-art specialized neural network architectures, and (ii) they even outperform strong traditional ML methods, such as XGBoost.
Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data
— AK (@ak92501) June 22, 2021
pdf: https://t.co/rcRbf6H3tf
abs: https://t.co/voMqog8zct
well-regularized plain MLPs significantly outperform sota specialized neural network architectures pic.twitter.com/L3jcM0hbgc
3. Lossy Compression for Lossless Prediction
Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, Chris J. Maddison
Most data is automatically collected and only ever “seen” by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations. Based on our theory, we design unsupervised objectives for training neural compressors. Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.
Most data is processed by algorithms, but compressors (eg JPEG) are for human eyes.
— Yann Dubois (@yanndubs) June 22, 2021
🤓Our fix: formalize lossy compression that ensures perfect downstream predictions
🔥1000x gains vs JPEG on ImageNet🔥https://t.co/8r3OJyCILj
w. Ben Bloem-Reddy @karen_ullrich @cjmaddison
1/9 pic.twitter.com/dJoQA3zdOW
4. Improving security for users of decentralized exchanges through multiparty computation
Robert Annessi, Ethan Fast
Decentralized cryptocurrency exchanges offer compelling security benefits over centralized exchanges: users control their funds and avoid the risk of an exchange hack or malicious operator. However, because user assets are fully accessible by a secret key, decentralized exchanges pose significant internal security risks for trading firms and automated trading systems, where a compromised system can result in total loss of funds. Centralized exchanges mitigate this risk through API key based security policies that allow professional users to give individual traders or automated systems specific and customizable access rights such as trading or withdrawal limits. Such policies, however, are not compatible with decentralized exchanges, where all exchange operations require a signature generated by the owner’s secret key. This paper introduces a protocol based upon multiparty computation that allows for the creation of API keys and security policies that can be applied to any existing decentralized exchange. Our protocol works with both ECDSA and EdDSA signature schemes and prioritizes efficient computation and communication. We have deployed this protocol on Nash exchange, as well as around several Ethereum-based automated market maker smart contracts, where it secures the trading accounts and wallets of thousands of users.
Multi-party computation (MPC) offers major improvements to non-custodial wallet security and lets decentralized exchanges offer traders API keys with security policies. Read our research team’s technical paper on Nash’s MPC implementation here: https://t.co/RylGraR37e
— Nash (@nashsocial) June 22, 2021
Excited to have a paper out describing the MPC protocol we developed at Nash. The @nashsocial mobile app uses this protocol to secure all user blockchain interactions, including with integrated DeFi applications like @Uniswap and @1inchNetwork https://t.co/udC5bNkifC
— Ethan Fast (@unignorant) June 22, 2021
5. Calliar: An Online Handwritten Dataset for Arabic Calligraphy
Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Yousif Ahmed Al-Wajih
Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.
Pleased to announce Calliar, the first online dataset for Arabic Calligraphy. Joint work with @_MagedSaeed_ @alwaridi and Yousif Al-Wajih.
— Zaid زيد (@zaidalyafeai) June 22, 2021
Paper: https://t.co/q5mwcP6roa
Code & data: https://t.co/tDCPJRYUcl
Colab: https://t.co/46yD0V3wht pic.twitter.com/qbbb4tZJ6l
6. VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal
Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We pre-train our model on uncurated videos and show that our pre-trained model can reach state-of-the-art results on several video understanding datasets (e.g., SSV2, Diving48). Lastly, we provide detailed analyses on model scalability and pre-training method design. Code is released at https://github.com/airsplay/vimpac.
Excited to share “VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning”, combining language modeling and contrastive learning for video pre-training. https://t.co/UxgVjs5R5d
— Hao Tan (@HaoTan5) June 22, 2021
Work done w/ @jayleicn @thom_wolf @mohitban47
(@huggingface + @uncnlp)
1/4 pic.twitter.com/DUpoKHVajg
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
— AK (@ak92501) June 22, 2021
pdf: https://t.co/9RJH4ovBiP
github: https://t.co/inPZwIuR4q
pre-train model on uncurated videos and pre-trained
model can reach sota results on several video understanding datasets pic.twitter.com/TamganjV9t
7. Nested Variational Inference
Heiko Zimmermann, Hao Wu, Babak Esmaeili, Jan-Willem van de Meent
We develop nested variational inference (NVI), a family of methods that learn proposals for nested importance samplers by minimizing an forward or reverse KL divergence at each level of nesting. NVI is applicable to many commonly-used importance sampling strategies and provides a mechanism for learning intermediate densities, which can serve as heuristics to guide the sampler. Our experiments apply NVI to (a) sample from a multimodal distribution using a learned annealing path (b) learn heuristics that approximate the likelihood of future observations in a hidden Markov model and (c) to perform amortized inference in hierarchical deep generative models. We observe that optimizing nested objectives leads to improved sample quality in terms of log average weight and effective sample size.
New on ArXiv: Nested Variational Inferencehttps://t.co/wkAESSNfiu
— Jan-Willem van de Meent (@jwvdm) June 22, 2021
Work by Heiko Zimmermann (@zmheiko), Hao Wu (@Hao_Wu_), Babak Esmaeili (@bob_smiley_), and myself.
(this is an extended version of our work at AABI this year; https://t.co/hzLJ2IpIWm) [1/] pic.twitter.com/U4D5vvhQEd
8. Boundary Graph Neural Networks for 3D Simulations
Andreas Mayr, Sebastian Lehner, Arno Mayrhofer, Christoph Kloss, Sepp Hochreiter, Johannes Brandstetter
The abundance of data has given machine learning huge momentum in natural sciences and engineering. However, the modeling of simulated physical processes remains difficult. A key problem in doing so is the correct handling of geometric boundaries. While triangularized geometric boundaries are very common in engineering applications, they are notoriously difficult to model by machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce Boundary Graph Neural Networks (BGNNs), which dynamically modify graph structures to address boundary conditions. Boundary graph structures are constructed via modifying edges, augmenting node features, and dynamically inserting virtual nodes. The new BGNNs are tested on complex 3D granular flow processes of hoppers and rotating drums which are standard parts of industrial machinery. Using precise simulations that are obtained by an expensive and complex discrete element method, BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. Even if complex boundaries are present, BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps, and most notably particles completely stay within the geometric objects without using handcrafted conditions or restrictions.
Great work lead by @AndreasMayr11. Granular flows learn to move inside complex geometric objects without any handcrafted restrictions. Paper: https://t.co/qKIBDKZVJZ Blog post: https://t.co/eWL1A7Awuv pic.twitter.com/okLWDpAiD1
— Johannes Brandstetter (@jo_brandstetter) June 22, 2021
9. Say Their Names: Resurgence in the collective attention toward Black victims of fatal police violence following the death of George Floyd
Henry H. Wu, Ryan J. Gallagher, Thayer Alshaabi, Jane L. Adams, Joshua R. Minot, Michael V. Arnold, Brooke Foucault Welles, Randall Harp, Peter Sheridan Dodds, Christopher M. Danforth
- retweets: 430, favorites: 32 (06/23/2021 10:32:06)
- links: abs | pdf
- cs.SI | cs.CY | physics.soc-ph
The murder of George Floyd by police in May 2020 sparked international protests and renewed attention in the Black Lives Matter movement. Here, we characterize ways in which the online activity following George Floyd’s death was unparalleled in its volume and intensity, including setting records for activity on Twitter, prompting the saddest day in the platform’s history, and causing George Floyd’s name to appear among the ten most frequently used phrases in a day, where he is the only individual to have ever received that level of attention who was not known to the public earlier that same week. Further, we find this attention extended beyond George Floyd and that more Black victims of fatal police violence received attention following his death than during other past moments in Black Lives Matter’s history. We place that attention within the context of prior online racial justice activism by showing how the names of Black victims of police violence have been lifted and memorialized over the last 12 years on Twitter. Our results suggest that the 2020 wave of attention to the Black Lives Matter movement centered past instances of police violence in an unprecedented way, demonstrating the impact of the movement’s rhetorical strategy to “say their names.”
“Say Their Names: Resurgence in the collective attention toward Black victims of fatal police violence following the death of George Floyd”
— Computational Story Lab (@compstorylab) June 22, 2021
New study from our group in collaboration w/@reharp @foucaultwelles @ryanjgallager.https://t.co/ZGJ5lsry4G
1/15 pic.twitter.com/yQriHjbOCZ
10. Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling
Hongyu Gong, Yun Tang, Juan Pino, Xian Li
Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of BLEU over language directions in multilingual setting and BLEU over domains in multi-domain setting.
Pay Better Attention to Attention: Head Selection in
— AK (@ak92501) June 22, 2021
Multilingual and Multi-Domain Sequence Modeling
pdf: https://t.co/TRKWh374g0
For speech-to-text translation, yields an average of +2.0 BLEU over 13 language directions in multilingual setting and +2.0 BLEU over 3 domains pic.twitter.com/Vg6FSpJ3ea
11. Multiplying Matrices Without Multiplying
Davis Blalock, John Guttag
Multiplying matrices is among the most fundamental and compute-intensive operations in machine learning. Consequently, there has been significant work on efficiently approximating matrix multiplies. We introduce a learning-based algorithm for this task that greatly outperforms existing methods. Experiments using hundreds of matrices from diverse domains show that it often runs faster than exact matrix products and faster than current approximate methods. In the common case that one matrix is known ahead of time, our method also has the interesting property that it requires zero multiply-adds. These results suggest that a mixture of hashing, averaging, and byte shufflingthe core operations of our methodcould be a more promising building block for machine learning than the sparsified, factorized, and/or scalar quantized matrix products that have recently been the focus of substantial research and hardware investment.
Excited to announce our ICML paper, “Multiplying Matrices Without Multiplying” !
— Davis Blalock (@davisblalock) June 22, 2021
TL;DR: 10x better sparsity + quantization, no multiply-adds
Paper: https://t.co/nsPiiwVbC0
Code: https://t.co/xXEfRpAEZR [1/n] pic.twitter.com/vlsLPJPSCB
12. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
— AK (@ak92501) June 22, 2021
pdf: https://t.co/3EgkVlPLtE
abs: https://t.co/l9oIq1jqie
sota performance on text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSRVTT, MSVD and VATEX pic.twitter.com/kVyUXAeG1J
13. One Million Scenes for Autonomous Driving: ONCE Dataset
Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Hang Xu, Chunjing Xu
Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.
One Million Scenes for Autonomous Driving: ONCE Dataset
— AK (@ak92501) June 22, 2021
pdf: https://t.co/4ITaZ5dRdS
abs: https://t.co/oV9Gl4l7u9
project page: https://t.co/tLcOYgMuBt
consists of 1 million LiDAR scenes and 7 million corresponding camera images pic.twitter.com/2WCmC8TAwF
14. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, Wenping Wang
We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inputs. Existing neural surface reconstruction approaches, such as DVR and IDR, require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures. Meanwhile, recent neural methods for novel view synthesis, such as NeRF and its variants, use volume rendering to produce a neural scene representation with robustness of optimization, even for highly complex objects. However, extracting high-quality surfaces from this learned implicit representation is difficult because there are not sufficient surface constraints in the representation. In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation. We observe that the conventional volume rendering method causes inherent geometric errors (i.e. bias) for surface reconstruction, and therefore propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision. Experiments on the DTU dataset and the BlendedMVS dataset show that NeuS outperforms the state-of-the-arts in high-quality surface reconstruction, especially for objects and scenes with complex structures and self-occlusion.
NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction
— AK (@ak92501) June 22, 2021
pdf: https://t.co/LPA78L4KFO
multiview surface reconstruction, represents 3D surfaces as neural SDF and developed a new volume rendering method for training the implicit SDF
representation pic.twitter.com/k6hBlbV9mP
15. Understanding Object Dynamics for Interactive Image-to-Video Synthesis
Andreas Blattmann, Timo Milbich, Michael Dorkenwald, Björn Ommer
What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks. Project page is available at https://bit.ly/3cxfA2L .
Understanding Object Dynamics for Interactive Image-to-Video Synthesis
— AK (@ak92501) June 22, 2021
pdf: https://t.co/kOfQrMw69K
abs: https://t.co/LvXna6pQuU
project page: https://t.co/gYrMQjgkYr
static image of an object and a local poking of a pixel, approach predicts how the object would deform over time pic.twitter.com/ITLDPKDCte
16. Scenic4RL: Programmatic Modeling and Generation of Reinforcement Learning Environments
Abdus Salam Azad, Edward Kim, Qiancheng Wu, Kimin Lee, Ion Stoica, Pieter Abbeel, Sanjit A. Seshia
The capability of reinforcement learning (RL) agent directly depends on the diversity of learning scenarios the environment generates and how closely it captures real-world situations. However, existing environments/simulators lack the support to systematically model distributions over initial states and transition dynamics. Furthermore, in complex domains such as soccer, the space of possible scenarios is infinite, which makes it impossible for one research group to provide a comprehensive set of scenarios to train, test, and benchmark RL algorithms. To address this issue, for the first time, we adopt an existing formal scenario specification language, SCENIC, to intuitively model and generate interactive scenarios. We interfaced SCENIC to Google Research Soccer environment to create a platform called SCENIC4RL. Using this platform, we provide a dataset consisting of 36 scenario programs encoded in SCENIC and demonstration data generated from a subset of them. We share our experimental results to show the effectiveness of our dataset and the platform to train, test, and benchmark RL algorithms. More importantly, we open-source our platform to enable RL community to collectively contribute to constructing a comprehensive set of scenarios.
Scenic4RL: Programmatic Modeling and Generation of Reinforcement Learning Environments
— AK (@ak92501) June 22, 2021
pdf: https://t.co/7ZnzOIOSrj
platform for generation of diverse scenarios for Reinforcement Learning programmatically, using the SCENIC scenario specification language pic.twitter.com/TFAJjrvhFN
17. DiGS : Divergence guided shape implicit neural representation for unoriented point clouds
Yizhak Ben-Shabat, Chamin Hewa Koneputugodage, Stephen Gould
Neural shape representations have recently shown to be effective in shape analysis and reconstruction tasks. Existing neural network methods require point coordinates and corresponding normal vectors to learn the implicit level sets of the shape. Normal vectors are often not provided as raw data, therefore, approximation and reorientation are required as pre-processing stages, both of which can introduce noise. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal shape representation networks that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and show state-of-the-art performance compared to other unoriented methods and on-par performance compared to oriented methods.
DiGS : Divergence guided shape implicit neural representation for unoriented point clouds
— AK (@ak92501) June 22, 2021
pdf: https://t.co/HfT8LN7dpj
abs: https://t.co/cxYS5WYvd1 pic.twitter.com/zzxHij5tnD
18. Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval
Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval — through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and computing their relevance scores — has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT’s [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings - while ensuring that these embeddings discriminate among passages — which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.
Happy to share the 𝗽𝗿𝗲𝗽𝗿𝗶𝗻𝘁 of our #ictir2021 paper “Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval” with @craig_macdonald @ntonellotto and @iadh.
— Xiao.W (@_XiaoWang_) June 22, 2021
📰Preprint: https://t.co/anjXcNZAKW.
🔗Github: https://t.co/vdntN5EIe0 pic.twitter.com/tuHfvhoNaS
19. MeshRIR: A Dataset of Room Impulse Responses on Meshed Grid Points For Evaluating Sound Field Analysis and Synthesis Methods
Shoichi Koyama, Tomoya Nishida, Keisuke Kimura, Takumi Abe, Natsuki Ueno, Jesper Brunnström
A new impulse response (IR) dataset called “MeshRIR” is introduced. Currently available datasets usually include IRs at an array of microphones from several source positions under various room conditions, which are basically designed for evaluating speech enhancement and distant speech recognition methods. On the other hand, methods of estimating or controlling spatial sound fields have been extensively investigated in recent years; however, the current IR datasets are not applicable to validating and comparing these methods because of the low spatial resolution of measurement points. MeshRIR consists of IRs measured at positions obtained by finely discretizing a spatial region. Two subdatasets are currently available: one consists of IRs in a three-dimensional cuboidal region from a single source, and the other consists of IRs in a two-dimensional square region from an array of 32 sources. Therefore, MeshRIR is suitable for evaluating sound field analysis and synthesis methods. This dataset is freely available at \url{https://sh01k.github.io/MeshRIR/} with some codes of sample applications.
Published MeshRIR - room impulse response dataset on meshed grid points💥 You can visualize wave motion🌊
— Shoichi Koyama (@sh01) June 22, 2021
- website: https://t.co/Dfd0JINQwD
- preprint: https://t.co/O3qxEH7WRH pic.twitter.com/dxVpBOmZ0h
20. Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama
Ching-Yu Hsu, Cheng Sun, Hwann-Tzong Chen
We present Omnidirectional Neural Radiance Fields (OmniNeRF), the first method to the application of parallax-enabled novel panoramic view synthesis. Recent works for novel view synthesis focus on perspective images with limited field-of-view and require sufficient pictures captured in a specific condition. Conversely, OmniNeRF can generate panorama images for unknown viewpoints given a single equirectangular image as training data. To this end, we propose to augment the single RGB-D panorama by projecting back and forth between a 3D world and different 2D panoramic coordinates at different virtual camera positions. By doing so, we are able to optimize an Omnidirectional Neural Radiance Field with visible pixels collecting from omnidirectional viewing angles at a fixed center for the estimation of new viewing angles from varying camera positions. As a result, the proposed OmniNeRF achieves convincing renderings of novel panoramic views that exhibit the parallax effect. We showcase the effectiveness of each of our proposals on both synthetic and real-world datasets.
Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama
— AK (@ak92501) June 22, 2021
pdf: https://t.co/4FyesN29ia
abs: https://t.co/BknCJii2Kr pic.twitter.com/Nv2AIQMdvr
21. Towards Long-Form Video Understanding
Chao-Yuan Wu, Philipp Krähenbühl
Our world offers a never-ending stream of visual stimuli, yet today’s vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.
Towards Long-Form Video Understanding
— AK (@ak92501) June 22, 2021
pdf: https://t.co/bHyRADq2g3
abs: https://t.co/dyP3cH7We3
object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks pic.twitter.com/LYGO8653oR
22. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
— AK (@ak92501) June 22, 2021
pdf: https://t.co/eDZTwj9vLk
abs: https://t.co/9rfY9Lkdce
visual representation learning which relies on
a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks pic.twitter.com/Bzxq2aC2DV