Hot Papers 2021-06-07

1. How Great is the Great Firewall? Measuring China’s DNS Censorship

Nguyen Phong Hoang, Arian Akhavan Niaki, Jakub Dalek, Jeffrey Knockel, Pellaeon Lin, Bill Marczak, Masashi Crete-Nishihata, Phillipa Gill, Michalis Polychronakis

retweets: 9496, favorites: 5 (06/08/2021 09:10:32)
links: abs | pdf
cs.CR | cs.CY | cs.NI | cs.SI

The DNS filtering apparatus of China’s Great Firewall (GFW) has evolved considerably over the past two decades. However, most prior studies of China’s DNS filtering were performed over short time periods, leading to unnoticed changes in the GFW’s behavior. In this study, we introduce GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the GFW’s DNS filtering behavior. We present the results of running GFWatch over a nine-month period, during which we tested an average of 411M domains per day and detected a total of 311K domains censored by GFW’s DNS filter. To the best of our knowledge, this is the largest number of domains tested and censored domains discovered in the literature. We further reverse engineer regular expressions used by the GFW and find 41K innocuous domains that match these filters, resulting in overblocking of their content. We also observe bogus IPv6 and globally routable IPv4 addresses injected by the GFW, including addresses owned by US companies, such as Facebook, Dropbox, and Twitter. Using data from GFWatch, we studied the impact of GFW blocking on the global DNS system. We found 77K censored domains with DNS resource records polluted in popular public DNS resolvers, such as Google and Cloudflare. Finally, we propose strategies to detect poisoned responses that can (1) sanitize poisoned DNS records from the cache of public DNS resolvers, and (2) assist in the development of circumvention tools to bypass the GFW’s DNS censorship.

For the last several months, you might have noticed many tweets of mine reporting new domains censored by the Great Firewall. Today, I’m happy to share the research paper behind these tweets, which will be presented at the 30th @USENIXSecurity Symposium.https://t.co/iRGr4VcJSi pic.twitter.com/IB8xnnSmd7
— Phong (@NP_tokumei) June 7, 2021

2. Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth, Yarin Gal

retweets: 7578, favorites: 4 (06/08/2021 09:10:32)
links: abs | pdf
cs.LG | stat.ML

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Introduces a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time.https://t.co/ARIsdy3nUg pic.twitter.com/UfSeJRSDwX
— Aran Komatsuzaki (@arankomatsuzaki) June 7, 2021

3. X-volution: On the unification of convolution and self-attention

Xuanhong Chen, Hang Wang, Bingbing Ni

retweets: 5955, favorites: 585 (06/08/2021 09:10:32)
links: abs | pdf
cs.CV

Convolution and self-attention are acting as two fundamental building blocks in deep neural networks, where the former extracts local image features in a linear way while the latter non-locally encodes high-order contextual relationships. Though essentially complementary to each other, i.e., first-/high-order, stat-of-the-art architectures, i.e., CNNs or transformers lack a principled way to simultaneously apply both operations in a single computational module, due to their heterogeneous computing pattern and excessive burden of global dot-product for visual tasks. In this work, we theoretically derive a global self-attention approximation scheme, which approximates a self-attention via the convolution operation on transformed features. Based on the approximated scheme, we establish a multi-branch elementary module composed of both convolution and self-attention operation, capable of unifying both local and non-local feature interaction. Importantly, once trained, this multi-branch module could be conditionally converted into a single standard convolution operation via structural re-parameterization, rendering a pure convolution styled operator named X-volution, ready to be plugged into any modern networks as an atomic operation. Extensive experiments demonstrate that the proposed X-volution, achieves highly competitive visual understanding improvements (+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation).

「Attention is all you need」「Convolutionの方がいいかも?」「原点回帰でMLPが最強」とカオスを極めたアーキテクチャ論争,ついに「ならAttentionとConvolutionでいいとこ取り合体しよう」の発想が登場
X-volution: On the unification of convolution and self-attentionhttps://t.co/D0pgRgciCB pic.twitter.com/Ry58IbPWyE
— えるエル (@ImAI_Eruel) June 7, 2021

X-volution: On the Unification of Convolution and
Self-attention
pdf: https://t.co/QhTwIybRvx
abs: https://t.co/NyFh07kpGE

+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation pic.twitter.com/wxiRb0KY0l
— AK (@ak92501) June 7, 2021

4. MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

retweets: 1755, favorites: 322 (06/08/2021 09:10:32)
links: abs | pdf
cs.CV | cs.CL | cs.LG

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech — in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

Introducing MERLOT: a new model that learns about language, vision, & the world from 6M YouTube videos.

Out-of-the-box, MERLOT has intrinsic notions of multimodal temporal commonsense. When finetuned, we get SOTA performance on 12 video tasks + VCR.https://t.co/2H6ng2Yfxt pic.twitter.com/oMPPtOjLBm
— Rowan Zellers (@rown) June 7, 2021

🍷Super excited about our new preprint!🍷

𝓜𝓔𝓡𝓛𝓞𝓣: Multimodal Script Knowledge Models!https://t.co/qkUVY8Im5B https://t.co/msAKGFzewv

TL;DR: By pretraining on 6M youtube videos, we transfer with SoTA performance on 10+ tasks (e.g. Video QA) that require temporal reasoning pic.twitter.com/sqbf7qo0hr
— Jack Hessel (@jmhessel) June 7, 2021

MERLOT: Multimodal Neural Script Knowledge Models
pdf: https://t.co/vzmHC42rI4
abs: https://t.co/3ADDscKw8i
project page: https://t.co/LhPfzluxqd

learns multimodal script knowledge, watching millions of YT videos with transcribed speech in entirely
label-free, ss manner pic.twitter.com/W7B9OVOF9c
— AK (@ak92501) June 7, 2021

5. Ukiyo-e Analysis and Creativity with Attribute and Geometry Annotation

Yingtao Tian, Tarin Clanuwat, Chikahiko Suzuki, Asanobu Kitamoto

retweets: 1784, favorites: 231 (06/08/2021 09:10:33)
links: abs | pdf
cs.CV | cs.LG

The study of Ukiyo-e, an important genre of pre-modern Japanese art, focuses on the object and style like other artwork researches. Such study has benefited from the renewed interest by the machine learning community in culturally important topics, leading to interdisciplinary works including collections of images, quantitative approaches, and machine learning-based creativities. They, however, have several drawbacks, and it remains challenging to integrate these works into a comprehensive view. To bridge this gap, we propose a holistic approach We first present a large-scale Ukiyo-e dataset with coherent semantic labels and geometric annotations, then show its value in a quantitative study of Ukiyo-e paintings’ object using these labels and annotations. We further demonstrate the machine learning methods could help style study through soft color decomposition of Ukiyo-e, and finally provides joint insights into object and style by composing sketches and colors using colorization. Dataset available at https://github.com/rois-codh/arc-ukiyoe-faces

Our work "Ukiyo-e Analysis and Creativity with Attribute and Geometry Annotation" has been accepted #iccc21!

Ukiyo-e paintings with labelled attributes and automatically extracted face landmarks, allowing quantitative analysis and fun ML experiments. https://t.co/dDOL216N5Z pic.twitter.com/CCuiK44OOZ
— Yingtao Tian (@alanyttian) June 7, 2021

6. Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering

Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum, Fredo Durand

retweets: 1084, favorites: 230 (06/08/2021 09:10:33)
links: abs | pdf
cs.CV | cs.AI | cs.GR | cs.LG | cs.MM

Inferring representations of 3D scenes from 2D observations is a fundamental problem of computer graphics, computer vision, and artificial intelligence. Emerging 3D-structured neural scene representations are a promising approach to 3D scene understanding. In this work, we propose a novel neural scene representation, Light Field Networks or LFNs, which represent both geometry and appearance of the underlying 3D scene in a 360-degree, four-dimensional light field parameterized via a neural implicit representation. Rendering a ray from an LFN requires only a single network evaluation, as opposed to hundreds of evaluations per ray for ray-marching or volumetric based renderers in 3D-structured neural scene representations. In the setting of simple scenes, we leverage meta-learning to learn a prior over LFNs that enables multi-view consistent light field reconstruction from as little as a single image observation. This results in dramatic reductions in time and memory complexity, and enables real-time rendering. The cost of storing a 360-degree light field via an LFN is two orders of magnitude lower than conventional methods such as the Lumigraph. Utilizing the analytical differentiability of neural implicit representations and a novel parameterization of light space, we further demonstrate the extraction of sparse depth maps from LFNs.

Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering https://t.co/inQAm1tbd5

New work based on neural 3D representations that shows how to perform really fast “single-evaluation” rendering. #computervision #deeplearning #3D pic.twitter.com/IQ6zQJzSmh
— Tomasz Malisiewicz (@quantombone) June 7, 2021

Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering
pdf: https://t.co/WqXVBY51yA
abs: https://t.co/q45R1v06TH

neural scene representation directly parameterizes the full 360-degree, 4D light field of a 3D scene via a neural implicit representation pic.twitter.com/fENMzAlPWq
— AK (@ak92501) June 7, 2021

7. Solving Schrödinger Bridges via Maximum Likelihood

Francisco Vargas, Pierre Thodoroff, Neil D. Lawrence, Austen Lamacraft

retweets: 964, favorites: 175 (06/08/2021 09:10:34)
links: abs | pdf
stat.ML | cs.LG

The Schr”odinger bridge problem (SBP) finds the most likely stochastic evolution between two probability distributions given a prior stochastic evolution. As well as applications in the natural sciences, problems of this kind have important applications in machine learning such as dataset alignment and hypothesis testing. Whilst the theory behind this problem is relatively mature, scalable numerical recipes to estimate the Schr”odinger bridge remain an active area of research. We prove an equivalence between the SBP and maximum likelihood estimation enabling direct application of successful machine learning techniques. We propose a numerical procedure to estimate SBPs using Gaussian process and demonstrate the practical usage of our approach in numerical simulations and experiments.

Solving Schrödinger bridges via maximum likelihood arxiv: https://t.co/AhcnXhFOvL

We propose an approximate IPFP/Sinkhorn variant based on the time reveral of diffusions with the goal of learning meaningful interpolating dynamics between two distributions. pic.twitter.com/47uekAkQLg
— Neil Lawrence (@lawrennd) June 7, 2021

8. Few-Shot Segmentation via Cycle-Consistent Transformer

Gengwei Zhang, Guoliang Kang, Yunchao Wei, Yi Yang

retweets: 483, favorites: 85 (06/08/2021 09:10:34)
links: abs | pdf
cs.CV

Few-shot segmentation aims to train a segmentation model that can fast adapt to novel classes with few exemplars. The conventional training paradigm is to learn to make predictions on query images conditioned on the features from support images. Previous methods only utilized the semantic-level prototypes of support images as the conditional information. These methods cannot utilize all pixel-wise support information for the query predictions, which is however critical for the segmentation task. In this paper, we focus on utilizing pixel-wise relationships between support and target images to facilitate the few-shot semantic segmentation task. We design a novel Cycle-Consistent Transformer (CyCTR) module to aggregate pixel-wise support features into query ones. CyCTR performs cross-attention between features from different images, i.e. support and query images. We observe that there may exist unexpected irrelevant pixel-level support features. Directly performing cross-attention may aggregate these features from support to query and bias the query features. Thus, we propose using a novel cycle-consistent attention mechanism to filter out possible harmful support features and encourage query features to attend to the most informative pixels from support images. Experiments on all few-shot segmentation benchmarks demonstrate that our proposed CyCTR leads to remarkable improvement compared to previous state-of-the-art methods. Specifically, on Pascal- $5^i$ and COCO- $20^i$ datasets, we achieve 66.6% and 45.6% mIoU for 5-shot segmentation, outperforming previous state-of-the-art by 4.6% and 7.1% respectively.

Few-Shot Segmentation via Cycle-Consistent Transformer
pdf: https://t.co/2O4fUUFuSY
abs: https://t.co/f6lirYQG1d

on Pascal-5i and COCO-20i datasets, achieve 66.6% and 45.6% mIoU for 5-shot segmentation, outperforming previous sota by 4.6% and 7.1% respectively pic.twitter.com/b8zxCrlktm
— AK (@ak92501) June 7, 2021

9. Detecting and Adapting to Novelty in Games

Xiangyu Peng, Jonathan C. Balloch, Mark O. Riedl

retweets: 231, favorites: 77 (06/08/2021 09:10:34)
links: abs | pdf
cs.AI

Open-world novelty occurs when the rules of an environment can change abruptly, such as when a game player encounters “house rules”. To address open-world novelty, game playing agents must be able to detect when novelty is injected, and to quickly adapt to the new rules. We propose a model-based reinforcement learning approach where game state and rules are represented as knowledge graphs. The knowledge graph representation of the state and rules allows novelty to be detected as changes in the knowledge graph, assists with the training of deep reinforcement learners, and enables imagination-based re-training where the agent uses the knowledge graph to perform look-ahead.

Detecting and Adapting to Novelty in Games
pdf: https://t.co/xqLavTbywA
abs: https://t.co/rAaHfc0H4W

model-based reinforcement learning approach where game state and rules are represented as knowledge graphs pic.twitter.com/92N1grQIzn
— AK (@ak92501) June 7, 2021

10. RL-DARTS: Differentiable Architecture Search for Reinforcement Learning

Yingjie Miao, Xingyou Song, Daiyi Peng, Summer Yue, Eugene Brevdo, Aleksandra Faust

retweets: 164, favorites: 54 (06/08/2021 09:10:34)
links: abs | pdf
cs.LG | cs.AI | cs.CV

We introduce RL-DARTS, one of the first applications of Differentiable Architecture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of applying neural architecture search techniques in RL, and demonstrate that by simply replacing the image encoder with a DARTS supernet, our search method is sample-efficient, requires minimal extra compute resources, and is also compatible with off-policy and on-policy RL algorithms, needing only minor changes in preexisting code. Surprisingly, we find that the supernet can be used as an actor for inference to generate replay data in standard RL training loops, and thus train end-to-end. Throughout this training process, we show that the supernet gradually learns better cells, leading to alternative architectures which can be highly competitive against manually designed policies, but also verify previous design choices for RL policies.

RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
pdf: https://t.co/53XSc0o0lR
abs: https://t.co/hkb1YqpLka

one of the first applications of Differentiable Architecture Search in RL to search for convolutional cells, applied to the Procgen benchmark pic.twitter.com/OLvGXDSEX6
— AK (@ak92501) June 7, 2021

11. Glance-and-Gaze Vision Transformer

Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, Wei Shen

retweets: 156, favorites: 47 (06/08/2021 09:10:34)
links: abs | pdf
cs.CV

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.

Glance-and-Gaze Vision Transformer
pdf: https://t.co/GGcirv36Fz
abs: https://t.co/WsvVAJt6vS

parallel and complementary Glance branch and Gaze branch, which offer long-range relationship and short-range modeling pic.twitter.com/ewkYVocgIh
— AK (@ak92501) June 7, 2021

12. Semantic Correspondence with Transformers

Seokju Cho, Sunghwan Hong, Sangryul Jeon, Yunsung Lee, Kwanghoon Sohn, Seungryong Kim

retweets: 86, favorites: 58 (06/08/2021 09:10:34)
links: abs | pdf
cs.CV

We propose a novel cost aggregation network, called Cost Aggregation with Transformers (CATs), to find dense correspondences between semantically similar images with additional challenges posed by large intra-class appearance and geometric variations. Compared to previous hand-crafted or CNN-based methods addressing the cost aggregation stage, which either lack robustness to severe deformations or inherit the limitation of CNNs that fail to discriminate incorrect matches due to limited receptive fields, CATs explore global consensus among initial correlation map with the help of some architectural designs that allow us to exploit full potential of self-attention mechanism. Specifically, we include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation to benefit from hierarchical feature representations within Transformer-based aggregator, and combine with swapping self-attention and residual connections not only to enforce consistent matching, but also to ease the learning process. We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies. Code and trained models will be made available at https://github.com/SunghwanHong/CATs.

Semantic Correspondence with Transformers
pdf: https://t.co/EMx0X1jxF0
abs: https://t.co/7gQUC3Cd7d

cost aggregation network, find dense correspondences between semantically similar images with additional challenges posed by large intra-class appearance and geometric variations pic.twitter.com/8b6sSPEpfX
— AK (@ak92501) June 7, 2021

13. nmT5 — Is parallel data still relevant for pre-training massively multilingual language models?

Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami Al-Rfou, Linting Xue

retweets: 56, favorites: 38 (06/08/2021 09:10:35)
links: abs | pdf
cs.CL

Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.

nmT5 - Is parallel data still relevant for pre-training massively multilingual language models?
pdf: https://t.co/KKoNuLTkPy
abs: https://t.co/VNawWupnse

larger model sizes, pre-training with parallel data still provides benefits in the limited labelled data regime pic.twitter.com/IHzTlm9UwY
— AK (@ak92501) June 7, 2021

14. SOLQ: Segmenting Objects by Learning Queries

Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, Yichen Wei

retweets: 58, favorites: 31 (06/08/2021 09:10:35)
links: abs | pdf
cs.CV

In this paper, we propose an end-to-end framework for instance segmentation. Based on the recently introduced DETR [1], our method, termed SOLQ, segments objects by learning unified queries. In SOLQ, each query represents one object and has multiple representations: class, location and mask. The object queries learned perform classification, box regression and mask encoding simultaneously in an unified vector form. During training phase, the mask vectors encoded are supervised by the compression coding of raw spatial masks. In inference time, mask vectors produced can be directly transformed to spatial masks by the inverse process of compression coding. Experimental results show that SOLQ can achieve state-of-the-art performance, surpassing most of existing approaches. Moreover, the joint learning of unified query representation can greatly improve the detection performance of original DETR. We hope our SOLQ can serve as a strong baseline for the Transformer-based instance segmentation. Code is available at https://github.com/megvii-research/SOLQ.

SOLQ: Segmenting Objects by Learning Queries
pdf: https://t.co/W6Y4sJvEeO
abs: https://t.co/4LYTbWBYi4
github: https://t.co/ROiTSNDcho pic.twitter.com/V1GMtpbKQn
— AK (@ak92501) June 7, 2021

15. Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing

Rowan Hall Maudslay, Ryan Cotterell

retweets: 44, favorites: 38 (06/08/2021 09:10:35)
links: abs | pdf
cs.CL | cs.LG

Analysing whether neural language models encode linguistic information has become popular in NLP. One method of doing so, which is frequently cited to support the claim that models like BERT encode syntax, is called probing; probes are small supervised models trained to extract linguistic information from another model’s output. If a probe is able to predict a particular structure, it is argued that the model whose output it is trained on must have implicitly learnt to encode it. However, drawing a generalisation about a model’s linguistic knowledge about a specific phenomena based on what a probe is able to learn may be problematic: in this work, we show that semantic cues in training data means that syntactic probes do not properly isolate syntax. We generate a new corpus of semantically nonsensical but syntactically well-formed Jabberwocky sentences, which we use to evaluate two probes trained on normal data. We train the probes on several popular language models (BERT, GPT, and RoBERTa), and find that in all settings they perform worse when evaluated on these data, for one probe by an average of 15.4 UUAS points absolute. Although in most cases they still outperform the baselines, their lead is reduced substantially, e.g. by 53% in the case of BERT for one probe. This begs the question: what empirical scores constitute knowing syntax?

Some (@ryandcotterell) would say that posting an arXiv link <1hr before the conference started was leaving things late....BUT what the hell:

Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probinghttps://t.co/p2hMeycos5
Done at @CSatETH & @cambridgenlp [1/6] pic.twitter.com/uaUmyyM1RF
— Rowan Hall Maudslay (@rowhallmauds) June 7, 2021

16. A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Mingde Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio

retweets: 44, favorites: 26 (06/08/2021 09:10:35)
links: abs | pdf
cs.AI | cs.LG

We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state, in order to plan and to generalize better out-of-distribution. The agent’s architecture uses a set representation and a bottleneck mechanism, forcing the number of entities to which the agent attends at each planning step to be small. In experiments with customized MiniGrid environments with different dynamics, we observe that the design allows agents to learn to plan effectively, by attending to the relevant objects, leading to better out-of-distribution generalization.

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning
pdf: https://t.co/aW21hIkEW7
abs: https://t.co/7OP9ctAHRs

end-to-end, model-based DRL agent which dynamically attends to relevant parts of its state, in order to plan and
to generalize better ood pic.twitter.com/ZN7jHIS38Z
— AK (@ak92501) June 7, 2021

17. Fundamental tradeoffs between memorization and robustness in random features and neural tangent regimes

Elvis Dohmatob

retweets: 30, favorites: 34 (06/08/2021 09:10:35)
links: abs | pdf
stat.ML | cs.LG

This work studies the (non)robustness of two-layer neural networks in various high-dimensional linearized regimes. We establish fundamental trade-offs between memorization and robustness, as measured by the Sobolev-seminorm of the model w.r.t the data distribution, i.e the square root of the average squared $L_2$ -norm of the gradients of the model w.r.t the its input. More precisely, if $n$ is the number of training examples, $d$ is the input dimension, and $k$ is the number of hidden neurons in a two-layer neural network, we prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded by (i) $\sqrt{n}$ in case of infinite-width random features (RF) or neural tangent kernel (NTK) with $d \gtrsim n$ ; (ii) $\sqrt{n}$ in case of finite-width RF with proportionate scaling of $d$ and $k$ ; and (iii) $\sqrt{n/k}$ in case of finite-width NTK with proportionate scaling of $d$ and $k$ . Moreover, all of these lower-bounds are tight: they are attained by the min-norm / least-squares interpolator (when $n$ , $d$ , and $k$ are in the appropriate interpolating regime). All our results hold as soon as data is log-concave isotropic, and there is label-noise, i.e the target variable is not a deterministic function of the data / features. We empirically validate our theoretical results with experiments. Accidentally, these experiments also reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.

1/ NEW preprint https://t.co/aIMxhTPoDf wherein we uncover a fundamental tradeoff between memorization and robustness for NNs in linearized regimes (RF, NTK, ...). Also, we accidentally observe, for the first time (it seems), a multiple-descent phenomenon in robustness pic.twitter.com/mpGVKoyi4e
— Elvis Dohmatob (@dohmatobelvis) June 7, 2021

18. Eliciting Spoken Interruptions to Inform Proactive Speech Agent Design

Justin Edwards, Christian Janssen, Sandy Gould, Benjamin R Cowan

retweets: 42, favorites: 21 (06/08/2021 09:10:35)
links: abs | pdf
cs.HC

Current speech agent interactions are typically user-initiated, limiting the interactions they can deliver. Future functionality will require agents to be proactive, sometimes interrupting users. Little is known about how these spoken interruptions should be designed, especially in urgent interruption contexts. We look to inform design of proactive agent interruptions through investigating how people interrupt others engaged in complex tasks. We therefore developed a new technique to elicit human spoken interruptions of people engaged in other tasks. We found that people interrupted sooner when interruptions were urgent. Some participants used access rituals to forewarn interruptions, but most rarely used them. People balanced speed and accuracy in timing interruptions, often using cues from the task they interrupted. People also varied phrasing and delivery of interruptions to reflect urgency. We discuss how our findings can inform speech agent design and how our paradigm can help gain insight into human interruptions in new contexts.

19. The Image Local Autoregressive Transformer

Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, XiangYang Xue, Yanwei Fu

retweets: 20, favorites: 33 (06/08/2021 09:10:35)
links: abs | pdf
cs.CV | eess.IV

Recently, AutoRegressive (AR) models for the whole image generation empowered by transformers have achieved comparable or even better performance to Generative Adversarial Networks (GANs). Unfortunately, directly applying such AR models to edit/change local image regions, may suffer from the problems of missing global information, slow inference speed, and information leakage of local guidance. To address these limitations, we propose a novel model — image Local Autoregressive Transformer (iLAT), to better facilitate the locally guided image synthesis. Our iLAT learns the novel local discrete representations, by the newly proposed local autoregressive (LA) transformer of the attention mask and convolution mechanism. Thus iLAT can efficiently synthesize the local image regions by key guidance information. Our iLAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both the quantitative and qualitative results show the efficacy of our model.

The Image Local Autoregressive Transformer
pdf: https://t.co/Ldk53mswBh
abs: https://t.co/frzJ3ZaNgR

learns the novel local discrete representations, by the newly proposed local autoregressive transformer
of the attention mask and convolution mechanism pic.twitter.com/uOCch4f6qF
— AK (@ak92501) June 7, 2021

Published 8 Jun 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter