1. How Great is the Great Firewall? Measuring China’s DNS Censorship
Nguyen Phong Hoang, Arian Akhavan Niaki, Jakub Dalek, Jeffrey Knockel, Pellaeon Lin, Bill Marczak, Masashi Crete-Nishihata, Phillipa Gill, Michalis Polychronakis
The DNS filtering apparatus of China’s Great Firewall (GFW) has evolved considerably over the past two decades. However, most prior studies of China’s DNS filtering were performed over short time periods, leading to unnoticed changes in the GFW’s behavior. In this study, we introduce GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the GFW’s DNS filtering behavior. We present the results of running GFWatch over a nine-month period, during which we tested an average of 411M domains per day and detected a total of 311K domains censored by GFW’s DNS filter. To the best of our knowledge, this is the largest number of domains tested and censored domains discovered in the literature. We further reverse engineer regular expressions used by the GFW and find 41K innocuous domains that match these filters, resulting in overblocking of their content. We also observe bogus IPv6 and globally routable IPv4 addresses injected by the GFW, including addresses owned by US companies, such as Facebook, Dropbox, and Twitter. Using data from GFWatch, we studied the impact of GFW blocking on the global DNS system. We found 77K censored domains with DNS resource records polluted in popular public DNS resolvers, such as Google and Cloudflare. Finally, we propose strategies to detect poisoned responses that can (1) sanitize poisoned DNS records from the cache of public DNS resolvers, and (2) assist in the development of circumvention tools to bypass the GFW’s DNS censorship.
For the last several months, you might have noticed many tweets of mine reporting new domains censored by the Great Firewall. Today, I’m happy to share the research paper behind these tweets, which will be presented at the 30th @USENIXSecurity Symposium.https://t.co/iRGr4VcJSi pic.twitter.com/IB8xnnSmd7
— Phong (@NP_tokumei) June 7, 2021
2. Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth, Yarin Gal
We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
— Aran Komatsuzaki (@arankomatsuzaki) June 7, 2021
Introduces a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time.https://t.co/ARIsdy3nUg pic.twitter.com/UfSeJRSDwX
3. X-volution: On the unification of convolution and self-attention
Xuanhong Chen, Hang Wang, Bingbing Ni
Convolution and self-attention are acting as two fundamental building blocks in deep neural networks, where the former extracts local image features in a linear way while the latter non-locally encodes high-order contextual relationships. Though essentially complementary to each other, i.e., first-/high-order, stat-of-the-art architectures, i.e., CNNs or transformers lack a principled way to simultaneously apply both operations in a single computational module, due to their heterogeneous computing pattern and excessive burden of global dot-product for visual tasks. In this work, we theoretically derive a global self-attention approximation scheme, which approximates a self-attention via the convolution operation on transformed features. Based on the approximated scheme, we establish a multi-branch elementary module composed of both convolution and self-attention operation, capable of unifying both local and non-local feature interaction. Importantly, once trained, this multi-branch module could be conditionally converted into a single standard convolution operation via structural re-parameterization, rendering a pure convolution styled operator named X-volution, ready to be plugged into any modern networks as an atomic operation. Extensive experiments demonstrate that the proposed X-volution, achieves highly competitive visual understanding improvements (+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation).
「Attention is all you need」「Convolutionの方がいいかも?」「原点回帰でMLPが最強」とカオスを極めたアーキテクチャ論争,ついに「ならAttentionとConvolutionでいいとこ取り合体しよう」の発想が登場
— えるエル (@ImAI_Eruel) June 7, 2021
X-volution: On the unification of convolution and self-attentionhttps://t.co/D0pgRgciCB pic.twitter.com/Ry58IbPWyE
X-volution: On the Unification of Convolution and
— AK (@ak92501) June 7, 2021
Self-attention
pdf: https://t.co/QhTwIybRvx
abs: https://t.co/NyFh07kpGE
+1.2% top-1 accuracy on ImageNet classification, +1.7 box AP and +1.5 mask AP on COCO detection and segmentation pic.twitter.com/wxiRb0KY0l
4. MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech — in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.
Introducing MERLOT: a new model that learns about language, vision, & the world from 6M YouTube videos.
— Rowan Zellers (@rown) June 7, 2021
Out-of-the-box, MERLOT has intrinsic notions of multimodal temporal commonsense. When finetuned, we get SOTA performance on 12 video tasks + VCR.https://t.co/2H6ng2Yfxt pic.twitter.com/oMPPtOjLBm
🍷Super excited about our new preprint!🍷
— Jack Hessel (@jmhessel) June 7, 2021
𝓜𝓔𝓡𝓛𝓞𝓣: Multimodal Script Knowledge Models!https://t.co/qkUVY8Im5Bhttps://t.co/msAKGFzewv
TL;DR: By pretraining on 6M youtube videos, we transfer with SoTA performance on 10+ tasks (e.g. Video QA) that require temporal reasoning pic.twitter.com/sqbf7qo0hr
MERLOT: Multimodal Neural Script Knowledge Models
— AK (@ak92501) June 7, 2021
pdf: https://t.co/vzmHC42rI4
abs: https://t.co/3ADDscKw8i
project page: https://t.co/LhPfzluxqd
learns multimodal script knowledge, watching millions of YT videos with transcribed speech in entirely
label-free, ss manner pic.twitter.com/W7B9OVOF9c
5. Ukiyo-e Analysis and Creativity with Attribute and Geometry Annotation
Yingtao Tian, Tarin Clanuwat, Chikahiko Suzuki, Asanobu Kitamoto
The study of Ukiyo-e, an important genre of pre-modern Japanese art, focuses on the object and style like other artwork researches. Such study has benefited from the renewed interest by the machine learning community in culturally important topics, leading to interdisciplinary works including collections of images, quantitative approaches, and machine learning-based creativities. They, however, have several drawbacks, and it remains challenging to integrate these works into a comprehensive view. To bridge this gap, we propose a holistic approach We first present a large-scale Ukiyo-e dataset with coherent semantic labels and geometric annotations, then show its value in a quantitative study of Ukiyo-e paintings’ object using these labels and annotations. We further demonstrate the machine learning methods could help style study through soft color decomposition of Ukiyo-e, and finally provides joint insights into object and style by composing sketches and colors using colorization. Dataset available at https://github.com/rois-codh/arc-ukiyoe-faces
Our work "Ukiyo-e Analysis and Creativity with Attribute and Geometry Annotation" has been accepted #iccc21!
— Yingtao Tian (@alanyttian) June 7, 2021
Ukiyo-e paintings with labelled attributes and automatically extracted face landmarks, allowing quantitative analysis and fun ML experiments. https://t.co/dDOL216N5Z pic.twitter.com/CCuiK44OOZ
6. Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering
Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum, Fredo Durand
- retweets: 1084, favorites: 230 (06/08/2021 09:10:33)
- links: abs | pdf
- cs.CV | cs.AI | cs.GR | cs.LG | cs.MM
Inferring representations of 3D scenes from 2D observations is a fundamental problem of computer graphics, computer vision, and artificial intelligence. Emerging 3D-structured neural scene representations are a promising approach to 3D scene understanding. In this work, we propose a novel neural scene representation, Light Field Networks or LFNs, which represent both geometry and appearance of the underlying 3D scene in a 360-degree, four-dimensional light field parameterized via a neural implicit representation. Rendering a ray from an LFN requires only a single network evaluation, as opposed to hundreds of evaluations per ray for ray-marching or volumetric based renderers in 3D-structured neural scene representations. In the setting of simple scenes, we leverage meta-learning to learn a prior over LFNs that enables multi-view consistent light field reconstruction from as little as a single image observation. This results in dramatic reductions in time and memory complexity, and enables real-time rendering. The cost of storing a 360-degree light field via an LFN is two orders of magnitude lower than conventional methods such as the Lumigraph. Utilizing the analytical differentiability of neural implicit representations and a novel parameterization of light space, we further demonstrate the extraction of sparse depth maps from LFNs.
Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering https://t.co/inQAm1tbd5
— Tomasz Malisiewicz (@quantombone) June 7, 2021
New work based on neural 3D representations that shows how to perform really fast “single-evaluation” rendering. #computervision #deeplearning #3D pic.twitter.com/IQ6zQJzSmh
Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering
— AK (@ak92501) June 7, 2021
pdf: https://t.co/WqXVBY51yA
abs: https://t.co/q45R1v06TH
neural scene representation directly parameterizes the full 360-degree, 4D light field of a 3D scene via a neural implicit representation pic.twitter.com/fENMzAlPWq
7. Solving Schrödinger Bridges via Maximum Likelihood
Francisco Vargas, Pierre Thodoroff, Neil D. Lawrence, Austen Lamacraft
The Schr”odinger bridge problem (SBP) finds the most likely stochastic evolution between two probability distributions given a prior stochastic evolution. As well as applications in the natural sciences, problems of this kind have important applications in machine learning such as dataset alignment and hypothesis testing. Whilst the theory behind this problem is relatively mature, scalable numerical recipes to estimate the Schr”odinger bridge remain an active area of research. We prove an equivalence between the SBP and maximum likelihood estimation enabling direct application of successful machine learning techniques. We propose a numerical procedure to estimate SBPs using Gaussian process and demonstrate the practical usage of our approach in numerical simulations and experiments.
Solving Schrödinger bridges via maximum likelihood arxiv: https://t.co/AhcnXhFOvL
— Neil Lawrence (@lawrennd) June 7, 2021
We propose an approximate IPFP/Sinkhorn variant based on the time reveral of diffusions with the goal of learning meaningful interpolating dynamics between two distributions. pic.twitter.com/47uekAkQLg
8. Few-Shot Segmentation via Cycle-Consistent Transformer
Gengwei Zhang, Guoliang Kang, Yunchao Wei, Yi Yang
Few-shot segmentation aims to train a segmentation model that can fast adapt to novel classes with few exemplars. The conventional training paradigm is to learn to make predictions on query images conditioned on the features from support images. Previous methods only utilized the semantic-level prototypes of support images as the conditional information. These methods cannot utilize all pixel-wise support information for the query predictions, which is however critical for the segmentation task. In this paper, we focus on utilizing pixel-wise relationships between support and target images to facilitate the few-shot semantic segmentation task. We design a novel Cycle-Consistent Transformer (CyCTR) module to aggregate pixel-wise support features into query ones. CyCTR performs cross-attention between features from different images, i.e. support and query images. We observe that there may exist unexpected irrelevant pixel-level support features. Directly performing cross-attention may aggregate these features from support to query and bias the query features. Thus, we propose using a novel cycle-consistent attention mechanism to filter out possible harmful support features and encourage query features to attend to the most informative pixels from support images. Experiments on all few-shot segmentation benchmarks demonstrate that our proposed CyCTR leads to remarkable improvement compared to previous state-of-the-art methods. Specifically, on Pascal- and COCO- datasets, we achieve 66.6% and 45.6% mIoU for 5-shot segmentation, outperforming previous state-of-the-art by 4.6% and 7.1% respectively.
Few-Shot Segmentation via Cycle-Consistent Transformer
— AK (@ak92501) June 7, 2021
pdf: https://t.co/2O4fUUFuSY
abs: https://t.co/f6lirYQG1d
on Pascal-5i and COCO-20i datasets, achieve 66.6% and 45.6% mIoU for 5-shot segmentation, outperforming previous sota by 4.6% and 7.1% respectively pic.twitter.com/b8zxCrlktm
9. Detecting and Adapting to Novelty in Games
Xiangyu Peng, Jonathan C. Balloch, Mark O. Riedl
Open-world novelty occurs when the rules of an environment can change abruptly, such as when a game player encounters “house rules”. To address open-world novelty, game playing agents must be able to detect when novelty is injected, and to quickly adapt to the new rules. We propose a model-based reinforcement learning approach where game state and rules are represented as knowledge graphs. The knowledge graph representation of the state and rules allows novelty to be detected as changes in the knowledge graph, assists with the training of deep reinforcement learners, and enables imagination-based re-training where the agent uses the knowledge graph to perform look-ahead.
Detecting and Adapting to Novelty in Games
— AK (@ak92501) June 7, 2021
pdf: https://t.co/xqLavTbywA
abs: https://t.co/rAaHfc0H4W
model-based reinforcement learning approach where game state and rules are represented as knowledge graphs pic.twitter.com/92N1grQIzn
10. RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
Yingjie Miao, Xingyou Song, Daiyi Peng, Summer Yue, Eugene Brevdo, Aleksandra Faust
We introduce RL-DARTS, one of the first applications of Differentiable Architecture Search (DARTS) in reinforcement learning (RL) to search for convolutional cells, applied to the Procgen benchmark. We outline the initial difficulties of applying neural architecture search techniques in RL, and demonstrate that by simply replacing the image encoder with a DARTS supernet, our search method is sample-efficient, requires minimal extra compute resources, and is also compatible with off-policy and on-policy RL algorithms, needing only minor changes in preexisting code. Surprisingly, we find that the supernet can be used as an actor for inference to generate replay data in standard RL training loops, and thus train end-to-end. Throughout this training process, we show that the supernet gradually learns better cells, leading to alternative architectures which can be highly competitive against manually designed policies, but also verify previous design choices for RL policies.
RL-DARTS: Differentiable Architecture Search for Reinforcement Learning
— AK (@ak92501) June 7, 2021
pdf: https://t.co/53XSc0o0lR
abs: https://t.co/hkb1YqpLka
one of the first applications of Differentiable Architecture Search in RL to search for convolutional cells, applied to the Procgen benchmark pic.twitter.com/OLvGXDSEX6
11. Glance-and-Gaze Vision Transformer
Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, Wei Shen
Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.
Glance-and-Gaze Vision Transformer
— AK (@ak92501) June 7, 2021
pdf: https://t.co/GGcirv36Fz
abs: https://t.co/WsvVAJt6vS
parallel and complementary Glance branch and Gaze branch, which offer long-range relationship and short-range modeling pic.twitter.com/ewkYVocgIh
12. Semantic Correspondence with Transformers
Seokju Cho, Sunghwan Hong, Sangryul Jeon, Yunsung Lee, Kwanghoon Sohn, Seungryong Kim
We propose a novel cost aggregation network, called Cost Aggregation with Transformers (CATs), to find dense correspondences between semantically similar images with additional challenges posed by large intra-class appearance and geometric variations. Compared to previous hand-crafted or CNN-based methods addressing the cost aggregation stage, which either lack robustness to severe deformations or inherit the limitation of CNNs that fail to discriminate incorrect matches due to limited receptive fields, CATs explore global consensus among initial correlation map with the help of some architectural designs that allow us to exploit full potential of self-attention mechanism. Specifically, we include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation to benefit from hierarchical feature representations within Transformer-based aggregator, and combine with swapping self-attention and residual connections not only to enforce consistent matching, but also to ease the learning process. We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies. Code and trained models will be made available at https://github.com/SunghwanHong/CATs.
Semantic Correspondence with Transformers
— AK (@ak92501) June 7, 2021
pdf: https://t.co/EMx0X1jxF0
abs: https://t.co/7gQUC3Cd7d
cost aggregation network, find dense correspondences between semantically similar images with additional challenges posed by large intra-class appearance and geometric variations pic.twitter.com/8b6sSPEpfX
13. nmT5 — Is parallel data still relevant for pre-training massively multilingual language models?
Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami Al-Rfou, Linting Xue
Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.
nmT5 - Is parallel data still relevant for pre-training massively multilingual language models?
— AK (@ak92501) June 7, 2021
pdf: https://t.co/KKoNuLTkPy
abs: https://t.co/VNawWupnse
larger model sizes, pre-training with parallel data still provides benefits in the limited labelled data regime pic.twitter.com/IHzTlm9UwY
14. SOLQ: Segmenting Objects by Learning Queries
Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, Yichen Wei
In this paper, we propose an end-to-end framework for instance segmentation. Based on the recently introduced DETR [1], our method, termed SOLQ, segments objects by learning unified queries. In SOLQ, each query represents one object and has multiple representations: class, location and mask. The object queries learned perform classification, box regression and mask encoding simultaneously in an unified vector form. During training phase, the mask vectors encoded are supervised by the compression coding of raw spatial masks. In inference time, mask vectors produced can be directly transformed to spatial masks by the inverse process of compression coding. Experimental results show that SOLQ can achieve state-of-the-art performance, surpassing most of existing approaches. Moreover, the joint learning of unified query representation can greatly improve the detection performance of original DETR. We hope our SOLQ can serve as a strong baseline for the Transformer-based instance segmentation. Code is available at https://github.com/megvii-research/SOLQ.
SOLQ: Segmenting Objects by Learning Queries
— AK (@ak92501) June 7, 2021
pdf: https://t.co/W6Y4sJvEeO
abs: https://t.co/4LYTbWBYi4
github: https://t.co/ROiTSNDcho pic.twitter.com/V1GMtpbKQn
15. Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing
Rowan Hall Maudslay, Ryan Cotterell
Analysing whether neural language models encode linguistic information has become popular in NLP. One method of doing so, which is frequently cited to support the claim that models like BERT encode syntax, is called probing; probes are small supervised models trained to extract linguistic information from another model’s output. If a probe is able to predict a particular structure, it is argued that the model whose output it is trained on must have implicitly learnt to encode it. However, drawing a generalisation about a model’s linguistic knowledge about a specific phenomena based on what a probe is able to learn may be problematic: in this work, we show that semantic cues in training data means that syntactic probes do not properly isolate syntax. We generate a new corpus of semantically nonsensical but syntactically well-formed Jabberwocky sentences, which we use to evaluate two probes trained on normal data. We train the probes on several popular language models (BERT, GPT, and RoBERTa), and find that in all settings they perform worse when evaluated on these data, for one probe by an average of 15.4 UUAS points absolute. Although in most cases they still outperform the baselines, their lead is reduced substantially, e.g. by 53% in the case of BERT for one probe. This begs the question: what empirical scores constitute knowing syntax?
Some (@ryandcotterell) would say that posting an arXiv link <1hr before the conference started was leaving things late....BUT what the hell:
— Rowan Hall Maudslay (@rowhallmauds) June 7, 2021
Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probinghttps://t.co/p2hMeycos5
Done at @CSatETH & @cambridgenlp [1/6] pic.twitter.com/uaUmyyM1RF
16. A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning
Mingde Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio
We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state, in order to plan and to generalize better out-of-distribution. The agent’s architecture uses a set representation and a bottleneck mechanism, forcing the number of entities to which the agent attends at each planning step to be small. In experiments with customized MiniGrid environments with different dynamics, we observe that the design allows agents to learn to plan effectively, by attending to the relevant objects, leading to better out-of-distribution generalization.
A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning
— AK (@ak92501) June 7, 2021
pdf: https://t.co/aW21hIkEW7
abs: https://t.co/7OP9ctAHRs
end-to-end, model-based DRL agent which dynamically attends to relevant parts of its state, in order to plan and
to generalize better ood pic.twitter.com/ZN7jHIS38Z
17. Fundamental tradeoffs between memorization and robustness in random features and neural tangent regimes
Elvis Dohmatob
This work studies the (non)robustness of two-layer neural networks in various high-dimensional linearized regimes. We establish fundamental trade-offs between memorization and robustness, as measured by the Sobolev-seminorm of the model w.r.t the data distribution, i.e the square root of the average squared -norm of the gradients of the model w.r.t the its input. More precisely, if is the number of training examples, is the input dimension, and is the number of hidden neurons in a two-layer neural network, we prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded by (i) in case of infinite-width random features (RF) or neural tangent kernel (NTK) with ; (ii) in case of finite-width RF with proportionate scaling of and ; and (iii) in case of finite-width NTK with proportionate scaling of and . Moreover, all of these lower-bounds are tight: they are attained by the min-norm / least-squares interpolator (when , , and are in the appropriate interpolating regime). All our results hold as soon as data is log-concave isotropic, and there is label-noise, i.e the target variable is not a deterministic function of the data / features. We empirically validate our theoretical results with experiments. Accidentally, these experiments also reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
1/ NEW preprint https://t.co/aIMxhTPoDf wherein we uncover a fundamental tradeoff between memorization and robustness for NNs in linearized regimes (RF, NTK, ...). Also, we accidentally observe, for the first time (it seems), a multiple-descent phenomenon in robustness pic.twitter.com/mpGVKoyi4e
— Elvis Dohmatob (@dohmatobelvis) June 7, 2021
18. Eliciting Spoken Interruptions to Inform Proactive Speech Agent Design
Justin Edwards, Christian Janssen, Sandy Gould, Benjamin R Cowan
Current speech agent interactions are typically user-initiated, limiting the interactions they can deliver. Future functionality will require agents to be proactive, sometimes interrupting users. Little is known about how these spoken interruptions should be designed, especially in urgent interruption contexts. We look to inform design of proactive agent interruptions through investigating how people interrupt others engaged in complex tasks. We therefore developed a new technique to elicit human spoken interruptions of people engaged in other tasks. We found that people interrupted sooner when interruptions were urgent. Some participants used access rituals to forewarn interruptions, but most rarely used them. People balanced speed and accuracy in timing interruptions, often using cues from the task they interrupted. People also varied phrasing and delivery of interruptions to reflect urgency. We discuss how our findings can inform speech agent design and how our paradigm can help gain insight into human interruptions in new contexts.
19. The Image Local Autoregressive Transformer
Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, XiangYang Xue, Yanwei Fu
Recently, AutoRegressive (AR) models for the whole image generation empowered by transformers have achieved comparable or even better performance to Generative Adversarial Networks (GANs). Unfortunately, directly applying such AR models to edit/change local image regions, may suffer from the problems of missing global information, slow inference speed, and information leakage of local guidance. To address these limitations, we propose a novel model — image Local Autoregressive Transformer (iLAT), to better facilitate the locally guided image synthesis. Our iLAT learns the novel local discrete representations, by the newly proposed local autoregressive (LA) transformer of the attention mask and convolution mechanism. Thus iLAT can efficiently synthesize the local image regions by key guidance information. Our iLAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both the quantitative and qualitative results show the efficacy of our model.
The Image Local Autoregressive Transformer
— AK (@ak92501) June 7, 2021
pdf: https://t.co/Ldk53mswBh
abs: https://t.co/frzJ3ZaNgR
learns the novel local discrete representations, by the newly proposed local autoregressive transformer
of the attention mask and convolution mechanism pic.twitter.com/uOCch4f6qF