1. Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?
Thang M. Pham, Trung Bui, Long Mai, Anh Nguyen
Do state-of-the-art natural language understanding models care about word order - one of the most important characteristics of a sequence? Not always! We found 75% to 90% of the correct predictions of BERT-based classifiers, trained on many GLUE tasks, remain constant after input words are randomly shuffled. Despite BERT embeddings are famously contextual, the contribution of each individual word to downstream tasks is almost unchanged even after the word’s context is shuffled. BERT-based models are able to exploit superficial cues (e.g. the sentiment of keywords in sentiment analysis; or the word-wise similarity between sequence-pair inputs in natural language inference) to make correct decisions when tokens are arranged in random orders. Encouraging classifiers to capture word order information improves the performance on most GLUE tasks, SQuAD 2.0 and out-of-samples. Our work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence.
Do SotA natural language understanding models care about word order?
— Anh Nguyen (@anh_ng8) January 2, 2021
Nope 🙃, 75% to 90% of the time, for BERT-based models, on many GLUE tasks (where they outperformed humans).
"marijuana cause cancer" == "cancer cause marijuana" Ouch...https://t.co/Kr3i3SBNXb 1/4 pic.twitter.com/IY6naG0onV
2. Reinforcement Learning for Control of Valves
Rajesh Siraskar
This paper compares reinforcement learning (RL) with PID (proportional-integral-derivative) strategy for control of nonlinear valves using a unified framework. RL is an autonomous learning mechanism that learns by interacting with its environment. It is gaining increasing attention in the world of control systems as a means of building optimal-controllers for challenging dynamic and nonlinear processes. Published RL research often uses open-source tools (Python and OpenAI Gym environments) which could be difficult to adapt and apply by practicing industrial engineers, we therefore used MathWorks tools. MATLAB’s recently launched (R2019a) Reinforcement Learning Toolbox was used to develop the valve controller; trained using the DDPG (Deep Deterministic Policy-Gradient) algorithm and Simulink to simulate the nonlinear valve and setup the experimental test-bench to evaluate the RL and PID controllers. Results indicate that the RL controller is extremely good at tracking the signal with speed and produces a lower error with respect to the reference signals. The PID, however, is better at disturbance rejection and hence provides a longer life for the valves. Experiential learnings gained from this research are corroborated against published research. It is known that successful machine learning involves tuning many hyperparameters and significant investment of time and efforts. We introduce Graded Learning" as a simplified, application oriented adaptation of the more formal and algorithmic
Curriculum for Reinforcement Learning”. It is shown via experiments that it helps converge the learning task of complex non-linear real world systems.
Reinforcement Learning for Control of Valves. #AI #MachineLearning #TensorFlow #100DaysOfCode #BigData #Analytics #DevCommunity #Programming #IoT #javascript #Linux #Cloud #Serverless #womenwhocode #Python #RStats #DataScience #DeepLearning #NeuralNetworkshttps://t.co/GFdRWmvK79 pic.twitter.com/f40RpdgRN8
— Marcus Borba (@marcusborba) January 2, 2021
3. Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy
Feed-forward layers constitute two-thirds of a transformer model’s parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys’ input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model’s layers via residual connections to produce the final output distribution.
"Transformer Feed-Forward Layers Are Key-Value Memories"
— Mor Geva (@megamor2) January 1, 2021
Check out our new preprint where we analyze the role of FF layers in transformer models.https://t.co/MY44mMVxyV
With @RoeiSchuster @JonathanBerant @omerlevy_
1/3 pic.twitter.com/9JGWazHcpO
4. A Memory Efficient Baseline for Open Domain Question Answering
Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, Edouard Grave
Recently, retrieval systems based on dense representations have led to important improvements in open-domain question answering, and related tasks. While very effective, this approach is also memory intensive, as the dense vectors for the whole knowledge source need to be kept in memory. In this paper, we study how the memory footprint of dense retriever-reader systems can be reduced. We consider three strategies to reduce the index size: dimension reduction, vector quantization and passage filtering. We evaluate our approach on two question answering benchmarks: TriviaQA and NaturalQuestions, showing that it is possible to get competitive systems using less than 6Gb of memory.
New paper on memory efficient open domain question answering. We show that combining dimension reduction, vector quantization and passage filtering greatly reduces the memory footprint of retrieval based systems, without hurting accuracy too much.
— Edouard Grave (@EXGRV) January 1, 2021
Paper: https://t.co/BVuvEMCKhe pic.twitter.com/YTy6HSmo66
5. Improving Zero-Shot Translation by Disentangling Positional Information
Danni Liu, Jan Niehues, James Cross, Francisco Guzmán, Xian Li
Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.
Improving Zero-Shot Translation by Disentangling Positional Information
— Aran Komatsuzaki (@arankomatsuzaki) January 1, 2021
Achieving up to 18.5 BLEU points gain on zero-shot translation by removing residual connections in an encoder layer. https://t.co/le4p1eFN90 pic.twitter.com/wc5qkGjX6D
6. Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade
Jiatao Gu, Xiang Kong
Fully non-autoregressive neural machine translation (NAT) is proposed to simultaneously predict tokens with single forward of neural networks, which significantly reduces the inference latency at the expense of quality drop compared to the Transformer baseline. In this work, we target on closing the performance gap while maintaining the latency advantage. We first inspect the fundamental issues of fully NAT models, and adopt dependency reduction in the learning space of output tokens as the basic guidance. Then, we revisit methods in four different aspects that have been proven effective for improving NAT models, and carefully combine these techniques with necessary modifications. Our extensive experiments on three translation benchmarks show that the proposed system achieves the new state-of-the-art results for fully NAT models, and obtains comparable performance with the autoregressive and iterative NAT systems. For instance, one of the proposed models achieves 27.49 BLEU points on WMT14 En-De with approximately 16.5X speed up at inference time.
Happy New Year!! I am super excited to share our new pre-print “Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade”, joint work with @XiangKong4 .
— Jiatao Gu (@thoma_gu) January 1, 2021
Please check out https://t.co/NSzUtZr7Fb
(1/2) pic.twitter.com/YUdcEocEZg
7. Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans
Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, Xiaowei Zhou
This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. Experiments on a newly collected multi-view dataset show that our approach outperforms prior works by a large margin in terms of the view synthesis quality. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset. The code and dataset will be available at https://zju3dv.github.io/neuralbody/.
Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans
— AK (@ak92501) January 1, 2021
pdf: https://t.co/gIj48j9xYw
abs: https://t.co/qxv6vEBvNE
project page: https://t.co/fCw6cDz1Yb pic.twitter.com/1JyiVTMHb4
8. Is Pessimism Provably Efficient for Offline RL?
Ying Jin, Zhuoran Yang, Zhaoran Wang
- retweets: 1114, favorites: 153 (01/04/2021 10:51:47)
- links: abs | pdf
- cs.LG | cs.AI | math.OC | math.ST | stat.ML
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the best effort'' among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the
irrelevant” trajectories that are less covered by the dataset and not informative for the optimal policy.
We know optimism is provably efficient for online RL. What about offline RL? It turns out simply flipping the sign of the bonus is minimax optimal! Given a dataset, pessimism is the best effort we can make. https://t.co/zO6cxKY7bb
— Zhaoran Wang (@zhaoran_wang) January 1, 2021
Just leave pessimism to 2020. Happy new year~! pic.twitter.com/wYcrQPGXCz
9. Studying Strategically: Learning to Mask for Closed-book QA
Qinyuan Ye, Belinda Z. Li, Sinong Wang, Benjamin Bolte, Hao Ma, Xiang Ren, Wen-tau Yih, Madian Khabsa
Closed-book question-answering (QA) is a challenging task that requires a model to directly answer questions without access to external knowledge. It has been shown that directly fine-tuning pre-trained language models with (question, answer) examples yields surprisingly competitive performance, which is further improved upon through adding an intermediate pre-training stage between general pre-training and fine-tuning. Prior work used a heuristic during this intermediate stage, whereby named entities and dates are masked, and the model is trained to recover these tokens. In this paper, we aim to learn the optimal masking strategy for the intermediate pretraining stage. We first train our masking policy to extract spans that are likely to be tested, using supervision from the downstream task itself, then deploy the learned policy during intermediate pre-training. Thus, our policy packs task-relevant knowledge into the parameters of a language model. Our approach is particularly effective on TriviaQA, outperforming strong heuristics when used to pre-train BART.
Building upon "𝘩𝘰𝘸 𝘮𝘶𝘤𝘩 knowledge can you pack into the parameters of a language model?", have you wondered "𝘸𝘩𝘢𝘵 knowledge do you want to pack into the parameters of a language model?" Check out our new preprint (https://t.co/yDdixrPy2l) on this problem! 1/n pic.twitter.com/PAa5t8bFvJ
— Qinyuan Ye (@qinyuan_ye) January 2, 2021
10. NeuralMagicEye: Learning to See and Understand the Scene Behind an Autostereogram
Zhengxia Zou, Tianyang Shi, Yi Yuan, Zhenwei Shi
An autostereogram, a.k.a. magic eye image, is a single-image stereogram that can create visual illusions of 3D scenes from 2D textures. This paper studies an interesting question that whether a deep CNN can be trained to recover the depth behind an autostereogram and understand its content. The key to the autostereogram magic lies in the stereopsis - to solve such a problem, a model has to learn to discover and estimate disparity from the quasi-periodic textures. We show that deep CNNs embedded with disparity convolution, a novel convolutional layer proposed in this paper that simulates stereopsis and encodes disparity, can nicely solve such a problem after being sufficiently trained on a large 3D object dataset in a self-supervised fashion. We refer to our method as NeuralMagicEye”. Experiments show that our method can accurately recover the depth behind autostereograms with rich details and gradient smoothness. Experiments also show the completely different working mechanisms for autostereogram perception between neural networks and human eyes. We hope this research can help people with visual impairments and those who have trouble viewing autostereograms. Our code is available at \url{https://jiupinjia.github.io/neuralmagiceye/}.
NeuralMagicEye: Learning to See and Understand the Scene Behind an Autostereogram
— hardmaru (@hardmaru) January 3, 2021
An autostereogram is a single-image stereogram, designed to create the visual illusion of a 3D scene from a 2D image. Cool project!
Paper https://t.co/tVeOQUYmfp
Other info https://t.co/hH3CmKXjo9 pic.twitter.com/rTR17kPH0x
11. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first (44.42% mIoU) position in the highly competitive ADE20K test server leaderboard.
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
— AK (@ak92501) January 1, 2021
pdf: https://t.co/d3pOGztGQP
abs: https://t.co/wEvouWEAdO
project page: https://t.co/5TBcHvnHk0 pic.twitter.com/bylJfBtubX
12. FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Han Guo, Nazneen Fatema Rajani, Peter Hase, Mohit Bansal, Caiming Xiong
Influence functions approximate the ‘influences’ of training data-points for test predictions and have a wide variety of applications. Despite the popularity, their computational cost does not scale well with model and training data size. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. We use k-Nearest Neighbors (kNN) to narrow the search space down to a subset of good candidate data points, identify the configurations that best balance the speed-quality trade-off in estimating the inverse Hessian-vector product, and introduce a fast parallel variant. Our proposed method achieves about 80x speedup while being highly correlated with the original influence values. With the availability of the fast influence functions, we demonstrate their usefulness in four applications. First, we examine whether influential data-points can ‘explain’ test time behavior using the framework of simulatability. Second, we visualize the influence interactions between training and test data-points. Third, we show that we can correct model errors by additional fine-tuning on certain influential data-points, improving the accuracy of a trained MNLI model by 2.6% on the HANS challenge set using a small number of gradient updates. Finally, we experiment with a data-augmentation setup where we use influence functions to search for new data-points unseen during training to improve model performance. Overall, our fast influence functions can be efficiently applied to large models and datasets, and our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors. Code is available at https://github.com/salesforce/fast-influence-functions
Glad to share our latest work "FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging"!
— Han Guo (@HanGuo97) January 2, 2021
Joint work with @nazneenrajani @peterbhase @mohitban47 @caimingxiong (@uncnlp @sfresearch).
Paper: https://t.co/l0ZOKTBSjR
Code: https://t.co/4soU3e1vpD
1/5 pic.twitter.com/ROdgkrfONI
特定の学習データの有無やその摂動が予測結果に与える影響を表す影響関数を求めるには計算コストが大きかった。FASTIFは 1) 評価データの特徴ベクトルと似た学習データをkNNで絞り込む 2) ヘシアンの逆行列をhvpとノイマン級数経由で求めることで高速化。80倍の高速化を達成https://t.co/Nw5cJSuuSZ
— Daisuke Okanohara (@hillbig) January 3, 2021
13. OSTeC: One-Shot Texture Completion
Baris Gecer, Jiankang Deng, Stefanos Zafeiriou
The last few years have witnessed the great success of non-linear generative models in synthesizing high-quality photorealistic face images. Many recent 3D facial texture reconstruction and pose manipulation from a single image approaches still rely on large and clean face datasets to train image-to-image Generative Adversarial Networks (GANs). Yet the collection of such a large scale high-resolution 3D texture dataset is still very costly and difficult to maintain age/ethnicity balance. Moreover, regression-based approaches suffer from generalization to the in-the-wild conditions and are unable to fine-tune to a target-image. In this work, we propose an unsupervised approach for one-shot 3D facial texture completion that does not require large-scale texture datasets, but rather harnesses the knowledge stored in 2D face generators. The proposed approach rotates an input image in 3D and fill-in the unseen regions by reconstructing the rotated image in a 2D face generator, based on the visible parts. Finally, we stitch the most visible textures at different angles in the UV image-plane. Further, we frontalize the target image by projecting the completed texture into the generator. The qualitative and quantitative experiments demonstrate that the completed UV textures and frontalized images are of high quality, resembles the original identity, can be used to train a texture GAN model for 3DMM fitting and improve pose-invariant face recognition.
OSTeC: One-Shot Texture Completion
— AK (@ak92501) January 1, 2021
pdf: https://t.co/9LnfxMEZeM
abs: https://t.co/2dTKqeXz4J pic.twitter.com/6pX8B53l49
14. TransTrack: Multiple-Object Tracking with Transformer
Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, Ping Luo
Multiple-object tracking(MOT) is mostly dominated by complex and multi-step tracking-by-detection algorithm, which performs object detection, feature extraction and temporal association, separately. Query-key mechanism in single-object tracking(SOT), which tracks the object of the current frame by object feature of the previous frame, has great potential to set up a simple joint-detection-and-tracking MOT paradigm. Nonetheless, the query-key method is seldom studied due to its inability to detect new-coming objects. In this work, we propose TransTrack, a baseline for MOT with Transformer. It takes advantage of query-key mechanism and introduces a set of learned object queries into the pipeline to enable detecting new-coming objects. TransTrack has three main advantages: (1) It is an online joint-detection-and-tracking pipeline based on query-key mechanism. Complex and multi-step components in the previous methods are simplified. (2) It is a brand new architecture based on Transformer. The learned object query detects objects in the current frame. The object feature query from the previous frame associates those current objects with the previous ones. (3) For the first time, we demonstrate a much simple and effective method based on query-key mechanism and Transformer architecture could achieve competitive 65.8% MOTA on the MOT17 challenge dataset. We hope TransTrack can provide a new perspective for multiple-object tracking. The code is available at: \url{https://github.com/PeizeSun/TransTrack}.
TransTrackは複数物体トラッキング問題にTransformerを利用。現フレームから候補キーを抽出し、前フレームの検出結果由来のクエリとマッチングしトラッキング。さらに学習可能なクエリとマッチングし新出物体の検出を行う。複雑な従来手法に比べ単純でありながら高性能https://t.co/F1YzRgPha4
— Daisuke Okanohara (@hillbig) January 2, 2021
TransTrack: Multiple-Object Tracking with Transformer
— AK (@ak92501) January 1, 2021
pdf: https://t.co/II7aQ1p1BU
abs: https://t.co/SXe3WJaCdD
github: https://t.co/xYtfgZlIAp pic.twitter.com/Qtst1NmZl1
オブジェクトトラッキングにTransformerを使用した研究。Decoderは物体検出と(前画像)オブジェクト位置推定用の2つでそれぞれ(学習可能な)物体検出用クエリ、オブジェクト特徴を入力とする。Encoderで計算した前後画像特徴をKeyとしてDecoder内でCross Attentionし位置を推定https://t.co/loZ1n2eoS9
— piqcy (@icoxfog417) January 2, 2021
15. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness
Sabrina J. Mielke, Arthur Szlam, Y-Lan Boureau, Emily Dinan
Open-domain dialogue agents have vastly improved, but still confidently hallucinate knowledge or express doubt when asked straightforward questions. In this work, we analyze whether state-of-the-art chit-chat models can express metacognition capabilities through their responses: does a verbalized expression of doubt (or confidence) match the likelihood that the model’s answer is incorrect (or correct)? We find that these models are poorly calibrated in this sense, yet we show that the representations within the models can be used to accurately predict likelihood of correctness. By incorporating these correctness predictions into the training of a controllable generation model, we obtain a dialogue agent with greatly improved linguistic calibration.
Excited to share the project that's been carrying me through much of 2020:
— Sabrina J. Mielke (@sjmielke) January 1, 2021
"Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness" 🤖📊https://t.co/90GGp2jPQv
w/ Arthur Szlam, Y-Lan Boureau, and Emily Dinan (@em_dinan)
[1/8] pic.twitter.com/Rb0r1uHeSf
16. Combinatorial Pure Exploration with Full-bandit Feedback and Beyond: Solving Combinatorial Optimization under Uncertainty with Limited Observation
Yuko Kuroki, Junya Honda, Masashi Sugiyama
- retweets: 320, favorites: 184 (01/04/2021 10:51:49)
- links: abs | pdf
- cs.LG | cs.DM | cs.DS | cs.SI | stat.ML
Combinatorial optimization is one of the fundamental research fields that has been extensively studied in theoretical computer science and operations research. When developing an algorithm for combinatorial optimization, it is commonly assumed that parameters such as edge weights are exactly known as inputs. However, this assumption may not be fulfilled since input parameters are often uncertain or initially unknown in many applications such as recommender systems, crowdsourcing, communication networks, and online advertisement. To resolve such uncertainty, the problem of combinatorial pure exploration of multi-armed bandits (CPE) and its variants have recieved increasing attention. Earlier work on CPE has studied the semi-bandit feedback or assumed that the outcome from each individual edge is always accessible at all rounds. However, due to practical constraints such as a budget ceiling or privacy concern, such strong feedback is not always available in recent applications. In this article, we review recently proposed techniques for combinatorial pure exploration problems with limited feedback.
あけましておめでとうございます🌅🎍
— Yuko Kuroki (@yuko_kuroki_cs) January 1, 2021
不確実性下&限られた観測だけから組合せ最適化問題を解く一般的な枠組みである確率的組合せ最適腕識別に関するちょっとしたReview記事を杉山先生と本多先生と一緒に書かせてもらいました🙂
(プレプリントhttps://t.co/Q7BFxyY0Y7)
今年も宜しくお願いします🎍🌅 pic.twitter.com/LyZbZHcTny
17. CLEAR: Contrastive Learning for Sentence Representation
Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, Hao Ma
Pre-trained language models have proven their unique powers in capturing implicit language features. However, most pre-training approaches focus on the word-level training objective, while sentence-level objectives are rarely studied. In this paper, we propose Contrastive LEArning for sentence Representation (CLEAR), which employs multiple sentence-level augmentation strategies in order to learn a noise-invariant sentence representation. These augmentations include word and span deletion, reordering, and substitution. Furthermore, we investigate the key reasons that make contrastive learning effective through numerous experiments. We observe that different sentence augmentations during pre-training lead to different performance improvements on various downstream tasks. Our approach is shown to outperform multiple existing methods on both SentEval and GLUE benchmarks.
CLEAR: Contrastive Learning for Sentence Representation
— AK (@ak92501) January 1, 2021
pdf: https://t.co/jTg81lBLp0
abs: https://t.co/yPTHlRAMtC pic.twitter.com/5vR3QG2Afk
18. BinaryBERT: Pushing the Limit of BERT Quantization
Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King
The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes. Therefore, we propose ternary weight splitting, which initializes the binary model by equivalent splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary model, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base while being smaller, achieving the state-of-the-art results on GLUE and SQuAD benchmarks.
BinaryBERT: Pushing the Limit of BERT Quantization
— AK (@ak92501) January 1, 2021
pdf: https://t.co/iovaacojgy
abs: https://t.co/jIqyFhrr39 pic.twitter.com/sSbpHYUFyx
19. Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring
Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre
Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.
We have a new paper on cross-lingual word embeddings! Instead of aligning fixed monolingual embeddings under the isometry assumption, our method fixes the target language embeddings, and learns aligned embeddings in the source language from scratch.https://t.co/lpHy4xiVjQ https://t.co/Y8r4deJ73i
— Mikel Artetxe (@artetxem) January 1, 2021
Check out our new paper "Beyond offline mapping: Learning Cross Lingual Word Embeddings through Context Anchoring". We propose a new method to learn word embeddings aligned in a target space without a mapping step, outperforming mapping methods in BLI. https://t.co/HmGLlvHdaU
— Aitor Ormazabal (@aormazabalo) January 1, 2021
20. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
Bertie Vidgen, Tristan Thrush, Zeerak Waseem, Douwe Kiela
We present a first-of-its-kind large synthetic training dataset for online hate classification, created from scratch with trained annotators over multiple rounds of dynamic data collection. We provide a 40,623 example dataset with annotations for fine-grained labels, including a large number of challenging contrastive perturbation examples. Unusually for an abusive content dataset, it comprises 54% hateful and 46% not hateful entries. We show that model performance and robustness can be greatly improved using the dynamic data collection paradigm. The model error rate decreased across rounds, from 72.1% in the first round to 35.8% in the last round, showing that models became increasingly harder to trick — even though content become progressively more adversarial as annotators became more experienced. Hate speech detection is an important and subtle problem that is still very challenging for existing AI methods. We hope that the models, dataset and dynamic system that we present here will help improve current approaches, having a positive social impact.
#onlinehate remains a challenge for machine learning - most classifiers aren't very accurate, robust or generalisable. We used @DynabenchAI to dynamically generate more challenging datasets and better models. Preprint now out! Feedback very welcome :) https://t.co/lt5VOVOLR3
— Bertie Vidgen (@bertievidgen) January 1, 2021
21. Audio-Visual Floorplan Reconstruction
Senthil Purushwalkam, Sebastian Vicenc Amengual Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Gupta, Kristen Grauman
Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense geometry outside the camera’s field of view, but it also reveals the existence of distant freespace (e.g., a dog barking in another room) and suggests the presence of rooms not visible to the camera (e.g., a dishwasher humming in what must be the kitchen to the left). We introduce AV-Map, a novel multi-modal encoder-decoder framework that reasons jointly about audio and vision to reconstruct a floorplan from a short input video sequence. We train our model to predict both the interior structure of the environment and the associated rooms’ semantic labels. Our results on 85 large real-world environments show the impact: with just a few glimpses spanning 26% of an area, we can estimate the whole area with 66% accuracy — substantially better than the state of the art approach for extrapolating visual maps.
Audio-Visual Floorplan Reconstruction
— AK (@ak92501) January 1, 2021
pdf: https://t.co/GsuSH6KcOE
abs: https://t.co/mmRaznkor6
project page: https://t.co/WQoIuQOpco pic.twitter.com/0nxcne0xiN
22. kōan: A Corrected CBOW Implementation
Ozan İrsoy, Adrian Benton, Karl Stratos
It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, k=oan, at https://github.com/bloomberg/koan.
単語埋め込みベクトル学習でCBOWがSkip-gramに性能で劣っているのは現在の実装の更新式が間違っているとの指摘 https://t.co/NPciw2FHlp それに対し、実装の著者が実装の別の変更が原因ではないかと指摘。 真相はまだ不明https://t.co/siyDdi4Bgv https://t.co/QUI2LNxB5j
— Daisuke Okanohara (@hillbig) January 3, 2021
23. Adaptive Extreme Edge Computing for Wearable Devices
Erika Covi, Elisa Donati, Hadi Heidari, David Kappel, Xiangpeng Liang, Melika Payvand, Wei Wang
Wearable devices are a fast-growing technology with impact on personal healthcare for both society and economy. Due to the widespread of sensors in pervasive and distributed networks, power consumption, processing speed, and system adaptation are vital in future smart wearable devices. The visioning and forecasting of how to bring computation to the edge in smart sensors have already begun, with an aspiration to provide adaptive extreme edge computing. Here, we provide a holistic view of hardware and theoretical solutions towards smart wearable devices that can provide guidance to research in this pervasive computing era. We propose various solutions for biologically plausible models for continual learning in neuromorphic computing technologies for wearable sensors. To envision this concept, we provide a systematic outline in which prospective low power and low latency scenarios of wearable sensors in neuromorphic platforms are expected. We successively describe vital potential landscapes of neuromorphic processors exploiting complementary metal-oxide semiconductors (CMOS) and emerging memory technologies (e.g. memristive devices). Furthermore, we evaluate the requirements for edge computing within wearable devices in terms of footprint, power consumption, latency, and data size. We additionally investigate the challenges beyond neuromorphic computing hardware, algorithms and devices that could impede enhancement of adaptive edge computing in smart wearable devices.
New collaborative #preprint out now on #arXiv:
— Hadi Heidari هـ (@hadihei) January 2, 2021
Adaptive Extreme #EdgeComputing for #Wearable Deviceshttps://t.co/1ape44SuXb
We introduce low power and low latency scenarios of wearable sensors in #neuromorphic platforms and potential landscapes of neuromorphic processors! pic.twitter.com/FkuZtAkfPS
24. Reservoir Transformer
Sheng Shen, Alexei Baevski, Ari S. Morcos, Kurt Keutzer, Michael Auli, Douwe Kiela
We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear “reservoir” layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.
Random Layers can be helpful!
— Sheng (Arnold) Shen (@shengs1123) January 1, 2021
We show transformers obtain impressive performance even when some of the layers are randomly initialized and never updated.
Check it out here: https://t.co/G17sh8V2tr.
with @douwekiela, @MichaelAuli, @arimorcos, Alexei Baevski, and @KurtKeutzer
1/N
25. BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining
Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Dayiheng Liu, Weizhu Chen, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan
In this paper, we propose BANG, a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation. AR and NAR generation can be uniformly regarded as what extend of previous tokens can be attended to, and BANG bridges AR and NAR generation through designing a novel model structure for large-scale pre-training. A pretrained BANG model can simultaneously support AR, NAR, and semi-NAR generation to meet different requirements. Experiments on question generation (SQuAD 1.1), summarization (XSum), and dialogue (PersonaChat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models. Compared with the semi-NAR strong baselines, BANG achieves absolute improvements of 14.01 and 5.24 in overall scores of SQuAD and XSum, respectively. In addition, BANG achieves absolute improvements of 10.73, 6.39, and 5.90 in overall scores of SQuAD, XSUM, and PersonaChat compared with the NAR strong baselines, respectively. Our code will be made publicly available in the near future\footnote{https://github.com/microsoft/BANG}.
BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining
— AK (@ak92501) January 1, 2021
pdf: https://t.co/ofGpd3rx7U
abs: https://t.co/7HaSWipRat pic.twitter.com/AZhsoca83Q
26. HateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger, Bertram Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, Janet Pierrehumbert
Detecting online hate is a difficult task that even state-of-the-art models struggle with. In previous research, hate speech detection models are typically evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model quality due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a first suite of functional tests for hate speech detection models. We specify 29 model functionalities, the selection of which we motivate by reviewing previous research and through a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate data quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer detection models as well as a popular commercial model, revealing critical model weaknesses.
Evaluating #hatespeech detection models is really difficult. It depends on the quality/coverage/variety of the dataset you've trained on.
— Bertie Vidgen (@bertievidgen) January 2, 2021
We've developed HateCheck to assess models using functional tests (w. a 4k dataset).
Even near-SoTA have big problems! https://t.co/Bc0azGrwqF
27. Conditional Generation of Temporally-ordered Event Sequences
Shih-Ting Lin, Nathanael Chambers, Greg Durrett
Models encapsulating narrative schema knowledge have proven to be useful for a range of event-related tasks, but these models typically do not engage with temporal relationships between events. We present a a BART-based conditional generation model capable of capturing event cooccurrence as well as temporality of event sequences. This single model can address both temporal ordering, sorting a given sequence of events into the order they occurred, and event infilling, predicting new events which fit into a temporally-ordered sequence of existing ones. Our model is trained as a denoising autoencoder: we take temporally-ordered event sequences, shuffle them, delete some events, and then attempting to recover the original event sequence. In this fashion, the model learns to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.
New preprint with Shih-Ting Lin and @NateChambers about modeling temporally-ordered event sequences:https://t.co/Ky6ynUkLzP
— Greg Durrett (@gregd_nlp) January 1, 2021
We train a BART-based denoising autoencoder over linearized SRL event representations to make several
kinds of temporal-related event inferences 1/2 pic.twitter.com/ncLJwHLOgm
28. Simulation and Control of Deformable Autonomous Airships in Turbulent Wind
Eric Price, Yu Tang Liu, Michael J. Black, Aamir Ahmad
Abstract. Fixed wing and multirotor UAVs are common in the field of robotics. Solutions for simulation and control of these vehicles are ubiquitous. This is not the case for airships, a simulation of which needs to address unique properties, i) dynamic deformation in response to aerodynamic and control forces, ii) high susceptibility to wind and turbulence at low airspeed, iii) high variability in airship designs regarding placement, direction and vectoring of thrusters and control surfaces. We present a flexible framework for modeling, simulation and control of airships, based on the Robot operating system (ROS), simulation environment (Gazebo) and commercial off the shelf (COTS) electronics, both of which are open source. Based on simulated wind and deformation, we predict substantial effects on controllability, verified in real world flight experiments. All our code is shared as open source, for the benefit of the community and to facilitate lighter-than-air vehicle (LTAV) research. https://github.com/robot-perception-group/airship_simulation
Simulation and Control of Deformable Autonomous Airships in Turbulent Wind
— AK (@ak92501) January 1, 2021
pdf: https://t.co/1SNQQDyl26
abs: https://t.co/NcQdlagq5e
github: https://t.co/rdmpHWz2dO pic.twitter.com/PToVC0G6Dg
29. Intrinsic Bias Metrics Do Not Correlate with Application Bias
Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sanchez, Mugdha Pandya, Adam Lopez
Natural Language Processing (NLP) systems learn harmful societal biases that cause them to extend and proliferate inequality widely, as they are deployed in more and more situations. To address and combat this, the NLP community has come to rely on a variety of metrics to identify and quantify bias in black-box models, which are used to monitor model behaviour and to guide efforts at debiasing. Some of these metrics are intrinsic, and are measured in word embedding spaces, and some are extrinsic, which measure the bias present downstream in the tasks that the word embeddings are plugged into. This research examines whether intrinsic metrics (which are easy to measure) correlate well to extrinsic metrics (which reflect real world bias). We measure both intrinsic and extrinsic bias across hundreds of trained models covering different tasks and experimental conditions and find that there is no reliable correlation between these metrics that holds in more than extremely specific settings. We advise that efforts to debias embedding spaces be always also paired with measurement of downstream model bias, and suggest that that community direct more effort into making downstream measurement simpler and easier.
I have a new preprint out! There have been some callouts recently to the need to investigate how metrics of bias for NLP systems correlate to each other, and this is a tiny piece of that answer. (TL;DR poorly). Preprint here: https://t.co/cP4Ty5z3jO
— Seraphina Goldfarb-Tarrant (@seraphinagt) January 2, 2021
30. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, Furu Wei
We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student’s attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, and RoBERTa) outperform the state of the art.
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
— AK (@ak92501) January 1, 2021
pdf: https://t.co/6owREO7pBB
abs: https://t.co/aZv8yY2jyZ pic.twitter.com/gmNAXyRt7e