1. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy with soft inductive biases in place of hard token boundaries.To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes con-text. CANINE outperforms a comparable mBERT model by >=1 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
単語の分かち書きをなしに文字レベルで、BERT超えの手法が発表されました
— 小川雄太郎 (@ISID_AI_team) March 12, 2021
CANINE (Character Architecture with No tokenization In Neural Encoders)
(論文)https://t.co/greWNpcPsN
日本語が本当にどこまでいけるか疑問ですが、以下で非常に丁寧に解説されていますhttps://t.co/GcVOQMAiVU
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
— Aran Komatsuzaki (@arankomatsuzaki) March 12, 2021
A pre-training strategy w/o hard token boundaries trains a character-level encoder that outperforms mBERT w/ fewer parameters.https://t.co/1mRWy12E48 pic.twitter.com/XWWlD2X80q
2. Holistic 3D Scene Understanding from a Single Image with Implicit Representation
Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, Shuaicheng Liu
We present a new pipeline for holistic 3D scene understanding from a single image, which could predict object shape, object pose, and scene layout. As it is a highly ill-posed problem, existing methods usually suffer from inaccurate estimation of both shapes and layout especially for the cluttered scene due to the heavy occlusion between objects. We propose to utilize the latest deep implicit representation to solve this challenge. We not only propose an image-based local structured implicit network to improve the object shape estimation, but also refine 3D object pose and scene layout via a novel implicit scene graph neural network that exploits the implicit local object features. A novel physical violation loss is also proposed to avoid incorrect context between objects. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of object shape, scene layout estimation, and 3D object detection.
Holistic 3D Scene Understanding from a Single Image with Implicit Representation
— AK (@ak92501) March 12, 2021
pdf: https://t.co/XDcIxs9Ggi
abs: https://t.co/Hnws5q7WL9 pic.twitter.com/Ht3IenDYOB
3. Multi-Format Contrastive Learning of Audio Representations
Luyu Wang, Aaron van den Oord
Recent advances suggest the advantage of multi-modal training in comparison with single-modal methods. In contrast to this view, in our work we find that similar gain can be obtained from training with different formats of a single modality. In particular, we investigate the use of the contrastive learning framework to learn audio representations by maximizing the agreement between the raw audio and its spectral representation. We find a significant gain using this multi-format strategy against the single-format counterparts. Moreover, on the downstream AudioSet and ESC-50 classification task, our audio-only approach achieves new state-of-the-art results with a mean average precision of 0.376 and an accuracy of 90.5%, respectively.
Multi-Format Contrastive Learning of Audio
— Aran Komatsuzaki (@arankomatsuzaki) March 12, 2021
Representations
Using the contrastive learning to maximize the agreement between the raw audio and its spectral representation leads to a significant gain and achieves a new SotA. https://t.co/R6JcpLaCQV pic.twitter.com/ZgQfInnyEK
Multi-Format Contrastive Learning of Audio Representations
— AK (@ak92501) March 12, 2021
pdf: https://t.co/l2YDNCPOJM
abs: https://t.co/w8i5u3nFlr pic.twitter.com/VU3znf2QJm
4. S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning
Samarth Sinha, Animesh Garg
Offline reinforcement learning proposes to learn policies from large collected datasets without interaction. These algorithms have made it possible to learn useful skills from data that can then be transferred to the environment, making it feasible to deploy the trained policies in real-world settings where interactions may be costly or dangerous, such as self-driving. However, current algorithms overfit to the dataset they are trained on and perform poor out-of-distribution (OOD) generalization to the environment when deployed. We propose a Surprisingly Simple Self-Supervision algorithm (S4RL), which utilizes data augmentations from states to learn value functions that are better at generalizing and extrapolating when deployed in the environment. We investigate different data augmentation techniques that help learning a value function that can extrapolate to OOD data, and how to combine data augmentations and offline RL algorithms to learn a policy. We experimentally show that using S4RL significantly improves the state-of-the-art on most benchmark offline reinforcement learning tasks on popular benchmark datasets from D4RL, despite being simple and easy to implement.
S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning
— Aran Komatsuzaki (@arankomatsuzaki) March 12, 2021
Significantly improves the SotA on various offline RL tasks with a better data augmentation strategy.https://t.co/eEmbOYs41r pic.twitter.com/q4ewkW6tF2
Excited to share our recent work:
— Samarth Sinha (@_sam_sinha_) March 12, 2021
Surprisingly Simple Self-Supervision for Offline RL where we propose a Surprisingly Simple method to learn representations using data augmentations from offline data which achieves SOTA performance! https://t.co/jPX8tt8pc6
w/ @animesh_garg pic.twitter.com/CTwA31uLMq
5. SMPLicit: Topology-aware Generative Model for Clothed People
Enric Corona, Albert Pumarola, Guillem Alenyà, Gerard Pons-Moll, Francesc Moreno-Noguer
In this paper we introduce SMPLicit, a novel generative model to jointly represent body pose, shape and clothing geometry. In contrast to existing learning-based approaches that require training specific models for each type of garment, SMPLicit can represent in a unified manner different garment topologies (e.g. from sleeveless tops to hoodies and to open jackets), while controlling other properties like the garment size or tightness/looseness. We show our model to be applicable to a large variety of garments including T-shirts, hoodies, jackets, shorts, pants, skirts, shoes and even hair. The representation flexibility of SMPLicit builds upon an implicit model conditioned with the SMPL human body parameters and a learnable latent space which is semantically interpretable and aligned with the clothing attributes. The proposed model is fully differentiable, allowing for its use into larger end-to-end trainable systems. In the experimental section, we demonstrate SMPLicit can be readily used for fitting 3D scans and for 3D reconstruction in images of dressed people. In both cases we are able to go beyond state of the art, by retrieving complex garment geometries, handling situations with multiple clothing layers and providing a tool for easy outfit editing. To stimulate further research in this direction, we will make our code and model publicly available at http://www.iri.upc.edu/people/ecorona/smplicit/.
SMPLicit: Topology-aware Generative Model for Clothed People
— AK (@ak92501) March 12, 2021
pdf: https://t.co/4x1cxKQFnk
abs: https://t.co/1ksZEdppsb
project page: https://t.co/A6lsQMXEvs pic.twitter.com/QPZM2USAOI
📢📢Check out our new work:
— Enric Corona (@enric_corona) March 12, 2021
SMPLicit: Topology-aware Generative Model for Clothed People!
Accepted at #CVPR2021 with @AlbertPumarola @_guillem_ @GerardPonsMoll1 @fmorenoguer
Arxiv: https://t.co/CnGiBrU0LF
project: https://t.co/ll7NyOIpw0
Code will be available soon, stay tuned! pic.twitter.com/xXJLzXVAx4
We have just released SMPLicit :), a fully differentiable generative model to jointly represent body pose, shape, and clothing geometry. #CVPR2021
— Albert Pumarola (@AlbertPumarola) March 12, 2021
🖥Project: https://t.co/64eJFJg3Nr
📄PDF: https://t.co/REhmGNj2ZD https://t.co/N6rzMwvmOt
SMPLicit: a differentiable body model with clothing of varied topology (neural implicit functions), we fit it to images too. #CVPR2021. Check it out:
— Gerard Pons-Moll (@GerardPonsMoll1) March 12, 2021
paper: https://t.co/vhXsEuluPq
project: https://t.co/03FQgO0PN9
with @enric_corona @AlbertPumarola @_guillem_ @fmorenoguer https://t.co/hfVhNWQs3K
6. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino
Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced “viola”), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks. Extensive ablation studies also clarified the contribution of each component and their combinations.
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
— AK (@ak92501) March 12, 2021
pdf: https://t.co/zrT9CanBQc
abs: https://t.co/fjRbMEd36g pic.twitter.com/nl2ISCTZao
Our new paper is out!🚀 First professional research paper in my 20+yrs career. 😭
— daisukelab (@nizumical) March 12, 2021
This is for training better audio encoder, especially for a practical small audio DNN.
"BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation"https://t.co/DwnNOpWsEX
7. Hurdles to Progress in Long-form Question Answering
Kalpesh Krishna, Aurko Roy, Mohit Iyyer
The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset. While our system tops the public leaderboard, a detailed analysis reveals several troubling trends: (1) our system’s generated answers are not actually grounded in the documents that it retrieves; (2) ELI5 contains significant train / test overlap, as at least 81% of ELI5 validation questions occur in paraphrased form in the training set; (3) ROUGE-L is not an informative metric of generated answer quality and can be easily gamed; and (4) human evaluations used for other text generation tasks are unreliable for LFQA. We provide suggestions to mitigate each of these issues, which we hope will lead to more rigorous LFQA research and meaningful progress in the future.
Hurdles to Progress in Long-form Question Answering
— Aran Komatsuzaki (@arankomatsuzaki) March 12, 2021
- ELI5 contains significant train / test overlap
- ROUGE-L is uninformative for this task
- human evaluations used for other text gen tasks
are unreliable for long-form QAhttps://t.co/hNDUTd4YdJ pic.twitter.com/gSvdr24qow
8. Fast and Accurate Model Scaling
Piotr Dollár, Mannat Singh, Ross Girshick
In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Existing analysis typically focuses on the interplay of accuracy and flops (floating point operations). Yet, as we demonstrate, various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. In our experiments we show the surprising result that numerous scaling strategies yield networks with similar accuracy but with widely varying properties. This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent. Unlike currently popular scaling strategies, which result in about increase in model activation w.r.t. scaling flops by a factor of , the proposed fast compound scaling results in close to increase in activations, while achieving excellent accuracy. This leads to comparable speedups on modern memory-limited hardware (e.g., GPU, TPU). More generally, we hope this work provides a framework for analyzing and selecting scaling strategies under various computational constraints.
Fast and Accurate Model Scaling
— Aran Komatsuzaki (@arankomatsuzaki) March 12, 2021
Investigates various scaling strategies and finds that scaling up the width by O(sqrt(s)) works the best in terms of performance-computes trade-off with GPUs.https://t.co/7AjlABlspY pic.twitter.com/XJJcq0BXNJ
9. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan’ led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model within the cross-modal contrastive learning (CMCL) framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our CMCL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our CMCL model. Extensive experiments demonstrate that the pre-trained CMCL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
— AK (@ak92501) March 12, 2021
pdf: https://t.co/LPcVgF6lFO
abs: https://t.co/UfzAjDZyYh pic.twitter.com/MQz6qZXMaT
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
— Aran Komatsuzaki (@arankomatsuzaki) March 12, 2021
Constructs a large Chinese multi-source image-text dataset for pre-training, which leads to their model outperforming both UNITER and CLIP on various downstream tasks.https://t.co/NWVA7aqt17 pic.twitter.com/foX43PaUn1
10. MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding
Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, Nanyun Peng
Generating metaphors is a challenging task as it requires a proper understanding of abstract concepts, making connections between unrelated concepts, and deviating from the literal meaning. Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus by transforming a large number of metaphorical sentences from the Gutenberg Poetry corpus (Jacobs, 2018) to their literal counterpart using recent advances in masked language modeling coupled with commonsense inference. For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data to generate high-quality metaphors. Human evaluation on an independent test set of literal statements shows that our best model generates metaphors better than three well-crafted baselines 66% of the time on average. A task-based evaluation shows that human-written poems enhanced with metaphors proposed by our model are preferred 68% of the time compared to poems without metaphors.
🔥Metaphors are not to be trifled with🔥 Excited to share #NAACL2021 preprint titled “MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding”https://t.co/gcnvn995vR . Joint work with my figurative NLG constants @VioletNPeng and Smaranda Muresan. #NLProc pic.twitter.com/3mJqlLV6j2
— Tuhin Chakrabarty (@TuhinChakr) March 12, 2021
11. Fair Mixup: Fairness via Interpolation
Ching-Yao Chuang, Youssef Mroueh
Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use mixup, a powerful data augmentation strategy to generate these interpolates. We analyze fair mixup and empirically show that it ensures a better generalization for both accuracy and fairness measurement in tabular, vision, and language benchmarks.
"Fair Mixup: Fairness via Interpolation"
— Ching-Yao Chuang (@ChingYaoChuang) March 12, 2021
Happy to share that the work done during my internship at IBM Research AI was accepted at ICLR 2021!
Paper: https://t.co/uN28ntnwfV
Code: https://t.co/vaQigPhooR
with Youssef Mroueh pic.twitter.com/Ucc6Zsy7T2
12. 3D Head-Position Prediction in First-Person View by Considering Head Pose for Human-Robot Eye Contact
Yuki Tamaru, Yasunori Ozaki, Yuki Okafuji, Jun Baba, Junya Nakanishi, Yuichiro Yoshikawa
For a humanoid robot to make eye contact to initiate communication with a human, it is necessary to estimate the human’s head position.However, eye contact becomes difficult due to the mechanical delay of the robot while the subject with whom the robot is interacting with is moving. Owing to these issues, it is important to perform head-position prediction to mitigate the effect of the delay in the robot’s motion. Based on the fact that humans turn their heads before changing direction while walking, we hypothesized that the accuracy of three-dimensional(3D) head-position prediction from the first-person view can be improved by considering the head pose into account.We compared our method with the conventional Kalman filter-based method, and found our method to be more accurate. The experimental results show that considering the head pose helps improve the accuracy of 3D head-position prediction.
インターンシップで一緒に研究した東大の田丸さんとの研究をarXivで公開しました。研究内容は、人間とロボットとでアイコンタクトするための一人称映像からの三次元顔位置予測です。ロボットの国際会議IROSに投稿しているので、運が良ければ採択されるといいですね。https://t.co/Q2JGwQmSqH pic.twitter.com/6jkLVQfOym
— あるふ (@alfredplpl) March 12, 2021
13. A semi-agnostic ansatz with variable structure for quantum machine learning
M. Bilkis, M. Cerezo, Guillaume Verdon, Patrick J. Coles, Lukasz Cincio
Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications and also in the quantum autoencoder for data compression, showing successful results in all cases.
Happy to debut on Quantum-Twitter-World by announcing
— Mati Bilkis (@MatiasBilkis) March 12, 2021
A semi-agnostic ansatz with variable structure for quantum machine learning 🔥
Exciting collaboration with @MvsCerezo @quantumVerd @ColesQuantum @LCincio and unnoficially sponsored by @vans_66 👟👟https://t.co/vxXueHLEjB
14. Robust High-speed Running for Quadruped Robots via Deep Reinforcement Learning
Guillaume Bellegarda, Quan Nguyen
Deep reinforcement learning has emerged as a popular and powerful way to develop locomotion controllers for quadruped robots. Common approaches have largely focused on learning actions directly in joint space, or learning to modify and offset foot positions produced by trajectory generators. Both approaches typically require careful reward shaping and training for millions of time steps, and with trajectory generators introduce human bias into the resulting control policies. In this paper, we instead explore learning foot positions in Cartesian space, which we track with impedance control, for a task of running as fast as possible subject to environmental disturbances. Compared with other action spaces, we observe less needed reward shaping, much improved sample efficiency, the emergence of natural gaits such as galloping and bounding, and ease of sim-to-sim transfer. Policies can be learned in only a few million time steps, even for challenging tasks of running over rough terrain with loads of over 100% of the nominal quadruped mass. Training occurs in PyBullet, and we perform a sim-to-sim transfer to Gazebo, where our quadruped is able to run at over 4 m/s without a load, and 3.5 m/s with a 10 kg load, which is over 83% of the nominal quadruped mass. Video results can be found at https://youtu.be/roE1vxpEWfw.
Robust High-speed Running for Quadruped Robots via Deep Reinforcement Learning
— AK (@ak92501) March 12, 2021
pdf: https://t.co/0r5Gcerjmw
abs: https://t.co/9Y0GBAHNBp pic.twitter.com/6f7X1XG07k
15. Hard Attention Control By Mutual Information Maximization
Himanshu Sahni, Charles Isbell
Biological agents have adopted the principle of attention to limit the rate of incoming information from the environment. One question that arises is if an artificial agent has access to only a limited view of its surroundings, how can it control its attention to effectively solve tasks? We propose an approach for learning how to control a hard attention window by maximizing the mutual information between the environment state and the attention location at each step. The agent employs an internal world model to make predictions about its state and focuses attention towards where the predictions may be wrong. Attention is trained jointly with a dynamic memory architecture that stores partial observations and keeps track of the unobserved state. We demonstrate that our approach is effective in predicting the full state from a sequence of partial observations. We also show that the agent’s internal representation of the surroundings, a live mental map, can be used for control in two partially observable reinforcement learning tasks. Videos of the trained agent can be found at https://sites.google.com/view/hard-attention-control.
Hard Attention Control By Mutual Information Maximization
— AK (@ak92501) March 12, 2021
pdf: https://t.co/Uiq9yoYz7y
abs: https://t.co/vVaEMj59Lp
project page: https://t.co/4vybnkLlK6 pic.twitter.com/4Vx20T8cT5
16. Research Software Sustainability and Citation
Stephan Druskat, Daniel S. Katz, Ilian T. Todorov
Software citation contributes to achieving software sustainability in two ways: It provides an impact metric to incentivize stakeholders to make software sustainable. It also provides references to software used in research, which can be reused and adapted to become sustainable. While software citation faces a host of technical and social challenges, community initiatives have defined the principles of software citation and are working on implementing solutions.