Hot Papers 2021-07-06

1. A topological solution to object segmentation and tracking

Thomas Tsao, Doris Y. Tsao

retweets: 10656, favorites: 1 (07/07/2021 09:36:55)
links: abs | pdf
cs.CV | cs.LG

The world is composed of objects, the ground, and the sky. Visual perception of objects requires solving two fundamental challenges: segmenting visual input into discrete units, and tracking identities of these units despite appearance changes due to object deformation, changing perspective, and dynamic occlusion. Current computer vision approaches to segmentation and tracking that approach human performance all require learning, raising the question: can objects be segmented and tracked without learning? Here, we show that the mathematical structure of light rays reflected from environment surfaces yields a natural representation of persistent surfaces, and this surface representation provides a solution to both the segmentation and tracking problems. We describe how to generate this surface representation from continuous visual input, and demonstrate that our approach can segment and invariantly track objects in cluttered synthetic video despite severe appearance changes, without requiring learning.

How does perception of objects arise? Objects undergo huge changes in appearance due to deformation, perspective change, & dynamic occlusion. We prove from first principles that it’s possible, without learning, to perceive invariant objects despite this. https://t.co/oTWSUmuzbk
— Doris Tsao (@doristsao) July 6, 2021

2. DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

Lanqing Xue, Kaitao Song, Duocai Wu, Xu Tan, Nevin L. Zhang, Tao Qin, Wei-Qiang Zhang, Tie-Yan Liu

retweets: 1488, favorites: 153 (07/07/2021 09:36:55)
links: abs | pdf
cs.SD | cs.AI | cs.CL | cs.LG | eess.AS

Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap dataset with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms. Code will be released on GitHub.

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling
pdf: https://t.co/IjrgGeDA4h
abs: https://t.co/7mANmHJBJk
project page: https://t.co/gIQiFiVSgK
Transformer-based rap generation system that can model both rhymes and rhythms pic.twitter.com/iFwmdZ6zjZ
— AK (@ak92501) July 6, 2021

3. Automating Generative Deep Learning for Artistic Purposes: Challenges and Opportunities

Sebastian Berns, Terence Broad, Christian Guckelsberger, Simon Colton

retweets: 1460, favorites: 114 (07/07/2021 09:36:55)
links: abs | pdf
cs.LG

We present a framework for automating generative deep learning with a specific focus on artistic applications. The framework provides opportunities to hand over creative responsibilities to a generative system as targets for automation. For the definition of targets, we adopt core concepts from automated machine learning and an analysis of generative deep learning pipelines, both in standard and artistic settings. To motivate the framework, we argue that automation aligns well with the goal of increasing the creative responsibility of a generative system, a central theme in computational creativity research. We understand automation as the challenge of granting a generative system more creative autonomy, by framing the interaction between the user and the system as a co-creative process. The development of the framework is informed by our analysis of the relationship between automation and creative autonomy. An illustrative example shows how the framework can give inspiration and guidance in the process of handing over creative responsibility.

📘 Automating Generative Deep Learning for Artistic Purposes: Challenges and Opportunities

This work makes the case for automation and its role in allowing generative systems more creative autonomy.

Interesting lessons emerging from ML for creativity.https://t.co/KWAbzJnWBA pic.twitter.com/0SV129rKcd
— elvis (@omarsar0) July 6, 2021

Automating Generative Deep Learning. #BigData #Analytics #DataScience #AI #MachineLearning #IoT #IIoT #PyTorch #Python #RStats #TensorFlow #Java #JavaScript #ReactJS #CloudComputing #Serverless #DataScientist #Linux #Programming #Coding #100DaysofCode https://t.co/LldNaUah8v pic.twitter.com/QpEnWqY4Ke
— Dr. Ganapathi Pulipaka 🇺🇸 (@gp_pulipaka) July 6, 2021

4. Mava: a research framework for distributed multi-agent reinforcement learning

Arnu Pretorius, Kale-ab Tessera, Andries P. Smit, Claude Formanek, St John Grimbly, Kevin Eloff, Siphelele Danisa, Lawrence Francis, Jonathan Shock, Herman Kamper, Willie Brink, Herman Engelbrecht, Alexandre Laterre, Karim Beguir

retweets: 930, favorites: 97 (07/07/2021 09:36:55)
links: abs | pdf
cs.LG | cs.MA

Breakthrough advances in reinforcement learning (RL) research have led to a surge in the development and application of RL. To support the field and its rapid growth, several frameworks have emerged that aim to help the community more easily build effective and scalable agents. However, very few of these frameworks exclusively support multi-agent RL (MARL), an increasingly active field in itself, concerned with decentralised decision-making problems. In this work, we attempt to fill this gap by presenting Mava: a research framework specifically designed for building scalable MARL systems. Mava provides useful components, abstractions, utilities and tools for MARL and allows for simple scaling for multi-process system training and execution, while providing a high level of flexibility and composability. Mava is built on top of DeepMind’s Acme \citep{hoffman2020acme}, and therefore integrates with, and greatly benefits from, a wide range of already existing single-agent RL components made available in Acme. Several MARL baseline systems have already been implemented in Mava. These implementations serve as examples showcasing Mava’s reusable features, such as interchangeable system architectures, communication and mixing modules. Furthermore, these implementations allow existing MARL algorithms to be easily reproduced and extended. We provide experimental results for these implementations on a wide range of multi-agent environments and highlight the benefits of distributed system training.

🧵1/6 Super excited to launch Mava: a scalable, research framework for multi-agent reinforcement learning (MARL)! 🤖🥳

Humbled to be part of one of the first Deep RL/MARL frameworks built and led by an African team🌍
📜: https://t.co/CTb03zGENi
⌨️: https://t.co/UT5E8nHz7M pic.twitter.com/ReDYcEBnwZ
— Kale-ab Tessera (@KaliTessera) July 6, 2021

5. Solving Machine Learning Problems

Sunny Tran, Pranav Krishna, Ishan Pakuwal, Prabhakar Kafle, Nikhil Singh, Jayson Lynch, Iddo Drori

retweets: 839, favorites: 126 (07/07/2021 09:36:56)
links: abs | pdf
cs.LG

Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT’s 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system demonstrates an overall accuracy of 96% for open-response questions and 97% for multiple-choice questions, compared with MIT students’ average of 93%, achieving grade A performance in the course, all in real-time. Questions cover all 12 topics taught in the course, excluding coding questions or questions with images. Topics include: (i) basic machine learning principles; (ii) perceptrons; (iii) feature extraction and selection; (iv) logistic regression; (v) regression; (vi) neural networks; (vii) advanced neural networks; (viii) convolutional neural networks; (ix) recurrent neural networks; (x) state machines and MDPs; (xi) reinforcement learning; and (xii) decision trees. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations. An important aspect of our approach is a data-augmentation scheme for generating new example problems. We also train a machine learning model to generate problem hints. Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.

Solving Machine Learning Problems
pdf: https://t.co/jf7EeapP4Z
abs: https://t.co/uv1JlDLw5M

accuracy of 96% for open-response questions and 97% for mcq, compared with MIT students’ average of 93%, achieving grade A performance in the course, all in real-time pic.twitter.com/NwF3SgMsTv
— AK (@ak92501) July 6, 2021

6. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang

retweets: 467, favorites: 149 (07/07/2021 09:36:56)
links: abs | pdf
cs.CL

Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way. As a result, this kind of traditional fine-tuning approach demonstrates relatively weak performance when solving downstream language understanding tasks. In order to solve the above problems, we propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Empirical results show that the model outperforms the state-of-the-art models on 54 Chinese NLP tasks, and its English version achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%).

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
pdf: https://t.co/txNBuGFtGa
abs: https://t.co/BQIs7uRq4o

trained the model with 10B parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph pic.twitter.com/ftFY8Y7fcr
— AK (@ak92501) July 6, 2021

BaiduのAI研究（ERNIE 3.0）

54個の中国語タスクで従来の最先端モデルを凌駕。英語版は、言語理解に関する8つのタスクを集めたベンチマーク「SuperGLUE」において、人間の性能を+0.8%上回り、1位を達成（90.6% vs. 89.8%）。4TBのコーパスで100億パラメータのモデルを学習https://t.co/5UyvINjOum
— 小猫遊りょう（たかにゃし・りょう） (@jaguring1) July 6, 2021

7. Dealing with Adversarial Player Strategies in the Neural Network Game iNNk through Ensemble Learning

Mathias Löwe, Jennifer Villareale, Evan Freed, Aleksanteri Sladek, Jichen Zhu, Sebastian Risi

retweets: 240, favorites: 55 (07/07/2021 09:36:56)
links: abs | pdf
cs.LG | cs.AI

Applying neural network (NN) methods in games can lead to various new and exciting game dynamics not previously possible. However, they also lead to new challenges such as the lack of large, clean datasets, varying player skill levels, and changing gameplay strategies. In this paper, we focus on the adversarial player strategy aspect in the game iNNk, in which players try to communicate secret code words through drawings with the goal of not being deciphered by a NN. Some strategies exploit weaknesses in the NN that consistently trick it into making incorrect classifications, leading to unbalanced gameplay. We present a method that combines transfer learning and ensemble methods to obtain a data-efficient adaptation to these strategies. This combination significantly outperforms the baseline NN across all adversarial player strategies despite only being trained on a limited set of adversarial examples. We expect the methods developed in this paper to be useful for the rapidly growing field of NN-based games, which will require new approaches to deal with unforeseen player creativity.

Can humans outsmart machines to communicate a secret codeword visually? In our recent FDG paper we investigate how to deal with adversarial player strategies in our neural network game iNNk.

The video shows one such strategy, drawing a rebus puzzle.

PDF: https://t.co/3pLrODdp3Y pic.twitter.com/7Qc5lfZkaf
— Sebastian Risi (@risi1979) July 6, 2021

8. Efficient Vision Transformers via Fine-Grained Manifold Distillation

Ding Jia, Kai Han, Yunhe Wang, Yehui Tang, Jianyuan Guo, Chao Zhang, Dacheng Tao

retweets: 182, favorites: 66 (07/07/2021 09:36:56)
links: abs | pdf
cs.CV

This paper studies the model compression problem of vision transformers. Benefit from the self-attention module, transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources including memory usage and the inference complexity. Compared with the existing knowledge distillation approaches, we propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches. We then explore an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and random-selected manifolds in teacher and student models. Experimental results conducted on several benchmarks demonstrate the superiority of the proposed algorithm for distilling portable transformer models with higher performance. For example, our approach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model, which outperforms other ViT distillation methods.

Efficient Vision Transformers via Fine-Grained Manifold Distillation
pdf: https://t.co/WwmHUd17dk
abs: https://t.co/eQ3hbcQfSL
approach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model pic.twitter.com/kraLMm66X6
— AK (@ak92501) July 6, 2021

9. Test-Time Personalization with a Transformer for Human Pose Estimation

Miao Hao, Yizhuo Li, Zonglin Di, Nitesh B. Gundavarapu, Xiaolong Wang

retweets: 154, favorites: 66 (07/07/2021 09:36:57)
links: abs | pdf
cs.CV

We propose to personalize a human pose estimator given a set of test images of a person without using any manual annotations. While there is a significant advancement in human pose estimation, it is still very challenging for a model to generalize to different unknown environments and unseen persons. Instead of using a fixed model for every test case, we adapt our pose estimator during test time to exploit person-specific information. We first train our model on diverse data with both a supervised and a self-supervised pose estimation objectives jointly. We use a Transformer model to build a transformation between the self-supervised keypoints and the supervised keypoints. During test time, we personalize and adapt our model by fine-tuning with the self-supervised objective. The pose is then improved by transforming the updated self-supervised keypoints. We experiment with multiple datasets and show significant improvements on pose estimations with our self-supervised personalization.

Test-Time Personalization with a Transformer for Human Pose Estimation
pdf: https://t.co/LR1CTzMSOF
abs: https://t.co/272J51jz43 pic.twitter.com/pPInkaiAmA
— AK (@ak92501) July 6, 2021

10. Data-driven mapping between functional connectomes using optimal transport

Javid Dadashkarimi, Amin Karbasi, Dustin Scheinost

retweets: 156, favorites: 45 (07/07/2021 09:36:57)
links: abs | pdf
q-bio.NC | cs.LG

Functional connectomes derived from functional magnetic resonance imaging have long been used to understand the functional organization of the brain. Nevertheless, a connectome is intrinsically linked to the atlas used to create it. In other words, a connectome generated from one atlas is different in scale and resolution compared to a connectome generated from another atlas. Being able to map connectomes and derived results between different atlases without additional pre-processing is a crucial step in improving interpretation and generalization between studies that use different atlases. Here, we use optimal transport, a powerful mathematical technique, to find an optimum mapping between two atlases. This mapping is then used to transform time series from one atlas to another in order to reconstruct a connectome. We validate our approach by comparing transformed connectomes against their “gold-standard” counterparts (i.e., connectomes generated directly from an atlas) and demonstrate the utility of transformed connectomes by applying these connectomes to predictive models based on a different atlas. We show that these transformed connectomes are significantly similar to their “gold-standard” counterparts and maintain individual differences in brain-behavior associations, demonstrating both the validity of our approach and its utility in downstream analyses. Overall, our approach is a promising avenue to increase the generalization of connectome-based results across different atlases.

I am excited to share my latest work about optimal transport with @DScheinost and @aminkarbasi :

Title: Data-driven mapping between functional
connectomes using optimal transport

preprint: https://t.co/0XCXi1FKpp

code: https://t.co/ntPjLXCJlK #MICCAI2021 @MICCAI_Society pic.twitter.com/Ysg8VegJt9
— javid.dadashkarimi (@JDadashkarimi) July 6, 2021

11. Do Different Tracking Tasks Require Different Appearance Models?

Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin Wang, Philip H.S. Torr, Luca Bertinetto

retweets: 63, favorites: 77 (07/07/2021 09:36:57)
links: abs | pdf
cs.CV | cs.AI

Tracking objects of interest in a video is one of the most popular and widely applicable problems in computer vision. However, with the years, a Cambrian explosion of use cases and benchmarks has fragmented the problem in a multitude of different experimental setups. As a consequence, the literature has fragmented too, and now the novel approaches proposed by the community are usually specialised to fit only one specific setup. To understand to what extent this specialisation is actually necessary, in this work we present UniTrack, a unified tracking solution to address five different tasks within the same framework. UniTrack consists of a single and task-agnostic appearance model, which can be learned in a supervised or self-supervised fashion, and multiple “heads” to address individual tasks and that do not require training. We show how most tracking tasks can be solved within this framework, and that the same appearance model can be used to obtain performance that is competitive against specialised methods for all the five tasks considered. The framework also allows us to analyse appearance models obtained with the most recent self-supervised methods, thus significantly extending their evaluation and comparison to a larger variety of important problems. Code available at https://github.com/Zhongdao/UniTrack.

Do Different Tracking Tasks Require Different Appearance Models?
pdf: https://t.co/YM53bIicxc
abs: https://t.co/Lfo1cmK3kx
github: https://t.co/3GHtskmBTy pic.twitter.com/LZBp1ScfBJ
— AK (@ak92501) July 6, 2021

12. What Makes for Hierarchical Vision Transformer?

Yuxin Fang, Xinggang Wang, Rui Wu, Jianwei Niu, Wenyu Liu

retweets: 80, favorites: 34 (07/07/2021 09:36:57)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Recent studies show that hierarchical Vision Transformer with interleaved non-overlapped intra window self-attention & shifted window self-attention is able to achieve state-of-the-art performance in various visual recognition tasks and challenges CNN’s dense sliding window paradigm. Most follow-up works try to replace shifted window operation with other kinds of cross window communication while treating self-attention as the de-facto standard for intra window information aggregation. In this short preprint, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and what makes for hierarchical Vision Transformer? We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged. The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs. We also experiment with other alternatives to self-attention for context aggregation inside each non-overlapped window, which all give similar competitive results under the same architecture. Our study reveals that the \textbf{macro architecture} of Swin model families (i.e., interleaved intra window & cross window communications), other than specific aggregation layers or specific means of cross window communication, may be more responsible for its strong performance and is the real challenger to CNN’s dense sliding window paradigm.

What Makes for Hierarchical Vision Transformer?
pdf: https://t.co/44NsyNZIa8
abs: https://t.co/aQCNVjVk3R

resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOP pic.twitter.com/aiU32BpVvA
— AK (@ak92501) July 6, 2021

13. A Multilayer Network Model of the Coevolution of the Spread of a Disease and Competing Opinions

Kaiyan Peng, Zheng Lu, Vanessa Lin, Michael R. Lindstrom, Christian Parkinson, Chuntian Wang, Andrea L. Bertozzi, Mason A. Porter

retweets: 42, favorites: 18 (07/07/2021 09:36:57)
links: abs | pdf
cs.SI | physics.soc-ph | q-bio.PE

During the COVID-19 pandemic, conflicting opinions on physical distancing swept across social media, affecting both human behavior and the spread of COVID-19. Inspired by such phenomena, we construct a two-layer multiplex network for the coupled spread of a disease and conflicting opinions. We model each process as a contagion. On one layer, we consider the concurrent evolution of two opinions — pro-physical-distancing and anti-physical-distancing — that compete with each other and have mutual immunity to each other. The disease evolves on the other layer, and individuals are less likely (respectively, more likely) to become infected when they adopt the pro-physical-distancing (respectively, anti-physical-distancing) opinion. We develop approximations of mean-field type by generalizing monolayer pair approximations to multilayer networks; these approximations agree well with Monte Carlo simulations for a broad range of parameters and several network structures. Through numerical simulations, we illustrate the influence of opinion dynamics on the spread of the disease from complex interactions both between the two conflicting opinions and between the opinions and the disease. We find that lengthening the duration that individuals hold an opinion may help suppress disease transmission, and we demonstrate that increasing the cross-layer correlations or intra-layer correlations of node degrees may lead to fewer individuals becoming infected with the disease.

14. EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

retweets: 25, favorites: 25 (07/07/2021 09:36:57)
links: abs | pdf
eess.AS | cs.SD

This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ .

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion
pdf: https://t.co/AFupl3y5e4
abs: https://t.co/EzYlVMX1Qd
project page: https://t.co/erZVbCwfEE pic.twitter.com/o3UHN08eaU
— AK (@ak92501) July 6, 2021

Published 7 Jul 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter