All Articles

Hot Papers 2021-08-11

1. A Survey on Deep Reinforcement Learning for Data Processing and Analytics

Qingpeng Cai, Can Cui, Yiyuan Xiong, Zhongle Xie, Meihui Zhang

  • retweets: 2496, favorites: 263 (08/12/2021 08:06:18)
  • links: abs | pdf
  • cs.LG | cs.DB

Data processing and analytics are fundamental and pervasive. Algorithms play a vital role in data processing and analytics where many algorithm designs have incorporated heuristics and general rules from human knowledge and experience to improve their effectiveness. Recently, reinforcement learning, deep reinforcement learning (DRL) in particular, is increasingly explored and exploited in many areas because it can learn better strategies in complicated environments it is interacting with than statically designed algorithms. Motivated by this trend, we provide a comprehensive review of recent works focusing on utilizing deep reinforcement learning to improve data processing and analytics. First, we present an introduction to key concepts, theories, and methods in deep reinforcement learning. Next, we discuss deep reinforcement learning deployment on database systems, facilitating data processing and analytics in various aspects, including data organization, scheduling, tuning, and indexing. Then, we survey the application of deep reinforcement learning in data processing and analytics, ranging from data preparation, natural language interface to healthcare, fintech, etc. Finally, we discuss important open challenges and future research directions of using deep reinforcement learning in data processing and analytics.

2. Making Transformers Solve Compositional Tasks

Santiago Ontañón, Joshua Ainslie, Vaclav Cvicek, Zachary Fisher

  • retweets: 440, favorites: 121 (08/12/2021 08:06:19)
  • links: abs | pdf
  • cs.AI | cs.CL

Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).

3. AnyoneNet: Synchronized Speech and Talking Head Generation for arbitrary person

Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg

  • retweets: 437, favorites: 79 (08/12/2021 08:06:19)
  • links: abs | pdf
  • cs.CV | cs.HC

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person that is inaccessible in the training stage. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-to-speech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multi-speaker TTS model that gets the speaker identity information from face images instead of speech, which allows us to synthesize a personalized voice on the basis of the input face image. To generate the talking head videos from the face images, a facial landmark-based method that can predict both lip movements and head rotations is proposed. Extensive experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons. Synthesized speech shows consistency with the given face regarding to the synthesized voice’s timbre and one’s appearance in the image, and the proposed landmark-based talking head method outperforms the state-of-the-art landmark-based method on generating natural talking head videos.

4. Instance-wise Hard Negative Example Generation for Contrastive Learning in Unpaired Image-to-Image Translation

Weilun Wang, Wengang Zhou, Jianmin Bao, Dong Chen, Houqiang Li

  • retweets: 361, favorites: 68 (08/12/2021 08:06:19)
  • links: abs | pdf
  • cs.CV

Contrastive learning shows great potential in unpaired image-to-image translation, but sometimes the translated results are in poor quality and the contents are not preserved consistently. In this paper, we uncover that the negative examples play a critical role in the performance of contrastive learning for image translation. The negative examples in previous methods are randomly sampled from the patches of different positions in the source image, which are not effective to push the positive examples close to the query examples. To address this issue, we present instance-wise hard Negative Example Generation for Contrastive learning in Unpaired image-to-image Translation~(NEGCUT). Specifically, we train a generator to produce negative examples online. The generator is novel from two perspectives: 1) it is instance-wise which means that the generated examples are based on the input image, and 2) it can generate hard negative examples since it is trained with an adversarial loss. With the generator, the performance of unpaired image-to-image translation is significantly improved. Experiments on three benchmark datasets demonstrate that the proposed NEGCUT framework achieves state-of-the-art performance compared to previous methods.

5. Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

Morgan Klaus Scheuerman, Emily Denton, Alex Hanna

  • retweets: 198, favorites: 107 (08/12/2021 08:06:19)
  • links: abs | pdf
  • cs.CV | cs.HC

Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision’s propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.

6. FairyTailor: A Multimodal Generative Framework for Storytelling

Eden Bensaid, Mauro Martino, Benjamin Hoover, Jacob Andreas, Hendrik Strobelt

Storytelling is an open-ended task that entails creative thinking and requires a constant flow of ideas. Natural language generation (NLG) for storytelling is especially challenging because it requires the generated text to follow an overall theme while remaining creative and diverse to engage the reader. In this work, we introduce a system and a web-based demo, FairyTailor, for human-in-the-loop visual story co-creation. Users can create a cohesive children’s fairytale by weaving generated texts and retrieved images with their input. FairyTailor adds another modality and modifies the text generation process to produce a coherent and creative sequence of text and images. To our knowledge, this is the first dynamic tool for multimodal story generation that allows interactive co-formation of both texts and images. It allows users to give feedback on co-created stories and share their results.

7. Meta-repository of screening mammography classifiers

Benjamin Stadnick, Jan Witowski, Vishwaesh Rajiv, Jakub Chłędowski, Farah E. Shamout, Kyunghyun Cho, Krzysztof J. Geras

  • retweets: 182, favorites: 25 (08/12/2021 08:06:19)
  • links: abs | pdf
  • cs.LG | cs.CV

Artificial intelligence (AI) is transforming medicine and showing promise in improving clinical diagnosis. In breast cancer screening, several recent studies show that AI has the potential to improve radiologists’ accuracy, subsequently helping in early cancer diagnosis and reducing unnecessary workup. As the number of proposed models and their complexity grows, it is becoming increasingly difficult to re-implement them in order to reproduce the results and to compare different approaches. To enable reproducibility of research in this application area and to enable comparison between different methods, we release a meta-repository containing deep learning models for classification of screening mammograms. This meta-repository creates a framework that enables the evaluation of machine learning models on any private or public screening mammography data set. At its inception, our meta-repository contains five state-of-the-art models with open-source implementations and cross-platform compatibility. We compare their performance on five international data sets: two private New York University breast cancer screening data sets as well as three public (DDSM, INbreast and Chinese Mammography Database) data sets. Our framework has a flexible design that can be generalized to other medical image analysis tasks. The meta-repository is available at https://www.github.com/nyukat/mammography_metarepository.

8. U-Net-and-a-half: Convolutional network for biomedical image segmentation using multiple expert-driven annotations

Yichi Zhang, Jesper Kers, Clarissa A. Cassol, Joris J. Roelofs, Najia Idrees, Alik Farber, Samir Haroon, Kevin P. Daly, Suvranu Ganguli, Vipul C. Chitalia, Vijaya B. Kolachalama

  • retweets: 100, favorites: 12 (08/12/2021 08:06:20)
  • links: abs | pdf
  • cs.CV | cs.LG

Development of deep learning systems for biomedical segmentation often requires access to expert-driven, manually annotated datasets. If more than a single expert is involved in the annotation of the same images, then the inter-expert agreement is not necessarily perfect, and no single expert annotation can precisely capture the so-called ground truth of the regions of interest on all images. Also, it is not trivial to generate a reference estimate using annotations from multiple experts. Here we present a deep neural network, defined as U-Net-and-a-half, which can simultaneously learn from annotations performed by multiple experts on the same set of images. U-Net-and-a-half contains a convolutional encoder to generate features from the input images, multiple decoders that allow simultaneous learning from image masks obtained from annotations that were independently generated by multiple experts, and a shared low-dimensional feature space. To demonstrate the applicability of our framework, we used two distinct datasets from digital pathology and radiology, respectively. Specifically, we trained two separate models using pathologist-driven annotations of glomeruli on whole slide images of human kidney biopsies (10 patients), and radiologist-driven annotations of lumen cross-sections of human arteriovenous fistulae obtained from intravascular ultrasound images (10 patients), respectively. The models based on U-Net-and-a-half exceeded the performance of the traditional U-Net models trained on single expert annotations alone, thus expanding the scope of multitask learning in the context of biomedical image segmentation.

9. Learning to Cut by Watching Movies

Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, Bernard Ghanem

  • retweets: 64, favorites: 47 (08/12/2021 08:06:20)
  • links: abs | pdf
  • cs.CV | cs.MM

Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise. Many video editing components are astonishingly hard to automate primarily due to the lack of raw video materials. This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility. Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts. To do this, we first collected a data source of more than 10K videos, from which we extract more than 255K cuts. We devise a model that learns to discriminate between real and artificial cuts via contrastive learning. We set up a new task and a set of baselines to benchmark video cut generation. We observe that our proposed model outperforms the baselines by large margins. To demonstrate our model in real-world applications, we conduct human studies in a collection of unedited videos. The results show that our model does a better job at cutting than random and alternative baselines.

10. RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

Yuki Tatsunami, Masato Taki

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at \url{https://github.com/okojoalg/raft-mlp}.