1. A Survey on Deep Reinforcement Learning for Data Processing and Analytics
Qingpeng Cai, Can Cui, Yiyuan Xiong, Zhongle Xie, Meihui Zhang
Data processing and analytics are fundamental and pervasive. Algorithms play a vital role in data processing and analytics where many algorithm designs have incorporated heuristics and general rules from human knowledge and experience to improve their effectiveness. Recently, reinforcement learning, deep reinforcement learning (DRL) in particular, is increasingly explored and exploited in many areas because it can learn better strategies in complicated environments it is interacting with than statically designed algorithms. Motivated by this trend, we provide a comprehensive review of recent works focusing on utilizing deep reinforcement learning to improve data processing and analytics. First, we present an introduction to key concepts, theories, and methods in deep reinforcement learning. Next, we discuss deep reinforcement learning deployment on database systems, facilitating data processing and analytics in various aspects, including data organization, scheduling, tuning, and indexing. Then, we survey the application of deep reinforcement learning in data processing and analytics, ranging from data preparation, natural language interface to healthcare, fintech, etc. Finally, we discuss important open challenges and future research directions of using deep reinforcement learning in data processing and analytics.
🎓 A Survey on Deep Reinforcement Learning for Data Processing and Analytics
— elvis (@omarsar0) August 11, 2021
Provides a comprehensive overview of how deep reinforcement learning can improve data processing and analytics applications.
A great read for ML practitioners and students.https://t.co/bk8Fj2f9Me pic.twitter.com/7F02KHhpN6
2. Making Transformers Solve Compositional Tasks
Santiago Ontañón, Joshua Ainslie, Vaclav Cvicek, Zachary Fisher
Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).
Making Transformers Solve Compositional Tasks
— AK (@ak92501) August 11, 2021
paper: https://t.co/1qUhBPlTfa
explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization pic.twitter.com/WSMeRNl3SX
3. AnyoneNet: Synchronized Speech and Talking Head Generation for arbitrary person
Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg
Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person that is inaccessible in the training stage. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-to-speech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multi-speaker TTS model that gets the speaker identity information from face images instead of speech, which allows us to synthesize a personalized voice on the basis of the input face image. To generate the talking head videos from the face images, a facial landmark-based method that can predict both lip movements and head rotations is proposed. Extensive experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons. Synthesized speech shows consistency with the given face regarding to the synthesized voice’s timbre and one’s appearance in the image, and the proposed landmark-based talking head method outperforms the state-of-the-art landmark-based method on generating natural talking head videos.
AnyoneNet: Synchronized Speech and Talking Head
— AK (@ak92501) August 11, 2021
Generation for arbitrary person
pdf: https://t.co/pm6IWdWScu
abs: https://t.co/d5O2t0x1zi pic.twitter.com/qSiJEwXm62
4. Instance-wise Hard Negative Example Generation for Contrastive Learning in Unpaired Image-to-Image Translation
Weilun Wang, Wengang Zhou, Jianmin Bao, Dong Chen, Houqiang Li
Contrastive learning shows great potential in unpaired image-to-image translation, but sometimes the translated results are in poor quality and the contents are not preserved consistently. In this paper, we uncover that the negative examples play a critical role in the performance of contrastive learning for image translation. The negative examples in previous methods are randomly sampled from the patches of different positions in the source image, which are not effective to push the positive examples close to the query examples. To address this issue, we present instance-wise hard Negative Example Generation for Contrastive learning in Unpaired image-to-image Translation~(NEGCUT). Specifically, we train a generator to produce negative examples online. The generator is novel from two perspectives: 1) it is instance-wise which means that the generated examples are based on the input image, and 2) it can generate hard negative examples since it is trained with an adversarial loss. With the generator, the performance of unpaired image-to-image translation is significantly improved. Experiments on three benchmark datasets demonstrate that the proposed NEGCUT framework achieves state-of-the-art performance compared to previous methods.
Instance-wise Hard Negative Example Generation for Contrastive Learning in Unpaired Image-to-Image Translation
— AK (@ak92501) August 11, 2021
pdf: https://t.co/dUlGZSiiN3
abs: https://t.co/WfKKY2VgQY pic.twitter.com/Xzeh5LCMde
5. Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development
Morgan Klaus Scheuerman, Emily Denton, Alex Hanna
Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision’s propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.
Preprint announcement: "Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development" w/ myself, @cephaloponderer, @alexhanna to be published in CSCW 2021 https://t.co/xBuZd1VvdB
— Morgan Klaus Scheuerman (@morganklauss) August 11, 2021
> We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work
— Ali Alkhatib (@_alialkhatib) August 11, 2021
👀 very excited to read thishttps://t.co/OGwttZEFEy
6. FairyTailor: A Multimodal Generative Framework for Storytelling
Eden Bensaid, Mauro Martino, Benjamin Hoover, Jacob Andreas, Hendrik Strobelt
Storytelling is an open-ended task that entails creative thinking and requires a constant flow of ideas. Natural language generation (NLG) for storytelling is especially challenging because it requires the generated text to follow an overall theme while remaining creative and diverse to engage the reader. In this work, we introduce a system and a web-based demo, FairyTailor, for human-in-the-loop visual story co-creation. Users can create a cohesive children’s fairytale by weaving generated texts and retrieved images with their input. FairyTailor adds another modality and modifies the text generation process to produce a coherent and creative sequence of text and images. To our knowledge, this is the first dynamic tool for multimodal story generation that allows interactive co-formation of both texts and images. It allows users to give feedback on co-created stories and share their results.
FairyTailor: A Multimodal Generative Framework for Storytelling
— AK (@ak92501) August 11, 2021
pdf: https://t.co/XJ7X4CDZfz
abs: https://t.co/IZJuTXsHm6
webpage: https://t.co/Fbc6bo8RAX
github: https://t.co/BYypJRTWGp pic.twitter.com/GVbviyZBmf
7. Meta-repository of screening mammography classifiers
Benjamin Stadnick, Jan Witowski, Vishwaesh Rajiv, Jakub Chłędowski, Farah E. Shamout, Kyunghyun Cho, Krzysztof J. Geras
Artificial intelligence (AI) is transforming medicine and showing promise in improving clinical diagnosis. In breast cancer screening, several recent studies show that AI has the potential to improve radiologists’ accuracy, subsequently helping in early cancer diagnosis and reducing unnecessary workup. As the number of proposed models and their complexity grows, it is becoming increasingly difficult to re-implement them in order to reproduce the results and to compare different approaches. To enable reproducibility of research in this application area and to enable comparison between different methods, we release a meta-repository containing deep learning models for classification of screening mammograms. This meta-repository creates a framework that enables the evaluation of machine learning models on any private or public screening mammography data set. At its inception, our meta-repository contains five state-of-the-art models with open-source implementations and cross-platform compatibility. We compare their performance on five international data sets: two private New York University breast cancer screening data sets as well as three public (DDSM, INbreast and Chinese Mammography Database) data sets. Our framework has a flexible design that can be generalized to other medical image analysis tasks. The meta-repository is available at https://www.github.com/nyukat/mammography_metarepository.
Today, we release an open-source *meta-repository* for breast cancer mammography classifiers! In a new paper, we use it to evaluate 5 SOTA models on 5 various datasets from around the world. Preprint is now live at: https://t.co/CpHV9Yo2b9, and a thread below: pic.twitter.com/VIwGDa1GAX
— Jan Witowski (@JanWitowski) August 11, 2021
8. U-Net-and-a-half: Convolutional network for biomedical image segmentation using multiple expert-driven annotations
Yichi Zhang, Jesper Kers, Clarissa A. Cassol, Joris J. Roelofs, Najia Idrees, Alik Farber, Samir Haroon, Kevin P. Daly, Suvranu Ganguli, Vipul C. Chitalia, Vijaya B. Kolachalama
Development of deep learning systems for biomedical segmentation often requires access to expert-driven, manually annotated datasets. If more than a single expert is involved in the annotation of the same images, then the inter-expert agreement is not necessarily perfect, and no single expert annotation can precisely capture the so-called ground truth of the regions of interest on all images. Also, it is not trivial to generate a reference estimate using annotations from multiple experts. Here we present a deep neural network, defined as U-Net-and-a-half, which can simultaneously learn from annotations performed by multiple experts on the same set of images. U-Net-and-a-half contains a convolutional encoder to generate features from the input images, multiple decoders that allow simultaneous learning from image masks obtained from annotations that were independently generated by multiple experts, and a shared low-dimensional feature space. To demonstrate the applicability of our framework, we used two distinct datasets from digital pathology and radiology, respectively. Specifically, we trained two separate models using pathologist-driven annotations of glomeruli on whole slide images of human kidney biopsies (10 patients), and radiologist-driven annotations of lumen cross-sections of human arteriovenous fistulae obtained from intravascular ultrasound images (10 patients), respectively. The models based on U-Net-and-a-half exceeded the performance of the traditional U-Net models trained on single expert annotations alone, thus expanding the scope of multitask learning in the context of biomedical image segmentation.
9. Learning to Cut by Watching Movies
Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, Bernard Ghanem
Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise. Many video editing components are astonishingly hard to automate primarily due to the lack of raw video materials. This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility. Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts. To do this, we first collected a data source of more than 10K videos, from which we extract more than 255K cuts. We devise a model that learns to discriminate between real and artificial cuts via contrastive learning. We set up a new task and a set of baselines to benchmark video cut generation. We observe that our proposed model outperforms the baselines by large margins. To demonstrate our model in real-world applications, we conduct human studies in a collection of unedited videos. The results show that our model does a better job at cutting than random and alternative baselines.
Learning to Cut by Watching Movies
— AK (@ak92501) August 11, 2021
pdf: https://t.co/vKUSasl0O7
abs: https://t.co/G3tw5nX8Be pic.twitter.com/B2FiJxbt2r
10. RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
Yuki Tatsunami, Masato Taki
For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at \url{https://github.com/okojoalg/raft-mlp}.
RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
— AK (@ak92501) August 11, 2021
pdf: https://t.co/gZF22TVnnZ
abs: https://t.co/2Wr0rtSu0Z
github: https://t.co/AxBFNk1Qsj
raft-token-mixing block improves accuracy when
trained on the ImageNet-1K dataset, as compared to
plain MLP-Mixer pic.twitter.com/HrkxHT5xzo