All Articles

Hot Papers 2020-12-29

1. Towards Fully Automated Manga Translation

Ryota Hinami, Shonosuke Ishiwatari, Kazuhiko Yasuda, Yusuke Matsui

  • retweets: 3574, favorites: 306 (12/30/2020 10:11:08)
  • links: abs | pdf
  • cs.CL

We tackle the problem of machine translation of manga, Japanese comics. Manga translation involves two important problems in machine translation: context-aware and multimodal translation. Since text and images are mixed up in an unstructured fashion in Manga, obtaining context from the image is essential for manga translation. However, it is still an open problem how to extract context from image and integrate into MT models. In addition, corpus and benchmarks to train and evaluate such model is currently unavailable. In this paper, we make the following four contributions that establishes the foundation of manga translation research. First, we propose multimodal context-aware translation framework. We are the first to incorporate context information obtained from manga image. It enables us to translate texts in speech bubbles that cannot be translated without using context information (e.g., texts in other speech bubbles, gender of speakers, etc.). Second, for training the model, we propose the approach to automatic corpus construction from pairs of original manga and their translations, by which large parallel corpus can be constructed without any manual labeling. Third, we created a new benchmark to evaluate manga translation. Finally, on top of our proposed methods, we devised a first comprehensive system for fully automated manga translation.

2. TransPose: Towards Explainable Human Pose Estimation by Transformer

Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang

  • retweets: 1225, favorites: 164 (12/30/2020 10:11:09)
  • links: abs | pdf
  • cs.CV

Deep Convolutional Neural Networks (CNNs) have made remarkable progress on human pose estimation task. However, there is no explicit understanding of how the locations of body keypoints are predicted by CNN, and it is also unknown what spatial dependency relationships between structural variables are learned in the model. To explore these questions, we construct an explainable model named TransPose based on Transformer architecture and low-level convolutional blocks. Given an image, the attention layers built in Transformer can capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on. We analyze the rationality of using attention as the explanation to reveal the spatial dependencies in this task. The revealed dependencies are image-specific and variable across different keypoint types, layer depths, or trained models. The experiments show that TransPose can accurately predict the positions of keypoints. It achieves state-of-the-art performance on COCO dataset, while being more interpretable, lightweight, and efficient than mainstream fully convolutional architectures.

3. Improving the Generalization of End-to-End Driving through Procedural Generation

Quanyi Li, Zhenghao Peng, Qihang Zhang, Cong Qiu, Chunxiao Liu, Bolei Zhou

Recently there is a growing interest in the end-to-end training of autonomous driving where the entire driving pipeline from perception to control is modeled as a neural network and jointly optimized. The end-to-end driving is usually first developed and validated in simulators. However, most of the existing driving simulators only contain a fixed set of maps and a limited number of configurations. As a result the deep models are prone to overfitting training scenarios. Furthermore it is difficult to assess how well the trained models generalize to unseen scenarios. To better evaluate and improve the generalization of end-to-end driving, we introduce an open-ended and highly configurable driving simulator called PGDrive. PGDrive first defines multiple basic road blocks such as ramp, fork, and roundabout with configurable settings. Then a range of diverse maps can be assembled from those blocks with procedural generation, which are further turned into interactive environments. The experiments show that the driving agent trained by reinforcement learning on a small fixed set of maps generalizes poorly to unseen maps. We further validate that training with the increasing number of procedurally generated maps significantly improves the generalization of the agent across scenarios of different traffic densities and map structures. Code is available at: https://decisionforce.github.io/pgdrive

4. Neural Network Training With Homomorphic Encryption

Kentaro Mihara, Ryohei Yamaguchi, Miguel Mitsuishi, Yusuke Maruyama

  • retweets: 426, favorites: 24 (12/30/2020 10:11:09)
  • links: abs | pdf
  • cs.CR

We introduce a novel method and implementation architecture to train neural networks which preserves the confidentiality of both the model and the data. Our method relies on homomorphic capability of lattice based encryption scheme. Our procedure is optimized for operations on packed ciphertexts in order to achieve efficient updates of the model parameters. Our method achieves a significant reduction of computations due to our way to perform multiplications and rotations on packed ciphertexts from a feedforward network to a back-propagation network. To verify the accuracy of the training model as well as the implementation feasibility, we tested our method on the Iris data set by using the CKKS scheme with Microsoft SEAL as a back end. Although our test implementation is for simple neural network training, we believe our basic implementation block can help the further applications for more complex neural network based use cases.

5. Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning

Amy X. Lu, Alex X. Lu, Alan Moses

Self-supervised representation learning of biological sequence embeddings alleviates computational resource constraints on downstream tasks while circumventing expensive experimental label acquisition. However, existing methods mostly borrow directly from large language models designed for NLP, rather than with bioinformatics philosophies in mind. Recently, contrastive mutual information maximization methods have achieved state-of-the-art representations for ImageNet. In this perspective piece, we discuss how viewing evolution as natural sequence augmentation and maximizing information across phylogenetic “noisy channels” is a biologically and theoretically desirable objective for pretraining encoders. We first provide a review of current contrastive learning literature, then provide an illustrative example where we show that contrastive learning using evolutionary augmentation can be used as a representation learning objective which maximizes the mutual information between biological sequences and their conserved function, and finally outline rationale for this approach.

6. On Generating Extended Summaries of Long Documents

Sajad Sotudeh, Arman Cohan, Nazli Goharian

  • retweets: 156, favorites: 63 (12/30/2020 10:11:10)
  • links: abs | pdf
  • cs.CL

Prior work in document summarization has mainly focused on generating short summaries of a document. While this type of summary helps get a high-level view of a given document, it is desirable in some cases to know more detailed information about its salient points that can’t fit in a short summary. This is typically the case for longer documents such as a research paper, legal document, or a book. In this paper, we present a new method for generating extended summaries of long papers. Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach. We then present our results on three long summarization datasets, arXiv-Long, PubMed-Long, and Longsumm. Our method outperforms or matches the performance of strong baselines. Furthermore, we perform a comprehensive analysis over the generated results, shedding insights on future research for long-form summary generation task. Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences across diverse sections. Our datasets, and codes are publicly available at https://github.com/Georgetown-IR-Lab/ExtendedSumm

7. Analysis of Short Dwell Time in Relation to User Interest in a News Application

Ryosuke Homma, Yoshifumi Seki, Mitsuo Yoshida, Kyoji Umemura

Dwell time has been widely used in various fields to evaluate content quality and user engagement. Although many studies shown that content with long dwell time is good quality, contents with short dwell time have not been discussed in detail. We hypothesize that content with short dwell time is not always low quality and does not always have low user engagement, but is instead related to user interest. The purpose of this study is to clarify the meanings of short dwell time browsing in mobile news application. First, we analyze the relation of short dwell time to user interest using large scale user behavior logs from a mobile news application. This analysis was conducted on a vector space based on users click histories and then users and articles were mapped in the same space. The users with short dwell time are concentrated on a specific position in this space; thus, the length of dwell time is related to their interest. Moreover, we also analyze the characteristics of short dwell time browsing by excluding these browses from their click histories. Surprisingly, excluding short dwell time click history, it was found that short dwell time click history included some aspect of user interest in 30.87% of instances where the cluster of users changed. These findings demonstrate that short dwell time does not always indicate a low level of user engagement, but also level of user interest.

8. A Google Earth Engine-enabled Python approach to improve identification of anthropogenic palaeo-landscape features

Filippo Brandolini, Guillem Domingo Ribas, Andrea Zerboni, Sam Turner

  • retweets: 87, favorites: 18 (12/30/2020 10:11:10)
  • links: abs | pdf
  • cs.CY | cs.CV

The necessity of sustainable development for landscapes has emerged as an important theme in recent decades. Current methods take a holistic approach to landscape heritage and promote an interdisciplinary dialogue to facilitate complementary landscape management strategies. With the socio-economic values of the natural and cultural landscape heritage increasingly recognised worldwide, remote sensing tools are being used more and more to facilitate the recording and management of landscape heritage. Satellite remote sensing technologies have enabled significant improvements in landscape research. The advent of the cloud-based platform of Google Earth Engine has allowed the rapid exploration and processing of satellite imagery such as the Landsat and Copernicus Sentinel datasets. In this paper, the use of Sentinel-2 satellite data in the identification of palaeo-riverscape features has been assessed in the Po Plain, selected because it is characterized by human exploitation since the Mid-Holocene. A multi-temporal approach has been adopted to investigate the potential of satellite imagery to detect buried hydrological and anthropogenic features along with Spectral Index and Spectral Decomposition analysis. This research represents one of the first applications of the GEE Python API in landscape studies. The complete FOSS-cloud protocol proposed here consists of a Python code script developed in Google Colab which could be simply adapted and replicated in different areas of the world

9. Self-supervised Pre-training with Hard Examples Improves Visual Representations

Chunyuan Li, Xiujun Li, Lei Zhang, Baolin Peng, Mingyuan Zhou, Jianfeng Gao

Self-supervised pre-training (SSP) employs random image transformations to generate training data for visual representation learning. In this paper, we first present a modeling framework that unifies existing SSP methods as learning to predict pseudo-labels. Then, we propose new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations. Specifically, we use adversarial training and CutMix to create hard examples (HEXA) to be used as augmented views for MoCo-v2 and DeepCluster-v2, leading to two variants HEXA{MoCo} and HEXA{DCluster}, respectively. In our experiments, we pre-train models on ImageNet and evaluate them on multiple public benchmarks. Our evaluation shows that the two new algorithm variants outperform their original counterparts, and achieve new state-of-the-art on a wide range of tasks where limited task supervision is available for fine-tuning. These results verify that hard examples are instrumental in improving the generalization of the pre-trained models.

10. Universal Sentence Representation Learning with Conditional Masked Language Model

Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve

  • retweets: 42, favorites: 50 (12/30/2020 10:11:10)
  • links: abs | pdf
  • cs.CL

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using (semi-)supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval~(BR) and natural language inference~(NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin. We explore the same language bias of the learned representations, and propose a principle component based approach to remove the language identifying information from the representation while still retaining sentence semantics.

11. Logic Tensor Networks

Samy Badreddine, Artur d’Avila Garcez, Luciano Serafini, Michael Spranger

  • retweets: 64, favorites: 22 (12/30/2020 10:11:10)
  • links: abs | pdf
  • cs.AI | cs.LG

Artificial Intelligence agents are required to learn from their surroundings and to reason about the knowledge that has been learned in order to make decisions. While state-of-the-art learning from data typically uses sub-symbolic distributed representations, reasoning is normally useful at a higher level of abstraction with the use of a first-order logic language for knowledge representation. As a result, attempts at combining symbolic AI and neural computation into neural-symbolic systems have been on the increase. In this paper, we present Logic Tensor Networks (LTN), a neurosymbolic formalism and computational model that supports learning and reasoning through the introduction of a many-valued, end-to-end differentiable first-order logic called Real Logic as a representation language for deep learning. We show that LTN provides a uniform language for the specification and the computation of several AI tasks such as data clustering, multi-label classification, relational learning, query answering, semi-supervised learning, regression and embedding learning. We implement and illustrate each of the above tasks with a number of simple explanatory examples using TensorFlow 2. Keywords: Neurosymbolic AI, Deep Learning and Reasoning, Many-valued Logic.

12. Taxonomy of multimodal self-supervised representation learning

Alex Fedorov, Tristan Sylvain, Margaux Luck, Lei Wu, Thomas P. DeRamus, Alex Kirilin, Dmitry Bleklov, Sergey M. Plis, Vince D. Calhoun

  • retweets: 52, favorites: 26 (12/30/2020 10:11:10)
  • links: abs | pdf
  • cs.LG | cs.CV

Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.

13. Latent Compass: Creation by Navigation

Sarah Schwettmann, Hendrik Strobelt, Mauro Martino

  • retweets: 42, favorites: 36 (12/30/2020 10:11:10)
  • links: abs | pdf
  • cs.AI | cs.CV

In Marius von Senden’s Space and Sight, a newly sighted blind patient describes the experience of a corner as lemon-like, because corners “prick” sight like lemons prick the tongue. Prickliness, here, is a dimension in the feature space of sensory experience, an effect of the perceived on the perceiver that arises where the two interact. In the account of the newly sighted, an effect familiar from one interaction translates to a novel context. Perception serves as the vehicle for generalization, in that an effect shared across different experiences produces a concrete abstraction grounded in those experiences. Cezanne and the post-impressionists, fluent in the language of experience translation, realized that the way to paint a concrete form that best reflected reality was to paint not what they saw, but what it was like to see. We envision a future of creation using AI where what it is like to see is replicable, transferrable, manipulable - part of the artist’s palette that is both grounded in a particular context, and generalizable beyond it. An active line of research maps human-interpretable features onto directions in GAN latent space. Supervised and self-supervised approaches that search for anticipated directions or use off-the-shelf classifiers to drive image manipulation in embedding space are limited in the variety of features they can uncover. Unsupervised approaches that discover useful new directions show that the space of perceptually meaningful directions is nowhere close to being fully mapped. As this space is broad and full of creative potential, we want tools for direction discovery that capture the richness and generalizability of human perception. Our approach puts creators in the discovery loop during real-time tool use, in order to identify directions that are perceptually meaningful to them, and generate interpretable image translations along those directions.

14. A Tutorial on Sparse Gaussian Processes and Variational Inference

Felix Leibfried, Vincent Dutordoir, ST John, Nicolas Durrande

Gaussian processes (GPs) provide a framework for Bayesian inference that can offer principled uncertainty estimates for a large range of problems. For example, if we consider regression problems with Gaussian likelihoods, a GP model can predict both the mean and variance of the posterior in closed form. However, identifying the posterior GP scales cubically with the number of training examples and requires to store all examples in memory. In order to overcome these obstacles, sparse GPs have been proposed that approximate the true posterior GP with pseudo-training examples. Importantly, the number of pseudo-training examples is user-defined and enables control over computational and memory complexity. In the general case, sparse GPs do not enjoy closed-form solutions and one has to resort to approximate inference. In this context, a convenient choice for approximate inference is variational inference (VI), where the problem of Bayesian inference is cast as an optimization problem — namely, to maximize a lower bound of the log marginal likelihood. This paves the way for a powerful and versatile framework, where pseudo-training examples are treated as optimization arguments of the approximate posterior that are jointly identified together with hyperparameters of the generative model (i.e. prior and likelihood). The framework can naturally handle a wide scope of supervised learning problems, ranging from regression with heteroscedastic and non-Gaussian likelihoods to classification problems with discrete labels, but also multilabel problems. The purpose of this tutorial is to provide access to the basic matter for readers without prior knowledge in both GPs and VI. A proper exposition to the subject enables also access to more recent advances (like importance-weighted VI as well as inderdomain, multioutput and deep GPs) that can serve as an inspiration for new research ideas.