All Articles

Hot Papers 2020-10-01

1. Towards decolonising computational sciences

Abeba Birhane, Olivia Guest

  • retweets: 19018, favorites: 18 (10/02/2020 08:46:22)
  • links: abs | pdf
  • cs.CY

This article sets out our perspective on how to begin the journey of decolonising computational fields, such as data and cognitive sciences. We see this struggle as requiring two basic steps: a) realisation that the present-day system has inherited, and still enacts, hostile, conservative, and oppressive behaviours and principles towards women of colour (WoC); and b) rejection of the idea that centering individual people is a solution to system-level problems. The longer we ignore these two steps, the more “our” academic system maintains its toxic structure, excludes, and harms Black women and other minoritised groups. This also keeps the door open to discredited pseudoscience, like eugenics and physiognomy. We propose that grappling with our fields’ histories and heritage holds the key to avoiding mistakes of the past. For example, initiatives such as “diversity boards” can still be harmful because they superficially appear reformatory but nonetheless center whiteness and maintain the status quo. Building on the shoulders of many WoC’s work, who have been paving the way, we hope to advance the dialogue required to build both a grass-roots and a top-down re-imagining of computational sciences — including but not limited to psychology, neuroscience, cognitive science, computer science, data science, statistics, machine learning, and artificial intelligence. We aspire for these fields to progress away from their stagnant, sexist, and racist shared past into carving and maintaining an ecosystem where both a diverse demographics of researchers and scientific ideas that critically challenge the status quo are welcomed.

2. Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

3. Direct Multi-hop Attention based Graph Neural Network

Guangtao Wang, Rex Ying, Jing Huang, Jure Leskovec

Introducing self-attention mechanism in graph neural networks (GNNs) achieved state-of-the-art performance for graph representation learning. However, at every layer, attention is only computed between two connected nodes and depends solely on the representation of both nodes. This attention computation cannot account for the multi-hop neighbors which supply graph structure context information and have influence on the node representation learning as well. In this paper, we propose Direct Multi-hop Attention based Graph neural Network (DAGN) for graph representation learning, a principled way to incorporate multi-hop neighboring context into attention computation, enabling long-range interactions at every layer. To compute attention between nodes that are multiple hops away, DAGN diffuses the attention scores from neighboring nodes to non-neighboring nodes, thus increasing the receptive field for every message passing layer. Unlike previous methods, DAGN uses a diffusion prior on attention values, to efficiently account for all paths between the pair of nodes when computing multi-hop attention weights. This helps DAGN capture large-scale structural information in a single layer, and learn more informative attention distribution. Experimental results on standard semi-supervised node classification as well as the knowledge graph completion show that DAGN achieves state-of-the-art results: DAGN achieves up to 5.7% relative error reduction over the previous state-of-the-art on Cora, Citeseer, and Pubmed. DAGN also obtains the best performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion DAGN advances state-of-the-art on WN18RR and FB15k-237 across four different performance metrics.

4. Measuring Systematic Generalization in Neural Proof Generation with Transformers

Nicolas Gontier, Koustuv Sinha, Siva Reddy, Christopher Pal

We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded in the form of natural language. We investigate systematic generalization abilities on an inductive logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate logical proofs represented in natural language. We systematically test proof generation capabilities, along with inference capabilities leveraging the generated proofs. We observe length-generalization issues in proof generation and inference when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they find it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This result suggests that Transformers have efficient, yet not interpretable reasoning strategies internally. These results also highlight the systematic generalization issues in TLMs in the context of logical reasoning, and we believe this work will motivate deeper inspection of their underlying reasoning strategies.

5. Bridging Information-Seeking Human Gaze and Machine Reading Comprehension

Jonathan Malmaud, Roger Levy, Yevgeni Berzak

  • retweets: 90, favorites: 45 (10/02/2020 08:46:23)
  • links: abs | pdf
  • cs.CL

In this work, we analyze how human gaze during reading comprehension is conditioned on the given reading comprehension question, and whether this signal can be beneficial for machine reading comprehension. To this end, we collect a new eye-tracking dataset with a large number of participants engaging in a multiple choice reading comprehension task. Our analysis of this data reveals increased fixation times over parts of the text that are most relevant for answering the question. Motivated by this finding, we propose making automated reading comprehension more human-like by mimicking human information-seeking reading behavior during reading comprehension. We demonstrate that this approach leads to performance gains on multiple choice question answering in English for a state-of-the-art reading comprehension model.

6. Learning Image-adaptive 3D Lookup Tables for High Performance Photo Enhancement in Real-time

Hui Zeng, Jianrui Cai, Lida Li, Zisheng Cao, Lei Zhang

Recent years have witnessed the increasing popularity of learning based methods to enhance the color and tone of photos. However, many existing photo enhancement methods either deliver unsatisfactory results or consume too much computational and memory resources, hindering their application to high-resolution images (usually with more than 12 megapixels) in practice. In this paper, we learn image-adaptive 3-dimensional lookup tables (3D LUTs) to achieve fast and robust photo enhancement. 3D LUTs are widely used for manipulating color and tone of photos, but they are usually manually tuned and fixed in camera imaging pipeline or photo editing tools. We, for the first time to our best knowledge, propose to learn 3D LUTs from annotated data using pairwise or unpaired learning. More importantly, our learned 3D LUT is image-adaptive for flexible photo enhancement. We learn multiple basis 3D LUTs and a small convolutional neural network (CNN) simultaneously in an end-to-end manner. The small CNN works on the down-sampled version of the input image to predict content-dependent weights to fuse the multiple basis 3D LUTs into an image-adaptive one, which is employed to transform the color and tone of source images efficiently. Our model contains less than 600K parameters and takes less than 2 ms to process an image of 4K resolution using one Titan RTX GPU. While being highly efficient, our model also outperforms the state-of-the-art photo enhancement methods by a large margin in terms of PSNR, SSIM and a color difference metric on two publically available benchmark datasets.

7. Few-shot Learning for Time-series Forecasting

Tomoharu Iwata, Atsutoshi Kumagai

Time-series forecasting is important for many applications. Forecasting models are usually trained using time-series data in a specific target task. However, sufficient data in the target task might be unavailable, which leads to performance degradation. In this paper, we propose a few-shot learning method that forecasts a future value of a time-series in a target task given a few time-series in the target task. Our model is trained using time-series data in multiple training tasks that are different from target tasks. Our model uses a few time-series to build a forecasting function based on a recurrent neural network with an attention mechanism. With the attention mechanism, we can retrieve useful patterns in a small number of time-series for the current situation. Our model is trained by minimizing an expected test error of forecasting next timestep values. We demonstrate the effectiveness of the proposed method using 90 time-series datasets.

8. Attention that does not Explain Away

Nan Ding, Xinjie Fan, Zhenzhong Lan, Dale Schuurmans, Radu Soricut

Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. Following a probabilistic view of the attention via the Gaussian mixture model, we find empirical evidence that the Transformer attention tends to “explain away” certain input neurons. To compensate for this, we propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the “explaining away” effect without introducing significant computational or memory cost. Empirically, we show that the new attention schemes result in improved performance on several well-known benchmarks.

9. On Romanization for Model Transfer Between Scripts in Neural Machine Translation

Chantal Amrhein, Rico Sennrich

  • retweets: 18, favorites: 39 (10/02/2020 08:46:23)
  • links: abs | pdf
  • cs.CL

Transfer learning is a popular strategy to improve the quality of low-resource machine translation. For an optimal transfer of the embedding layer, the child and parent model should share a substantial part of the vocabulary. This is not the case when transferring to languages with a different script. We explore the benefit of romanization in this scenario. Our results show that romanization entails information loss and is thus not always superior to simpler vocabulary transfer methods, but can improve the transfer between related languages with different scripts. We compare two romanization tools and find that they exhibit different degrees of information loss, which affects translation quality. Finally, we extend romanization to the target side, showing that this can be a successful strategy when coupled with a simple deromanization model.

10. Machine Learning and Computational Mathematics

Weinan E

Neural network-based machine learning is capable of approximating functions in very high dimension with unprecedented efficiency and accuracy. This has opened up many exciting new possibilities, not just in traditional areas of artificial intelligence, but also in scientific computing and computational science. At the same time, machine learning has also acquired the reputation of being a set of “black box” type of tricks, without fundamental principles. This has been a real obstacle for making further progress in machine learning. In this article, we try to address the following two very important questions: (1) How machine learning has already impacted and will further impact computational mathematics, scientific computing and computational science? (2) How computational mathematics, particularly numerical analysis, {can} impact machine learning? We describe some of the most important progress that has been made on these issues. Our hope is to put things into a perspective that will help to integrate machine learning with computational mathematics.