Hot Papers 2020-09-22

1. Overfit Neural Networks as a Compact Shape Representation

Thomas Davies, Derek Nowrouzezahrai, Alec Jacobson

retweets: 20493, favorites: 1 (09/24/2020 07:57:34)
links: abs | pdf
cs.GR | cs.CG | cs.CV

Neural networks have proven to be effective approximators of signed distance fields (SDFs) for solid 3D objects. While prior work has focused on the generalization power of such approximations, we instead explore their suitability as a compact - if purposefully overfit - SDF representation of individual shapes. Specifically, we ask whether neural networks can serve as first-class implicit shape representations in computer graphics. We call such overfit networks Neural Implicits. Similar to SDFs stored on a regular grid, Neural Implicits have fixed storage profiles and memory layout, but afford far greater accuracy. At equal storage cost, Neural Implicits consistently match or exceed the accuracy of irregularly-sampled triangle meshes. We achieve this with a combination of a novel loss function, sampling strategy and supervision protocol designed to facilitate robust shape overfitting. We demonstrate the flexibility of our representation on a variety of standard rendering and modelling tasks.

Purposefully overfit neural networks are an efficient surface representation for solid 3D shapes

In https://t.co/m4heg8orD7 with Thomas Davies, @DerekRenderling, we make a few observations: pic.twitter.com/ai9il42tRy
— Alec Jacobson (@_AlecJacobson) September 22, 2020

2. Towards Fast, Accurate and Stable 3D Dense Face Alignment

Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, Stan Z. Li

retweets: 3422, favorites: 264 (09/24/2020 07:57:34)
links: abs | pdf
cs.CV

Existing methods of 3D dense face alignment mainly concentrate on accuracy, thus limiting the scope of their practical applications. In this paper, we propose a novel regression framework which makes a balance among speed, accuracy and stability. Firstly, on the basis of a lightweight backbone, we propose a meta-joint optimization strategy to dynamically regress a small set of 3DMM parameters, which greatly enhances speed and accuracy simultaneously. To further improve the stability on videos, we present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving. On the premise of high accuracy and stability, our model runs at over 50fps on a single CPU core and outperforms other state-of-the-art heavy models simultaneously. Experiments on several challenging datasets validate the efficiency of our method. Pre-trained models and code are available at https://github.com/cleardusk/3DDFA_V2.

Towards Fast, Accurate and Stable 3D Dense Face Alignment
pdf: https://t.co/NlbDhcNJZ3
abs: https://t.co/55h07qopDP
github: https://t.co/F6xkGL6wjx pic.twitter.com/lfm3D2fa5a
— AK (@ak92501) September 22, 2020

3. Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks

Vidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu, Nikhil Patil, Palkesh Jain, Sachin S. Sapatnekar

retweets: 2481, favorites: 37 (09/24/2020 07:57:34)
links: abs | pdf
cs.AR | cs.AI | cs.LG

Computationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based on input power, power grid distribution, and power pad distribution patterns. The models are design-independent and must be trained just once for a particular technology and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict the on-chip temperature and IR drop contours in milliseconds (in contrast with commercial tools that require several hours or more) and provide an average error of 0.6% and 0.008% respectively.

Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks. #DeepLearning #DataScience #BigData #Analytics #AI #RStats #Python #Java #JavaScript #ReactJS #Serverless #IoT #Linux #100DaysofCode #Programming #MachineLearning #NeuralNetworks https://t.co/nnOnnKlRe8 pic.twitter.com/WsyPmz6Paw
— Marcus Borba (@marcusborba) September 22, 2020

4. An AI based talent acquisition and benchmarking for job

Rudresh Mishra, Ricardo Rodriguez, Valentin Portillo

retweets: 1767, favorites: 24 (09/24/2020 07:57:35)
links: abs | pdf
cs.CY | cs.AI | cs.CL

In a recruitment industry, selecting a best CV from a particular job post within a pile of thousand CV’s is quite challenging. Finding a perfect candidate for an organization who can be fit to work within organizational culture is a difficult task. In order to help the recruiters to fill these gaps we leverage the help of AI. We propose a methodology to solve these problems by matching the skill graph generated from CV and Job Post. In this report our approach is to perform the business understanding in order to justify why such problems arise and how we intend to solve these problems using natural language processing and machine learning techniques. We limit our project only to solve the problem in the domain of the computer science industry.

An #AI based talent acquisition and benchmarking for job. #DeepLearning #DataScience #BigData #Analytics #RStats #Python #Java #JavaScript #ReactJS #Serverless #IoT #Linux #100DaysofCode #Programming #DataScientists #MachineLearning #ArtificialIntelligence https://t.co/eGxmvpfzBX pic.twitter.com/t4Q7dfp2yV
— Marcus Borba (@marcusborba) September 22, 2020

5. Latin BERT: A Contextual Language Model for Classical Philology

David Bamman, Patrick J. Burns

retweets: 1295, favorites: 115 (09/24/2020 07:57:35)
links: abs | pdf
cs.CL

We present Latin BERT, a contextual language model for the Latin language, trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century. In a series of case studies, we illustrate the affordances of this language-specific model both for work in natural language processing for Latin and in using computational methods for traditional scholarship: we show that Latin BERT achieves a new state of the art for part-of-speech tagging on all three Universal Dependency datasets for Latin and can be used for predicting missing text (including critical emendations); we create a new dataset for assessing word sense disambiguation for Latin and demonstrate that Latin BERT outperforms static word embeddings; and we show that it can be used for semantically-informed search by querying contextual nearest neighbors. We publicly release trained models to help drive future work in this space.

Happy to get this out there: "Latin BERT: A Contextual Language Model for Classical Philology" (with @diyclassics) -- paper: https://t.co/wM27qUdFjn, code/model: https://t.co/r26aYeHt5U
— David Bamman (@dbamman) September 22, 2020

6. Differentiable Refraction-Tracing for Mesh Reconstruction of Transparent Objects

Jiahui Lyu, Bojian Wu, Dani Lischinski, Daniel Cohen-Or, Hui Huang

retweets: 960, favorites: 101 (09/24/2020 07:57:35)
links: abs | pdf
cs.GR

Capturing the 3D geometry of transparent objects is a challenging task, ill-suited for general-purpose scanning and reconstruction techniques, since these cannot handle specular light transport phenomena. Existing state-of-the-art methods, designed specifically for this task, either involve a complex setup to reconstruct complete refractive ray paths, or leverage a data-driven approach based on synthetic training data. In either case, the reconstructed 3D models suffer from over-smoothing and loss of fine detail. This paper introduces a novel, high precision, 3D acquisition and reconstruction method for solid transparent objects. Using a static background with a coded pattern, we establish a mapping between the camera view rays and locations on the background. Differentiable tracing of refractive ray paths is then used to directly optimize a 3D mesh approximation of the object, while simultaneously ensuring silhouette consistency and smoothness. Extensive experiments and comparisons demonstrate the superior accuracy of our method.

Differentiable Refraction-Tracing for Mesh Reconstruction of Transparent Objects
pdf: https://t.co/ggKHGMqj4K
abs: https://t.co/4fxiqmJZ7B pic.twitter.com/xTB6q404um
— AK (@ak92501) September 22, 2020

7. “Hey, that’s not an ODE”: Faster ODE Adjoints with 12 Lines of Code

Patrick Kidger, Ricky T. Q. Chen, Terry Lyons

retweets: 870, favorites: 137 (09/24/2020 07:57:36)
links: abs | pdf
cs.LG | math.CA

Neural differential equations may be trained by backpropagating gradients via the adjoint method, which is another differential equation typically solved using an adaptive-step-size numerical differential equation solver. A proposed step is accepted if its error, \emph{relative to some norm}, is sufficiently small; else it is rejected, the step is shrunk, and the process is repeated. Here, we demonstrate that the particular structure of the adjoint equations makes the usual choices of norm (such as $L^2$ ) unnecessarily stringent. By replacing it with a more appropriate (semi)norm, fewer steps are unnecessarily rejected and the backpropagation is made faster. This requires only minor code modifications. Experiments on a wide range of tasks---including time series, generative modeling, and physical control---demonstrate a median improvement of 40% fewer function evaluations. On some problems we see as much as 62% fewer function evaluations, so that the overall training time is roughly halved.

New paper, with @RickyTQChen!

"Hey, that's not an ODE": Faster ODE Adjoints with 12 Lines of Codehttps://t.co/u90kK4xKMk https://t.co/6HExIcoQaY

We roughly double the training speed of neural ODEs.

1/ pic.twitter.com/CPeKZEtPru
— Patrick Kidger (@PatrickKidger) September 22, 2020

8. Epidemic mitigation by statistical inference from contact tracing data

Antoine Baker, Indaco Biazzo, Alfredo Braunstein, Giovanni Catania, Luca Dall’Asta, Alessandro Ingrosso, Florent Krzakala, Fabio Mazza, Marc Mézard, Anna Paola Muntoni, Maria Refinetti, Stefano Sarao Mannelli, Lenka Zdeborová

retweets: 222, favorites: 62 (09/24/2020 07:57:36)
links: abs | pdf
q-bio.PE | cond-mat.stat-mech | cs.AI | cs.LG

Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing their performance and assessing their impact on the mitigation of the epidemic. We develop Bayesian inference methods to estimate the risk that an individual is infected. This inference is based on the list of his recent contacts and their own risk levels, as well as personal information such as results of tests or presence of syndromes. We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic. Our results show that in some range of epidemic spreading (typically when the manual tracing of all contacts of infected people becomes practically impossible, but before the fraction of infected people reaches the scale where a lock-down becomes unavoidable), this inference of individuals at risk could be an efficient way to mitigate the epidemic. Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact. Such communication may be encrypted and anonymized and thus compatible with privacy preserving standards. We conclude that probabilistic risk estimation is capable to enhance performance of digital contact tracing and should be considered in the currently developed mobile applications.

Mobile-based contact tracing could be done much more efficiently than what is currently implemented. Quantitative evidence to this is presented here: https://t.co/KcQjFBZOvf Science helps saving lives.
— Lenka Zdeborova (@zdeborova) September 22, 2020

What would happens if we run a full Bayesian reconstruction on digital contact tracing for COVID19?
Long story short: probabilistic risk estimation does enhance performance, see "Epidemic mitigation by statistical inference from contact tracing data" https://t.co/hPGrhHsutx https://t.co/aA7WggCGqQ pic.twitter.com/tdU58HSB1X
— Krzakala Florent (@KrzakalaF) September 22, 2020

9. DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro

retweets: 190, favorites: 83 (09/24/2020 07:57:36)
links: abs | pdf
eess.AS | cs.CL | cs.LG | cs.SD | stat.ML

In this work, we propose DiffWave, a versatile Diffusion probabilistic model for conditional and unconditional Waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in Different Waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality~(MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

DiffWave: A Versatile Diffusion Model for Audio Synthesis
pdf: https://t.co/WF40ojFMMs
abs: https://t.co/GpaIiQlNjq
webpage: https://t.co/KeC5GZvmZJ pic.twitter.com/9V14ABEd15
— AK (@ak92501) September 22, 2020

10. Can questions summarize a corpus? Using question generation for characterizing COVID-19 research

Gabriela Surita, Rodrigo Nogueira, Roberto Lotufo

retweets: 195, favorites: 60 (09/24/2020 07:57:36)
links: abs | pdf
cs.IR | cs.CL | cs.LG

What are the latent questions on some textual data? In this work, we investigate using question generation models for exploring a collection of documents. Our method, dubbed corpus2question, consists of applying a pre-trained question generation model over a corpus and aggregating the resulting questions by frequency and time. This technique is an alternative to methods such as topic modelling and word cloud for summarizing large amounts of textual data. Results show that applying corpus2question on a corpus of scientific articles related to COVID-19 yields relevant questions about the topic. The most frequent questions are “what is covid 19” and “what is the treatment for covid”. Among the 1000 most frequent questions are “what is the threshold for herd immunity” and “what is the role of ace2 in viral entry”. We show that the proposed method generated similar questions for 13 of the 27 expert-made questions from the CovidQA question answering dataset. The code to reproduce our experiments and the generated questions are available at: https://github.com/unicamp-dl/corpus2question

New work on using doc2query for summarization by @rodrigfnogueira et al. - works surprisingly well! https://t.co/sg1q34ixUg Samples from CORD-19 corpus related to COVID-19 below. pic.twitter.com/W5lsDwcUJe
— Jimmy Lin (@lintool) September 22, 2020

11. Measuring the effect of Non-Pharmaceutical Interventions (NPIs) on mobility during the COVID-19 pandemic using global mobility data

Berber T Snoeijer, Mariska Burger, Shaoxiong Sun, Richard JB Dobson, Amos A Folarin

retweets: 169, favorites: 20 (09/24/2020 07:57:36)
links: abs | pdf
physics.soc-ph | cs.SI | q-bio.QM

The implementation of governmental Non-Pharmaceutical Interventions (NPIs) has been the primary means of controlling the spread of the COVID-19 disease. The intended effect of these NPIs has been to reduce mobility. A strong reduction in mobility is believed to have a positive effect on the reduction of COVID-19 transmission by limiting the opportunity for the virus to spread in the population. Due to the huge costs of implementing these NPIs, it is essential to have a good understanding of their efficacy. Using global mobility data, released by Apple and Google, and ACAPS NPI data, we investigate the proportional contribution of NPIs on i) size of the change (magnitude) of transition between pre- and post-lockdown mobility levels and ii) rate (gradient) of this transition. Using generalized linear models to find the best fit model we found similar results using Apple or Google data. NPIs found to impact the magnitude of the change in mobility were: Lockdown measures (Apple, Google Retail and Recreation (RAR) and Google Transit and Stations (TS)), declaring a state of emergency (Apple, Google RAR and Google TS), closure of businesses and public services (Google RAR) and school closures (Apple). Using cluster analysis and chi square tests we found that closure of businesses and public services, school closures and limiting public gatherings as well as border closures and international flight suspensions were closely related. The implementation of lockdown measures and limiting public gatherings had the greatest effect on the rate of mobility change. In conclusion, we were able to quantitatively assess the efficacy of NPIs in reducing mobility, which enables us to understand their fine grained effects in a timely manner and therefore facilitate well-informed and cost-effective interventions.

Read our paper using #Apple and #Google global mobility datasets to assess efficacy of different measures used to control #COVID19 around the world. https://t.co/llqfb5RFlT @phidatalab @radar_base @KCLBHI @richdobson @ShaoxiongSun @berbers1
and Mariska Burger pic.twitter.com/at4Zx9M73m
— Amos Folarin (@amosfolarin) September 22, 2020

12. PIE: Portrait Image Embedding for Semantic Control

Ayush Tewari, Mohamed Elgharib, Mallikarjun B R., Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, Christian Theobalt

retweets: 92, favorites: 29 (09/24/2020 07:57:36)
links: abs | pdf
cs.CV | cs.GR

Editing of portrait images is a very popular and important research topic with a large variety of applications. For ease of use, control should be provided via a semantically meaningful parameterization that is akin to computer animation controls. The vast majority of existing techniques do not provide such intuitive and fine-grained control, or only enable coarse editing of a single isolated control parameter. Very recently, high-quality semantically controlled editing has been demonstrated, however only on synthetically created StyleGAN images. We present the first approach for embedding real portrait images in the latent space of StyleGAN, which allows for intuitive editing of the head pose, facial expression, and scene illumination in the image. Semantic editing in parameter space is achieved based on StyleRig, a pretrained neural network that maps the control space of a 3D morphable face model to the latent space of the GAN. We design a novel hierarchical non-linear optimization problem to obtain the embedding. An identity preservation energy term allows spatially coherent edits while maintaining facial integrity. Our approach runs at interactive frame rates and thus allows the user to explore the space of possible edits. We evaluate our approach on a wide set of portrait photos, compare it to the current state of the art, and validate the effectiveness of its components in an ablation study.

PIE: Portrait Image Embedding for Semantic Control
pdf: https://t.co/VUQL2fboiI
abs: https://t.co/HT2DiqUSQn
project page: https://t.co/2BxVMtqY7n pic.twitter.com/Ge1ehN68Lu
— AK (@ak92501) September 22, 2020

13. Target Conditioning for One-to-Many Generation

Marie-Anne Lachaux, Armand Joulin, Guillaume Lample

retweets: 42, favorites: 31 (09/24/2020 07:57:37)
links: abs | pdf
cs.LG | stat.ML

Neural Machine Translation (NMT) models often lack diversity in their generated translations, even when paired with search algorithm, like beam search. A challenge is that the diversity in translations are caused by the variability in the target language, and cannot be inferred from the source sentence alone. In this paper, we propose to explicitly model this one-to-many mapping by conditioning the decoder of a NMT model on a latent variable that represents the domain of target sentences. The domain is a discrete variable generated by a target encoder that is jointly trained with the NMT model. The predicted domain of target sentences are given as input to the decoder during training. At inference, we can generate diverse translations by decoding with different domains. Unlike our strongest baseline (Shen et al., 2019), our method can scale to any number of domains without affecting the performance or the training time. We assess the quality and diversity of translations generated by our model with several metrics, on three different datasets.

Target Conditioning for One-to-Many Generation
pdf: https://t.co/PXGydtLtEM
abs: https://t.co/48iaM9bFxM pic.twitter.com/8kavPgtksF
— AK (@ak92501) September 22, 2020

14. Multi-Task Learning with Deep Neural Networks: A Survey

Michael Crawshaw

retweets: 35, favorites: 37 (09/24/2020 07:57:37)
links: abs | pdf
cs.LG | cs.CV | stat.ML

Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model. Such approaches offer advantages like improved data efficiency, reduced overfitting through shared representations, and fast learning by leveraging auxiliary information. However, the simultaneous learning of multiple tasks presents new design and optimization challenges, and choosing which tasks should be learned jointly is in itself a non-trivial problem. In this survey, we give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field. Our discussion is structured according to a partition of the existing deep MTL techniques into three groups: architectures, optimization methods, and task relationship learning. We also provide a summary of common multi-task benchmarks.

MULTI-TASK LEARNING WITH DEEP NEURAL NETWORKS: A SURVEYhttps://t.co/KpaAyy1IFU
— phalanx (@ZFPhalanx) September 22, 2020

Daniele Romanini, Sune Lehmann, Mikko Kivelä

retweets: 48, favorites: 11 (09/24/2020 07:57:37)
links: abs | pdf
cs.SI | physics.soc-ph

The ability to share social network data at the level of individual connections is beneficial to science: not only for reproducing results, but also for researchers who may wish to use it for purposes not foreseen by the data releaser. Sharing such data, however, can lead to serious privacy issues, because individuals could be re-identified, not only based on possible nodes’ attributes, but also from the structure of the network around them. The risk associated with re-identification can be measured and it is more serious in some networks than in others. Various optimization algorithms have been proposed to anonymize the network while keeping the number of changes minimal. However, existing algorithms do not provide guarantees on where the changes will be made, making it difficult to quantify their effect on various measures. Using network models and real data, we show that the average degree of networks is a crucial parameter for the severity of re-identification risk from nodes’ neighborhoods. Dense networks are more at risk, and, apart from a small band of average degree values, either almost all nodes are re-identifiable or they are all safe. Our results allow researchers to assess the privacy risk based on a small number of network statistics which are available even before the data is collected. As a rule-of-thumb, the privacy risks are high if the average degree is above 10. Guided by these results we propose a simple method based on edge sampling to mitigate the re-identification risk of nodes. Our method can be implemented already at the data collection phase. Its effect on various network measures can be estimated and corrected using sampling theory. These properties are in contrast with previous methods arbitrarily biasing the data. In this sense, our work could help in sharing network data in a statistically tractable way.

16. Content Planning for Neural Story Generation with Aristotelian Rescoring

Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph Weischedel, Nanyun Peng

retweets: 32, favorites: 24 (09/24/2020 07:57:37)
links: abs | pdf
cs.CL | cs.AI

Long-form narrative text generated from large language models manages a fluent impersonation of human writing, but only at the local sentence level, and lacks structure or global cohesion. We posit that many of the problems of story generation can be addressed via high-quality content planning, and present a system that focuses on how to learn good plot structures to guide story generation. We utilize a plot-generation language model along with an ensemble of rescoring models that each implement an aspect of good story-writing as detailed in Aristotle’s Poetics. We find that stories written with our more principled plot-structure are both more relevant to a given prompt and higher quality than baselines that do not content plan, or that plan in an unprincipled way.

17. Cross-Entropy Method Variants for Optimization

Robert J. Moss

retweets: 36, favorites: 18 (09/24/2020 07:57:37)
links: abs | pdf
cs.LG | math.OC | stat.ML

The cross-entropy (CE) method is a popular stochastic method for optimization due to its simplicity and effectiveness. Designed for rare-event simulations where the probability of a target event occurring is relatively small, the CE-method relies on enough objective function calls to accurately estimate the optimal parameters of the underlying distribution. Certain objective functions may be computationally expensive to evaluate, and the CE-method could potentially get stuck in local minima. This is compounded with the need to have an initial covariance wide enough to cover the design space of interest. We introduce novel variants of the CE-method to address these concerns. To mitigate expensive function calls, during optimization we use every sample to build a surrogate model to approximate the objective function. The surrogate model augments the belief of the objective function with less expensive evaluations. We use a Gaussian process for our surrogate model to incorporate uncertainty in the predictions which is especially helpful when dealing with sparse data. To address local minima convergence, we use Gaussian mixture models to encourage exploration of the design space. We experiment with evaluation scheduling techniques to reallocate true objective function calls earlier in the optimization when the covariance is the largest. To test our approach, we created a parameterized test objective function with many local minima and a single global minimum. Our test function can be adjusted to control the spread and distinction of the minima. Experiments were run to stress the cross-entropy method variants and results indicate that the surrogate model-based approach reduces local minima convergence using the same number of function evaluations.

18. Redundancy of Hidden Layers in Deep Learning: An Information Perspective

Chenguang Zhang, Yuexian Hou, Dawei Song, Liangzhu Ge, Yaoshuai Yao

retweets: 40, favorites: 14 (09/24/2020 07:57:37)
links: abs | pdf
cs.LG | cs.AI | physics.app-ph | stat.ML

Although the deep structure guarantees the powerful expressivity of deep networks (DNNs), it also triggers serious overfitting problem. To improve the generalization capacity of DNNs, many strategies were developed to improve the diversity among hidden units. However, most of these strategies are empirical and heuristic in absence of either a theoretical derivation of the diversity measure or a clear connection from the diversity to the generalization capacity. In this paper, from an information theoretic perspective, we introduce a new definition of redundancy to describe the diversity of hidden units under supervised learning settings by formalizing the effect of hidden layers on the generalization capacity as the mutual information. We prove an opposite relationship existing between the defined redundancy and the generalization capacity, i.e., the decrease of redundancy generally improving the generalization capacity. The experiments show that the DNNs using the redundancy as the regularizer can effectively reduce the overfitting and decrease the generalization error, which well supports above points.

19. Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data

Jonathan Pilault, Amine Elhattami, Christopher Pal

retweets: 14, favorites: 36 (09/24/2020 07:57:37)
links: abs | pdf
cs.LG | stat.ML

Multi-Task Learning (MTL) has emerged as a promising approach for transferring learned knowledge across different tasks. However, multi-task learning must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Additionally, in Natural Language Processing (NLP), MTL alone has typically not reached the performance level possible through per-task fine-tuning of pretrained models. However, many fine-tuning approaches are both parameter inefficient, e.g. potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel transformer based architecture consisting of a new conditional attention mechanism as well as a set of task conditioned modules that facilitate weight sharing. Through this construction we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach we are able to surpass single-task fine-tuning methods while being parameter and data efficient. With our base model, we attain 2.2% higher performance compared to a full fine-tuned BERT large model on the GLUE benchmark, adding only 5.6% more trained parameters per task (whereas naive fine-tuning potentially adds 100% of the trained parameters per task) and needing only 64.6% of the data. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets.

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in Nlp Using Fewer Parameters & Less Data

Multi-task learning on BERT variants with better performance than fine-tuning. https://t.co/pmDlCmsbUo pic.twitter.com/LIrJA8WcDm
— Aran Komatsuzaki (@arankomatsuzaki) September 22, 2020

Published 24 Sep 2020

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter