1. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model — with outrageous numbers of parameters — but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability — we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model.
Introducing Switch Transformer, a simplified sparse architecture for scaling to trillion parameter language models
— Barret Zoph (@barret_zoph) January 12, 2021
Switch Transformers yield 4-7x speedups over strong Transformer T5 models w/ the same computational resources
Paper: https://t.co/JfuNCmyEka pic.twitter.com/agbZWUwLmG
2. How to Train Your Energy-Based Models
Yang Song, Diederik P. Kingma
Energy-Based Models (EBMs), also known as non-normalized probabilistic models, specify probability density or mass functions up to an unknown normalizing constant. Unlike most other probabilistic models, EBMs do not place a restriction on the tractability of the normalizing constant, thus are more flexible to parameterize and can model a more expressive family of probability distributions. However, the unknown normalizing constant of EBMs makes training particularly difficult. Our goal is to provide a friendly introduction to modern approaches for EBM training. We start by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching (SM) and Noise Constrastive Estimation (NCE). We highlight theoretical connections among these three approaches, and end with a brief survey on alternative training methods, which are still under active research. Our tutorial is targeted at an audience with basic understanding of generative models who want to apply EBMs or start a research project in this direction.
@YSongStanford and I wrote a tutorial on training energy-based models (EBMs), which was just released on arXiv. Our goal was to provide a friendly introduction to modern parameter estimation methods. Hope it helps people get up to speed! https://t.co/E7veZEXuue
— Durk Kingma (@dpkingma) January 12, 2021
3. Predicting Individual Substance Abuse Vulnerability using Machine Learning Techniques
Uwaise Ibna Islam, Iqbal H. Sarker, Enamul Haque, Mohammed Moshiul Hoque
Substance abuse is the unrestrained and detrimental use of psychoactive chemical substances, unauthorized drugs, and alcohol. Continuous use of these substances can ultimately lead a human to disastrous consequences. As patients display a high rate of relapse, prevention at an early stage can be an effective restraint. We therefore propose a binary classifier to identify any individual’s present vulnerability towards substance abuse by analyzing subjects’ socio-economic environment. We have collected data by a questionnaire which is created after carefully assessing the commonly involved factors behind substance abuse. Pearson’s chi-squared test of independence is used to identify key feature variables influencing substance abuse. Later we build the predictive classifiers using machine learning classification algorithms on those variables. Logistic regression classifier trained with 18 features can predict individual vulnerability with the best accuracy.
Predicting Individual Substance Abuse Vulnerability using Machine Learning Techniques. #ML #DeepLearning #BigData #Analytics #Python #RStats #DevCommunity #Serverless #Programming #IoT #womenwhocode #Cloud #100DaysOfCode #DataScience #AI #MachineLearning https://t.co/fchDfFG65s pic.twitter.com/ufotnG7GF0
— Marcus Borba (@marcusborba) January 12, 2021
4. Investigating the Vision Transformer Model for Image Retrieval Tasks
Socratis Gkelios, Yiannis Boutalis, Savvas A. Chatzichristofis
This paper introduces a plug-and-play descriptor that can be effectively adopted for image retrieval tasks without prior initialization or preparation. The description method utilizes the recently proposed Vision Transformer network while it does not require any training data to adjust parameters. In image retrieval tasks, the use of Handcrafted global and local descriptors has been very successfully replaced, over the last years, by the Convolutional Neural Networks (CNN)-based methods. However, the experimental evaluation conducted in this paper on several benchmarking datasets against 36 state-of-the-art descriptors from the literature demonstrates that a neural network that contains no convolutional layer, such as Vision Transformer, can shape a global descriptor and achieve competitive results. As fine-tuning is not required, the presented methodology’s low complexity encourages adoption of the architecture as an image retrieval baseline model, replacing the traditional and well adopted CNN-based approaches and inaugurating a new era in image retrieval approaches.
Investigating the Vision Transformer Model for Image Retrieval Tasks
— AK (@ak92501) January 12, 2021
pdf: https://t.co/WgrzjxjGst
abs: https://t.co/wfEO6GJxOp pic.twitter.com/tqWfyJGNAf
5. RepVGG: Making VGG-style ConvNets Great Again
Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, Jian Sun
We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet. The code and trained models are available at https://github.com/megvii-model/RepVGG.
RepVGG https://t.co/J3GvQYQ3vD
— Yosuke Shinya (@shinya7y) January 12, 2021
ResNet等の分岐構造は、訓練に向くが推論が遅い。
VGG等の分岐無し構造は、推論に向くが精度が低い。
そこで、分岐構造を訓練後、推論用の分岐無し構造に変換することで、精度・速度を両立。
変換方法はACNetに類似 https://t.co/XCRTdsxotE pic.twitter.com/ZDcEPQ8HFd
6. HypoSVI: Hypocenter inversion with Stein variational inference and Physics Informed Neural Networks
Jonathan D. Smith, Zachary E. Ross, Kamyar Azizzadenesheli, Jack B. Muir
- retweets: 110, favorites: 38 (01/13/2021 09:30:47)
- links: abs | pdf
- physics.geo-ph | cs.LG
We introduce a scheme for probabilistic hypocenter inversion with Stein variational inference. Our approach uses a differentiable forward model in the form of a physics-informed neural network, which we train to solve the Eikonal equation. This allows for rapid approximation of the posterior by iteratively optimizing a collection of particles against a kernelized Stein discrepancy. We show that the method is well-equipped to handle highly non-convex posterior distributions, which are common in hypocentral inverse problems. A suite of experiments is performed to examine the influence of the various hyperparameters. Once trained, the method is valid for any network geometry within the study area without the need to build travel time tables. We show that the computational demands scale efficiently with the number of differential times, making it ideal for large-N sensing technologies like Distributed Acoustic Sensing.
We have a new method for probabilistic earthquake hypocenter inversion: we use a physics-informed neural network as a forward model with Stein variational inference to rapidly approximate posterior @GeologyJon @kazizzad @muir_jack https://t.co/0lUK9OarR2
— Zachary Ross (@zross_) January 12, 2021
7. Learning to Segment Rigid Motions from Two Frames
Gengshan Yang, Deva Ramanan
Appearance-based detectors achieve remarkable performance on common scenes, but tend to fail for scenarios lack of training data. Geometric motion segmentation algorithms, however, generalize to novel scenes, but have yet to achieve comparable performance to appearance-based ones, due to noisy motion estimations and degenerate motion configurations. To combine the best of both worlds, we propose a modular network, whose architecture is motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field. It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations. Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel. The inferred rigid motions lead to a significant improvement for depth and scene flow estimation. At the time of submission, our method ranked 1st on KITTI scene flow leaderboard, out-performing the best published method (scene flow error: 4.89% vs 6.31%).
Learning to Segment Rigid Motions from Two Frames
— AK (@ak92501) January 12, 2021
pdf: https://t.co/BRMBxsW3o8
abs: https://t.co/z6ZG4eJ7Ky pic.twitter.com/fc4HY3K12y
8. The shifted ODE method for underdamped Langevin MCMC
James Foster, Terry Lyons, Harald Oberhauser
In this paper, we consider the underdamped Langevin diffusion (ULD) and propose a numerical approximation using its associated ordinary differential equation (ODE). When used as a Markov Chain Monte Carlo (MCMC) algorithm, we show that the ODE approximation achieves a -Wasserstein error of in steps under the standard smoothness and strong convexity assumptions on the target distribution. This matches the complexity of the randomized midpoint method proposed by Shen and Lee [NeurIPS 2019] which was shown to be order optimal by Cao, Lu and Wang. However, the main feature of the proposed numerical method is that it can utilize additional smoothness of the target log-density . More concretely, we show that the ODE approximation achieves a -Wasserstein error of in and steps when Lipschitz continuity is assumed for the Hessian and third derivative of . By discretizing this ODE using a fourth order splitting method, we obtain a practical MCMC method that requires just three additional gradient evaluations in each step. In our experiment, where the target comes from a logistic regression, this method shows faster convergence compared to other unadjusted Langevin MCMC algorithms.
some interesting work at the junction of numerical analysis and sampling, reducing the solution of an SDE to an ODE, to which an ODE solver is then applied.https://t.co/955bvEV5ks
— Sam Power (@sam_power_825) January 12, 2021
`The shifted ODE method for underdamped Langevin MCMC'
- J. Foster, T. Lyons, H. Oberhauser
9. Towards Real-World Blind Face Restoration with Generative Facial Prior
Xintao Wang, Yu Li, Honglun Zhang, Ying Shan
Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via novel channel-split spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require expensive image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.
Towards Real-World Blind Face Restoration with Generative Facial Prior
— AK (@ak92501) January 12, 2021
pdf: https://t.co/99L95HKYuT
abs: https://t.co/dBOfhQNp87 pic.twitter.com/3UcFelhoQb
10. ArrowGAN : Learning to Generate Videos by Learning Arrow of Time
Kibeom Hong, Youngjung Uh, Hyeran Byun
Training GANs on videos is even more sophisticated than on images because videos have a distinguished dimension: time. While recent methods designed a dedicated architecture considering time, generated videos are still far from indistinguishable from real videos. In this paper, we introduce ArrowGAN framework, where the discriminators learns to classify arrow of time as an auxiliary task and the generators tries to synthesize forward-running videos. We argue that the auxiliary task should be carefully chosen regarding the target domain. In addition, we explore categorical ArrowGAN with recent techniques in conditional image generation upon ArrowGAN framework, achieving the state-of-the-art performance on categorical video generation. Our extensive experiments validate the effectiveness of arrow of time as a self-supervisory task, and demonstrate that all our components of categorical ArrowGAN lead to the improvement regarding video inception score and Frechet video distance on three datasets: Weizmann, UCFsports, and UCF-101.
ArrowGAN : Learning to Generate Videos by Learning Arrow of Time
— AK (@ak92501) January 12, 2021
pdf: https://t.co/zmY1K7V0D3
abs: https://t.co/85pE2u9XcL pic.twitter.com/S9j1n36y27
11. Technology Readiness Levels for Machine Learning Systems
Alexander Lavin, Ciarán M. Gilligan-Lee, Alessya Visnjic, Siddha Ganju, Dava Newman, Sujoy Ganguly, Danny Lange, Atılım Güneş Baydin, Amit Sharma, Adam Gibson, Yarin Gal, Eric P. Xing, Chris Mattmann, James Parr
The development and deployment of machine learning (ML) systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, where mission critical measures and robustness are ingrained in the development process. Drawing on experience in both spacecraft engineering and ML (from research through product across domain areas), we have developed a proven systems engineering approach for machine learning development and deployment. Our “Machine Learning Technology Readiness Levels” (MLTRL) framework defines a principled process to ensure robust, reliable, and responsible systems while being streamlined for ML workflows, including key distinctions from traditional software engineering. Even more, MLTRL defines a lingua franca for people across teams and organizations to work collaboratively on artificial intelligence and machine learning technologies. Here we describe the framework and elucidate it with several real world use-cases of developing ML methods from basic research through productization and deployment, in areas such as medical diagnostics, consumer computer vision, satellite imagery, and particle physics.
Excited to share this massive collab across AI industry and academia, "Technology Readiness Levels for Machine Learning Systems": https://t.co/YGNjoq55j9
— Alexander Lavin (@theAlexLavin) January 12, 2021
Our MLTRL is an industry-hardened systems engineering
framework for robust, reliable, and responsible AI & ML...
1/n pic.twitter.com/roHuViBwAn
12. A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules
Mehran Taghian, Ahmad Asadi, Reza Safabakhsh
A wide variety of deep reinforcement learning (DRL) models have recently been proposed to learn profitable investment strategies. The rules learned by these models outperform the previous strategies specially in high frequency trading environments. However, it is shown that the quality of the extracted features from a long-term sequence of raw prices of the instruments greatly affects the performance of the trading rules learned by these models. Employing a neural encoder-decoder structure to extract informative features from complex input time-series has proved very effective in other popular tasks like neural machine translation and video captioning in which the models face a similar problem. The encoder-decoder framework extracts highly informative features from a long sequence of prices along with learning how to generate outputs based on the extracted features. In this paper, a novel end-to-end model based on the neural encoder-decoder framework combined with DRL is proposed to learn single instrument trading strategies from a long sequence of raw prices of the instrument. The proposed model consists of an encoder which is a neural structure responsible for learning informative features from the input sequence, and a decoder which is a DRL model responsible for learning profitable strategies based on the features extracted by the encoder. The parameters of the encoder and the decoder structures are learned jointly, which enables the encoder to extract features fitted to the task of the decoder DRL. In addition, the effects of different structures for the encoder and various forms of the input sequences on the performance of the learned strategies are investigated. Experimental results showed that the proposed model outperforms other state-of-the-art models in highly dynamic environments.