Hot Papers 2021-04-07

Yuval Alaluf, Or Patashnik, Daniel Cohen-Or

retweets: 3074, favorites: 332 (04/08/2021 09:37:53)
links: abs | pdf
cs.CV

Recently, the power of unconditional image synthesis has significantly advanced through the use of Generative Adversarial Networks (GANs). The task of inverting an image into its corresponding latent code of the trained GAN is of utmost importance as it allows for the manipulation of real images, leveraging the rich semantics learned by the network. Recognizing the limitations of current inversion approaches, in this work we present a novel inversion scheme that extends current encoder-based inversion methods by introducing an iterative refinement mechanism. Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate of the inverted latent code in a self-correcting manner. Our residual-based encoder, named ReStyle, attains improved accuracy compared to current state-of-the-art encoder-based methods with a negligible increase in inference time. We analyze the behavior of ReStyle to gain valuable insights into its iterative nature. We then evaluate the performance of our residual encoder and analyze its robustness compared to optimization-based inversion and state-of-the-art encoders.

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement
pdf: https://t.co/3dAepkclTV
abs: https://t.co/c1iJAj4JV5
project page: https://t.co/O1tBDNoUak
github: https://t.co/vn72cBCBdH
colab: https://t.co/1hNHzhcHdl pic.twitter.com/cqLKsdMXWB
— AK (@ak92501) April 7, 2021

2. gradSim: Differentiable simulation for system identification and visuomotor control

Krishna Murthy Jatavallabhula, Miles Macklin, Florian Golemo, Vikram Voleti, Linda Petrini, Martin Weiss, Breandan Considine, Jerome Parent-Levesque, Kevin Xie, Kenny Erleben, Liam Paull, Florian Shkurti, Derek Nowrouzezahrai, Sanja Fidler

retweets: 1122, favorites: 187 (04/08/2021 09:37:54)
links: abs | pdf
cs.CV | cs.AI | cs.LG | cs.RO

We consider the problem of estimating an object’s physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph — spanning from the dynamics and through the rendering process — enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels.

At @iclr_conf, we're excited to present 𝗴𝗿𝗮𝗱𝗦𝗶𝗺, a differentiable simulator 👇

gradSim enables visuomotor control and physical parameter estimation directly from images/videoshttps://t.co/gy5RsG70od

Co-led with @milesmacklin https://t.co/P3Z4kaNVyu pic.twitter.com/RuLaopPaQp
— Krishna Murthy (@_krishna_murthy) April 7, 2021

3. CodeTrans: Towards Cracking the Language of Silicone’s Code Through Self-Supervised Deep Learning and High Performance Computing

Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, Burkhard Rost

retweets: 988, favorites: 130 (04/08/2021 09:37:54)
links: abs | pdf
cs.SE | cs.AI | cs.CL | cs.LG | cs.PL

Currently, a growing number of mature natural language processing applications make people’s life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans. https://github.com/agemagician/CodeTrans

CodeTrans: Towards Cracking the Language of Silicone’s Code Through Self-Supervised Deep Learning and High Performance Computing

Sets a new SotA on six software engineering tasks, including thirteen sub-tasks.

abs: https://t.co/xEbRtDYy3O
code: https://t.co/7NfNmXAEqf pic.twitter.com/NqVH8Jxw2b
— Aran Komatsuzaki (@arankomatsuzaki) April 7, 2021

4. Fourier Image Transformer

Tim-Oliver Buchholz, Florian Jug

retweets: 614, favorites: 112 (04/08/2021 09:37:54)
links: abs | pdf
cs.CV | cs.LG | eess.IV

Transformer architectures show spectacular performance on NLP tasks and have recently also been used for tasks such as image completion or image classification. Here we propose to use a sequential image representation, where each prefix of the complete sequence describes the whole image at reduced resolution. Using such Fourier Domain Encodings (FDEs), an auto-regressive image completion task is equivalent to predicting a higher resolution output given a low-resolution input. Additionally, we show that an encoder-decoder setup can be used to query arbitrary Fourier coefficients given a set of Fourier domain observations. We demonstrate the practicality of this approach in the context of computed tomography (CT) image reconstruction. In summary, we show that Fourier Image Transformer (FIT) can be used to solve relevant image analysis tasks in Fourier space, a domain inherently inaccessible to convolutional architectures.

The lastest work by @tibuch_ now on arXiv. https://t.co/WTgmFv6EK7
We thought that Transformers on pixel sequences appear a bit misused, but on unrolled Fourier coefficients tasks such as superres or tomographic reconstructions seem quite natural.
What do you think? 🖼+🤖=🤩? pic.twitter.com/4aR7C8jNwS
— Florian Jug (@florianjug) April 7, 2021

5. What Will it Take to Fix Benchmarking in Natural Language Understanding?

Samuel R. Bowman, George E. Dahl

retweets: 506, favorites: 166 (04/08/2021 09:37:54)
links: abs | pdf
cs.CL

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

🚨 New cranky position paper alert! 🚨
(#naacl2021, mini-thread, https://t.co/wRvhZ70bbV) pic.twitter.com/JQwSAoeb4E
— Prof. Sam Bowman (@sleepinyourhat) April 7, 2021

6. GPU Domain Specialization via Composable On-Package Architecture

Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee, David Nellans, Stephen W. Keckler

retweets: 256, favorites: 69 (04/08/2021 09:37:54)
links: abs | pdf
cs.AR | cs.DC | cs.LG

As GPUs scale their low precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that converged GPU design trying to address diverging architectural requirements between FP32 (or larger) based HPC and FP16 (or smaller) based DL workloads results in sub-optimal configuration for either of the application domains. We argue that a Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16x larger cache capacity and 1.6x higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35% respectively and reduces the number of GPU instances by 50% in scale-out training scenarios.

Nvidia researchers proposed COPA-GPU, a domain-specialized composable GPU architecture capable to provide high levels of GPU design reuse across the #HPC and #DeepLearning domains, while enabling specifically optimized products for each domain.https://t.co/V1SiMZRgGK pic.twitter.com/5sj2YCsKHH
— Underfox (@Underfox3) April 7, 2021

7. Deep Animation Video Interpolation in the Wild

Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, Dimitris N. Metaxas, Chen Change Loy, Ziwei Liu

retweets: 144, favorites: 98 (04/08/2021 09:37:54)
links: abs | pdf
cs.CV

In the animation industry, cartoon videos are usually produced at low frame rate since hand drawing of such frames is costly and time-consuming. Therefore, it is desirable to develop computational models that can automatically interpolate the in-between animation frames. However, existing video interpolation methods fail to produce satisfying results on animation data. Compared to natural videos, animation videos possess two unique characteristics that make frame interpolation difficult: 1) cartoons comprise lines and smooth color pieces. The smooth areas lack textures and make it difficult to estimate accurate motions on animation videos. 2) cartoons express stories via exaggeration. Some of the motions are non-linear and extremely large. In this work, we formally define and study the animation video interpolation problem for the first time. To address the aforementioned challenges, we propose an effective framework, AnimeInterp, with two dedicated modules in a coarse-to-fine manner. Specifically, 1) Segment-Guided Matching resolves the “lack of textures” challenge by exploiting global matching among color pieces that are piece-wise coherent. 2) Recurrent Flow Refinement resolves the “non-linear and extremely large motion” challenge by recurrent predictions using a transformer-like architecture. To facilitate comprehensive training and evaluations, we build a large-scale animation triplet dataset, ATD-12K, which comprises 12,000 triplets with rich annotations. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art interpolation methods for animation videos. Notably, AnimeInterp shows favorable perceptual quality and robustness for animation scenarios in the wild. The proposed dataset and code are available at https://github.com/lisiyao21/AnimeInterp/.

Deep Animation Video Interpolation in the Wild
pdf: https://t.co/ZfH53tlU1g
abs: https://t.co/oTOLO0cxrY pic.twitter.com/Qv3olll3jf
— AK (@ak92501) April 7, 2021

8. Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

Medhini Narasimhan, Shiry Ginosar, Andrew Owens, Alexei A. Efros, Trevor Darrell

retweets: 126, favorites: 81 (04/08/2021 09:37:55)
links: abs | pdf
cs.CV | cs.AI | cs.MM

We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a manner that scales to more challenging dynamics, and to condition on other data, such as audio. We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high transition probabilities to generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends to an audio-conditioned setting without requiring any finetuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

Excited to share our work, “Strumming to the Beat: Audio-Conditioned Contrastive Video Textures” with Shiry Ginosar, @andrewhowens , Alyosha Efros, and @trevordarrell

Website: https://t.co/o6kFWOWztG
Talk: https://t.co/zIos8hVWxs
Paper: https://t.co/TVgI5qVBFE
— Medhini Narasimhan (@medhini_n) April 7, 2021

Jennifer J. Sun, Tomomi Karigo, Dipam Chakraborty, Sharada P. Mohanty, David J. Anderson, Pietro Perona, Yisong Yue, Ann Kennedy

retweets: 132, favorites: 39 (04/08/2021 09:37:55)
links: abs | pdf
cs.LG | cs.CV

Multi-agent behavior modeling aims to understand the interactions that occur between agents. We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) Dataset. Our dataset consists of the social interactions between freely behaving mice in a standard resident-intruder assay. The CalMS21 dataset is part of the Multi-Agent Behavior Challenge 2021 and for our next step, we aim to incorporate datasets from other domains studying multi-agent behavior. To help accelerate behavioral studies, the CalMS21 dataset provides a benchmark to evaluate the performance of automated behavior classification methods in three settings: (1) for training on large behavioral datasets all annotated by a single annotator, (2) for style transfer to learn inter-annotator differences in behavior definitions, and (3) for learning of new behaviors of interest given limited training data. The dataset consists of 6 million frames of unlabelled tracked poses of interacting mice, as well as over 1 million frames with tracked poses and corresponding frame-level behavior annotations. The challenge of our dataset is to be able to classify behaviors accurately using both labelled and unlabelled tracking data, as well as being able to generalize to new annotators and behaviors.

We released a large dataset (>6 million frames) of mouse social interactions for benchmarking behavior models - CalMS21🐭!
Part of the Multi-Agent Behavior Challenge @#CVPR2021.

Paper: https://t.co/nkzmKYRUim

Thanks to David Anderson Lab @Caltech for providing the data! pic.twitter.com/DwMNwTuzO1
— Jennifer J. Sun (@JenJSun) April 7, 2021

10. Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark

Vincent Dumoulin, Neil Houlsby, Utku Evci, Xiaohua Zhai, Ross Goroshin, Sylvain Gelly, Hugo Larochelle

retweets: 76, favorites: 75 (04/08/2021 09:37:55)
links: abs | pdf
cs.LG | cs.CV

Meta and transfer learning are two successful families of approaches to few-shot learning. Despite highly related goals, state-of-the-art advances in each family are measured largely in isolation of each other. As a result of diverging evaluation norms, a direct or thorough comparison of different approaches is challenging. To bridge this gap, we perform a cross-family study of the best transfer and meta learners on both a large-scale meta-learning benchmark (Meta-Dataset, MD), and a transfer learning benchmark (Visual Task Adaptation Benchmark, VTAB). We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet. In contrast, meta-learning approaches struggle to compete on VTAB when trained and validated on MD. However, BiT is not without limitations, and pushing for scale does not improve performance on highly out-of-distribution MD tasks. In performing this study, we reveal a number of discrepancies in evaluation norms and study some of these in light of the performance gap. We hope that this work facilitates sharing of insights from each community, and accelerates progress on few-shot learning.

Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark

Finds that large-scale transfer methods (e.g. BiT) often outperforms competing approaches like meta-learning on large-scale meta-learning benchmark.https://t.co/andt264D3R pic.twitter.com/6e5lDwcVzM
— Aran Komatsuzaki (@arankomatsuzaki) April 7, 2021

11. SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi

retweets: 72, favorites: 54 (04/08/2021 09:37:55)
links: abs | pdf
cs.CL | cs.LG

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0% WER on AMI-IHM, 4.7% WER on Switchboard, 8.3% WER on CallHome, and 1.3% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9% WER without a language model, which compares to 38.6% WER to a strong HMM baseline with a language model.

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
pdf: https://t.co/qbnt35OnAH
abs: https://t.co/0aclGNyrIz

"SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language
model." pic.twitter.com/o7ZIPGUHZ7
— AK (@ak92501) April 7, 2021

12. Localizing Visual Sounds the Hard Way

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

retweets: 42, favorites: 46 (04/08/2021 09:37:55)
links: abs | pdf
cs.CV | eess.AS | eess.IV

The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines.

Localizing Visual Sounds the Hard Way
pdf: https://t.co/MfACK3LagP
abs: https://t.co/52DPDTsFXB
project page: https://t.co/tuCYTM2lYf pic.twitter.com/ZesHVhAlvq
— AK (@ak92501) April 7, 2021

13. MirrorNeRF: One-shot Neural Portrait RadianceField from Multi-mirror Catadioptric Imaging

Ziyu Wang, Liao Wang, Fuqiang Zhao, Minye Wu, Lan Xu, Jingyi Yu

retweets: 42, favorites: 20 (04/08/2021 09:37:55)
links: abs | pdf
cs.CV

Photo-realistic neural reconstruction and rendering of the human portrait are critical for numerous VR/AR applications. Still, existing solutions inherently rely on multi-view capture settings, and the one-shot solution to get rid of the tedious multi-view synchronization and calibration remains extremely challenging. In this paper, we propose MirrorNeRF - a one-shot neural portrait free-viewpoint rendering approach using a catadioptric imaging system with multiple sphere mirrors and a single high-resolution digital camera, which is the first to combine neural radiance field with catadioptric imaging so as to enable one-shot photo-realistic human portrait reconstruction and rendering, in a low-cost and casual capture setting. More specifically, we propose a light-weight catadioptric system design with a sphere mirror array to enable diverse ray sampling in the continuous 3D space as well as an effective online calibration for the camera and the mirror array. Our catadioptric imaging system can be easily deployed with a low budget and the casual capture ability for convenient daily usages. We introduce a novel neural warping radiance field representation to learn a continuous displacement field that implicitly compensates for the misalignment due to our flexible system setting. We further propose a density regularization scheme to leverage the inherent geometry information from the catadioptric data in a self-supervision manner, which not only improves the training efficiency but also provides more effective density supervision for higher rendering quality. Extensive experiments demonstrate the effectiveness and robustness of our scheme to achieve one-shot photo-realistic and high-quality appearance free-viewpoint rendering for human portrait scenes.

MirrorNeRF: One-shot Neural Portrait RadianceField from Multi-mirror Catadioptric Imaging
pdf: https://t.co/X9Efzczg4V
abs: https://t.co/TeKeTbyP50 pic.twitter.com/YV3wB0GTGj
— AK (@ak92501) April 7, 2021

14. Content-Aware GAN Compression

Yuchen Liu, Zhixin Shu, Yijun Li, Zhe Lin, Federico Perazzi, S.Y. Kung

retweets: 20, favorites: 32 (04/08/2021 09:37:55)
links: abs | pdf
cs.CV

Generative adversarial networks (GANs), e.g., StyleGAN2, play a vital role in various image generation and synthesis tasks, yet their notoriously high computational cost hinders their efficient deployment on edge devices. Directly applying generic compression approaches yields poor results on GANs, which motivates a number of recent GAN compression works. While prior works mainly accelerate conditional GANs, e.g., pix2pix and CycleGAN, compressing state-of-the-art unconditional GANs has rarely been explored and is more challenging. In this paper, we propose novel approaches for unconditional GAN compression. We first introduce effective channel pruning and knowledge distillation schemes specialized for unconditional GANs. We then propose a novel content-aware method to guide the processes of both pruning and distillation. With content-awareness, we can effectively prune channels that are unimportant to the contents of interest, e.g., human faces, and focus our distillation on these regions, which significantly enhances the distillation quality. On StyleGAN2 and SN-GAN, we achieve a substantial improvement over the state-of-the-art compression method. Notably, we reduce the FLOPs of StyleGAN2 by 11x with visually negligible image quality loss compared to the full-size model. More interestingly, when applied to various image manipulation tasks, our compressed model forms a smoother and better disentangled latent manifold, making it more effective for image editing.

Content-Aware GAN Compression
pdf: https://t.co/mFUmgUWWiN
abs: https://t.co/ErbFLOBQcv

"Notably, we reduce the FLOPs of StyleGAN2 by 11x with visually negligible image quality loss compared to the full-size model." pic.twitter.com/KewyaAdDOL
— AK (@ak92501) April 7, 2021

15. Variational Transformer Networks for Layout Generation

Diego Martin Arroyo, Janis Postels, Federico Tombari

retweets: 12, favorites: 40 (04/08/2021 09:37:56)
links: abs | pdf
cs.CV | cs.LG

Generative models able to synthesize layouts of different kinds (e.g. documents, user interfaces or furniture arrangements) are a useful tool to aid design processes and as a first step in the generation of synthetic data, among other tasks. We exploit the properties of self-attention layers to capture high level relationships between elements in a layout, and use these as the building blocks of the well-known Variational Autoencoder (VAE) formulation. Our proposed Variational Transformer Network (VTN) is capable of learning margins, alignments and other global design rules without explicit supervision. Layouts sampled from our model have a high degree of resemblance to the training data, while demonstrating appealing diversity. In an extensive evaluation on publicly available benchmarks for different layout types VTNs achieve state-of-the-art diversity and perceptual quality. Additionally, we show the capabilities of this method as part of a document layout detection pipeline.

Variational Transformer Networks for Layout Generation
pdf: https://t.co/QAGZPcmMpF
abs: https://t.co/XSRTKtjmGU pic.twitter.com/0zAYdqZKG7
— AK (@ak92501) April 7, 2021

16. MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms

Kai Middlebrook, Shyam Sudhakaran, David Guy Brizan

retweets: 30, favorites: 20 (04/08/2021 09:37:56)
links: abs | pdf
cs.SD | cs.LG | eess.AS

In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention-augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporating self-attention. The difference between MuSLCAT and MuSLCAN is their backend components. MuSLCAT’s backend is a modified version of BERT. While MuSLCAN’s is a simple AAC block. We validate the proposed MuSLCAT and MuSLCAN architectures by comparing them to state-of-the-art networks on four benchmark datasets for music tagging and genre recognition. Our experiments show that MuSLCAT and MuSLCAN consistently yield competitive results when compared to state-of-the-art waveform-based models yet require considerably fewer parameters.

MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms
pdf: https://t.co/UtZIk2L5zn
abs: https://t.co/AzsNWv3kVq pic.twitter.com/RRpuVV1EmY
— AK (@ak92501) April 7, 2021

Published 8 Apr 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter