Hot Papers 2021-08-04

1. Accountability and Forensics in Blockchains: XDC Consensus Engine DPoS 2.0

Gerui Wang, Jerome Wang, Liam Lai, Fisher Yu

retweets: 2905, favorites: 313 (08/05/2021 08:26:21)
links: abs | pdf
cs.CR

This document introduces XinFin DPoS 2.0, the proposed next generation decentralized consensus engine for the XinFin XDC Network. Built upon the most advanced BFT consensus protocol, this upgrade will empower the XDC Network with military-grade security and performance while consuming extremely low resources, and will be fully backwards-compatible in terms of APIs. It will also pave the road to the future evolution of the XDC Network. The core invention is the holistic integration of accountability and forensics in blockchains: the ability to identify malicious actors with cryptographic integrity directly from the blockchain records, incorporating the latest peer-reviewed academic research with state of the art engineering designs and implementation plans.

https://t.co/c6VrebqWwq
— Atul Khekade (@atulkhekade) August 4, 2021

Accountability and Forensics in Blockchains: $XDC Consensus Engine DPoS 2.0#XDCNetwork #Cryptography #Security https://t.co/iE9LTE4NvN pic.twitter.com/qBoemeCHad
— Sal M #WeAreXDC 🪐🔥 (@Mr_Blockchain22) August 4, 2021

Accountability and Forensics in Blockchains: XDC Consensus Engine DPoS 2.0#XDC Network #XDPOS 2.0 by #XinFin Team #Growth, #progress, #interoperability, #security, #innovation

This is #XDC https://t.co/pntVFhBHkJ
— ChainGamer #WeAreXDC (@BlockNewb) August 4, 2021

2. SphereFace2: Binary Classification is All You Need for Deep Face Recognition

Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, Rita Singh

retweets: 323, favorites: 81 (08/05/2021 08:26:21)
links: abs | pdf
cs.CV | cs.AI | cs.LG

State-of-the-art deep face recognition methods are mostly trained with a softmax-based multi-class classification framework. Despite being popular and effective, these methods still have a few shortcomings that limit empirical performance. In this paper, we first identify the discrepancy between training and evaluation in the existing multi-class classification framework and then discuss the potential limitations caused by the “competitive” nature of softmax normalization. Motivated by these limitations, we propose a novel binary classification training framework, termed SphereFace2. In contrast to existing methods, SphereFace2 circumvents the softmax normalization, as well as the corresponding closed-set assumption. This effectively bridges the gap between training and evaluation, enabling the representations to be improved individually by each binary classification task. Besides designing a specific well-performing loss function, we summarize a few general principles for this “one-vs-all” binary classification framework so that it can outperform current competitive methods. We conduct comprehensive experiments on popular benchmarks to demonstrate that SphereFace2 can consistently outperform current state-of-the-art deep face recognition methods.

SphereFace2: Binary Classification is All You Need for Deep Face Recognition
pdf: https://t.co/EasbGEvgit
abs: https://t.co/h6WolgsqRP pic.twitter.com/0Kg6kU9p6z
— AK (@ak92501) August 4, 2021

3. Toward Spatially Unbiased Generative Models

Jooyoung Choi, Jungbeom Lee, Yonghyun Jeong, Sungroh Yoon

retweets: 272, favorites: 85 (08/05/2021 08:26:22)
links: abs | pdf
cs.LG | cs.CV

Recent image generation models show remarkable generation performance. However, they mirror strong location preference in datasets, which we call spatial bias. Therefore, generators render poor samples at unseen locations and scales. We argue that the generators rely on their implicit positional encoding to render spatial content. From our observations, the generator’s implicit positional encoding is translation-variant, making the generator spatially biased. To address this issue, we propose injecting explicit positional encoding at each scale of the generator. By learning the spatially unbiased generator, we facilitate the robust use of generators in multiple tasks, such as GAN inversion, multi-scale generation, generation of arbitrary sizes and aspect ratios. Furthermore, we show that our method can also be applied to denoising diffusion probabilistic models.

Toward Spatially Unbiased Generative Models
pdf: https://t.co/TPlWZmqJiy
abs: https://t.co/YI7IWqAoJ6
github: https://t.co/HWYfnPrutK pic.twitter.com/AlAAWY9cWd
— AK (@ak92501) August 4, 2021

4. A Benchmarking Initiative for Audio-Domain Music Generation Using the Freesound Loop Dataset

Tun-Min Hung, Bo-Yu Chen, Yen-Tung Yeh, Yi-Hsuan Yang

retweets: 235, favorites: 113 (08/05/2021 08:26:22)
links: abs | pdf
cs.SD | eess.AS

This paper proposes a new benchmark task for generat-ing musical passages in the audio domain by using thedrum loops from the FreeSound Loop Dataset, which arepublicly re-distributable. Moreover, we use a larger col-lection of drum loops from Looperman to establish fourmodel-based objective metrics for evaluation, releasingthese metrics as a library for quantifying and facilitatingthe progress of musical audio generation. Under this eval-uation framework, we benchmark the performance of threerecent deep generative adversarial network (GAN) mod-els we customize to generate loops, including StyleGAN,StyleGAN2, and UNAGAN. We also report a subjectiveevaluation of these models. Our evaluation shows that theone based on StyleGAN2 performs the best in both objec-tive and subjective metrics.

A Benchmarking Initiative for Audio-Domain Music Generation Using the Freesound Loop Dataset
pdf: https://t.co/K9SPhGgIYf
github: https://t.co/qW9TjxvUXA pic.twitter.com/ueFRJft6iu
— AK (@ak92501) August 4, 2021

Preprint of our #ismir2021 paper on audio-domain loop generation benchmarking, showing StyleGAN2>UNAGAN>StyleGAN.
🐋paper - https://t.co/rTa3C6Ak7E
🐋demo - https://t.co/PmZCC9KWGh
🐋code (incl. metrics) - https://t.co/UhsxjTsleK
(@AllenHu75466923, @Chen_Paul_u , @YenTung11) https://t.co/oTIgKIMf8N
— Yi-Hsuan Yang (@affige_yang) August 4, 2021

5. Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao

retweets: 64, favorites: 46 (08/05/2021 08:26:22)
links: abs | pdf
cs.CV

This paper investigates an open research task of text-to-image synthesis for automatically generating or manipulating images from text descriptions. Prevailing methods mainly use the text as conditions for GAN generation, and train different models for the text-guided image generation and manipulation tasks. In this paper, we propose a novel unified framework of Cycle-consistent Inverse GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation tasks. Specifically, we first train a GAN model without text input, aiming to generate images with high diversity and quality. Then we learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image, where we introduce the cycle-consistency training to learn more robust and consistent inverted latent codes. We further uncover the latent space semantics of the trained GAN model, by learning a similarity model between text representations and the latent codes. In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes. Extensive experiments on the Recipe1M and CUB datasets validate the efficacy of our proposed framework.

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis
pdf: https://t.co/tllnLlnoKO
abs: https://t.co/0h79k6yfMT pic.twitter.com/aLofeRfkBL
— AK (@ak92501) August 4, 2021

6. Domain Generalization via Gradient Surgery

Lucas Mansilla, Rodrigo Echeveste, Diego H. Milone, Enzo Ferrante

retweets: 58, favorites: 41 (08/05/2021 08:26:22)
links: abs | pdf
cs.LG | cs.CV | eess.IV

In real-life applications, machine learning models often face scenarios where there is a change in data distribution between training and test domains. When the aim is to make predictions on distributions different from those seen at training, we incur in a domain generalization problem. Methods to address this issue learn a model using data from multiple source domains, and then apply this model to the unseen target domain. Our hypothesis is that when training with multiple domains, conflicting gradients within each mini-batch contain information specific to the individual domains which is irrelevant to the others, including the test domain. If left untouched, such disagreement may degrade generalization performance. In this work, we characterize the conflicting gradients emerging in domain shift scenarios and devise novel gradient agreement strategies based on gradient surgery to alleviate their effect. We validate our approach in image classification tasks with three multi-domain datasets, showing the value of the proposed agreement strategy in enhancing the generalization capability of deep learning models in domain shift scenarios.

🚨“Domain generalization via gradient surgery”✂️was accepted at @iccv2021 🎉

We study conflicting gradients emerging in domain shift
scenarios and devise novel gradient agreement strategies to improve generalization performance on unseen domains

Paper➡️ https://t.co/eLNVqeW79x pic.twitter.com/eoaXHQJJnr
— Enzo Ferrante (@enzoferrante) August 4, 2021

7. Where do Models go Wrong? Parameter-Space Saliency Maps for Explainability

Roman Levin, Manli Shu, Eitan Borgnia, Furong Huang, Micah Goldblum, Tom Goldstein

retweets: 51, favorites: 40 (08/05/2021 08:26:22)
links: abs | pdf
cs.CV | cs.LG

Conventional saliency maps highlight input features to which neural network predictions are highly sensitive. We take a different approach to saliency, in which we identify and analyze the network parameters, rather than inputs, which are responsible for erroneous decisions. We find that samples which cause similar parameters to malfunction are semantically similar. We also show that pruning the most salient parameters for a wrongly classified sample often improves model behavior. Furthermore, fine-tuning a small number of the most salient parameters on a single sample results in error correction on other samples that are misclassified for similar reasons. Based on our parameter saliency method, we also introduce an input-space saliency technique that reveals how image features cause specific network components to malfunction. Further, we rigorously validate the meaningfulness of our saliency maps on both the dataset and case-study levels.

Where do Models go Wrong? Parameter-Space Saliency Maps for Explainability
pdf: https://t.co/dFPHas5hq3
abs: https://t.co/dBM4SHzVma pic.twitter.com/hmLH0cvBqo
— AK (@ak92501) August 4, 2021

8. Large-Scale Differentially Private BERT

Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, Pasin Manurangsi

retweets: 56, favorites: 29 (08/05/2021 08:26:22)
links: abs | pdf
cs.LG | cs.CL | cs.CR

In this work, we study the large-scale pretraining of BERT-Large with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance its efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of [SVK20], who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX [BFH+18, FJL18] primitives in conjunction with the XLA compiler [XLA17]. Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for $\epsilon = 5.36$ . To put this number in perspective, non-private BERT models achieve an accuracy of $\sim$ 70%.

Large-Scale Differentially Private BERT
pdf: https://t.co/S3WwsMUEr6
abs: https://t.co/jn803y67sI

achieves a masked language model accuracy of 60.5% at a batch size of 2M pic.twitter.com/u5xm4HyfzV
— AK (@ak92501) August 4, 2021

Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim, Juhan Nam, Yi-Hsuan Yang

retweets: 42, favorites: 26 (08/05/2021 08:26:23)
links: abs | pdf
cs.SD | cs.MM | eess.AS

While there are many music datasets with emotion labels in the literature, they cannot be used for research on symbolic-domain music analysis or generation, as there are usually audio files only. In this paper, we present the EMOPIA (pronounced yee-m\{o}-pi-uh’) dataset, a shared multi-modal (audio and MIDI) database focusing on perceived emotion in pop piano music, to facilitate research on various tasks related to music emotion. The dataset contains 1,087 music clips from 387 songs and clip-level emotion labels annotated by four dedicated annotators. Since the clips are not restricted to one clip per song, they can also be used for song-level analysis. We present the methodology for building the dataset, covering the song list curation, clip selection, and emotion annotation processes. Moreover, we prototype use cases on clip-level music emotion classification and emotion-based symbolic music generation by training and evaluating corresponding models using the dataset. The result demonstrates the potential of EMOPIA for being used in future exploration on piano emotion-related MIR tasks.

EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation
paper: https://t.co/A9T5TULIxG
github: https://t.co/57yl1rCnXj

dataset contains 1,087 music clips from 387 songs, clip-level emotion labels annotated by four dedicated annotators pic.twitter.com/Mi0I24W7HW
— AK (@ak92501) August 4, 2021

10. Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, Xing Sun

retweets: 32, favorites: 20 (08/05/2021 08:26:23)
links: abs | pdf
cs.CV

Vision transformers have recently received explosive popularity, but huge computational cost is still a severe issue. Recent efficient designs for vision transformers follow two pipelines, namely, structural compression based on local spatial prior and non-structural token pruning. However, rough token pruning breaks the spatial structure that is indispensable for local spatial prior. To take advantage of both two pipelines, this work seeks to dynamically identify uninformative tokens for each instance and trim down both the training and inference complexity while maintain complete spatial structure and information flow. To achieve this goal, we propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the global class attention that is unique to vision transformers. Then, we propose to update information tokens and placeholder tokens that contribute little to the final prediction with different computational properties, namely, slow-fast updating. Thanks to the slow-fast updating mechanism that guarantees information flow and spatial structure, our Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that the proposed method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiT-S by over 60% throughput while only sacrificing 0.4% top-1 accuracy.

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
pdf: https://t.co/Hk6haYZdH8
abs: https://t.co/FaFbXbXV7d

accelerates DeiTS by over 60% throughput while only sacrificing 0.4% top-1 accuracy pic.twitter.com/RFpf87zjb1
— AK (@ak92501) August 4, 2021

11. Consistent Depth of Moving Objects in Video

Zhoutong Zhang, Forrester Cole, Richard Tucker, William T. Freeman, Tali Dekel

retweets: 20, favorites: 31 (08/05/2021 08:26:23)
links: abs | pdf
cs.CV | cs.GR

We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera. We seek a geometrically and temporally consistent solution to this underconstrained problem: the depth predictions of corresponding points across frames should induce plausible, smooth motion in 3D. We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction MLP over the entire input video. By recursively unrolling the scene-flow prediction MLP over varying time steps, we compute both short-range scene flow to impose local smooth motion priors directly in 3D, and long-range scene flow to impose multi-view consistency constraints with wide baselines. We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars), as well as camera motion. Our depth maps give rise to a number of depth-and-motion aware video editing effects such as object and lighting insertion.

Consistent Depth of Moving Objects in Video
pdf: https://t.co/S63yrlca9B
abs: https://t.co/LYb476h5Fo

a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera pic.twitter.com/eET8CTq5R7
— AK (@ak92501) August 4, 2021

Published 5 Aug 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter