Hot Papers 2021-05-13

1. Segmenter: Transformer for Semantic Segmentation

Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

retweets: 2497, favorites: 248 (05/14/2021 10:14:57)
links: abs | pdf
cs.CV | cs.AI | cs.LG

Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution based approaches, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.

Segmenter: Transformer for Semantic Segmentation
pdf: https://t.co/HGQVRNuMLC
abs: https://t.co/xUjh7rjBKT pic.twitter.com/O6vpNtixz7
— AK (@ak92501) May 13, 2021

2. When Does Contrastive Visual Representation Learning Work?

Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac Aodha, Serge Belongie

retweets: 487, favorites: 142 (05/14/2021 10:14:57)
links: abs | pdf
cs.CV | cs.LG

Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on ImageNet are now relatively well understood, the field still lacks widely accepted best practices for replicating this success on other datasets. As a first step in this direction, we study contrastive self-supervised learning on four diverse large-scale datasets. By looking through the lenses of data quantity, data domain, data quality, and task granularity, we provide new insights into the necessary conditions for successful self-supervised learning. Our key findings include observations such as: (i) the benefit of additional pretraining data beyond 500k images is modest, (ii) adding pretraining images from another domain does not lead to more general representations, (iii) corrupted pretraining images have a disparate impact on supervised and self-supervised pretraining, and (iv) contrastive learning lags far behind supervised learning on fine-grained visual classification tasks.

When Does Contrastive Visual Representation Learning Work?
pdf: https://t.co/pw83mXu7Sn
abs: https://t.co/tSHW6h8hVs

benefit of additional pretraining data beyond
500k images is modest, adding pretraining images from
another domain does not lead to more general representations pic.twitter.com/XVHBNvI1bv
— AK (@ak92501) May 13, 2021

3. TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang, Wojciech Galuba, Tal Hassner

retweets: 182, favorites: 65 (05/14/2021 10:14:57)
links: abs | pdf
cs.CV

A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion, allowing us to revisit several design choices to achieve new state-of-the-art performance on TextVQA dataset.

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
pdf: https://t.co/TDw1QSlSzw
abs: https://t.co/rWRCRSlLKd
project page: https://t.co/zX0Wf1ucH6

arbitrary-shaped scene text detection and recognition with 900k annotated words pic.twitter.com/Vn8ad0EKSx
— AK (@ak92501) May 13, 2021

4. An Introduction to Algorithmic Fairness

Hilde J.P. Weerts

retweets: 196, favorites: 50 (05/14/2021 10:14:58)
links: abs | pdf
cs.CY

In recent years, there has been an increasing awareness of both the public and scientific community that algorithmic systems can reproduce, amplify, or even introduce unfairness in our societies. These lecture notes provide an introduction to some of the core concepts in algorithmic fairness research. We list different types of fairness-related harms, explain two main notions of algorithmic fairness, and map the biases that underlie these harms upon the machine learning development process.

After some initial nice responses, I've decided to put the introduction chapter of my lecture notes on algorithmic fairness on arXiv: https://t.co/3UTqNOWUcw. I hope these will make it easier for folks who are interested to learn more about fairness but don't know where to start!
— Hilde Weerts (@hildeweerts) May 13, 2021

5. Fairness and Discrimination in Information Access Systems

Michael D. Ekstrand, Anubrata Das, Robin Burke, Fernando Diaz

retweets: 156, favorites: 46 (05/14/2021 10:14:58)
links: abs | pdf
cs.IR

Recommendation, information retrieval, and other information access systems pose unique challenges for investigating and applying the fairness and non-discrimination concepts that have been developed for studying other machine learning systems. While fair information access shares many commonalities with fair classification, the multistakeholder nature of information access applications, the rank-based problem setting, the centrality of personalization in many cases, and the role of user response complicate the problem of identifying precisely what types and operationalizations of fairness may be relevant, let alone measuring or promoting them. In this monograph, we present a taxonomy of the various dimensions of fair information access and survey the literature to date on this new and rapidly-growing topic. We preface this with brief introductions to information access and algorithmic fairness, to facilitate use of this work by scholars with experience in one (or neither) of these fields who wish to learn about their intersection. We conclude with several open problems in fair information access, along with some suggestions for how to approach research in this space.

🚨 new preprint alert 🚨

A survey and systematization of fairness in information retrieval and #recsys, with @d_anubrata, @rburke2233, and @841io. Share and enjoy, and please send feedback! https://t.co/B0at5cTIMl
— Michael Ekstrand (@mdekstrand) May 13, 2021

6. Operation-wise Attention Network for Tampering Localization Fusion

Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Kompatsiaris

retweets: 56, favorites: 5 (05/14/2021 10:14:58)
links: abs | pdf
cs.CV

In this work, we present a deep learning-based approach for image tampering localization fusion. This approach is designed to combine the outcomes of multiple image forensics algorithms and provides a fused tampering localization map, which requires no expert knowledge and is easier to interpret by end users. Our fusion framework includes a set of five individual tampering localization methods for splicing localization on JPEG images. The proposed deep learning fusion model is an adapted architecture, initially proposed for the image restoration task, that performs multiple operations in parallel, weighted by an attention mechanism to enable the selection of proper operations depending on the input signals. This weighting process can be very beneficial for cases where the input signal is very diverse, as in our case where the output signals of multiple image forensics algorithms are combined. Evaluation in three publicly available forensics datasets demonstrates that the performance of the proposed approach is competitive, outperforming the individual forensics techniques as well as another recently proposed fusion framework in the majority of cases.

7. GANs for Medical Image Synthesis: An Empirical Study

Youssef Skandarani, Pierre-Marc Jodoin, Alain Lalande

retweets: 30, favorites: 29 (05/14/2021 10:14:58)
links: abs | pdf
eess.IV | cs.CV | cs.LG

Generative Adversarial Networks (GANs) have become increasingly powerful, generating mind-blowing photorealistic images that mimic the content of datasets they were trained to replicate. One recurrent theme in medical imaging is whether GANs can also be effective at generating workable medical data as they are for generating realistic RGB images. In this paper, we perform a multi-GAN and multi-application study to gauge the benefits of GANs in medical imaging. We tested various GAN architectures from basic DCGAN to more sophisticated style-based GANs on three medical imaging modalities and organs namely : cardiac cine-MRI, liver CT and RGB retina images. GANs were trained on well-known and widely utilized datasets from which their FID score were computed to measure the visual acuity of their generated images. We further tested their usefulness by measuring the segmentation accuracy of a U-Net trained on these generated images. Results reveal that GANs are far from being equal as some are ill-suited for medical imaging applications while others are much better off. The top-performing GANs are capable of generating realistic-looking medical images by FID standards that can fool trained experts in a visual Turing test and comply to some metrics. However, segmentation results suggests that no GAN is capable of reproducing the full richness of a medical datasets.

GANs for Medical Image Synthesis: An Empirical Study
Youssef Skandarani, @PMJodoin, Alain Lalandehttps://t.co/47eV12lpQ2

tl;dr:
- FID score != downstream metrics in the medical domain
- adding GAN-generated data to the train set helps only a little if any. pic.twitter.com/Fi3SVEmgG1
— Dmytro Mishkin (@ducha_aiki) May 13, 2021

8. Collaborative Regression of Expressive Bodies using Moderation

Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, Michael J. Black

retweets: 12, favorites: 43 (05/14/2021 10:14:58)
links: abs | pdf
cs.CV

Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars from a single image, with realistic facial detail. To get accurate whole bodies, PIXIE uses two key observations. First, body parts are correlated, but existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. Uniquely, part experts can contribute to the whole, using SMPL-X’s shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer “gendered” 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D surface displacements for the face. Quantitative and qualitative evaluation shows that PIXIE estimates 3D humans with a more accurate whole-body shape and detailed face shape than the state of the art. Our models and code are available for research at https://pixie.is.tue.mpg.de.

Collaborative Regression of Expressive Bodies using Moderation
pdf: https://t.co/rriYtSrXC2
abs: https://t.co/XtqaXJloGo
project page: https://t.co/u86fe8Hcy9

whole-body reconstruction method that recovers an animatable 3D avatar with a detailed face from a single RGB image pic.twitter.com/UXf1oJRoRF
— AK (@ak92501) May 13, 2021

Published 14 May 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter