Hot Papers 2021-02-11

1. Training Vision Transformers for Image Retrieval

Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou

retweets: 1317, favorites: 401 (02/12/2021 09:39:48)
links: abs | pdf
cs.CV

Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.

Training Vision Transformers for Image Retrieval
pdf: https://t.co/Zp9dXHcvMT
abs: https://t.co/axwTP8u61v pic.twitter.com/DJe11NTxhv
— AK (@ak92501) February 11, 2021

画像をクエリとし画像を検索する（もしくは画像間の類似度を評価する）タスクにおいて、Vision Transformerを使った手法がCNNを使ったSOTAsに精度面で大きく凌駕。CLSの埋め込みを符号で利用、エントロピー（最近傍距離の対数）を最大化する正則化付の対比損失で学習。https://t.co/7SgM3gRR1t
— Daisuke Okanohara (@hillbig) February 11, 2021

Training Vision Transformers for Image Retrievalhttps://t.co/7Q8xcZxB6i pic.twitter.com/Niz6xGiusL
— phalanx (@ZFPhalanx) February 11, 2021

UPDATE: #ComputerVision Transformers

✅ Image retrieval https://t.co/5fmoI1t5BW
✅ Video https://t.co/Pl2Isp8B9V #YearOfTheTransformer
— Kosta Derpanis (@CSProfKGD) February 11, 2021

2. Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

retweets: 582, favorites: 213 (02/12/2021 09:39:48)
links: abs | pdf
cs.CV

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named “TimeSformer,” adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically different design compared to the prominent paradigm of 3D convolutional architectures for video, TimeSformer achieves state-of-the-art results on several major action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Furthermore, our model is faster to train and has higher test-time efficiency compared to competing architectures. Code and pretrained models will be made publicly available.

Is Space-Time Attention All You Need for Video Understanding?
pdf: https://t.co/9GOGsQiCgP
abs: https://t.co/x29qlViXnm pic.twitter.com/NeYcHAgm4y
— AK (@ak92501) February 11, 2021

UPDATE: #ComputerVision Transformers

✅ Image retrieval https://t.co/5fmoI1t5BW
✅ Video https://t.co/Pl2Isp8B9V #YearOfTheTransformer
— Kosta Derpanis (@CSProfKGD) February 11, 2021

3. NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting

Kai Chen, Guang Chen, Dan Xu, Lijun Zhang, Yuyao Huang, Alois Knoll

retweets: 336, favorites: 97 (02/12/2021 09:39:49)
links: abs | pdf
cs.LG | stat.ML

Although Transformer has made breakthrough success in widespread domains especially in Natural Language Processing (NLP), applying it to time series forecasting is still a great challenge. In time series forecasting, the autoregressive decoding of canonical Transformer models could introduce huge accumulative errors inevitably. Besides, utilizing Transformer to deal with spatial-temporal dependencies in the problem still faces tough difficulties.~To tackle these limitations, this work is the first attempt to propose a Non-Autoregressive Transformer architecture for time series forecasting, aiming at overcoming the time delay and accumulative error issues in the canonical Transformer. Moreover, we present a novel spatial-temporal attention mechanism, building a bridge by a learned temporal influence map to fill the gaps between the spatial and temporal attention, so that spatial and temporal dependencies can be processed integrally. Empirically, we evaluate our model on diversified ego-centric future localization datasets and demonstrate state-of-the-art performance on both real-time and accuracy.

NAST: Non-Autoregressive Spatial-Temporal Transformer
for Time Series Forecasting
pdf: https://t.co/DIJS3cWlRp
abs: https://t.co/R8jauAdLEO pic.twitter.com/jJvhu3ulAJ
— AK (@ak92501) February 11, 2021

Published 12 Feb 2021

ML Lead at Beatrust. (https://beatrust.com)Tatsuya Shirakawa on Twitter