1. Training Vision Transformers for Image Retrieval
Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.
Training Vision Transformers for Image Retrieval
— AK (@ak92501) February 11, 2021
pdf: https://t.co/Zp9dXHcvMT
abs: https://t.co/axwTP8u61v pic.twitter.com/DJe11NTxhv
画像をクエリとし画像を検索する(もしくは画像間の類似度を評価する)タスクにおいて、Vision Transformerを使った手法がCNNを使ったSOTAsに精度面で大きく凌駕。CLSの埋め込みを符号で利用、エントロピー(最近傍距離の対数)を最大化する正則化付の対比損失で学習。https://t.co/7SgM3gRR1t
— Daisuke Okanohara (@hillbig) February 11, 2021
Training Vision Transformers for Image Retrievalhttps://t.co/7Q8xcZxB6i pic.twitter.com/Niz6xGiusL
— phalanx (@ZFPhalanx) February 11, 2021
UPDATE: #ComputerVision Transformers
— Kosta Derpanis (@CSProfKGD) February 11, 2021
✅ Image retrieval https://t.co/5fmoI1t5BW
✅ Video https://t.co/Pl2Isp8B9V#YearOfTheTransformer
2. Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named “TimeSformer,” adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically different design compared to the prominent paradigm of 3D convolutional architectures for video, TimeSformer achieves state-of-the-art results on several major action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Furthermore, our model is faster to train and has higher test-time efficiency compared to competing architectures. Code and pretrained models will be made publicly available.
Is Space-Time Attention All You Need for Video Understanding?
— AK (@ak92501) February 11, 2021
pdf: https://t.co/9GOGsQiCgP
abs: https://t.co/x29qlViXnm pic.twitter.com/NeYcHAgm4y
UPDATE: #ComputerVision Transformers
— Kosta Derpanis (@CSProfKGD) February 11, 2021
✅ Image retrieval https://t.co/5fmoI1t5BW
✅ Video https://t.co/Pl2Isp8B9V#YearOfTheTransformer
3. NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting
Kai Chen, Guang Chen, Dan Xu, Lijun Zhang, Yuyao Huang, Alois Knoll
Although Transformer has made breakthrough success in widespread domains especially in Natural Language Processing (NLP), applying it to time series forecasting is still a great challenge. In time series forecasting, the autoregressive decoding of canonical Transformer models could introduce huge accumulative errors inevitably. Besides, utilizing Transformer to deal with spatial-temporal dependencies in the problem still faces tough difficulties.~To tackle these limitations, this work is the first attempt to propose a Non-Autoregressive Transformer architecture for time series forecasting, aiming at overcoming the time delay and accumulative error issues in the canonical Transformer. Moreover, we present a novel spatial-temporal attention mechanism, building a bridge by a learned temporal influence map to fill the gaps between the spatial and temporal attention, so that spatial and temporal dependencies can be processed integrally. Empirically, we evaluate our model on diversified ego-centric future localization datasets and demonstrate state-of-the-art performance on both real-time and accuracy.
NAST: Non-Autoregressive Spatial-Temporal Transformer
— AK (@ak92501) February 11, 2021
for Time Series Forecasting
pdf: https://t.co/DIJS3cWlRp
abs: https://t.co/R8jauAdLEO pic.twitter.com/jJvhu3ulAJ