1. MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera
Felix Wimbauer, Nan Yang, Lukas von Stumberg, Niclas Zeller, Daniel Cremers
In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. MonoRec is based on a MVS setting which encodes the information of multiple consecutive images in a cost volume. To deal with dynamic objects in the scene, we introduce a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes. Unlike other MVS methods, MonoRec is able to predict accurate depths for both static and moving objects by leveraging the predicted masks. Furthermore, we present a novel multi-stage training scheme with a semi-supervised loss formulation that does not require LiDAR depth values. We carefully evaluate MonoRec on the KITTI dataset and show that it achieves state-of-the-art performance compared to both multi-view and single-view methods. With the model trained on KITTI, we further demonstrate that MonoRec is able to generalize well to both the Oxford RobotCar dataset and the more challenging TUM-Mono dataset recorded by a handheld camera. Training code and pre-trained model will be published soon.
MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera
— AK (@ak92501) November 25, 2020
pdf: https://t.co/iC7DyrDmLD
abs: https://t.co/gy9XCyjOLF
project page: https://t.co/8wMfO9AgR4 pic.twitter.com/i15K5bFPPW
MonoRecは単眼カメラからの密な深度推定による三次元復元でSOTA性能を達成。複数フレームからコストボリュームを構築、動的物体を推定/除去し深度推定。学習時は他の深度/動的物体推定結果を利用しブートストラップ、LIDARセンサを必要としない https://t.co/fTQPpeC0KZ https://t.co/GAZ3udeGEE
— Daisuke Okanohara (@hillbig) November 25, 2020
2. Is a Green Screen Really Necessary for Real-Time Human Matting?
Zhanghan Ke, Kaican Li, Yurou Zhou, Qiuhua Wu, Xiangyu Mao, Qiong Yan, Rynson W.H. Lau
For human matting without the green screen, existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. Consequently, they are unavailable in real-time applications. In contrast, we present a light-weight matting objective decomposition network (MODNet), which can process human matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. Moreover, since trimap-free methods usually suffer from the domain shift problem in practice, we introduce (1) a self-supervised strategy based on sub-objectives consistency to adapt MODNet to real-world data and (2) a one-frame delay trick to smooth the results when applying MODNet to video human matting. MODNet is easy to be trained in an end-to-end style. It is much faster than contemporaneous matting methods and runs at 63 frames per second. On a carefully designed human matting benchmark newly proposed in this work, MODNet greatly outperforms prior trimap-free methods. More importantly, our method achieves remarkable results in daily photos and videos. Now, do you really need a green screen for real-time human matting?
Is a Green Screen Really Necessary for Real-Time Human Matting?
— AK (@ak92501) November 25, 2020
pdf: https://t.co/qmjULLGpJB
abs: https://t.co/NA9ctiRBiS pic.twitter.com/1rXV7W0EJ9
3. Differentially Private Learning Needs Better Features (or Much More Data)
Florian Tramèr, Dan Boneh
We demonstrate that differentially private machine learning has not yet reached its “AlexNet moment” on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to features learned on public data from a similar domain. Our work introduces simple yet strong baselines for differentially private learning that can inform the evaluation of future progress in this area.
Current algorithms for training neural nets with differential privacy greatly hurt model accuracy.
— Florian Tramèr (@florian_tramer) November 25, 2020
Can we do better? Yes!
With @danboneh we show how to get better private models by...not using deep learning!
Paper: https://t.co/5jMfcq2NXZ
Code: https://t.co/ZnudaQrZ9Q pic.twitter.com/qFTpJeJ8WC
4. HistoGAN: Controlling Colors of GAN-Generated and Real Images via Color Histograms
Mahmoud Afifi, Marcus A. Brubaker, Michael S. Brown
While generative adversarial networks (GANs) can successfully produce high-quality images, they can be challenging to control. Simplifying GAN-based image generation is critical for their adoption in graphic design and artistic work. This goal has led to significant interest in methods that can intuitively control the appearance of images generated by GANs. In this paper, we present HistoGAN, a color histogram-based method for controlling GAN-generated images’ colors. We focus on color histograms as they provide an intuitive way to describe image color while remaining decoupled from domain-specific semantics. Specifically, we introduce an effective modification of the recent StyleGAN architecture to control the colors of GAN-generated images specified by a target color histogram feature. We then describe how to expand HistoGAN to recolor real images. For image recoloring, we jointly train an encoder network along with HistoGAN. The recoloring model, ReHistoGAN, is an unsupervised approach trained to encourage the network to keep the original image’s content while changing the colors based on the given target histogram. We show that this histogram-based approach offers a better way to control GAN-generated and real images’ colors while producing more compelling results compared to existing alternative strategies.
HistoGAN: Controlling Colors of GAN-Generated and Real Images via Color Histograms
— AK (@ak92501) November 25, 2020
pdf: https://t.co/ejPdYZCPE0
abs: https://t.co/saIMtEYXE4 pic.twitter.com/wuNESux4XE
5. MicroNet: Towards Image Recognition with Extremely Low FLOPs
Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Lei Zhang, Nuno Vasconcelos
In this paper, we present MicroNet, which is an efficient convolutional neural network using extremely low computational cost (e.g. 6 MFLOPs on ImageNet classification). Such a low cost network is highly desired on edge devices, yet usually suffers from a significant performance degradation. We handle the extremely low FLOPs based upon two design principles: (a) avoiding the reduction of network width by lowering the node connectivity, and (b) compensating for the reduction of network depth by introducing more complex non-linearity per layer. Firstly, we propose Micro-Factorized convolution to factorize both pointwise and depthwise convolutions into low rank matrices for a good tradeoff between the number of channels and input/output connectivity. Secondly, we propose a new activation function, named Dynamic Shift-Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. The fusions are dynamic as their parameters are adapted to the input. Building upon Micro-Factorized convolution and dynamic Shift-Max, a family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime. For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
MicroNet: Towards Image Recognition with Extremely Low FLOPshttps://t.co/wpBenRPA00 pic.twitter.com/xoMfUBtQKg
— phalanx (@ZFPhalanx) November 25, 2020
MicroNet: Towards Image Recognition with Extremely Low FLOPs
— AK (@ak92501) November 25, 2020
pdf: https://t.co/R0jqr1E6F5
abs: https://t.co/p1Z5B4gc3n pic.twitter.com/yisKFisAiI
6. Benchmarking Image Retrieval for Visual Localization
Noé Pion, Martin Humenberger, Gabriela Csurka, Yohann Cabon, Torsten Sattler
Visual localization, i.e., camera pose estimation in a known scene, is a core component of technologies such as autonomous driving and augmented reality. State-of-the-art localization approaches often rely on image retrieval techniques for one of two tasks: (1) provide an approximate pose estimate or (2) determine which parts of the scene are potentially visible in a given query image. It is common practice to use state-of-the-art image retrieval algorithms for these tasks. These algorithms are often trained for the goal of retrieving the same landmark under a large range of viewpoint changes. However, robustness to viewpoint changes is not necessarily desirable in the context of visual localization. This paper focuses on understanding the role of image retrieval for multiple visual localization tasks. We introduce a benchmark setup and compare state-of-the-art retrieval representations on multiple datasets. We show that retrieval performance on classical landmark retrieval/recognition tasks correlates only for some but not all tasks to localization performance. This indicates a need for retrieval approaches specifically designed for localization tasks. Our benchmark and evaluation protocols are available at https://github.com/naver/kapture-localization.
"Benchmarking Image Retrieval for Visual Localization" from @Poyonoz @SattlerTorsten
— Dmytro Mishkin (@ducha_aiki) November 25, 2020
is so cool, that I wrote an overview.
And my take-home messages are different from the paper conclusions ;)https://t.co/ffUmtMkWQL
paper: https://t.co/HQghxgU3oF
tl;dr below:
1/4 pic.twitter.com/dKzvzi64qv
If you are attending #3DV2020, please stop by our poster today: @Poyonoz, Martin Humenberger, Gabriela Csurka, Yohann Cabon, Torsten Sattler, Benchmarking Image Retrieval for Visual Localization, 3DV 2020, https://t.co/GOQrMwvAWr
— Torsten Sattler (@SattlerTorsten) November 25, 2020
Times (CET): 7am and 5:30pm
7. Adversarial Generation of Continuous Images
Ivan Skorokhodov, Savva Ignatyev, Mohamed Elhoseiny
In most existing learning systems, images are typically viewed as 2D pixel arrays. However, in another paradigm gaining popularity, a 2D image is represented as an implicit neural representation (INR) — an MLP that predicts an RGB pixel value given its (x,y) coordinate. In this paper, we propose two novel architectural techniques for building INR-based image decoders: factorized multiplicative modulation and multi-scale INRs, and use them to build a state-of-the-art continuous image GAN. Previous attempts to adapt INRs for image generation were limited to MNIST-like datasets and do not scale to complex real-world data. Our proposed architectural design improves the performance of continuous image generators by x6-40 times and reaches FID scores of 6.27 on LSUN bedroom 256x256 and 16.32 on FFHQ 1024x1024, greatly reducing the gap between continuous image GANs and pixel-based ones. To the best of our knowledge, these are the highest reported scores for an image generator, that consists entirely of fully-connected layers. Apart from that, we explore several exciting properties of INR-based decoders, like out-of-the-box superresolution, meaningful image-space interpolation, accelerated inference of low-resolution images, an ability to extrapolate outside of image boundaries and strong geometric prior. The source code is available at https://github.com/universome/inr-gan
Adversarial Generation of Continuous Images
— AK (@ak92501) November 25, 2020
pdf: https://t.co/PPc0geUIKL
abs: https://t.co/AHyozdAHXc pic.twitter.com/10c4zH558N
8. GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
Michael Niemeyer, Andreas Geiger
Deep generative models allow for photorealistic image synthesis at high resolutions. But for many applications, this is not enough: content creation also needs to be controllable. While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional. Further, only few works consider the compositional nature of scenes. Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. Representing scenes as compositional generative neural feature fields allows us to disentangle one or multiple objects from the background as well as individual objects’ shapes and appearances while learning from unstructured and unposed image collections without any additional supervision. Combining this scene representation with a neural rendering pipeline yields a fast and realistic image synthesis model. As evidenced by our experiments, our model is able to disentangle individual objects and allows for translating and rotating them in the scene as well as changing the camera pose.
GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
— AK (@ak92501) November 25, 2020
pdf: https://t.co/0zfGaDDRWQ
abs: https://t.co/xxlpLZbVGx pic.twitter.com/5WraAgsh83
9. Energy-Based Models for Continual Learning
Shuang Li, Yilun Du, Gido M. van de Ven, Antonio Torralba, Igor Mordatch
We motivate Energy-Based Models (EBMs) as a promising model class for continual learning problems. Instead of tackling continual learning via the use of external memory, growing models, or regularization, EBMs have a natural way to support a dynamically-growing number of tasks or classes that causes less interference with previously learned information. We find that EBMs outperform the baseline methods by a large margin on several continual learning benchmarks. We also show that EBMs are adaptable to a more general continual learning setting where the data distribution changes without the notion of explicitly delineated tasks. These observations point towards EBMs as a class of models naturally inclined towards the continual learning regime.
Excited to share our work investigating energy-based models for continual learning and how they are naturally less prone to catastrophic forgetting: https://t.co/IFwBaJwVnw with fantastic collaborators @ShuangL13799063 @du_yilun @GMvandeVen and A. Torralba
— Igor Mordatch (@IMordatch) November 25, 2020
Energy-based models are a class of flexible, powerful models with applications in many areas of deep learning. Could energy-based models also be useful for continual learning?
— Gido van de Ven (@GMvandeVen) November 25, 2020
Yes! https://t.co/bp2Huaz9iV Work led by @ShuangL13799063.
10. From Pixels to Legs: Hierarchical Learning of Quadruped Locomotion
Deepali Jain, Atil Iscen, Ken Caluwaerts
Legged robots navigating crowded scenes and complex terrains in the real world are required to execute dynamic leg movements while processing visual input for obstacle avoidance and path planning. We show that a quadruped robot can acquire both of these skills by means of hierarchical reinforcement learning (HRL). By virtue of their hierarchical structure, our policies learn to implicitly break down this joint problem by concurrently learning High Level (HL) and Low Level (LL) neural network policies. These two levels are connected by a low dimensional hidden layer, which we call latent command. HL receives a first-person camera view, whereas LL receives the latent command from HL and the robot’s on-board sensors to control its actuators. We train policies to walk in two different environments: a curved cliff and a maze. We show that hierarchical policies can concurrently learn to locomote and navigate in these environments, and show they are more efficient than non-hierarchical neural network policies. This architecture also allows for knowledge reuse across tasks. LL networks trained on one task can be transferred to a new task in a new environment. Finally HL, which processes camera images, can be evaluated at much lower and varying frequencies compared to LL, thus reducing computation times and bandwidth requirements.
From Pixels to Legs: Hierarchical Learning of Quadruped Locomotion
— AK (@ak92501) November 25, 2020
pdf: https://t.co/R9bu202ERa
abs: https://t.co/beZtPYJMmq pic.twitter.com/JCb6mKbHoj