1. Contact and Human Dynamics from Monocular Video
Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, Jimei Yang
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. We first estimate ground contact timings with a novel prediction network which is trained without hand-labeled data. A physics-based trajectory optimization then solves for a physically-plausible motion, based on the inputs. We show this process produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematic and dynamic plausibility. We demonstrate our method on character animation and pose estimation tasks on dynamic motions of dancing and sports with complex contact patterns.
Contact and Human Dynamics from Monocular Video
— AK (@ak92501) July 24, 2020
pdf: https://t.co/ECy6u7VOcq
abs: https://t.co/SqN67KcaUw
project page: https://t.co/0ez3bjGunw pic.twitter.com/2tIAjrHvQD
Our new #ECCV2020 paper uses learned foot contact estimation and trajectory optimization to reconstruct physically-plausible 3D human motion from video. Check it out!
— Davis Rempe (@davrempe) July 24, 2020
Paper: https://t.co/1JDnFYpwy5
Video: https://t.co/IO9MrHob6t
Project: https://t.co/GTijegtYD1 pic.twitter.com/9zc5PqIf3r
2. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM
Carlos Campos, Richard Elvira, Juan J. Gómez Rodríguez, José M. M. Montiel, Juan D. Tardós
This paper presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multi-map SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models. The first main novelty is a feature-based tightly-integrated visual-inertial SLAM system that fully relies on Maximum-a-Posteriori (MAP) estimation, even during the IMU initialization phase. The result is a system that operates robustly in real-time, in small and large, indoor and outdoor environments, and is 2 to 5 times more accurate than previous approaches. The second main novelty is a multiple map system that relies on a new place recognition method with improved recall. Thanks to it, ORB-SLAM3 is able to survive to long periods of poor visual information: when it gets lost, it starts a new map that will be seamlessly merged with previous maps when revisiting mapped areas. Compared with visual odometry systems that only use information from the last few seconds, ORB-SLAM3 is the first system able to reuse in all the algorithm stages all previous information. This allows to include in bundle adjustment co-visible keyframes, that provide high parallax observations boosting accuracy, even if they are widely separated in time or if they come from a previous mapping session. Our experiments show that, in all sensor configurations, ORB-SLAM3 is as robust as the best systems available in the literature, and significantly more accurate. Notably, our stereo-inertial SLAM achieves an average accuracy of 3.6 cm on the EuRoC drone and 9 mm under quick hand-held motions in the room of TUM-VI dataset, a setting representative of AR/VR scenarios. For the benefit of the community we make public the source code.
ORB-SLAM3 ?! !!
— Giseop Kim (@GiseopK) July 24, 2020
ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAMhttps://t.co/sPWH5dK3pe
videos https://t.co/JBhgAfi2J3
We are proud to announce ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM.
— Mingo Tardos (@mingo_tardos) July 24, 2020
Paper: https://t.co/y77D4UmzNN
Code: https://t.co/1fwh9MKCM9
Videos: https://t.co/wWO863EqW4
3. Whole-Body Human Pose Estimation in the Wild
Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo
This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet. As existing datasets do not have whole-body annotations, previous methods have to assemble different deep models trained independently on different datasets of the human face, hand, and body, struggling with dataset biases and large model complexity. To fill in this blank, we introduce COCO-WholeBody which extends COCO dataset with whole-body annotations. To our best knowledge, it is the first benchmark that has manual annotations on the entire human body, including 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet. A single-network model, named ZoomNet, is devised to take into account the hierarchical structure of the full human body to solve the scale variation of different body parts of the same person. ZoomNet is able to significantly outperform existing methods on the proposed COCO-WholeBody dataset. Extensive experiments show that COCO-WholeBody not only can be used to train deep models from scratch for whole-body pose estimation but also can serve as a powerful pre-training dataset for many different tasks such as facial landmark detection and hand keypoint estimation. The dataset is publicly available at https://github.com/jin-s13/COCO-WholeBody.
Whole-Body Human Pose Estimation in the Wild
— AK (@ak92501) July 24, 2020
pdf: https://t.co/M3WDRjKHGu
abs: https://t.co/CdiEIztSjk
github: https://t.co/9c8PqBDowz pic.twitter.com/FlspnRTAXu
4. Online Robust and Adaptive Learning from Data Streams
Shintaro Fukushima, Atsushi Nitanda, Kenji Yamanishi
In online learning from non-stationary data streams, it is both necessary to learn robustly to outliers and to adapt to changes of underlying data generating mechanism quickly. In this paper, we refer to the former nature of online learning algorithms as robustness and the latter as adaptivity. There is an obvious tradeoff between them. It is a fundamental issue to quantify and evaluate the tradeoff because it provides important information on the data generating mechanism. However, no previous work has considered the tradeoff quantitatively. We propose a novel algorithm called the Stochastic approximation-based Robustness-Adaptivity algorithm (SRA) to evaluate the tradeoff. The key idea of SRA is to update parameters of distribution or sufficient statistics with the biased stochastic approximation scheme, while dropping data points with large values of the stochastic update. We address the relation between two parameters, one of which is the step size of the stochastic approximation, and the other is the threshold parameter of the norm of the stochastic update. The former controls the adaptivity and the latter does the robustness. We give a theoretical analysis for the non-asymptotic convergence of SRA in the presence of outliers, which depends on both the step size and the threshold parameter. Since SRA is formulated on the majorization-minimization principle, it is a general algorithm including many algorithms, such as the online EM algorithm and stochastic gradient descent. Empirical experiments for both synthetic and real datasets demonstrated that SRA was superior to previous methods.
プレプリントを公開しました.時系列データのオンライン学習において,外れ値への頑健性(robustness)と変化への適応性(adaptivity)の間に存在するトレードオフを評価しています.確率的近似(stochastic approximation)の収束性を用いて上記トレードオフを議論しています.https://t.co/u76LYR5eAg
— Shintaro FUKUSHIMA (@shifukushima) July 24, 2020
5. TSIT: A Simple and Versatile Framework for Image-to-Image Translation
Liming Jiang, Changxu Zhang, Mingyang Huang, Chunxiao Liu, Jianping Shi, Chen Change Loy
We introduce a simple and versatile framework for image-to-image translation. We unearth the importance of normalization layers, and provide a carefully designed two-stream generative model with newly proposed feature transformations in a coarse-to-fine fashion. This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network, permitting our method to scale to various tasks in both unsupervised and supervised settings. No additional constraints (e.g., cycle consistency) are needed, contributing to a very clean and simple method. Multi-modal image synthesis with arbitrary style control is made possible. A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
TSIT: A Simple and Versatile Framework for Image-to-Image Translation
— AK (@ak92501) July 24, 2020
pdf: https://t.co/mEUvgIF1q5
abs: https://t.co/1QdmjapdWb
github: https://t.co/SxcLoIsHGY pic.twitter.com/vaG2mSP4Fa
6. Funnel Activation for Visual Recognition
Ningning Ma, Xiangyu Zhang, Jian Sun
We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are y = max(x, 0) and y = max(x, px), respectively, while FReLU is in the form of y = max(x,T(x)), where T(x) is the 2D spatial condition. Moreover, the spatial condition achieves a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions. We conduct experiments on ImageNet, COCO detection, and semantic segmentation tasks, showing great improvements and robustness of FReLU in the visual recognition tasks.
Funnel Activation for Visual Recognitionhttps://t.co/J0AVaaD50D
— phalanx (@ZFPhalanx) July 24, 2020
良さげ。 pic.twitter.com/bxP7EZEPDi
7. Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval
Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman
Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods. To this end, we introduce an objective that optimises instead a smoothed approximation of AP, coined Smooth-AP. Smooth-AP is a plug-and-play objective function that allows for end-to-end training of deep networks with a simple and elegant implementation. We also present an analysis for why directly optimising the ranking based metric of AP offers benefits over other deep metric learning losses. We apply Smooth-AP to standard retrieval benchmarks: Stanford Online products and VehicleID, and also evaluate on larger-scale datasets: INaturalist for fine-grained category retrieval, and VGGFace2 and IJB-C for face retrieval. In all cases, we improve the performance over the state-of-the-art, especially for larger-scale datasets, thus demonstrating the effectiveness and scalability of Smooth-AP to real-world scenarios.
We boost the performance of image retrieval systems by directly optimising a Smoothed Average Precision loss, Smooth-AP! By @AndrewB88429285 @WeidiXie @VickyKalogeiton and Andrew Zisserman.
— Visual Geometry Group (VGG) (@Oxford_VGG) July 24, 2020
website: https://t.co/3LAvbzl7Hk
arXiv: https://t.co/sXZOWzsVAm#ECCV2020 pic.twitter.com/gQvgs6l53n