1. What Should Not Be Contrastive in Contrastive Learning
Tete Xiao, Xiaolong Wang, Alexei A. Efros, Trevor Darrell
Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.
What Should Not Be Contrastive in Contrastive Learning.
— Tomasz Malisiewicz (@quantombone) August 14, 2020
The model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation.https://t.co/YKQUzLN6TH#computervision pic.twitter.com/l7SzTkCvMG
2. Full-Body Awareness from Partial Observations
Chris Rockwell, David F. Fouhey
There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation. Project website: https://crockwell.github.io/partial_humans
How can we understand humans in internet video? Our #ECCV2020 work presents a simple but highly effective method for self-training on unlabeled video! We annotate four datasets to evaluate & show large gains.
— Chris Rockwell (@_crockwell) August 14, 2020
Project Page: https://t.co/tLLNqkLqhS
arXiv: https://t.co/nW0WOJNfUP pic.twitter.com/xx39I6UJAA
3. Compiling a Higher-Order Smart Contract Language to LLVM
Vaivaswatha Nagaraj, Jacob Johannsen, Anton Trunov, George Pîrlea, Amrit Kumar, Ilya Sergey
Scilla is a higher-order polymorphic typed intermediate level language for implementing smart contracts. In this talk, we describe a Scilla compiler targeting LLVM, with a focus on mapping Scilla types, values, and its functional language constructs to LLVM-IR. The compiled LLVM-IR, when executed with LLVM’s JIT framework, achieves a speedup of about 10x over the reference interpreter on a typical Scilla contract. This reduced latency is crucial in the setting of blockchains, where smart contracts are executed as parts of transactions, to achieve peak transactions processed per second. Experiments on the Ackermann function achieved a speedup of more than 45x. This talk abstract is aimed at both programming language researchers looking to implement an LLVM based compiler for their functional language, as well as at LLVM practitioners.
Scilla to LLVM compiler project by @VaivaswathaN is in full swing for $ZIL. We are seeing 10x improvement on performance with the compiler compared to the ref interpreter. Details on the mapping in this paper: https://t.co/tx1wibceKp
— Amrit Kummer (@maqstik) August 14, 2020
CC: @secondstateinc @stevanlohja @etclabs pic.twitter.com/A1czqqPrsD
4. Powers of layers for image-to-image translation
Hugo Touvron, Matthijs Douze, Matthieu Cord, Hervé Jégou
We propose a simple architecture to address unpaired image-to-image translation tasks: style or class transfer, denoising, deblurring, deblocking, etc. We start from an image autoencoder architecture with fixed weights. For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached. A specific training schedule is required to alleviate the exponentiation effect of the iterations. At test time, it offers several advantages: the number of weight parameters is limited and the compositional design allows one to modulate the strength of the transformation with the number of iterations. This is useful, for instance, when the type or amount of noise to suppress is not known in advance. Experimentally, we provide proofs of concepts showing the interest of our method for many transformations. The performance of our model is comparable or better than CycleGAN with significantly fewer parameters.
Powers of layers for image-to-image translation
— AK (@ak92501) August 14, 2020
pdf: https://t.co/bdW96IkDF1
abs: https://t.co/qbm88TJ8Ej pic.twitter.com/pQVm80W4SG
5. Generating Person-Scene Interactions in 3D Scenes
Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, Siyu Tang
High fidelity digital 3D environments have been proposed in recent years; however, it remains extreme challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depths, or semantic maps to represent the scene, and parametric human models to represent 3D bodies in the scene. While being straightforward, their generated human-scene interactions are often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To explicitly and effectively represent the physical contact between the body and the world is essential for modeling human-scene interaction. To that end, we propose a novel interaction representation, which explicitly encodes the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the distance from every basis point to its closest point on a human body. The synthesized proximal relationship between the human body and the scene can indicate which region a person tends to contact. Furthermore, based on such synthesized proximity, we can effectively obtain expressive 3D human bodies that naturally interact with the 3D scene. Our perceptual study shows that our model significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. Our code and model will be publicly available for research purpose.
Generating Person-Scene Interactions in 3D Scenes
— AK (@ak92501) August 14, 2020
pdf: https://t.co/TSUtX7llnT
abs: https://t.co/JzfR21G8F6 pic.twitter.com/vPK66UgDN0
6. Overcoming Model Bias for Robust Offline Deep Reinforcement Learning
Phillip Swazinna, Steffen Udluft, Thomas Runkler
State-of-the-art reinforcement learning algorithms mostly rely on being allowed to directly interact with their environment to collect millions of observations. This makes it hard to transfer their success to industrial control problems, where simulations are often very costly or do not exist at all. Furthermore, interacting with (and especially exploring in) the real, physical environment has the potential to lead to catastrophic events. We thus propose a novel model-based RL algorithm, called MOOSE (MOdel-based Offline policy Search with Ensembles) which can train a policy from a pre-existing, fixed dataset. It ensures that dynamics models are able to accurately assess policy performance by constraining the policy to stay within the support of the data. We design MOOSE deliberately similar to state-of-the-art model-free, offline (a.k.a. batch) RL algorithms BEAR and BCQ, with the main difference being that our algorithm is model-based. We compare the algorithms on the Industrial Benchmark and Mujoco continuous control tasks in terms of robust performance and find that MOOSE almost always outperforms its model-free counterparts by far.
Overcoming Model Bias for Robust Offline Deep Reinforcement Learning. #DataScience #BigData #IoT #Python #RStats #TensorFlow #Java #JavaScript #ReactJS #GoLang #Serverless #IIoT #Linux #AI #Programming #DeepLearning #MachineLearning #ArtificialIntelligencehttps://t.co/21PhHaUHe0 pic.twitter.com/ynDiyMsdRK
— Marcus Borba (@marcusborba) August 14, 2020