1. On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on conventional deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
Stanford's ~entire AI Department has just released a 200 page 100 author Neural Scaling Laws Manifesto.
— Ethan Caballero (@ethancaballero) August 17, 2021
They're pivoting to positioning themselves as #1 at academic ML Scaling (e.g. GPT-4) research.
"On the Opportunities and Risks of Foundation Models"https://t.co/rFNh0m2CmB pic.twitter.com/B6i0zbGLGU
2. Learning Open-World Object Proposals without Learning to Classify
Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo
Object proposals have become an integral preprocessing steps of many vision pipelines including object detection, weakly supervised detection, object discovery, tracking, etc. Compared to the learning-free methods, learning-based proposals have become popular recently due to the growing interest in object detection. The common paradigm is to learn object proposals from data labeled with a set of object regions and their corresponding categories. However, this approach often struggles with novel objects in the open world that are absent in the training set. In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories. Therefore, we propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlap with any ground-truth object (e.g., centerness and IoU). This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization on COCO, as well as cross-dataset evaluation on RoboNet, Object365, and EpicKitchens. Finally, we demonstrate the merit of OLN for long-tail object detection on large vocabulary dataset, LVIS, where we notice clear improvement in rare and common categories.
Learning Open-World Object Proposals without Learning to Classify
— AK (@ak92501) August 17, 2021
pdf: https://t.co/UxOBMGqpir
abs: https://t.co/mFo2d8JdGJ pic.twitter.com/cu8d5JTx1c
3. SOTR: Segmenting Objects with Transformers
Ruohao Guo, Dantong Niu, Liao Qu, Zhenbo Li
Most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present a novel, flexible, and effective transformer-based model for high-quality instance segmentation. The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline, building on an alternative CNN backbone appended with two parallel subtasks: (1) predicting per-instance category via transformer and (2) dynamically generating segmentation mask with the multi-level upsampling module. SOTR can effectively extract lower-level feature representations and capture long-range context dependencies by Feature Pyramid Network (FPN) and twin transformer, respectively. Meanwhile, compared with the original transformer, the proposed twin transformer is time- and resource-efficient since only a row and a column attention are involved to encode pixels. Moreover, SOTR is easy to be incorporated with various CNN backbones and transformer model variants to make considerable improvements for the segmentation accuracy and training convergence. Extensive experiments show that our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches. We hope our simple but strong framework could serve as a preferment baseline for instance-level recognition. Our code is available at https://github.com/easton-cau/SOTR.
SOTR: Segmenting Objects with Transformers
— AK (@ak92501) August 17, 2021
pdf: https://t.co/eplIKD4mgZ
abs: https://t.co/ARAaQ7VJAe
github: https://t.co/XlVZrJh25P
performs well on the MS COCO dataset and surpasses sota instance segmentation approaches pic.twitter.com/06tH3XPtKQ
4. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, Jun Yu
Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets.
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
— AK (@ak92501) August 17, 2021
pdf: https://t.co/ha8jCqGSxF
abs: https://t.co/1ZlwQ6a2vN
github: https://t.co/kYhbj9cU3L pic.twitter.com/bq3CCxt5Ff
5. Whoโs Waldo? Linking People Across Text and Images
Claire Yuqing Cui, Apoorv Khandelwal, Yoav Artzi, Noah Snavely, Hadar Averbuch-Elor
We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Whoโs Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.
Whoโs Waldo? Linking People Across Text and Images
— AK (@ak92501) August 17, 2021
pdf: https://t.co/iiktNH0yey
abs: https://t.co/y65WaRRkVd
present a task, dataset, and method for linking people across images and text pic.twitter.com/XMlBRiqg9m
6. Spectral Detection of Simplicial Communities via Hodge Laplacians
Sanjukta Krishnagopal, Ginestra Bianconi
- retweets: 56, favorites: 31 (08/18/2021 07:33:40)
- links: abs | pdf
- physics.data-an | cs.SI | physics.soc-ph
While the study of graphs has been very popular, simplicial complexes are relatively new in the network science community. Despite being are a source of rich information, graphs are limited to pairwise interactions. However, several real world networks such as social networks, neuronal networks etc. involve simultaneous interactions between more than two nodes. Simplicial complexes provide a powerful mathematical way to model such interactions. Now, the spectrum of the graph Laplacian is known to be indicative of community structure, with nonzero eigenvectors encoding the identity of communities. Here, we propose that the spectrum of the Hodge Laplacian, a higher-order Laplacian applied to simplicial complexes, encodes simplicial communities. We formulate an algorithm to extract simplicial communities (of arbitrary dimension). We apply this algorithm on simplicial complex benchmarks and on real data including social networks and language-networks, where higher-order relationships are intrinsic. Additionally, datasets for simplicial complexes are scarce. Hence, we introduce a method of optimally generating a simplicial complex from its network backbone through estimating the \textit{true} higher-order relationships when its community structure is known. We do so by using the adjusted mutual information to identify the configuration that best matches the expected data partition. Lastly, we demonstrate an example of persistent simplicial communities inspired by the field of persistence homology.
Curious to know which Zachary-Karate-Club members had an higher-order interaction? Find out by looking at our work today in the arxiv https://t.co/3ocyBxX1JO. Many thanks to ๐ฆ๐ฎ๐ป๐ท๐๐ธ๐๐ฎ ๐๐ฟ๐ถ๐๐ต๐ป๐ฎ๐ด๐ผ๐ฝ๐ฎ๐น for the wonderful collaboration! pic.twitter.com/zq9XXtuLWh
— Ginestra Bianconi (@gin_bianconi) August 17, 2021
7. Constrained Iterative LQG for Real-Time Chance-ConstrainedGaussian Belief Space Planning
Jianyu Chen, Yutaka Shimizu, Liting Sun, Masayoshi Tomizuka, Wei Zhan
Motion planning under uncertainty is of significant importance for safety-critical systems such as autonomous vehicles. Such systems have to satisfy necessary constraints (e.g., collision avoidance) with potential uncertainties coming from either disturbed system dynamics or noisy sensor measurements. However, existing motion planning methods cannot efficiently find the robust optimal solutions under general nonlinear and non-convex settings. In this paper, we formulate such problem as chance-constrained Gaussian belief space planning and propose the constrained iterative Linear Quadratic Gaussian (CILQG) algorithm as a real-time solution. In this algorithm, we iteratively calculate a Gaussian approximation of the belief and transform the chance-constraints. We evaluate the effectiveness of our method in simulations of autonomous driving planning tasks with static and dynamic obstacles. Results show that CILQG can handle uncertainties more appropriately and has faster computation time than baseline methods.
IROSใซๅบใใ่ซๆใArxivใซไธใใใพใใ๏ผใญใใใ(็นใซ่ชๅ้่ปข)ใฎ็ขบ็็ใช็ต่ทฏ่จ็ปใใฆใๆนใILQR, ILQG,ๆ้ฉๅถๅพกใชใฉใฎๅ่ชใ่ใใจ้ใใใจใพใใชใๆนใฎ็ฎใซๅ ฅใใฐๅฌใใใงใ๏ผhttps://t.co/27ypLpgJ2e
— ใใใ (@purewater0901) August 17, 2021
8. Online Multi-Granularity Distillation for GAN Compression
Yuxi Ren, Jie Wu, Xuefeng Xiao, Jianchao Yang
Generative Adversarial Networks (GANs) have witnessed prevailing success in yielding outstanding images, however, they are burdensome to deploy on resource-constrained devices due to ponderous computational costs and hulking memory usage. Although recent efforts on compressing GANs have acquired remarkable results, they still exist potential model redundancies and can be further compressed. To solve this issue, we propose a novel online multi-granularity distillation (OMGD) scheme to obtain lightweight GANs, which contributes to generating high-fidelity images with low computational demands. We offer the first attempt to popularize single-stage online distillation for GAN-oriented compression, where the progressively promoted teacher generator helps to refine the discriminator-free based student generator. Complementary teacher generators and network layers provide comprehensive and multi-granularity concepts to enhance visual fidelity from diverse dimensions. Experimental results on four benchmark datasets demonstrate that OMGD successes to compress 40x MACs and 82.5X parameters on Pix2Pix and CycleGAN, without loss of image quality. It reveals that OMGD provides a feasible solution for the deployment of real-time image translation on resource-constrained devices. Our code and models are made public at: https://github.com/bytedance/OMGD.
Online Multi-Granularity Distillation for GAN Compression
— AK (@ak92501) August 17, 2021
pdf: https://t.co/mYPCYIks8B
abs: https://t.co/uJ5cbfGPnL
github: https://t.co/N5XY817RGD pic.twitter.com/PDhTy3d9wK
9. WikiChurches: A Fine-Grained Dataset of Architectural Styles with Real-World Challenges
Bjรถrn Barz, Joachim Denzler
We introduce a novel dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available. Images and annotations are available at: https://doi.org/10.5281/zenodo.5166987