1. EfficientNetV2: Smaller Models and Faster Training
Mingxing Tan, Quoc V. Le
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose to adaptively adjust regularization (e.g., dropout and data augmentation) as well, such that we can achieve both fast training and good accuracy. With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at https://github.com/google/automl/efficientnetv2.
Happy to introduce EfficientNetV2: Smaller Models and Faster Training
— Mingxing Tan (@tanmingxing) April 2, 2021
Achieved faster training and inference speed, AND also with better parameters efficiency.
Arxiv: https://t.co/YHWEb8pHmR
Thread 1/4 pic.twitter.com/LY2oZ4tSbN
2. NeRF-VAE: A Geometry Aware 3D Scene Generative Model
Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Soňa Mokrá, Danilo J. Rezende
We propose NeRF-VAE, a 3D scene generative model that incorporates geometric structure via NeRF and differentiable volume rendering. In contrast to NeRF, our model takes into account shared structure across scenes, and is able to infer the structure of a novel scene — without the need to re-train — using amortized inference. NeRF-VAE’s explicit 3D rendering process further contrasts previous generative models with convolution-based rendering which lacks geometric structure. Our model is a VAE that learns a distribution over radiance fields by conditioning them on a latent scene representation. We show that, once trained, NeRF-VAE is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images. We further demonstrate that NeRF-VAE generalizes well to out-of-distribution cameras, while convolutional models do not. Finally, we introduce and study an attention-based conditioning mechanism of NeRF-VAE’s decoder, which improves model performance.
NeRF-VAE: A Geometry Aware 3D Scene Generative Model
— AK (@ak92501) April 2, 2021
pdf: https://t.co/943n2Sspab
abs: https://t.co/CYlYRFvOkU
"NeRF-VAE is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images" pic.twitter.com/MxlAwmbCu1
3. In&Out : Diverse Image Outpainting via GAN Inversion
Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Ming-Hsuan Yang
Image outpainting seeks for a semantically consistent extension of the input image beyond its available content. Compared to inpainting — filling in missing pixels in a way coherent with the neighboring pixels — outpainting can be achieved in more diverse ways since the problem is less constrained by the surrounding pixels. Existing image outpainting methods pose the problem as a conditional image-to-image translation task, often generating repetitive structures and textures by replicating the content available in the input image. In this work, we formulate the problem from the perspective of inverting generative adversarial networks. Our generator renders micro-patches conditioned on their joint latent code as well as their individual positions in the image. To outpaint an image, we seek for multiple latent codes not only recovering available patches but also synthesizing diverse outpainting by patch-based generation. This leads to richer structure and content in the outpainted regions. Furthermore, our formulation allows for outpainting conditioned on the categorical input, thereby enabling flexible user controls. Extensive experimental results demonstrate the proposed method performs favorably against existing in- and outpainting methods, featuring higher visual quality and diversity.
Wanna generate panorama from images you took during vacation? Check out our recent paper "𝐈𝐧&𝐎𝐮𝐭: 𝐃𝐢𝐯𝐞𝐫𝐬𝐞 𝐈𝐦𝐚𝐠𝐞 𝐎𝐮𝐭𝐩𝐚𝐢𝐧𝐭𝐢𝐧𝐠 𝐯𝐢𝐚 𝐆𝐀𝐍 𝐈𝐧𝐯𝐞𝐫𝐬𝐢𝐨𝐧"!
— Hsin-Ying James Lee (@hyjameslee) April 2, 2021
Project: https://t.co/jQXut5xjAp
Paper: https://t.co/BLlMoJ4V2H#snap #computervision
(1/2) pic.twitter.com/dyEFX836Np
In&Out : Diverse Image Outpainting via GAN Inversion
— AK (@ak92501) April 2, 2021
pdf: https://t.co/wCPVXSu1Pw
abs: https://t.co/yvFnkVJpiQ
project page: https://t.co/fFTrsY6no8 pic.twitter.com/JP8Lu2K0qe
4. LoFTR: Detector-Free Local Feature Matching with Transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, Xiaowei Zhou
We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods.
LoFTR: Detector-Free Local Feature Matching with Transformers
— AK (@ak92501) April 2, 2021
pdf: https://t.co/AFNPRCJKs6
abs: https://t.co/GgQGINeKpE
project page: https://t.co/gwmw9BbE6g pic.twitter.com/kWiY5IQ0QD
LoFTR: Detector-Free Local Feature Matching with Transformers@JiamingSuen, Zehong Shen, Yuang Wang, Hujun Bao, Xiaowei Zhou
— Dmytro Mishkin (@ducha_aiki) April 3, 2021
tl;dr: dense local descriptor -> linear transformer matcher (similar to SuperGlue). Everything is in the coarse-to-fine scheme.https://t.co/O8ze2SdTUb pic.twitter.com/yVNZv3ibzq
5. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute. We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as ‘frozen’ snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.
Separation of effort in the image and video retrieval communities is suboptimal - they share a lot of overlapping info!
— Arsha Nagrani @🏠 (@NagraniArsha) April 2, 2021
Check out our NEW model for visual-text retrieval, easily trains on *both* images and videos jointly, setting new SOTA results! https://t.co/riCQF69To1 pic.twitter.com/HpORGtLMUs
6. NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video
Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, Hujun Bao
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time.
NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video
— AK (@ak92501) April 2, 2021
pdf: https://t.co/KTjroX1B1s
abs: https://t.co/ahbtVraPkB
project page: https://t.co/w5K5HgFKHt pic.twitter.com/0X48Rj6wE0
7. Unconstrained Scene Generation with Locally Conditioned Radiance Fields
Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W. Taylor, Joshua M. Susskind
We tackle the challenge of learning a distribution over complex, realistic, indoor scenes. In this paper, we introduce Generative Scene Networks (GSN), which learns to decompose scenes into a collection of many local radiance fields that can be rendered from a free moving camera. Our model can be used as a prior to generate new scenes, or to complete a scene given only sparse 2D observations. Recent work has shown that generative models of radiance fields can capture properties such as multi-view consistency and view-dependent lighting. However, these models are specialized for constrained viewing of single objects, such as cars or faces. Due to the size and complexity of realistic indoor environments, existing models lack the representational capacity to adequately capture them. Our decomposition scheme scales to larger and more complex scenes while preserving details and diversity, and the learned prior enables high-quality rendering from viewpoints that are significantly different from observed viewpoints. When compared to existing models, GSN produces quantitatively higher-quality scene renderings across several different scene datasets.
Introducing Generative Scene Networks (GSN), a generative model for learning radiance fields for realistic scenes. With GSN we can sample scenes from the learned prior and move through them with a freely moving camera.
— Miguel A Bautista (@itsbautistam) April 3, 2021
Arxiv: https://t.co/rYiFH4uhLp
Scenes sampled from the prior: https://t.co/aFxnve3PEd pic.twitter.com/huUT9z8a1t
8. Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis
Ajay Jain, Matthew Tancik, Pieter Abbeel
We present DietNeRF, a 3D neural scene representation estimated from a few images. Neural Radiance Fields (NeRF) learn a continuous volumetric representation of a scene through multi-view consistency, and can be rendered from novel viewpoints by ray casting. While NeRF has an impressive ability to reconstruct geometry and fine details given many images, up to 100 for challenging 360{\deg} scenes, it often finds a degenerate solution to its image reconstruction objective when only a few input views are available. To improve few-shot quality, we propose DietNeRF. We introduce an auxiliary semantic consistency loss that encourages realistic renderings at novel poses. DietNeRF is trained on individual scenes to (1) correctly render given input views from the same pose, and (2) match high-level semantic attributes across different, random poses. Our semantic loss allows us to supervise DietNeRF from arbitrary poses. We extract these semantics using a pre-trained visual encoder such as CLIP, a Vision Transformer trained on hundreds of millions of diverse single-view, 2D photographs mined from the web with natural language supervision. In experiments, DietNeRF improves the perceptual quality of few-shot view synthesis when learned from scratch, can render novel views with as few as one observed image when pre-trained on a multi-view dataset, and produces plausible completions of completely unobserved regions.
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis
— AK (@ak92501) April 2, 2021
pdf: https://t.co/DKAXVYozKR
abs: https://t.co/FcfuYJsX4H
project page: https://t.co/8OAhjQMQ90 pic.twitter.com/TPKkEqeX1V
9. Reconstructing 3D Human Pose by Watching Humans in the Mirror
Qi Fang, Qing Shuai, Junting Dong, Hujun Bao, Xiaowei Zhou
In this paper, we introduce the new task of reconstructing 3D human pose from a single image in which we can see the person and the person’s image through a mirror. Compared to general scenarios of 3D pose estimation from a single view, the mirror reflection provides an additional view for resolving the depth ambiguity. We develop an optimization-based approach that exploits mirror symmetry constraints for accurate 3D pose reconstruction. We also provide a method to estimate the surface normal of the mirror from vanishing points in the single image. To validate the proposed approach, we collect a large-scale dataset named Mirrored-Human, which covers a large variety of human subjects, poses and backgrounds. The experiments demonstrate that, when trained on Mirrored-Human with our reconstructed 3D poses as pseudo ground-truth, the accuracy and generalizability of existing single-view 3D pose estimators can be largely improved.
Reconstructing 3D Human Pose by Watching Humans in the Mirror
— AK (@ak92501) April 2, 2021
pdf: https://t.co/4AoRvYDIfP
abs: https://t.co/L10OpNaFQE
project page: https://t.co/0NvY0M2XrD pic.twitter.com/He65Yw5QzV
10. Real-time Data Infrastructure at Uber
Yupeng Fu, Chinmay Soman
Uber’s business is highly real-time in nature. PBs of data is continuously being collected from the end users such as Uber drivers, riders, restaurants, eaters and so on everyday. There is a lot of valuable information to be processed and many decisions must be made in seconds for a variety of use cases such as customer incentives, fraud detection, machine learning model prediction. In addition, there is an increasing need to expose this ability to different user categories, including engineers, data scientists, executives and operations personnel which adds to the complexity. In this paper, we present the overall architecture of the real-time data infrastructure and identify three scaling challenges that we need to continuously address for each component in the architecture. At Uber, we heavily rely on open source technologies for the key areas of the infrastructure. On top of those open-source software, we add significant improvements and customizations to make the open-source solutions fit in Uber’s environment and bridge the gaps to meet Uber’s unique scale and requirements. We then highlight several important use cases and show their real-time solutions and tradeoffs. Finally, we reflect on the lessons we learned as we built, operated and scaled these systems.
This pretty much summarizes my Uber tenure - coauthored with Yupeng Fu (SIGMOD21): https://t.co/xAVW0U3PdU
— Chinmay Soman (@ChinmaySoman) April 2, 2021
Very grateful for @ApachePinot @apachekafka and @ApacheFlink for providing a robust foundation to power Uber's real time analytics.
#حاسوبيات
— Saif AlHarthi (@SaifAlHarthi) April 2, 2021
بحث سينشر من شركة
Uber
في احد اكبر مؤتمرات قواعد البيانات
تتحدث عن البنية التحتية للتعامل مع البيانات اللحظية
Real-time Datahttps://t.co/jrvvYNUp1E
11. Vertex Connectivity in Poly-logarithmic Max-flows
Jason Li, Danupon Nanongkai, Debmalya Panigrahi, Thatchaphol Saranurak, Sorrachai Yingchareonthawornchai
The vertex connectivity of an -edge -vertex undirected graph is the smallest number of vertices whose removal disconnects the graph, or leaves only a singleton vertex. In this paper, we give a reduction from the vertex connectivity problem to a set of maxflow instances. Using this reduction, we can solve vertex connectivity in time for any , if there is a -time maxflow algorithm. Using the current best maxflow algorithm that runs in time (Kathuria, Liu and Sidford, FOCS 2020), this yields a -time vertex connectivity algorithm. This is the first improvement in the running time of the vertex connectivity problem in over 20 years, the previous best being an -time algorithm due to Henzinger, Rao, and Gabow (FOCS 1996). Indeed, no algorithm with an running time was known before our work, even if we assume an -time maxflow algorithm. Our new technique is robust enough to also improve the best -time bound for directed vertex connectivity to time
How many vertices do you need to delete to disconnect an n-vertex graph?
— Thatchaphol Saranurak (@eig) April 2, 2021
Previous algorithms roughly call n max flows (or some optimized version of it) to compute this number.
Our new algorithm takes time proportional to polylog(n) max flows only!https://t.co/mNXHWfIN0a
12. Why is AI hard and Physics simple?
Daniel A. Roberts
- retweets: 129, favorites: 85 (04/04/2021 11:47:39)
- links: abs | pdf
- cs.LG | cs.AI | hep-th | physics.hist-ph | stat.ML
We discuss why AI is hard and why physics is simple. We discuss how physical intuition and the approach of theoretical physics can be brought to bear on the field of artificial intelligence and specifically machine learning. We suggest that the underlying project of machine learning and the underlying project of physics are strongly coupled through the principle of sparsity, and we call upon theoretical physicists to work on AI as physicists. As a first step in that direction, we discuss an upcoming book on the principles of deep learning theory that attempts to realize this approach.
New essay "Why is AI hard and Physics simple?" on how the tools and language of theoretical physics might be useful for making progress in AI, and specifically for deep learning. Basically, an apologia for physicists working on ML. https://t.co/IATPFBn63h
— Dan Roberts (@danintheory) April 2, 2021
1/
13. PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting
Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, Noah Snavely
We present PhySG, an end-to-end inverse rendering pipeline that includes a fully differentiable renderer and can reconstruct geometry, materials, and illumination from scratch from a set of RGB input images. Our framework represents specular BRDFs and environmental illumination using mixtures of spherical Gaussians, and represents geometry as a signed distance function parameterized as a Multi-Layer Perceptron. The use of spherical Gaussians allows us to efficiently solve for approximate light transport, and our method works on scenes with challenging non-Lambertian reflectance captured under natural, static illumination. We demonstrate, with both synthetic and real data, that our reconstructions not only enable rendering of novel viewpoints, but also physics-based appearance editing of materials and illumination.
PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting
— AK (@ak92501) April 2, 2021
pdf: https://t.co/nIKgSFadir
abs: https://t.co/QFMoNo8lVr
project page: https://t.co/YFvFZPCXN6 pic.twitter.com/JRu4SQi3Gk
14. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: \url{https://resynthesis-ssl.github.io/}.
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
— AK (@ak92501) April 2, 2021
pdf: https://t.co/6uQG1W2cIX
abs: https://t.co/kp622FKD7M pic.twitter.com/OL654AlfF2
15. Sketch2Mesh: Reconstructing and Editing 3D Shapes from Sketches
Benoit Guillard, Edoardo Remelli, Pierre Yvernay, Pascal Fua
Reconstructing 3D shape from 2D sketches has long been an open problem because the sketches only provide very sparse and ambiguous information. In this paper, we use an encoder/decoder architecture for the sketch to mesh translation. This enables us to leverage its latent parametrization to represent and refine a 3D mesh so that its projections match the external contours outlined in the sketch. We will show that this approach is easy to deploy, robust to style changes, and effective. Furthermore, it can be used for shape refinement given only single pen strokes. We compare our approach to state-of-the-art methods on sketches — both hand-drawn and synthesized — and demonstrate that we outperform them.
Sketch2Mesh: Reconstructing and Editing 3D Shapes from Sketches
— AK (@ak92501) April 2, 2021
pdf: https://t.co/ABMuJtY6Vy
abs: https://t.co/atejyzpbLg pic.twitter.com/Vf365HDTh1
16. Group-Free 3D Object Detection via Transformers
Ze Liu, Zheng Zhang, Yue Cao, Han Hu, Xin Tong
Recently, directly detecting 3D objects from 3D point clouds has received increasing attention. To extract object representation from an irregular point cloud, existing methods usually take a point grouping step to assign the points to an object candidate so that a PointNet-like network could be used to derive object features from the grouped points. However, the inaccurate point assignments caused by the hand-crafted grouping scheme decrease the performance of 3D object detection. In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers \cite{vaswani2017attention}, where the contribution of each point is automatically learned in the network training. With an improved attention stacking scheme, our method fuses object features in different stages and generates more accurate object detection results. With few bells and whistles, the proposed method achieves state-of-the-art 3D object detection performance on two widely used benchmarks, ScanNet V2 and SUN RGB-D. The code and models are publicly available at \url{https://github.com/zeliu98/Group-Free-3D}
Group-Free 3D Object Detection via Transformers
— AK (@ak92501) April 2, 2021
pdf: https://t.co/Liq6WX16cC
abs: https://t.co/ThBRk3rPcq
github: https://t.co/YiEihL55nk
" the proposed method achieves state-of-the-art 3D object detection performance on two widely used benchmarks, ScanNet V2 and SUN RGB-D." pic.twitter.com/AodWkXb31a
17. LIFT-SLAM: a deep-learning feature-based monocular visual SLAM method
Hudson M. S. Bruno, Esther L. Colombini
The Simultaneous Localization and Mapping (SLAM) problem addresses the possibility of a robot to localize itself in an unknown environment and simultaneously build a consistent map of this environment. Recently, cameras have been successfully used to get the environment’s features to perform SLAM, which is referred to as visual SLAM (VSLAM). However, classical VSLAM algorithms can be easily induced to fail when either the motion of the robot or the environment is too challenging. Although new approaches based on Deep Neural Networks (DNNs) have achieved promising results in VSLAM, they still are unable to outperform traditional methods. To leverage the robustness of deep learning to enhance traditional VSLAM systems, we propose to combine the potential of deep learning-based feature descriptors with the traditional geometry-based VSLAM, building a new VSLAM system called LIFT-SLAM. Experiments conducted on KITTI and Euroc datasets show that deep learning can be used to improve the performance of traditional VSLAM systems, as the proposed approach was able to achieve results comparable to the state-of-the-art while being robust to sensorial noise. We enhance the proposed VSLAM pipeline by avoiding parameter tuning for specific datasets with an adaptive approach while evaluating how transfer learning can affect the quality of the features extracted.
18. Mining Wikidata for Name Resources for African Languages
Jonne Sälevä, Constantine Lignos
This work supports further development of language technology for the languages of Africa by providing a Wikidata-derived resource of name lists corresponding to common entity types (person, location, and organization). While we are not the first to mine Wikidata for name lists, our approach emphasizes scalability and replicability and addresses data quality issues for languages that do not use Latin scripts. We produce lists containing approximately 1.9 million names across 28 African languages. We describe the data, the process used to produce it, and its limitations, and provide the software and data for public use. Finally, we discuss the ethical considerations of producing this resource and others of its kind.
19. Text to Image Generation with Semantic-Spatial Aware GAN
Wentong Liao, Kai Hu, Michael Ying Yang, Bodo Rosenhahn
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description.
Text to Image Generation with Semantic-Spatial Aware GAN
— AK (@ak92501) April 2, 2021
pdf: https://t.co/A6Yq8P8qtn
abs: https://t.co/F3ZhzibmEO pic.twitter.com/wtOF9WxUrp