1. ByT5: Towards a token-free future with pre-trained byte-to-byte models
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
ByT5: Towards a token-free future with pre-trained byte-to-byte models
— Aran Komatsuzaki (@arankomatsuzaki) May 31, 2021
Shows that byte-level models are competitive with their token-level counterparts and more robust to noise.
abs: https://t.co/Nt6mgTIi29
code: https://t.co/cRWQfFDBFv pic.twitter.com/wZtxmqXjsf
2. Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging
S. Mahdi H. Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris, Yağız Aksoy
Neural networks have shown great abilities in estimating depth from a single image. However, the inferred depth maps are well below one-megapixel resolution and often lack fine-grained details, which limits their practicality. Our method builds on our analysis on how the input resolution and the scene structure affects depth estimation performance. We demonstrate that there is a trade-off between a consistent scene structure and the high-frequency details, and merge low- and high-resolution estimations to take advantage of this duality using a simple depth merging network. We present a double estimation method that improves the whole-image depth estimation and a patch selection method that adds local details to the final result. We demonstrate that by merging estimations at different resolutions with changing context, we can generate multi-megapixel depth maps with a high level of detail using a pre-trained model.
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging
— AK (@ak92501) May 31, 2021
pdf: https://t.co/uHtoSUIsLk
abs: https://t.co/oktiTnwuOl
project page: https://t.co/2L9ZFr7zRI pic.twitter.com/aZs0VgWhLi
3. Changing the World by Changing the Data
Anna Rogers
NLP community is currently investing a lot more research and resources into development of deep learning models than training data. While we have made a lot of progress, it is now clear that our models learn all kinds of spurious patterns, social biases, and annotation artifacts. Algorithmic solutions have so far had limited success. An alternative that is being actively discussed is more careful design of datasets so as to deliver specific signals. This position paper maps out the arguments for and against data curation, and argues that fundamentally the point is moot: curation already is and will be happening, and it is changing the world. The question is only how much thought we want to invest into that process.
🎈 #NLPaperAlert: Changing the World 🌍 by Changing the Data 🗃https://t.co/GHDTfcbW5Y
— Anna Rogers (@annargrs) May 31, 2021
A soul-searching piece that made it to ACL 2021:
- how NLP resources affect the world
- what does it even mean to 'work in NLP'
- how we can make better use of our subcommunities.
/1
4. ResT: An Efficient Transformer for Visual Recognition
Qinglong Zhang, Yubin Yang
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
ResT: An Efficient Transformer for Visual Recognition
— AK (@ak92501) May 31, 2021
pdf: https://t.co/buYVDxY7wb
abs: https://t.co/MOPWQd5BA2
github: https://t.co/pwLeXbcpSE
multi-scale Transformer which produces hierarchical feature representations for dense prediction pic.twitter.com/U10h0Cnhje
5. What Is Considered Complete for Visual Recognition?
Lingxi Xie, Xiaopeng Zhang, Longhui Wei, Jianlong Chang, Qi Tian
This is an opinion paper. We hope to deliver a key message that current visual recognition systems are far from complete, i.e., recognizing everything that human can recognize, yet it is very unlikely that the gap can be bridged by continuously increasing human annotations. Based on the observation, we advocate for a new type of pre-training task named learning-by-compression. The computational models (e.g., a deep network) are optimized to represent the visual data using compact features, and the features preserve the ability to recover the original data. Semantic annotations, when available, play the role of weak supervision. An important yet challenging issue is the evaluation of image recovery, where we suggest some design principles and future research directions. We hope our proposal can inspire the community to pursue the compression-recovery tradeoff rather than the accuracy-complexity tradeoff.
What Is Considered Complete for Visual Recognition?
— Aran Komatsuzaki (@arankomatsuzaki) May 31, 2021
Describes the limitations of the current visual recognition systems and suggest some future research directions.https://t.co/mcgSXSJSnf pic.twitter.com/qO3Ajn7AzE
6. DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion
Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng
Singing voice conversion (SVC) is one promising technique which can enrich the way of human-computer interaction by endowing a computer the ability to produce high-fidelity and expressive singing voice. In this paper, we propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. DiffSVC uses phonetic posteriorgrams (PPGs) as content features. A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram produced by the diffusion/forward process and its corresponding step information as input to predict the added Gaussian noise. We use PPGs, fundamental frequency features and loudness features as auxiliary input to assist the denoising process. Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion
— AK (@ak92501) May 31, 2021
pdf: https://t.co/945wanF1up
abs: https://t.co/299tKHAzzV pic.twitter.com/RVQb656dK0
7. Mapping urban socioeconomic inequalities in developing countries through Facebook advertising data
Serena Giurgola, Simone Piaggesi, Márton Karsai, Yelena Mejova, André Panisson, Michele Tizzoni
Ending poverty in all its forms everywhere is the number one Sustainable Development Goal of the UN 2030 Agenda. To monitor the progress towards such an ambitious target, reliable, up-to-date and fine-grained measurements of socioeconomic indicators are necessary. When it comes to socioeconomic development, novel digital traces can provide a complementary data source to overcome the limits of traditional data collection methods, which are often not regularly updated and lack adequate spatial resolution. In this study, we collect publicly available and anonymous advertising audience estimates from Facebook to predict socioeconomic conditions of urban residents, at a fine spatial granularity, in four large urban areas: Atlanta (USA), Bogot’a (Colombia), Santiago (Chile), and Casablanca (Morocco). We find that behavioral attributes inferred from the Facebook marketing platform can accurately map the socioeconomic status of residential areas within cities, and that predictive performance is comparable in both high and low-resource settings. We also show that training a model on attributes of adult Facebook users, aged more than 25, leads to a more accurate mapping of socioeconomic conditions in all cities. Our work provides additional evidence of the value of social advertising media data to measure human development.
New pre-print is out on "Mapping urban socioeconomic inequalities in developing countries through Facebook advertising data". An @ISI_Fondazione teamwork with @GiurgolaSerena @simonepiaggesi @yelenamejova @apanisson and @mtizzoni https://t.co/v4P71oMxbY pic.twitter.com/fEsHV6rYVI
— Marton Karsai (@MartonKarsai) May 31, 2021
8. Learning to Stylize Novel Views
Hsin-Ping Huang, Hung-Yu Tseng, Saurabh Saini, Maneesh Singh, Ming-Hsuan Yang
We tackle a 3D scene stylization problem - generating stylized images of a scene from arbitrary novel views given a set of images of the same scene and a reference image of the desired style as inputs. Direct solution of combining novel view synthesis and stylization approaches lead to results that are blurry or not consistent across different views. We propose a point cloud-based method for consistent 3D scene stylization. First, we construct the point cloud by back-projecting the image features to the 3D space. Second, we develop point cloud aggregation modules to gather the style information of the 3D scene, and then modulate the features in the point cloud with a linear transformation matrix. Finally, we project the transformed features to 2D space to obtain the novel views. Experimental results on two diverse datasets of real-world scenes validate that our method generates consistent stylized novel view synthesis results against other alternative approaches.
Learning to Stylize Novel Views
— AK (@ak92501) May 31, 2021
pdf: https://t.co/sEtbrpMlV7
abs: https://t.co/99z2DjbN1r
project page: https://t.co/9wbPWgaUsU
design a point cloud transformation module to
transfer the style of the reference image to the 3D representation pic.twitter.com/2nn9ZCN3qu
9. “Why Would I Trust Your Numbers?” On the Explainability of Expected Values in Soccer
Jan Van Haaren
In recent years, many different approaches have been proposed to quantify the performances of soccer players. Since player performances are challenging to quantify directly due to the low-scoring nature of soccer, most approaches estimate the expected impact of the players’ on-the-ball actions on the scoreline. While effective, these approaches are yet to be widely embraced by soccer practitioners. The soccer analytics community has primarily focused on improving the accuracy of the models, while the explainability of the produced metrics is often much more important to practitioners. To help bridge the gap between scientists and practitioners, we introduce an explainable Generalized Additive Model that estimates the expected value for shots. Unlike existing models, our model leverages features corresponding to widespread soccer concepts. To this end, we represent the locations of shots by fuzzily assigning the shots to designated zones on the pitch that practitioners are familiar with. Our experimental evaluation shows that our model is as accurate as existing models, while being easier to explain to soccer practitioners.
The limited explainability of expected value metrics for football is holding back their adoption by practitioners. Therefore, I am exploring ways to improve their explainability in a paper that I will be presenting at the AI for Sports Analytics workshop.https://t.co/idIo3ubr0D
— Jan Van Haaren (@JanVanHaaren) May 31, 2021
10. OTTers: One-turn Topic Transitions for Open-Domain Dialogue
Karin Sevegnani, David M. Howcroft, Ioannis Konstas, Verena Rieser
Mixed initiative in open-domain dialogue requires a system to pro-actively introduce new topics. The one-turn topic transition task explores how a system connects two topics in a cooperative and coherent manner. The goal of the task is to generate a “bridging” utterance connecting the new topic to the topic of the previous conversation turn. We are especially interested in commonsense explanations of how a new topic relates to what has been mentioned before. We first collect a new dataset of human one-turn topic transitions, which we call OTTers. We then explore different strategies used by humans when asked to complete such a task, and notice that the use of a bridging utterance to connect the two topics is the approach used the most. We finally show how existing state-of-the-art text generation models can be adapted to this task and examine the performance of these baselines on different splits of the OTTers data.
Most open-domain #ConvAI systems are purely reactive. How can a system introduce new topics without sounding abrupt or incoherent? Check-out our new paper to appear @aclmeeting with @KarinSevegnani @sinantie @_dmh https://t.co/dmqLEV2Rcn
— Verena Rieser (@verena_rieser) May 31, 2021
11. NViSII: A Scriptable Tool for Photorealistic Image Generation
Nathan Morrical, Jonathan Tremblay, Yunzhi Lin, Stephen Tyree, Stan Birchfield, Valerio Pascucci, Ingo Wald
We present a Python-based renderer built on NVIDIA’s OptiX ray tracing engine and the OptiX AI denoiser, designed to generate high-quality synthetic images for research in computer vision and deep learning. Our tool enables the description and manipulation of complex dynamic 3D scenes containing object meshes, materials, textures, lighting, volumetric data (e.g., smoke), and backgrounds. Metadata, such as 2D/3D bounding boxes, segmentation masks, depth maps, normal maps, material properties, and optical flow vectors, can also be generated. In this work, we discuss design goals, architecture, and performance. We demonstrate the use of data generated by path tracing for training an object detector and pose estimator, showing improved performance in sim-to-real transfer in situations that are difficult for traditional raster-based renderers. We offer this tool as an easy-to-use, performant, high-quality renderer for advancing research in synthetic data generation and deep learning.
NViSII: A Scriptable Tool for Photorealistic Image Generationhttps://t.co/j6FQLkPzML pic.twitter.com/zfiwzYxYZG
— sim2real (@sim2realAIorg) May 31, 2021
NViSII: A Scriptable Tool for Photorealistic Image Generation
— AK (@ak92501) May 31, 2021
pdf: https://t.co/yPJavd1Muw
abs: https://t.co/wnE4mIpUlf
github: https://t.co/HkiJQM4dFh pic.twitter.com/81kaGkgrHA
12. ResearchGate and Google Scholar: How much do they differ in publications, citations and different metrics and why?
Vivek Kumar Singh, Satya Swarup Srichandan, Hiran H. Lathabai
ResearchGate has emerged as a popular professional network for scientists and researchers in a very short span of time. Similar to Google Scholar, the ResearchGate indexing uses an automatic crawling algorithm that extracts bibliographic data, citations and other information about scholarly articles from various sources. However, it has been observed that the two platforms often show different publication and citation data for the same institutions, journals and authors. This paper, therefore, attempts to analyse and measure the differences in publication counts, citations and different metrics of the two platforms for a large data set of highly cited authors. The results indicate that there are significantly high differences in publication counts and citations for the same authors in the two platforms, with Google Scholar having higher counts for a vast majority of the cases. The different metrics computed by the two platforms also differ in their values, showing different degrees of correlations. The coverage policy, indexing errors, author attribution mechanism and strategy to deal with predatory publishing are found to be the main probable reasons for the differences in the two platforms.
ResearchGate and Google Scholar: How much do they differ in publications, citations and different metrics and why? Preprint now in arxiv: https://t.co/sWPZ7FajyI@googlescholar_ @ResearchGate @mikethelwall @JLOrtegaPriego @eomalea @albertomartin @HIRAN31021775 @satyaswarup98
— Vivek Singh (@vivekks12) May 31, 2021
13. Knowledge Inheritance for Pre-trained Language Models
Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou
Recent explorations of large-scale pre-trained language models (PLMs) such as GPT-3 have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, training a large-scale PLM requires tremendous amounts of computational resources, which is time-consuming and expensive. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring the availability of many existing well-trained PLMs. To this end, we explore the question that how can previously trained PLMs benefit training larger PLMs in future. Specifically, we introduce a novel pre-training framework named “knowledge inheritance” (KI), which combines both self-learning and teacher-guided learning to efficiently train larger PLMs. Sufficient experimental results demonstrate the feasibility of our KI framework. We also conduct empirical analyses to explore the effects of teacher PLMs’ pre-training settings, including model architecture, pre-training data, etc. Finally, we show that KI can well support lifelong learning and knowledge transfer.
Knowledge Inheritance for Pre-trained Language Models
— AK (@ak92501) May 31, 2021
pdf: https://t.co/a1b1PcrjdD
abs: https://t.co/Mog8wrnqJF
github: https://t.co/eqHVMMrw0K
pre-training framework, knowledge inheritance, combines both self-learning and teacher-guided learning to efficiently train larger PLMs pic.twitter.com/RVlodUWKjv