1. When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
Xiangning Chen, Cho-Jui Hsieh, Boqing Gong
Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models’ data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
— AK (@ak92501) June 4, 2021
pdf: https://t.co/GYknaVoNAM
abs: https://t.co/kaUxIdMVNQ
+5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and MixerB/16, with the simple Inception-style preprocessing pic.twitter.com/EI1ZSUccUn
2. The Contestation of Tech Ethics: A Sociotechnical Approach to Ethics and Technology in Action
Ben Green
Recent controversies related to topics such as fake news, privacy, and algorithmic bias have prompted increased public scrutiny of digital technologies and soul-searching among many of the people associated with their development. In response, the tech industry, academia, civil society, and governments have rapidly increased their attention to “ethics” in the design and use of digital technologies (“tech ethics”). Yet almost as quickly as ethics discourse has proliferated across the world of digital technologies, the limitations of these approaches have also become apparent: tech ethics is vague and toothless, is subsumed into corporate logics and incentives, and has a myopic focus on individual engineers and technology design rather than on the structures and cultures of technology production. As a result of these limitations, many have grown skeptical of tech ethics and its proponents, charging them with “ethics-washing”: promoting ethics research and discourse to defuse criticism and government regulation without committing to ethical behavior. By looking at how ethics has been taken up in both science and business in superficial and depoliticizing ways, I recast tech ethics as a terrain of contestation where the central fault line is not whether it is desirable to be ethical, but what “ethics” entails and who gets to define it. This framing highlights the significant limits of current approaches to tech ethics and the importance of studying the formulation and real-world effects of tech ethics. In order to identify and develop more rigorous strategies for reforming digital technologies and the social relations that they mediate, I describe a sociotechnical approach to tech ethics, one that reflexively applies many of tech ethics’ own lessons regarding digital technologies to tech ethics itself.
Really excited to share a new paper draft! I survey the significant limits and fault lines of “tech ethics,” characterizing tech ethics as a terrain of contestation where the central struggle is over *what* ethics entails and *who* gets to define it.
— Ben Green (@benzevgreen) June 4, 2021
📑: https://t.co/RRvvv5CPfn pic.twitter.com/q95Q5hgCVZ
If you're thinking about tech ethics in education then this by @benzevgreen is super important reading. In UK our only "AIed" ethics representatives are industry-friendly and politically-connected. They're defining tech ethics for education. That matters. https://t.co/S2BDSM8Xlh
— Ben Williamson (@BenPatrickWill) June 5, 2021
3. Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control
Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, Christian Theobalt
We propose Neural Actor (NA), a new method for high-quality synthesis of humans from arbitrary viewpoints and under arbitrary controllable poses. Our method is built upon recent neural scene representation and rendering works which learn representations of geometry and appearance from only 2D images. While existing works demonstrated compelling rendering of static scenes and playback of dynamic scenes, photo-realistic reconstruction and rendering of humans with neural implicit methods, in particular under user-controlled novel poses, is still difficult. To address this problem, we utilize a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose. A neural radiance field learns pose-dependent geometric deformations and pose- and view-dependent appearance effects in the canonical space from multi-view video input. To synthesize novel views of high fidelity dynamic geometry and appearance, we leverage 2D texture maps defined on the body model as latent variables for predicting residual deformations and the dynamic appearance. Experiments demonstrate that our method achieves better quality than the state-of-the-arts on playback as well as novel pose synthesis, and can even generalize well to new poses that starkly differ from the training poses. Furthermore, our method also supports body shape control of the synthesized results.
Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control
— AK (@ak92501) June 4, 2021
pdf: https://t.co/cNmHoi25tl
abs: https://t.co/Q8PObWc83w
project page: https://t.co/wRHnNJmZbm pic.twitter.com/kvn6ZVdU39
4. Luna: Linear Unified Nested Attention
Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, Luke Zettlemoyer
The quadratic computational and memory complexities of the Transformer’s attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety
Luna: Linear Unified Nested Attention
— AK (@ak92501) June 4, 2021
pdf: https://t.co/Ui56MpOt4E
abs: https://t.co/yBwB21Ag3d
a simple, efficient and effective linear attention mechanism used as a drop-in substitute for regular softmax attention pic.twitter.com/Vj1weMJtbC
5. Single Image Depth Estimation using Wavelet Decomposition
Michaël Ramamonjisoa, Michael Firman, Jamie Watson, Vincent Lepetit, Daniyar Turmukhambetov
We present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoder-decoder architecture. We demonstrate that we can reconstruct high-fidelity depth maps by predicting sparse wavelet coefficients. In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully self-supervised scenarios, without access to ground-truth depth. Finally, we apply our method to different state-of-the-art monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiply-adds in the decoder network. Code at https://github.com/nianticlabs/wavelet-monodepth
I'm happy to share our upcoming #CVPR #CVPR2021 paper "Single Image Depth Prediction with Wavelet Decomposition"
— Daniyar Turmukhambetov (@dantkz) June 4, 2021
TLDR: monodepth but with wavelets for fewer decoder convolutions
arxiv: https://t.co/8JIsPX700a
code: https://t.co/e4p3FDLj1f pic.twitter.com/fBoO32ejga
6. Barbershop: GAN-based Image Compositing using Segmentation Masks
Peihao Zhu, Rameen Abdal, John Femiani, Peter Wonka
Seamlessly blending features from multiple images is extremely challenging because of complex relationships in lighting, geometry, and partial occlusion which cause coupling between different parts of the image. Even though recent work on GANs enables synthesis of realistic hair or faces, it remains difficult to combine them into a single, coherent, and plausible image rather than a disjointed set of image patches. We present a novel solution to image blending, particularly for the problem of hairstyle transfer, based on GAN-inversion. We propose a novel latent space for image blending which is better at preserving detail and encoding spatial information, and propose a new GAN-embedding algorithm which is able to slightly modify images to conform to a common segmentation mask. Our novel representation enables the transfer of the visual properties from multiple reference images including specific details such as moles and wrinkles, and because we do image blending in a latent-space we are able to synthesize images that are coherent. Our approach avoids blending artifacts present in other approaches and finds a globally consistent image. Our results demonstrate a significant improvement over the current state of the art in a user study, with users preferring our blending solution over 95 percent of the time.
Barbershop: GAN-based Image Compositing using Segmentation Masks
— AK (@ak92501) June 4, 2021
pdf: https://t.co/O69XOLB8zY
abs: https://t.co/BwayqZpjUF
project page: https://t.co/epT4KebIg1 pic.twitter.com/iP1DZMymzs
7. Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?
Jieyu Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Kai-Wei Chang
Is it possible to use natural language to intervene in a model’s behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a question-answering (QA) model’s unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a system’s social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even today’s powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Few-shot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.
Can we intervene in a model’s behavior by natural languages? Check our #ACL2021 Findings “Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?” (https://t.co/T7CpoDzKbY). w/ @DanielKhashabi, Tushar Khot, Ashish Sabharwal, and @kaiwei_chang. 1/n pic.twitter.com/ZP0tag1TLR
— Jieyu Zhao (@jieyuzhao11) June 5, 2021
Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?
— AK (@ak92501) June 4, 2021
pdf: https://t.co/eEyLuKAUk3
abs: https://t.co/o33FyK8vOi
language understanding task, where the goal is to amend a QA model’s unethical behavior by communicating context-specific principles pic.twitter.com/2549ai2scQ
8. Container: Context Aggregation Network
Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers — originally introduced in natural language processing — have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.
Presenting the CONTAINER (CONtext Aggregation NEtwoRk) -- a unified view of Transformers, CNNs, MLP-Mixers via a general-purpose building block for multi-head context aggregation.https://t.co/tsEow09Sah by Peng, @jiasenlu, Hongsheng, @RoozbehMottaghi, @anikembhavi @allen_ai pic.twitter.com/BouZ4f40xT
— Jiasen Lu (@jiasenlu) June 4, 2021
Container: Context Aggregation Network
— AK (@ak92501) June 4, 2021
pdf: https://t.co/MWec8TBMvy
abs: https://t.co/jXa0r1XgXh
a generalized context aggregation building block that
combines static and dynamic affinity matrices using learnable parameters pic.twitter.com/L9BAIixN4B
9. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh
Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
— AK (@ak92501) June 4, 2021
pdf: https://t.co/D6PwXK7tY3
abs: https://t.co/eZ3lw3YAmh
github: https://t.co/eA82k0xuXQ
project page: https://t.co/tVtV8ghWkS pic.twitter.com/xzTYqhrWp7
10. Towards a Mathematical Theory of Abstraction
Beren Millidge
While the utility of well-chosen abstractions for understanding and predicting the behaviour of complex systems is well appreciated, precisely what an abstraction has so far has largely eluded mathematical formalization. In this paper, we aim to set out a mathematical theory of abstraction. We provide a precise characterisation of what an abstraction is and, perhaps more importantly, suggest how abstractions can be learnt directly from data both for static datasets and for dynamical systems. We define an abstraction to be a small set of summaries' of a system which can be used to answer a set of queries about the system or its behaviour. The difference between the ground truth behaviour of the system on the queries and the behaviour of the system predicted only by the abstraction provides a measure of the
leakiness’ of the abstraction which can be used as a loss function to directly learn abstractions from data. Our approach can be considered a generalization of classical statistics where we are not interested in reconstructing `the data’ in full, but are instead only concerned with answering a set of arbitrary queries about the data. While highly theoretical, our results have deep implications for statistical inference and machine learning and could be used to develop explicit methods for learning precise kinds of abstractions directly from data.
Today, I'm excited to announce a new preprint: "Towards a Mathematical Theory of Abstraction" (https://t.co/juUp7K2i3d).
— Beren Millidge (@BerenMillidge) June 4, 2021
Here we try to start formalizing our folk notion of an 'abstraction' with the goal of eventually being able to learn abstractions directly from data.
11. Towards Learning to Play Piano with Dexterous Hands and Touch
Huazhe Xu, Yuping Luo, Shaoxiong Wang, Trevor Darrell, Roberto Calandra
The virtuoso plays the piano with passion, poetry and extraordinary technical ability. As Liszt said (a virtuoso)must call up scent and blossom, and breathe the breath of life. The strongest robots that can play a piano are based on a combination of specialized robot hands/piano and hardcoded planning algorithms. In contrast to that, in this paper, we demonstrate how an agent can learn directly from machine-readable music score to play the piano with dexterous hands on a simulated piano using reinforcement learning (RL) from scratch. We demonstrate the RL agents can not only find the correct key position but also deal with various rhythmic, volume and fingering, requirements. We achieve this by using a touch-augmented reward and a novel curriculum of tasks. We conclude by carefully studying the important aspects to enable such learning algorithms and that can potentially shed light on future research in this direction.
Towards Learning to Play Piano with Dexterous Hands and Touch
— AK (@ak92501) June 4, 2021
pdf: https://t.co/e9y37yfiE9
abs: https://t.co/IzkikMRabG pic.twitter.com/LQtZE5wwIw
12. Robust Reference-based Super-Resolution via C2-Matching
Yuming Jiang, Kelvin C.K. Chan, Xintao Wang, Chen Change Loy, Ziwei Liu
Reference-based Super-Resolution (Ref-SR) has recently emerged as a promising paradigm to enhance a low-resolution (LR) input image by introducing an additional high-resolution (HR) reference image. Existing Ref-SR methods mostly rely on implicit correspondence matching to borrow HR textures from reference images to compensate for the information loss in input images. However, performing local transfer is difficult because of two gaps between input and reference images: the transformation gap (e.g. scale and rotation) and the resolution gap (e.g. HR and LR). To tackle these challenges, we propose C2-Matching in this work, which produces explicit robust matching crossing transformation and resolution. 1) For the transformation gap, we propose a contrastive correspondence network, which learns transformation-robust correspondences using augmented views of the input image. 2) For the resolution gap, we adopt a teacher-student correlation distillation, which distills knowledge from the easier HR-HR matching to guide the more ambiguous LR-HR matching. 3) Finally, we design a dynamic aggregation module to address the potential misalignment issue. In addition, to faithfully evaluate the performance of Ref-SR under a realistic setting, we contribute the Webly-Referenced SR (WR-SR) dataset, mimicking the practical usage scenario. Extensive experiments demonstrate that our proposed C2-Matching significantly outperforms state of the arts by over 1dB on the standard CUFED5 benchmark. Notably, it also shows great generalizability on WR-SR dataset as well as robustness across large scale and rotation transformations.
Robust Reference-based Super-Resolution via C2-Matching
— AK (@ak92501) June 4, 2021
pdf: https://t.co/SRUwBqfMac
abs: https://t.co/LLQa7JwKcH
github: https://t.co/7zCCbyVpVS pic.twitter.com/wFLKCP8e4V
Our #CVPR2021 paper "Robust Reference-based Super-Resolution via C2-Matching":
— Ziwei Liu (@liuziwei7) June 4, 2021
Paper: https://t.co/kn4nPK9Nly
Code: https://t.co/p1OqjEckCU
- *C2-Matching*: a robust ref-SR framework that significantly outperforms existing methods.
- *WR-SR*: A new webly-referenced SR dataset. pic.twitter.com/O5MtxnmVAy
13. NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination
Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William T. Freeman, Jonathan T. Barron
We address the problem of recovering the shape and spatially-varying reflectance of an object from posed multi-view images of the object illuminated by one unknown lighting condition. This enables the rendering of novel views of the object under arbitrary environment lighting and editing of the object’s material properties. The key to our approach, which we call Neural Radiance Factorization (NeRFactor), is to distill the volumetric geometry of a Neural Radiance Field (NeRF) [Mildenhall et al. 2020] representation of the object into a surface representation and then jointly refine the geometry while solving for the spatially-varying reflectance and the environment lighting. Specifically, NeRFactor recovers 3D neural fields of surface normals, light visibility, albedo, and Bidirectional Reflectance Distribution Functions (BRDFs) without any supervision, using only a re-rendering loss, simple smoothness priors, and a data-driven BRDF prior learned from real-world BRDF measurements. By explicitly modeling light visibility, NeRFactor is able to separate shadows from albedo and synthesize realistic soft or hard shadows under arbitrary lighting conditions. NeRFactor is able to recover convincing 3D models for free-viewpoint relighting in this challenging and underconstrained capture setup for both synthetic and real scenes. Qualitative and quantitative experiments show that NeRFactor outperforms classic and deep learning-based state of the art across various tasks. Our code and data are available at people.csail.mit.edu/xiuming/projects/nerfactor/.
NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination
— AK (@ak92501) June 4, 2021
pdf: https://t.co/GolmVJOB4P
abs: https://t.co/P82yZL2lnq
project page: https://t.co/uViygdoU40
github: https://t.co/OmdOypmcMP pic.twitter.com/C4HJZyxxym
14. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang
Vision-language pre-training (VLP) on large-scale image-text pairs has achieved huge success for the cross-modal downstream tasks. The most existing pre-training methods mainly adopt a two-step training procedure, which firstly employs a pre-trained object detector to extract region-based visual features, then concatenates the image representation and text embedding as the input of Transformer to train. However, these methods face problems of using task-specific visual representation of the specific object detector for generic cross-modal understanding, and the computation inefficiency of two-stage pipeline. In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text. We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. An extensive set of experiments have been conducted on well-established vision-language downstream tasks to demonstrate the effectiveness of this novel VLP paradigm.
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
— AK (@ak92501) June 4, 2021
pdf: https://t.co/g6VZqsREuq
abs: https://t.co/acUnJv2nl8
end-to-end paradigm for pixel-level vision-language pretraining, jointly learn visual representation, semantic alignments between image and text pic.twitter.com/S6oImq92pk
15. What Happened Next? Using Deep Learning to Value Defensive Actions in Football Event-Data
Charbel Merhej, Ryan Beal, Sarvapali Ramchurn, Tim Matthews
Objectively quantifying the value of player actions in football (soccer) is a challenging problem. To date, studies in football analytics have mainly focused on the attacking side of the game, while there has been less work on event-driven metrics for valuing defensive actions (e.g., tackles and interceptions). Therefore in this paper, we use deep learning techniques to define a novel metric that values such defensive actions by studying the threat of passages of play that preceded them. By doing so, we are able to value defensive actions based on what they prevented from happening in the game. Our Defensive Action Expected Threat (DAxT) model has been validated using real-world event-data from the 2017/2018 and 2018/2019 English Premier League seasons, and we combine our model outputs with additional features to derive an overall rating of defensive ability for players. Overall, we find that our model is able to predict the impact of defensive actions allowing us to better value defenders using event-data.
Excited to announce our paper "What Happened Next? Using Deep Learning to Value Defensive Actions in Football Event-Data" has been accepted for publication in the Applied Data Science Track at #KDD2021. The paper is available here ⬇️https://t.co/tYRie5xkkp
— Ryan Beal (@ryanbeal95) June 4, 2021
16. Undecidability of Learnability
Matthias C. Caro
Machine learning researchers and practitioners steadily enlarge the multitude of successful learning models. They achieve this through in-depth theoretical analyses and experiential heuristics. However, there is no known general-purpose procedure for rigorously evaluating whether newly proposed models indeed successfully learn from data. We show that such a procedure cannot exist. For PAC binary classification, uniform and universal online learning, and exact learning through teacher-learner interactions, learnability is in general undecidable, both in the sense of independence of the axioms in a formal system and in the sense of uncomputability. Our proofs proceed via computable constructions of function classes that encode the consistency problem for formal systems and the halting problem for Turing machines into complexity measures that characterize learnability. Our work shows that undecidability appears in the theoretical foundations of machine learning: There is no one-size-fits-all algorithm for deciding whether a machine learning model can be successful. We cannot in general automatize the process of assessing new learning models.
I have a new @arxiv preprint: "Undecidability of Learnability" at https://t.co/sgi5gEkTCY. It's my first non-quantum project! pic.twitter.com/vHjrrufDcf
— Matthias C. Caro (@IMathYou2) June 4, 2021
There is a new paper on arXiv, https://t.co/2pMRJ7Fcni
— Asaf Karagila (@AsafKaragila) June 4, 2021
"Undecidability of Learnability" by Matthias C. Caro. This isn't the first type of paper like that.
If someone wants to look into these things under V=L, PFA, large cardinals, or the AC connection, get in touch. (Please rt!)
17. The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models
Ulme Wennberg, Gustav Eje Henter
Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.
The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models
— AK (@ak92501) June 4, 2021
pdf: https://t.co/R1ltolgXTs
abs: https://t.co/oip0UHVbWV
accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings pic.twitter.com/SMcR95lkh7