Transformers and related deep network architectures are summarized and implemented here.

Overview

Transformers: from NLP to CV

cover

This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV)

  1. Introduction
  2. ViT: Transformers for Computer Vision
  3. Visualizing the attention Open In Colab
  4. MLP-Mixer Open In Colab
  5. Hybrid MLP-Mixer + ViT Open In Colab
  6. ConvMixer Open In Colab
  7. Hybrid ConvMixer + MLP-Mixer Open In Colab

1) Introduction

What is wrong with RNNs and CNNs

Learning Representations of Variable Length Data is a basic building block of sequence-to-sequence learning for Neural machine translation, summarization, etc

  • Recurrent Neural Networks (RNNs) are natural fit variable-length sentences and sequences of pixels. But sequential computation inhibits parallelization. No explicit modeling of long and short-range dependencies.
  • Convolutional Neural Networks (CNNs) are trivial to parallelize (per layer) and exploit local dependencies. However, long-distance dependencies require many layers.

Attention!

The Transformer archeticture was proposed in the paper Attention is All You Need. As mentioned in the paper:

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely"

"Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train"

Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language). One basic and well known neural network architecture for NMT is called sequence-to-sequence seq2seq and it involves two RNNs.

  • Encoder: RNN network that encodes the input sequence to a single vector (sentence encoding)
  • Decoder: RNN network that generates the output sequences conditioned on the encoder's output. (conditioned language model)

seqseq

The problem of the vanilla seq2seq is information bottleneck, where the encoding of the source sentence needs to capture all information about it in one vector.

As mentioned in the paper Neural Machine Translation by Jointly Learning to Align and Translate

"A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus."

attention001.gif

Attention provides a solution to the bottleneck problem

  • Core idea: on each step of the decoder, use a direct connection to the encoder to focus on a particular part of the source sequence. Attention is basically a technique to compute a weighted sum of the values (in the encoder), dependent on another value (in the decoder).

The main idea of attention can be summarized as mention the OpenAi's article:

"... every output element is connected to every input element, and the weightings between them are dynamically calculated based upon the circumstances, a process called attention."

Query and Values

  • In the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states (values)
  • The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
  • Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).

2) Transformers for Computer Vision

Transfomer based architectures were used not only for NLP but also for computer vision tasks. One important example is Vision Transformer ViT that represents a direct application of Transformers to image classification, without any image-specific inductive biases. As mentioned in the paper:

"We show that reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks"

"Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks"

vit

As we see, an input image is splitted into patches which are treated the same way as tokens (words) in an NLP application. Position embeddings are added to the patch embeddings to retain positional information. Similar to BERT’s class token, a classification head is attached here and used during pre-training and fine-tuning. The model is trained on image classification in supervised fashion.

Multi-head attention

The intuition is similar to have a multi-filter in CNNs. Here we can have multi-head attention, to give the network more capacity and ability to learn different attention patterns. By having multiple different layers that generate (or project) the vectors of queries, keys and values, we can learn multiple representations of these queries, keys and values.

mha

Where each token is projected (in a learnable way) into three vecrors Q, K, and V:

  • Q: Query vector: What I want
  • K: Key vector: What type of info I have
  • V: Value vector: What actual info I have

3) Visualizing the attention

Open In Colab

The basic ViT architecture is used, however with only one transformer layer with one (or four) head(s) for simplicity. The model is trained on CIFAR-10 classification task. The image is splitted in to 12 x 12 = 144 patches as usual, and after training, we can see the 144 x 144 attention scores (where each patch can attend to the others).

imgpatches

Attention map represents the correlation (attention) between all the tokens, where the sum of each row equals 1 representing the probability distribution of attention from a query patch to all others.

attmap

Long distance attention we can see two interesting patterns where background patch attends to long distance other background patches, and this flight patch attends to long distance other flight patches.

attpattern

We can try more heads and more transfomer layers and inspect the attention patterns.

attanim


4) MLP-Mixer

Open In Colab

MLP-Mixer is proposed in the paper An all-MLP Architecture for Vision. As mentioned in the paper:

"While convolutions and attention are both sufficient for good performance, neither of them is necessary!"

"Mixer is a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention"

Mixer accepts a sequence of linearly projected image patches (tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers:

mixer

  • Channel-mixing MLPs allow communication between different channels, they operate on each token independently and take individual rows of the table as inputs
  • Token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs.

These two types of layers are interleaved to enable interaction of both input dimensions.

"The computational complexity of the network is linear in the number of input patches, unlike ViT whose complexity is quadratic"

"Unlike ViTs, Mixer does not use position embeddings"

It is commonly observed that the first layers of CNNs tend to learn detectors that act on pixels in local regions of the image. In contrast, Mixer allows for global information exchange in the token-mixing MLPs.

"Recall that the token-mixing MLPs allow global communication between different spatial locations."

vizmixer

The figure shows hidden units of the four token-mixing MLPs of Mixer trained on CIFAR10 dataset.


5) Hybrid MLP-Mixer and ViT

Open In Colab

We can use both the MLP-Mixer and ViT in one network architecture to get the best of both worlds.

mixvit

Adding a few self-attention sublayers to mixer is expected to offer a simple way to trade off speed for accuracy.


6) CovMixer

Open In Colab

Patches Are All You Need?

Is the performance of ViTs due to the inherently more powerful Transformer architecture, or is it at least partly due to using patches as the input representation.

ConvMixer, an extremely simple model that is similar in many aspects to the ViT and the even-more-basic MLP-Mixer

Despite its simplicity, ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.

While self-attention and MLPs are theoretically more flexible, allowing for large receptive fields and content-aware behavior, the inductive bias of convolution is well-suited to vision tasks and leads to high data efficiency.

ConvMixers are substantially slower at inference than the competitors!

conmixer01


7) Hybrid MLP-Mixer and CovMixer

Open In Colab

Once again, we can use both the MLP-Mixer and ConvMixer in one network architecture to get the best of both worlds. Here is a simple example.

convmlpmixer


References and more information

Owner
Ibrahim Sobh
Ibrahim Sobh
Fine-tune GPT-3 with a Google Chat conversation history

Google Chat GPT-3 This repo will help you fine-tune GPT-3 with a Google Chat conversation history. The trained model will be able to converse as one o

Nate Baer 7 Dec 10, 2022
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretra

Mozilla 6.5k Jan 08, 2023
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
A Telegram bot to add notes to Flomo.

flomo bot 使用 Telegram 机器人发送笔记到你的 Flomo. 你需要有一台可访问 Telegram 的服务器。 Steps @BotFather 新建机器人,获取 token Flomo 官网获取 API,链接 https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
Code for the paper PermuteFormer

PermuteFormer This repo includes codes for the paper PermuteFormer: Efficient Relative Position Encoding for Long Sequences. Directory long_range_aren

Peng Chen 42 Mar 16, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

ProphetNet-X This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model call

Microsoft 394 Dec 17, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
The training code for the 4th place model at MDX 2021 leaderboard A.

The training code for the 4th place model at MDX 2021 leaderboard A.

Chin-Yun Yu 32 Dec 18, 2022