Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Last update: Dec 28, 2022

Overview

FLASH - Pytorch

Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time

Install

$ pip install FLASH-pytorch

Usage

The main novel circuit in this paper is the "Gated Attention Unit", which they claim can replace multi-headed attention while reducing it to just one head.

It uses a relu squared activation in place of the softmax, the activation of which was first seen in the Primer paper, and the use of ReLU in ReLA Transformer. The gating style seems mostly inspired by gMLPs.

import torch
from flash_pytorch import GAU

gau = GAU(
    dim = 512,
    query_key_dim = 128,     # query / key dimension
    causal = True,           # autoregressive or not
    expansion_factor = 2,    # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1024, 512)
out = gau(x) # (1, 1024, 512)

The authors then combine GAU with Katharopoulos linear attention, using grouping of the sequences to overcome a known issue with autoregressive linear attention.

This combination of the quadratic gated attention unit with grouped linear attention they named FLASH

You can also use this quite easily

import torch
from flash_pytorch import FLASH

flash = FLASH(
    dim = 512,
    group_size = 256,             # group size
    causal = True,                # autoregressive or not
    query_key_dim = 128,          # query / key dimension
    expansion_factor = 2.         # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1111, 512)     # sequence will be auto-padded to nearest group size
out = flash(x) # (1, 1111, 512)

Finally, you can use the full FLASH transformer as mentioned in the paper. This contains all the positional embeddings mentioned in the paper. Absolute positional embedding uses scaled sinusoidal. GAU quadratic attention will get one-headed T5 relative positional bias. On top of all this, both GAU attention as well as the linear attention will be rotary embedded (RoPE).

import torch
from flash_pytorch import FLASHTransformer

model = FLASHTransformer(
    num_tokens = 20000,          # number of tokens
    dim = 512,                   # model dimension
    depth = 12,                  # depth
    causal = True,               # autoregressive or not
    group_size = 256,            # size of the groups
    query_key_dim = 128,         # dimension of queries / keys
    expansion_factor = 2.,       # hidden dimension = dim * expansion_factor
    norm_type = 'scalenorm',     # in the paper, they claimed scalenorm led to faster training at no performance hit. the other option is 'layernorm' (also default)
    shift_tokens = True          # discovered by an independent researcher in Shenzhen @BlinkDL, this simply shifts half of the feature space forward one step along the sequence dimension - greatly improved convergence even more in my local experiments
)

x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)

Test on Autoregressive Enwik8

$ python train.py

Citations

@article{Hua2022TransformerQI,
    title   = {Transformer Quality in Linear Time},
    author  = {Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2202.10447}
}

@software{peng_bo_2021_5196578,
    author    = {PENG Bo},
    title     = {BlinkDL/RWKV-LM: 0.01},
    month     = {aug},
    year      = {2021},
    publisher = {Zenodo},
    version   = {0.01},
    doi       = {10.5281/zenodo.5196578},
    url       = {https://doi.org/10.5281/zenodo.5196578}
}

Comments

einsum operation in Linear Attention Part
Hi, Thanks a lot for your FLASH_pytorch, which helps a lot. I found that there are some differences from the paper in the Linear Attention Part: https://github.com/lucidrains/FLASH-pytorch/blob/main/flash_pytorch/flash_pytorch.py#L342-L343

lin_kv = einsum('b g n d, b g n e -> b d e', lin_k, v) / n lin_out = einsum('b g n d, b d e -> b g n e', lin_q, lin_kv)

the lin_kv is three-dim (bde) And the code in the paper is

lin_kv = tf.einsum('bhke,bgh→bgke', lin_kv, mask) linear = tf.einsum('bgnk,bgke→bgne', lin_q, lin_kv)

the lin_kv is four-dim (bgke) It seems that the two ways are not equivalent.

Looking forward to your reply. Best,
opened by ShomyLiu 5
mask error
x = torch.randint(0, 20000, (1, 1024)) mask = x.ne(0) logits = model(x, mask=mask)

RuntimeError: The size of tensor a (1024) must match the size of tensor b (128) at non-singleton dimension 2
opened by keyunluo 1
Speed on TPU

Hi, Thanks for the code! I test it on Google TPU v3, the training speed seems slower than my expectation. Maybe there is some operation which is not lower on TPU.

opened by magicknight 0
About the "shift_tokens"

Thank you for your amazing code.

In the class of FLASH, I find a flag: shift_tokens, and the corresponding code is as following: if self.shift_tokens: x_shift, x_pass = normed_x.chunk(2, dim = -1) x_shift = F.pad(x_shift, (0, 0, 1, -1), value = 0.) normed_x = torch.cat((x_shift, x_pass), dim = -1)

Assume we have normed_x in the shape [1024, 512], the x_shift/x_pass is the shape of [1024, 256]. Then it adds a row (with all 0 value) and remove the last row in the x_shift, and concat x_shift and x_pass to get the normed_x.

In my opinion, the F.pad operation will make the row in x_shift and x_pass do not match again.

May I know why it works?

Kang

opened by kangzhao2 1
Cross-Attention?

Hi, @lucidrains. Thank you for sharing this excellent implementation with us all! Do you have any thoughts as to what changes would need to be made to make cross-attention possible with your FLASH model?

opened by amorehead 2

Releases(0.1.6)

0.1.6(Sep 23, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.5(Jun 19, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.4(Jun 18, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.2(Apr 8, 2022)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2a(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

NeurIPS 2021 Datasets and Benchmarks Track

82 Dec 11, 2022

Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at [email protected]

TableParser Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at DS3 Lab 11 Dec 13, 2022

Multiview Dataset Toolkit

Multiview Dataset Toolkit Using multi-view cameras is a natural way to obtain a complete point cloud. However, there is to date only one multi-view 3D

11 Dec 22, 2022

Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit

streamlit-manim Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit Installation I had to install pango with sudo apt-get

6 Aug 03, 2022

Expand human face editing via Global Direction of StyleCLIP, especially to maintain similarity during editing.

Oh-My-Face This project is based on StyleCLIP, RIFE, and encoder4editing, which aims to expand human face editing via Global Direction of StyleCLIP, e

51 Nov 17, 2022

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

**Codebase and data are uploaded in progress. ** VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly ge

416 Jan 09, 2023

buildseg is a building extraction plugin of QGIS based on PaddlePaddle.

buildseg buildseg is a Building Extraction plugin for QGIS based on PaddlePaddle. How to use Download and install QGIS and clone the repo : git clone

39 Dec 09, 2022

A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or simply to separate onnx files to any size you want.

sne4onnx A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or

10 Aug 30, 2022

Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

The official code for the paper "Inverse Problems Leveraging Pre-trained Contrastive Representations" (to appear in NeurIPS 2021).

26 Dec 10, 2022

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

TSOD Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer" Usage For training, open train_test, run p

2 Dec 23, 2021

A Dataset for Direct Quotation Extraction and Attribution in News Articles.

DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles DirectQuote is a corpus containing 19,760 paragraphs and 10,3

9 Sep 23, 2022

UI2I via StyleGAN2 - Unsupervised image-to-image translation method via pre-trained StyleGAN2 network

We proposed an unsupervised image-to-image translation method via pre-trained StyleGAN2 network. paper: Unsupervised Image-to-Image Translation via Pr

208 Dec 30, 2022

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 07, 2022

Machine Translation Implement By Bi-GRU And Transformer

Seq2Seq Translation Implement By Bidirectional GRU And Transformer In Pytorch Before You Run The Code You should download the data through the link be

2 Oct 27, 2021

Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Deep Learning - Image Captioning with Convolutional and Recurrent Neural Nets ========================================================================

23 Apr 06, 2022

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Related tags

Overview

FLASH - Pytorch

Install

Usage

Test on Autoregressive Enwik8

Citations

Comments

einsum operation in Linear Attention Part

mask error

Speed on TPU

About the "shift_tokens"

Cross-Attention?

Releases(0.1.6)

0.1.6(Sep 23, 2022)

v0.1.5(Jun 19, 2022)

v0.1.4(Jun 18, 2022)

0.1.2(Apr 8, 2022)

0.1.1(Mar 29, 2022)

0.0.15a(Mar 29, 2022)

0.0.14(Mar 29, 2022)

0.0.12(Mar 29, 2022)

0.0.11(Mar 29, 2022)

0.0.10(Mar 29, 2022)

0.0.9(Mar 29, 2022)

0.0.8(Mar 29, 2022)

0.0.7(Mar 29, 2022)

0.0.6(Mar 29, 2022)

0.0.1a(Mar 29, 2022)

0.0.5(Mar 28, 2022)

0.0.4(Mar 28, 2022)

0.0.3(Mar 28, 2022)

0.0.2a(Mar 28, 2022)

0.0.1(Mar 28, 2022)