Implementation of Nyström Self-attention, from the paper Nyströmformer

Last update: Jan 02, 2023

Overview

Nyström Attention

Implementation of Nyström Self-attention, from the paper Nyströmformer.

Install

$ pip install nystrom-attention

Usage

import torch
from nystrom_attention import NystromAttention

attn = NystromAttention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    num_landmarks = 256,    # number of landmarks
    pinv_iterations = 6,    # number of moore-penrose iterations for approximating pinverse. 6 was recommended by the paper
    residual = True         # whether to do an extra residual with the value or not. supposedly faster convergence if turned on
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

attn(x, mask = mask) # (1, 16384, 512)

Nyströmformer, layers of Nyström attention

import torch
from nystrom_attention import Nystromformer

model = Nystromformer(
    dim = 512,
    dim_head = 64,
    heads = 8,
    depth = 6,
    num_landmarks = 256,
    pinv_iterations = 6
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

model(x, mask = mask) # (1, 16384, 512)

You can also import it as Nyströmer if you wish

from nystrom_attention import Nystromer

Citations

@misc{xiong2021nystromformer,
    title   = {Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention},
    author  = {Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh},
    year    = {2021},
    eprint  = {2102.03902},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Comments

Clarification on masking
Given the dimensionality of the mask argument, (N, T), I'm assuming this is a boolean mask for masking out padding tokens. I created the following function to generate such a mask given an input tensor:

def _create_pad_mask(self, x: torch.LongTensor) -> torch.BoolTensor: mask = torch.ones_like(x).to(torch.bool) mask[x==0] = False return mask

where 0 is the padding token, setting positions to False so not to attend to them.

However, I am unsure how to apply a causal mask to the attention layers so to prevent my decoder from accessing future elements. I couldn't see an example of this in the full Nystromformer module. How can I achieve this?

For context, I am trying to apply the causal mask generated by the following function:

def _create_causal_mask(self, x: torch.LongTensor) -> torch.FloatTensor: size = x.shape[1] mask = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1) mask = mask.float().masked_fill_(mask == 0, float('-inf')).masked_fill_(mask==1, 0.0) return mask

One way I can think of is to set return_attn to True, apply the mask on the returned attention weights then matmul with the value tensor. But this has a few issues:

Having to return v

Computing the full attention matrix (I think), defeating the entire point of linear attention

Needlessly calculating out only to discard it.

Is this just a limitation of Nystrom attention? Or am I overlooking something obvious?

Thanks
opened by vvvm23 3
Possible bug with padding
Hey there,

I was going through the code and I noticed the following, which I found curious.

In Line 75, you pad the input tensor to a multiple of num_landmarks from the front:

x = F.pad(x, (0, 0, padding, 0), value = 0)

In Line 144 you trim the extra padding elements you inserted in the output tensor from the end.

out = out[:, :n]

Am I not getting something, or should we be removing the front elements of out?

out = out[:, out.size(1) - n:]
opened by georgepar 2
Nystrom for Image processing
thank you for sharing the wondeful code. I am working on image processing and wanted to try your code for the same. I have 2 doubts:

How to select residual_conv_kernel? I could not find any details for the same. also, it is enabled by a flag. When should we enable it and when to disable it?

Is there any guideline for deciding num_landmarks for image processing task?

Thanks
opened by paragon1234 1
Error when mask is of the same size as that of the input X

Hi,

First of all, thank you for putting such an easy to use implementation on GitHub. I'm trying to incorporate the nystrom attention into a legacy codebase, it previously used to provide the input X and the mask (off the same dimensions as X) to a Multi headed Attention Layer.

When I'm trying to integrate nystrom attention with it, it runs alright without the mask. But, when I pass the mask alongside it, it throws einops rearrange error.

Sorry, if this is a very basic question, but how would you recommend I deal with handling 3D mask (same dimensions as the size of input) in the codebase.

Best, VB

opened by Vaibhavs10 1

ViewBackward inplace deprecation warning

Hello again,

The following code results in a UserWarning in PyTorch 1.8.1.

In [1]: from nystrom_attention.nystrom_attention import NystromAttention

In [2]: import torch

In [3]: attn = NystromAttention(256)

In [4]: x = torch.randn(1, 8192, 256)

In [5]: attn(x)
/home/alex/.tmp/nystrom-attention/nystrom_attention/nystrom_attention.py:91: UserWarning: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views are being deprecated and will be forbidden starting from version 1.8. Consider using `unsafe_` version of the function that produced this view or don't modify this view inplace. (Triggered internally at  ../torch/csrc/autograd/variable.cpp:547.)
  q *= self.scale
Out[5]:
tensor([[[-0.0449, -0.1726,  0.1409,  ...,  0.0127,  0.2287, -0.2437],
         [-0.1132,  0.3229, -0.1279,  ...,  0.0084, -0.3307, -0.2351],
         [ 0.0361,  0.1013,  0.0828,  ...,  0.1045, -0.1627,  0.0736],
         ...,
         [ 0.0018,  0.1385, -0.1716,  ..., -0.0366, -0.0682,  0.0241],
         [ 0.1497,  0.0149, -0.0020,  ..., -0.0352, -0.1126,  0.0193],
         [ 0.1341,  0.0077,  0.1627,  ..., -0.0363,  0.1057, -0.2071]]],
       grad_fn=<SliceBackward>)

Not a huge issue, but worth mentioning

opened by vvvm23 1

Relative position encoding

Similar to the question raised for the performer architecture , is it possible to implement a relative position encoding given the methodology in which attention is calculated?

opened by jdcla 1
How can we implement "batch_first" in Nystrom attention?

Hi,

Thanks a lot for implementing the nystromformer attention algorithm! Very nice job!

I am wondering whether it is feasible to add the "batch_first" option in the nystrom attention algorithm? This allow the algorithm to be integrated in the existing pytorch transformer encoder architecture.

opened by mark0935git 0
x-transformers

Hi @lucidrains - just wondering if we can plug in Nystrom Attention with x-transformers?

I've been plugging in Vision Transformers with X-transformers but am wondering if its possible to have a Nystrom transformer with x-transformer improvements to plug into a ViT?

opened by robbohua 0

Releases(0.0.11)

0.0.11(Apr 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Feb 24, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Feb 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Feb 14, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

410 Jan 03, 2023

U-Net Implementation: Convolutional Networks for Biomedical Image Segmentation" using the Carvana Image Masking Dataset in PyTorch

U-Net Implementation By Christopher Ley This is my interpretation and implementation of the famous paper "U-Net: Convolutional Networks for Biomedical

1 Jan 06, 2022

Tools for computational pathology

A toolkit for computational pathology and machine learning. View documentation Please cite our paper Installation There are several ways to install Pa

254 Dec 12, 2022

[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis

Focal Frequency Loss - Official PyTorch Implementation This repository provides the official PyTorch implementation for the following paper: Focal Fre

460 Jan 04, 2023

A minimalist tool to display a network graph.

A tool to get a minimalist view of any architecture This tool has only be tested with the models included in this repo. Therefore, I can't guarantee t

1 Feb 11, 2022

A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

AMAZ3DSim AMAZ3DSim is a lightweight python-based 3D network multi-agent simulator. It uses a cell-based congestion model. It calculates risk, battery

13 Nov 04, 2022

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model. Designed sample dashboard with insights and recommendation for

2 Apr 07, 2022

Mahadi-Now - This Is Pakistani Just Now Login Tools

PAKISTANI JUST NOW LOGIN TOOLS Install apt update apt upgrade apt install python

19 Apr 06, 2022

ROMP: Monocular, One-stage, Regression of Multiple 3D People, ICCV21

Monocular, One-stage, Regression of Multiple 3D People ROMP, accepted by ICCV 2021, is a concise one-stage network for multi-person 3D mesh recovery f

937 Jan 04, 2023

Entity-Based Knowledge Conflicts in Question Answering.

Entity-Based Knowledge Conflicts in Question Answering Run Instructions | Paper | Citation | License This repository provides the Substitution Framewo

35 Oct 19, 2022

WHENet - ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L

HeadPoseEstimation-WHENet-yolov4-onnx-openvino ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L 1. Usage $ git clone htt

49 Sep 21, 2022

This repository provides an efficient PyTorch-based library for training deep models.

s3sec Test AWS S3 buckets for read/write/delete access This tool was developed to quickly test a list of s3 buckets for public read, write and delete

123 Jan 05, 2023

🔥RandLA-Net in Tensorflow (CVPR 2020, Oral & IEEE TPAMI 2021)

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020) This is the official implementation of RandLA-Net (CVPR2020, Oral

1k Dec 30, 2022

Simple Tensorflow implementation of "Adaptive Convolutions for Structure-Aware Style Transfer" (CVPR 2021)

AdaConv — Simple TensorFlow Implementation [Paper] : Adaptive Convolutions for Structure-Aware Style Transfer (CVPR 2021) Note This repository does no

26 Nov 18, 2022

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

MusCaps: Generating Captions for Music Audio Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1 1 Queen Mary University of London, 2

57 Dec 07, 2022

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

116 Jan 04, 2023

Phylogeny Partners

Phylogeny-Partners Two states models Instalation You may need to install the cython, networkx, numpy, scipy package: pip install cython, networkx, num

1 Sep 19, 2022

[CVPR 2021] "Multimodal Motion Prediction with Stacked Transformers": official code implementation and project page.

mmTransformer Introduction This repo is official implementation for mmTransformer in pytorch. Currently, the core code of mmTransformer is implemented

232 Dec 31, 2022

A hyperparameter optimization framework

Optuna: A hyperparameter optimization framework Website | Docs | Install Guide | Tutorial Optuna is an automatic hyperparameter optimization software

7.4k Jan 04, 2023

Proof-Of-Concept Piano-Drums Music AI Model/Implementation

Rock Piano "When all is one and one is all, that's what it is to be a rock and not to roll." ---Led Zeppelin, "Stairway To Heaven" Proof-Of-Concept Pi

4 Nov 28, 2021