Reaction SMILES-AA mapping via language modelling

Last update: Dec 13, 2022

Related tags

Overview

rxn-aa-mapper

Reactions SMILES-AA sequence mapping

setup

conda env create -f conda.yml
conda activate rxn_aa_mapper

In the following we consider on examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

Create a vocabulary compatible with the enzymatic reaction tokenizer:

create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

Using the examples vocabulary and AA tokenizer provided, we can observe the enzymatic reaction tokenizer in action:

from rxn_aa_mapper.tokenization import EnzymaticReactionBertTokenizer

tokenizer = EnzymaticReactionBertTokenizer(
    vocabulary_file="./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    aa_sequence_tokenizer_filepath="./examples/token_75K_min_600_max_750_500K.json"
)
tokenizer.tokenize("NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]")

train the model

The mlm-trainer script can be used to train a model via MTL:

mlm-trainer \
    ./examples/data-samples/biochemical ./examples/data-samples/biochemical \  # just a sample, simply split data in a train and a validation folder
    ./examples/vocabulary_token_75K_min_600_max_750_500K.txt /tmp/mlm-trainer-log \
    ./examples/sample-config.json "*.csv" 1 \  # for a more realistic config see ./examples/config.json
    ./examples/data-samples/organic ./examples/data-samples/organic \  # just a sample, simply split data in a train and a validation folder
    ./examples/token_75K_min_600_max_750_500K.json

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

Those can be turned into an HuggingFace model by simply running:

checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

The trained model can used to map reactant atoms to AA sequence locations that potentially represent the active site.

from rxn_aa_mapper.aa_mapper import RXNAAMapper

config_mapper = {
    "vocabulary_file": "./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    "aa_sequence_tokenizer_filepath": "./examples/token_75K_min_600_max_750_500K.json",
    "model_path": "/tmp/rxnaamapper-pretrained-model",
    "head": 3,
    "layers": [11],
    "top_k": 1,
}
mapper = RXNAAMapper(config=config_mapper)
mapper.get_reactant_aa_sequence_attention_guided_maps(["NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"])

citation

@article{dassi2021identification,
  title={Identification of Enzymatic Active Sites with Unsupervised Language Modeling},
  author={Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Teukam, Yves Gaetan Nana and Laino, Teodoro},
  year={2021}
  conference={AI for Science: Mind the Gaps at NeurIPS 2021, ELLIS Machine Learning for Molecule Discovery Workshop 2021}
}

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

use the tokenizer

train the model

predict active site

citation

Owner

The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

Code for "Finding Regions of Heterogeneity in Decision-Making via Expected Conditional Covariance" at NeurIPS 2021

Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for Tracking

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop.

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Fast sparse deep learning on CPUs

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Noether Networks: meta-learning useful conserved quantities

Implementation for paper LadderNet: Multi-path networks based on U-Net for medical image segmentation

Implements an infinite sum of poisson-weighted convolutions

Back to Basics: Efficient Network Compression via IMP

Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

Submanifold sparse convolutional networks

DANet for Tabular data classification/ regression.

Discord Multi Tool that focuses on design and easy usage

JORLDY an open-source Reinforcement Learning (RL) framework provided by KakaoEnterprise

All course materials for the Zero to Mastery Machine Learning and Data Science course.

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

use the tokenizer

train the model

predict active site

citation

Owner

The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

Code for "Finding Regions of Heterogeneity in Decision-Making via Expected Conditional Covariance" at NeurIPS 2021

Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for Tracking

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop.

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Fast sparse deep learning on CPUs

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Noether Networks: meta-learning useful conserved quantities

Implementation for paper LadderNet: Multi-path networks based on U-Net for medical image segmentation

Implements an infinite sum of poisson-weighted convolutions

Back to Basics: Efficient Network Compression via IMP

Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

Submanifold sparse convolutional networks

DANet for Tabular data classification/ regression.

Discord Multi Tool that focuses on design and easy usage

JORLDY an open-source Reinforcement Learning (RL) framework provided by KakaoEnterprise

All course materials for the Zero to Mastery Machine Learning and Data Science course.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`