A Unified Generative Framework for Various NER Subtasks.

Related tags

Deep LearningBARTNER
Overview

This is the code for ACL-ICJNLP2021 paper A Unified Generative Framework for Various NER Subtasks.

Install the package in the requirements.txt, then use the following commands to install two other packages

pip install git+https://github.com/fastnlp/[email protected]
pip install git+https://github.com/fastnlp/fitlog

You need to put your data in the parallel folder of this repo

    - BARTNER/
        - train.py
        ...
    - data/
        - conll2003
            - train.txt
            - text.txt
            - dev.txt
        - en-ontonotes
            - ...
        - Share_2013
        - Share_2014
        - CADEC
        - en_ace04
        - en_ace05
        - genia

For the conll2003 and en-ontonotes you data in each split should like (The first column is words, the second column is tags. We assume the tag is the BIO-tagging)

LONDON B-LOC
1996-08-30 O

West B-MISC
Indian I-MISC
all-rounder O
Phil B-PER

For nested dataset en_ace04, en_ace05 and genia, the data should like (each line is a jsonline, contains ners and sentences keys.)

{"ners": [[[16, 16, "DNA"], [4, 8, "DNA"], [24, 26, "DNA"], [19, 20, "DNA"]], [[31, 31, "DNA"], [2, 2, "DNA"], [4, 4, "DNA"], [30, 31, "DNA"]], [[23, 24, "RNA"], [14, 15, "cell_type"], [1, 2, "RNA"]], [[2, 2, "DNA"]], [], [[0, 0, "DNA"], [9, 9, "cell_type"]]], "sentences": [["There", "is", "a", "single", "methionine", "codon-initiated", "open", "reading", "frame", "of", "1,458", "nt", "in", "frame", "with", "a", "homeobox", "and", "a", "CAX", "repeat", ",", "and", "the", "open", "reading", "frame", "is", "predicted", "to", "encode", "a", "protein", "of", "51,659", "daltons."], ["When", "the", "homeodomain", "from", "HB24", "was", "compared", "to", "known", "mammalian", "and", "Drosophila", "homeodomains", "it", "was", "found", "to", "be", "only", "moderately", "conserved,", "but", "when", "it", "was", "compared", "to", "a", "highly", "diverged", "Drosophila", "homeodomain", ",", "H2.0,", "it", "was", "found", "to", "be", "80%", "identical."], ["The", "HB24", "mRNA", "was", "absent", "or", "present", "at", "low", "levels", "in", "normal", "B", "and", "T", "lymphocytes", ";", "however,", "with", "the", "appropriate", "activation", "signal", "HB24", "mRNA", "was", "induced", "within", "several", "hours", "even", "in", "the", "presence", "of", "cycloheximide", "."], ["Characterization", "of", "HB24", "expression", "in", "lymphoid", "and", "select", "developing", "tissues", "was", "performed", "by", "in", "situ", "hybridization", "."], ["Positive", "hybridization", "was", "found", "in", "thymus", ",", "tonsil", ",", "bone", "marrow", ",", "developing", "vessels", ",", "and", "in", "fetal", "brain", "."], ["HB24", "is", "likely", "to", "have", "an", "important", "role", "in", "lymphocytes", "as", "well", "as", "in", "certain", "developing", "tissues", "."]]}
{"ners": [[[16, 16, "DNA"], [4, 8, "DNA"], [24, 26, "DNA"], [19, 20, "DNA"]], [[31, 31, "DNA"], [2, 2, "DNA"], [4, 4, "DNA"], [30, 31, "DNA"]], [[23, 24, "RNA"], [14, 15, "cell_type"], [1, 2, "RNA"]], [[2, 2, "DNA"]], [], [[0, 0, "DNA"], [9, 9, "cell_type"]]], "sentences": [["There", "is", "a", "single", "methionine", "codon-initiated", "open", "reading", "frame", "of", "1,458", "nt", "in", "frame", "with", "a", "homeobox", "and", "a", "CAX", "repeat", ",", "and", "the", "open", "reading", "frame", "is", "predicted", "to", "encode", "a", "protein", "of", "51,659", "daltons."], ["When", "the", "homeodomain", "from", "HB24", "was", "compared", "to", "known", "mammalian", "and", "Drosophila", "homeodomains", "it", "was", "found", "to", "be", "only", "moderately", "conserved,", "but", "when", "it", "was", "compared", "to", "a", "highly", "diverged", "Drosophila", "homeodomain", ",", "H2.0,", "it", "was", "found", "to", "be", "80%", "identical."], ["The", "HB24", "mRNA", "was", "absent", "or", "present", "at", "low", "levels", "in", "normal", "B", "and", "T", "lymphocytes", ";", "however,", "with", "the", "appropriate", "activation", "signal", "HB24", "mRNA", "was", "induced", "within", "several", "hours", "even", "in", "the", "presence", "of", "cycloheximide", "."], ["Characterization", "of", "HB24", "expression", "in", "lymphoid", "and", "select", "developing", "tissues", "was", "performed", "by", "in", "situ", "hybridization", "."], ["Positive", "hybridization", "was", "found", "in", "thymus", ",", "tonsil", ",", "bone", "marrow", ",", "developing", "vessels", ",", "and", "in", "fetal", "brain", "."], ["HB24", "is", "likely", "to", "have", "an", "important", "role", "in", "lymphocytes", "as", "well", "as", "in", "certain", "developing", "tissues", "."]]}
...

For discontinuous dataset Share_2013, Share_2014 and CADEC, the data should like ( each sample has two lines, if the second line is empty means there is not entity. )

Abdominal cramps , flatulence , gas , bloating .
0,1 ADR|3,3 ADR|7,7 ADR|5,5 ADR

Cramps would start within 15 minutes of taking pill , even during meals .
0,0 ADR

...

We use code from https://github.com/daixiangau/acl2020-transition-discontinuous-ner to pre-process the data.

You can run the code by directly using

python train.py

The following output should be achieved

Save cache to caches/data_facebook/bart-large_conll2003_word.pt.                                                                                                        
max_len_a:0.6, max_len:10
In total 3 datasets:
        test has 3453 instances.
        train has 14041 instances.
        dev has 3250 instances.

The number of tokens in tokenizer  50265
50269 50274
input fields after batch(if batch size is 2):
        tgt_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 8]) 
        src_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 11]) 
        first: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 11]) 
        src_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
        tgt_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
target fields after batch(if batch size is 2):
        entities: (1)type:numpy.ndarray (2)dtype:object, (3)shape:(2,) 
        tgt_tokens: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 8]) 
        target_span: (1)type:numpy.ndarray (2)dtype:object, (3)shape:(2,) 
        tgt_seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 

training epochs started 2021-06-02-11-49-26-964889
Epoch 1/30:   0%|                                                         | 15/32430 [00:06<3:12:37,  2.80it/s, loss:6.96158

Some important python files are listed below

- BartNER
  - data
     - pipe.py # load and process data
  - model
     - bart.py # the model file
  - train.py  # the training file

The different Loaders in the data/pipe.py is meant to load data, and the data.BartNERPipe class is to process data, the loader should load data into a DataBundle object, you can mock the provided Loader to write your own loader, as long as your dataset has the following four fields, the BartNERPipe should be able to process it

- raw_words  # List[str]
    # ['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06']
- entities  # List[List[str]]
    # [['AL-AIN'], ['United', 'Arab', 'Emirates']]
- entity_tags  # List[str], the same length as entities
    # ['loc', 'loc']
- entity_spans # List[List[int]], the inner list must have an even number of ints, means the start(inclusive,开区间) and end(exclusive,开区间) of an entity segment
    # [[0, 1], [2, 5]] or for discontinous NER [[0, 1, 5, 7], [2, 3, 5, 7],...]

In order to help you reproduce the results, we have hardcoded the hyper-parameters for each dataset in the code, you can change them based on your need. We conduct all experiments in NVIDIA-3090(24G memory). Some known difficulties about the reproduction of this code: (1) Some datasets (nested and discontinous) will drop to 0 or near 0 F1 during training, please drop these results; (2) randomness will cause large performance variance for some datasets, please try to run multiple times.

We deeply understand how frustrating it can be if the results are hard to reproduce, we tried our best to make sure the results were at least reproducible in our equipment (Usually take average from at least five runs).

Owner
I am currently a PhD candidate in Fudan University.
Code for the paper: Sketch Your Own GAN

Sketch Your Own GAN Project | Paper | Youtube Our method takes in one or a few hand-drawn sketches and customizes an off-the-shelf GAN to match the in

677 Dec 28, 2022
This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by Divam Gupta, Wei Pu, Trenton Tabor, Jeff Schneider

SBEVNet: End-to-End Deep Stereo Layout Estimation This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by D

Divam Gupta 19 Dec 17, 2022
Pytorch implementation of RED-SDS (NeurIPS 2021).

Recurrent Explicit Duration Switching Dynamical Systems (RED-SDS) This repository contains a reference implementation of RED-SDS, a non-linear state s

Abdul Fatir 10 Dec 02, 2022
A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

LP-Option-Hedging Description A Python program to analyze leveraged liquidity farming/mining and find the optimal option combination for hedging imper

Aureliano 18 Dec 19, 2022
Catalyst.Detection

Accelerated DL R&D PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentatio

Catalyst-Team 12 Oct 25, 2021
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

bottom-up-attention This code implements a bottom-up attention model, based on multi-gpu training of Faster R-CNN with ResNet-101, using object and at

Peter Anderson 1.3k Jan 09, 2023
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Winning submission to the 2021 Brain Tumor Segmentation Challenge This repo contains the codes and pretrained weights for the winning submission to th

94 Dec 28, 2022
[CVPR2021] Invertible Image Signal Processing

Invertible Image Signal Processing This repository includes official codes for "Invertible Image Signal Processing (CVPR2021)". Figure: Our framework

Yazhou XING 281 Dec 31, 2022
Generalized Random Forests

generalized random forests A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods fo

GRF Labs 781 Dec 25, 2022
Generating Fractals on Starknet with Cairo

StarknetFractals Generating the mandelbrot set on Starknet Current Implementation generates 1 pixel of the fractal per call(). It takes a few minutes

Orland0x 10 Jul 16, 2022
[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Grounded Situation Recognition with Transformers Paper | Model Checkpoint This is the official PyTorch implementation of Grounded Situation Recognitio

Junhyeong Cho 18 Jul 19, 2022
This repository contains answers of the Shopify Summer 2022 Data Science Intern Challenge.

Data-Science-Intern-Challenge This repository contains answers of the Shopify Summer 2022 Data Science Intern Challenge. Summer 2022 Data Science Inte

1 Jan 11, 2022
Hierarchical Memory Matching Network for Video Object Segmentation (ICCV 2021)

Hierarchical Memory Matching Network for Video Object Segmentation Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, Euntai Kim

Hongje Seong 72 Dec 14, 2022
Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation Introduction Getting Started FSD50K Recipe AudioSet Recipe Label E

Yuan Gong 84 Dec 27, 2022
Reading Group @mila-iqia on Computational Optimal Transport for Machine Learning Applications

Computational Optimal Transport for Machine Learning Reading Group Over the last few years, optimal transport (OT) has quickly become a central topic

Ali Harakeh 11 Aug 26, 2022
System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Validating Simulations of User Query Variants This repository contains the scripts of the experiments and evaluations, simulated queries, as well as t

IR Group at Technische Hochschule Köln 2 Nov 23, 2022
Modelisation on galaxy evolution using PEGASE-HR

model_galaxy Modelisation on galaxy evolution using PEGASE-HR This is a labwork done in internship at IAP directed by Damien Le Borgne (https://github

Adrien Anthore 1 Jan 14, 2022
Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models

LMPBT Supplementary code for the Paper entitled ``Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models"

1 Sep 29, 2022
A crossplatform menu bar application using mpv as DLNA Media Renderer.

Macast Chinese README A menu bar application using mpv as DLNA Media Renderer. Install MacOS || Windows || Debian Download link: Macast release latest

4.4k Jan 01, 2023