Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Overview

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Project Structure

./
├── DataProcess
│   ├── __pycache__
│   ├── convert2bio.py
│   ├── convert_jsonl.py
│   ├── handle_numbers.py
│   ├── load_data.py
│   └── statistic.py
├── README.md
├── __pycache__
├── chinese_L-12_H-768_A-12                                    BERT权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── chinese_bert_wwm                                           BERT_wwm权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── chinese_macbert_base                                       macBERT权重
│   ├── chinese_macbert_base.ckpt.data-00000-of-00001
│   ├── chinese_macbert_base.ckpt.index
│   ├── chinese_macbert_base.ckpt.meta
│   ├── macbert_base_config.json
│   └── vocab.txt
├── chinese_roberta_wwm_ext_L-12_H-768_A-12                    roberta权重
│   ├── bert_config.json
│   ├── bert_model.ckpt.data-00000-of-00001
│   ├── bert_model.ckpt.index
│   ├── bert_model.ckpt.meta
│   └── vocab.txt
├── config                                                     
│   ├── __pycache__
│   ├── config.py                                              配置文件
│   └── pulmonary_label2id.json                                label id
├── data                                                       数据集
│   ├── pulmonary.test
│   ├── pulmonary.train
│   └── sict_train.txt
├── environment.yaml                                           conda环境配置文件
├── evaluate.py
├── generator_train.py
├── keras_bert                                                 keras_bert(可pip下)
├── keras_contrib                                              keras_contrib(可pip下)
├── log                                                        训练nohup日志
│   ├── chinese_L-12_H-768_A-12.out
│   ├── chinese_macbert_base.out
│   ├── chinese_roberta_wwm_ext_L-12_H-768_A-12.out
│   └── electra_180g_base.out
├── model.py                                                   模型构建文件
├── models                                                     保存的模型权重
│   ├── pulmonary_chinese_L-12_H-768_A-12_ner.h5
│   ├── pulmonary_chinese_bert_wwm_ner.h5
│   ├── pulmonary_chinese_macbert_base_ner.h5
│   └── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_ner.h5
├── predict.py                                                 预测
├── report                                                     模型实体F1评估报告
│   ├── pulmonary_chinese_L-12_H-768_A-12_evaluate.txt
│   ├── pulmonary_chinese_L-12_H-768_A-12_predict.json
│   ├── pulmonary_chinese_bert_wwm_evaluate.txt
│   ├── pulmonary_chinese_bert_wwm_predict.json
│   ├── pulmonary_chinese_macbert_base_evaluate.txt
│   ├── pulmonary_chinese_macbert_base_predict.json
│   ├── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_evaluate.txt
│   └── pulmonary_chinese_roberta_wwm_ext_L-12_H-768_A-12_predict.json
├── requirements.txt                                           pip环境
├── test.py                                                    
├── train.py                                                   训练
└── utils                                                      
    ├── FGM.py                                                 FGM对抗
    ├── __pycache__
    └── path.py                                                所有路径

56 directories, 193 files

Dataset

三甲医院肺结节数据集,20000+字,BIO格式,形如:

中	B-ORG
共	I-ORG
中	I-ORG
央	I-ORG
致	O
中	B-ORG
国	I-ORG
致	I-ORG
公	I-ORG
党	I-ORG
十	I-ORG
一	I-ORG
大	I-ORG
的	O
贺	O
词	O

ATTENTION: 在处理自己数据集的时候需要注意:

  • 字与标签之间用空格("\ ")隔开
  • 其中句子与句子之间使用空行隔开

Steps

  1. 替换数据集
  2. 使用DataProcess/load_data.py生成label2id.txt文件
  3. 修改config/config.py中的MAX_SEQ_LEN(超过截断,少于填充,最好设置训练集、测试集中最长句子作为MAX_SEQ_LEN)
  4. 下载权重,放到项目中
  5. 修改public/path.py中的地址
  6. 根据需要修改model.py模型结构
  7. 修改config/config.py的参数
  8. 训练前debug看下input_train_labels,result_train对不对,input_train_types全是0
  9. 训练

Model

BERT

roberta

macBERT

BERT_wwm

Train

运行train.py

Evaluate

运行evaluate/f1_score.py

BERT

           precision    recall  f1-score   support

     SIGN     0.6651    0.7354    0.6985       189
  ANATOMY     0.8333    0.8409    0.8371       220
 DIAMETER     1.0000    1.0000    1.0000        16
  DISEASE     0.4915    0.6744    0.5686        43
 QUANTITY     0.8837    0.9157    0.8994        83
TREATMENT     0.3571    0.5556    0.4348         9
  DENSITY     1.0000    1.0000    1.0000         8
    ORGAN     0.4500    0.6923    0.5455        13
LUNGFIELD     1.0000    0.5000    0.6667         6
    SHAPE     0.5714    0.5714    0.5714         7
   NATURE     1.0000    1.0000    1.0000         6
 BOUNDARY     1.0000    0.6250    0.7692         8
   MARGIN     0.8333    0.8333    0.8333         6
  TEXTURE     1.0000    0.8571    0.9231         7

micro avg     0.7436    0.7987    0.7702       621
macro avg     0.7610    0.7987    0.7760       621

roberta

           precision    recall  f1-score   support

  ANATOMY     0.8624    0.8545    0.8584       220
  DENSITY     0.8000    1.0000    0.8889         8
     SIGN     0.7347    0.7619    0.7481       189
 QUANTITY     0.8977    0.9518    0.9240        83
  DISEASE     0.5690    0.7674    0.6535        43
 DIAMETER     1.0000    1.0000    1.0000        16
TREATMENT     0.3333    0.5556    0.4167         9
 BOUNDARY     1.0000    0.6250    0.7692         8
LUNGFIELD     1.0000    0.6667    0.8000         6
   MARGIN     0.8333    0.8333    0.8333         6
  TEXTURE     1.0000    0.8571    0.9231         7
    SHAPE     0.5714    0.5714    0.5714         7
   NATURE     1.0000    1.0000    1.0000         6
    ORGAN     0.6250    0.7692    0.6897        13

micro avg     0.7880    0.8261    0.8066       621
macro avg     0.8005    0.8261    0.8104       621

macBERT

           precision    recall  f1-score   support

  ANATOMY     0.8773    0.8773    0.8773       220
     SIGN     0.6538    0.7196    0.6851       189
  DISEASE     0.5893    0.7674    0.6667        43
 QUANTITY     0.9070    0.9398    0.9231        83
    ORGAN     0.5882    0.7692    0.6667        13
  TEXTURE     1.0000    0.8571    0.9231         7
 DIAMETER     1.0000    1.0000    1.0000        16
TREATMENT     0.3750    0.6667    0.4800         9
LUNGFIELD     1.0000    0.5000    0.6667         6
    SHAPE     0.4286    0.4286    0.4286         7
   NATURE     1.0000    1.0000    1.0000         6
  DENSITY     1.0000    1.0000    1.0000         8
 BOUNDARY     1.0000    0.6250    0.7692         8
   MARGIN     0.8333    0.8333    0.8333         6

micro avg     0.7697    0.8180    0.7931       621
macro avg     0.7846    0.8180    0.7977       621

BERT_wwm

           precision    recall  f1-score   support

  DISEASE     0.5667    0.7907    0.6602        43
  ANATOMY     0.8676    0.8636    0.8656       220
 QUANTITY     0.8966    0.9398    0.9176        83
     SIGN     0.7358    0.7513    0.7435       189
LUNGFIELD     1.0000    0.6667    0.8000         6
TREATMENT     0.3571    0.5556    0.4348         9
 DIAMETER     0.9375    0.9375    0.9375        16
 BOUNDARY     1.0000    0.6250    0.7692         8
  TEXTURE     1.0000    0.8571    0.9231         7
   MARGIN     0.8333    0.8333    0.8333         6
    ORGAN     0.5882    0.7692    0.6667        13
  DENSITY     1.0000    1.0000    1.0000         8
   NATURE     1.0000    1.0000    1.0000         6
    SHAPE     0.5000    0.5714    0.5333         7

micro avg     0.7889    0.8245    0.8063       621
macro avg     0.8020    0.8245    0.8104       621

Predict

运行predict/predict_bio.py

Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Refactored version of FastSpeech2

Refactored version of FastSpeech2. An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

ILJI CHOI 10 May 26, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
Mlcode - Continuous ML API Integrations

mlcode Basic APIs for ML applications. Django REST Application Contains REST API

Sujith S 1 Jan 01, 2022
A curated list of FOSS tools to improve the Hacker News experience

Awesome-Hackernews Hacker News is a social news website focusing on computer technologies, hacking and startups. It promotes any content likely to "gr

Bryton Lacquement 141 Dec 27, 2022
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022