Chinese named entity recognization with BiLSTM using Keras

Overview

Chinese named entity recognization (Bilstm with Keras)

Project Structure

./
├── README.md
├── data
│   ├── README.md
│   ├── data							数据集
│   │   ├── test.txt
│   │   └── train.txt
│   ├── plain_text.txt
│   └── vocab.txt                       词表
├── evaluate
│   ├── __init__.py
│   └── f1_score.py                     计算实体F1得分
├── keras_contrib                       keras_contrib包,也可以pip装
├── log                                 训练nohup日志
│   ├── __init__.py
│   └── nohup.out
├── model                               模型
│   ├── BiLSTMCRF.py
│   ├── __init__.py
│   └── __pycache__
├── predict                             输出预测
│   ├── __init__.py
│   ├── __pycache__
│   ├── predict.py
│   └── predict_process.py
├── preprocess                          数据预处理
│   ├── README.md
│   ├── __pycache__
│   ├── convert_jsonl.py
│   ├── data_add_line.py
│   ├── generate_vocab.py               生成词表
│   ├── process_data.py                 数据处理转换
│   ├── splite.py
│   └── vocab.py                        词表对应工具
├── public
│   ├── __init__.py
│   ├── __pycache__
│   ├── config.py                       训练设置
│   ├── generate_label_id.py            生成label2id文件
│   ├── label2id.json                   标签dict
│   ├── path.py                         所有路径
│   └── utils.py                        小工具
├── report
│   └── report.out                      F1评估报告
├── train.py
└── weight                              保存的权重
    └── bilstm_ner.h5

52 directories, 214 files

Dataset

三甲医院肺结节数据集,20000+字,BIO格式,形如:

中	B-ORG
共	I-ORG
中	I-ORG
央	I-ORG
致	O
中	B-ORG
国	I-ORG
致	I-ORG
公	I-ORG
党	I-ORG
十	I-ORG
一	I-ORG
大	I-ORG
的	O
贺	O
词	O

ATTENTION: 在处理自己数据集的时候需要注意:

  • 字与标签之间用tab("\t")隔开
  • 其中句子与句子之间使用空行隔开

Steps

  1. 替换数据集
  2. 修改public/path.py中的地址
  3. 使用public/generate_label_id.py生成label2id.txt文件,将其中的内容填到preprocess/vocab.py的get_tag2index中。注意:序号必须从0开始
  4. 修改public/config.py中的MAX_LEN(超过截断,少于填充,最好设置训练集、测试集中最长句子作为MAX_LEN)
  5. 运行preprocess/generate_vocab.py生成词表,词表按词频生成
  6. 根据需要修改BiLSTMCRF.py模型结构
  7. 修改public/config.py的参数
  8. 训练前debug看下train_data,train_label对不对
  9. 训练

Model

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, None)              0
_________________________________________________________________
embedding_1 (Embedding)      (None, None, 128)         81408
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 256)         263168
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 256)         0
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 128)         164352
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 128)         0
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 29)          3741
_________________________________________________________________
dropout_3 (Dropout)          (None, None, 29)          0
_________________________________________________________________
crf_1 (CRF)                  (None, None, 29)          1769
=================================================================
Total params: 514,438
Trainable params: 514,438
Non-trainable params: 0
_________________________________________________________________

Train

运行train.py

Epoch 1/500
806/806 [==============================] - 15s 18ms/step - loss: 2.4178 - crf_viterbi_accuracy: 0.9106

Epoch 00001: loss improved from inf to 2.41777, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 2/500
806/806 [==============================] - 10s 13ms/step - loss: 0.6370 - crf_viterbi_accuracy: 0.9106

Epoch 00002: loss improved from 2.41777 to 0.63703, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 3/500
806/806 [==============================] - 11s 14ms/step - loss: 0.5295 - crf_viterbi_accuracy: 0.9106

Epoch 00003: loss improved from 0.63703 to 0.52950, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 4/500
806/806 [==============================] - 11s 13ms/step - loss: 0.4184 - crf_viterbi_accuracy: 0.9064

Epoch 00004: loss improved from 0.52950 to 0.41838, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 5/500
806/806 [==============================] - 12s 14ms/step - loss: 0.3422 - crf_viterbi_accuracy: 0.9104

Epoch 00005: loss improved from 0.41838 to 0.34217, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 6/500
806/806 [==============================] - 10s 13ms/step - loss: 0.3164 - crf_viterbi_accuracy: 0.9106

Epoch 00006: loss improved from 0.34217 to 0.31637, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 7/500
806/806 [==============================] - 10s 12ms/step - loss: 0.3003 - crf_viterbi_accuracy: 0.9111

Epoch 00007: loss improved from 0.31637 to 0.30032, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 8/500
806/806 [==============================] - 10s 12ms/step - loss: 0.2906 - crf_viterbi_accuracy: 0.9117

Epoch 00008: loss improved from 0.30032 to 0.29058, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 9/500
806/806 [==============================] - 9s 12ms/step - loss: 0.2837 - crf_viterbi_accuracy: 0.9118

Epoch 00009: loss improved from 0.29058 to 0.28366, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 10/500
806/806 [==============================] - 9s 11ms/step - loss: 0.2770 - crf_viterbi_accuracy: 0.9142

Epoch 00010: loss improved from 0.28366 to 0.27696, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 11/500
806/806 [==============================] - 10s 12ms/step - loss: 0.2713 - crf_viterbi_accuracy: 0.9160

Evaluate

运行evaluate/f1_score.py

100%|█████████████████████████████████████████| 118/118 [00:38<00:00,  3.06it/s]
TP: 441
TP+FP: 621
precision: 0.7101449275362319
TP+FN: 604
recall: 0.7301324503311258
f1: 0.72

classification report:
              precision    recall  f1-score   support

     ANATOMY       0.74      0.75      0.74       220
    BOUNDARY       1.00      0.75      0.86         8
     DENSITY       0.78      0.88      0.82         8
    DIAMETER       0.82      0.88      0.85        16
     DISEASE       0.54      0.72      0.62        43
   LUNGFIELD       0.83      0.83      0.83         6
      MARGIN       0.57      0.67      0.62         6
      NATURE       0.00      0.00      0.00         6
       ORGAN       0.62      0.62      0.62        13
    QUANTITY       0.88      0.87      0.87        83
       SHAPE       1.00      0.43      0.60         7
        SIGN       0.66      0.65      0.65       189
     TEXTURE       0.75      0.43      0.55         7
   TREATMENT       0.25      0.33      0.29         9

   micro avg       0.71      0.71      0.71       621
   macro avg       0.67      0.63      0.64       621
weighted avg       0.71      0.71      0.71       621

Predict

运行predict/predict_bio.py

Heterogeneous Deep Graph Infomax

Heterogeneous-Deep-Graph-Infomax Parameter Setting: HDGI-A: Node-level dimension: 16 Attention head: 4 Semantic-level attention vector: 8 learning rat

52 Oct 31, 2022
QT Py Media Knob using rotary encoder & neopixel ring

QTPy-Knob QT Py USB Media Knob using rotary encoder & neopixel ring The QTPy-Knob features: Media knob for volume up/down/mute with "qtpy-knob.py" Cir

Tod E. Kurt 56 Dec 30, 2022
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
CVPR '21: In the light of feature distributions: Moment matching for Neural Style Transfer

In the light of feature distributions: Moment matching for Neural Style Transfer (CVPR 2021) This repository provides code to recreate results present

Nikolai Kalischek 49 Oct 13, 2022
The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

EIGNN: Efficient Infinite-Depth Graph Neural Networks The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 20

Juncheng Liu 14 Nov 22, 2022
Learning to Prompt for Vision-Language Models.

CoOp Paper: Learning to Prompt for Vision-Language Models Authors: Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu CoOp (Context Optimization)

Kaiyang 679 Jan 04, 2023
[ICCV 2021] HRegNet: A Hierarchical Network for Large-scale Outdoor LiDAR Point Cloud Registration

HRegNet: A Hierarchical Network for Large-scale Outdoor LiDAR Point Cloud Registration Introduction The repository contains the source code and pre-tr

Intelligent Sensing, Perception and Computing Group 55 Dec 14, 2022
Project page for End-to-end Recovery of Human Shape and Pose

End-to-end Recovery of Human Shape and Pose Angjoo Kanazawa, Michael J. Black, David W. Jacobs, Jitendra Malik CVPR 2018 Project Page Requirements Pyt

1.4k Dec 29, 2022
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
This is an open source library implementing hyperbox-based machine learning algorithms

hyperbox-brain is a Python open source toolbox implementing hyperbox-based machine learning algorithms built on top of scikit-learn and is distributed

Complex Adaptive Systems (CAS) Lab - University of Technology Sydney 21 Dec 14, 2022
Pixray is an image generation system

Pixray is an image generation system

pixray 883 Jan 07, 2023
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 02, 2023
Generating Digital Painting Lighting Effects via RGB-space Geometry (SIGGRAPH2020/TOG2020)

Project PaintingLight PaintingLight is a project conducted by the Style2Paints team, aimed at finding a method to manipulate the illumination in digit

651 Dec 29, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

ASAPP Research 2.1k Jan 01, 2023
Experiments with differentiable stacks and queues in PyTorch

Please use stacknn-core instead! StackNN This project implements differentiable stacks and queues in PyTorch. The data structures are implemented in s

Will Merrill 141 Oct 06, 2022
Improving Calibration for Long-Tailed Recognition (CVPR2021)

MiSLAS Improving Calibration for Long-Tailed Recognition Authors: Zhisheng Zhong, Jiequan Cui, Shu Liu, Jiaya Jia [arXiv] [slide] [BibTeX] Introductio

Jia Research Lab 116 Dec 20, 2022
End-to-end speech secognition toolkit

End-to-end speech secognition toolkit This is an E2E ASR toolkit modified from Espnet1 (version 0.9.9). This is the official implementation of paper:

Jinchuan Tian 147 Dec 28, 2022
PushForKiCad - AISLER Push for KiCad EDA

AISLER Push for KiCad Push your layout to AISLER with just one click for instant

AISLER 31 Dec 29, 2022
[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Garment4D [PDF] | [OpenReview] | [Project Page] Overview This is the codebase for our NeurIPS 2021 paper Garment4D: Garment Reconstruction from Point

Fangzhou Hong 112 Dec 23, 2022
Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Stock Price Prediction Using Deep Learning Univariate Time Series Predicting stock price using historical data of a company using Neural networks for

Abdultawwab Safarji 7 Nov 27, 2022