An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Last update: Dec 23, 2022

Overview

Kazakh Named Entity Recognition

This repository contains an open-source Kazakh named entity recognition dataset (KazNERD), named entity annotation guidelines (in Kazakh), and NER model training codes (CRF, BiLSTM-CNN-CRF, BERT and XLM-RoBERTa).

KazNERD Corpus
Annotation Guidelines
NER Models
Citation

1. KazNERD Corpus

KazNERD contains 112,702 sentences, extracted from the television news text, and 136,333 annotations for 25 entity classes. All sentences in the dataset were manually annotated by two native Kazakh-speaking linguists, supervised by an ISSAI researcher. The IOB2 scheme was used for annotation. The dataset, in CoNLL 2002 format, is located here.

2. Annotation Guidelines

The annotation guidelines followed to build KazNERD are located here. The guidelines contain rules for annotating 25 named entity classes and their examples. The guidelines are in the Kazakh language.

3. NER Models

3.1 CRF

Conda Environment Setup for CRF

The CRF-based NER model training codes are based on Python 3.8. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdCRF python=3.8
conda activate knerdCRF
conda install -c anaconda nltk scikit-learn
conda install -c conda-forge sklearn-crfsuite seqeval

Start CRF training

$ cd crf
$ python runCRF_KazNERD.py

3.2 BiLSTM-CNN-CRF

Conda Environment Setup for BiLSTM-CNN-CRF

The BiLSTM-CNN-CRF-based NER model training codes are based on Python 3.8 and PyTorch 1.7.1. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdLSTM python=3.8
conda activate knerdLSTM
# Check https://pytorch.org/get-started/previous-versions/#v171
# to install a PyTorch version suitable for your OS and CUDA
# or feel free to adapt the code to a newer PyTorch version
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch   # we used this version
conda install -c conda-forge tqdm seqeval

Start BiLSTM-CNN-CRF training

$ cd BiLSTM_CNN_CRF
$ bash run_train_p.sh

3.3 BERT and XLM-RoBERTa

Conda Environment Setup for BERT and XLM-RoBERTa

The BERT- and XLM-RoBERTa-based NER models training codes are based on Python 3.8 and PyTorch 1.7.1. To ease the experiment replication experience, we recommend setting up a Conda environment.

conda create --name knerdBERT python=3.8
conda activate knerdBERT
# Check https://pytorch.org/get-started/previous-versions/#v171
# to install a PyTorch version suitable for your OS and CUDA
# or feel free to adapt the code to a newer PyTorch version
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch   # we used this version
conda install -c anaconda numpy
conda install -c conda-forge seqeval
pip install transformers
pip install datasets

Start BERT training

$ cd bert
$ python run_finetune_kaznerd.py bert

Start XLM-RoBERTa training

$ cd bert
$ python run_finetune_kaznerd.py roberta

4. Citation

@misc{yeshpanov2021kaznerd,
      title={KazNERD: Kazakh Named Entity Recognition Dataset}, 
      author={Rustem Yeshpanov and Yerbolat Khassanov and Huseyin Atakan Varol},
      year={2021},
      eprint={2111.13419},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Related tags

Overview

Kazakh Named Entity Recognition

1. KazNERD Corpus

2. Annotation Guidelines

3. NER Models

3.1 CRF

Conda Environment Setup for CRF

Start CRF training

3.2 BiLSTM-CNN-CRF

Conda Environment Setup for BiLSTM-CNN-CRF

Start BiLSTM-CNN-CRF training

3.3 BERT and XLM-RoBERTa

Conda Environment Setup for BERT and XLM-RoBERTa

Start BERT training

Start XLM-RoBERTa training

4. Citation

Owner

ISSAI

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

A toolkit for controlling Euro Truck Simulator 2 with python to develop self-driving algorithms.

Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

Recognize Handwritten Digits using Deep Learning on the browser itself.

Tracking Progress in Question Answering over Knowledge Graphs

This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.

Spectralformer: Rethinking hyperspectral image classification with transformers

Learn about quantum computing and algorithm on quantum computing

[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations

NeuralCompression is a Python repository dedicated to research of neural networks that compress data

Label Hallucination for Few-Shot Classification

Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

A minimal implementation of face-detection models using flask, gunicorn, nginx, docker, and docker-compose

Config files for my GitHub profile.

Canonical Capsules: Unsupervised Capsules in Canonical Pose (NeurIPS 2021)

TensorFlow Tutorials with YouTube Videos

A machine learning malware analysis framework for Android apps.

An All-MLP solution for Vision, from Google AI

EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.