a chinese segment base on crf

Last update: Nov 04, 2022

Related tags

Text Data & NLP genius

Overview

Genius

Genius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。

Feature

支持python2.x、python3.x以及pypy2.x。
支持简单的pinyin分词
支持用户自定义break
支持用户自定义合并词典
支持词性标注

Source Install

安装git: 1) ubuntu or debian apt-get install git 2) fedora or redhat yum install git
下载代码：git clone https://github.com/duanhongyi/genius.git
安装代码：python setup.py install

Pypi Install

执行命令：easy_install genius或者pip install genius

Algorithm

采用trie树进行合并词典查找
基于wapiti实现条件随机场分词
可以通过genius.loader.ResourceLoader来重载默认的字典

功能 1)：分词`genius.seg_text`方法

genius.seg_text函数接受5个参数，其中text是必填参数:
text第一个参数为需要分词的字符
use_break代表对分词结构进行打断处理，默认值True
use_combine代表是否使用字典进行词合并，默认值False
use_tagging代表是否进行词性标注，默认值True
use_pinyin_segment代表是否对拼音进行分词处理，默认值True

代码示例( 全功能分词 )

#encoding=utf-8
import genius
text = u"""昨天,我和施瓦布先生一起与部分企业家进行了交流,大家对中国经济当前、未来发展的态势、走势都十分关心。"""
seg_list = genius.seg_text(
    text,
    use_combine=True,
    use_pinyin_segment=True,
    use_tagging=True,
    use_break=True
)
print('\n'.join(['%s\t%s' % (word.text, word.tagging) for word in seg_list]))

功能 2)：面向索引分词

genius.seg_keywords方法专门为搜索引擎索引准备，保留歧义分割，其中text是必填参数。
text第一个参数为需要分词的字符
use_break代表对分词结构进行打断处理，默认值True
use_tagging代表是否进行词性标注，默认值False
use_pinyin_segment代表是否对拼音进行分词处理，默认值False
由于合并操作与此方法有意义上的冲突，此方法并不提供合并功能；并且如果采用此方法做索引时候，检索时不推荐genius.seg_text使用use_combine=True参数。

代码示例

#encoding=utf-8
import genius

seg_list = genius.seg_keywords(u'南京市长江大桥')
print('\n'.join([word.text for word in seg_list]))

功能 3)：关键词提取

genius.extract_tag方法专门为提取tag关键字准备，其中text是必填参数。
text第一个参数为需要分词的字符
use_break代表对分词结构进行打断处理，默认值True
use_combine代表是否使用字典进行词合并，默认值False
use_pinyin_segment代表是否对拼音进行分词处理，默认值False

代码示例

#encoding=utf-8
import genius

tag_list = genius.extract_tag(u'南京市长江大桥')
print('\n'.join(tag_list))

其他说明 4)：

目前分词语料出自人民日报1998年1月份，所以对于新闻类文章分词较为准确。
CRF分词效果很大程度上依赖于训练语料的类别以及覆盖度，若解决语料问题分词和标注效果还有很大的提升空间。

a chinese segment base on crf

Related tags

Overview

Genius

Feature

Source Install

Pypi Install

Algorithm

功能 1)：分词`genius.seg_text`方法

功能 2)：面向索引分词

功能 3)：关键词提取

其他说明 4)：

Owner

duanhongyi

Neural network sequence labeling model

justCTF [*] 2020 challenges sources

Train and use generative text models in a few lines of code.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Understand Text Summarization and create your own summarizer in python

Big Bird: Transformers for Longer Sequences

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

GPT-2 Model for Leetcode Questions in python

Simple, hackable offline speech to text - using the VOSK-API.

Switch spaces for knowledge graph embeddings

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

A method to generate speech across multiple speakers

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

a chinese segment base on crf

Related tags

Overview

Genius

Feature

Source Install

Pypi Install

Algorithm

功能 1)：分词genius.seg_text方法

功能 2)：面向索引分词

功能 3)：关键词提取

其他说明 4)：

Owner

duanhongyi

Neural network sequence labeling model

justCTF [*] 2020 challenges sources

Train and use generative text models in a few lines of code.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Understand Text Summarization and create your own summarizer in python

Big Bird: Transformers for Longer Sequences

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

GPT-2 Model for Leetcode Questions in python

Simple, hackable offline speech to text - using the VOSK-API.

Switch spaces for knowledge graph embeddings

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

A method to generate speech across multiple speakers

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

功能 1)：分词`genius.seg_text`方法