其他
A lightweight lexicon/dictionary based Chinese text segmenter; it adds whitespace to separate and tokenize the text. For example,
Input:
应有尽有的丰富选择定将为您的旅程增添无数的赏心乐事
Output:
应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏 心 乐事
The advantage of using a lexicon/dictionary for text segmentation is the ability to localize and scale according to the text's language or domain. Supporting the open source movement, the default dictionary used by mini-segmenter
is MDBG's CC-CEDICT.
The test suite sentences for mini-segmenter are from the Nanyang Technological University - Multilingual Corpus (NTU-MC)
[Download mini-segmenter here](https://mini-segmenter.googlecode.com/files/minisegmenter-v1.1.tar.gz) =)
NOTE: The preferred encoding for mini-seg
python
Chinese
Dictionary
Analytics
暂无评论