I needed a Chinese utf8 word segmentation program for some simple stuff. After googling found mmseg. But it's big5 based and written in C. I created a python version based just on 'simple algorithm' (just do maximum matching without other 3 steps) and converted lexicons to UTF8.
Download
It sorta works, for my purpose anyway, as you can see from screen shot.
You can see a lot of words don't exists in dictionary/lexicons like '赈灾','板房','两米'. That's because the lexicons are directly converted to simplified Chinese from a traditional one, and it's missing a lot of words. I am sure using a dictionary trained from simplified source will greatly improve pymmseg's accuracy. But using a better dictionary will not solve ambiguity, such as '兴奋 得很 晚 都 睡不着' (should be '兴奋 得 很晚 都 睡不着'). For that we will have to use 'complex algorithm'.
I will create a better dictionary using simplified chinese corpus, and also create the a version employing 'complex algorithm' when I have time.
 
3 comments:
very nice package, thank you for sharing!
This is a really nice thing to share. I really appreciate it.
In the word.dic, I noticed some unusual words, maybe I'm confused a little bit. 他不, 她不, 上有, these were all in the word.dic file, so if the parser encountered these, it would treat them as single words, but I think they should be separate. For example, the sentence:
農場上有多少工人?
Becomes separated as:
農場 上有 多少 工人
But should be:
農場 上 有 多少 工人
Can you explain the numbers in the words.dic and chars.dic file that are next to each character?
I would like to construct a new traditional chinese dictionary, maybe based off CC-CEDICT project.
http://cc-cedict.org/wiki/start
thanks again,
Shaun Pedicini
Great Article
Final Year Projects for CSE in Python
FInal Year Project Centers in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
Post a Comment