6/2/08

pymmseg, python mmseg

pymmseg is a python implementation of mmseg. It's my quick-n-dirty Chinese word segmentation program.

I needed a Chinese utf8 word segmentation program for some simple stuff. After googling found mmseg. But it's big5 based and written in C. I created a python version based just on 'simple algorithm' (just do maximum matching without other 3 steps) and converted lexicons to UTF8.

Download

It sorta works, for my purpose anyway, as you can see from screen shot.


You can see a lot of words don't exists in dictionary/lexicons like '赈灾','板房','两米'. That's because the lexicons are directly converted to simplified Chinese from a traditional one, and it's missing a lot of words. I am sure using a dictionary trained from simplified source will greatly improve pymmseg's accuracy. But using a better dictionary will not solve ambiguity, such as '兴奋 得很 晚 都 睡不着' (should be '兴奋 得 很晚 都 睡不着'). For that we will have to use 'complex algorithm'.

I will create a better dictionary using simplified chinese corpus, and also create the a version employing 'complex algorithm' when I have time.

3 comments:

atpic said...

very nice package, thank you for sharing!

愛默愛默 said...

This is a really nice thing to share. I really appreciate it.

In the word.dic, I noticed some unusual words, maybe I'm confused a little bit. 他不, 她不, 上有, these were all in the word.dic file, so if the parser encountered these, it would treat them as single words, but I think they should be separate. For example, the sentence:
農場上有多少工人?

Becomes separated as:
農場 上有 多少 工人
But should be:
農場 上 有 多少 工人

Can you explain the numbers in the words.dic and chars.dic file that are next to each character?

I would like to construct a new traditional chinese dictionary, maybe based off CC-CEDICT project.

http://cc-cedict.org/wiki/start

thanks again,
Shaun Pedicini

for IT the said...

Great Article
Final Year Projects for CSE in Python
FInal Year Project Centers in Chennai

JavaScript Training in Chennai
JavaScript Training in Chennai