6/2/08

pymmseg, python mmseg

pymmseg is a python implementation of mmseg. It's my quick-n-dirty Chinese word segmentation program.

I needed a Chinese utf8 word segmentation program for some simple stuff. After googling found mmseg. But it's big5 based and written in C. I created a python version based just on 'simple algorithm' (just do maximum matching without other 3 steps) and converted lexicons to UTF8.

Download

It sorta works, for my purpose anyway, as you can see from screen shot.


You can see a lot of words don't exists in dictionary/lexicons like '赈灾','板房','两米'. That's because the lexicons are directly converted to simplified Chinese from a traditional one, and it's missing a lot of words. I am sure using a dictionary trained from simplified source will greatly improve pymmseg's accuracy. But using a better dictionary will not solve ambiguity, such as '兴奋 得很 晚 都 睡不着' (should be '兴奋 得 很晚 都 睡不着'). For that we will have to use 'complex algorithm'.

I will create a better dictionary using simplified chinese corpus, and also create the a version employing 'complex algorithm' when I have time.

2 comments:

atpic said...

very nice package, thank you for sharing!

愛默愛默 said...

This is a really nice thing to share. I really appreciate it.

In the word.dic, I noticed some unusual words, maybe I'm confused a little bit. 他不, 她不, 上有, these were all in the word.dic file, so if the parser encountered these, it would treat them as single words, but I think they should be separate. For example, the sentence:
農場上有多少工人?

Becomes separated as:
農場 上有 多少 工人
But should be:
農場 上 有 多少 工人

Can you explain the numbers in the words.dic and chars.dic file that are next to each character?

I would like to construct a new traditional chinese dictionary, maybe based off CC-CEDICT project.

http://cc-cedict.org/wiki/start

thanks again,
Shaun Pedicini