Wensheng: pymmseg, python mmseg

6/2/08

pymmseg, python mmseg

pymmseg is a python implementation of mmseg. It's my quick-n-dirty Chinese word segmentation program.

I needed a Chinese utf8 word segmentation program for some simple stuff. After googling found mmseg. But it's big5 based and written in C. I created a python version based just on 'simple algorithm' (just do maximum matching without other 3 steps) and converted lexicons to UTF8.

Download

It sorta works, for my purpose anyway, as you can see from screen shot.

You can see a lot of words don't exists in dictionary/lexicons like '赈灾','板房','两米'. That's because the lexicons are directly converted to simplified Chinese from a traditional one, and it's missing a lot of words. I am sure using a dictionary trained from simplified source will greatly improve pymmseg's accuracy. But using a better dictionary will not solve ambiguity, such as '兴奋得很晚都睡不着' (should be '兴奋得很晚都睡不着'). For that we will have to use 'complex algorithm'.

I will create a better dictionary using simplified chinese corpus, and also create the a version employing 'complex algorithm' when I have time.

3 comments:

Anonymous said...: very nice package, thank you for sharing!; 12/08/2009 12:45:00 PM
愛默愛默 said...: This is a really nice thing to share. I really appreciate it.

In the word.dic, I noticed some unusual words, maybe I'm confused a little bit. 他不, 她不, 上有, these were all in the word.dic file, so if the parser encountered these, it would treat them as single words, but I think they should be separate. For example, the sentence:
農場上有多少工人？

Becomes separated as:
農場上有多少工人
But should be:
農場上有多少工人

Can you explain the numbers in the words.dic and chars.dic file that are next to each character?

I would like to construct a new traditional chinese dictionary, maybe based off CC-CEDICT project.

http://cc-cedict.org/wiki/start

thanks again,
Shaun Pedicini; 5/04/2010 01:00:00 AM
for ict 99 said...: Great Article
Final Year Projects for CSE in Python
FInal Year Project Centers in Chennai

JavaScript Training in Chennai
JavaScript Training in Chennai; 12/11/2018 09:38:00 PM

Wensheng

6/2/08

pymmseg, python mmseg

3 comments:

Links

About Me