I needed a Chinese utf8 word segmentation program for some simple stuff. After googling found mmseg. But it's big5 based and written in C. I created a python version based just on 'simple algorithm' (just do maximum matching without other 3 steps) and converted lexicons to UTF8.
It sorta works, for my purpose anyway, as you can see from screen shot.
You can see a lot of words don't exists in dictionary/lexicons like '赈灾','板房','两米'. That's because the lexicons are directly converted to simplified Chinese from a traditional one, and it's missing a lot of words. I am sure using a dictionary trained from simplified source will greatly improve pymmseg's accuracy. But using a better dictionary will not solve ambiguity, such as '兴奋 得很 晚 都 睡不着' (should be '兴奋 得 很晚 都 睡不着'). For that we will have to use 'complex algorithm'.
I will create a better dictionary using simplified chinese corpus, and also create the a version employing 'complex algorithm' when I have time.