5/16/10

Creating ooxml word docx file with python and XSLT

The idea is taking an existing Word document, making it a template, and using it with external data to create a new Word document.

If you're dealing with 'doc' file (word2003 and before), you can use Pywin32. There are good information in the Chapter 10 of "Python Programming on Win32" book, in a section called "Automating Word".
But 'doc' file is on the way out. Staring from version 2007, Word is using the new 'docx' format. The good news is that 'docx' use OOXML (Office Open XML), an open industry standard. This means we can, in theory anyway, create Word DOCX file without using Windows. In practice, there's currently no good library that facilitate creating a docx file from xml data from scratch. It's tedious to do so without a good library. Hopefully someone will create one soon.

A short-cut exists though, i.e. create a template docx, create XSLT from it, then use XSLT to transform xml data into a new docx file.

Here's a post that describes this process using CSharp. My code is based on ideas and codes in this post.

Here's how to do it in Python (using the excellent lxml module):

1, Create a template docx file
Just create a regular docx file with Word (example.docx)

2, Create xslt file from docx file.
A docx file is a zip file. You can get "word/document.xml" file by unzip the docx file. But the resulting file has no line breaks. It's very hard to edit it using regular text editor like VIM. (you can use xml editor, but there ain't good one that's free). I have a python program that does this (getxslt.py). All it does is getting 'word/document.xml', prettifying it(adding linebreaks), and adding XSLT header and footer.
We need to use the docx file later, but without 'word/document.xml' file (python's zipfile module can't do file replacement in zip). You can achieve this by "open archive" the docx file with '7-zip' on windows, then delete 'word/document.xml'.

3, Modify xslt file.
Determine where you need to change the data, add things like:
<xsl:value-of select="..."/>
and
<xsl:for-each select="...">
...
</xsl:for-each>
You need to study a little bit xslt to do this, it's not too hard.

4, Do XSLT transform to convert xml data to 'word/document.xml' and add it to new docx file.
import shutil
import zipfile
from lxml import etree

xsl = etree.XSLT(etree.parse("xslt1.xml"))
xml = etree.parse("cddata.xml")
newxml=xsl(xml)

#nodocxml.docx is original docx file without word\document.xml
shutil.copyfile("nodocxml.docx","mycds.docx")

mycds = zipfile.ZipFile("mycds.docx",'a',zipfile.ZIP_DEFLATED)
mycds.writestr('word/document.xml',str(newxml))
mycds.close()

explanation:
'xslt1.xml' is the modified xslt file from the step 3;
'cddata.xml' is xml data file (source: w3schools xslt tutorial );
newxml is the result xml tree of xslt transformation;
'nodocxml.docx' is the same as example.docx, the original docx file, except it doesn't have 'word/document.xml';
'mycds.docx' is our final product.

The example files can be downloaded from here: http://bitbucket.org/wensheng/pydocx/downloads/pydocxml.zip

The example does very few transformation. If you want to more advanced stuff like adding images, you have to be more familiar with OOXML. (Add image to media directory, change relationships xml file, add relationship anchor to document.xml etc.)

4/13/10

Getting back data from lvm on raid1 on a single disk

One of my servers died. The only thing I have left is a hard-drive from that server. But I have important data I need to recover from this disk.

The disk was part of RAID1 (software-raid) disks. It's not boot-able, and only contains LVM volumes.

Here's how I get my data back:

1, Connect the hard-drive to my PC. I used a USB docking station.

2, My OS (Fedora) assigned it to /dev/sdd. It can not mount it since it's raid.

3, make raid node, and attach the partition
mknod /dev/md0 b 9 3
mdadm --create /dev/md0 --level=raid1 --name=0 --auto=md --raid-disks=1 -f /dev/sdd1
"-f" switch is needed to force creating raid1 with only 1 disk.

4, now if I do pvscan, vgscan, and lvscan, it shows my vg (vg1) and lvm's:
pvscan #it shows PV /dev/md0 VG vg1
but I still can not mount them, because no volume group actually existed. Do this:
vgchange -a y vg1
Now /dev/vg1 is activated.

5, Do the mount as usual
mount /dev/vg1/wensheng /mnt

After I got back all my data, I disassemble the hard-drive to get the magnets, and throw the rest to trash. Why? because of this:

from my experience, this drive will die soon anyway.

11/26/09

IGot24 - a javascript game for calculate 24, hosted on AppEngine

I wrote a small game in the last few days. The game is called IGot24.
The idea is to use 4 basic arithmetic operations to get result 24 from 4 randomly drawn cards. I talked about the algorithm and python implementation in an old post. In fact, it's currently the first result on Google for "calculate 24".

The game is at:
http://calculate24.appspot.com/

I first wrote the game using pure JavaScript. Later I thought whoever play it may want to get solutions. So I created the backend with Google AppEngine that supply cards and solutions. Then I added a few more options, like getting only solvable cards, disabling timer, calculate 42 instead of 24, etc.

The heart of the JavaScript code is a finite state machine. At first I just coded away thinking it'll be simple, then gradually realized there're more states. I then drew a state diagram and found out there were 21 states! They are all branches, there's no loopback, so I don't think they can be optimized. (Maybe they can, but I don't want to re-read my circuit design book from ages ago to find out)

I think the game can be easily re-done in flash using Flex, since JavaScript can be ported to AS3. It can be ported to IPhone too but it will be waste of time and money (to buy a mac to do iphone dev) because I found a "calculating 24" iphone app got downloaded a whopping 12 times.

11/21/09

Chrome OS on Xen HVM

Google announced open-source Chromium OS 2 days ago. today I tried to build it on a Ubuntu virtual machine. But I failed, had numerous problems I won't elaborate. I will try again sometime later.

I searched web and found a VMWare virtual disk of ChromeOS here. So I downloaded it.

Not wanting to install VMWare, I converted it to a raw disk image and loaded it to XEN with HVM, it worked. But it basically unusable for me mainly because of mouse movement was too slow. I will try with a different vnc client later to see if it improves.

Here's how to convert vmdk image to raw:
$qemu-img convert -f vmdk chrome-os-0.4.22.8-gdgt.vmdk -O raw chrome-os.img

Then just specify the image file in 'disk' line of domain config file, like:
disk = [ 'file:/root/chrome-os.img,hda,w' ]

Here's screenshot of chrome OS:


Here's chrome OS login screen, I have a Fedora 12 hvm domain and a Ubuntu 9.04 32bit domainU also running on this machine. The domain0 itself is Fedora 12 64bit with pvops kernel 2.6.31.6.

I had to use standard VGA driver(stdvga=1), the default Cirrus Logic driver give me a screen like this after I login:

10/27/09

POC: online barcode qrcode scan with webcam

This is proof of concept(POC). The idea is to scan rebates, coupons, tickets etc. in the forms of barcode and qr-code using your web-cam.

Such an idea is really nothing new. With cellphone, you can now scan the barcode when you pick up an item in store, and get price information on whether it is cheaper at nearby stores or online. Also barcodes on cellphone can now be used for mobile ticketing.

But I have not heard a lot about scanning rebates/coupons/tickets with webcam online. So I created this POC project.
Screenshot:


It somewhat works. QR-code works consistently but most of times the program can not read UPC and code 128. I am sure with some added backend photo processing, I can improve the result, but I doubt it will help a lot. Two of my 3 webcams have a focal length that make it impossible to snap a clear code image, unless of course the image is real large like 400x400 or 500x500px. So I guesstimate most webcams are not suited code scanning?

Here's actual page in a iframe:


The frontend is build with Flex SDK. The webcam code come straight from here.
The backend is using Zxing.
The QRcode test sample is generated from here with google chart api.

9/2/09

JW media player with lyrics scroller

Yesterday, I made a lyrics scroller. The thing works, but it's missing quite a few things, like, progress bar, seek, time display, volume control, videos.

So today I made a new one, this time using jw media player. JW media player is a full featured web media player. It has almost everything. Now it has a lyrics scroller;)

Below is a demo, it's a iframe of this page.


Usage:
Include 4 javascript files in the html header:
jquery.js, jquery.scrollTo-1.4.2-min.js, swfobject.js, jwplrc.js

Then put these javascript code in html:
var flashvars = {
file:"somesong.mp3",
lrc:"somelrc.lrc"
};
create_jwplrc("player_divid","320","80",flashvars,{},"some_uniq_id");


The "file" flashvar can be used to specify a media file such as mp3 or a playlist xml file.
If it's a single media file, you need to specify "lrc" flashvar to tell the player where lrc file is. (update on 1/18/10)It can be in your web directory or anywhere on the web.
if it's a playlist, no "lrc" flashvar is needed. You imply that lrc and media are at
same location/directory, and lrc and media have the same filename.
If it's a playlist, you need to specify lrc file in "<info> </info>" inside playlist xml file. The media files can be anywhere on the web, but lrc files has to be on your own site.

"create_jwplrc" is just a wrapper function that wraps "swfobject.embedSWF". The first argument for create_jwplrc is the id of a "<div>" that will be the player, 2nd/3rd are width and height, 4th is flashvars, 5th is parameters, it can be empty {}. The 6th argument is a unique name for the player.

File download wensheng.com/code/jwlyrics.tgz (updated 1/18/10)

A javascript mp3 player with scrolling lyrics (lrc) display

When listening to music on PC, I use a Chinese mp3 player called TTPlayer. The main feature of TTPlayer is it display lyrics that synced with the music.
Searching for a web equivalent, the closest thing i found is this. But it has issues, for me anyway, i.e. play only one song, no play/pause control, can't scroll back, can't replay, use an unfamiliar animation framework. So I look at the code and come up with my own. The player is based on soundmanager2 demo code. It use jquery and jquery scrollto plugin.




The demo is at: http://wensheng.com/code/sm_lyrics/

The code can be downloaded here.

7/31/09

Happy Girls

I will be watching Tonight's Super Girls (or Happy Girls, as it's called now). I don't know why I watch it. I never watched American Idols or any other pop contest shows. And I absolutely hate the most populor one of previous super-girls winner - Li Yuchun.

But since I watched this year's happy-girls one night with wife, I got hooked. Maybe it's just the pretty girls I'd like to watch. I told a co-worker the other day I like Happy-girls, she said I'd have become dirty middle-aged man (龌龊中年男人). Is that the reason? Ouch..

I am rooting for Liu Xijun 刘惜君. (big pic)


I also like Tan Lina 谈莉娜, she's very pretty. (pic)
I like Li Xiaoyun 李宵云 too (video), but she's not good-looking. Huang ying's good too, but her voice style gets old after a while. I really don't care for other girls.

Here's a video of Liu from youtube singing "伤痕". In my opinion, it's better than Lin yilian's original.


7/30/09

set up geo dns (geodns) on Fedora using geoipdns pt 2

Here I describe the steps of set up Geoipdns (a Tinydns fork at does geo dns). First following these steps here to set up Tinydns. Even though Geoipdns is a fork, it doesn't provide the configuration programs such as tinydns-conf that's in Tinydns. So it's much easier to set up Geoipdns after we already have Tinydns set up.

You can read Geoipdns document on http://pub.mud.ro/wiki/Geoipdns.
Here are quick steps:
$yum install inotify-tools-devel
#geoipdns use libinotifytools.

$mkdir vdns
$cd vdns/
$wget http://pub.mud.ro/~cia/files/vdns-src.tgz
$tar xfz vdns-src.tgz
$vi conf-cc
#Add " -include /usr/include/errno.h" to the first line of conf-cc

$LOCAL_CFLAGS="-DUSE_LOCMAPS -DUSE_SFHASH -DUSE_TOUCH_RELOADS -DDEBUG_MODE -DHAVE_MMAP_READAHEAD"
$make
$mkdir /usr/local/apps
$./install
$cp -rp /usr/local/apps/vdns/bin/* /usr/local/bin/
The last step is to allow us access vdnsd vdnsdb without add to PATH.

Now geoipdns is installed. Next we need to configure it so it does geo dns.
$cd /etc/tinydns
$cp run run.tinydns #backup run
$vi run
#inside run, we change /usr/local/bin/tinydns to/usr/local/bin/vdnsd

$cd root
$vi Makefile
#inside Makefile we change tinydns-data to vdnsdb

$make
$svc -t /service/tinydns

Now vdnsd should be running, use "ps -ef" to see that vdnsd is running tinydns is not, but you still should see "supervise tinydns".
Do some dig to make sure everything still works just like Tinydns is running.

Now you need to query the server from at least 2 different locations, otherwise you don't know if geodns works or not.
If you set up dns on local network, such as 192.168.1.0, you can put IP's of local machines in "data" and test geodns from these local machine.
If you have a world facing DNS server and you want to test geodns, you can put the IP's of your home, work, colocated server, or VPS in data, then test dns from those locations.

However if you want to test it fast at where you are, you are also in luck, because there're several online dig sites you can use. I will use 2 such sites as examples:
http://dig.menandmice.com/knowledgehub/tools/dig ip is 207.57.2.84
http://www.subnetonline.com/pages/network-tools/online-dig.php ip is 85.17.250.238
Note these Ip's are as of this writing. If they have changed at the time of your testing, you need to change them too in your data file.

Now the data. Suppose your DNS server name is ns1.myserver.com (replace it with your real dns server name), add these to your /etc/tinydns/root/data file:
%onlinedig1:207.57.2.84:32
%onlinedig2:85.17.250.238:32
.example.com::ns1.myserver.com:259200
+www.example.com:1.1.1.1:3600::onlinedig1
+www.example.com:2.2.2.2:3600::onlinedig2
+www.example.com:3.3.3.3:3600::nomatch
Do a "make" to update the data hash.

Now you can start testing. First do a "dig @ns1.myserver.com www.example.com" locally. It should say 3.3.3.3. Then go to http://dig.menandmice.com/knowledgehub/tools/dig , enter name server "ns1.myserver.com", enter domain name "www.example.com", click "perform query", it should say 1.1.1.1. Now do the same on http://www.subnetonline.com/pages/network-tools/online-dig.php, it should say 2.2.2.2.
If they display correct fake information, your geo dns works.

Next you need to add real IP location data and setup geo dns for your real domains.
I will talk about this in another post.

7/29/09

Why geo dns - setting up geodns on fedora part 1

First of all, what is geodns? Let's look at an example. The following show the outputs of dig and ping on www.google.com from 3 different geographical locations.
First from Dallas:

270 06:13 PM wang@ns1)dig www.google.com +short
www.l.google.com.
74.125.47.99
74.125.47.103
(truncated)
271 06:13 PM wang@ns1)ping -c 2 www.google.com
PING www.l.google.com (74.125.47.99) 56(84) bytes of data.
64 bytes from yw-in-f99.google.com (74.125.47.99): icmp_seq=1 ttl=54 time=21.2 ms
64 bytes from yw-in-f99.google.com (74.125.47.99): icmp_seq=2 ttl=54 time=21.0 ms
Next from Xi'an, China:
[wang@www ~]$ dig www.google.com +short
www.l.google.com.
64.233.189.104
64.233.189.147
64.233.189.99
[wang@www ~]$ ping -c 2 www.google.com
PING www.l.google.com (64.233.189.147) 56(84) bytes of data.
64 bytes from hk-in-f147.google.com (64.233.189.147): icmp_seq=1 ttl=242 time=45.7 ms
64 bytes from hk-in-f147.google.com (64.233.189.147): icmp_seq=2 ttl=242 time=45.8 ms
Finally from Zhejiang China:
137 03:06 PM wang@cn)dig www.google.com +short
www.l.google.com.
66.249.89.99
66.249.89.104
66.249.89.147
138 03:06 PM wang@cn)ping -c 2 www.google.com
PING www.l.google.com (66.249.89.99) 56(84) bytes of data.
64 bytes from jp-in-f99.google.com (66.249.89.99): icmp_seq=1 ttl=243 time=47.4 ms
64 bytes from jp-in-f99.google.com (66.249.89.99): icmp_seq=2 ttl=243 time=47.4 ms
You can see the IP's for www.google.com from these locations are different. The reason for this is that Google want to send www.google.com visitors to their nearest web servers. Why? you might ask. Because of network latency. Here's output from pinging US google server from China:
[wang@www ~]$ ping 74.125.47.99 -c 2
PING 74.125.47.99 (74.125.47.99) 56(84) bytes of data.
64 bytes from 74.125.47.99: icmp_seq=1 ttl=44 time=258 ms
64 bytes from 74.125.47.99: icmp_seq=2 ttl=44 time=261 ms
So the ping time is >10 times as long as ping time from within US. If Google doesn't have servers in China (or hk, jp, or whatever closer to China), the experience of Chinese www.google.com visitors will be really bad (long response time, slow page load).

The 3 popular DNS software (bind, powerdns, tinydns) all have geo capability, either with patch, backend, or in tinydns case, a fork called geoipdns. I have been using Tinydns for several years and very satisfied with its ease of use and performance. So I stick with Tinydns for my geodns.
The geodns fork of Tinydns is called geoipdns, it's written by Adrian Ilarion Ciobanu.
I will talk about how to set up Geoipdns in the next post.