5/16/10

Creating ooxml word docx file with python and XSLT

The idea is taking an existing Word document, making it a template, and using it with external data to create a new Word document.

If you're dealing with 'doc' file (word2003 and before), you can use Pywin32. There are good information in the Chapter 10 of "Python Programming on Win32" book, in a section called "Automating Word".
But 'doc' file is on the way out. Staring from version 2007, Word is using the new 'docx' format. The good news is that 'docx' use OOXML (Office Open XML), an open industry standard. This means we can, in theory anyway, create Word DOCX file without using Windows. In practice, there's currently no good library that facilitate creating a docx file from xml data from scratch. It's tedious to do so without a good library. Hopefully someone will create one soon.

A short-cut exists though, i.e. create a template docx, create XSLT from it, then use XSLT to transform xml data into a new docx file.

Here's a post that describes this process using CSharp. My code is based on ideas and codes in this post.

Here's how to do it in Python (using the excellent lxml module):

1, Create a template docx file
Just create a regular docx file with Word (example.docx)

2, Create xslt file from docx file.
A docx file is a zip file. You can get "word/document.xml" file by unzip the docx file. But the resulting file has no line breaks. It's very hard to edit it using regular text editor like VIM. (you can use xml editor, but there ain't good one that's free). I have a python program that does this (getxslt.py). All it does is getting 'word/document.xml', prettifying it(adding linebreaks), and adding XSLT header and footer.
We need to use the docx file later, but without 'word/document.xml' file (python's zipfile module can't do file replacement in zip). You can achieve this by "open archive" the docx file with '7-zip' on windows, then delete 'word/document.xml'.

3, Modify xslt file.
Determine where you need to change the data, add things like:
<xsl:value-of select="..."/>
and
<xsl:for-each select="...">
...
</xsl:for-each>
You need to study a little bit xslt to do this, it's not too hard.

4, Do XSLT transform to convert xml data to 'word/document.xml' and add it to new docx file.
import shutil
import zipfile
from lxml import etree

xsl = etree.XSLT(etree.parse("xslt1.xml"))
xml = etree.parse("cddata.xml")
newxml=xsl(xml)

#nodocxml.docx is original docx file without word\document.xml
shutil.copyfile("nodocxml.docx","mycds.docx")

mycds = zipfile.ZipFile("mycds.docx",'a',zipfile.ZIP_DEFLATED)
mycds.writestr('word/document.xml',str(newxml))
mycds.close()

explanation:
'xslt1.xml' is the modified xslt file from the step 3;
'cddata.xml' is xml data file (source: w3schools xslt tutorial );
newxml is the result xml tree of xslt transformation;
'nodocxml.docx' is the same as example.docx, the original docx file, except it doesn't have 'word/document.xml';
'mycds.docx' is our final product.

The example files can be downloaded from here: http://bitbucket.org/wensheng/pydocx/downloads/pydocxml.zip

The example does very few transformation. If you want to more advanced stuff like adding images, you have to be more familiar with OOXML. (Add image to media directory, change relationships xml file, add relationship anchor to document.xml etc.)

3 comments:

n0rus said...

Hi,

Thanks for your post. Would it be possible for you to share the getxslt.py script?

James Heywood said...

Thank you very much indeed for this, it has most certainly given me a kick start into this field.

Anonymous said...

Based on your idea, I've wrote https://github.com/backbohne/docx-xslt