Jim DeLaHunt, world-ready

The .ODT word processor documents produced by LibreOffice and Apache OpenOffice are in fact ZIP archives consisting of XML files, other files, and directories. It is actually straightforward to crack open the archive and get at the files and directories within. To edit the XML files, or just to explore them, XSL Transformations (XSLT) are a useful tool. This is a look at how to use XSLT and the xsltproc tool on XML files extracted from .ODT documents.

The basic idea is to capture the exploring or modification you want to perform as XSLTs, and run an XSLT processor tool to apply the XSLT to the document’s XML file. The XSLT language and processor simplify your task of exploring or modifying, by letting you largely disregard the details of the XML file syntax, and focus on expressing the elements you want to use or the structure you want to identify. The tools handle those details. Just get the XML file from the .ODT document, use XSLT on it, replace the modified XML file in the .ODT document, then use the modified document in LibreOffice.

The first step is to crack open the archive which is the .ODT document. My earlier blog post, How to crack open LibreOffice .ODT documents for fun and bug fixing, explains two ways to do this. First, you can have LibreOffice save the document as a as a “Flat XML ODF Text Document”. Second, you can use a ZIP archiving and extraction tool to get at the individual XML files within the .ODT archive. That blog also explains how to insert a modified XML file back into a document archive. It links to the specifications for the .ODT format and for the internals of the files within it. From here on, we will assume that you have XML files to work on.

Next, brush up on XML files and on the XSLT tools. The Wikipedia articles on Extensible Markup Language (XML) and Extensible Stylesheet Language Transformations (XSLT) are reasonable overviews. From there, web searches for XML introduction and XSLT introduction will lead you to a number of tutorials. For instance, the Mozilla Developer Documentation includes XML introduction and XSLT: Extensible Stylesheet Language Transformations. Personally, I find myself re-reading the XSLT Introduction from W3Schools anytime I pick up an XSLT project and discover I’ve forgotten the basics â€” again. I won’t attempt to cover that material in this post.

You will need some tools to author XSLTs. XSLTs are themselves XML documents, so you can certainly use an XML editor. I use Eclipse XML editors for that purpose. The XSLTs I used on .ODT documents have been simple enough that I frankly can edit them fine with a plain text editor. The fine XML editor, Oxygen, has an XSL editor and a helper for processing XSLTs.

You will need a processor to apply XSLTs to XML files. I use xsltproc as supplied as part of port libxslt by MacPorts. Once installed, you invoke xsltproc to apply an XSLT transform.xslt to an input XML file in.xml, writing output to XML file out.xml, as follows:

% xsltproc --output out.xml transform.xslt in.xml

Alternatively, you can omit the --output option, and send the transformed output to stdout. The some_processing is a placeholder for whatever you might want to do with stdout.

% xsltproc transform.xslt in.xml | some_processing > out.xml

As you write the XSLT for an .ODT-derived XML files, there are a few details which surprised me. I will describe them, so hopefully they cause no problem for you.

As an example, let us suppose we want to find every table in an .ODT document. Looking at expanded .ODT document archives, and the ODT specification, Open Document Format for Office Applications (OpenDocument) Version 1.3, we see that a table is stored as an element <table:table>, within an <office:text> element, which in turn is within an <office:body> element, within an <office:document-content> element.

The colon-separated prefixes in the element names, e.g. office: and table:, are XML namespace prefixes. The easiest way to find the complete list of these prefixes is to look at the beginning of an .ODT-derived XML file. There is something like the below. The <office:document-content> element contains namespace prefix expressions as attributes. I have elided most of the 35 prefix expressions for brevity.

<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:css3t="http://www.w3.org/TR/css3-text/" xmlns:grddl="http://www.w3.org/2003/g/data-view#" 
â€¦ 
xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" 
â€¦ 
xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" office:version="1.3">
â€¦

The root element of an XSLT is an <xsl:stylesheet> element. It must contain namespace prefix expressions for the xsl: prefix, and for any other prefixes used in the XSLT. In our example, that will be expressions for the office: and table: prefixes.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0">

If you don’t include a namespace prefix expression for a prefix you use in the XSLT, then xsltproc gives an error message like:

compilation error: file transform.xslt line 29 element attribute
xsl:attribute: The prefixed QName 'office:value-type' has no namespace binding in scope in the stylesheet; this is an error, since the namespace was not specified by the instruction itself.
error
xsltCompileStepPattern : no namespace bound to prefix office
compilation error: file transform.xslt line 27 element template
xsltCompilePattern : failed to compile 'â€¦[elided]â€¦'

The XML declaration for the XSLT should probably contain the attribute encoding="UTF-8", because this lets you use the full gamut of Unicode characters in your XSLT. By the same token, the XSLT should contain an <xsl:output> element. The following shows these two elements:

<?xml version="1.0" encoding="UTF-8"?>
â€¦[<xsl:stylesheet elided />]â€¦
<xsl:output method="xml" indent="no" encoding="UTF-8" />

The details of the XSLT will depend on what you want to repair or explore. But for an XSLT which copies the input document, modifies it, and delivers the modified document as output, the XSLT should probably contain an <xsl:template>â€¦</xsl:template> element which copies everything. This should probably be followed by an <xsl:template match="â€¦">â€¦</xsl:template> element, with the value of the match attribute being an XPath expression to the document part you want.

Putting it all together, this is a simple XSLT which copies input document to output, matches tables within the document, and provides a place to modify the table.

<?xml version="1.0" encoding="UTF-8"?>
<!-- table-finder.xslt -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" 
xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0">

    <xsl:output method="xml" indent="no" encoding="UTF-8" />

    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="/office:document-content/office:body/office:text/table:table" >
        <xsl:copy>
            <!-- modification code goes here -->
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

It is tempting to compare the input and output XML files, to be sure that only the changes you intended were applied. There are two things that will make this harder. First, the XML files from .ODT documents leave out spaces and line breaks, so both input and output will appear to be just a few, very long lines. Most diff utilities don’t display differences well if lines are very long. Second, LibreOffice uses some named entities which xsltproc does not. For example, LibreOffice uses ' in place of an apostrophe, while xsltproc uses the literal apostrophe character. Diff utilities will flag such differences, but they represent the same content, so may be ignored.

XSLT is a powerful tool for operating on XML files, such as those found in .ODT documents. But namespace prefixes, and the details of the XML declaration, and of the xsl:output element, make it hard to get to the stage where one can experiment with Xpath and XSLT expressions and get the job done. Hopefully, this explanation, and the example XSLT file above, will provide you a way to get past those obstacles quickly.

1 Comment »

Culture, and software engineering, in British Columbia

How to use XSLT to modify XML files inside .ODT documents

One Response to “How to use XSLT to modify XML files inside .ODT documents”

Leave a Reply

Search

Tags

Archives

Pages