Saturday, January 26, 2008

XSLT 2.0

I had a complex XML transformation to perform. I ended up using XSLT 2.0, I was really impressed.

I needed to measure several sub-tasks in a publishing process, that updates hundreds of documents in batches every few hours, with data from live sources.

I needed to compare the original implementation with simulations of 3 incremental optimisations, so we could see if they are worth making.

Setting up and running the tests was out of my hands, happening in another country. What I got back were four very large XML files, in Java logger.xml format.

<record>
<date>2008-01-24T13:10:44</date>
<millis>1201176644561</millis>
<sequence>6</sequence>
<logger>app.Uploader</logger>
<level>INFO</level>
<class>app.Uploader$DataLogger</class>
<method>end</method>
<thread>10</thread>
<message>timing.folder 15 1468</message>
</record>
<record>
<date>2008-01-24T13:10:44</date>
<millis>1201176644561</millis>
<sequence>7</sequence>
<logger>app.Uploader</logger>
<level>INFO</level>
<class>app.Uploader$DataLogger</class>
<method>end</method>
<thread>10</thread>
<message>memory.folder 15 844128</message>
</record>


etc.

Thousands of them, all in a flat list. I needed csv files to import into a spreadsheet, to analyse the data.

The first processing pass, filtered out records I was not interested in and extracted values from the message, adding them as individual tags to each record. <message>[type].[key] [fid] [data]</message>

Each log file needed to be output into separate csv files, one for each type of measurement (timing, memory etc.). Each row of the csv file needed to contain each different measurement for one document update, where measurements with the same key were added together.

It is recognised that getting these kind of groupings from flat data using XSLT 1.0 is difficult. I tried for a while using muenchian grouping, the code was getting longer and more complex but I still had not solved all of the problems. It was looking like I would need a whole chain of XSLTs to do the job. I switched from xsltproc to Saxon, because it has a chaining command. Then I remembered it also implements all of XSLT 2.0 which has new commands that make grouping far easier.

I ende up with 3 nested <xsl:for-each-group select="nodes" group-by="key">. XSLT 1.0 has <xsl:for-each> which iterates a nodelist. <xsl:for-each-group> produces a list of lists and iterates through the outer one, giving you access to each inner list and it's key. This solves two difficult issues in XSLT 1.0, arbitrary grouping and making lists of unique keys.

After re-factoring, I had easier to understand code that was a third of the size and ran three times faster.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!--

output a CSV file for each unique record/type (eg. 'timing' or 'memory')
one row per unique record/fid (document ID)

where multiple values of the same record/type are summed
eg. record/type values below have multiple values:
transform
put.default.preview
put.default.www

-->
<xsl:param name="test">1</xsl:param>
<xsl:output method="text"/>

<xsl:template match="log">
<!-- create a set of nodelists of record nodes of each unique record/type -->
<xsl:for-each-group select="record" group-by="type">
<!-- output the log type for debugging purposes -->
<xsl:value-of select="current-grouping-key()"/><xsl:text>
</xsl:text><!-- new line -->
<!-- new output file for each unique record/type -->
<xsl:result-document href="results/{current-grouping-key()}.{$test}.csv" method="text">
<xsl:text>id,</xsl:text><!-- first column -->
<!-- nodelist of records for each unique record/key, within the record/type group -->
<xsl:for-each-group select="current-group()" group-by="key">
<!-- output each unique record/key into the first row -->
<xsl:value-of select="current-grouping-key()"/><xsl:text>,</xsl:text>
</xsl:for-each-group>
<xsl:text>
</xsl:text><!-- new line -->
<!-- create a set of nodelists of record nodes of each unique record/fid (document id),
within the record/type::record/key nested grouping -->
<xsl:for-each-group select="current-group()" group-by="fid">
<!-- for each unique record/fid -->
<xsl:for-each select="current-group()[1]">
<!-- output the document id into the first column -->
<xsl:value-of select="current-grouping-key()"/><xsl:text>,</xsl:text>
<!-- for each unique key in the document -->
<xsl:for-each-group select="current-group()" group-by="key">
<!-- output the sum of same-named keys -->
<xsl:value-of select="sum(current-group()/data)"/><xsl:text>,</xsl:text>
</xsl:for-each-group>
</xsl:for-each>
<xsl:text>
</xsl:text><!-- new line -->
</xsl:for-each-group>
</xsl:result-document>
</xsl:for-each-group>
</xsl:template>

<xsl:template match="@*|node()">
<xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
</xsl:template>

</xsl:stylesheet>


I think XSLT is an amazing language.

3 comments:

Sylvain Wallez said...

... and you are able to make amazing uses of XSLT. Remember the alpha editor in Cocoon? I had never seen such a use of XSLT before :-)

Ugo said...

Clever.

I just wonder who though that outputting large log files in XML, of all things, was a good idea. I want to poke him on the head with a cluestick.

Jeroen Reijn said...

Interesting Jeremy, very interesting!