Saturday, January 26, 2008

XSLT 2.0

I had a complex XML transformation to perform. I ended up using XSLT 2.0, I was really impressed.

I needed to measure several sub-tasks in a publishing process, that updates hundreds of documents in batches every few hours, with data from live sources.

I needed to compare the original implementation with simulations of 3 incremental optimisations, so we could see if they are worth making.

Setting up and running the tests was out of my hands, happening in another country. What I got back were four very large XML files, in Java logger.xml format.

<record>
<date>2008-01-24T13:10:44</date>
<millis>1201176644561</millis>
<sequence>6</sequence>
<logger>app.Uploader</logger>
<level>INFO</level>
<class>app.Uploader$DataLogger</class>
<method>end</method>
<thread>10</thread>
<message>timing.folder 15 1468</message>
</record>
<record>
<date>2008-01-24T13:10:44</date>
<millis>1201176644561</millis>
<sequence>7</sequence>
<logger>app.Uploader</logger>
<level>INFO</level>
<class>app.Uploader$DataLogger</class>
<method>end</method>
<thread>10</thread>
<message>memory.folder 15 844128</message>
</record>


etc.

Thousands of them, all in a flat list. I needed csv files to import into a spreadsheet, to analyse the data.

The first processing pass, filtered out records I was not interested in and extracted values from the message, adding them as individual tags to each record. <message>[type].[key] [fid] [data]</message>

Each log file needed to be output into separate csv files, one for each type of measurement (timing, memory etc.). Each row of the csv file needed to contain each different measurement for one document update, where measurements with the same key were added together.

It is recognised that getting these kind of groupings from flat data using XSLT 1.0 is difficult. I tried for a while using muenchian grouping, the code was getting longer and more complex but I still had not solved all of the problems. It was looking like I would need a whole chain of XSLTs to do the job. I switched from xsltproc to Saxon, because it has a chaining command. Then I remembered it also implements all of XSLT 2.0 which has new commands that make grouping far easier.

I ende up with 3 nested <xsl:for-each-group select="nodes" group-by="key">. XSLT 1.0 has <xsl:for-each> which iterates a nodelist. <xsl:for-each-group> produces a list of lists and iterates through the outer one, giving you access to each inner list and it's key. This solves two difficult issues in XSLT 1.0, arbitrary grouping and making lists of unique keys.

After re-factoring, I had easier to understand code that was a third of the size and ran three times faster.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!--

output a CSV file for each unique record/type (eg. 'timing' or 'memory')
one row per unique record/fid (document ID)

where multiple values of the same record/type are summed
eg. record/type values below have multiple values:
transform
put.default.preview
put.default.www

-->
<xsl:param name="test">1</xsl:param>
<xsl:output method="text"/>

<xsl:template match="log">
<!-- create a set of nodelists of record nodes of each unique record/type -->
<xsl:for-each-group select="record" group-by="type">
<!-- output the log type for debugging purposes -->
<xsl:value-of select="current-grouping-key()"/><xsl:text>
</xsl:text><!-- new line -->
<!-- new output file for each unique record/type -->
<xsl:result-document href="results/{current-grouping-key()}.{$test}.csv" method="text">
<xsl:text>id,</xsl:text><!-- first column -->
<!-- nodelist of records for each unique record/key, within the record/type group -->
<xsl:for-each-group select="current-group()" group-by="key">
<!-- output each unique record/key into the first row -->
<xsl:value-of select="current-grouping-key()"/><xsl:text>,</xsl:text>
</xsl:for-each-group>
<xsl:text>
</xsl:text><!-- new line -->
<!-- create a set of nodelists of record nodes of each unique record/fid (document id),
within the record/type::record/key nested grouping -->
<xsl:for-each-group select="current-group()" group-by="fid">
<!-- for each unique record/fid -->
<xsl:for-each select="current-group()[1]">
<!-- output the document id into the first column -->
<xsl:value-of select="current-grouping-key()"/><xsl:text>,</xsl:text>
<!-- for each unique key in the document -->
<xsl:for-each-group select="current-group()" group-by="key">
<!-- output the sum of same-named keys -->
<xsl:value-of select="sum(current-group()/data)"/><xsl:text>,</xsl:text>
</xsl:for-each-group>
</xsl:for-each>
<xsl:text>
</xsl:text><!-- new line -->
</xsl:for-each-group>
</xsl:result-document>
</xsl:for-each-group>
</xsl:template>

<xsl:template match="@*|node()">
<xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
</xsl:template>

</xsl:stylesheet>


I think XSLT is an amazing language.

Wednesday, January 16, 2008

Thin Air

I don't know about you, but I found this year's keynote by Steve Jobs a bit thin on the ground.

I don't have a big movie-rental habit and they won't be coming to the UK for a long time, I guess ..... also I expect we'll pay a Rip-off Britain surcharge anyway .....

The MacBook Air is obviously gorgeous and looks like a really stunning bit of engineering, but I would have to think long and hard ..... I am not sure I would want to be without FireWire or Ethernet (though they are offering a USB/Ethernet adaptor).

I did think building demonstration videos into System Preferences for multi-touch gestures was a total hoot :-)

Remote-Disk sounds jolly clever. I did not hear whether is allows you to install a new System using it though.

My initial reaction to Time-Capsule is that it is a bit of a joke, why not just fix Time-Machine backups to AirPort-Disk?

I am sure that the improvements to the iPhone and iPod touch are welcome by people who have not jailbroken theirs yet, but for me, I could not see anything that I cannot already do ...... albeit differently on my jailbroken iPod touch.

I will wait for the release of the iPhone SDK, and see what it brings, before I un-jailbreak and update the OS on mine. Even with Apple's additions, there are still compelling 3rd party apps on mine, that I would not want to loose.

Charging for the applications they left out of the iPod touch is bizarre IMHO, I bet we see a refund in a few weeks ;-)
Now they are putting back the bits missing from the iPod touch, maybe they will put back the bits missing from Leopard :-)

I find it very difficult to believe their WIFI-location technology would actually work, specially outside urban America. It sounds bloody clever, but I have a GPS ..... I'd rather have a GPS module .....

So, WTF happened to "Just one more thing"?