The Simple API for XML (SAX) is a callback based API for parsing XML documents. An XML document is walked by a SAX parser which calls into a known API to report the occurrence of XML constructs (elements, text) in the source document as they are encountered. This will (hopefully) become clearer when we get to the examples later in this post.
SAX is a defacto standard, rather than a formal standard, based on an original Java implementation.
http://www.saxproject.org provides the official website for SAX and includes some of the history of SAX's evolution and information on writing SAX based programs using the Java API. The book
Sax2 also provides a good reference for parsing with SAX. Note that the book was published in 2002 but is still relevant today as SAX version 2 is still the current version of the API.
Python provides its SAX support in the
xml.sax module. The official documentation is in subsections 19.9, 19.10, 19.11, and 19.12 of
Chapter 19: Structured Markup Processing Tools. As mentioned in my
previous post, this documentation can also be accessed directly from within the Python interactive interpreter as follows:
The entry point to parsing an XML document using SAX is the xml.sax.parse() function. This function takes two required arguments (source and content handler) and one optional argument (an error handler). The input source is a file like object that provides access to the source XML document. Functions provided by the supplied content handler are called by the parser as it encounters constructs in the source XML document. The optional third argument is there to provide a custom error handler.
The key part of the handling is the content handler. This is where your application specific code is informed of content sourced from the input XML document. Your content handler will be an object that provides the same interface as class
xml.sax.ContentHandler. The methods defined in this class are what the SAX parser expects when invoking callbacks. The complete interface is:
The simplest way to provide this interface is to have your custom class extend xml.sax.ContentHandler and override just the methods that you are interested in receiving. The methods you don't implement will make use of the empty implementations provided by xml.sax.ContentHandler.
Arguably the most relevant methods to override are
startElement,
endElement and
characters. These three methods will received just about all of the content from an XML document. The
startElement method is called when the SAX parser encounters the opening element in a document. The name of the element and all the attributes are supplied. The
endElement method is called when the closing tag for the element is encountered. The
characters method receives all the content in between, though there is no requirement that all the text be provided in one call to
characters. There may be multiple calls.
Enough abstract talk. Lets see what happens when parsing the following, simple, XML document.
The following code will parse this document (assumed stored in addressbook.xml) and echo the content as it is supplied to the custom content handler.
Worth noting here is the implementation of
startElement. This is called upon each element in the source document being encountered. Only the
address element has attributes so there is an explicit check for this element tag before trying to access the value of the
type attribute.
This code run against the source document generates the following output:
startElement 'address-book'
characters '
'
characters ' '
startElement 'name'
characters 'Fred Fox'
endElement 'name'
characters '
'
characters ' '
startElement 'phone'
characters '1234567'
endElement 'phone'
characters '
'
characters ' '
startElement 'address'
attribute type='postal'
characters 'PO Box 987, Anytown, EV'
endElement 'address'
characters '
'
characters ' '
startElement 'address'
attribute type='street'
characters '34 Main St, Anytown, EV'
endElement 'address'
characters '
'
endElement 'address-book'
Note the multiple calls to character providing whitespace used for indenting and new lines.
There is a lot more to say about SAX parsing, but I've said enough for this post. A subsequent post will explore the namespace areas of the SAX API (
startElementNS etc).