The history of SAX is unusually well documented, because all the discussion took place on the public XML-DEV mailing list, whose archives are available at http://www.lists.ic.ac.uk/hypermail/xml-dev/. David Megginson has also summarized its history at http://www.megginson.com/SAX/history.html.
The process started late in 1997 as a result of pressure from XML users such as Peter Murray-Rust, who was developing XML applications and struggling with the needless incompatibility of different parsers. Suppliers of early XML parsers, including Tim Bray, David Megginson, and James Clark contributed to the discussion, and many other members of the list commented on the various drafts. David Megginson devised a process, rather in the spirit of the original Internet "Request for Comments", whereby comments and suggestions could be handled promptly yet fairly, and he eventually declared the specification frozen on 11 May 1998.
One of the major reasons for the success of SAX was that along with the initial specification, Megginson supplied front-end drivers for several popular XML parsers, including his own Ælfred, Tim Bray's Lark, and Microsoft's MSXML. Once SAX was established in this way, other parser writers such as IBM, Sun, and ORACLE were quick to incorporate native SAX interfaces into their own parsers, to enable existing applications to run with their products.
The definitive SAX specification is written in terms of Java interfaces. It has been adapted to other languages, though the only one we know of that is actively supported is an interface for the Python language, produced by Lars Marius Garshol (see http://www.stud.ifi.uio.no/~larsga/download/python/xml/saxlib.html
). Of course, the Java interfaces can be used from other languages that interoperate with Java, for example by using Microsoft's Java VM that interfaces Java to COM. In this chapter, however, we'll stick to the original Java.
SAX is structured as a number of Java interfaces. It's very important to understand the difference between an interface and a class:
q An interface says what methods there are, and what kind of parameters they expect. It is purely a specification; it doesn't provide any code to execute when the methods are called. But it is a concrete specification, not just a scrap of paper, and the Java compiler will check that a class that claims to implement an interface does so correctly.
q A class provides executable methods, including public methods that can be called by the code in other classes.
q A class may implement one or more interfaces. In many cases SAX specifies several interfaces which could theoretically be implemented by separate classes, but which in practice are often implemented in combination by a single class. To implement an interface, a class must supply code for each of the methods defined in the interface.
q Several classes may implement the same interface. Of course this is the whole point of the SAX exercise – there are lots of implementations of the SAX Parser interface for you to choose from, and because they all implement the same interface, your application doesn't care which one it is using.
Some of the interfaces in SAX are implemented by classes within the parser, and some must be implemented by classes within the application. There are some classes supplied with SAX itself, though you don't have to use these. And there are some classes (such as the error handling classes), which the parser must provide, but which your application can override if it wishes.
The components of a simple SAX application are shown in the diagram below.
In the diagram:
q The Application is the "main program": the code that you write to start the whole process off.
q The Document Handler is code that you write to process the contents of the document.
q The Parser is an XML Parser that conforms to the SAX standard.
The job of the application is to create a parser (more technically, to instantiate a class that implements the org.xml.sax.Parser interface); to create a document handler (by instantiating a class that implements the org.xml.sax.DocumentHandler interface); to tell the parser what document handler to use (by calling the parser's setDocumentHandler() method); and to tell the parser to start processing a particular input document (by calling the parse() method of the parser).
The job of the parser is to notify the document handler of all the interesting things it finds in the document, such as element start tags and end tags.
The job of the document handler is to process these notifications to achieve whatever the application requires.
Let's look at a very simple application: one that simply counts how many <book> elements there are in the supplied XML file (shown later).
In this example we will simplify the structure shown in the diagram above by using the same class to act as both the application and the document handler. The reason we can do this is that one Java class can implement several interfaces, so it can perform several roles at once.
The first thing the application must do is to create a parser:
This is the only time you need to say which particular SAX parser you are using. We have chosen the xp parser produced by James Clark, and available from http://www.jclark.com. Like any other Java class you use, of course, it must be on the Java classpath.
The chosen parser must implement the SAX Parser interface org.xml.sax.Parser (if it doesn't, Java will complain loudly), so it can be assigned to a variable of type Parser. Because of the import statement at the top, Parser is actually a shorthand for org.xml.sax.Parser.
So you need to know the relevant class name of your chosen parser. Oddly, many of the available SAX parsers don't advertise their parser class name in bright lights. So here is a list of some of the more popular parsers, with the class name you need to use to instantiate them. (Note however that this may change with later versions of the products.)
parser class: com.microstar.xml.SAXDriver
parser class: com.datachannel.xml.sax.SAXDriver
parser class (non-validating):
parser class (validating):
from: http://www.oracle.com (requires TechNet registration)
parser class: oracle.xml.parser.v2.SAXParser
Sun Project X
parser class (non-validating):
parser class (validating):
parser class: com.jclark.xml.sax.Driver
So, you've created a parser. Now you can start telling it what to do.
First you need to tell the parser what document handler to call when events occur. This can be any class that implements the SAX org.xml.sax.DocumentHandler interface. The simplest and most common approach is to make your application itself act as the document handler.
DocumentHandler itself is an interface defined in SAX. You could make your application program implement this interface directly, in which case you would have to provide code for all the different methods required by that interface. In our example, however, we want to ignore most of the events, so it would be rather tedious to define lots of methods that do nothing. Fortunately SAX supplies an implementation of DocumentHandler that does nothing, HandlerBase, and we can make our application extend this, so it inherits all the "do nothing" methods. Let's do this:
The call on setDocumentHandler() tells the parser that "this" class (your application program) is to receive notification of events. This class is an implementation of org.xml.sax.DocumentHandler, because it inherits from org.xml.sax.HandlerBase, which in turn implements DocumentHandler.
The parser is now almost ready to go; all it needs is a document to parse, and the Java main method that lets it operate as a standalone program. Let's give it a file to parse first:
Note that the argument to parse() is a URL, supplied as a string. We'll show you later how to supply a filename rather than a URL. Because the program now involves data input and output we must also add "throws Exception" to the countBooks method to alert if there are errors.
We need to make one more addition to get the program to run as a standalone application: the Java main method. In the main method we create an instance of the class, with new BookCounter(), and then call the object's countBooks method; we also trap exceptions again for the new object as a whole. Our code should then look like this:
The program can now be run: it will parse the document and run to completion (assuming, of course, that the document is there to be parsed).
The only snag is that the program currently produces no output. To make it useful, we need to add a method that counts the <book> start tags as they are notified, and another that prints the number of books counted at the end of the document. These methods make use of the global variable count.
The final version of the application is shown below. You can find it on our web site on the pages for this book at http://www.wrox.com/ in the code for this chapter.
You can now run this application from the command line, with a command of the form:
and it will print the number of <book> elements in the supplied XML file. Suppose the file c:\data\books.xml contains the following file (available for download with the code for the chapter from http://www.wrox.com)
Then the output displayed at the terminal will be:
As the example above shows, the main work in a SAX application is done in a class that implements the DocumentHandler interface. Usually we'll be interested in rather more of the events than in the simple example above, so let's look at the other methods that make up this interface.
First, there's a pair of methods that mark the start and end of document processing:
These two methods take no parameters and return no result. In fact, you can usually get by without them, since anything you want to do at the start can generally be done before you call parse(), and anything you want to do at the end can be done when parse() returns. However, in a more complex application you may want to make the application that calls parse() a different class from the DocumentHandler, and in this case these two methods are useful for initializing variables and tidying up at the end.
Note that a SAX parser (a single instance of the Parser class) should only be used to parse one XML document at a time. Once it has finished, you can use it again to parse another document. If you want to parse several documents concurrently, you need to create one instance of the Parser class for each. You'll almost certainly want to apply the same one-document-per-instance rule to a DocumentHandler, because there's nothing in the event information that tells you what document the event came from.
As with document events, there is a pair of methods that is called to mark the start and end tags of each element in the document:
q startElement(String name, AttributeList attList)
q endElement(String name)
The name is the name that appears in the start and end tag of the element.
If the document uses the abbreviated syntax for an empty element (that is, "<tag/>"), the parser will notify both a start and end tag, exactly as if you had written "<tag></tag>". This is because XML defines these two constructs as equivalent, so your application shouldn't need to know which was used.
The attributes appearing in the start tag are bundled together into an object of class AttributeList and handed to the application all at once. This is a departure from the event-based model, in which you might expect each attribute to be notified as it occurs. AttributeList is another interface defined by SAX. It's up to the parser to define a class that implements this interface: all the application needs to know is the methods it can call to get details of individual attributes. The most useful one is:
q getValue(String name)
which returns the value of the named attribute as a String, if it is present, or null if it is absent.
One thing to remember about the AttributeList is that it's only valid for the duration of the startElement() method. Once your method returns control to the parser, it can (and often does) overwrite the AttributeList with different information. If you want to keep attribute information for later use, you'll need to make a copy. One convenient way to do this is to use the SAX "helper" class AttributeListImpl: this allows you to create another AttributeList as a private copy of the one you were given.
Character data appearing in the XML document is usually reported to the application using the method
q characters(char chars, int start, int len)
This interface was defined for efficiency rather than convenience. If you want to handle the character data as a String, you can easily construct one by writing:
The parser could have constructed this String for you, but creating new objects can be expensive in Java, so instead it just gives you a pointer to its internal buffer where the characters are already held.
One advantage of using Java for XML processing is that Java and XML both use the Unicode character set as standard. The characters passed in the chars array are always native Java Unicode characters, regardless of the character encoding used in the original source document. This means you never need to worry about how the characters were encoded.
One important point to remember is that the parser is allowed to break up character data however it likes, and pass it to you one piece at a time. This means that if you are looking for "gold" in your document, the following code is wrong:
Why? Because the string "gold" might appear in your document, but be notified to your application in two or more calls of the characters() method. In theory, there could be four separate calls, one for the "g", one for the "o", one for the "l", and one for the "d".
The worst aspect of this problem is that you will probably not discover your program is wrong during testing, because in practice parsers very rarely split the text in this way. They might split it, for example, only if the text happens to straddle a 4096-byte boundary (if there is some reason the memory should happen to be limited in this way at the time), and this might not happen until after months of successful running. Be warned.
There is one circumstance in which parsers are obliged to split the text, and that is when external entities are used. The SAX specification is quite explicit that a single call on characters() may not contain text from two different external entities.
If you want to do anything with character data other than simply copying it unconditionally to an output file, you are probably interested in knowing what element is belongs to. Unfortunately the SAX interface doesn't give you this information directly. If you need such contextual information, your application will have to maintain a data structure that retains some memory of previous events. The most common is a stack. In the next section we will show how you can use some simple data structures both to assemble character data supplied piecemeal by the parser, and to determine what element it is part of.
There is a second method for reporting character data, namely
q ignorableWhitespace(char chars, int start, int len)
This interface can be used to report what the SAX specification rather loosely refers to as "ignorable white space". If the DTD defines an element with "element content" (that is, the element can have children but cannot contain PCDATA), then XML permits the child elements to be separated by spaces, tabs, and newlines, even though "real" character data is not allowed. This white space is probably insignificant, so a SAX application will almost invariably ignore it: which you can do simply by having an ignorableWhitespace() method that does nothing. The only time you might want to do anything else is if your application is copying the data unchanged to an output file.
The XML specification allows a parser to ignore information in the external DTD, however. A non-validating parser will not necessarily distinguish between an element with element content and one with mixed content. In this case the ignorable white space is likely to be reported via the ordinary characters() interface. Unfortunately there is no way within a SAX application of telling whether the parser is a validating one or not, so a portable application must be prepared for either. This is another limitation that is remedied in SAX 2.0.
There is one more kind of event that parsers report, namely processing instructions. You probably won't meet these very often: they are the instructions that can appear anywhere in an XML document between the symbols "<?" and "?>". A processing instruction has a name (called a target), and arbitrary character data (instructions for the target application concerned).
Processing instructions are notified to the DocumentHandler using the method:
q processingInstruction(String name, String data)
By convention, you should ignore any processing instruction (or copy it unchanged) unless you recognize its name.
Note that the XML declaration at the start of a document may look like a processing instruction, but it is not a true processing instruction, and is not reported to the application via this interface – indeed, it is not reported at all.
Processing instructions are often written to look like element start tags, with a sequence of keyword="value" attributes. This syntax, however, is purely an application convention, and is not defined by the XML standard. So SAX doesn't recognize it; the contents of the processing instruction data are passed over in an amorphous lump.
We've glossed over error handling so far, but as always, it needs careful thought in a real production application.
There are three main kinds of errors that can occur:
q Failure to open the XML input file, or another file that it refers to, for example the DTD or another external entity. In this case the parser will throw an IOException (input/output exception), and it is up to your application to handle it.
q XML errors detected by the parser, including well-formedness errors and validity errors. These are handled by calling an error handler which your application can supply, as described below.
q Errors detected by the application: for example, an invalid date or number in an attribute. You handle these by throwing an exception in the DocumentHandler method that detects the error.
The SAX specification defines three levels of error severity, based on the terminology used in the XML standard itself. These are:
These usually mean the XML is not well-formed. The parser will call the registered error handler if there is one; if not, it will throw a SAXParseException. In most cases a parser will stop after the first fatal error it finds.
These usually mean the XML is well-formed but not valid. The parser will call the registered error handler if there is one; if not, it will ignore the error.
These mean that the XML is correct, but there is some condition that the parser considers it useful to report. For example this might be a violation of one of the "interoperability" rules: input that is correct XML but not correct SGML. The parser will call the registered error handler if there is one; if not, it will ignore the error.
The application can register an error handler using the parser's setErrorHandler() method. An error handler contains three methods, fatalError(), error(), and warning(), reflecting the three different error severities. If you don't want to define all three, you can make an error handler that inherits from HandlerBase: this contains versions of all three methods that take the same action as if no error handler were registered.
The parameter to the error handling method, in all three cases, is a SAXParseException object. You probably think of Java Exceptions as things that are thrown and caught when errors occur; but in fact an Exception is a regular Java object and can be passed as a parameter to methods just like any other: it might never be thrown at all. The SAXParseException contains information about the error, including where in the source XML file it occurred. The most common thing for an error handler method to do is to extract this information to construct an error message, which can be written to a suitable destination: for example, a web server log file.
The other useful thing the error handling method can do is to throw an exception: usually, but not necessarily, the exception that the parser supplied as a parameter. If you do this, the parse will typically be aborted, and the top-level application will see the same exception thrown by the parse() method. It then has another opportunity to output diagnostics. Whether you generate a fatal error message from within the error handler, or do it by letting the top-level application catch the exception, is entirely up to you.
When your application detects an error within a DocumentHandler method (for example, a badly formatted date), the method should throw a SAXException containing an appropriate message to explain the problem. After this, the parser deals with the situation exactly as if it had detected the error itself. Typically, it doesn't attempt to catch the exception, but exits immediately from the parse() method with the same exception, which the top-level application can then catch.
When the parser detects an XML syntax error, it will supply details of the error in a SAXParseException object. This object will include details of the URL, line, and column where the error occurred (a line number on its own is not much use, because the error may be in some external entity not in the main document). When you catch the SAXParseException in your application, you can extract this information and display it so the user can locate the error.
If the problem with the XML file is detected at application level (for example, an invalid date), it is equally important to tell the user where the problem was found, but this time you can't rely on the SAXParseException to locate it. Instead, SAX defines a Locator interface. The SAX specification doesn't insist that parsers supply a Locator, but most parsers do.
One of the methods you must implement in a document handler is the setLocator() method. If the parser maintains location information it will call this method to tell the document handler where to find the Locator object. At any subsequent time while your document handler is processing an event it can ask the Locator object for details of the current coordinates in the source document. There are three coordinates:
q The URL of the document or external entity currently being processed
q The line number within that URL
q The column number within that line
This is of course exactly the same information that you can get from a SAXParseException object, and in fact one of the things you can do very easily when your application detects an error is to throw a SAXParseException that takes the coordinates directly from the Locator object: just write:
Why wasn't the location information simply included in the events passed to the document handler, such as startElement()? The reason is efficiency: most applications only want location information if something goes wrong, so there should be minimal overhead incurred when it is not needed. Supplying location information with each call from the parser to the document handler would be unnecessarily expensive.
After this excursion into the world of error handling, let's develop a slightly more complex example SAX application.
The task this time is for the application to print the average price of fiction books in the catalog. We'll use the same data file (books.xml) as in our previous example.
We are interested only in those <book> elements that have the attribute category="fiction", and for these we are interested only in the contents of the <price> child element. We add up the prices, count the books, and at the end divide the total price by the number of books.
Here's our first version of the application:
There are three main points to note in this code:
q The application needs to maintain one piece of context, namely whether the current book is fiction or not. It uses an instance variable to remember this, setting isFiction to true when a start tag for a fiction book is encountered, and to false when a start tag for a non-fiction book is read.
q See how the character content is accumulated in a Java StringBuffer and is not actually processed until the endElement() event is notified. This kills two birds with one stone: it solves the problem that the content of a single element might be broken up and notified piecemeal; at the same time, it means that when we handle the data, we know which element we are dealing with. The StringBuffer is emptied whenever a start or end tag is read, which means that when the application gets to the end tag of a PCDATA element (one that contains character data only) the buffer will contain the character data of that element.
q The application needs to do something sensible when the price of a book is not a valid number. (Until XML Schemas become standardized, we can't rely on the parser to do this piece of validation for us: DTDs provide no way of restricting the data type of character data within an element.) This condition is detected by the fact that the Java constructor Double(String s), which converts a String to a number, reports an exception. The relevant code catches this exception, and reports a SAXException describing the problem. This will cause the parsing to be terminated with an appropriate error message.
When the code is run on our example XML file it produces the following output:
But the program isn't yet perfect.
Firstly, it can easily fail if the structure of the input document is not as expected. For example, it will give wrong answers if the <price> element occurs other than in a <book>, or if there is a <book> with no <price>, or if a <price> element has its own child elements. Such things might happen because there is no DTD, or because a non-validating parser is used that doesn't check the DTD, or because a document is submitted that uses a different DTD from that expected, or because the DTD has been enhanced since the program was written.
Secondly, the diagnostics when errors are detected are rather unfriendly. The user will be told that a price is not numeric, but there may be hundreds of books in the list: it would be more helpful to say which one. Even more helpful would be to report all the errors in a single run, so that the user doesn't have to run the program once to find and correct each separate error. (Actually, most XML parsers will only report one syntax error in a single run, so there's a limit to what we can achieve here.)
In the next section we'll look at how to maintain more information about element context, which is necessary if we're to do more thorough validation. Before that, we'll make one improvement in the area of error handling. We'll use the Locator object to determine where in the source document the error occurred, and report it accordingly.
In order to show what happens clearly we've switched from James Clark's xp parser to IBM Alphaworks' xml4j, which provides clearer messages. Here is the revised program.
This version of the application can also be found on our web site at http://www.wrox.com
This version of the code improves the diagnostics with very little extra effort. The revised application does three things:
q It keeps a note of the Locator object supplied by the parser.
q When an error occurs, it uses the Locator object to print information about the location of the error before generating the SAXException. Note that the application has to allow for the case where there is no Locator, because SAX doesn't require the parser to supply one.
q It also includes details of the original "root cause" exception (the NumberFormatException) encapsulated within the SAXParseException, again allowing more precise diagnostics to be written.
This is the output we got from the xml4j parser, after modifying the price of Moby Dick from "8.99" to "A.99":
In this example the application produces a message containing location information before throwing the exception, and then produces the real error message when the exception is caught at the top level. An alternative is to pass the location information as part of the exception, which could be done by throwing a SAXParseException instead of an ordinary SAXException. However, the application still has to deal with the case where there is no Locator, in which case throwing a SAXParseException is not very convenient. An alternative here would be for your application to create its own default locator (containing no useful information) for use when the parser doesn't supply one.
We've seen in both the examples so far that the DocumentHandler generally needs to maintain some kind of context information as the parse proceeds. In the first case all that it did was to accumulate a count of elements; in the second example the DocumentHandler kept track of whether or not we were currently within a <book> element with category="fiction".
Nearly all realistic SAX applications will need to maintain some context information of this kind. Very often, it's appropriate to keep track of the current element nesting, and in many cases it's also useful to know the attributes of all the ancestor elements of the data we're currently processing.
The obvious data structure to hold this information is a Stack, because it's natural to add information about an element when we reach its start tag, and to remove that information when we reach its end tag. A stack, of course, still requires far less memory than you would need to store the whole document, because the maximum number of entries on the stack is only as great as the maximum nesting of elements, which even in large and complex documents rarely exceeds a depth of ten or so.
We can see how a stack can be useful if we modify the requirements for the previous example application. This time we'll allow our book catalog to include multi-volume books, with a price for each volume and a price for the whole set. In calculating the average price, we want to consider the price of the whole set, not the price of the individual volumes.
The source document might now look like this (it's also available via the web site at http://www.wrox.com):
One way of handling this would be to introduce another flag in the program, which we set when we encounter a <volume> start tag, and unset when we find a </volume> end tag; we could ignore a <price> element if this flag is set. But this style of programming quickly leads to a proliferation of flags and complex nesting of if-then-else conditions. A better approach is to put all the information about the currently open elements on a stack, which we can then interrogate as required.
Here's the new version of the application:
Here is the expected output:
It might seem that maintaining this stack is a lot of effort for rather a small return. But it's a worthwhile investment. All real applications become more complex over time, and it's worth having a structure that allows the logic to evolve without destroying the structure of the program. Note how the condition tests such as isFiction() and isVolume() have now become methods applied to the context data structure rather than flags that are maintained as events occur. As the number of conditions to be tested multiplies, we can write more of these methods without increasing the complexity of the startElement() and endElement() methods.