Our example SAX applications have only been interested in processing one or two different element types, and the processing has been very simple. In real applications where there is a need to process many different element types, this style of program can quickly become very unstructured. This happens for two reasons: firstly, the interactions of different events processing the same global context data can become difficult to disentangle, and secondly, each of the event-handling methods is doing a number of quite unrelated tasks.
So there is a need to think carefully about the design of a SAX application to prevent this happening. This section presents some of the possibilities. We'll look at two commonly used patterns: the filter pattern and the rule-based pattern.
In the filter design pattern, which is also sometimes called the pipeline pattern, each stage of processing can be represented as a section of a pipeline: the data flows through the pipe, and each section of the pipe filters the data as it passes through. This is illustrated in the diagram below:
There are many different things a filter can do, for example:
q Remove elements of the source document that are not wanted
q Modify tags or attribute names
q Perform validation
q Normalize data values such as dates
The important characteristic of this design is that each filter has an input and an output, both of which conform to the same interface. The filter implements the interface at one end, and is a client of the same interface at the other end. So if we consider any adjacent pair of filters, the left-hand one acts as the Parser, the right-hand one as the DocumentHandler. And indeed, the filters in this structure will generally implement both the SAX Parser and DocumentHandler interfaces. ("Parser," of course, is a misnomer here. The characteristic of a SAX Parser is not that it understands the lexical and syntactic rules of XML, but that it notifies events to a DocumentHandler. Any program that performs such notification can implement the SAX Parser interface, even though it doesn't do any actual parsing).
It is also possible for a filter to have more than one output, notifying the events to more than one recipient, or less commonly, for a filter to have more than one input, merging events from several sources.
The power of the filter design pattern is that the filters are highly reusable, because just like real plumbing, the same standard filters can be plugged together in many different ways.
There are a number of tools around for constructing a pipeline of this form. The simplest is John Cowan's ParserFilter class, available from http://www.ccil.org/~cowan/XML/. This is an abstract class: it does the things that every filter needs to do, and leaves you to define a subclass for each specific filter needed in your own pipeline.
As you might expect, ParserFilter implements both the SAX Parser and DocumentHandler interfaces; in fact, for good measure, it implements the other SAX event-handling interfaces as well (DTDHandler, ErrorHandler, and EntityResolver). All that the event-handling methods in this class do is to pass the event on to the next filter in the pipeline: it's up to your subclass to override any methods that need to do useful work.
The ParserFilter class has a constructor that takes a Parser as its parameter: the effect is to create a piece of the pipeline and connect it to another piece on its left. To construct our three-stage pipeline in the diagram above, we could write:
The initial input to the pipeline is of course a SAX Parser and the final output is a SAX DocumentHandler.
Here is a complete working example of a ParserFilter called Indenter. This filter takes a stream of SAX events, and massages the data by adding whitespace before start and end tags to make the nested structure of the document visible on display. It then passes the massaged data to the next DocumentHandler (which might, of course, be another filter).
The code should be self-explanatory. Note how it relies on the methods in the superclass to actually send the events to the DocumentHandler:
To actually run this example, we will need a DocumentHandler that outputs the XML, let's suppose this exists and is called XMLOutputter (we'll show how XMLOutputter is written in the next section). We can then write a main program as follows:
And you will also have to add an import statement for the ParserManager class at the top of the file:
We've made the program a bit more realistic by making the input file an argument that you can specify on the command line (retrieved from args), and by creating the underlying SAX Parser using the ParserManager class that we introduced earlier. It's still not a production-quality program, for example it falls over if called without an input argument, but it's getting closer. Once you have set up the classpath (remember that to use ParserManager, the file ParserManager.properties must also be on the classpath), you can run this program from the command line, for example:
The output appears nicely intended. Because the argument is a URL, you can format any XML file on the web.
Very often, as in the previous example, the final output of the pipeline will be a new XML document. So you will often need a DocumentHandler that uses the events coming out of the pipeline to generate an XML document: a sort of parser in reverse.
Surprisingly we couldn't find a DocumentHandler on the web that does this, so we've written one and included it here.
Here is the class. It's reasonably straightforward, except for the code that generates entity and character references for special characters, which uses some of Java's less intuitive methods for manipulating Strings and arrays.
Now you can see how SAX can be used to write XML documents as well as reading them. In fact, you can run SAX back-to-front: instead of the Parser being standard software that someone else writes, and the DocumentHandler being your specific application code, you can write an implementation of org.xml.sax.Parser that contains your application logic for generating XML, and couple it to this off-the-shelf DocumentHandler for writing XML output!
This ParserFilter implements the XML Namespaces recommendation, described in Chapter 7. It is available from JohnCowan's web site at http://www.ccil.org/~cowan/XML/
SAX was defined before the XML Namespaces recommendation was published, and takes no account of it. If an element name is written in the source document as <html:table>, then the element name passed to the startDocument() method will be "html:table". There is no simple way for the application to determine which namespace "html" is referring to.
The NamespaceFilter solves this problem. It keeps track of all the namespace declarations in the document (that is, the "xmlns:xxx" attributes), and when a prefixed element or attribute name is reported by the SAX parser, it substitutes the full namespace URI for the prefix before passing it on down the pipeline. For example, if the element start tag is <html:table xmlns:html="http://www.w3.org/TR/REC-html40"> then the element name passed on to the next DocumentHandler will be "http://www.w3.org/TR/REC-html40^table". The circumflex character was chosen to separate the namespace URI from the local part of the element name because it's a character that can't appear in URIs or in XML names.
Sometimes applications want to know the prefix as well as the namespace URI (for example, for use in error messages). NamespaceFilter doesn't provide this information, but it could easily be extended to do so.
This is also available from John Cowan's web site at http://www.ccil.org/~cowan/XML/.
Many XML document designs use the concept of an inheritable attribute. The idea is that if a particular attribute is not present on an element, the value is taken from the same attribute on a containing element. The XML standard itself uses this idea for the special attributes xml:lang and xml:space, and it is extensively used in some other standards such as the XSL Formatting Objects proposal.
InheritanceFilter is a ParserFilter that extends the attribute list passed to the startElement() method to include attributes that were not actually present on that element, but were inherited from parent elements. The InheritanceFilter needs to be primed with a list of attribute names that are to be treated as inherited attributes.
This ParserFilter provides support for the draft XLink specification for creating hyperlinks between XML documents. It is published by Simon St. Laurent on http://www.simonstl.com/projects/xlinkfilter/
Unlike most ParserFilters, an XLinkFilter passes all the events through unchanged. While doing so, however, it constructs a data structure reflecting the XLink attributes encountered in the document. This data structure can then be interrogated by subsequent stages in the pipeline.
One kind of link defined in the XLink specification is a so-called "inclusion" link where the linked text is designed to appear inline within the main document – rather like a preprocessor #include directive in C. The XLink syntax for this is show="parsed". This is very similar to an external entity reference, except that the application has some control over the decision whether and when to include the text: for example, the user might have a choice to display the long or short forms of a document. It would be quite possible, of course, to implement a filter that expanded such links directly, presenting an included document to subsequent pipeline stages as if it were physically embedded in the original document.
One potential difficulty with a pipeline is that each filter in the pipeline has to work out for itself things that other filters already know; a common example is knowing the parent of the current element. If one filter is already maintaining a stack of elements so that it can determine this, it is wasteful for another filter to do the same thing.
You can get round this by allowing one filter to access data structures set up by a previous filter, either directly or via public methods. However, this requires that the filters in the pipeline know rather more about each other than the pure pipeline model suggests, which reduces your ability to plug filters together in any order. Arguably, when processing reaches this level of complexity, it might be better to forget event-based processing entirely and use the DOM (with a navigational design pattern) instead.