perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home »» Free Library »» O'Reilly Books » Sunday, 13 February 2005
 


ISBN:
Author(s):




Buy this book!

Copyright O'Reilly & Associates, Inc.. Used with permission.

Providing entity text

For data without a URI, or that uses a URI scheme not supported by your JVM, applications must provide entity text themselves. There are two ways to provide the text through an InputSource: as character data or as binary data, which needs to be decoded into character data before it can be parsed. In both cases your application will create an open data stream and give it to the parser. It will no longer be owned by your application; the parser should later close it as part of its end-of-input processing. If you provide binary data, you might know the character encoding used with it and can give that information to the parser rather than turning it to character data yourself using something like an InputStreamReader.

InputSource(java.io.Reader in)

Use this constructor when you are providing predecoded data to the parser, which will then ignore what any XML or text declaration says about the character encoding. (Also, call setSystemId(uri) when possible.) This constructor is useful for parsing data from a java.io.Reader such as java.io.CharArrayReader and for working around configuration bugs in HTTP servers.

Some HTTP servers will misidentify the text encoding used for XML documents, using the content type text/xml for non-ASCII data, instead of text/xml;charset=... or application/xml.

[1]
If you know a particular server does this, and that the encoding won't be autodetected, create an InputSourceby using an InputStreamReaderthat uses the correct encoding. If the correct encoding will be autodetectable, you can use the InputStreamconstructor. InputSource(java.io.InputStream in)

Use this constructor when you are providing binary data to a parser and expect the parser to be able to detect the encoding from the binary data. (Also, call setSystemId(uri) when possible.)

For example, UTF-16 text always includes a Byte Order Mark, a document beginning <?xml ... encoding="Big5"?> is understood by most parsers as a Big5 (traditional Chinese) document, and UTF-8 is the default for XML documents without a declaration identifying the actual encoding in use.

InputSource.setEncoding(String id)

Use this method if you know the character encoding used with data you are providing as a java.io.InputStream. (Or provide a java.io.Reader if you can, though some parsers know more about encodings than the underlying JVM does.)[2] If you don't know the encoding, don't guess. XML parsers know how to use XML and text declarations to correctly determine the encoding in use. However, some parsers don't autodetect EBCDIC encodings, which are mostly used with IBM mainframes. You can use this method to help parsers handle documents using such encodings, if you can't provide the document in a fully interoperable encoding such as UTF-8.

All XML parsers support "UTF-8" and "UTF-16" values here, and most support other values, such as US-ASCII and ISO-8859-1. Consult your parser documentation for information about other encodings it supports. Typically, all encodings supported by the underlying JVM will be available, but they might be inconsistently named. (As one example, Sun's JDK supports many EBCDIC encodings, but gives them unusual names that don't suggest they're actually EBCDIC.) You should use standard Internet (IANA) encoding names, rather than Java names, where possible. In particular, don't use the name "UTF8"; use "UTF-8".

So if you want to parse some XML text you have lying around in a character array or String, the natural thing to do is package it as a java.io.Reader and wrap it up in something like this:

String    text = "<lichen color='red'/>";
Reader    reader = new StringReader (text);
XMLReader parser = ... ;

parser.setContentHandler (...);
parser.parse (new InputSource (reader));

In the same way, if you're implementing a servlet's POST handler and the servlet accepts XML text as its input, you'll create an InputSource. The InputSource will never have a URI, though you could support URIs for multipart/related content (sending a bundle of related components, such as external entities). Example 1-1 handles the MIME content type correctly, though it does so by waving a magic wand: it calls a routine that implements the rules in RFC 3023. That is, text/* content is US-ASCII (seven-bit code) by default, and any charset=... attribute is authoritative. When parsing XML requests inside a servlet, you'd typically apply a number of configuration techniques to speed up per-request processing and maintain security.[3]

Example 1-1. Parsing POST input to an HTTP Servlet

import gnu.xml.util.Resolver;

public void doPost (HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException
{
    String       type = req.getContentType ();
    InputSource  in;
    XMLReader    parser;

    if (!(type.startsWith ("text/xml")
            || type.startsWith ("application/xml")) {
        response.sendError (response.SC_UNSUPPORTED_MEDIA_TYPE,
            "non-XML content type: " + type);
        return;
    }

    // there's no URI for this input data!
    in = new InputSource (req.getInputStream ());

    // use any encoding associated with the MIME type
    in.setEncoding (Resolver.getEncoding (req.getContentType ()));

    try {
        parser = XMLReaderFactory.createXMLReader();
        ...
        parser.setContentHandler (...);
        parser.parse (in);
        // content handler expected to handle response generation

    } catch (SAXException e) {
        response.sendError (response.SC_BAD_REQUEST,
            "bad input: " + e.getMessage ());
        return;

    } catch (IOException e) {
        // maybe a relative URI in the input couldn't be resolved
        response.sendError (response.SC_INTERNAL_SERVER_ERROR
            "i/o problem: " + e.getMessage ());
        return;
    }
}

You might have some XML text in a database, stored as a binary large object (BLOB, accessed using java.sql.Blob) and potentially referring to other BLOBs in the database. Constructing input sources for such data should be slightly different because of those references. You'd want to be sure to provide a URI, so the references can be resolved:

String        key = "42";
byte          data [] = Storage.keyToBlob (key);
InputStream   stream = new ByteArrayInputStream (data);
InputSource   source = new InputSource (stream);
XMLReader     parser = ... ;

source.setSystemId ("blob:" + key);
parser.parse (source);

In such cases, where you are using a URI scheme that your JVM doesn't support directly, consider using an EntityResolver to create the InputSource objects you hand to parse(). Such schemes might be standard (such as members of a MIME multipart/related bundle), or they might be private to your application (like this blob: scheme). (

Example 1-3shows how to package handling for such nonstandard URI schemes so that you can use them in your application, even when your JVM does not understand them. You may need to pass such URIs using public IDs rather than system IDs, so that parsers won't report errors when they try to resolve them.)

Filenames Versus URIs

Filenames are not URIs, so you may not provide them as system identifiers where SAX expects a system identifier: in parse() or in an InputSource object. If you are depending on JDK 1.2 or later, you can rely on new File(name).toURL().toString() to turn a filename into a URI. To be most portable, you may prefer to use a routine as shown in

Example 1-2
, which handles key issues like mapping DOS or Mac OS filenames into legal URIs.

Example 1-2. File.toURL( ) analogue for JDK 1.1

public static String fileToURL (File f)
throws IOException
{
    String      temp;

    if (!f.exists ())
        throw new IOException ("no such file: " + f.getName ());

    temp = f.getAbsolutePath ();

    if (File.separatorChar != '/')
        temp = temp.replace (File.separatorChar, '/');
    if (!temp.startsWith ("/"))
        temp = "/" + temp;
    if (!temp.endsWith ("/") && f.isDirectory ())
        temp = temp + "/";
    return "file:" + temp;
}

If you're using the GNU software distribution that is described earlier, gnu.xml.util.Resolver.fileToURL() is available so you won't need to enter that code yourself.

Bootstrapping an XMLReader

There are several ways to obtain an XMLReader. Here we'll look at a few of them, focusing first on the most commonly available ones. These are the "pure SAX" solutions.

It's good policy to reuse parsers, rather than constantly discard and recreate them. Some parsers are more expensive to create than others, so such reuse can improve performance if you parse many documents. Similarly, factory approaches add some fixed costs to achieve vendor neutrality, and those costs can add up. In contexts like servlets, where any number of threads may need to parse XML concurrently, parsers are often pooled so those bootstrapping costs won't increase per-request service times.

The XMLReaderFactory Class

The simplest way to get a parser is to use the default parser for your environment, as we saw earlier:

import org.xml.sax.helpers.XMLReaderFactory;

...

XMLReader       parser = null;

try {
    parser = XMLReaderFactory.createXMLReader ();
    // success!

} catch (SAXException e) {
    System.err.println ("Can't get default parser: " + e.getMessage ());
}
            

Normally, the default parser is defined by setting the org.xml.sax.driver system property. Application startup should set that property, normally using JVM invocation flags. (In a very few cases System.setProperty() may be appropriate.)

$ java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XMLReader

Unfortunately, in many cases the original reference implementation of that method is used. This is problematic in two situations: when the system property isn't set and when security permissions are set to prevent access to that system property; this is common for many applets. Good SAX2 distributions will ensure that this factory method succeeds in the face of such errors. The current release of the SAX2 helper classes makes this easy to do.

[4]

Because of that problem, you may choose to code your application so parser choice is a configuration option encoded through some other mechanism than system properties. You can't keep it in your application's XML-format configuration file. Once you get that configuration data you'll probably use a different XMLReaderFactory call:

import org.xml.sax.helpers.XMLReaderFactory;

...

XMLReader       parser = null;
String          className = ...;

try {
    parser = XMLReaderFactory.createXMLReader (className);
    // success!

} catch (SAXException e) {
    System.err.println ("Can't get default parser: " + e.getMessage ());
}
            

Using this factory call, the class name identifies the SAX parser you want to use. It may well be one of the entries in Table 1-1, though some frameworks bundle other parsers.

Table 1-1. SAX2 XMLReader implementation classes

Parser (and type) Class name
Aelfred2 (nonvalidating) gnu.xml.aelfred2.SAXDriver
Aelfred2 (optionally validating) gnu.xml.aelfred2.XmlReader
Crimson (optionally validating) org.apache.crimson.XmlReaderImpl
Xerces (optionally validating) org.apache.xerces.parsers.SAXParser

If you're using a parser without a settable option for validation, you may want to let distinct parsers be configured for validating and nonvalidating usage, assuming that your application needs both. Parsers with validation support are significantly larger than ones without it, which is partly why Aelfred2 still has a nonvalidating class.

Calling Parser Constructors

If you need to force the use of some particular parser, you can invoke its constructor directly. Every SAX2 XMLReader must have a default constructor in order to work with the XMLReaderFactory class. Since it exists, you can invoke it directly using the same class names you may have passed to the XMLReaderFactory, if you used application-level configuration:

import org.xml.sax.XMLReader;
import gnu.xml.aelfred2.XmlReader;

...

XMLReader       parser = new XmlReader ();
            

In some cases you may actually prefer to force use of some particular parser. In other cases, you may have no option, maybe because of class loader or security configuration. If you run into trouble with those mechanisms, you may not be able to use factory APIs to access parsers unless they are visible through the system class loader.

In general, avoid such nonportable coding decisions; use a factory API wherever you can.

Using JAXP

Sun's JAXP 1.1 supports yet another way to bootstrap SAX parsers. It's a more complex process, taking several steps instead of just one:

  1. First, get a javax.xml.parsers.SAXParserFactory.

  2. Tell it to return parsers that will do the kind of processing needed by your application.

  3. Ask it to give you a JAXP parser of type javax.xml.parsers.SAXParser.

  4. Finally, ask the JAXP parser to give you the XMLReader that is normally lurking inside of it.

Conceptually this is like the no-parameters XMLReaderFactory.createXMLReader() method, except it's complicated by expecting the factory to return preconfigured parsers.

[5]
Configuring the parser using the SAX2 flags and properties directly is preferable; the API "surface area" is smaller. Other than having different default namespace-processing modes, the practical difference is primarily availability: many implementations ensure that a JAXP system default is always accessible, but they haven't paid the same attention to providing the default SAX2 parser. (Current versions of the SAX2 classes make that easier, but you might not be using such versions.)

The code to use the JAXP bootstrap API to get a SAX2 parser looks like this:

import org.xml.sax.*;
import javax.xml.parsers.*;

XMLReader        parser;

try {
    SAXParserFactory factory;

    factory = SAXParserFactory.newInstance ();
    factory.setNamespaceAware (true);
    parser = factory.newSAXParser ().getXMLReader ();
    // success!

} catch (FactoryConfigurationError err) {
    System.err.println ("can't create JAXP SAXParserFactory, "
        + err.getMessage ());
} catch (ParserConfigurationException err) {
    System.err.println ("can't create XMLReader with namespaces, "
        + err.getMessage ());
} catch (SAXException err) {
    System.err.println ("Hmm, SAXException, " + err.getMessage ());
}

Rather than calling newInstance(), you can hardcode the constructor for a particular factory, probably using one of the classes listed in Table 1-2. It's better to keep implementation preferences as configuration issues though, and not hardwire them into source code. For situations where you may have several parsers in your class path (or a tree of class loaders, as found in many recent servlet engines), JAXP offers several methods to configure such preferences. You can associate the factory class name value with the key javax.xml.parsers.SAXParserFactory by using the key to name a system property (which sets the default parser for your JVM instance) or by putting it in the $JAVA_HOME/jre/lib/jaxp.properties property file (which sets the default policy for that JVM implementation). I prefer the jaxp.properties solution; with the other method the default parser is a function of your class path settings and even the names assigned to various JAR files. You can also embed this preference in your application's JAR files as a META-INF/services/... file, but that solution is similarly sensitive to class loader configuration issues.

Table 1-2. JAXP SAXParserFactory implementation classes

JAXP factory Class name
Aelfred2 gnu.xml.aelfred2.JAXPFactory
Crimson org.apache.crimson.jaxp.SAXParserFactoryImpl
Xerces org.apache.xerces.jaxp.SAXParserFactoryImpl

If you're using JAXP to bootstrap a SAX2 parser, rather than the SAX2 APIs, the default setting for namespace processing is different: JAXP parsers don't process namespaces by default, while SAX2 parsers do. SAX2 normally removes all xmlns* attributes, reports namespace scope events, and may hide the namespace prefixes actually used by element and attribute names. JAXP does none of that unless you make it; in fact, the default parser mode for some current implementations is the illegal SAX2 mode described in the previous chapter. The example code in this section made the JAXP factory follow SAX2 defaults.

This book encourages you to use SAX2 directly, rather than through the JAXP factory mechanism. Even if JAXP is available, it's more complex to use. Also, the resulting parser is configured differently, so many of the examples in this book would break.

Configuring XMLReader Behavior

A configuration mechanism was one of the key features added in the SAX2 release. Parsers can support extensible sets of named Boolean feature flags and property objects. These function in similar ways, including using URIs to identify any number of features and properties. The exception model, presented in in " is used to distinguish the three basic types of feature or property: the current value may be read-only, read/write, or undefined. Some flags and properties may have rules about when they can be changed (typically not while parsing) or read.

Applications access property objects and feature flags through get*() and set*() methods and use URIs to identify the characteristic of interest. Since SAX does not provide a way to enumerate such URIs as supported by a parser, you will need to rely on parser documentation, or the tables in this section, to identify the legal identifiers. (Or consult the source code, if you have access to it.)

If you happen to be defining new handlers or features using the SAX2 framework, you don't have to ask for permission to define new property or feature flag IDs. Since they are identified using URIs, just start your ID with a base URI that you control. (Only the SAX maintainers would start with the http://xml.org/sax/ URI, for example.) Typically, it will be easiest to make up some HTTP URL based on a fully qualified domain name that you control. As with namespace URIs, these are used purely as identifiers rather than as locations from which data would be retrieved. (The "I" in URI stands for "identifier.")

XMLReader Properties

SAX2 defines two XMLReader calls for accessing named property objects. One of the most common uses for such objects is to install non-core event handlers. Accessing properties is like accessing feature flags, except that the values associated with these names are objects rather than Booleans:

XMLReader       producer ...;
String          uri = ...;
Object          value = ...;

// Try getting and setting the property
try {
    System.out.println ("Initial property setting: "
        + producer.getProperty (uri);
    // if we get here, the property is supported

    producer.setProperty (uri, value);
    // if we get here, the parser set the property 

} catch (SAXNotSupportedException e) {
    // bad value for property ... maybe wrong type, or parser state
    System.out.println ("Can't set property: "
        + e.getMessage ());
    System.exit (1);

} catch (SAXNotRecognizedException e) {
    // property not supported by this parser
    System.out.println ("Doesn't understand property: "
        + e.getMessage ());
    System.exit (1);
}

You'll notice the URIs for these standard properties happen to have a common prefix. This means that you can declare the prefix (http://xml.org/sax/properties/) as a constant string and construct the identifiers by string catenation.

Here are the standard properties:

http://xml.org/sax/properties/declaration-handler

This property holds an implementation of org.xml.sax.ext.DeclHandler, used for reporting the DTD declarations that aren't reported through org.xml.sax.DTDHandler callbacks or for the root element name declaration, org.xml.sax.ext.LexicalHandler callbacks. This handler is presented in .

Aelfred2, Crimson, and Xerces support this property. In fact, all JAXP-compliant processors must do so.

http://xml.org/sax/properties/dom-node

Only specialized parsers will support this property: parsers that traverse DOM document nodes to produce streams of corresponding SAX events. (Typical SAX2 parsers parse XML text instead of DOM content.) When read, this property returns the DOM node corresponding to the current SAX2 callback. The property can only be written before a parse, to specify that the DOM node beginning and ending the SAX event stream need not be a org.w3c.dom.Document. This type of parser is presented later in this chapter, in "

DOM-to-SAX Event Production (and DOM4J, JDOM)
."

One example of such a parser is gnu.xml.util.DomParser, which is currently packaged along with the Aelfred2 parser. At this time, neither Crimson nor Xerces include such functionality.

http://xml.org/sax/properties/lexical-handler

This property holds an implementation of org.xml.sax.ext.LexicalHandler, used for reporting various events mostly (but not exclusively) relating to details of XML text that have no semantic or structural meaning, such as comments. This handler is presented in in ."

Aelfred2, Crimson, and Xerces support this property. In fact, all JAXP-compliant processors must do so.

http://xml.org/sax/properties/xml-string

This property returns a literal string of characters associated with the current parser callback event. Exactly which characters are returned isn't specified by SAX2. An example would be returning all the characters in the start tag of an element, including unexpanded entity and character references as well as excess whitespace and the exact type of quote characters (single, double) used to delimit attribute values. (This feature is intended to be of use when constructing certain kinds of XML editors, or DTD analyzers, that are willing to re-parse this data.)

No widely available open source SAX2 parser currently supports this property.

Applications may find it useful to define their own types of handler interfaces, assembling sequences of SAX event "atoms" into higher-level event "molecules" that incorporate essential application-level semantics (and probably some procedural validation). This is the same kind of process model used by W3C's XML schema processing model: the Post-Schema-Validation Infoset (PSVI) additions incorporate semantics suited to processing with that kind of schema. Most applications need to associate even more semantics with data than are easily captured by such simple rules (including DTDs and all types of schema). Those semantics would likely not be understood by any common XMLReader, but other kinds of SAX processing components can help manage such application-level handlers. You can see an example of this technique in .

XMLReader Feature Flags

The previous chapter showed how to access feature flags from SAX parsers and used the standard validation flag as the primary example. Accessing feature flags follows the same model as accessing properties, except the values are boolean not Object. There are a handful of standard SAX2 feature flags, which are all you normally need. The namespace for features is different from the namespace for properties. You can't set a property to a java.lang.Boolean value and expect to have the same effect as setting the feature flag that happens to use the same identifier.

As with properties, the URIs for these standard feature flags happen to have a common prefix: http://xml.org/sax/features/. It's good programming practice to declare the prefix as a constant and construct these feature identifiers by string catenation, helping reduce errors. Also, remember that flags aren't necessarily either settable (read/write)

[6]
or readable (supported); some parsers won't recognize all these flags, and in some cases these flags expose parser behaviors that don't change.

The standard flags are as follows:

http://xml.org/sax/features/external-general-entities

The default value for this flag is parser-specific. When the parser is validating, and in most other cases, the flag is true, indicating that the parser reads all external entities used outside the DTD. When the flag is false, the XML parser won't expand references to external general entities, so applications won't see the entire body of documents using such entities. This value can't be changed during parsing.

Crimson and Xerces only support true for this property. (For such parsers, you can get most of the effect of setting this flag to false by using an EntityResolver that returns zero-length entities after the first startElement() event. Aelfred2 supports changing the value of this property.

http://xml.org/sax/features/external-parameter-entities

The default value for this flag is parser-specific. When the parser is validating, and in most other cases, the flag is true, indicating the DTD will be completely processed. When the flag is false, the XML parser will skip any external DTD subset, as well as named external parameter entities, so it won't necessarily read the entire DTD for a document. This value can't be changed during parsing.

Skipping these entities means attributes declared in them will not be defaulted or normalized as expected, and their types won't be known. As a result, default namespace declarations may get dropped. Parts of the internal subset after a reference to a skipped external parameter entity will be ignored. It also means some general entities might not be declared, making it impossible to correctly distinguish whether references to undefined entities are well-formedness errors.

Normally, you are better off providing an entity resolver that accesses locally cached copies of your DTD components, or not using DTDs, rather than disabling processing of external parameter entities. But don't assume all the XML you work with will have these DTD entities processed; the XML processors in some web browsers will not read these entities by default.

Xerces and Crimson only support true for this property. (For such parsers, you can get an effect similar to setting this to false by using an EntityResolver that returns zero-length entities before the first startElement() event. The parser won't correctly ignore declarations found later in the DTD.) Aelfred2 supports changing the value of this property.

http://xml.org/sax/features/is-standalone/

This feature flag derives its value from the document being parsed, so it is read-only and only available after the first part of the document has been parsed. When the flag is true, the document has been declared to be standalone. If that declaration is correct, then all external entities may be safely ignored. This feature is part of XML 1.0 and is intended to reduce the cost of parsing some documents.

This flag should be part of an upcoming SAX extensions release.

http://xml.org/sax/features/lexical-handler/parameter-entities

The default value for this flag is parser-specific and is implicitly false if the parser doesn't support the LexicalHandler through a parser property. When the flag is true, the parser will report the beginning and end of parameter entities through LexicalHandler calls. (Skipped parameter entities are always reported, through the appropriate ContentHandler call.) Parameter entities are distinguished from general entities because the first character of their entity name will be a percent sign (%). The value can't be changed during parsing.

Currently, only the Aelfred2 parser reports parameter entities.

http://xml.org/sax/features/namespaces

This flag defaults to true in XML parsers, which indicates the parser performs namespace processing, reporting xmlns attributes by startPrefixMapping() and endPrefixMapping() calls and providing namespace URIs for each element or attribute. Otherwise no such processing is done at the parser level. This can't be changed during parsing.

You will leave flag this at its default setting unless your XML documents aren't guaranteed to conform to the XML Namespaces specification. Setting this to false usually gives some degree of parsing speed improvement, although it will likely not provide a significant impact on overall application performance. If you disable namespaces, make sure you first enable the namespace-prefixes feature.

This is supported by all SAX2 XML parsers. Aelfred2, Crimson, and Xerces support changing the value of this property.

http://xml.org/sax/features/namespace-prefixes

This flag defaults to false in XML parsers, indicating the parser will not present xmlns* attributes in its startElement() callbacks. Unless the flag is true, parsers won't portably present the qualified names (which include the prefix) used in an XML document for elements or attributes. The value can't be changed during parsing.

If you want to see the namespace prefixes for any reason, including for generating output without further postprocessing or for performing layered DTD validation, make sure this flag is set. Also make sure this flag is set if you completely disable namespace processing (with the namespaces feature flag), because otherwise the behavior of a SAX2 parser is undefined.

This is supported by all SAX2 parsers. Aelfred2, Crimson, and Xerces support changing the value of this property.

http://xml.org/sax/features/string-interning

The default value for this flag is parser-specific. When true, this indicates that all XML name strings (except those inside attribute values) and namespace URIs returned by this parser will have been interned using String.intern(). Some kind of interning is almost always done to improve the performance of parsers, and this flag exposes this work for the benefit of applications. This value can't be changed during parsing.

When applications know interning has been done, they know they can rely on fast, identity-based tests for string equality (== or !=) rather than the more expensive String.equals() method. Using equality testing for strings will always work, but it can be much slower than identity testing. Java automatically interns all string constants. Lots of startElement() processing needs to match element and attribute name strings (as sketched in ), so this kind of optimization can often be a win.

Aelfred2 interns all strings. Some older versions of Crimson don't recognize this flag, but all versions should correctly intern those strings. Xerces reports that it does not intern these strings.

http://xml.org/sax/features/validation

The default value for this flag is parser-specific; in most cases it is false. When the flag is true, the parser is performing XML validation (with a DTD, unless you've requested otherwise). When the flag is false, the parser isn't validating. The value can't be changed while parsing.

Aelfred2, when packaged with its optional validator, Crimson, and Xerces support both settings.

A few additional standard extension features will likely be defined, providing even more complete Infoset support from SAX2 XML parsers. Aelfred2 also includes a nonvalidating parser, which supports only false for this flag.

Of the widely available parsers, only Xerces has nonstandard feature flags. (The Xerces distribution includes full documentation for those flags.) As a rule, avoid most of these, because they are parser-specific and even version-specific. Some are used to disable warnings about extra definitions that aren't errors. (Most parsers don't bother reporting such nonerrors; Xerces reports them by default.) Others promote noncompliant XML validation semantics. Here are a few flags that you may want to use.

http://apache.org/xml/features/validation/schema

This tells the parser to validate with W3C-style schemas. The document needs to identify a schema, and the parser must have namespaces and validation enabled. (Defaults to false.)

W3C XML schema validation does not need to be built into XML parsers. In fact, most currently available schema validators are layered.

http://apache.org/xml/features/validation/schema-full-checking

This flag controls whether W3C schema validation involves all the specified tests. By default, some of the more expensive checks are not performed; Xerces is not "fully conforming" by default.

http://apache.org/xml/features/allow-java-encodings

This flag defaults to false, limiting the encodings that the parser accepts to a handful. When the flag is set to true, more encoding names are supported. Most other SAX2 parsers effectively have true as their default. A few of those additional encoding names are Java-specific (such as "UTF8"); most of them are standard encoding names, either the preferred version or recognized alternatives.

http://apache.org/xml/features/continue-after-fatal-error

When set, this flag permits Xerces to continue parsing after it invokes ErrorHandler.fatalError() to report a nonrecoverable error. If the error handler doesn't abort parsing by throwing an exception, Xerces will continue. The XML specification requires that no more event data be reported after fatal errors, but it allows additional errors to be reported. (Of course, depending on the initial error, many of the subsequent reports might be nonsense.)

The EntityResolver Interface

As mentioned earlier, this interface is used when a parser needs to access and parse external entities in the DTD or document content. It is not used to access the document entity itself. Cases where an EntityResolver should be used include:

  • When "more local" copies of entity data should be used. Such copies might be from a local filesystem or from a smart caching proxy. A normal web server may be unavailable or may only be accessible through a slow or congested network link; such remote access can cause application slowdowns and failures. This is generically called catalog or cache processing.

  • When the entity's systemId uses a URI scheme that is not understood by the underlying JVM. Built-in schemes usually include http://, file://, ftp://, and increasingly https://. Schemes not supported by the JVM include urn: and application-specific schemes. (You may need to put such URI schemes into publicID values, in order to prevent problems resolving relative URIs.)

  • When entities need to be constructed dynamically, or not through the standard URI resolution scheme. For example, entity text might be the result of a query through some user interface or another computation.

  • When the XML source text doesn't provide usable URIs. SGML-style systems sometimes use system identifiers that aren't really URIs; they might be relative to some base URI other than the base URI of the appropriate entity (document or DTD). Avoid this practice for XML-based systems; it's not very interoperable because most XML processors strongly expect system IDs in XML documents to be valid URIs, relative to the actual base URI of their declaration.

Applications that handle documents with DTDs should plan to use an EntityResolver so they work robustly in the face of partial network failures, and so they avoid placing excessive loads on remote servers. That is, they should try to access local copies of DTD data even when the document specifies a remote one. There are many examples of sloppily written applications that broke when a remote system administrator moved a DTD file. Examples range from purely informative services like most RSS feeds to fee-based services like some news syndication protocols.

You can implement a useful resolver with a data structure as simple as a hash table that maps identifiers to URIs. There is normally no reason to have different parsers use different entity resolvers; documents shouldn't use the same public or (absolute) system identifiers to denote different entities. You'll normally just have one resolver, and it could adaptively cache entities if you like.

More complex catalog facilities may be used by applications that follow the SGML convention that public identifiers are Formal Public Identifiers (FPIs). FPIs serve the role that Universal Resource Names (URNs) serve for Internet-oriented systems. Such mappings can also be used with URIs, if the entity text associated with URIs is as stable as an FPI. (Such stability is one of the goals of URNs.)

Applications pass objects that implement the EntityResolver interface to the XMLReader.setEntityResolver() method. The parser will then use the resolver with all external parsed entities. The EntityResolver interface has only one method, which can throw a java.io.IOException as well as the org.xml.sax.SAXException most other callbacks throw.

InputSource resolveEntity(String publicId, String systemId)

Parsers invoke this method to map entity identifiers either to other identifiers or to data that they will parse. See the discussion in "The InputSource Class," earlier in this chapter, for information about how the InputSource interface is used. If null is returned, then the parser will resolve the systemId without additional assistance. To avoid parsing an entity, return a value that encapsulates a zero-length text entity.

The systemId will always be present and will be a fully resolved URI. The publicId may be null. If it's not null, it will have been normalized by mapping sequences of consecutive whitespace characters to a single space character.

Example 1-3 is an example of a simple resolver that substitutes for a web-based time service running on the local machine by interpreting a private URI scheme and mapping public identifiers to alternative URIs using a dictionary that's externally maintained somehow. (For example, you might prime a hashtable with the public IDs for the XHTML 1.0, XHMTL 1.1, and DocBook 4.0 XML DTDs to point to local files.) It delegates to another resolver for other cases.

Example 1-3. Entity resolver, with chaining

public class MyResolver implements EntityResolver
{
    private EntityResolver next;
    private Dictionary     map;

    // n -- optional resolver to consult on failure 
    // m -- mapping public ids to preferred URLs
    public MyResolver (EntityResolver n, Dictionary m)
        { next = n; map = m; }

    InputSource resolveEntity (String publicId, String systemId)
    throws SAXException, IOException
    {
        // magic URL?
        if ("http://localhost/xml/date".equals (systemId)) {
            InputSource   retval = new InputSource (systemId);
            Reader        date;

            date = new InputStringReader (new Date().toString ());
            retval.setCharacterStream (date);
            return retval;
        }

        // nonstandard URI scheme?
        if (systemId.startsWith ("blob:") {
            InputSource   retval = new InputSource (systemId);
            String        key = systemId.substring (5);
            byte          data [] = Storage.keyToBlob (key);

            retval.setInputSource (new ByteArrayInputStream (data));
            return retval;
        }

        // use table to map public id to local URL?
        if (map != null && publicId != null) {
            String url = (String) map.get (publicId);
            if (url != null)
                return new InputSource (url);
        }

        // chain to next resolver?
        if (next != null)
            return next.resolveEntity (publicId, systemId);
        return null;
    }
}

Traditionally, public identifiers are mainly used as keys to find local copies of entities. In SGML, system identifiers were optional and system-specific, so public identifiers were sometimes the only ones available. (XML changed this: system identifiers are mandatory and are URIs.) In essence, public identifiers were used in SGML to serve the role that URNs serve in web-oriented architectures. An ISO standard for FPIs exists, and now RFC 3151 (available at

http://www.ietf.org/rfc/rfc3151.txt) defines a mapping from FPIs to URNs. (The FPI is normalized and transformed, then gets a urn:publicid:prefix.) When public identifiers are used with XML systems, it's largely by adopting FPI policies to interoperate with such SGML systems; however, XML public identifiers don't need to be FPIs. You may prefer to use URN schemes in newer systems. If so, be aware that some XML processing engines support only URLs as system identifiers. By letting applications interpret public IDs as URNs, SAX offers more power than some other XML APIs do.

If you want richer catalog-style functionality than the table mapping shown earlier, look for open source implementations of the XML version of the OASIS SGML/Open Catalog (SOCAT). At this time, a specification for such a catalog is a stable draft, still in development; see http://www.oasis.org/committees/entity/ for more information. This specification defines an XML text representation of mappings; the mappings can be significantly more complex than the tabular one shown earlier.

Other Kinds of SAX2 Event Producers

Normally, an XMLReader turns XML text into SAX event callbacks. This book encourages you to think of those event consumer callbacks as the most important part of the process, so using XML text as input is just one option for feeding those consumers.

For example, some SAX parsers have turned HTML text into SAX callbacks; there have even been SAX wrappers around the limited javax.swing.text.html parser. These wrappers can help migrate to XHTML, first by making sure tags are properly formed, paired, and nested, then by helping make the XHTML be valid so more tools can work with it. Malformed HTML is a huge problem; there's lots of brain-dead HTML text on the Web.[7] In practice, no generally available SAX HTML parser is quite good enough to substitute for tools like HTML Tidy (see http://tidy.sourceforge.net/) combined with manual fixup for problem cases, but that could change.

DOM-to-SAX Event Production (and DOM4J, JDOM)

It's so typical to want to turn a DOM node into a series of SAX events that SAX2 defined a standard way to do this. Several of the projects that claim to improve on DOM by being more Java-friendly, such as DOM4J and JDOM, have similar functionality.

In conjunction with any sort of SAX text output API (such as an XMLWriter), this technique is an easy way to turn a DOM tree into text. Utilities to turn a DOM node into text all need to do more or less the same thing: traverse the tree and emit the right sort of text. Using SAX (and SAX utilities) you can do this without needing support for any optional DOM Level 3 modules and without relying on any vendor-specific DOM extensions. (It's also a fine technique to use when you need a debugging snapshot and can't afford the memory needed to deep-clone a DOM document.)

Of course, any other processing can be done too, such as validating the output. After initializing and connecting an appropriate event producer, consumer-side validator, and ErrorHandler, just produce the events and watch for reports of validity errors. In some cases (as with DOM-to-SAX converters), you can look at individual element subtrees; in other cases, you'll need to examine entire documents.

Turning DOM trees into SAX events

To turn a DOM node into SAX events, you'll need to use a special parser class; normal SAX parsers require text as input and won't know the first thing about DOM. If it's a Level 2 DOM and is using namespace support, you'll probably need to manually patch up the namespace data, since DOM isn't guaranteed to maintain it. Patching can be done before or after you generate SAX events; I prefer to use a single, generic SAX2 processing component to handle namespace fixups no matter where the problem arose, since DOM isn't the only culprit. Given such a parser class (the GNU version is used here), your code will look like this:

import gnu.xml.util.XMLWriter;
import org.w3c.dom.Node;
import gnu.xml.util.DomParser;

XMLReader       parser;
Node            node = ...;
ContentHandler  contentHandler = new XMLWriter (system.out);

parser = new DomParser ();
parser.setContentHandler (contentHandler);
// you may also set DTDHandler, LexicalHandler, and DeclHandler

parser.setProperty ("http://xml.org/sax/properties/dom-node", node);
parser.parse ("dom-node value gets parsed");

Neither Crimson nor Xerces currently include support for such DOM-to-SAX transformations.

Turning DOM4J trees into SAX events

In DOM4J (http://www.dom4j.org ), it works like this. The current version of DOM4J isn't as flexible or complete as a DOM-to-SAX converter, though it has a few more options than JDOM. See the current release for more information.

import gnu.xml.util.XMLWriter;
import org.dom4j.io.SAXWriter;
import org.dom4j.Document;

SAXWriter       parser;
ContentHandler  contentHandler = new XMLWriter (system.out);
Document        doc = ...;

parser = new SAXWriter ();
parser.setContentHandler (contentHandler);
// you may also set DTDHandler and LexicalHandler

parser.write (doc);
                

Turning JDOM trees into SAX events

Here's how to do this conversion in JDOM (http://www.jdom.org ). As this is being written, the current version of JDOM doesn't support the level of flexibility of a DOM-to-SAX parser; it only handles JDOM document nodes. It also doesn't support LexicalHandler or DeclHandler events. JDOM could support some of the LexicalHandler events, such as those for comments and CDATA section boundaries. See the current release for more information.

import gnu.xml.util.XMLWriter;
import org.jdom.Document;
import org.jdom.output.SAXOutputter;

SAXOutputter    parser;
ContentHandler  contentHandler = new XMLWriter (system.out);
Document        doc;

parser = new SAXOutputter (contentHandler);
// you may also set DTDHandler

parser.output (doc);

Push Mode Event Production

Since SAX event handlers are just objects, your application software can call their methods directly. This is a common technique for application code that needs to convert data structures to XML: turn them into SAX event streams for processing by other components. That component could be an XMLWriter sending data across the web to a partner, but you can do other kinds of processing too. Such application code normally has no reason to be wrapped as an implementation of XMLReader.

When used with in-memory data structures, this is part of what's sometimes called serialization. Be careful not to confuse this with the more specialized meaning in Java RMI, where serialization is a binary data format tied to individual Java classes. Other words used to describe this kind of process include "marshaling," "encoding," and "pickling." Reversing the process is an important parallel problem, since most of the time applications must both produce and consume XML data. That is, most applications round-trip data, rather than just consuming it or producing it.

This event generation technique is not restricted to data structures that were originally stored in memory. You can use it with data from databases, stored on filesystems, and entered through user interfaces. The same general technique is used in all these cases.

Turning CSV files into SAX events

Comma Separated Values, or CSV, is a data format that is widely used for some data interchange problems. Many spreadsheets and databases can read and write it, and it can be used to publish fairly large databases. It's one of the more widely understood "flat file" text formats, and it's not uncommon to need to translate data CSV formats into XML. With luck, the meaning of each field will be documented or maybe obvious from context. A simple CSV list of some yoga classes might have five fields per record and look like this:

daniela,4:30-5:45pm,ashtanga,sun,mixed
(staff),10:30am-12:00m,sivanenda,daily,open
philippe,7-9:00pm,ashtanga,mon,mixed
larry,4:30-5:45pm,ashtanga,wed,rocket
mahadevi,6-8:00pm,sivanenda,wed,advanced
savonn,7-8:30pm,vinyasa,wed,2-3
kei,9:30-11am,vinyasa,thu,intermediate
patti,7:30-9pm,iyenegar,thur,1-2
regan,9:30-11am,bikram,fri,open
mark,12m-2pm,ashtanga,sat,mysore

The translation is easier than the parsing of CSV itself. Details like handling of empty or missing fields, quoted values, and inconsistent value syntax are messy, and critical when importing lots of data. In fact, it's so messy that

Example 1-4
completely avoids such lexical issues for CSV input data. (Nonlexical issues should be delegated to XML processing layers.) The example shows one way to translate; it's packaged more simply than a real-world application would probably expect. (Making an XMLReaderthat emits SAX events is possible and might be convenient.) This approach turns each CSV record into a single element by using attributes (with a sneak peek at a helper class we'll see later). It prints the output as XML text, which is probably not how you'd normally work with such data; the output is more naturally sent through a processing pipeline.

Example 1-4. Producing SAX2 events from CSV input

import java.io.*;
import java.util.StringTokenizer;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import gnu.xml.util.XMLWriter;


public class csv
{
    // stdin = (simple) CSV, stdout = XML
    public static void main (String argv [])
    {
        BufferedReader  in;
        XMLWriter       out;
        ErrorHandler    errs;
        String          line;

        try {
            in = new BufferedReader (new InputStreamReader (System.in));
            out = new XMLWriter (System.out);
            errs = new DefaultHandler () {
                    public void fatalError (SAXParseException e) {
                        System.err.println ("** parse error: "
                            + e.getMessage ());
                    }
                };

            out.startElement ("", "", "yoga", new AttributesImpl ());
            while ((line = in.readLine ()) != null)
                parseLine (line, out, errs);
            out.endElement ("", "", "yoga");
            out.flush ();
        } catch (Exception e) {
            System.err.println ("** error: " + e.getMessage ());
            e.printStackTrace (System.err);
            System.exit (1);
        }
    }

    // this doesn't handle quoted strings (with commas inside),
    // empty fields, tabs used as delimiters, or column headers.
    private static void parseLine (
        String          line,
        ContentHandler  out,
        ErrorHandler    errs
    ) throws SAXException
    {
        StringTokenizer tokens = new StringTokenizer (line.trim (), ",");
        String          values [] = new String [5];

        // if there aren't five values, it's malformed
        if (tokens.countTokens () != 5) {
            errs.fatalError (
                new SAXParseException ("not enough values", null));
            return;
        }
        for (int i = 0; i < 5; i++)
            values [i] = tokens.nextToken ();

        // now that we parsed the line safely, report its contents

        // the AttributesImpl class is shown later
        AttributesImpl  atts = new AttributesImpl ();

        atts.addAttribute ("", "", "teacher", "CDATA", values [0]);
        atts.addAttribute ("", "", "time", "CDATA", values [1]);
        atts.addAttribute ("", "", "type", "CDATA", values [2]);
        atts.addAttribute ("", "", "date", "CDATA", values [3]);
        atts.addAttribute ("", "", "level", "CDATA", values [4]);

        out.ignorableWhitespace ("\n  ".toCharArray (), 0, 3);
        out.startElement ("", "", "class", atts);
        out.endElement ("", "", "class");
    } 
}

The output of that program looks somewhat like this:

<yoga>
  <class teacher="daniela" time="4:30-5:45pm" type="ashtanga" 
         date="sun" level="mixed"></class>
  <class teacher="(staff)" time="10:30am-12:00m" type="sivanenda" 
         date="daily" level="open"></class>
  <class teacher="philippe" time="7-9:00pm" type="ashtanga" 
         date="mon" level="mixed"></class>
  <class teacher="larry" time="4:30-5:45pm" type="ashtanga" 
         date="wed" level="rocket"></class>
  <class teacher="mahadevi" time="6-8:00pm" type="sivanenda" 
         date="wed" level="advanced"></class>
  <class teacher="savonn" time="7-8:30pm" type="vinyasa" 
         date="wed" level="2-3"></class>
  <class teacher="kei" time="9:30-11am" type="vinyasa" 
         date="thu" level="intermediate"></class>
  <class teacher="patti" time="7:30-9pm" type="iyenegar" 
         date="thur" level="1-2"></class>
  <class teacher="regan" time="9:30-11am" type="bikram" 
         date="fri" level="open"></class>
  <class teacher="mark" time="12m-2pm" type="ashtanga" 
         date="sat" level="mysore"></class></yoga>

This included some ignorable whitespace to prevent the output from appearing as one big line of text; enabling pretty printing would do as well. Notice that the output needed to be flushed, else the JVM would normally exit with data still buffered in memory. We haven't yet looked at the endDocument() callback that would normally flush the data. Finally, notice that handling of any CSV conversion errors is delegated to a SAX error handler, which in this case adopts a very permissive strategy.

Turning objects into SAX events

For simple objects, something like the following "Address" example works. For a more complex object, such as a purchase order with multiple addresses for shipping and billing, you'll likely have routines that encode other data and use routines like this one as subroutines. You won't need to use any other handler interfaces, though you might want to embed comments or create CDATA boundaries using a LexicalHandler. Notice that startElement() calls always have matching endElement() calls, just as if the text was generated by an XML parser. This example declares and uses namespaces; you don't need to do that on the producer side if you patch them up later, but it's a reasonable practice to adopt. As used here, the AttributesImpl class just creates an empty set of attributes to pass on because null values can't be used:

static final String nsURI = "http://example.com/xml/address";

void toXML (Address addr, ContentHandler stream)
{
    char            temp [];
    Attributes      atts;

    // create an empty set of attributes
    atts = new AttributesImpl ();

    // <address xmlns="http://example.com/xml/address">
    stream.startPrefixMapping ("", nsURI);
    stream.startElement (nsURI, "address", "address", atts);

    // <street>...</street>
    stream.startElement (nsURI, "street", "street", atts);
    temp = addr.getStreet ().toCharArray ();
    stream.characters (temp, 0, temp.length);
    stream.endElement (nsURI, "street", "street");

    // <city>...</city>
    stream.startElement (nsURI, "city", "city", atts);
    temp = addr.getCity ().toCharArray ();
    stream.characters (temp, 0, temp.length);
    stream.endElement (nsURI, "city", "city");

    // <country>...</country>
    stream.startElement (nsURI, "country", "country", atts);
    temp = addr.getCountry ().toCharArray ();
    stream.characters (temp, 0, temp.length);
    stream.endElement (nsURI, "country", "country");

    // ... there would probably be more elements,
    // but not all application data in the "Address"
    // would be shared with the recipient.

    // </address>
    stream.endElement (nsURI, "address", "address");
    stream.endPrefixMapping ("");
}

If you're printing such output, you might want to add some ignorable whitespace to keep all the text from appearing on a single line. The resulting XML text will be easier to read, though having text without line breaks should not matter otherwise. (Better yet: use an XMLWriter with pretty-printing support.) If you are working with many namespaces, you may want to use the NamespaceSupport class to track and select the prefixes used in the element and attribute names you write.

It may also be a good idea to write "unmarshaling" code (taking such events and recreating, or looking up, application objects) at the same time you write marshaling code (like the preceding code, creating SAX events from application objects). That helps test the code: when round-trip processing works for many different data items (save a lot of test cases), you know it's behaving. Unmarshaling code can also be an appropriate place to test for semantic validity of data: you might have reason to trust that your current marshaling code is correct, but changes made next year could break something, and it's not good to expect everyone else will marshal correctly.

Data modeling concerns

As a rule of thumb, avoid assuming that your XML data model ought to match your application's data structures. Such policies can sometimes be appropriate, but more often, your application's internal data structures were optimized for something unrelated to communicating with other applications. Most systems that automatically marshal and unmarshal data structures (maybe using "reflection" in Java) will make such assumptions; they lead to tightly coupled systems. Tight coupling tends to cause fragility in the face of system evolution, since upgrades normally occur incrementally on widely distributed systems (such as almost all web-based applications).

For example, when you interchange the results of a complex set of queries from your database (perhaps for a large purchase order), it is typically appropriate to mask the exact relational structure used in your application. The recipient of your XML may well have adopted a different relational normalization. The recipient might not even expect to perform database operations on such data. Data displays may need to address usability issues that are completely unrelated to how applications "think" about the same data. Similar logic applies when the application data isn't stored in a database or is only partially stored in one.

On the other hand, if you're using XML to transfer a relation from one database to another, encoding a java.sql.ResultSet (or CSV table) into a series of elements (one element per table row, without duplications) may be exactly the right model. (The reverse transformation would be unmarshaling -- consuming XML to populate a database.) You won't always want to denormalize, even though the ability to easily do that is one of the great strengths of using XML to interchange data. Many common messaging scenarios involve the kind of data model that serves as input to normalization processes, and are oriented to individual cases not aggregates.

When you're encoding individual data items, such as integers, dates, or binary data encoded using BASE64, you should consider using the data-typing facilities in Part 2 of the W3C XML Schema specification (http://www.w3c.org/TR/xmlschema-2/ ). Those "simple" datatypes are intended to be used in many specifications. Its association with the particular schema system described in other parts of the W3C XML Schema specification can be viewed as a historical accident; you don't need to use W3C schemas to use these datatypes.

Producing Well-Formed Event Streams

If you are generating SAX2 events from any event producer that's not an actual XML parser (maybe by using an HTML parser or code that traverses data structures), you may need to ensure the event stream is legal before passing it to other components (maybe by printing it as XML text). There are issues of well formedess to think about: startElement() calls need matching endElement() calls, other calls require similar start/end nesting, carriage returns are prohibited in line ends, and more. Correct reporting of namespace information is important: prefixes must be declared and correctly used. Validity will also be an issue in many contexts as a policy of eliminating data format errors as early as possible. (It's cheaper to fix bugs before you ship them in products than afterward, and validation tools make some bugs easy to find.)

The particular issues you may have depend on what kind of event producer you use and what kinds of events you generate. DOM streams can easily be namespace-invalid; for example, prefixes are often undeclared or missing. Code that generates events directly is particularly prone to violate element nesting and closure requirements and to omit namespace declarations. Few tools prevent all kinds of illegal content; ]]> could appear in CDATA sections, and - - (two hyphens) within comments, both of which will prevent generation of legal XML text.

With high-quality producer-side code, you'll have fixed all those problems before the code is released. But you'll still probably want code that dynamically verifies that there's no problem to use when debugging or troubleshooting. If you adopt a good SAX2 event pipeline framework, it can easily support components that monitor event streams to ensure they meet those data integrity constraints or, in some cases (such as namespaces), patch event streams so they are correct.

The XMLFilter Interface

SAX2 added the XMLFilter interface. XMLFilter is just an XMLReader that can be associated with a "parent" reader. What's interesting is the expectation that the parent is producing the events and the filter postprocesses them; the filter parses and modifies Infoset data, not XML text. From the perspective of your application code, a filter that you use as an XMLReader is doing some postprocessing of your parser requests, some processing on the XML data, then passing you the results; it's a preprocessor for infoset data.

The XMLFilter interface adds these methods to XMLReader:

void setParent(XMLReader parser)
XMLReader getParent()

The parent of an XMLFilter is accessed using standard JavaBeans property-naming conventions. Use this property to control which parser (or filter) generates the events to be filtered.

The role of the XMLFilter implementation is primarily to intercept and process SAX content events. Because its real work is to process those events, the code in such a filter is acting as a consumer. Implementing the XMLReader interface is a facade to make that consumer code look like a pull API (XMLReader) and let it intercept requests to an underlying parser. That is, it supports one kind of XML pipeline model.

Since the interesting issues are all on the consumer side, XMLFilter is discussed later with other kinds of SAX event pipeline models, along with the XMLFilterImpl helper class.

If you're using these filters as event producers, you'll need to pay attention to a secondary role of an XMLFilter : intercepting and modifying parser requests. This kind of filter is a compound object. It consists of the filter, plus a reader (which might in turn be another filter), handler bindings, and settings for feature flags and properties. The interrelationships of these parts can get murky. In simple cases you can ignore the distinction, treating this type of SAX filter just like another reader. But in other cases you may need to remember that the filter and its parent are distinct objects with different behaviors.

For example, sometimes you'll find implementations of XMLFilter that don't use mechanisms such as the EntityResolver or ErrorResolver. When you need to use those mechanisms, you'd need to bind such objects to the parent parser. But most filters pass those objects on to the parent and may even need to use them internally, so you'd bind them to the filter instead. You'll need to know which kind of filter you have. In a similar way, if an underlying parser interns its strings, but the filter changes them (for example, swapping one namespace URI for another) and doesn't intern those strings, then code that talks to the filter can't use identity tests to replace the slower equality tests. The filter would have to expose a different setting for such feature flags than the parent parser.


Footnotes:
  1. application/xml is the safest MIME type to use for *.xml, *.dtd, and other XML files. See RFC 3023 for information about XML MIME types and character encodings.

  2. JDK 1.4 includes public APIs through which applications can support new character encodings. Some applications may need to use those APIs to support encodings beyond those the JVM handles natively.

  3. You might have a pool of parsers, to reduce bootstrap costs. You'd use an entity resolver to turn most entity accesses from remote ones into local ones. Depending on your application, you might even prevent all access to nonlocal entities so the servlet won't hang when remote network accesses get delayed.

    Some security policies would also involve the entity resolver. Basically, every entity access "requested" by the client (through a reference in the document) is a potential attack. If it's not known to be safe (for example, access to standard DTD components), it may be important to prevent or nullify the access. (This does not always happen in the entity resolver; sometimes system security policies will be more centralized.) In a small trade-off against performance, security might require that the request data always be validated, and that validity errors be treated as fatal, because malformed input data is likely to affect system integrity.

  4. The current version of XMLReaderFactory has more intelligence and supports additional configuration mechanisms. For example, your application or parser distribution can configure a META-INF/services/org.xml.sax.driver resource into its class path, holding a single string to be used if the system property hasn't been set. SAX2 parser distributions are expected to work even if the system property or class path resource hasn't been set.

  5. You can also look at this as choosing between parsers. For example, JAXP 1.2 will probably say how to request that schema validation be done. That's most naturally done as a layer on top of SAX, with a parser filter postprocessing the output of some other SAX parser.

  6. SAX could support write-only flags too, but these are rarely a good idea.

  7. The draconian error-handling policy of the XML specification (if it's not well formed, it must be rejected) was a reaction to those problems: XML parsers don't need to compete on how well they can make sense of garbage input. It was added at the request of the main browser vendors, which were then Netscape and Microsoft. This policy makes it a lot easier to create tools to process XML text, including presentation tools (XHTML browsing) that can even work on limited resource systems (such as PDAs or cell phones), content management tools, and "screen scrapers" for mining XHTML presentation text (to repurpose the data shown there).

    One early browser development policy was that there's no such thing as broken HTML, so parsers needed to accept pretty much everything. The policy helped simplify content creation when there were few tools beyond text editors, but it also led to serious problems with browser incompatibility which are only now beginning to go away. It's also helped spread tools fostering malformed HTML (including flakey CGI scripts) and made it harder to present HTML on low-cost systems (it takes a fat parser to handle even a fraction of the different kinds of broken HTML).

Chapter 3

Producing SAX2 Events

The preceding chapter provided an overview of the most widely used SAX classes and showed how they relate to each other. This chapter drills more deeply into how to produce XML events with SAX, including further customization of SAX parsers.

Pull Mode Event Production with XMLReader

Most of the time you work with SAX2, you'll be using some kind of org.xml.sax.XMLReader implementation that turns XML text into a SAX event stream. Such a class is loosely called a "SAX parser." Don't confuse this with the older SAX1 org.xml.sax.Parser class. New code should not be using that class!

This interface works in a kind of "pull" mode: when a thread makes an XMLReader.parse() request, it blocks until the XML document has been fully read and processed. Inside the parser there's a lot of work going on, including a "pull-to-push" adapter: the parser pulls data out of the input source provided to parse() and converts it to events that it pushes to event consumers. This model is different from the model of a java.io.Reader, from which applications can only get individual buffers of character data, but it's also similar because in both cases the calling thread is pulling data from a stream.

You can also have pure "push" mode event producers. The most common kind writes events directly to event handlers and doesn't use any kind of input abstraction to indicate the data's source; it's not parsing XML text. We discuss several types of such producers later in this chapter. Using threads, you could also create a producer that lets you write raw XML text, a buffer at a time, to an XMLReader that parses the text; that's another kind of "push" mode producer.

The XMLReader Interface

The SAX overview presented the most important parts of the XMLReader interface. Here we discuss the whole thing, in functional groups. Most of the handlers are presented in more detail in the next chapter, which focuses on the consumption side of the SAX event streaming process. Each handler has get and set accessor methods, and has a default value of null.

XMLReader has the following functional groups:

void parse(String uri)
void parse(InputSource in)

There are two methods to parse documents. In most cases, the Java environment is able to resolve the document's URI; the form with the absolute URI should be used when possible. (You may need to convert filenames to URIs before passing them to SAX. SAX specifically disallows passing relative URIs.) The second form is discussed in more detail along with the InputSource class. Both of these methods can throw a SAXException or java.io.IOException, as presented earlier. A SAXException is normally thrown only when an event handler throws it to terminate parsing. That policy is best encapsulated in an ErrorHandler, but handler methods can make such decisions themselves.

Only one thread may call a given parser's parse() method at a time; applications are responsible for ensuring that threads don't share parsers that are in active use. (SAX parsers aren't necessarily going to report applications that break that rule, though!) The thread doing the parsing will normally block only while it's waiting for data to be delivered to it, or if a handler's processing causes it to block.

void setContentHandler(ContentHandler handler)
ContentHandler getContentHandler()

Key parts of the ContentHandler interface were presented as part of the SAX overview; ContentHandler packages the fundamental parsing callbacks used by SAX event consumers.

void setDTDHandler(DTDHandler handler)
DTDHandler getDTDHandler()

The DTDHandler is presented in detail later, in in ."

void setEntityResolver(EntityResolver handler)
EntityResolver getEntityResolver()

The EntityResolver is presented later in this chapter, in "

The EntityResolver Interface
." It is used by the parser to help locate the content for external entities (general or parameter) to be parsed.

void setErrorHandler(ErrorHandler handler)
ErrorHandler getErrorHandler()

The ErrorHandler was presented in " in . It is often used by consumer code that interprets events reported through other handlers, since they may need to report errors detected at higher levels than XML syntax.

void setFeature(String uri, boolean value)
boolean getFeature(String uri)

Parser feature flags are presented in more detail later in this chapter in "XMLReader Feature Flags."

void setProperty(String uri, Object value)
Object getProperty(String uri)

Parser properties are used for data such as additional event handlers, and are presented in more detail later in this chapter in "XMLReader Properties."

All the event handlers and the entity resolver may be reassigned inside event callbacks. At this level, SAX guarantees "late binding" of handlers. Layers built on top of SAX might use earlier binding, which can optimize event processing.

Many SAX parsers let you set handlers to null as a way to ignore the events reported by that type of handler. Strictly speaking, they don't need to do that; they're allowed to throw a NullPointerException when you use null. So if you need to restore the default behavior of a parser, you should use a DefaultHandler (or something implementing the appropriate extension interface) just in case, rather than use the more natural idiom of setting the handler to its default value, null.

If for any reason you need a push mode XML parser, which takes blocks of character or byte data (encapsulating XML text) that you write to a parser, you can easily create one from a standard pull mode parser. The cost is one helper thread and some API glue. The helper thread will call parse() on an InputSource that uses a java.io.PipedInputStream to read text. The push thread will write such data blocks to an associated java.io.PipedOutputStream when it becomes available. Most SAX parsers will in turn push the event data out incrementally, but there's no guarantee (at least from SAX) that they won't buffer megabytes of data before they start to parse.

The InputSource Class

The InputSource class shows up in both places where SAX needs to parse data: for the document itself, through parse(), and for the external parsed entities it might reference through the EntityResolver interface.

In almost all cases you should simply pass an absolute URI to the XMLReader.parse() method. (If you have a relative URI or a filename, turn it into an absolute URI first.) However, there are cases when you may need to parse data that has no URI. It might be in unnamed storage like a String; or it might need to be read using a specialized access scheme (maybe a java.io.PipedInputStream, or POST input to a servlet, or something named by a URN). The web server for the URI might misidentify the document's character encoding, so you'd need to work around that server bug. In such cases, you must use the alternative XMLReader.parse() method and pass an InputSource object to the parser.

InputSource objects are fundamentally holders for one or two things: an entity's URI and the entity text. (There can be a "public ID" too, but it's rarely useful.) When only one of those is needed, an application's work for setting up the InputSource might end with choosing the right constructor. Whenever you provide the entity text, you need to pay attention to some character encoding issues. Because character encoding is easy to get wrong, avoid directly providing entity text when you can.

Always provide absolute URIs

You should try to always provide the fully qualified (absolute) URI of the entity as its systemId, even if you also provide the entity text. That URI will often be the only data you need to provide. You must convert filenames to URIs (as described later in this chapter in "

Filenames Versus URIs
"), and turn relative URIs into absolute ones. Some parsers have bugs and will attempt to turn relative URIs into absolute ones, guessing at an appropriate base URI. Do not rely on such behavior.

If you don't provide that absolute URI, then diagnostics may be useless. More significantly, relative URIs within the document can't be correctly resolved by the parser if the base URI is forgotten. XML parsers need to handle relative URIs within DTDs. To do that they need the absolute document (or entity) base URIs to be provided in InputSource (or parse() methods) by the application. Parsers use those base URIs to absolutize relative URIs, and then use EntityResolver to map the URIs (or their public identifiers) to entity text. Applications sometimes need to do similar things to relative URIs in document content. The xml:base attribute may provide an alternative solution for applications to determine the base URI, but it is normally needed only when relative URIs are broken. This can happen when someone moves the base document without moving its associated resources, or when you send the document through DOM (which doesn't record base URIs). Moreover, relative URIs in an xml:base attribute still need to be resolved with respect to the real base URI of the document.

The following methods are used to provide absolute URIs:

InputSource(String uri)

Use this constructor when you are creating an InputSource consisting only of a fully qualified URI in a scheme understood by the JVM you are using. Such schemes commonly include http://, file://, ftp://, and increasingly https://.

InputSource.setSystemId(String uri)

Use this method to record the URI associated with text you are providing directly.

For example, these three ways to parse a document are precisely equivalent:

String    uri = ...;
XMLReader parser = ...;

parser.parse (uri);
// or
parser.parse (new InputSource (uri);

  Contact Us |  | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy