Parts of a DTD
As with any great masterpiece, you need to use certain building blocks to construct your DTD. Everything about the document—from the entities to the elements that help construct the document—is defined in the DTD. A DTD contains no content, only definitions.
TIP: You must use the correct XML syntax when you create your DTDs. Otherwise, your document won’t parse, and you’ll have nothing but errors. Learn the syntax for declaring elements, attributes, and entities. If you need more information on how to read the XML specification, review Chapter 2. Information on elements, attributes, and entities is described in detail in Chapters 6, 7, 8, and 9.
When you think about it, the entire document rests on the shoulders of the DTD. It not only defines the elements, attributes, and entities of the document, but it also describes everything that can be contained within the document. The DTD actually accomplishes many things, including:
• Defines and provides the names of all the elements used in the document.
• Defines how elements may (and in some cases must) be used together to describe information.
• Defines and provides the names of all the attributes used in a document.
• Defines all attributes (and their default values) for each element.
• Defines and provides the names and the content of all the entities used in
• Specifies the order in which elements and attributes must appear in
• Outlines any comments that may help clarify the structural context of the document.
The DTD makes it possible for the document content to be marked up and
then displayed properly when it’s parsed. To construct a well-defined DTD, you must first know exactly what individual parts are included. DTDs include the following:
• Character data, including normal character and special character data
• White space characters
• Elements, including their start and end tags (unless they’re empty elements)
• Processing instructions
Each part is used to create DTDs that make up both well-formed and valid XML documents. Let’s examine each part briefly.
TIP: We discuss both well-formed and valid documents later in this chapter.
The smallest piece of a DTD is a single character. So, how does a single character help form a well-designed DTD? It’s simple: Characters make up the content of the document and also the content of entities, elements, attributes, and even comments. Character data specifies a certain process, marks up data, or represents some type of information.
Character data can be a mixture of text and markup information. When that happens, you have mixed content. Here’s an example of what mixed content looks like:
<information>This section will help you understand
how to create well-designed DTDs.</information>
All characters used in the DTD and the document itself within XML are based on the ISO 10646 character-encoding scheme, commonly referred to as Unicode. You can use Unicode to represent the same characters across different platforms. It supports encoding schemes for 8-bit, 16-bit, and 32-bit character sets.
Some special characters, however, are reserved and are used within XML to signify certain functionality. For example the left angle bracket (<) is used to indicate element and attribute declarations, as well as to identify the beginning of a tag for an element used in an XML document. With the help of Unicode, you can use these special, reserved characters within the content of a document without worrying that the processor will indicate an error in processing. Unicode helps create internal entities, which are reserved entities that specify various reserved characters. Table 5.2 lists several reserved characters and their Unicode hexadecimal assignments along with the escape character strings used to denote them in an XML document.
Table 5.2 Reserved characters in XML.
Character Unicode Assignment Escape Character String
< < <
> > >
& & &
TIP: For more information about Unicode, check out the official Web site at www.unicode.org. Chapter 9 includes an in-depth discussion of how to represent non-ASCII characters in your XML documents.
White Space Characters
White space is simply empty space between characters. However, white space can be more than just space. It can also be one of the following characters:
• The space character (Unicode character #x20)
• The line feed (Unicode character #xD)
• The tab character (Unicode character #xA)
• The carriage return (Unicode character #x9)
You can combine any of these characters in a string of character data. XML processes white space by using white space handling. XML processors read all white space along with all other characters in a document. However, you need to describe to the XML processor when white space is significant by using the xml:space attribute within the attribute list. For example, if you want to signify that white space is important and needs to display in the document itself, you would define the following within the attribute list:
<!ATTLIST listing xml:space
(default | preserve) "preserve">
While we’re on the subject of characters, we might as well discuss entities. XML provides a mechanism that makes it easy to create information that will be placed in the document repeatedly and will help you maintain the document over time. This mechanism is called an entity and unlike HTML, with XML, you get to create your own entities. An entity is a component, be it text or other data, that can be substituted into a document based on a declaration. The component can be a text string or any type of file; that’s right, entities can also reference entire files. That means an entity could reference a masthead, a chapter in a book, or anything to which a reference can point. Because entities allow text and files to be substituted into a document, they can be used to replace values when a document is parsed and displayed.
TIP: Chapters 1 and 2 discuss the roles entities play in markup in general and XML in specific. Chapter 9 is entirely devoted to the ins and outs of entities.
It’s our opinion that entities are by far the easiest portion of the DTD to create, because in their most basic form, entities provide a mechanism to specify content without much effort. Entities are also the most complex, because you can do so much with them. There are two types of entities: parsed entities and unparsed entities. The following example focuses on parsed entities that associate an entity name or keyword with a text string or file, as well as on unparsed entities, which associate an entity name or keyword with content that may be non-textual (such as a graphic). All other kinds of entities are discussed in Chapter 9.
For example, if you use copyright information again and again in a document, you could create an entity with the following code:
<!ENTITY copyright "Copyright 2000">
Then, you would specify the entity within the document by using the entity reference, as follows:
All this information is ©right;.
The previous code would display as follows when the document is parsed:
All this information is Copyright 2000.
You can also create entities that reference entire files to be inserted into the document. You use the SYSTEM keyword to identify the URL of the file. SYSTEM is a reserved keyword that tells the XML processor the referenced external entity is in a file. The following code inserts a file into an XML document (note that the URL used is just an example):
<!ENTITY adddata SYSTEM "http://www.adn.com/docs/document.xml">
When the XML processor encounters the adddata entity, it processes the file and then replaces the entity with the processed contents of the file.
You can also use this type of code to include other types of files—for example, the image file for a logo. Entities that reference non-text files are called unparsed entities; they are not parsed by the XML parser, but are instead handled by an application you specify in the DTD. You specify an application or processor to handle the file using the keyword NDATA. NDATA is used as a pointer to a previously declared notation to specify which application will process the entity reference. The following code allows you to insert a graphic into an XML document and have it processed as a GIF file, and not as any other type of file:
<!ENTITY adnlogo SYSTEM
"http://www.adn.com/images/logo.gif" NDATA gif >
WARNING! You can reference parsed entities anywhere in a document, but you can only reference an unparsed entity in the value of an element’s attribute.
Therefore, when you use the adnlogo entity in a document, the processor uses the GIF notation to determine how to handle the entity.
TIP: You can use the NOTATION declaration to declare the GIF notation. When you create a notation, identify the URL of the helper application using the SYSTEM keyword and the following code (note that iviewer.exe is the application that can display the GIF file):
<!NOTATION gif SYSTEM "http://site.com
Because elements are fully explained in Chapter 6, we only discuss them briefly here. Elements construct the parts of a document. You have full control over what elements you create and use in your XML documents. You usually create elements and their content right after you specify the XML processing instruction and DOCTYPE declaration at the top of a DTD with external DTDs, or at the top of an XML document when creating internal DTDs. An element declaration looks like this:
<!ELEMENT name content>
For example, if you were creating an online parts ordering system, you could declare the part number element named partno and specify that this content can be parsed character data. To do so, you would include the following code in your DTD:
<!ELEMENT partno (#PCDATA)>
You can declare as many elements as you need in the DTD. Yet, depending upon how you structure the elements in the DTD, you may or may not need to use all those elements in your XML document. Or, depending upon how you specify the elements’ content, you may need to use one element before you use another. You determine such element preferences in the DTD. For example, to specify that the element name must contain two other elements, first and last, and that first has to appear before last, you declare the elements in your DTD in this order:
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT name (first, last)>
Attributes help describe exactly what elements are, the kind of information that must be placed in them, and the order in which the information should be placed. Attributes can be placed directly after an element has been declared, or you can place them in groups (attribute lists) after all elements have been declared within the DTD. It’s best to declare the element and then declare its attribute list because doing so makes it easier to read the DTD. You’ll know exactly what element has been declared and what attributes are specified if you follow the element with its attribute list.
A list of attributes is defined for an element using the ATTLIST assignment to specify exactly what can and cannot be placed in an element and what information is required. Because a document’s elements are completely configurable, you’re free to create attributes as necessary.
This is where the tough part begins. Although attributes help clarify what content the elements can contain, they require more information than elements because they further clarify what an element does. Therefore, their construction is a bit more complex; you really have to think about your elements and how to describe them not only to humans, but also to parsers. The parser that reads the XML DTD document uses the attributes to set certain flags, such as whether an order has been processed. The application in turn uses this flag to determine if data can be edited.
The basic format of an attribute is specified through the use of the ATTLIST assignment, as shown in the following code:
<!ATTLIST elementname attributename type default_usage>
If we break down this string of code from left to right, this is what we have:
• !ATTLIST—Begins the attribute-list declaration.
• elementname—Names the element to which the attribute is associated. Attributes can appear directly after elements or anywhere within the DTD. You need to include this information because, in many cases, the attribute doesn’t follow the actual element it describes.
• attributename—Specifies the name given to the attribute. You can give your attribute just about any name you want (within the limitations specified by the XML 1.0 and XML Namespaces recommendations). The attribute name is significant when you need to reference it while using an application and when someone’s reading your document and trying to make sense of it.
• type—Specifies whether the attribute will be a string type, tokenized type, or enumerated type.
• default_usage—Specifies the default values that can be used with attributes. Some of the default values used are:
• #IMPLIED—A value is optional for this attribute. The processor should notify the system when no value is set, but the document can still be considered valid.
• #FIXED—The value is fixed and cannot be changed. The document is not valid if the attribute is used with a value different from the default.
• #REQUIRED—A value is mandatory for this attribute. If no value is set, the document is not valid.
For example, suppose you want to declare an element called last and you want to create an attribute named format for it. You want to specify that this attribute must contain character data (CDATA) and not markup data. In addition, you want the content of the element to be required; this tells the processor to return an error if no content is specified for this element’s attribute. The code would look like this:
<!ELEMENT last (#PCDATA)>
format CDATA #REQUIRED>
TIP: You can do many things with attributes and use them to closely control what information is contained in a document and what form that information should take. To learn more about the various attribute options you can declare in your DTD, see Chapter 8.
Comments are important in any DTD. They not only help you remember what you placed in the DTD, but they also help others know the purpose for certain elements or attributes you create. Remember, one of the XML design principles is that XML data be humanly legible and easily understood. With a few well-placed comments, you can ensure that this design principle is followed. Here is the syntax for a comment:
<!-- comment -->
The first part, <!--, signifies the start of the comment, and --> signifies the end. If you’ve worked with HTML, this comment format probably looks very familiar to you. With the exception of two hyphens (--), anything can be placed within the comment itself. No part of the comment is displayed or processed by the parser. Comments can be on a single line or broken up and placed on multiple lines. For example:
<!-- The information
specified in this document
outlines the document structure. -->
You should use a lot of comments when you first start creating XML documents. Consider the DTD shown in Listing 5.2, which was created by David Megginson. He has fully commented the entire DTD. He has indicated each element’s purpose and specified where the entities start. When you read Listing 5.2, you’ll understand what the author’s intentions were when he created each element, attribute, and notation, because he identified each with a comment. Each section that describes a particular set of DTD components is commented. As you can see, it doesn’t matter where you place the comments. Placing them frequently throughout the DTD makes it easier to read and understand. Even the end of the DTD is described in detail.
Listing 5.2 A novel DTD.
novel.dtd: A simple XML DTD for marking-up novels.
Copyright (c) 1997 by David Megginson.
<!-- Content model for phrasal content -->
<!ENTITY % phrasal "#PCDATA|emphasis">
<!-- ******** -->
<!-- Elements -->
<!-- ******** -->
<!-- The top-level novel -->
<!ELEMENT novel (front, body)>
<!-- The frontmatter for the novel -->
<!ELEMENT front (title, author, revision-list)>
<!-- The list of revisions to this text -->
<!ELEMENT revision-list (item+)>
<!-- An item in the list of revisions -->
<!ELEMENT item (%phrasal;)*>
<!-- The main body of a novel -->
<!ELEMENT body (chapter+)>
<!-- A chapter of a novel -->
<!ELEMENT chapter (title, paragraph+)>
id ID #REQUIRED>
<!-- The title of a novel or chapter -->
<!ELEMENT title (%phrasal;)*>
<!-- The author(s) of a novel -->
<!ELEMENT author (%phrasal;)*>
<!-- A paragraph in a chapter -->
<!ELEMENT paragraph (%phrasal;)*>
<!-- An emphasized phrase -->
<!ELEMENT emphasis (%phrasal;)*>
<!-- **************** -->
<!-- General Entities -->
<!-- **************** -->
These really should have their Unicode equivalents.
<!-- em-dash -->
<!ENTITY mdash "--">
<!-- left double quotation mark -->
<!ENTITY ldquo "``">
<!-- right double quotation mark -->
<!ENTITY rdquo "''">
<!-- left single quotation mark -->
<!ENTITY lsquo "`">
<!-- right single quotation mark -->
<!ENTITY rsquo "'">
<!-- horizontal ellipse -->
<!ENTITY hellip "...">
<!-- end of DTD -->
Processing instructions are information for applications and are passed through to the application by the XML processor. For example, the following processing instruction could be used in an XML document to indicate to an application that uses the data, for example, a Web browser, that it should apply the style sheet that is indicated to the contents of the document in which this instruction appears:
<?xml-stylesheet href="MyStyle.xsl" type="text/xsl"?>
Every processing instruction must start with <? and end with ?>. The contents of a processing instruction begin with a target name. The target name is generally used to indicate the application (for example, an XSL processor) that will use the data contained in the processing instruction. The target name may not be the letters “xml” in any combination of upper and lowercase. These names (xml, Xml, xMl, xmL, and so on) are reserved for use by the XML declaration.
Page 2 of 3. Goto Page 1 | 3