Schema Tricks and Tips
The Purpose of Schemas
What is a Schema?
What is a Schema?
A Schema is an abstract definition of an object's characteristics and interrelationships. This schema is represented in different ways in different environments.
- Databases - A database schema describes the table names and columns, describes the relationships between tables (via keys), and acts as a repository for triggers and stored procedures.
- Classes - An object oriented class interface is a schema for describing objects, including properties, methods, and events acting on the object.
- UML - A UML object is in fact an abstraction of a schema by way of a visual metaphor. UML objects can be used to generate other types of schemas, so in a sense a UML object is a meta-schema.
- XML - An XML Schema describes the interrelationship between elements and attributes that make up a given XML "object". Schemas can include data type representations, but don't necessarily have to.
What is a Schema Not?
What is a Schema Not?It's worth contrasting what schemas are with what they aren't. Specifically, a schema is not ...
- An Object - While it is possible as a special case to make a schema that describes other schemas (think about XML a bit in this role), in general a schema is an abstraction - it contains no actual data, only potential data.
- A Form - A schema should be independent of the media used to display the information in the schema. While it is possible (and desirable) to annotate your schemas, keep in mind that this annotation shouldn't be tied into a specific implementation.
- A Transformation - Nope, that's XSLT ... different seminar.
XML Schema Languages
XML Schema LanguagesThere are currently a number of different schema language representations, although by the time of this conference there will probably be one fully approved XSL Schema Definition, which this discussion explores. These other schema languages include:
- Document Type Definitions(DTD) - The principle schema mechanism for SGML, DTDs are still the most common form of XML schema representation. However, they suffer from a number of significant limitations, and are being phased out of the XML Oevure.
- XML Data Reduced(XDR) - One of the first schema representations for XML, this was proposed in early 1998 by Microsoft, and forms the foundation of almost all of their XML data type story. XDR included the introduction of data types, but didn't include mechanisms for creating OOP representations (or generating more sophisticated datatypes).
- Simple Object XML(SOX) - Forming the foundation of the ebXML movement, the SOX schema architecture introduced a more sophisticated form of object oriented design, including the notion of inheritance and archetypes.
Why DTDs Are Not Enough
Why DTDs Are Not EnoughWhen XML was first designed, the SGML community assumed automatically that the SGML document definition language, DTD, would be sufficient to describe XML. However, DTDs suffer from a number of limitations that make them less than ideal as data language grammars.
- Document Centric - DTDs are designed for the manipulation of text blocks for the purpose of creating documents. XML is increasingly being called upon to handle the transport of data, not documents, and the largely macro driven facitilies of DTDs are simply not up to the task.
- No Data Types - There is no way using a DTD to differentiate between the string "32.421" and the number 32.421. As a consequence, applications that use XML must have an existing convention for denoting which fields are numbers vs. strings. Moreover, there are no constrains to limit data types to insure, for example, that a given string consists of 6 digits and two letters.
- Not in XML - A DTD is written in its own language. What this means is that validating parsers must be written to understand DTDs using separate capabilities to those they use for manipulating XML. With a schema written in XML, you can query the schema for more detailed information from an XML expression, something you couldn't do with a Schema.
Why DTDs Are Not Enough (More Reasons)
Why DTDs Are Not Enough (More Reasons)
- Entities - An entity (denoted by the &entityRef; notation) are ambiguous, especially in asynchronous environments. Entities can prove troublesome for creating markup code, require that entity references be resolved prior to the processing of the XML document itself, and are very difficult to manipulate in XSLT.
- Parametric Entities - Parametric Entities provide a mechanism for changing an XML document based upon some external parameter. However, this essentially forces the XML document to be responsible for its own transformations, something that can be accomplished far more easily with XSLT.
- Notations - A notation maps a given element to a media application for processing that element. While useful in theory, notations frequently end up producing too tight a coupling between the data of an XML document and the corresponding implementation of that data in a given system.
- No Mechanism for Inheritance - A DTD contains no direct mechanism to handle the concept of one schema inheriting some or all the characteristics of another. This tends to keep DTDs constrained to the level of individual object definitions, rather than on larger frameworks.
Who Needs Schemas?
Who Needs Schemas?While these constraints provide a way of seeing what an XML schema definition language shouldn't do, the formal requirements for creating a schema language come from a number of different types of users, and understanding these conflicting needs can help to explain what the requirements of such a language are.
- Document Users and Vendors - The original XMLers, these users need a language that is sufficiently flexible to describe and contain sophisticated documents. They are mostly happy with the features of DTDs, and see the additional requirements as cumbersome.
- Database Vendors and Developers - On the opposite end of the spectrum, most traditional SQL database vendors have realized that XML can be a useful mechanism for transmitting large amounts of data over the wire without forcing highly specialized clients. Their primary requirement is for a stable set of simple datatypes.
- e-Commerce Developers and Vendors - These users have a need for flexibility, but also want ways of insuring that there is as little work in converting between schemas as possible. As a consequence, they are the ones pushing for stronger inheritance models and other object oriented features.
- User Agent Vendors - These are the manufacturers of browsers and other XML enabled devices. They want a schema language that can be used for specifying languages that best utilize the characteristics of user agents.
Requirements for a Schema Language Definition
Requirements for a Schema Language DefinitionThese often contradictary stances have made it difficult to form consensus on the "perfect" schema language. However, over time, the very tangible need for an XML based schema has forced a minimal set of requirements for such a language.
- Flexible Data Type Specification - A schema language should include the ability to both define a set of core data types and provide an extension mechanism for creating new types from these.
- Containment, Grouping, and Ordering - Any schema language should describe the relationship between the elements and attributes, including ordering, mutual exclusivity, grouping, and defaults.
- Object Oriented Design - The language should incorporate many of the features of traditional object oriented programming languages. At the very least, there should be a mechanism to accomodate inheritance.
- Text Friendly - A schema language should have flexible ways of dealing with compound documents, open element and attribute sets, and differentiation between two or more objects within the same XML document.
- Validation - One of the key roles of any schema language is to insure that an XML document is not only well formed (it can be parsed), but valid (everything in the document is of the right type, in the right place, and with the right numbers). Validation mechanisms by constraint (minima and maxima) and text patterns (regular expressions) are also desireable.
Advantages of an XML Based Schema Language
Advantages of an XML Based Schema LanguageBeyond the lack of the disadvantages that plague DTDs, XML based schemas offer a number of potential benefits as well.
- Dynamic Schemas - Especially with such items as enumerations, validation data isn't always available at design-time. While DTDs also include a macro-like validation mechanism, generating DTDs dynamically requires much more customized code.
- XSLT Manipulation - XML based schemas can be queried and manipulated by XSLT. This means that it becomes possible to use XSLT to generate an instance of data from its associated schema in a very generic fashion, especially when parameterization is employed.
- Self-Documentation - By moving documentation about a given schema into the schema itself (through the use of annotations), you can simplify the documentation about an XML structure, and even use the documentation from schemas to define column labels and other tags in interface devices.
- Auto-Schemas - With XSLT and a good parser you can convert an XML document into a first pass schema (within limits).
- Interface Definition Languages - One area where XML schemas are gaining attention is in the realm of interface definition languages, where an XML structure describes the interface of a programmatic class. As XML technologies such as SOAP become more prominent, this makes discovery of computer application classes possible.
Understanding the XSD
W3C Schema Definition Language
W3C Schema Definition LanguageAs I write this, the W3C XML Schema Definition Language (XSD) has just recently gone into Candidate Recommendation Status. The XSD Standard is in fact two different documents, along with a "tutorial" document. These are listed below, and will all be covered in greater depth subsequently.
- XML Schemas Part 1: Structures - This document provides the grammar for the language, defining the notion of a schema in an abstract fashion then discussing specific implementation details.
- XML Schemas Part 2: Datatypes - This document gives the datatype primitives that are used within schemas, as well as covering some of the constraining rules and subtypes, such as regular expression.
- XML Schemas Part 0: Primer - This 'primer' is not for the faint of heart. As a measure of the complexity of schemas, even the primer can be difficult to follow.
XSD StructuresXML Schema Structures form the meat of the schema specification. Contained within its pages are the keys to describing the structure of documents, the distinction between Simple and Complex element types, Abstraction and Inheritance, and considerably more. The document itself is divided into the following:
- Conceptual Framework - This describes the principles used to abstract the process of defining schemas. It is useful from a terms definition standpoint, but it doesn't explicitly define the XML implementation.
- Schema Components - Not there yet. Schema components describe each of the elements and attributes that are used in creating formal schemas from a strictly formalistic standpoint.
- XML Implementation - This part contains the actual implementation details, covering the XML representation of the formal grammar.
- Constraints - This section deals with ways that a given element or attribute can be constrained to work within a subset of its original domain.
- Access and Composition - This section covers how schemas work across a distributed system, including the thorny issues of namespace design and schema location and retrieval.
- Validation - This section looks at validation mechanisms and how they should be addressed by the parser vendors.
- Location is http://www.w3.org/TR/xmlschema-1/
XSD DatatypesThe Datatypes document is simpler, dealing as it does with the specific implementation of type rather than conceptual frameworks.
- Datatype Concepts - This section looks at data types in an abstract fashion, setting up the distinctions between differing archetypes of data, and introducing the notions of value and lexical spaces.
- Built-in Datatypes - This section lists the various core datatypes, including those for string, numeric, and date representation.
- Datatype Components - This section deals primarily with constraints that limit the scope of given datatypes, and includes both the use of patterns (i.e., regular expressions), enumerations, precision, and minima and maxima constraints.
- Location is http://www.w3.org/TR/xmlschema-2/
XSD PrimerIn order to understand the XSD specifications, read the Primer. It can be a little cryptic, but Document 0 is still probably the best place to see Schema code in action. Just a few notes.
- Four Separate Examples - The primer looks at four different examples of schemas defined within the language, working its way up in complexity from specifying simple documents to creating abstract types and inheritance.
- Location is http://www.w3.org/TR/xmlschema-0/
Datatypes and Simple Types
Built-In Primitive Datatypes
Built-in Primitive DatatypesXSD contains a number of built-in datatypes. In some cases these types are primitives - they are not made of any other types - while others are derived from more primitive types (or from derived types that are themselves built from primitives). The following primitive types are far from inclusive, but represent most of the more common types.
||A sequence of Unicode characters.
||"This is a sample string. ????'
||One of of either true (1), or false (0).
||A single precision 32-bit floating point type
||-1E4, 2442, 342.34, 0, INF, NaN
||A double precision 64-bit floating point type
||-1E4, 2442, 342.34, 0, INF, NaN
||A decimal number of arbitrary precision
||A specific period of time, in the format P nY nM nD T nH nM nS. Only relevent duration need be shown.
||P1Y2M13DT4H represents one year, two months, thirteen days and four hours.
||A Universal Resource Locator
Built-in Derived Datatypes
Built-in Derived DatatypesDerived types were added by the W3C to cut down on developers rolling their own for many of the more common data formats such as integer or time.
||A decimal value in which the scale (the number of digits after the decimal point) is 0.
||..., -2, -1, 0, 1, 2, ...
||All integers less than or equal to 0
||..., -3, -2, -1, 0
||Value derived from integer within ±9223372036854775808
||2214433234, 12, -32551
||Value derived from long within ±2147483648
||32768, 12, -32551
||Value derived from int within ±32786
||32765, 12, -32551
||Value derived from short within ±128
||78, 12, -114
||Value derived from unsignedLong within 0 to 4294867286
||Time represents an instant of time that recurs every day, and is given in the format HH:MM:SS-ZZ:YY, where ZZ:YY represents the time zone offset relative to Greenwich Mean Time.
||21:15:00-08:00, which is 9:15 at night in Seattle (8 hours from Greenwich Mean Time)
||timeInstant combines date and time format, with the time separated from the date by a "T".
||2000-08-15T 21:15:00-08:00 is August 15, 2000 at 9:15 PM in Seattle.
DTD Specific Datatypes
DTD Specific DatatypesIn addition to string, numeric, and date types, the schema reference also covers more traditional XML types that pertain to atomic units - name tokens, entities, and so forth. These are largely included for backward compatibility with DTDs.
||A unique name token that identifies a given element.
||A reference to an existing ID for a given element.
||A collection of alphanumeric characters and the underscore character, used within attributes.
||the value 'red' in the attribute color="red"
||A list of NMTOKEN items, typically as options within a given attribute, separated by white space.
||the value 'red blue green' in the attribute colors="red blue green"
||A reference to a specific entity object defined within a DTD.
Creating A Simple Type By Constraint
Creating A Simple Type By ConstraintSimple types form the foundation of datatypes, and are created in one of two ways. Either they are constrained from existing data-types, or they are aggregated from simpler data types. As an example of the former, consider a phoneNumber type, which is essentially a string that follows a very specific order (including potential optional characters), a purchase order quantity amount (limited to 1000 units), and an enumeration of specific holidays.
Creating A Simple Type By List
Creating A Simple Type By ListYou can also create a simple type by aggregating a list of other simple types. The following, for instance, will create aggregates of the phoneNumber, poQuantity types.
No more than three poQuantities can be given
A Basic Schema
A Basic SchemaA schema consists of a collection of <element>tags, along with a collection of <simpleType>and <complexType>elements that aggregate other elements into a cohesive blocks. In the simplest cases, the arrangement of elements are straightforward, mapping easily to a known schema instance. For example, the schema for the slideshow that you are currently viewing is quite simple (at least at first blush) and is illustrated below.
A Basic Schema (Continued)
A Basic Schema (Continued)This in turn contains the group and slide information.Continued ...
Complex TypesSimple Types define atomic characteristics, and constrain an existing type. Complex Types, on the other hand, aggregate collections of elements together into a single cohesive unit. For example, a Slide element pulls in a number of string based elements into the contents of a primary Slide object, which can in turn be referenced in a type attribute by some other schema element definition.
Mixed ContentMixed content, elements containing both nodes and text, can be especially difficult to specify. The content="mixed"attribute indicates that the enclosed elements will likely occur in the company of one or more unspecified text nodes.
AttributesYou can define an attribute within an XSD schema using the <attribute>tag, analogous to the way that elements are defined. For example, to create an id attribute associated with the Slide element, you would include the tag:
Note that you cannot create an attribute using a complex type, although an attribute from a simple type is quite permissible.
Fixed Types and Default Values
Fixed Types and Default ValuesYou can assign a specific value to a given element or attribute. Based upon the XSD useattribute, the role that the value has can be determined:
- fixed, in which the element or attribute will always have that value (and as such can typically be excluded)
default, where the value of the attribute or element is automatically set to the default value if the element is not explicitly specified,
optional, in which the attribute or element does not need to be explicitly specified within a calling element,
required, the element or attribute must be included within the parent element.
prohibited, the element or attribute cannot be included in the parent element.
BoundariesYou can also specify the minimum and maximum number of occurences of a given element (by definition XML cannot have more than one attribute of the same name).
- maxOccurs gives the maximum number of times a given element can appear, and ranges from 0 (only for use="prohibited") to any integer value, or the text value "unbounded" (no limitation on maximum number of elements).
- minOccurs gives the minimum number of times a given element can appear, and can take on the values 0, 1, or any value less than or equal to maxOccurs in the same section.
Anonymous TypesIn certain cases, a type is only specifically needed once within a complex data type definition, and as such there is no explicit need to create a formally named element type. In such cases the type element is said to be an anonymous type. Anonymous types are otherwise identical to their non-anonymous counterparts:
AnnotationsSchemas describe an object structure, but that description goes beyond simply providing encapsulation and data type information. Annotations let you provide additional information about the element or attribute in question:
- Application Information (appInfo), can be used to pass specific information about the element. For example, information about how a schema to instance generator might render a given element would be contained here.
- Documentation (documentation) provides a way of describing the element in human readable terms, and could in turn contain specific information in different languages.
- Labels. Either <documentation> or <appInfo> can be used to provide labels for tables and other interface elements. This can pull the onus of generating output labels from XSLT or DOM into the schema itself.
Grouping and Ordering
Grouping and OrderingXSD varies from earlier schema models in that the assumed model is one where given elements appear in the sequence specified. However, there are in fact three models for containing data:
- sequence. This, the default model, requires that all elements are presented in the order given.
choice. The choice model presents the elements as potential options that are mutually exclusive - only one element among the choices given can be a child of the indicated parent element.
Grouping and Ordering II
Grouping and Ordering II
- all. The all model waives the specific sequencing order of the sequence model and lets you choose elements in any order. The one caveat is that the minOccurs must be either 0 or 1 and the maxOccurs must be 1 (there should be no more than one such element of any given type in an all set. For any given group, the all element must contain all elements in that group.
Attribute GroupsXSD lets you define a group of related attributes as a single entity that can be referenced by an element. For example, suppose that in a catalog of books of different genres you still had a number of common attributes (cover image (href), list price (price), and ISBN number (isbn). You could define an attribute set that would pass recreate this information, as follows:
Advanced FeaturesThis is sufficient to create basic XSD schemas, though the specification itself is considerably more complex. The XSD specification also permits the following:
- Include. You can break a schema into a set of smaller component schemas.
- Class Inheritance. It is possible to create a type that derives from other types in a very straightforward manner. This makes inheritance possible.
- Equivalence Classes makes it possible to create polymorphic classes, where the same general interface can have separate implementations based upon internal data.
- Abstract Elements provide a jumping off point for the implementation of inheritance within XML terms.
- Preventing Derivations - what this does is to guarantee that abstract types cannot be inherited.
The Future of Schemas
Object Oriented XML
Object Oriented XMLOne of the most immediate impacts that Schemas will have upon XML-based programming is that it mixes the paradigms of the declarative structures that are characteristic of XML with an object-orientation that makes it possible to deal with identity and conceivably uniqueness in the programming sphere. Specifically,
- Distributed Objects. With the combination of XML listings of data code and either embedded or referenced schema objects, it becomes easier to maintain the integrity of a given XML "object" across a broad network.
- Primitives Repositories. If you look upon XML as a means of creating a model that describes any object in the virtual sphere, then the schema enables the concept of primitives libraries that could reside anywhere on the net.
- Instance Generation. An XSD schema, with its associated annotation resources, could very easily generate through XSLT instances of objects without explicit need for customized constructor code. This could also be used for generating forms and similar resources that depend upon the data types of the schema to determine the valid content of each field.
Will There Ever Be Universal Schemas?
Will There Ever Be Universal Schemas?One of the greatest conundrums with schemas is that it is so easy to create a schema, and so hard to get anyone else to use it. The babel of schemas that currently exist, while bewildering, speaks to some of the strengths of schema in general.
- Schemas as Contracts. While a technical specification, the characteristics of schemas will make them likely to be part of future legal documents in contracts - defining the data characteristics of the common data interchange.
- Schemas and XSLT. A comprehensive Schema definition language can be used to better clarify XSLT transformations, moving it from a text manipulation language to a much more complex data manipulation language.
- Schema Variants. The XSD specification contains a number of provisions for building conditional Schemas that change in response to the internal data within the XML document the schema represents. By providing a rigorous mechanism for doing so, XSD makes it possible to have multiple "similar" transformations that provide some flexibility into the B2B sphere.
- Universal IDL. Interface Description Languages, or IDLs, are ways of describing programmatic interfaces. It is likely that an object oriented XML schema would be a major component of a universal IDL, perhaps one mediated by SOAP.