X Cert1422 A4

XML and Related Technologies certification prep,
Part 2: Information modeling

Model data with an XML grammar
Skill Level: Intermediate
Louis E Mauget (LMauget@numbersix.com)

Senior Consultant
Number Six Software, Inc.
12 Sep 2006
This tutorial on information modeling is the second in a series of five tutorials that can
help you prepare for the IBM™ certification Test 142, XML and Related
Technologies. This tutorial analyzes XML data, contrasts narrative documents with
record-like documents, and models a small data problem using Document Type
Definition (DTD) grammar and several iterations of the World Wide Web Consortium
(W3C) XML Schema. It finishes with a comparison of DTD and XML Schema to help
you choose one or the other in your design.
Section 1. Before you start

In this section, you'll find out what to expect from this tutorial and how to get the
most out of it.
About this series

This series of five tutorials helps you prepare to take the IBM certification Test 142,
XML and Related Technologies, to attain the IBM Certified Solution Developer - XML
and Related Technologies certification. This certification identifies an
intermediate-level developer who designs and implements applications that make
use of XML and related technologies such as XML Schema, Extensible Stylesheet
Language Transformation (XSLT), and XPath. This developer has a strong
understanding of XML fundamentals; has knowledge of XML concepts and related
technologies; understands how data relates to XML, in particular with issues
Information modeling
© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 1 of 33
developerWorks® ibm.com/developerWorks
associated with information modeling, XML processing, XML rendering, and Web
services; has a thorough knowledge of core XML-related W3C recommendations;
and is familiar with well-known, best practices.
About this tutorial

This tutorial is for programmers who have a basic understanding of XML and whose
skills and experience are at a beginning-to-intermediate level. You should have a
general familiarity with defining, validating, and reading XML.
The standardized nature of XML has given rise to a number of derivative

cross-platform, cross-language parsers and derivative technologies. Part 1 of this
series covered XML architecture. This second tutorial covers the characteristics of
data, narrative documents, and data documents. It goes on to model a simple case
study in XML, while showing several iterations of a grammar.
Objectives
After completing this tutorial, you will know how to:
• Analyze data and documents
• Represent structure in XML syntax
• Use namespaces appropriately
• Define DTDs
• Define grammars using XML Schema
• Determine when to use a DTD versus an XML Schema
Prerequisites
This tutorial is written for developers who have a background in programming and
scripting and who have an understanding of basic computer-science models and
data structures. You should be familiar with the following XML-related,
computer-science concepts: tree traversal, recursion, and reuse of data. You should
be familiar with Internet standards and concepts, such as Web browser,
client-server, documenting, formatting, e-commerce, and Web applications.
Experience designing and implementing Java™-based computer applications and
working with relational databases is also recommended.
System requirements
To complete the steps as shown in this tutorial you will need an up-to-date browser
and a validating XML editor. XMLSpy was the XML editor used in this tutorial. See
Page 2 of 33 © Copyright IBM Corporation 1994, 2006. All rights reserved.
ibm.com/developerWorks developerWorks®
Resources for a link to download a copy of XMLSpy.
Section 2. XML data and documents

In this section you will learn about two basic types of XML documents, called
narrative documents and record-like documents. You will learn their characteristics
and see samples for their use. Keep in mind that it is difficult to give a rigorous
definition of a narrative format as compared to a record-like format, because each
document is really data in the end.
XML narrative documents

A narrative document is a format that specifies words, symbols, and formatting
information intended for human consumption. A narrative-style document is
generally looser and less constrained in format than a data record-like document.
Most narrative document grammars allow an infinite number of document structures,
while some record-like grammars allow just one structure.
Narrative examples
A narrative-style XML grammar can define documents that render into printed
matter. Extensible HTML (XHTML), an XML grammar for the Web, and DocBook, a
markup for technical publications, are examples of narrative markup grammars. See
Listing 1 for a simple example of a DocBook document.
Listing 1. DocBook narrative document
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<?altova_sps http://www.altova.com/sps/Template/Publishing/docbook.sps?>
<article>
<title>Article Title</title>
<sect1>
<title>Section1 Title</title>
<para>Text</para>
</sect1>
</article>
Narrative documents have many uses in addition to rendering the printed word. For
example, Speech Synthesis Markup Language (SSML) defines documents that
render as synthetic speech audio. VoiceXML (VXML) is a speech-oriented XML
grammar specified by the W3C for bidirectional human voice interaction with a
computer. Listing 2 shows a small example.
Listing 2. VoiceXML narrative document

<vxml xmlns="http://www.w3.org/2001/vxml"
xmlns:xsi="http://www.w3.org/2001/Xhref="XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0">
<form>
<block>Hello from VXML!</block>
</form>
</vxml>
Record-like data documents

The founders of the XML project are probably pleasantly surprised at the ubiquitous
applications of XML for record-like data uses. As an alternative to the complex
Standard Generalized Markup Language (SGML), they conceived XML to be a
metalanguage used to create grammars for narrative-style documents. However, the
simple, yet rigorous nature of XML made it perfect for creating hierarchical
grammars of attributed elements -- sometimes better than the tabular relational
model.
A record-like data XML grammar is generally stricter than a narrative XML grammar.
Narratives are art for human consumption. Data more resembles science for
computational consumption. There is a plethora of standard industry-oriented
narrative grammars. A standardized well-known grammar is important if anonymous
folks are going to read a document. On the other hand, many record-like grammars
are specialized, often existing only for one application.
Record-like examples
If you work with Web applications, you've probably encountered Web application
deployment descriptors and JavaServer Pages™ (JSP) tag library descriptors.
These are good examples of XML used to define data records. A Web application
server needs a structured view of the semantics and locations of deployment
artifacts. The hierarchical nature of an XML-based web.xml document fills the
requirement nicely. Listing 3 shows a deployment descriptor for a simple Spring
Framework Web application.
Listing 3. A DTD describing a Web servlet deployable unit
<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
<servlet>
<servlet-name>example</servlet-name>
<servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
<servlet-name>example</servlet-name>
<url-pattern>/example/*</url-pattern>
</servlet-mapping>
<welcome-file-list>
<welcome-file>index.jsp</welcome-file>
</welcome-file-list>
</web-app>
Listing 4 contains a JSP tag library descriptor used to associate a taglib name
with a Java class that implements the logic of a custom JSP markup tag.
Listing 4. Sample of a JSP tab library descriptor XML

<!DOCTYPE taglib PUBLIC "-//Sun Microsystems, Inc.//DTD JSP Tag Library 1.2//EN"
"http://java.sun.com/dtd/web-jsptaglibrary_1_2.dtd">
<taglib>
<tlib-version>1.2</tlib-version>
<jsp-version>1.1</jsp-version>
<short-name>bc</short-name>
<tag>
<name>BreadCrumb</name>
<tag-class>com.rogers60.taglib.BreadCrumbTag</tag-class>
</tag>
</taglib>
Section 3. Well-formed XML

This section compares similar examples of HTML and XHTML to highlight the
differences between the two and to help demonstrate the requirements for an XML
file to be well formed.
HTML
HTML is a rather loose markup language based on Standard Generalized Markup
Language (SGML). Browsers try to make rendering assumptions about markup
omissions and errors. The result often varies across vendors or releases. HTML
doesn't conform to the simple rules of well-formed XML. See an example of
nonrigorous HTML markup in Listing 5. The document has no associated DTD. The
paragraph tag is not closed. The title tag is in lowercase, but the other tags are
uppercase. Figure 1 shows how the Firefox browser renders the document correctly
despite its sloppy markup.
Listing 5. Loose markup
<HEAD>
<title>XML Tutorial</title>
</HEAD>
<BODY>
<H1>This is a heading</h1>
<P>This is a paragraph.
<P align=center>This is centered</P>
<P><B>This is bold</P></B>
</BODY>
Figure 1. Browser renders nonrigorous HTML correctly
Note that HTML is not XML. An XML document is well-formed. If an XML parser
encounters a parsing error, then its document is not really an XML document. The
document is useless until you repair it.
XHTML
An XHTML document is a variant of HTML based on the well-formed rigor of XML
markup. Table 1 illustrates that XHTML is partially about what forms of markup are
not allowed.
Table 1. Illegal XHTML markup

Examples of disallowed markup in XHTML
<P>Uppercase elements</P>
<p>Missing end tags<br>
<body>Missing html root</body>
<p id=id001>Missing quotes on attributes
<p><b>Overlapping extents</p></b>
<input type="checkbox" checked/>
You must encode an XHTML document in UTF-8 or UTF-16, or prefix it with an XML
declaration that declares the encoding in force. It must also have a PUBLIC ID
DOCTYPE specifying an XHTML DTD. Table 2 shows five public DTDs that describe
variants of XHTML grammar.
Table 2. XHMTL DOCTYPEs

Grammar DTD DOCTYPE
XHTML 1.0 Strict <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0
Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
XHTML 1.0 Transitional <!DOCTYPE html PUBLIC
Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
XHTML 1.0 Frameset <!DOCTYPE html PUBLIC
Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
XHTML 1.1 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML
1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
XHTML 2.0 (2005) <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML
2.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml2.dtd">
Exercise
In this exercise, you will create an XHTML version of the HTML data from Listing 5
and then edit it to render as shown in Figure 1. First, invoke your XML editor, create
a new document, and choose type XHTML. The XHTML should resemble Figure 2.
If you aren't using XMLSpy, you might see some differences. Check that the doctype
is XHTML 1.0 strict.
Figure 2. XHTML boilerplate
Next, edit the boilerplate to incorporate a clean version of the sloppy HTML from
Listing 5. If you supply well-formed, valid XHTML, the validation test will pass, as
shown in Figure 3.
Figure 3. Updated XHTML 1.0 markup
Notice that you have to use a Cascading Style Sheet (CSS) text-align style to center
the middle paragraph, because the DTD won't allow the HTML paragraph center
attribute. The document would not pass the validation test with the paragraph
center attribute.
HTML, in comparison, is harder to validate because of the laid-back nature of

browsers and the complex SGML heritage of HTML. Specialized, complex tools
carry out HTML validation, while any standard XML tool validates an XHTML
document.
To test rendering, save the XHTML file and view it in a browser, or click the Browser
tab in XMLSpy. Figure 4 shows the result. Notice that it renders identical to Figure 1.
Figure 4. XHTML rendered in a Firefox Web browser
Unlike the sloppy HTML example, a browser refuses to render the XHTML file if it's
not valid. A single grammar error invalidates the entire document.
Why use XHTML if it's so touchy?
Browsers must make guesses to render sloppy markup in HTML files. There is no
standard for these assumptions, so the HTML might render unwanted visual artifacts
in one browser but not in another.
The valid file is XML, so all parsers are obligated to parse it and supply reasonable
visual rendering. Moreover, notice that you used an XML tool to create and edit the
file. A real value of XHTML is that you may use any XML tool or library to work with
XHTML.
Section 4. Modeling data in XML

Assume that you must model a collection of published books in XML. The goal is to
enable any programming application to use it as the input to reports, reference
searches, or, perhaps, an e-commerce catalog of books for sale.
Some authorities recommend creating a DTD or schema before you ever make a
sample document. I find this approach to be too abstract. Instead, I prefer to borrow
a page from the test-driven methods used in program design: Create a document,
then create a DTD that validates that document. The document is the test, and the
DTD is the application. You'll be able to iterate the design by changing either
document as you proceed. Later, I'll show you how to discard the DTD in favor of
creating an XML Schema for the test document, because this tutorial is about how to
model and constrain data in XML by using either approach.
You still need that test document. Your task is to model a catalog of published
books, but you might want to extend the catalog to other kinds of publications in the
future. You can think of the catalog as a list of publications. Every XML document is
a single-rooted hierarchy, so you can use the word "publications" as the root
element. See Listing 6.
Listing 6. Beginnings

<publications>
. . .
</publications>
The publications catalog contains zero or more books. It seems reasonable that a
book should be a child element of publications, as shown in Listing 7.
Listing 7. Child elements added

<publications>
<book></book>
<book></book>
<book></book>
. . .
</publications>
Each book has several common attributes: title, author, copyright, and ISBN
number. These items are attributes of a book, but do you model them as actual XML
attributes, or as XML elements? Review some capability differences between
attributes and elements, as shown in Table 3.
Table 3. Capabilities of attributes compared to elements

Capability Attribute Element
Hierarchical No - flat Yes
Ordered No - undefined Yes

Complex types No - string only Yes
Verbose Less - usually More
Readability Less More - usually
The title, author, copyright, and ISBN number seem to be immediate children of a
book. Will they eventually need children of their own? You're not sure at this point,
but you don't want to prevent that kind of extension for no good reason. This is a
point in favor of modeling them as elements.
The items probably need no ordering within a book as long as an application can
parse them by name, but ordering seems -- well -- more orderly. You can argue
each way with respect to ordering. Award no points to either side here.
Some of the items seem to be simple strings, but copyright is really a four-digit
number that you might enforce in a future version of a schema for your publications
catalog. In addition, you might later impose a formatting pattern on an ISBN. So
award a solid point to modeling as elements here.
If an XML binding technology, such as Java™ Architecture for XML Binding (JAXB),
is part of an architecture, then consider that elements translate into classes, while
attributes become properties of those classes. Thus, the number of classes is
proportional to the number of kinds of elements. This could mean more sizeable and
possibly more complex source code. However, the binding tool generates this code.
The source document is really the schema. Maintainers normally don't modify the
classes manually. Thus, XML binding might not be a factor in the elements versus
attribute decision. Award no points to either side for JAXB or XML binding in general.
The argument is currently two to zero in favor of elements over attributes -- for this
problem only. The characteristics of verbosity and readability can be matters of
personal taste or part of the design requirements. You must evaluate your actual
design task yourself. It sometimes comes down to a matter of personal taste.
Some schemas allow using either an attribute or an element in a given place. Both
Apache ANT and DocBook documents allow this behavior in places.
In this tutorial, use elements for title, author, copyright, and ISBN number, with
lowercase letters for ISBN. You can reserve the option to allow optional or required
actual attributes to the book element, such as image used to imbed an optional
picture of a book, and id to impose a unique identifier on a book for use as a
reference key by applications. Listing 8 shows the test document at this point. It has
no DTD or XML schema. That's your next job. This tutorial shows you how to do one
of each.
Listing 8. Test document publications1.xml

<publications>
<book>
<title>Building J2EE Applications With IBM WebSphere</title>

<author>Nilsson and Mauget</author>
<copyright>2003</copyright>
<publisher>Wiley</publisher>
<isbn>0471281573</isbn>
</book>
<book>
<title>Linux and Windows Interoperability Guide</title>
<author>Bradford and Mauget</author>
<publisher>Pearson Education</publisher>
<isbn>0130324779</isbn>
</book>
<book>
<title>e - Directories</title>
<author>House,Hahn,Mauget,Daugherty</author>
<isbn>0471281573</isbn>
</book>
</publications>
You can impose a DTD-based grammar on the test document by describing the
document in DTD-speak, an older markup language borrowed from SGML. First,
what does it mean to impose a grammar?
A grammar constrains an XML document's validity, somewhat as a relational

database management system (DBMS) schema describes and constrains the form
of a database. When does a grammar apply? A DBMS grammar prevents bad
structural changes or disallowed relational changes whenever an application
modifies data in its tables. DBMS schema is mostly about writing and less about
reading.
An XML grammar applies at a different point. A user could alter an XML document
with Microsoft® Notepad, an application ignorant of XML or its grammar documents.
An XML grammar applies during parsing, or recognizing, the XML document during
reading. Thus, an XML grammar is about reading valid information, not writing it. A
well-formed document meets the requirements of XML markup, but can be invalid
according to the associated grammar. This is a go/no-go vote during parsing. A
validity miss here renders the document useless. There is no carrying on, as
browsers do with the sloppy HTML.
Section 5. DTDs and Schemas
Defining a DTD
A DTD largely consists of <!ELEMENT ... > and <!ATTRIBUTE ... > markup
statements.
The test document's first element is publications. It contains book elements, so

place book in parentheses to indicate containment. You can suffix book with an
asterisk (*) for "zero or more," a plus sign (+) for "one or more," a question mark (?)
for "one or none," or no suffix for one book allowed. This is part of regular
expression notation. Thus, the first line of the DTD looks like this:
<!ELEMENT publications (book*)>
Each book contains exactly one title, author, copyright, and isbn element,
in that order. (For the purposes of this tutorial, multiple authors are entered in the
single author element. When you create a DTD, remember to consider how to set up
the DTD to permit one or more elements.) Thus, the next markup statement is:
<!ELEMENT book (title, author, copyright, publisher, isbn)>
The remaining elements are leaf-node elements that contain character data. You
use parentheses to indicate containment, as usual. You need to declare the kind of
character data. The character strings are parsed character data, indicated by the
literal, #PCDATA:
<!ELEMENT title (#PCDATA)>
You give each book a required unique identification key through the special XML ID
kind of attribute. In addition, you can enable an optional image attribute that
contains a URL of a picture of the book cover. The ATTLIST markup takes an
element argument followed by a tuple for each attribute associated with the element.
Each tuple consists of an attribute name, its type, and an indicator of whether it's
optional or required. The DTD specification allows 10 attribute types listed in Table
4.
Table 4. DTD attribute types

DTD attribute type Description
CDATA Any character string acceptable in XML
NMTOKEN Close to being an XML name; first character is
looser
NMTOKENS One or more NMTOKEN tokens separated by
white space
Enumeration List of the only allowed values for an attribute
ENTITY Associates a name with a macro-like
replacement
ENTITIES White-space-separated list of ENTITY names
ID XML name unique within the entire document
IDREF Reference to an ID attribute within the document
IDREFS White-space-separated list of IDREF tokens
NOTATION Associates a name with information used by the

client
You indicate a required attribute by appending #REQUIRED after the type. You can
stipulate an optional attribute by appending #IMPLIED. The grammar has one
attribute of each kind on the book element. The single ATTLIST statement looks like
this:
<!ATTLIST book id ID #REQUIRED image CDATA #IMPLIED>
Listing 9 shows the completed DTD.
Listing 9. Publication DTD

<!ELEMENT publications (book*)>
<!ELEMENT book (title, author, copyright, publisher, isbn)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT copyright (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT isbn (#PCDATA)>
<!ATTLIST book
id ID #REQUIRED
image CDATA #IMPLIED>
How would you associate the DTD with the document it describes? You could imbed
the DTD into the XML document inline. The textbook approach is to keep the two
documents separate. The application could explicitly use the publications.dtd
document to validate the publications2.xml document. Instead, implicitly link the DTD
to the XML document:
<!DOCTYPE publications SYSTEM "publications.dtd">
Listing 10 shows the XML document linked to the new DTD. It assumes that the
DTD is located in the current directory.
Listing 10. Document linked to separate DTD

<!DOCTYPE publications SYSTEM "publications.dtd">
<publications>
<book id="_1001">
<isbn>0471281573</isbn>
</book>
<book id="_3042">
<isbn>0130324779</isbn>
</book>
<book id="_9593" image="edir.jpg">

<isbn>0471281573</isbn>
</book>
</publications>
What easily-procured tools could you use to carry out validation testing? You could
use the freely downloadable Altova XMLSpy Home Edition to test the document for
validity. Place the publications.dtd file and the publications2.xml file in the same
directory, open the XML document, and then click F8. Figure 5 shows successful
validation.
Figure 5. Validation using linked DTD
What happens when the document is not valid? Remove the required id attribute
from the first element, then click F7 to check the document for legal XML syntax.
The status reports yellow. This means that the document is well-formed. Now click
F8. Figure 6 shows the result. The status is red, meaning it's an invalid document.
XMLSpy will complain that the document is invalid if you try to save it.
Figure 6. Invalid document -- missing required id attribute
Defining an XML Schema

A W3C XML Schema is a grammar document that is, itself, an XML document.
Schematron and Relax NG are other XML validation languages. I cover only the
W3C XML Schema, but you might want to check out the others, as they are
relatively simple and have currency.
You can design an XML Schema to constrain your document to a greater degree
than is possible by using the DTD. For example, an XML Schema grammar can
specify that exactly four apple elements must always be the immediate children of
a basket element. You can define complex types, building on string types. For
instance, you could require a zipcode element to have a pattern facet of value
"\d\d\d\d\d-\d\d\d\d", so that values such as "95123-4823" are valid, but
"abcde-fghi" or "27703" are invalid.
What is the meaning of the term facet? An XML Schema considers a facet to be an
aspect of possible values for a simple data type. Table 5 shows the XML Schema
facets.
Table 5. XML W3C Schema facets

Facet type Description
length Enforces an exact length
minLength Enforces, at least, a minimum length of a
string-derived type
maxLength Enforces, at most, a maximum length of a
string-derived type
maxExclusive Value must be <= a maximum appropriate for the
type
maxInclusive Value must be < a maximum appropriate for the
type
minExclusive Value must be > a minimum appropriate for the
type
minInclusive Value must be >= a minimum appropriate for the
type
enumeration Restricts values to a member of a defined list
totalDigits Enforces total digits in a number; signs and
decimal points skipped
fractionDigits Enforces total fractional digits in a fractional
number
whiteSpace Used to preserve, replace, or collapse document
white space
This is a hint at the granularity you have at your disposal, but you'll begin by making
an XML Schema that matches the capability of the DTD. Later, I'll show you how to
tighten it a bit to show the advantage of schema.
To begin, declare the schema that XML Schema uses. Don't use a namespace for
the grammar until later in the tutorial. The schema itself uses a namespace. By
convention, you'd use the prefix "xs." You could use any character string as the
prefix, even "radish," but why obscure convention?
xmlns:xs="http://www.w3.org/2001/XMLSchema"
For now, declare that your own declared elements and attributes are unqualified:
elementFormDefault="unqualified"
attributeFormDefault="unqualified"
The following is the XML Schema root element after you put this together:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="unqualified"
attributeFormDefault="unqualified">
Next, you can begin to specify the publications, as well as the book, title,
author, copyright, publisher, and isbn element declarations. You can
specify the root element as:
<xs:element name="publications">
The publications root is a complex type that contains a sequence of other

elements, namely an optional book element. It has an unbounded maximum number
of occurrences:
<xs:complexType>
<xs:sequence>
<xs:element name="book" maxOccurs="unbounded">
The book is a subsequent complex type that contains its own sequence of title,
author, copyright, publisher, and isbn element declarations:
<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
<xs:element name="title"/>
<xs:element name="author"/>
<xs:element name="copyright"/>
<xs:element name="publisher"/>
<xs:element name="isbn"/>
</xs:sequence>
...
Did I forget to have you add the id attribute and the image attribute to the book
element? No, you simply defer those to the end of the complex type enclosed by the
book element.
The rules (grammar) of XML Schema state that you place attributes last in the
complex type enclosed by their element. An attribute is a schema element of the
form <xs:attribute … />.
Thus, you can add the id and image attributes, as shown here:
<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
<xs:attribute name="image" type="xs:string"/>
</xs:complexType>
The id attribute is required, but the image attribute use defaults to optional.
Notice the type, xs:string. You could have specified a complex type based on a
string. To XML, that is still a string, but to XML Schema, it is a particular kind of
string. I'll say more about user-defined complex types a bit later.
That just about completes your schema. Add closing markup to the open elements,
as shown in Listing 11.
Listing 11. Schema that matches DTD capabilities

elementFormDefault="qualified"
<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
You can open publication3.xsd in XMLSpy and click F8 to validate it against the
http://www.w3.org/2001/XMLSchema, as if it were an XML document --
because it is an XML document.
Associating an XML Schema to a document
The schema is only useful when applied to an XML document. How can you
associate a schema to a document? The application could explicitly use the
publications3.xsd document to validate the publications3.xml document. Instead, you
want to implicitly associate the schema with the XML document.
You can modify the document root element to link to the schema through a special
attribute. The schema doesn't use a namespace -- yet. You must add an attribute to
the publications root that shows the parser where to find the no-namespace
schema:
<publications xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="publications3.xsd">
That does it! See the publications3.xml document in Listing 12. It's the same
document contents as that shown in Listing 8, except for the schema association.
Listing 12. publication3.xml linked to W3C XML Schema

<publications xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="publications3.xsd">
<book id="_1001">
<isbn>0471281573</isbn>
</book>
<book id="_3042">
<isbn>0130324779</isbn>
</book>
<book id="_9593" image="edir.jpg">
<isbn>0471281573</isbn>
</book>
</publications>
Imposing a namespace
What about imposing a namespace on your document? A DTD has no notion of

namespace. Instead, it ignores the colon (":") in a namespace-qualified element or
attribute. This means that any disambiguation is applied using the prefix preceding
the colon, not the URI represented by prefix. It is legal for two documents to use the
same prefix, yet be associated with different namespaces. This could be a
problematic naming collision.
Impose the following namespace on your document:
http://rogers60.com/xmltutorial/2
Use pub for the prefix, although you can use any legal string unique within the
document and schema. With a DTD, pub:book behaves as if it were pubbook.
With XML Schema namespace support, it behaves conceptually as if were

http://rogers60.com/xmltutorial/2book.
The DTD behavior doesn't necessarily prevent namespace collisions. On the other
hand, the XML Schema behavior can prevent namespace collisions, if properly
declared, because the DNS-based URI is unique. Take your next baby step in the
evolution of the schema by giving it a default namespace with a prefix of pub:
xmlns:pub="http://rogers60.com/xmltutorial/2"
You need to specify both attributes to target a namespace and use a default
namespace. The general convention is to enable unprefixed elements to assume the
default namespace, but leave unprefixed attributes out of any namespace:
attributeFormDefault="unqualified"
Recall that you declared elementFormDefault="unqualified" in the

no-namespace grammar previously. You added nothing else in your particular
schema to namespace-qualify its grammar. See Listing 13.
Listing 13. Schema that has a default namespace

targetNamespace="http://rogers60.com/xmltutorial/2"
<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
</xs:sequence>
<xs:attribute name="image" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
You need to perform minor surgery on the XML document to connect it to the
namespace. The surgery is minor because the entire document is in the default
namespace. Instead of specifying a no-namespace schema location attribute for the
root publications element, you specify:
xsi:schemaLocation="http://rogers60.com/xmltutorial/2 publicationsNS4.xsd"
Be careful here. Notice the quoted two-part, whitespace-separated pair of values.

The first component is the namespace URI, and the second is the location of the
schema file. This would be a URL, in general. Additionally, add a default namespace
and a pub prefix for the same namespace:
xmlns="http://rogers60.com/xmltutorial/2"
This enables you to let an unqualified element default to the URI

http://rogers60.com/xmltutorial/2 or to specify an explicit prefix pub for
that namespace. For this example, <book> and <pub:book> mean the identical
element grammar declaration. See Listing 14 for a namespace-enabled version of
the test document. Notice that the final book element is explicitly qualified as
pub:book for illustration. The document is valid in the context of the schema of
Listing 13, shown previously.
Listing 14. XML document with default namespace

<publications
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://rogers60.com/xmltutorial/2
publicationsNS4.xsd"
xmlns="http://rogers60.com/xmltutorial/2"
xmlns:pub="http://rogers60.com/xmltutorial/2">
<book id="_1001">
<isbn>0471281573</isbn>
</book>
<book id="_3042">
<isbn>0130324779</isbn>
</book>
<pub:book id="_9593" image="edir.jpg">
<isbn>0471281573</isbn>
</pub:book>
</publications>
Built-in simple data types
The W3C XML Schema specification imposes a set of simple built-in data types.
Table 6 lists the built-in XML Schema simple types and their descriptions. The test
case used in this tutorial implicitly uses the string type, except where it explicitly
specifies it in the image and id attributes of the book element.
Table 6. Built-in W3C Schema simple data types

Type Description
anyURI Uniform Resource Identifier
base64Binary base64 encoded binary value
boolean true .. false or 0 ..1
byte Signed quantity >= 128 and < 127
dateTime An absolute date and time
ID, IDREF, IDREFS, Same definitions as those in
ENTITY, ENTITIES, Table 4
NOTATION, NMTOKEN,
NMTOKENS
integer Signed integer
language "xml:lang" values from XML 1.0
recommendation.
name An XML name
string Unicode string
A data document schema for a complex application can become large and difficult to
maintain unless it is refactored to some normalized form. In your case, you've
declared a schema that declares elements and attributes at the structural point
where you use each. You mixed the document's structure with declarations of the
elements and attributes used to build that structure. This can obscure the clarity of
the schema to a human trying to understand or maintain it. It also reduces potential
reuse of types and makes it difficult to find a type when a maintenance person needs
to change it.
How does a URI differ from a URL?

A URI is a universally unique ID. A URL is a universally unique
locator that is used to address a remote or local resource.
You might use a URL as a URI, because it is unique. A URI, in

general, is not always a URL, because it might not denote a
resource location. XML Schema uses a URI to qualify a
namespace. This URI might or might not function as a URL that
references a document describing the namespace.
Confusing? How about a concrete example? The tutorial

namespace URI http://rogers60.com/xmltutorial/2 is not a real Web
resource (URL), but it could be a universally unique URI if its base,
rogers60.com, were a registered DNS name that you could
control.
Why not centralize type declarations at one point and then refer to those types in a
separate data structure portion of the schema? If you do this, you can even break a
large schema into separate files that confine themselves to type sections or structure
sections.
Try it. Refactor the schema to separate the declaration from the structure, thus
declaring all elements in a section at the top of the document, followed by attributes,
followed by the document structure. The structural part refers to an element or
attribute by using a ref attribute with a name value of the element or attribute name.
The attribute reference must use the namespace prefix, because you made all
attributes namespace-agnostic.
This normalized schema layout sometimes makes a huge document easier to read,
because humans can read the declarations separated from the somewhat
less-verbose structure. In addition, this promotes reuse of items.
Listing 15 shows the small schema in this normalized form. It is actually longer than
the original, but it's easier to maintain because of the separation of declaration from
structure, and because of the potential for declaring something once instead of
multiple times.
Listing 15. Normalized W3C Schema
<?xml version="1.0" encoding="UTF-8"?>6

attributeFormDefault="qualified">


<xs:attribute name="id" type="xs:string"/>

<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
<xs:element ref="pub:title"/>
<xs:element ref="pub:author"/>
<xs:element ref="pub:copyright"/>
<xs:element ref="pub:publisher"/>
<xs:element ref="pub:isbn"/>
</xs:sequence>
<xs:attribute ref="pub:id" use="required"/>
<xs:attribute ref="pub:image" use="optional"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
After linking the test document to the schema, it still validates as before, so I won't
repeat it here. Now open the schema in XMLSpy and the click the Schema/WSDL
tab to render the diagram, as shown in Figure 7. Notice the connector symbols for
the sequences, the namespace-qualified labels, and the stack of pub:book
elements.
Figure 7. W3C XML Schema diagram
The schema shows a slight error. You want to allow an empty publications list,
but notice that there must be at least one pub:book. An empty publications list
doesn't validate. It's always good to test boundary conditions. You can repair this by
adding a minOccurs attribute to that element:
<xs:element name="book" maxOccurs="unbounded" minOccurs="0">
An empty publications list will validate now.
The W3C XML Schema contains built-in simple types, but part of its attraction lies in
its ability to constrain values to more granular user-defined simple types. You'll
create two simple types -- one for the isbn element and one for the copyright
element. Base each upon xs:string, but apply restrictive patterns to them. First,
tackle the copyright format. Stipulate that it is always a four-digit number. While
there are alternative approaches for this simple restriction (for example, a decimal
with a specified length), you'll use a pattern for the form "dddd" where each "d" is
a decimal digit:
<xs:simpleType name="year">
<xs:restriction base="xs:string">
<xs:pattern value="\d\d\d\d"/>
</xs:restriction>
</xs:simpleType>
Similarly, restrict an ISBN number to have the form "d-dddd-dddd-d" where each "d"
is also a decimal digit. You know that isn't the real definitive format. ISBN recently
changed from 10 digits to 13 digits, because it ran out of numbers. However, this is a
tutorial, and the pattern fits the ISBN numbers in the test document:
<xs:simpleType name="isbn">
<xs:pattern value="\d-\d\d\d\d-\d\d\d\d-\d"/>
</xs:restriction>
</xs:simpleType>
Insert these two XML stanzas above the element and attribute declarations of the
normalized schema. Then you can refer to the new types anywhere you need them
by using a ref attribute aimed at the new type:

You use the namespace prefix in the ref value because the attribute references
default to no namespace, as specified in the schema by
attributeFormDefault="unqualified". Listing 16 shows the latest revision
of the XML Schema that uses the simple types. I won't display the XML document
listing again here, because it doesn't vary, except to target the latest name of the
schema.
Listing 16. Schema declaring a custom type


</xs:restriction>
</xs:simpleType>
</xs:restriction>
</xs:simpleType>
<xs:element name="title" type="xs:string"/>
<xs:element name="author" type="xs:string"/>
<xs:element name="copyright" type="pub:year"/>
<xs:element name="publisher" type="xs:string"/>
<xs:element name="isbn" type="pub:isbn"/>

<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
The test document still validates when you test it using XMLSpy. Now remove the
first dash in the first isbn element and revalidate. This should cause it to flunk
validation. Figure 8 shows what happens.
Figure 8. Flunking validation of a custom type
You can use a divide-and-conquer technique to make your schema even easier to
understand and maintain. Borrow a technique from programming and break the
schema into declarations that reside in separate files. First, create a schema that
constrains only the simple types, elements, and attribute declarations of your
previously refactored schema. Listing 17 shows how.
Listing 17. Declarations in separate schema file


</xs:restriction>
</xs:simpleType>
</xs:restriction>
</xs:simpleType>
<xs:element name="title" type="xs:string"/>
<xs:element name="author" type="xs:string"/>
<xs:element name="copyright" type="pub:year"/>
<xs:element name="publisher" type="xs:string"/>
<xs:element name="isbn" type="pub:isbn"/>

</xs:schema>
Next, remove those items from a copy of the original schema, replacing them with
the following markup:
<xs:include schemaLocation="publicationsRedefine8.xsd"/>
This produces the structural schema shown in Listing 18. Its ref attributes refer to
the included schema. Notice how each file is easier to read. One is about element,
attribute, and type declaration. The other is about arranging those into a document
structure. When you alter the linkage in the test XML file to point to
publications8.xsd, the file validates correctly in XMLSpy.
Listing 18. Structural W3C Schema

<xs:include schemaLocation="publicationsRedefine8.xsd"/>
<xs:complexType>
<xs:sequence>
<xs:complexType>
<xs:sequence>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
You can use the XML Schema ability to xs:include, xs:redefine, or

xs:import external files to derive new types through xs:extension or
xs:restriction elements. This can be an entire tutorial itself. Read about it in the
Web-based "W3C XML Schema Part 0: Primer Second Edition" listed in Resources.
Section 6. Deciding between DTD and W3C XML

Schema
I touched on aspects of both DTD and W3C XML Schema. Now I'll compare and
contrast them.
Table 7 lists the basic validation features of DTDs. There is little granularity in the
control of format and types of element values and attribute values. This is usually
sufficient for narrative-style documents. Indeed, an astounding number of standard
DTDs are available for various industry-oriented narrative exchange documents.
Table 7. DTD validation

DTD validation restraints
Element nesting
Element occurrence
Permitted attributes of an element
Attribute types and default values
Data record-like documents form the other major division of XML applications.
Object serialization to and from XML requires precise specification of content. Here
is where the W3C XML Schema shines. Table 8 contains a high-level description of
its constraint features. Notice that the XML Schema shows some overlap with DTD,
but XML Schema is able to compose new data types for a grammar. The overlap
features are misleading. XML Schema enables more precise control of items such
as element occurrence. You can stipulate that your publication list consist of 10 and
only 10 books, for example. This is not possible with DTDs.
Table 8. W3C Schema features

W3C XML Schema features
Element occurrence constraints
Namespace-qualified element and attribute declarations
Simple and complex data types
Type derivation and inheritance
It sounds as if XML Schema always wins over DTD for defining new data-oriented
grammars, but DTD can do one thing better than XML Schema. Remember entities?
Those are the macro-like declarations that can substitute named items into a
document. You can define them easily in a DTD. That functionality is difficult to
duplicate in XML Schema. General entities see common use in narrative grammars,
where DTD use remains entrenched.
XML Schema is another application of XML. In fact, it is constrained by its own XML
Schema. DTD is not XML, but rather a separate markup language. Some people
consider this a disadvantage of DTDs. Others point out that XML Schema is wordy
and can be hard to read. You've seen that normalizing a schema into sections
mitigates this somewhat. Schemas can be harder to write from scratch than DTDs.
Modern tools provide hinting assistance as you type, thus countering this argument
somewhat.
It's not valid to say, "Always design with schema." In the end, you must make the
decision based on your application, but now you have some arguments on either
side to guide you.
Section 7. Conclusion
Summary
Part 1 of this series covered XML architecture. This second tutorial discussed the
characteristics of data, and narrative documents. It went on to model a simple case
study in XML, while showing several iterations of a grammar. Part 3 shows you how
to process XML in an application. Part 4 concentrates on transforming XML
documents into new documents, and Part 5 explains testing and tuning XML and
common related technologies.
If you study the complete series, you should have sufficient background to help you
prepare to take the IBM certification Test 142, XML and Related Technologies, to
attain the IBM Certified Solution Developer - XML and Related Technologies
certification.
Resources
Learn
• XML on developerWorks: Get the resources you need to advance your XML
skills with technical articles and tips, tutorials, standards, and IBM Redbooks.
• New to XML page (developerWorks): Browse this overview if you want to learn
about XML but don't know where to start.
• IBM XML 1.1 certification: Become an IBM Certified Developer in XML 1.1 and
related technologies.
• Introduction to XML tutorial (Doug Tidwell, developerWorks, August 2002):
Learn what XML is, why it was developed, and how it's shaping the future of
electronic commerce. You'll also cover a variety of important XML programming
interfaces and standards.
• XML Matters: Comparing W3C XML Schemas and Document Type Definitions
(DTDs) (David Mertz, developerWorks, March 2001): Compare schemas and
DTDs and clarify just what is going on in the XML schema world.
• Validating XML tutorial, by (Nicholas Chase. developerWorks, August 2003):
Learn what validation is and how to check a document against a Document
Type Definition (DTD) or XML Schema document.
• XML in a Nutshell, 3rd Edition (Elliotte Rusty Harold and W. Scott Means,
O'Reilly Media, 2004, ISBN: 0596007647): Check out this comprehensive XML
reference with everything from fundamental syntax rules, DTD and XML
Schema creation, XSLT transformations, processing APIs, XML 1.1, plus SAX2
and DOM Level 3.
• XML Schema Part 0: Primer Second Edition on the W3C Web site: Read about
the XML Schema and how to create schemas using the XML Schema
language.
• W3C Markup Validation Service: With this free service, check Web documents
in formats like HTML and XHTML for conformance to W3C Recommendations
and other standards.
• XHTML: Learn more about the Extensible HyperText Markup Language
(XHTML) on the Wikipedia Web site.
• VoiceXML (VXML): Read more about this XML format for interactive voice
dialogues between humans and computers on the Wikipedia Web site.
• Speech Synthesis Markup Language (SSML): Find out more about this
XML-based markup language for speech synthesis apps on the Wikipedia Web
site.
• developerWorks technical events and webcasts: Stay current with technology in
these sessions.
Get products and technologies
• IBM product evaluation versions: Download and try application development

tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and
WebSphere®.
• Trial: Rational Application Developer for WebSphere Software V6.0: Download
and explore a free trial version that supports multi-vendor runtime environments
and is powered by the Eclipse open source platform.
• Altova XMLSpy® 2006 Home Edition: Download an entry level XML editor and
development tool for designing and editing XML-based applications.
• Eclipse SDK: Download the Eclipse SDK, then browse to find useful tools and
plugins.
• XMLBuddy™ plugin for the Eclipse SDK: Edit, validate, format and more with
XMLBuddy. It supports XML, DTD, XML Schema, RELAX NG, RELAX NG
compact syntax and XSLT.
Discuss
• XML zone discussion forums: Participate in any of several XML-centered
forums.
• developerWorks blogs: Get involved in the developerWorks community.
About the author

Louis E Mauget
Lou Mauget is a senior Java™ 2 Enterprise Edition (Java EE) consultant and
published author who lives near Raleigh, North Carolina. He consults to various
enterprise and government clients. Lou was a software engineer for IBM before his
current position with Number Six Software. He continues to participate in certification
activity for IBM. You can reach Lou at LMauget@numbersix.com.
Trademarks
IBM, DB2, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM
Corporation in the United States, other countries, or both.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the
United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.

X Cert1422 A4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

X Cert1422 A4

Uploaded by

Copyright:

Available Formats

XML and Related Technologies certification prep,

Part 2: Information modeling

Skill Level: Intermediate

Louis E Mauget (LMauget@numbersix.com)

Section 1. Before you start

About this series

About this tutorial

The standardized nature of XML has given rise to a number of derivative

Resources for a link to download a copy of XMLSpy.

Section 2. XML data and documents

XML narrative documents

Listing 1. DocBook narrative document

<?xml version="1.0" encoding="UTF-8"?>

Listing 2. VoiceXML narrative document

<?xml version="1.0" encoding="UTF-8"?>

Record-like data documents

Listing 3. A DTD describing a Web servlet deployable unit

<?xml version="1.0" encoding="ISO-8859-1"?>

Listing 4. Sample of a JSP tab library descriptor XML

<?xml version="1.0" encoding="UTF-8"?>

Section 3. Well-formed XML

Listing 5. Loose markup

Figure 1. Browser renders nonrigorous HTML correctly

Table 1. Illegal XHTML markup

variants of XHTML grammar.

Table 2. XHMTL DOCTYPEs

Figure 2. XHTML boilerplate

Figure 3. Updated XHTML 1.0 markup

HTML, in comparison, is harder to validate because of the laid-back nature of

Figure 4. XHTML rendered in a Firefox Web browser

Why use XHTML if it's so touchy?

Section 4. Modeling data in XML

<?xml version="1.0" encoding="UTF-8"?>

Listing 7. Child elements added

<?xml version="1.0" encoding="UTF-8"?>

Table 3. Capabilities of attributes compared to elements

Ordered No - undefined Yes

Listing 8. Test document publications1.xml

<?xml version="1.0" encoding="UTF-8"?>

<title>Building J2EE Applications With IBM WebSphere</title>

A grammar constrains an XML document's validity, somewhat as a relational

Section 5. DTDs and Schemas

The test document's first element is publications. It contains book elements, so

<!ELEMENT publications (book*)>

<!ELEMENT book (title, author, copyright, publisher, isbn)>

<!ELEMENT title (#PCDATA)>

Table 4. DTD attribute types

NOTATION Associates a name with information used by the

<!ATTLIST book id ID #REQUIRED image CDATA #IMPLIED>

Listing 9 shows the completed DTD.

Listing 9. Publication DTD

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE publications SYSTEM "publications.dtd">

Listing 10. Document linked to separate DTD

<?xml version="1.0" encoding="UTF-8"?>

<book id="_9593" image="edir.jpg">

Figure 5. Validation using linked DTD

Figure 6. Invalid document -- missing required id attribute

Defining an XML Schema

Table 5. XML W3C Schema facets

The publications root is a complex type that contains a sequence of other

Listing 11. Schema that matches DTD capabilities