Professional Documents
Culture Documents
Wednesday May 12, 1999 SD99 Copyright 1999 Elliotte Rusty Harold elharo@metalab.unc.edu http://metalab.unc.edu/xml/slides/
What is XML?
Extensible Markup Language
A syntax for documents A Meta-Markup Language A Structural and Semantic language, not a formatting language
The tags you create can be documented in a Document Type Definition (DTD)
A meta syntax for domain-specific markup languages like MusicML, MathML, and CML
Converter Program
Self-Describing Data
Much data is lost due to format problems
XML is very simple
XML is self-describing
XML is well documented
<PERSON ID="p1100" SEX="M"> <NAME> <GIVEN>Judson</GIVEN> <SURNAME>McDaniel</SURNAME> </NAME> <BIRTH> <DATE>21 Feb 1834</DATE> </BIRTH> <DEATH> <DATE>9 Dec 1905</DATE> </DEATH> </PERSON>
XML Applications
A specific markup language uses the XML meta-syntax is called an XML application Different XML applications have their own more constricted syntaxes and vocabularies within the broader XML syntax Further syntax can be layered on top of this; e.g. data typing through DCDs or other schemas
Classic Literature
The Complete Plays of Shakespeare
The Bible The Koran The Book of Mormon
Vector Graphics
Vector Markup Language (VML)
Internet Explorer 5.0 Microsoft Office 2000
An Example of RDF
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdfsyntax-ns#" xmlns:dc="http://purl.org/DC/> <rdf:Description about="http://metalab.unc.edu/xml/> <dc:CREATOR>Elliotte Rusty Harold</dc:CREATOR> <dc:TITLE>Cafe con Leche</dc:TITLE> </rdf:Description> </rdf:RDF>
<MONTH>9</MONTH>
<DCD> <ElementDef Type="MONTH" Model="Data" Datatype="i1" Min="1" Max="12" /> </DCD>
File Formats, In-house applications, and other behind the scenes uses
Microsoft Office 2000
Federal Express Web API Netscape Whats Related
Hello XML
<?xml version="1.0" standalone="yes"?> <FOO> Hello XML! </FOO>
Plain ASCII or UTF-8 text .xml is standard file extension Any standard text editor will work
version attribute
required
always has the value 1.0
standalone attribute
yes no
encoding attribute
UTF-8 8859_1
etc.
greeting.xml
<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! </GREETING>
Style sheets
Separate from the XML document Different Languages
Cascading Style Sheets Level 1 (CSS1)
Internet Explorer 5.0 Mozilla 5.0
xml-stylesheet
Style sheets are attached via an xmlstylesheet processing instruction in the prolog
<?xml version="1.0" standalone="yes"?> <?xml-stylesheet type="text/css" href="greeting.css"?> <GREETING>Hello XML!</GREETING> type attribute has the value text/css or text/xsl href attribute is a URL to the stylesheet, possibly relative
greeting.css
GREETING {display: block; font-size: 24pt; font-weight: bold}
Sample statistics
http://cbs.sportsline.com/u/baseball/mlb/ stats.htm
Within these limits there's more than one way to organize the data
Hierarchically Relationally Objects
Child Elements
<?xml version="1.0"?> <SEASON> <YEAR> 1998 </YEAR> </SEASON>
Leagues
Major league baseball is divided into two leagues
Each league has
a name
three divisions
Divisions
Each division has
name 4-6 teams
Teams
Each team has
Name City
Players
Player Data
Each player has
First name Last name
Position
Statistics
A Style Sheet
1998shortstats.xml
baseballstats.css <?xml-stylesheet type="text/css" href="baseballstats.css"?> styled1998shortstats.xml
Possible Extensions
There should be captions like "RBI" or "At Bats. Derived numbers like batting averages are not included. The titles are short. E.g. "1998" instead of "1998 Major League Baseball". The document is so long it's hard to read. Something similar to IE5's collapsible outline view would be nice. Pitcher stats should be separated from batter stats.
Possible Solutions
CSS Level 2
XSL XSL + JavaScript
Well-formedness Rules
Empty tags end with /> There is a unique root element Elements may not overlap Attribute values are quoted < and & are only used to start tags and entities Only the five predefined entity references are used
Bad:
<A HREF=http://metalab.unc.edu/xml/>
< and & are only used to start tags and entities
Good: <H1>O'Reilly & Associates</H1> Bad: <H1> O'Reilly & Associates</H1> Good:
<CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
Bad:
<CODE>for (int i = 0; i <= args.length;
Bad:
© ® &tm; α é
etc.
What is a DTD?
a list of the elements, tags, attributes, and entities contained in a document, and their relationship to each other internal vs. external DTDs
greeting.dtd:
<!ELEMENT GREETING (#PCDATA)>
Valid:
Invalid Documents
<GREETING> various random text but no markup </GREETING>
Validating Tools
Command line programs like XJParse
Online validators
http://www.stg.brown.edu/service/xmlv alid/ http://www.cogsci.ed.ac.uk/%7Erichard/ xml-check.html
Browsers
Element Declarations
Each tag must be declared in a <!ELEMENT> declaration.
A <!ELEMENT> declaration gives the name and content model of the element The content model uses a simple regular expression-like grammar to precisely specify what is and isn't allowed in an element
Content Specifications
ANY
#PCDATA
Sequences
Choices
Mixed Content
Modifiers
Empty
ANY
<!ELEMENT SEASON ANY>
A SEASON can contain any child element and/or raw text (parsed character data)
#PCDATA
<!ELEMENT YEAR (#PCDATA)>
#PCDATA
Valid:
<YEAR>1999</YEAR> <YEAR>99</YEAR> <YEAR>1999 C.E.</YEAR> <YEAR> The year of our Lord one thousand, nine hundred, and ninetynine </YEAR>
Invalid:
<YEAR> <MONTH>January</MONTH> <MONTH>February</MONTH> <MONTH>March</MONTH> <MONTH>April</MONTH> <MONTH>May</MONTH> <MONTH>June</MONTH> <MONTH>July</MONTH> <MONTH>August</MONTH> <MONTH>September</MONTH> <MONTH>October</MONTH> <MONTH>November</MONTH> <MONTH>December</MONTH> </YEAR>
Child Elements
To declare that a LEAGUE element must have a LEAGUE_NAME child:
<!ELEMENT LEAGUE (LEAGUE_NAME)> <!ELEMENT LEAGUE_NAME (#PCDATA)>
Sequences
Separate multiple required child elements with commas; e.g.
<!ELEMENT SEASON (YEAR, LEAGUE, LEAGUE)> <!ELEMENT LEAGUE (LEAGUE_NAME, DIVISION, DIVISION, DIVISION)>
<!ELEMENT PLAYER (GIVEN_NAME, SURNAME, POSITION, GAMES, GAMES_STARTED, AT_BATS?, RUNS?, HITS?, DOUBLES?, TRIPLES?, HOME_RUNS?, RBI?, STEALS?, CAUGHT_STEALING?, SACRIFICE_HITS?, SACRIFICE_FLIES?, ERRORS?, WALKS?, STRUCK_OUT?, HIT_BY_PITCH?, WINS?, LOSSES?, SAVES?, COMPLETE_GAMES?, SHUT_OUTS?, ERA?, INNINGS?, EARNED_RUNS?, HIT_BATTER?, WILD_PITCHES?, BALK?,WALKED_BATTER?, STRUCK_OUT_BATTER?)
>
Finished DTD
Choices
<!ELEMENT PAYMENT (CASH | CREDIT_CARD)> <!ELEMENT PAYMENT (CASH | CREDIT_CARD | CHECK)>
Parenthesized element can be nested inside other parentheses in place of a single element.
The parenthesized element can be suffixed with a plus sign, a comma, or a question mark.
<!ELEMENT dl (dt, dd)*> <!ELEMENT ARTICLE (TITLE, (P | PHOTO | GRAPH | SIDEBAR | PULLQUOTE | SUBHEAD)*, BYLINE?)>
Mixed Content
Both #PCDATA and child elements in a choice
<!ELEMENT TEAM (#PCDATA | TEAM_CITY | TEAM_NAME | PLAYER)*>
Empty elements
<!ELEMENT BR EMPTY> <!ELEMENT IMG EMPTY> <!ELEMENT HR EMPTY>
Internal DTDs
<?xml version="1.0"?> <!DOCTYPE GREETING [ <!ELEMENT GREETING (#PCDATA)> ]> <GREETING> Hello XML! </GREETING>
Questions?