You are on page 1of 8

An Introduction to the Extensible Markup Language (XML)

by Martin Bryan of The SGML Centre The SGML Centre, 1997 This file gives a very brief overview of the most commonly use com!onents of the "orl "i e "eb Consortium#s $"%C& '(tensible Mar)u! Language $*ML&, as s!ecifie in the "%C +ecommen ation a!!rove on 1,th -ebruary 199./

What is XML?
*ML is subset of the Stan ar Generali0e Mar)u! Language $SGML& efine in 1S2 stan ar ..79319.4 that is esigne to ma)e it easy to interchange structure ocuments over the 1nternet/ *ML files always clearly mar) where the start an en of each of the logical !arts $calle elements& of an interchange ocument occurs/ *ML restricts the use of SGML constructs to ensure that fall bac) o!tions are available when access to certain com!onents of the ocument is not currently !ossible over the 1nternet/ 1t also efines how 1nternet 5niform +esource Locators can be use to i entify com!onent !arts of *ML ata streams/ 6y efining the role of each element of te(t in a formal mo el, )nown as a Document Type Definition $7T7&, users of *ML can chec) that each com!onent of ocument occurs in a vali !lace within the interchange ata stream/ 8n *ML 7T7 allows com!uters to chec), for e(am!le, that users o not acci entally enter a thir 9level hea ing without first having entere a secon 9level hea ing, something that cannot be chec)e using the :y!erTe(t Mar)u! Language $:TML& !reviously use to co e ocuments that form !art of the "orl "i e "eb $"""& of ocuments accessible through the 1nternet/ :owever, unli)e SGML, *ML oes not re;uire the !resence of a 7T7/ 1f no 7T7 is available, either because all or !art of it is not accessible over the 1nternet or because the user faile to create it, an *ML system can assign a efault efinition for un eclare com!onents of the mar)u!/ *ML allows users to3

bring multi!le files together to form com!oun ocuments i entify where illustrations are to be incor!orate into te(t files, an the format use to enco e each illustration !rovi e !rocessing control information to su!!orting !rograms, such as ocument vali ators an browsers a e itorial comments to a file/

1t is im!ortant to note, however, that *ML is not3


a !re efine set of tags, of the ty!e efine for :TML, that can be use to mar)u! ocuments a stan ar i0e tem!late for !ro ucing !articular ty!es of ocuments/

*ML was not esigne to be a stan ar i0e way of co ing te(t3 in fact it is im!ossible to evise a single co ing scheme that woul be suit all languages an all a!!lications/ 1nstea *ML is formal language that can be use to !ass information about the com!onent !arts of a ocument to another com!uter system/ *ML is fle(ible enough to be able to escribe any logical te(t structure, whether it be a form, memo, letter, re!ort, boo), encyclo!e ia, ictionary or atabase/

The components o XML


*ML is base on the conce!t of documents com!ose of a series of entities/ $<'ntity# is the 'nglish s!elling of the -rench wor <entit=#, the Teutonic e;uivalent of which is <thing#/ Those familiar with mo ern !rogramming techni;ues will be !robably be more comfortable using the wor <ob>ect#/ 8ll these terms are synonymous/& 'ach entity can contain one or more logical elements/ 'ach of these elements can have certain attributes $!ro!erties& that escribe the way in which it is to be !rocesse / *ML !rovi es a formal synta( for escribing the relationshi!s between the entities, elements an attributes that ma)e u! an *ML ocument, which can be use to tell the com!uter how it can recogni0e the com!onent !arts of each ocument/ *ML iffers from other mar)u! languages in that it oes not sim!ly in icate where a change of a!!earance occurs, or where a new element starts/ *ML sets out to clearly i entify the boun aries of every !art of a ocument, whether it be a new cha!ter, a !iece of boiler!late te(t, or a reference to another !ublication/ To allow the com!uter to chec) the structure of a ocument users must !rovi e it with a ocument ty!e efinition that eclares each of the !ermitte entities, elements an attributes, an the relationshi!s between them/

!o" is XML used?


To use a set of markup tags that has been efine by a tra e association or similar bo y, users nee to )now how the mar)u! tags are elimite from normal te(t an in which or er the various elements shoul be use in/ Systems that un erstan *ML can !rovi e users with lists of the elements that are vali at each !oint in the ocument, an will automatically a the re;uire elimiters to the name to !ro uce a mar)u! tag/ "here the ata ca!ture system oes not un erstan *ML, users can enter the *ML tags manually for later vali ation/ 'lements an their attributes are entere between matche !airs of angle brac)ets $<...>& while entity references start with an am!ersan an en with a semicolon $&...;&/

6ecause *ML tag sets are base on the logical structure of the ocument they are somewhat easier to un erstan , an remember, than !hysically base mar)u! schemes of the ty!e ty!ically !rovi e by wor !rocessors/ 8n *ML memo might be co e as3
<memo> <to>All staff</to> <from>Martin Bryan</from> <date>5th November</date> <subject> ats and !o"s</subject> <te#t>$lease remember to %ee& all cats and do"s indoors toni"ht.</te#t> </memo>

This form the file is i eal for a com!uter to follow, an therefore to !rocess/ The start an en of each logical element of the file has been clearly i entifie by entry of a start9tag $e/g/ <to>& an an en 9tag $e/g/ </to>&/ ?otice that at this !oint nothing has been sai about the format of the final ocument/ -rom the neutral format !rovi e by *ML users can either chose to is!lay the memo on a screen, whose si0e can be varie to suit user !references, to !rint the te(t onto a !re9 !rinte form, or to generate a com!letely new form, !ositioning each element of the ocument where nee e /

#e ining $our o"n tag sets


To efine tag sets users must create a 7ocument Ty!e 7efinition that formally i entifies the relationshi!s between the various elements that form their ocuments/ -or a sim!le memo the *ML 7T7 might ta)e the form3
<'!( )*$+ <'+-+M+N) <'+-+M+N) <'+-+M+N) <'+-+M+N) <'+-+M+N) <'+-+M+N) 4> memo , memo &ara to from date subject .to/ from/ date/ subject0/ &ara12 > .3$ !A)A2 > .3$ !A)A2 > .3$ !A)A2 > .3$ !A)A2 > .3$ !A)A2 >

This mo el tells the com!uter that a memo consists of a se;uence of hea er elements, <to>, <from>, <date> an , o!tionally, <subject>, which must be followe by the contents of the memo/ The contents of the memo efine in this sim!le e(am!le is ma e u! of a number of !aragra!hs, at least one of which must be !resent $this is in icate by the 1 imme iately after &ara&/ 1n this sim!lifie e(am!le a !aragra!h has been efine as a leaf no e that can contain !arse character ata $3$ !A)A&, i/e/ ata that has been chec)e to ensure that it contains no unrecogni0e mar)u! strings/ 1n a similar way the <to>, <from>, <date> an <subject> elements have been eclare to be leaf no es in the ocument structure tree/

"here the !osition of an element in the mo el is variable the element can be efine as !art of a re!eatable choice of elements/ -or e(am!le, to allow references to boo)s or figures to occur anywhere in the te(t of a !aragra!h, but not in the hea ing, the mo el efinition for the <&ara> element coul be mo ifie to rea 3
<'+-+M+N) &ara .3$ !A)A5citation5fi"ref21 >

where the a e elements are efine as3


<'+-+M+N) citation .3$ !A)A2 > <'+-+M+N) fi"ref .3$ !A)A2 >

Some elements o not re;uire any contents as such/ They are sim!ly !lacehol ers that in icate where a certain !rocess is to ta)e !lace/ 8 s!ecial form of tag is use in *ML to in icate empty elements that o not have any contents, an therefore have no en 9tag/ -or e(am!le, a <"ra&hic/> element is ty!ically an em!ty element that acts as a !lace hol er for the gra!hical !art of a figure while an o!tional <ca&tion> element i entifies any te(t associate with the illustration/ Together the <"ra&hic> an <ca&tion> ma)e u! a <fi"ure>, which woul ty!ically be !lace at the same level as a te(t !aragra!h/ The following element eclarations can be use to e(ten the mo el for a <memo> to allow it to inclu e figures as well as te(t3
<'+-+M+N) <'+-+M+N) <'+-+M+N) <'+-+M+N) memo fi"ure "ra&hic ca&tion .to/ from/ date/ subject0/ .&ara5fi"ure21 > ."ra&hic/ ca&tion02 > +M$)* > .3$ !A)A2 >

#e ining the attributes o elements


"here elements can have variable forms, or nee to be lin)e together, they can be given suitable attributes to s!ecify the !ro!erties to be a!!lie to them/ -or e(am!le, it might be eci e that the <subject> fiel of a memo coul o!tionally be !rinte in bol or italics/ 8 suitable attribute list eclaration might, in this case, be3
<'A))-67) subject form .bold5italic5normal2 8normal8 >

This tells the com!uter that the <subject> start9tag can be amen e to rea <subject form98bold8> or <subject form98italic8> if a variant font is re;uire / 1f no such change is re;ueste the !rogram is to use the efault value to ma)e the tag rea <subject form98normal8>/ 2ne es!ecially im!ortant ty!e of attribute is the uni;ue i entifier/ 6ecause it is uni;ue it can be use to !rovi e a cross reference between two !oints in the ocument/ -or e(am!le, you can ensure that a uni;ue i entifier is assigne to each figure by a ing an attribute list eclaration of the following form to the 7T73
<'A))-67) fi"ure id 6! 3:+;<6:+! >

This tells the com!uter that every <fi"ure> element must be entere with a uni;ue i entifier within the start9tag, e/g/ as <fi"ure id98fi"=8> rather than >ust <fi"ure>/ 5ni;ue i entifiers can be referre to within the te(t by use of attributes that form i entifier references/ Ty!ically a figure reference element might have its attribute eclaration list efine as3
<'A))-67) fi"ref refid 6!:+> 36M$-6+! >

The )eywor 36M$-6+! in icates that it is !ermissible to omit the attribute in some instances of the <fi"ref> element/ -or e(am!le, this might nee to be one if the reference was to a figure in another !ublication/ $5ni;ue i entifiers only a!!ly to the current *ML ocument instance 9 they are not necessarily uni;ue across ocument sets/&

Incorporating standard and non%standard text elements


*ML also contains techni;ues for a ing stan ar $boiler!late& te(t to a file, an for han ling characters that are outsi e the stan ar character set, but which are available on certain out!ut evices/ Commonly use te(t can be eclare within the 7T7 as a text entity/ 8 ty!ical te(t entity efinition coul ta)e the form3
<'+N)6)* com&any 8)he 7?Mentre8 >

2nce such a eclaration has been ma e in the 7T7 users can use an entity reference of the form &com&any; in !lace of the full name of the com!any/ 8n a vantage of using this techni;ue is that, shoul the name of the com!any referre to by the mnemonic change later, only the entry in the 7T7 nee s to be change as the entity reference will automatically call in the current efinition/ Te(t store in another file it can also be incor!orate into a file using entity references/ 1n this case the entity eclaration in the 7T7 i entifies the location of the file containing the te(t to be reference , e/g/3
<'+N)6)* a&&endi# 7*7)+M 8htt&@//AAA.myco.com/&ub/boo%B/a&&endi#.#ml8 >

an the entity reference $&a&&endi#;& shows where the file is to be a e to the main te(t stream/ "here non9stan ar characters are re;uire s!ecial system9 e!en ent entities can be eclare to show how the characters can be generate / 8 ty!ical entry might rea 3
<'+N)6)* eacute !A)A 8&3CDD;8 >

"hen the string &eacute; is encountere in the te(t the com!uter will re!lace it by the co e whose ecimal value is @%%/ 8lternatively the ecimal character number, or its he(a ecimal e;uivalent, !rece e by #, can be use irectly as !art of a character reference, e/g/ &3#+E; to generate =/

Illustrations& tables and other special elements


*ML !rovi es a number of techni;ues for han ling non9stan ar ocument elements/ "here the co ing scheme of an element of the file such as an illustration iffers from that use for normal te(t the contents of the element can be treate as an entity with a s!ecial notation, e/g/3
<'+N)6)* fi"= 7*7)+M 8htt&@//AAA.myco.com/boo%=C/fi"ures/fi"=8 N!A)A ?6> >

8lternatively etails of the relevant notation can be efine as an attribute of an element, e/g/3
<'A))-67) "ra&hic source F<:-; 3:+;<6:+! ty&e N()A)6(N .?6>5$N?5G$+?2 8G$+?8 >

To i entify where the figure is to be !ositione in the te(t you woul either enter an entity reference such as &fi"=; or an em!ty element such as3
<"ra&hic source98htt&@//AAA.myco.com/fi"ures/fi"=."if8 ty&e98?6>8/>

1n both these situations a notation declaration is re;uire to tell the !rogram what to o with the unparsed data that is containe in the reference file/ Ty!ically this ta)es the form of a call to a !rogram mo ule, e/g/3
<'N()A)6(N ?6> 7*7)+M 8c@HAindoAsHsystemH"if.dll8 >

"here te(t, such as com!uter co e, has been create in a form esigne to be out!ut on a line9by9line basis with the original it can be flagge as a s!ecial ty!e of !arse character ata by a ition of a s!ecial reserve attribute, #ml@s&ace, to the element eclaration3
<'+-+M+N) code .3$ !A)A2 > <'A))-67) code #ml@s&ace .default5&reserve2 3>6I+! 8&reserve8 >

where &reserve means !reserve the line brea)s rather than use the efault of re!lacing line brea)s by s!aces before >ustifying the contents of the element/

'sing XML coded text

8n *ML file normally consists of three ty!es of mar)u!, the first two of which are o!tional3 1/ 8n XML processing instruction i entifying the version of *ML being use , the way in which it is enco e , an whether it references other files or not, e,g,
<0#ml version98=.J8 encodin"98< 7C8 standalone98yes8>

@/ 8 document type declaration that either contains the formal mar)u! eclarations in its internal subset $between s;uare brac)ets& or references a file containing the relevant mar)u! eclarations $the external subset&, e/g/3
<'!( )*$+ memo 7*7)+M 8htt&@//AAA.myco.com/dtds/memo.dtd8>

%/ 8 fully9tagge document instance which consists of a root element, whose element type name must match that assigne as the ocument ty!e name in the ocument ty!e eclaration, within which all other mar)u! is neste / 1f all three com!onents are !resent, an the ocument instance conforms to the rules efine in the ocument ty!e efinition, the ocument is sai to be valid/ 1f only the last com!onent is !resent, an no formal mo el is !resent, all the *ML !rocessor can o is to chec) that the ocument instance is well-formed, i/e/ that each element is !ro!erly neste within its !arent elements, an that each attribute is s!ecifie as an attribute name followe by a value in icator $9& an a ;uote string/ *ML9co e files are, by their nature, i eal for storing in atabases/ 6ecause *ML files are both ob>ect9orientate an hierarchical in nature they can be a o!te to virtually any ty!e of atabase, though care sometimes nee s to be ta)en to ensure that enough structural ata is retaine in the atabase to reconstruct the original file/ 8 stan ari0e interface to *ML ata is efine through "%C#s 7ocument 2b>ect Mo el $72M&, which !rovi es a C2+68 17L interface between a!!lications e(changing *ML ata/ 7ata store using non9*ML notations will nee a!!ro!riate a!!lication software to !rocess it, but the *ML9co e file will correctly i entify where each !iece of such ata belongs in the com!lete ocument an where it has been store !rior to use/ 6y storing ata in the clearly efine format !rovi e by *ML you can ensure that your ata will be transferable to a wi e range of har ware an software environments/ ?ew techni;ues in !rogramming an !rocessing ata will not affect the logical structure of your ocument#s message/ 1f more etail nee s to be a e to the file all you nee to o is to u! ate the mo el an then a new mar)u! tags where re;uire in the ocument instance/ 1f a com!letely new style is re;uire then the e(isting ocument mo el can be lin)e to the new one to !rovi e automatic u! ating of ocument structures/

"ebmaster3 mtbryanAsgml/u9net/com

Lesson 1: Authoring XML Elements

What is an XML element?


XML is a meta-markup language, a set of rules for creating semantic tags used to describe data. An XML element is made up of a start tag, an end tag, and data in between. The start and end tags describe the data within the tags, which is considered the value of the element. For e ample, the following XML element is a !director" element with the value #Matthew $unn.#

<director>Matthew Dunn</director>
The element name #director# allows %ou to mark up the value #Matthew $unn# semanticall%, so %ou can differentiate that particular bit of data from another, similar bit of data. For e ample, there might be another element with the value #Matthew $unn.#

<actor>Matthew Dunn</actor>
&ecause each element has a different tag name, %ou can easil% tell that one element refers to Matthew $unn, the director, while the other refers to Matthew $unn, the actor . 'f there were no wa% to mark up the data semanticall%, having two elements with the same value might cause confusion. 'n addition, XML tags are case-sensitive, so the following are each a different element.

<City> <CITY> <city>


Attributes
An element can optionall% contain one or more attributes. An attribute is a name-value pair separated b% an e(ual sign )*+.

<CITY ZIP="01085">Westfield</CITY>
'n this e ample, ZIP="0108 " is an attribute of the !,'T-" element. Attributes are used to attach additional, secondar% information to an element, usuall% meta information. Attributes can also accept default values, while elements cannot. .ach attribute of an element can be specified onl% once, but in an% order.

!r" it#
'n the following te t bo , t%pe the title of a favorite movie and then click $ontinue.

$he%& the s"nta'


&ecause XML is a highl% structured language, it is important that all XML be well-formed. That is, the XML must have both a start tag and end tag, and must be authored using the proper s%nta . 'n the following bo , create an XML element with a start tag, an end tag, and a value on a single line. ,lick the Well( )orme*? button to see if %our XML is correct.

You might also like