Professional Documents
Culture Documents
by Martin Bryan of The SGML Centre The SGML Centre, 1997 This file gives a very brief overview of the most commonly use com!onents of the "orl "i e "eb Consortium#s $"%C& '(tensible Mar)u! Language $*ML&, as s!ecifie in the "%C +ecommen ation a!!rove on 1,th -ebruary 199./
What is XML?
*ML is subset of the Stan ar Generali0e Mar)u! Language $SGML& efine in 1S2 stan ar ..79319.4 that is esigne to ma)e it easy to interchange structure ocuments over the 1nternet/ *ML files always clearly mar) where the start an en of each of the logical !arts $calle elements& of an interchange ocument occurs/ *ML restricts the use of SGML constructs to ensure that fall bac) o!tions are available when access to certain com!onents of the ocument is not currently !ossible over the 1nternet/ 1t also efines how 1nternet 5niform +esource Locators can be use to i entify com!onent !arts of *ML ata streams/ 6y efining the role of each element of te(t in a formal mo el, )nown as a Document Type Definition $7T7&, users of *ML can chec) that each com!onent of ocument occurs in a vali !lace within the interchange ata stream/ 8n *ML 7T7 allows com!uters to chec), for e(am!le, that users o not acci entally enter a thir 9level hea ing without first having entere a secon 9level hea ing, something that cannot be chec)e using the :y!erTe(t Mar)u! Language $:TML& !reviously use to co e ocuments that form !art of the "orl "i e "eb $"""& of ocuments accessible through the 1nternet/ :owever, unli)e SGML, *ML oes not re;uire the !resence of a 7T7/ 1f no 7T7 is available, either because all or !art of it is not accessible over the 1nternet or because the user faile to create it, an *ML system can assign a efault efinition for un eclare com!onents of the mar)u!/ *ML allows users to3
bring multi!le files together to form com!oun ocuments i entify where illustrations are to be incor!orate into te(t files, an the format use to enco e each illustration !rovi e !rocessing control information to su!!orting !rograms, such as ocument vali ators an browsers a e itorial comments to a file/
a !re efine set of tags, of the ty!e efine for :TML, that can be use to mar)u! ocuments a stan ar i0e tem!late for !ro ucing !articular ty!es of ocuments/
*ML was not esigne to be a stan ar i0e way of co ing te(t3 in fact it is im!ossible to evise a single co ing scheme that woul be suit all languages an all a!!lications/ 1nstea *ML is formal language that can be use to !ass information about the com!onent !arts of a ocument to another com!uter system/ *ML is fle(ible enough to be able to escribe any logical te(t structure, whether it be a form, memo, letter, re!ort, boo), encyclo!e ia, ictionary or atabase/
6ecause *ML tag sets are base on the logical structure of the ocument they are somewhat easier to un erstan , an remember, than !hysically base mar)u! schemes of the ty!e ty!ically !rovi e by wor !rocessors/ 8n *ML memo might be co e as3
<memo> <to>All staff</to> <from>Martin Bryan</from> <date>5th November</date> <subject> ats and !o"s</subject> <te#t>$lease remember to %ee& all cats and do"s indoors toni"ht.</te#t> </memo>
This form the file is i eal for a com!uter to follow, an therefore to !rocess/ The start an en of each logical element of the file has been clearly i entifie by entry of a start9tag $e/g/ <to>& an an en 9tag $e/g/ </to>&/ ?otice that at this !oint nothing has been sai about the format of the final ocument/ -rom the neutral format !rovi e by *ML users can either chose to is!lay the memo on a screen, whose si0e can be varie to suit user !references, to !rint the te(t onto a !re9 !rinte form, or to generate a com!letely new form, !ositioning each element of the ocument where nee e /
This mo el tells the com!uter that a memo consists of a se;uence of hea er elements, <to>, <from>, <date> an , o!tionally, <subject>, which must be followe by the contents of the memo/ The contents of the memo efine in this sim!le e(am!le is ma e u! of a number of !aragra!hs, at least one of which must be !resent $this is in icate by the 1 imme iately after &ara&/ 1n this sim!lifie e(am!le a !aragra!h has been efine as a leaf no e that can contain !arse character ata $3$ !A)A&, i/e/ ata that has been chec)e to ensure that it contains no unrecogni0e mar)u! strings/ 1n a similar way the <to>, <from>, <date> an <subject> elements have been eclare to be leaf no es in the ocument structure tree/
"here the !osition of an element in the mo el is variable the element can be efine as !art of a re!eatable choice of elements/ -or e(am!le, to allow references to boo)s or figures to occur anywhere in the te(t of a !aragra!h, but not in the hea ing, the mo el efinition for the <&ara> element coul be mo ifie to rea 3
<'+-+M+N) &ara .3$ !A)A5citation5fi"ref21 >
Some elements o not re;uire any contents as such/ They are sim!ly !lacehol ers that in icate where a certain !rocess is to ta)e !lace/ 8 s!ecial form of tag is use in *ML to in icate empty elements that o not have any contents, an therefore have no en 9tag/ -or e(am!le, a <"ra&hic/> element is ty!ically an em!ty element that acts as a !lace hol er for the gra!hical !art of a figure while an o!tional <ca&tion> element i entifies any te(t associate with the illustration/ Together the <"ra&hic> an <ca&tion> ma)e u! a <fi"ure>, which woul ty!ically be !lace at the same level as a te(t !aragra!h/ The following element eclarations can be use to e(ten the mo el for a <memo> to allow it to inclu e figures as well as te(t3
<'+-+M+N) <'+-+M+N) <'+-+M+N) <'+-+M+N) memo fi"ure "ra&hic ca&tion .to/ from/ date/ subject0/ .&ara5fi"ure21 > ."ra&hic/ ca&tion02 > +M$)* > .3$ !A)A2 >
This tells the com!uter that the <subject> start9tag can be amen e to rea <subject form98bold8> or <subject form98italic8> if a variant font is re;uire / 1f no such change is re;ueste the !rogram is to use the efault value to ma)e the tag rea <subject form98normal8>/ 2ne es!ecially im!ortant ty!e of attribute is the uni;ue i entifier/ 6ecause it is uni;ue it can be use to !rovi e a cross reference between two !oints in the ocument/ -or e(am!le, you can ensure that a uni;ue i entifier is assigne to each figure by a ing an attribute list eclaration of the following form to the 7T73
<'A))-67) fi"ure id 6! 3:+;<6:+! >
This tells the com!uter that every <fi"ure> element must be entere with a uni;ue i entifier within the start9tag, e/g/ as <fi"ure id98fi"=8> rather than >ust <fi"ure>/ 5ni;ue i entifiers can be referre to within the te(t by use of attributes that form i entifier references/ Ty!ically a figure reference element might have its attribute eclaration list efine as3
<'A))-67) fi"ref refid 6!:+> 36M$-6+! >
The )eywor 36M$-6+! in icates that it is !ermissible to omit the attribute in some instances of the <fi"ref> element/ -or e(am!le, this might nee to be one if the reference was to a figure in another !ublication/ $5ni;ue i entifiers only a!!ly to the current *ML ocument instance 9 they are not necessarily uni;ue across ocument sets/&
2nce such a eclaration has been ma e in the 7T7 users can use an entity reference of the form &com&any; in !lace of the full name of the com!any/ 8n a vantage of using this techni;ue is that, shoul the name of the com!any referre to by the mnemonic change later, only the entry in the 7T7 nee s to be change as the entity reference will automatically call in the current efinition/ Te(t store in another file it can also be incor!orate into a file using entity references/ 1n this case the entity eclaration in the 7T7 i entifies the location of the file containing the te(t to be reference , e/g/3
<'+N)6)* a&&endi# 7*7)+M 8htt&@//AAA.myco.com/&ub/boo%B/a&&endi#.#ml8 >
an the entity reference $&a&&endi#;& shows where the file is to be a e to the main te(t stream/ "here non9stan ar characters are re;uire s!ecial system9 e!en ent entities can be eclare to show how the characters can be generate / 8 ty!ical entry might rea 3
<'+N)6)* eacute !A)A 8&3CDD;8 >
"hen the string é is encountere in the te(t the com!uter will re!lace it by the co e whose ecimal value is @%%/ 8lternatively the ecimal character number, or its he(a ecimal e;uivalent, !rece e by #, can be use irectly as !art of a character reference, e/g/ &3#+E; to generate =/
8lternatively etails of the relevant notation can be efine as an attribute of an element, e/g/3
<'A))-67) "ra&hic source F<:-; 3:+;<6:+! ty&e N()A)6(N .?6>5$N?5G$+?2 8G$+?8 >
To i entify where the figure is to be !ositione in the te(t you woul either enter an entity reference such as &fi"=; or an em!ty element such as3
<"ra&hic source98htt&@//AAA.myco.com/fi"ures/fi"=."if8 ty&e98?6>8/>
1n both these situations a notation declaration is re;uire to tell the !rogram what to o with the unparsed data that is containe in the reference file/ Ty!ically this ta)es the form of a call to a !rogram mo ule, e/g/3
<'N()A)6(N ?6> 7*7)+M 8c@HAindoAsHsystemH"if.dll8 >
"here te(t, such as com!uter co e, has been create in a form esigne to be out!ut on a line9by9line basis with the original it can be flagge as a s!ecial ty!e of !arse character ata by a ition of a s!ecial reserve attribute, #ml@s&ace, to the element eclaration3
<'+-+M+N) code .3$ !A)A2 > <'A))-67) code #ml@s&ace .default5&reserve2 3>6I+! 8&reserve8 >
where &reserve means !reserve the line brea)s rather than use the efault of re!lacing line brea)s by s!aces before >ustifying the contents of the element/
8n *ML file normally consists of three ty!es of mar)u!, the first two of which are o!tional3 1/ 8n XML processing instruction i entifying the version of *ML being use , the way in which it is enco e , an whether it references other files or not, e,g,
<0#ml version98=.J8 encodin"98< 7C8 standalone98yes8>
@/ 8 document type declaration that either contains the formal mar)u! eclarations in its internal subset $between s;uare brac)ets& or references a file containing the relevant mar)u! eclarations $the external subset&, e/g/3
<'!( )*$+ memo 7*7)+M 8htt&@//AAA.myco.com/dtds/memo.dtd8>
%/ 8 fully9tagge document instance which consists of a root element, whose element type name must match that assigne as the ocument ty!e name in the ocument ty!e eclaration, within which all other mar)u! is neste / 1f all three com!onents are !resent, an the ocument instance conforms to the rules efine in the ocument ty!e efinition, the ocument is sai to be valid/ 1f only the last com!onent is !resent, an no formal mo el is !resent, all the *ML !rocessor can o is to chec) that the ocument instance is well-formed, i/e/ that each element is !ro!erly neste within its !arent elements, an that each attribute is s!ecifie as an attribute name followe by a value in icator $9& an a ;uote string/ *ML9co e files are, by their nature, i eal for storing in atabases/ 6ecause *ML files are both ob>ect9orientate an hierarchical in nature they can be a o!te to virtually any ty!e of atabase, though care sometimes nee s to be ta)en to ensure that enough structural ata is retaine in the atabase to reconstruct the original file/ 8 stan ari0e interface to *ML ata is efine through "%C#s 7ocument 2b>ect Mo el $72M&, which !rovi es a C2+68 17L interface between a!!lications e(changing *ML ata/ 7ata store using non9*ML notations will nee a!!ro!riate a!!lication software to !rocess it, but the *ML9co e file will correctly i entify where each !iece of such ata belongs in the com!lete ocument an where it has been store !rior to use/ 6y storing ata in the clearly efine format !rovi e by *ML you can ensure that your ata will be transferable to a wi e range of har ware an software environments/ ?ew techni;ues in !rogramming an !rocessing ata will not affect the logical structure of your ocument#s message/ 1f more etail nee s to be a e to the file all you nee to o is to u! ate the mo el an then a new mar)u! tags where re;uire in the ocument instance/ 1f a com!letely new style is re;uire then the e(isting ocument mo el can be lin)e to the new one to !rovi e automatic u! ating of ocument structures/
"ebmaster3 mtbryanAsgml/u9net/com
<director>Matthew Dunn</director>
The element name #director# allows %ou to mark up the value #Matthew $unn# semanticall%, so %ou can differentiate that particular bit of data from another, similar bit of data. For e ample, there might be another element with the value #Matthew $unn.#
<actor>Matthew Dunn</actor>
&ecause each element has a different tag name, %ou can easil% tell that one element refers to Matthew $unn, the director, while the other refers to Matthew $unn, the actor . 'f there were no wa% to mark up the data semanticall%, having two elements with the same value might cause confusion. 'n addition, XML tags are case-sensitive, so the following are each a different element.
<CITY ZIP="01085">Westfield</CITY>
'n this e ample, ZIP="0108 " is an attribute of the !,'T-" element. Attributes are used to attach additional, secondar% information to an element, usuall% meta information. Attributes can also accept default values, while elements cannot. .ach attribute of an element can be specified onl% once, but in an% order.
!r" it#
'n the following te t bo , t%pe the title of a favorite movie and then click $ontinue.