You are on page 1of 31

Text Annotation Techniques

Presented By

Sreeram Sreenivasan

What is an annotated text ?


Ordinary Text : (Eg.) This is an ordinary text document.
Annotated text : (Eg.) <html> <title> Sample Document </title> <body> This is an annotated text document. </body> </html>

Key Words
 DTD  SGML

Document Type Definition


j It is a specification that accompanies an annotated

document. j It enables the parser in identifying what the codes (or markup) are that separate paragraphs, identify topic headings, j It also intimates to the parser on how each tag is to be processed. j The DTD for every document is generally placed on top of the document.
.

Standard Generalized Markup Language


j SGML is the acronym for Standard Generalized

Markup Language. j It is a standard for how to specify a document markup language or a tag set. j The SGML is itself a DTD.
SGML

HTML

XML

WML

 HTML (Hyper Text Markup Language)  XML (eXtensible Markup Language)  WML (Wireless Markup Language)  TEI (Text Encoding Initiative)  References

HYPER TEXT MARKUP LANGUAGE


j It is the set of markup symbols or codes inserted in

a file intended for display on a World Wide Web browser page. j The markup tells the Web browser how to display a Web page's text and images for the user. j Each individual markup code is referred to as an element (but is generally referred to it as a tag). j Some elements come in pairs that indicate when some display effect is to begin and when it is to end.

The basic annotations on an HTML page :


1. 2. 3. 4.

Document Tags : HTML, HEAD, TITLE, BODY &


Comment Tags

Basic Text Structures : Headings, Paragraph,


Line Break & Blockquote

Anchors : HREF & NAME Images : IMG, ALIGN & ALT

..

Heading Tag
<H1>Heading 1</H1> <H2>Heading 2</H2> <H3>Heading 3</H3> <H4>Heading 4</H4>

Heading 1
Heading 2
Heading 3
Heading 4

Paragraph Tag
<P>This sort of paragraph usually deserves to be broken up into several paragraphs, since its sheer bulk dissuades the reader from attempting to plumb its depths. </P> <P> On the other hand, they can be pretty short. </P> <P> Really short. </P> This sort of paragraph usually deserves to be broken up into several paragraphs, since its sheer bulk dissuades the reader from attempting to plumb its depths. On the other hand, they can be pretty short. Really short. ..

HTML specifics
j Though there are special editors for writing

HTML files we can use the basic MS-Word or emacs in Unix. j The tags in HTML are not case sensitive ie. tags <title> and <TITLE> mean the same. j The HTML files can be viewed with Browsers (IE or Netscape), parsers or SGML compilers since it is standardized. j Sample Document

<html> <title> Sample Document </title> <body> <p> This is a sample HTML document.</p> <p>It illustrates the usage of tags with the actual text.</p> </body> </html> ..

EXTENSIBLE MARKUP LANGUAGE


j Definition :

It is a flexible way to create common information formats and share both the format and the data on the World Wide Web, intranets, and elsewhere.

Differences with HTML


j Tags & Semantics are flexible (facilitates the

programmer to define specify tags . E.g.. <P> in XML can mean a paragraph or phone no). j Processing of XML documents depends on receiving application. j Supports links to multiple documents. j XML contains tags that describe the data. E.g.. <phoneno> may describe a telephone no. Tags may also include attributes like that of HTML. j A forgotten tag in an XML program makes file unusable unlike HTML where it may be bypassed.

Relation with SGML


j It is basically a subset of SGML(Standard

Generalized Markup Language). j SGML is a standard to specify the document language set. j Like SGML, XML is based on the principle that documents have elements that can be described without reference to how data should be displayed i.e.. XML files are created thinking in terms of document structure and not appearance).

Elements of XML Language


j An element of XML is a start tag, an end tag and

data between. E.g.. <director>Ed Wood</director> j Attributes may also be assigned to element by tags. E.g..
<director=Hollywood>Ed Wood</director> * (Unlike in HTML tags are case-senstive) j Sample XML Document (Well-formed). j Sample XML Document (Valid).

<?xml version="1.0"?> <doc> <burns>Say<quote>goodnight</quote>, Gracie.</burns> <allen><quote>Goodnight, Gracie.</quote></allen> <applause/> </doc>

1: 2: 3: 4: 5: 6: 7: 8: 9:

<?xml version="1.0"?> <!DOCTYPE PARENT [ <!ELEMENT PARENT (CHILD*)> <!ELEMENT CHILD (MARK?,NAME+)> <!ELEMENT MARK EMPTY> <!ELEMENT NAME (LASTNAME+,FIRSTNAME+)*> <!ELEMENT LASTNAME (#PCDATA)> <!ELEMENT FIRSTNAME (#PCDATA)> <!ATTLIST MARK NUMBER ID #REQUIRED LISTED CDATA #FIXED "yes" TYPE (natural|adopted) "natural"> 10 : <!ENTITY STATEMENT "This is well-formed XML"> 11 : ]>

<PARENT> &STATEMENT; <CHILD> <MARK NUMBER="1" LISTED="yes" TYPE="natural"/> <NAME> <LASTNAME>child</LASTNAME> <FIRSTNAME>second</FIRSTNAME> </NAME> </CHILD> </PARENT>

Efficiency of XML in Information Retrieval


j Meaningful Markup j Single approach can accommodate document and

data structures and integrates both within documents. j Enables transfer of data between applications
j Structural similarity to HTML simplifies

implementation using traditional web servers/ browser applications CGI and java.

...

j Files can be processed purely as data - enabling it

to be stored or displayed. j Files are text & verbose - allows easy debugging j It license-free, platform independent & well supported.

WIRELESS MARKUP LANGUAGE


j It is an annotation technique that allows the

text portions of Web pages to be presented on cellular telephone and personal digital assistants (personal digital assistant) via wireless access. j WML is part of the Wireless Application Protocol (WAP) that is being proposed by several vendors to standards bodies. j It is formerly called HDML (Handheld Devices Markup Language) . .

j Just like HTML and XML, WML is read

and interpreted by a browser built into the WAP device. j For WAP devices, the browser is commonly called a micro browser which has inherently limited capabilities compared to the web browser. j Though HTML can be used WML is used as it has lesser bandwidth resources. j Also WML uses lesser power to process compared to HTML. .

TEXT ENCODING INITIATIVE


Definition: TEI is an international project to develop guidelines for the preparation and interchange of electronic texts for scholarly research.

Need for a common encoding scheme


j Till the TEI project was undertaken there

has not been any common encoding format for scholarly machine-readable texts. j None of the existing encoding schemes has been able to gain acceptance as a standard.

Origin of TEI & factors contributing to it


j TEI arose out of a planning conference convened

by ACH at Vassar College, Poughkeepsie, New York in November 1987 j Factor I : More is known now about the problems of text encoding than at the time of previous attempts j Factor II : The recently developed Standard Generalized Markup Language (SGML) seemed to be the ideal text-encoding scheme.

Objectives of TEI
j A. To specify a common interchange

format for machine readable texts j B. To provide a set of recommendations for encoding new textual materials. j C. To document the major existing encoding schemes

Why TEI chose SGML ?


j Easier to borrow syntax from an existing scheme. j The syntax must be relatively simple and must

allow for user-defined extensions to the predefined set of tags. j SGML was soon shown to meet all the requirements of the TEI j SGML also permits usage of multiple tag- sets in the same text. .

References :
j HTML :

http://www.ncsa.uiuc.edu/General/Intern et/WWW/HTMLPrimer.html j XML : http://www.w3.org/XML/ j WML : http://www.allnetdevices.com/faq/ j TEI : http://www.uic.edu/orgs/tei/

You might also like