You are on page 1of 4

A methodology for indexing and retrieval of information A method-

ology for
from XML document indexing
M. Shoaib, Shazia Arshad, A. Shah, Amjad Ali
Department of Computer Science & Engineering, University of Engineering and Technology, Lahore, Pakistan

Keywords Structured documents, structure information, hierarchical indexing, elements, attributes.

Abstract The XML documents having markup elements are increasing vividly on the World Wide Web. Now the exigency is
that how these documents could be used for the welfare of our posterity so that indexing and retrieving of these documents
can be made more accurate and precise. The endeavors to make the standards for indexing and retrieving of XML docu-
ments are burgeoning. Currently the structured documents are using elements to represent the structured documents, but
if also the attributes are used to represent the structure information then the indexing retrieval process can be made more
flexible and exact. We have intermingled the elements and attributes to represent XML documents and then hierarchical 1
or tree based indexing to the XML documents is used which step-up the efficiency of the indexing and retrieval methodol-
ogy for XML documents.

Introduction
With the ample use of the XML document there is burgeoning demand to develop better tech-
niques for indexing and retrieving of XML documents from the World Wide Web (Athena ,Barbara
,Anna,2004). XML provides the way to express the structure of the document. XML documents
are content oriented having markup elements(SungRan,1997). In precise form the XML document
1can be defined as the document D having the structure information, content information, and the
link information, where the contents are the collection of words and the structure includes the ele-
ments and attributes. Elements incorporate the attributes in it. Elements alone are not fit to represent
the structure of the XML documents also the structure of the documents become static by using
elements only and become difficult to amplify(Vakali,2001). So we need for attributes with the
elements to represent the structure of the XML documents and the indexing methodology applied
on these structured documents offers the best retrieval of the information from XML documents.
Hierarchical indexing methodology is used for indexing such cluster of the documents because the
XML documents have tree-based structures have some semantic relationship among the elements
and this relation among the elements make the hierarchical indexing more effective. As it is much
enlightening fact that retrieval efficiency is based on the indexing methodology yet there enhance-
ments can be made for accurate retrieval of the information (Vish,2002). We propose the solution
to increase the efficiency of retrieval of the information, which is expressed as if the attributes and
elements are arranged in the separate tables then the time consumed to reach the attribute is reduced
because attributes are indexed after the elements. So when only the attributes are needed then there
is direct access to the attribute table and when elements are requires then only elements are retrieved
from the separately arranged tables. Moreover the usage of the term frequency and inverse term
frequency for ranking the documents according to the priorities based on the weight age of the
documents is responsible for efficient retrieval of the documents.

Related Work
The work on indexing and retrieval is at its peak but still not standardized. Bit cube indexing
(Kevin,1998) ,a three dimensional indexing for XML document indexes the on the basis of the
three dimensional matrix that consists of (d,p,w).D contains XML elements (p), where p has
zero or more terms(w) bound to it. Typical indexing require a frequency table that is a two dimen-
sional matrix indicating the number of occurrences of the term used in the documents. Quite a large
number of indexing strategies have been proposed for XML documents (Shaorong,Qinghua,Wesely
,1999). One of the most common is to map an XML structure to a relational database schema. This
can be done by mapping the root element of an XML document to the name of a table residing
within an RDBMS (Relational Database Management System). The child elements of the docu-
ment are then mapped to the particular attributes of that the entity modeled by the table structure.
However this approach only will work for less complicated documents that one might find in a data
centric application. Algorithm efficiently supports dynamic indexing (Seung,Yoonl,1997) of new
documents into the already created index structure. The process of computing the index data for the
new set of documents to be added incrementally is efficiently merged with the ordered documents
of the existing index structure to create a new structure that reflects the effective and unified index- 18th National
ing organization of the entire document collection (old and new) as a whole. Quasi-Bit Cube, a Computer
Conference 2006
memory efficient indexing scheme extended from Bit Cube is also proposed. Since the information © Saudi Computer
stored is in the form of bits, the entire index structure fits into the main memory and hence Society
I/O operations are no longer a concern during information retrieval. I/O, however, plays a dominant
role during index creation or modification, results show that Quasi-Bit Cube (Seung,Yoonl,1997)
manages memory much more effectively and at the same time retains the same query processing
efficiency of a Bit Cube. The execution time of Quasi-Bit Cube for different query operations is
much more efficient than XQEngine. Efficient dynamic indexing algorithm supports incremental
addition of new XML documents to an existing index structure, without requiring the entire collec-
tion to be re-indexed. Experiments show that dynamic indexing scheme provides better updates and
search costs than the traditional scheme, with acceptable space overheads. As the incremental batch
size increases, dynamic indexing algorithm outperforms not only the traditional scheme, but also
XQEngine (Venish,1999). There is a growing demand of XML in the areas relating to XLinks[18],
XPointers(Watson,2001) and Security. As part future work, a plan to extend index structure to meet
these growing demands has been prepared.

Proposed Methodology
2 There are many studies on retrieval methods for structured documents such as SGML (Standard
Generalized Markup Language)and XML (eXtensible Markup Language)(Atena,2004) but most of
the studies are for those whose structure information is expressed by elements (Shah,1996). The
reason is that because the purpose of elements is expressing a document structure. However, if the
document structure is expressed by using elements, the structure becomes static and hard to expand.
So, there is an alternative method representing a structure information using attributes to avoid this
problem. To represent the basic structure of the document elements can be used but to represent the
detailed structure or user defined structure attributes are used. Also for indexing and retrieval of the
elements a method that converts attributes to elements is required for this purpose. This system is
also beneficial so that the existing systems can be used. Also a method that processes attributes infor-
mation to without changing it elements is required to preserve the original shape of the documents.
Now we describe the detail of our indexing mechanism for element retrieval. As we know that there
is hierarchy found in the XML documents due to this hierarchy there is a semantic relationship in the
elements. And this relationship in the elements makes the retrieval easy and precise.

Indexing for the structured document


After converting the document’s attributes to elements indexing for each document is made hierar-
chically. Index terms are distributed across all nodes in the document tree. The basic idea for assign-
ing an index term to an element node is that the term should characterize the concept of this element
and differentiate it from others. Thus, a rule of thumb for selecting good index terms is that if a term
appears frequently and is distributed evenly in the text of an element and meanwhile, its rank is high
compared to its peer terms, it should be considered as a representative term for this element. We
approach the goal of flexible element retrieval by a hierarchical indexing mechanism, which is not
only able to index the leaf nodes but also intermediate nodes, i.e. section and document nodes. Basi-
cally, we use a propagation and pruning mechanism to select index terms. From bottom up, terms
that can “exactly” describe the inherent concept of an element are propagated to it while terms with
too broad or too narrow meanings are pruned. Index pruning is employed to ensure that an index
term appearing in an element would not appear in any of its descendent elements thus content over-
lap in the text is avoided in the index. This saves much storage space and retrieval time. Moreover,
this hierarchical indexing mechanism produces the index with the same structure as the document
has.

Describing a Structured Document


To describe structured documents, a look at how structured documents are composed is necessary.
Also, a method for expressing each constitution that makes up a structured document should be
explained. A structured document is composed of three parts. First part expresses the actual content,
second part expresses structure information and third part expresses link information. Content is a
set of words, and structure information is made up of elements and a set of structure-attributes. When
this is defined formally, it is as follows. In the following definition, link information is not considered
in detail.

Definition: A structured document, the ith structured document Di is composed of content informa-
tion Ci, structure information Si, and link information Li. C is made up of a set of words. S is union
of elements(E) and structure-attributes(A~). And attribute information is composed of structure-
attribute(A’) and general-attribute(Ag). So, a structured document is defined as follows:

D = (C,S,L)
C = (wt, wz ..... w,}
S=U(Ej,AI),I<j
J ,t = {A”,A’)
In Definition 1, to effectively describe structure information, Si should be effectively described. In A method-
existing systems, structure information is considered as a set of elements and attribute information is ology for
processed separately. But in Definition 2, structure information is defined as a set of elements (E) and
structure attributes (A”). To effectively retrieve structured documents, index that describes content, indexing
structure, and link information, which are components of structured document is needed. Therefore,
in this paper, an index that effectively supports structure retrieval is designed. Especially, structure
index can describe not only elements but also structure-attributes.

Describing Content Information


Index that represents the content of structured document is required for retrieving content and it is
used for retrieving a document or an element that contains a certain word. It uses inverted file struc-
ture. It does not index the entire structure but performs indexing in the lowest part of the structure,
which contains the actual content.

Term Weighting for Elements


As the document has the hierarchical structure so the distribution of a term in an element can be mea- 3
sured by investigating this term’s appearances in the descendant elements of this element. It is noted
here that we consider only immediate descendant elements of the element because we believe that
the topic of an element should be best supported by the elements, which comprise it directly. If a term
is distributed evenly in a composite element’s immediate-descendant elements, this term would be a
good candidate index term for this element. We introduce entropy here as a criterion to measure the
distribution of a term in an element. For an element other than paragraphs (leaf nodes), we combine
term frequency in this element and term distribution in its immediate-descendent elements together
to weight a term.

Selection for Index Term


A term in an element whose weight is relatively high should be selected as an index term for this ele-
ment. Specifically, the propagation and pruning process realizes the selection of index terms. In the
previous section, we have the weights for each term in an arbitrary element. A term is propagated to
an upper element if its weight exceeds a certain threshold, and meanwhile this term is pruned from
these descendant elements since it may stand for a more general concept. This process is done recur-
sively from bottom up till all nodes in the tree are assigned proper index terms without duplications
in the same branch of the index tree. Obviously, the threshold controlling the term selection should
be dynamically adjusted according to statistics of all the terms’ weights in a specific element. More
precisely, a term is chosen as an index term for an element if and only if its weight is beyond the
average value plus the standard deviation of all terms’ weights in this element.

Retrieval Methodology
According to our hierarchical indexing mechanism, an element does not share any index terms with
its ancestors. Thus we say that an element is completely represented by all index terms of the ele-
ments along its path. Conversely, a path can be expressed as the element at the lowest level in the
path. Therefore, the element-ranking problem can be transformed to a path-ranking problem, that
is, to find those element paths with high relevance values to the query. In this paper, element-based
hierarchical tree structure is created and then structure-attributes information is inserted into the tree.
So, element based or attribute based structure information can be effectively described. And also for
increasing the retrieval performance when attribute based information is not needed then only ele-
ment based structure information can be retrieved. For this purpose attribute table and element table
is made separately and these tables are then linked with the element based hierarchical tree. When
for only attributes, the quality of retrieval can substantially drop. The reason is that because retrieval
of attributes can happen only after an element that has the attribute is found through visiting every
structure index. So, this shortcoming is overcome by providing an index table. In the attribute table,
the name of the attribute points to an element table so that direct access to an attribute is possible
without visiting the entire structure index.
Conclusion and Future Work
There have been many studies on element retrieval from structured documents. Also methodologies
for indexing and retrieval of information from XML document have been developed. Our method-
ology for indexing and retrieving information is based on the attributes and also on the elements
retrieval from XML documents. The proposed indexing method combined structure attribute and
elements to represent general structure attribute. This makes the retrieval of information from XML
document more accurate and precise. Because for the basic structure elements can be used but for
detailed structure or user defined structure attribute are used. Therefore indexing and retrieval of the
structured documents such as XML and SGML elements as well as attributes usage can increase the
retrieval performance.
References
Athena V., Barbara C., and Anna M., (2004), “XML document indexes”, Available on http://www.gorgia.com
A.Vish., (2002),“Extensible Markup Language”. Available on http://www.w3c.org/XML/
A.Shanii., (2001), “A structured documents retrieval method supporting Attribute-Structure Information” Available on http://www.portal.
acm.org.
BitCube., (2004), “A Three-Dimensional Bitmap Indexing for XML Documents” Available on http://www.eceb.gmu.edu.
Haixun W.,(2001) , “ A dynamic index method for querying XML data by tree structures”, IBM Thomas J. Watson Research Center
Hawthorne,Vol 12 No 10,pp 12-16.
Jaap, Kamps, Maarten and Marx, (2000), “XML retrieval :what to retrieve? language & inference technology group ILLC, university of
amsterdam” Available on http://www.lit.science.uva.nl/
Kevin R.(1998), “Indexing XML documents: A hybrid approach”, International Journal Of Software Engineering, Vol 11 No 8, Dec,
pp40-46.
4 Shah ,B.(1996),”Efficient dynamic indexing and retrieval of XML documents Using three dimensional quasi-bit cube”, Journal on Infor-
mation Technology. Vo10 No 5,October, pp20-50 SungRan Cho.,(2000),“A meta-data indexing system for XML”, L3S, Uni-
versity Of Hannover,Vol 35,pp1621-1628
Seung K., Yoonl C.,(1997), “A structured documents retrieval method supporting attribute-based structure information”, Journal of docu-
ments .Vol 2,pp 12-23.
Shaorong L., Qinghua Z., and Wesley W. Chu (1999) “Hierarchical indexing and flexible element retrieval for structured document”,
Available on http://www.w3c.org
Tim B.L., (2001), “extensible markup language”, Available on www.w3c.org.
T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler (2000), “Extensible markup language(XML)”, 2nd edition, Available on http://www.
w3.org/TR/REC-xml.
Uer W., (2002), “Xenogeny” Available on http://xqengine.sourceforge.net/
Vakali B., Catania and Anna M., (2001), “XML document indexes” Journal of Information Science, Vol14 No.3, Jan, pp.1301-1323.
Vist A., (2003), “A dynamic index method for querying XML data by tree structures” Available on http://www.w3.org/TR/ttr-xml
Venish A.,(1999), “XML linking language”, Available on http://www.w3.org/TR/xlink/.
Woosaeng K.(1995), “University extensible index technique for Storing and retrieving XML documents”, Available on http://www.w3c.
org
Watson B., (2001) “Text and office system standard generalized markup language (SGML)”, Information Processing, Vol 2 No.34 pp.1234-
1260.

You might also like