You are on page 1of 5

aXML-Motor

XML Document Parsing


version 2011.11.04

Algorithm

Abhishek Kumar ~=ABK=~ http://github.com/abhishekkr http://www.twitter.com/abionic Algorithm, Ruby Source and Gem: [axml-motor] @GitHub: http://github.com/abhishekkr/axml-motor.git rubygem's src @GitHub: http://github.com/abhishekkr/rubygem_xml_motor.git gem install @RubyGems:http://rubygems.org/gems/xml-motor

Algorithm-Walk-through
Example XML Content:
<BODY> <DIV id='banner'> <H1>aXML-Motor</H1> <H5>A new algorithm based compact XML Parser with <I>no dependencies</I>. </H5> </DIV> <DIV id='details'> <SPAN class='github'>@github: <A href='http://github.com/abhishekkr/axml-motor.git'> axml-motor</A> </SPAN> <DIV class='gem'> <SPAN id='source' class='github'>@github: <A href='http://github.com/abhishekkr/rubygem-xmlmotor.git'>rubygem-xml-motor</A> </SPAN> <SPAN class='rubygems'>@rubygems: <A href='http://rubygems.org/gems/xml-motor.git'>xml-motor</A> </SPAN> </DIV> <I> It's a new algorithm implemented to build a real compact parser (v0.0.2 has less than 200 ruby source code lines) without any dependencies.</I> </DIV> </BODY>

[Step.1]

Split

the XML Content

(1.1) Split by '<' store as XMLNodes


[0] BODY> [1] DIV id='banner'> [2] H1>aXML-Motor [3] /H1> [4] H5>A new algorithm based compact XML Parser with [5] I>no dependencies [6] /I>. [7] /H5> [8] /DIV> [9] DIV id='details'> [10] SPAN class='github'>@github: [11] A href='http://github.com/abhishekkr/axml-motor.git'>axml-motor< [12] /A> [13] /SPAN> [14] DIV class='gem'> [15] SPAN id='source' class='github'>@github: [16] A href='http://github.com/abhishekkr/rubygem-xml-motor.git'>rubygemxml-motor [17] /A> [18] /SPAN> [19] SPAN class='rubygems'>@rubygems: [20] A href='http://rubygems.org/gems/xml-motor.git'>xml-motor [21] /A> [22] /SPAN> [23] /DIV> [24] I> It's a new algorithm implemented to build a real compact parser (v0.0.2 has less than 200 ruby source code lines) without any dependencies. [25] /I> [26] /DIV> [27] /BODY>

(1.2) Split previous step1.1 result by '>' update XMLNodes

[0] [ 'BODY', '' ] [1] ['DIV id='banner', '' ] [2] ['H1', 'aXML-Motor' ] [3] ['/H1', ''] [4] ['H5', 'A new algorithm based compact XML Parser with '] [5] ['I', 'no dependencies'] [6] ['/I', '.'] [7] ['/H5', ''] [8] ['/DIV', ''] [9] ['DIV id=\'details\'', ''] [10] ['SPAN class=\'github\'', '@github: '] [11] ['A href=\'http://github.com/abhishekkr/axml-motor.git\'', 'axml-motor']

[12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]

['/A', ''] ['/SPAN', ''] ['DIV class=\'gem\'', ''] ['SPAN id=\'source\' class=\'github\'', '@github: '] ['A href=\'http://github.com/abhishekkr/rubygem-xml-motor.git\'', 'rubygem-xml-motor'] ['/A', ''] ['/SPAN', ''] ['SPAN class=\'rubygems\'', '@rubygems: '] ['A href=\'http://rubygems.org/gems/xml-motor.git\', 'xml-motor'] ['/A', ''] ['/SPAN', ''] ['/DIV', ''] ['I', 'It's a new algorithm implemented to build a real compact parser (v0.0.2 has less than 200 ruby source code lines) without any dependencies.'] ['/I', ''] ['/DIV', ''] ['/BODY', '']

(1.3) Split first element per line by space/tab, mark 1 st part as tag_name and split latter part by '=', iterating to make key=value pair per attribute... turning XMLNodes to update XMLNodes

[0] [ ['BODY', {}], '' ] [1] [ ['DIV', {'id'=>'banner'}], '' ] [2] [ ['H1', {}], 'aXML-Motor' ] [3] [ ['/H1', {}], ''] [4] [ ['H5', {}], 'A new algorithm based compact XML Parser with '] [5] [ ['I', {}], 'no dependencies'] [6] [ ['/I', {}], '.'] [7] [ ['/H5', {}], ''] [8] [ ['/DIV', {}], ''] [9] [ ['DIV', {'id'=>'details'}], ''] [10] [ ['SPAN', {'class'=>'github'}], '@github: '] [11] [ ['A', {'href'='http://github.com/abhishekkr/axml-motor.git'}], 'axml-motor'] [12] [ ['/A', {}], ''] [13] [ ['/SPAN', {}], ''] [14] [ ['DIV', {'class'=>'gem'}], ''] [15] [ ['SPAN', {'id'=>'source', 'class'=>'github'}], '@github: '] [16] [ ['A', {'href'=>'http://github.com/abhishekkr/rubygem-xml-motor.git'}], 'rubygem-xml-motor'] [17] [ ['/A', {}], ''] [18] [ ['/SPAN', {}], ''] [19] [ ['SPAN', {'class'=>'rubygems'}], '@rubygems: '] [20] [ ['A', {'href'=>'http://rubygems.org/gems/xml-motor.git'}], 'xml-motor'] [21] [ ['/A', {}], ''] [22] [ ['/SPAN', {}], ''] [23] [ ['/DIV', {}], ''] [24] [ ['I', {}], 'It's a new algorithm implemented to build a real

compact parser (v0.0.2 has less than 200 ruby source code lines) without any dependencies.'] [25] [ ['/I', {}], ''] [26] [ ['/DIV', {}], ''] [27] [ ['/BODY', {}], '']

Here, we have the XMLNodes as we wanted them. Now it's turn to Indexify them. [Step.2] Index the processed XMLNodes There are three things involved in Indexing of XMLNodes Tag_Name : Iterating through all elements of XMLNodes, every element has three components including Tag Name, which is available at XMLNodes.all[ [TAG_NAMES, *], *] Depth: The place/level of the Node in XML Node Tree starting from '0'. Index: The index value of Node as per depending upon the XMLNode Array How to Index-ify? There will be an element per Tag_Name with a Hash of Keys as the 'Depth' where it is found which has array of 2*number_of_nodes (starting and ending 'Index' for that same Node) Example: From above XMLNodes, the ['DIV'] would hold {1=>[1,8, 9,26], 2=>[14,26]} Because 'Tag_Name' DIV has 'Index' set of 1,8 and 9,26 for 'Depth' of 1. Similarly 'Index' set of 14,26 for 'Depth' of 2. Indexed XMLTags for above processed XMLNodes will be as follows: calculated XMLTags
['BODY'] = {0=>[0,27]} ['DIV'] = {1=>[1,8, 9,26], 2=>[14,23]} ['H1'] = {2=>[2,3]} ['H5'] = {2=>[4,7]} ['I'] => {3=>[5,6], 2=>[24,25]} ['SPAN'] => {2=>[10,13], 3=>[15,18, 19,22]} ['A'] => {3=>[11,12], 4=>[16,17, 20,21]}

[Step.3]

Grab My Node

from processed XMLNodes using XMLTags

Now suppose, I aim for a Tag_Name 'XYZ'..... then look for XMLTags['XYZ'], iterate through all of its depths and extract 2 indexes at a time. These two indexes per time indicate the start and end node, fetch all value within those nodes from XMLNode.

This will return set of values held by Tag_Name 'XYZ'. Suppose a tree form is provided as 'ABC.XYZ', then start from top nodes as 'ABC' in this context. Grab all it's node. Now move on to lower nodes and filter the Indexes found only within the Node Index ranges provided by the earlier node. This would end with the filtered set of Indexes for 'XYZ' falling only under the Index-Range of 'ABC'. To check for a Tag_Name with attribute, for every filtered Index-Range, just check if it has the required attribute as it's key-value pair.

Example:
Case: Grabbing 'SPAN', with attribute class=''github' It's a single node, grab all its Index-Range (10,13), (15,18) and (19,22). Here, just XMLNodes[10] and XMLNodes[15] have required attribute. Now, grab all data between XMLNodes[10][1] to XMLNodes[13-1][1] and XMLNodes[15][1] to XMLNodes[18-1][1]. Result: ['@github: <A href='http://github.com/abhishekkr/axml-motor.git'>axml-motor</A>' ,
'@github: <A href='http://github.com/abhishekkr/rubygem-xml-motor.git'>rubygem-xmlmotor</A>']

Case: Grabbing 'H5.I' Top node is 'H5', grab all its Index-Range (4,7). Second node 'I', grab all falling between ranges from previous node (5,6). Now, grab all data between XMLNodes[5][1] to XMLNodes[6-1][1].. Result: ['no dependencies'] Below, you'll also see that you need not give entire hierarchy to fetch any descendant from child tree of any node. Just giving the major scope nodes would do the work as fine as providing exact hierarchy. Case: Grabbing 'DIV.A' Top node is 'DIV', grab all its Index-Range (1,8), (9,26) and (14,23). Second node 'A', grab all falling between ranges from previous node (11,12), (16,17) and (20,21). Now, grab all data between XMLNodes[5][1] to XMLNodes[6-1][1].. Result: ['axml-motor', 'rubygem-xml-motor', 'xml-motor']

You might also like