Professional Documents
Culture Documents
Using XML
WHITE PAPER
Author:
Arvind Kumar
Senior ETL Architect
Oracle, Informatica Certified
arvind@curosys.com
V Keshav
Senior Informatica Developer
keshav@curosys.com
Informatica – Handling Variable Length Files
Abstract
Informatica is the leading provider of the Data Integration software. Informatica PowerCenter,
based on Universal Data Services (UDS) architecture, is the foremost adaptive software for
integrating immediate, accurate, and understandable enterprise data. PowerCenter provides
improved data integrity and greater visibility of enterprise data and processes. This paper
attempts to provide a solution to the limitations of Informatica PowerCenter in dealing with
Variable Length Delimited Flat Files.
Intended Audience
This paper is intended for readers who have a business need to process variable length flat files
within Informatica PowerCenter. The reader is expected to have a fair knowledge of XML
technology and is expected to have an understanding of the Informatica’s Midstream XML
Transformations introduced in 7.1. Though it is not mandatory, it is recommended to have a good
understanding of any one programming language that supports data structures before going
through this paper.
Table of Contents
INTRODUCTION ............................................................................................................................. 4
GETTING STARTED....................................................................................................................... 5
The Approach............................................................................................................................... 5
The Business Case ...................................................................................................................... 6
THE XML GENERATION ................................................................................................................ 8
The Vertical File ........................................................................................................................... 8
Assumptions............................................................................................................................... 10
The XML Hierarchy .................................................................................................................... 11
Code – Tag Resolver ................................................................................................................. 12
Groups & Levels......................................................................................................................... 13
All that’s needed… ..................................................................................................................... 14
Immediate Parents ..................................................................................................................... 16
Handling the Hierarchies............................................................................................................ 16
Already done…? ........................................................................................................................ 17
Generating the XML Output – Bringing them all together.......................................................... 18
APPENDIX A ................................................................................................................................. 22
Splitter Transformation............................................................................................................... 22
APPENDIX B ................................................................................................................................. 24
Implementation........................................................................................................................... 24
APPENDIX C ................................................................................................................................. 25
Generating the XML output – Code Sample.............................................................................. 25
INTRODUCTION
Informatica PowerCenter is capable of handling two different kinds of flat files: Fixed Width and
Delimited. In fixed width files, the data of each column starts exactly at a predefined column and
hence width of column is known well before hand. In case of no data, spaces are introduced in
the data to fill up the column width irrespective of whether the data type is number or string.
Because of their very nature, the length of all the records in a fixed width file is same. On the
other hand, in delimited files, data of each column is separated by a delimiter (separator), most
often a comma. In case of no data, the next delimiter is immediately followed. Informatica can
handle delimited files where the number of columns in the input file is fixed. But sometimes, the
businesses demand usage of the variable length delimited files which are not supported by
Informatica. This paper attempts to provide an alternative for this problem by converting this file
into XML before applying Business Rules whatsoever. Please note that this paper doesn’t present
the actual code but uses pseudocode. Reader is expected to change it appropriately to apply it to
Informatica.
GETTING STARTED
The Approach
This section walks you through the approach to be suggested to handle the variable length
delimited flat files. Starting with, convert the input delimited file in to a vertical file. Meaning that,
replace the delimiter wherever with a new line so that each element (data between delimiters) is
in a different line. Now, read this vertical file as the input. As usual Informatica treats each line as
a row; hence each element now turns to be a row. Convert this incoming element to an XML tag
until you reach the starting of the next actual row. Concatenate all this XML content and pass it
over to the XML parser which then populates the corresponding data in each port. You can now
apply the business rules on the data coming out of the parser.
Following is a Flowchart that visualizes exactly the process mentioned above.
For each
element in
vertical file
Convert the element
into an XML Tag
Generate appropriate
XML hierarchy
Example – Input
1. [SOF]
2. ELEMENT01_00897000006~ELEMENT02_089~ELEMENT03_0001~ELEM
ENT04_20040809~ELEMENT05_0001~ELEMENT06_0001~ELEMENT07_
002~ELEMENT08_SP~ELEMENT09_John Anderton~ELEMENT08_AB~
ELEMENT09_Mike~[EOR]
3. [EOF]
In the example above, the first line corresponds to an Indicator indicating the Start-Of-
File and the third line is the End-Of-File indicator. Actual records will always be between
these indicators. Hence, in the example above, the second line consists of a sample
input record.
The first 5 characters in the second line is the length of the line (and hence the record).
Then starts the first element (column), name of which is ELEMENT01_ (first ten
characters), whatever follows that is the element data, 00897000006 in this case. As
explained above, some elements may repeat more than once in the record making the
record as variable lengthed. The element ELEMENT07 has a value of 2 which means
that the ELEMENT08 and ELEMENT09 together repeat twice as can be seen in the data
above.
This data is then converted to XML. The output for the given sample record is shown
below:
Example – Output
<ROOT>
<DATA_TOPIC1>
<ELEMENT01>00897000006</ELEMENT01>
<ELEMENT02>089</ELEMENT02>
<ELEMENT03>0001</ELEMENT03>
<ELEMENT04>20040809</ELEMENT04>
</DATA_TOPIC1>
<DATA_TOPIC2>
<AGGREGATE1>
<AGGREGATE2>
<ELEMENT08>SP</ELEMENT08>
<ELEMENT09>John Anderton</ELEMENT09>
</AGGREGATE2>
<AGGREGATE2>
<ELEMENT08>AB</ELEMENT08>
<ELEMENT09>Mike</ELEMENT09>
</AGGREGATE2>
</AGGREGATE1>
</DATA_TOPIC2>
</ROOT>
__________________________________________________________
<ROOT><DATA_TOPIC1><ELEMENT01>00897000006</ELEMENT01><ELEM
ENT02>089</ELEMENT02><ELEMENT03>0001</ELEMENT03><ELEMENT04
>20040809</ELEMENT04></DATA_TOPIC1><DATA_TOPIC2><AGGREGATE
1><AGGREGATE2><ELEMENT08>SP</ELEMENT08><ELEMENT09>John
Anderton</ELEMENT09></AGGREGATE2><AGGREGATE2><ELEMENT08>AB
</ELEMENT08><ELEMENT09>Mike</ELEMENT09></AGGREGATE2></AGGR
EGATE1></DATA_TOPIC2></ROOT>
The first section in the above example shows the XML hierarchy that needs to be
generated whereas the second one shows how the content has to be passed to the
XML Parsers. Whole XML goes as a single line to the XML Parser.
1. Replace the delimiter with new line character, land the file and read it over again in
another mapping.
2. Use the Splitter Transformation *
______________________
* Splitter Transformation is NOT available with Informatica 7.1.1 installable. This needs to
be downloaded from Informatica Developer Network. For further Information browse
http://devnet.informatica.com/
It has to be noted that the above mapping writes the same no. of rows in the output file as it has
read from the input. However, each row that Informatica writes contains several lines (as the
expression introduces new line characters). The next mapping when reads this file reads each
line as a new row. This is because of Informatica’s default nature to read each line as a row while
reading from a flat file.
Splitter: Splits the incoming row in to multiple rows based on the delimiter
specified
Assumptions
Once, the vertical file is ready the next step is to generate the XML File. Before moving forward
with the XML generation the reader needs to get acquainted with the logic of the XML, this
section briefs the Assumptions made:
It is assumed that the XML to be generated doesn’t contain more than 7 levels of
hierarchy.
It is known well before hand the hierarchy and the XML Schema based on which the XML
needs to be generated.
Every record will have a static Record Start Indicator and Record End Indicator that helps
in the generating the XML
Root
Ancestor
Predecessor
Forerunner
Forebear
Antecedent
Precursor
Hierarchical Rules:
There will be one and only one Root per row
All the aggregates except Root are optional and may or may not appear in a row.
Ancestor is the Data Topic Level and hence cannot repeat. However, there can be any
number of distinct data topics per a given row
Any number of Predecessors, Forerunners, Forebears, Antecedents and Precursors can
exist per a given row subject to the hierarchy given above
A data element can appear at any level other than Root and Ancestor. Meaning that, data
elements can appear at any level starting with Predecessor going below.
______________________
Please note that all these Rules are subject to the case and scenario being explained in
this article. Actual Rules may vary based on the Users’s implementation. Also, all these
rules are NOT generic and might not be applied to all XML data in general.
Following table depicts the structure of the lookup file to be used for this purpose:
Lookup Column Description
Element Element name
Root The Root Element of the XML Hierarchy for the given Element
Ancestor The Data Topic Element of the XML Hierarchy for the given Element
Predecessor Third level parent, if any
Forerunner Fourth level parent, if any
Forebear Fifth level parent, if any
Antecedent Sixth level parent, if any
Precursor Seventh level parent, if any
Example – Lookup
Element|Root|Ancestor|Predecessor|Forerunner|Forebear|Antecedent|Precursor
ELEMENT01|Root|Data_Topic1|Agg1|Agg2|Agg3|Agg4|Agg5
ELEMENT02|Root|Data_Topic1|Agg1|Agg2|Agg3||
ELEMENT03|Root|Data_Topic1|Agg1|Agg2|||
This lookup helps is fetching the XML Hierarchy and makes the process more dynamic.
Example – Lookup
Element|Restart Group|Restart Levels|Skip
ELEMENT01|N|0|N
ELEMENT02|Y|1|N
ELEMENT03|N|0|Y
Mapping: Converts the delimited file to XML with the help of Code – Tag Resolver
and Groups & Levels
Immediate Parents
This section explains how to deal with the immediate parents of the XML elements. We’ll require
the immediate parent information to determine whether to open/close/re-open a specific XML
element. Typical code identify the immediate parent may be as follows:
Similar code is applied to all the levels of hierarchies viz, Root, Ancestor … in a Bottom – Up
approach to find if the corresponding closing tag can be generated.
Opening the Hierarchies for the XML elements is equally easier and simple with the exception
that a Top – Down approach is followed here. The following code fragment demonstrates the
same:
. . . . . .
. . . . . .
Generate Concatenation of '<' and Antecedent and '>' when any of the
following conditions satisfy:
Current Antecedent and Previous Antecedent are NOT same
Restart Group is 'Y'
Already done…?
During the XML generation, there are scenarios where an aggregate repeats itself. In these
scenarios, it is not simply enough if we generate the corresponding opening tag of the aggregate,
we’ll also need to generate the corresponding closing tag of the previous occurrence of the
aggregate. To identify whether an XML aggregate is already processed or not, we use a
combination of two ports – one to store names of all the aggregates that are processed and
another to flag if the aggregate has already been processed. The following code fragment aims at
the same:
Port: v_Parsed_Tags – Stores the list of all the aggregates processed at least once
It is very important that the port v_Parsed_Tags is placed below the port containing the output
data. This means that you must mark an aggregate as processed if and only if it is processed not
just before it.
Starting with the first IF we would like to process is that of the start element (ELEMENT01 in this
case). If the input is NOT ELEMENT01 we concatenate the existing output data to NULL. This
ensures that at the starting of record (i.e. when ELEMENT01 appears) we reset to NULL. In all
other cases, the partial XML generated till now is preserved.
There are five different cases of XML generation identified for the elements at data topic or at the
level below that. So, corresponding code is executed to ensure that all these cases are met.
These cases represent different scenarios that determine the XML generation.
Case A:
Highlights:
- Closing Hierarchy is NULL
- Opening Hierarchy is NULL
- Current and Previous parents are exactly Same
- Restart Element is “Yes” for current element
Element Structure encountered:
Level_1
Level_2 First Occurrence
Element_1 Restart Element
Element_2 Previous Element
Level_2 Second Occurrence
Element_1 Current Element – Restart Element
Element_2 Element not yet encountered
Action:
Case B:
Highlights:
- Closing Hierarchy is NOT NULL
- Opening Hierarchy is NULL
- Current and Previous parents are different
- Restart Element is “Yes” for current element
Element Structure encountered:
Level_1
Level_2 First Occurrence
Element_1 First Occur. First Element (Restart Element)
Level_3
Element_2
Element_3 Previous Element
Level_2 Second Occurrence
Element_1 Current Element – Restart Element
Level_3
Element_2
Element_3 Previous Element
Action:
We need to close the hierarchy (up till level applicable), Close the Immediate Parent and
re-open it again.
Case B.1:
Highlights:
- Closing Hierarchy is NOT NULL
- Opening Hierarchy is NULL
- Current and Previous parents are different
- Restart Element is “No” for current element
Element Structure encountered:
Level_1
Level_2
Element_1 First Occur. First Element (Restart Element)
Level_3
Element_2
Element_3 Previous Element
Action:
We need to close the hierarchy (up till level applicable).
Case C:
Highlights:
- Closing Hierarchy is NULL
- Opening Hierarchy is NOT NULL
- Current and Previous parents are different
Action:
Open the hierarchy required
Case D:
Highlights:
- Closing Hierarchy is NOT NULL
- Opening Hierarchy is NOT NULL
- Current and Previous parents are different
Element Structure encountered:
Level_1
Level_2_1
Element_1
Level_3
Element_2
Element_3 Previous Element
Level_2_2
Element_1 Current Element – Restart Element
Element_2
Action:
We need to close the hierarchy (up till level applicable) and open the hierarchy required
The code fragment in Appendix C details the implementation of the all the above scenarios.
APPENDIX A
Splitter Transformation
With the Splitter EP/AEP in your mapping, you can read data from a source that contains a
variable number of input fields in each row. This reader capability is not available in the flat file
reader. You can use the Splitter to read a variable number of delimited fields in each input row in
one of the following modes:
External Procedure (EP) Mode
Advanced External Procedure (AEP) mode
You can use the AEP mode to split data when you do not know the number of fields in an input
stream. For example, you have a dataset where the number of fields is not known. In HL7 Data
files, data can contain delimited fields. The following example shows the HL7 input data with “|”
as the split character:
Input:
PID^JOHN^DOE^5101112222|6506506500
PID^DONALD^DUCK^5101112222|6506506500|4084084080
As illustrated in the input, the input stream for a field that you want to split may contain a variable
number of fields. Since neither the actual nor maximum number of output fields is known, the
AEP mode is used. When you use the AEP mode to split data, each field in the input stream is
sent as a row.
Output:
Row 1 PID
Row 2 JOHN
Row 3 DOE
Row 4 5101112222|6506506500
Row 5 PID
Row 6 DONALD
Row 7 DUCK
Row 8 5101112222|6506506500|4084084080
NOTE: Information in this section is an extract provided as it is from the Informatica’s Spliiter AEP
Transformation Article. Please visit http://devnet.informatica.com for complete details.
APPENDIX B
Implementation
This was implemented for customer who is a leading provider of analytical business information in
United States. This process is currently handling daily data coming from approx. 100 countries
and executes on Informatica PowerCenter 7.1.1 Server hosted on a 6 CPU, 24 GB HP-UX Unix
Server. This process is currently delivering a performance of 250K / hour and processes an
average of 50K – 100K Incremental records daily.
APPENDIX C