You are on page 1of 158

All Datastage Stages

Datastage parallel stages groups


DataStage and QualityStage stages are grouped into the following logical sections:

General objects

Data Quality Stages

Database connectors

Development and Debug stages

File stages

Processing stages

Real Time stages

Restructure Stages

Sequence activities

Please refer to the list below for a description of the stages used in DataStage and QualityStage.
We classified all stages in order of importancy and frequency of use in real-life deployments (and
also on certification exams). Also, the most widely used stages are marked bold or there is a link
to a subpage available with a detailed description with examples.

DataStage and QualityStage parallel stages and activities

General elements

Link indicates a flow of the data. There are three main types of links in Datastage: stream,
reference and lookup.

Container (can be private or shared) - the main outcome of having containers is to


simplify visually a complex datastage job design and keep the design easy to understand.

Annotation is used for adding floating datastage job notes and descriptions on a job
canvas. Annotations provide a great way to document the ETL process and help
understand what a given job does.

Description Annotation shows the contents of a job description field. One description
annotation is allowed in a datastage job.

Debug and development stages

Row generator produces a set of test data which fits the specified metadata (can be
random or cycled through a specified list of values). Useful for testing and development.
Click here for more..

Column generator adds one or more column to the incoming flow and generates test
data for this column.

Peek stage prints record column values to the job log which can be viewed in Director. It
can have a single input link and multiple output links.Click here for more..

Sample stage samples an input data set. Operates in two modes: percent mode and period
mode.

Head selects the first N rows from each partition of an input data set and copies them to
an output data set.

Tail is similiar to the Head stage. It select the last N rows from each partition.

Write Range Map writes a data set in a form usable by the range partitioning method.

Processing stages

Aggregator joins data vertically by grouping incoming data stream and calculating
summaries (sum, count, min, max, variance, etc.) for each group. The data can be
grouped using two methods: hash table or pre-sort. Click here for more..

Copy - copies input data (a single stream) to one or more output data flows

FTP stage uses FTP protocol to transfer data to a remote machine

Filter filters out records that do not meet specified requirements.Click here for more..

Funnel combines mulitple streams into one. Click here for more..

Join combines two or more inputs according to values of a key column(s). Similiar
concept to relational DBMS SQL join (ability to perform inner, left, right and full outer
joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces
single output stream (no reject link). Click here for more..

Lookup combines two or more inputs according to values of a key column(s). Lookup
stage can have 1 source and multiple lookup tables. Records don't need to be sorted and
produces single output stream and a reject link. Click here for more..

Merge combines one master input with multiple update inputs according to values of a
key column(s). All inputs need to be sorted and unmatched secondary entries can be
captured in multiple reject links. Click here for more..

Modify stage alters the record schema of its input dataset. Useful for renaming columns,
non-default data type conversions and null handling

Remove duplicates stage needs a single sorted data set as input. It removes all duplicate
records according to a specification and writes to a single output

Slowly Changing Dimension automates the process of updating dimension tables, where
the data changes in time. It supports SCD type 1 and SCD type 2.Click here for more..

Sort sorts input columns.Click here for more..

Transformer stage handles extracted data, performs data validation, conversions and
lookups.Click here for more..

Change Capture - captures before and after state of two input data sets and outputs a
single data set whose records represent the changes made.

Change Apply - applies the change operations to a before data set to compute an after
data set. It gets data from a Change Capture stage

Difference stage performs a record-by-record comparison of two input data sets and
outputs a single data set whose records represent the difference between them. Similiar to
Change Capture stage.

Checksum - generates checksum from the specified columns in a row and adds it to the
stream. Used to determine if there are differencies between records.

Compare performs a column-by-column comparison of records in two presorted input


data sets. It can have two input links and one output link.

Encode encodes data with an encoding command, such as gzip.

Decode decodes a data set previously encoded with the Encode Stage.

External Filter permits speicifying an operating system command that acts as a filter on
the processed data

Generic stage allows users to call an OSH operator from within DataStage stage with
options as required.

Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row
to a single column in multiple output rows. Pivoting data results in obtaining a dataset
with fewer number of columns but more rows.

Surrogate Key Generator generates surrogate key for a column and manages the key
source.

Switch stage assigns each input row to an output link based on the value of a selector
field. Provides a similiar concept to the switch statement in most programming
languages.

Compress - packs a data set using a GZIP utility (or compress command on
LINUX/UNIX)

Expand extracts a previously compressed data set back into raw binary data.

File stage types

Sequential file is used to read data from or write data to one or more flat (sequential)
files.Click here for more..(.)

Data Set stage allows users to read data from or write data to a dataset. Datasets are
operating system files, each of which has a control file (.ds extension by default) and one
or more data files (unreadable by other applications). Click here for more info(.)

File Set stage allows users to read data from or write data to a fileset. Filesets are
operating system files, each of which has a control file (.fs extension) and data files.
Unlike datasets, filesets preserve formatting and are readable by other applications.

Complex flat file allows reading from complex file structures on a mainframe machine,
such as MVS data sets, header and trailer structured files, files that contain multiple
record types, QSAM and VSAM files.Click here for more info.

External Source - permits reading data that is output from multiple source programs.

External Target - permits writing data to one or more programs.

Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be
used for lookups.

Database stages

Oracle Enterprise allows reading data from and writing data to an Oracle database
(database version from 9.x to 10g are supported).

ODBC Enterprise permits reading data from and writing data to a database defined as an
ODBC source. In most cases it is used for processing data from or to Microsoft Access
databases and Microsoft Excel spreadsheets.

DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.

Teradata permits reading data from and writing data to a Teradata data warehouse. Three
Teradata stages are available: Teradata connector, Teradata Enterprise and Teradata
Multiload

SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl
Server 2005 amd 2008 database.

Sybase permits reading data from and writing data to Sybase databases.

Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL
Server. The Stored Procedure stage can be used as a source (returns a rowset), as a target
(pass a row to a stored procedure to write) or a transform (to invoke procedure processing
within the database).

MS OLEDB helps retrieve information from any type of information repository, such as a
relational source, an ISAM file, a personal database, or a spreadsheet.

Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or
writing to a number of different supported relational DB engines using native interfaces,
such as Oracle, Microsoft SQL Server, DB2, Informix and Sybase.

Informix (CLI or Load)

DB2 UDB (API or Load)

Classic federation

RedBrick Load

Netezza Enterpise

iWay Enterprise

Real Time stages

XML Input stage makes it possible to transform hierarchical XML data to flat relational
data sets

XML Output writes tabular data (relational tables, sequential files or any datastage data
streams) to XML structures

XML Transformer converts XML documents using an XSLT stylesheet

Websphere MQ stages provide a collection of connectivity options to access IBM


WebSphere MQ enterprise messaging systems. There are two MQ stage types available in
DataStage and QualityStage: WebSphere MQ connector and WebSphere MQ plug-in
stage.

Web services client

Web services transformer

Java client stage can be used as a source stage, as a target and as a lookup. The java
package consists of three public classes: com.ascentialsoftware.jds.Column,
com.ascentialsoftware.jds.Row, com.ascentialsoftware.jds.Stage

Java transformer stage supports three links: input, output and reject.

WISD Input - Information Services Input stage

WISD Output - Information Services Output stage

Restructure stages

Column export stage exports data from a number of columns of different data types into a
single column of data type ustring, string, or binary. It can have one input link, one output
link and a rejects link. Click here for more..

Column import complementary to the Column Export stage. Typically used to divide data
arriving in a single column into multiple columns.

Combine records stage combines rows which have identical keys, into vectors of
subrecords.

Make subrecord combines specified input vectors into a vector of subrecords whose
columns have the same names and data types as the original vectors.

Make vector joins specified input columns into a vector of columns

Promote subrecord - promotes input subrecord columns to top-level columns

Split subrecord - separates an input subrecord field into a set of top-level vector columns

Split vector promotes the elements of a fixed-length vector to a set of top-level columns

Data quality QualityStage stages

Investigate stage analyzes data content of specified columns of each record from the
source file. Provides character and word investigation methods.

Match frequency stage takes input from a file, database or processing stages and
generates a frequence distribution report.

MNS - multinational address standarization.

QualityStage Legacy

Reference Match

Standarize

Survive

Unduplicate Match

WAVES - worldwide address verification and enhancement system.

Sequence activity stage types

Job Activity specifies a Datastage server or parallel job to execute.

Notification Activity - used for sending emails to user defined recipients from within
Datastage

Sequencer used for synchronization of a control flow of multiple activities in a job


sequence.

Terminator Activity permits shutting down the whole sequence once a certain situation
occurs.

Wait for file Activity - waits for a specific file to appear or disappear and launches the
processing.

EndLoop Activity

Exception Handler

Execute Command

Nested Condition

Routine Activity

StartLoop Activity

UserVariables Activity

=====================================================================

Configuration file:

The Datastage configuration file is a master control file (a textfile which sits on the
server side) for jobs which describes the parallel system resources and architecture. The
configuration file provides hardware configuration for supporting such architectures
as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or
MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands
the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed
your processing configurations, or changed servers or platform, you will never have to worry
about it affecting your jobs since all the jobs depend on this configuration file for execution.
Datastage jobs determine which node to run the process on, where to store the temporary data,
where to store the dataset data, based on the entries provide in the configuration file. There is a
default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration
file is to separate software and hardware configuration from job design. It allows changing
hardware and software resources without changing a job design. Datastage jobs can point to
different configuration files by using job parameters, which means that a job can utilize different
hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk
space provided for each processing node which are logical processing nodes that are specified in
the configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a
very inefficient configuration on your hands.
1.

APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one
can have many configuration files for a project) to be used. In fact, this is what is generally used
in production. However, if this environment variable is not defined then how DataStage
determines which file to use ??
1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default
configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of
DataStage installation.

2.

Define Node in configuration file


A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual
name and defines a number and speed of CPUs, memory availability, page and swap space,
network connectivity details, etc.

3.
1.

What are the different options a logical node can have in the configuration file?
fastname The fastname is the physical node name that stages use to open connections for high
volume data transfers. The attribute of this option is often the network name. Typically, you can
get this name by using Unix command uname -n.
pools Name of the pools to which the node is assigned to. Based on the characteristics of the
processing nodes you can group nodes into set of pools.
A pool can be associated with many nodes and a node can be part of many pools.
A node belongs to the default pool unless you explicitly specify apools list for it, and omit the
default pool name () from the list.
A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of
processing nodes).
In case job as well as stage within the job are constrained to run on specific processing nodes
then stage will run on the node which is common to stage as well as job.
resource resource resource_type location [{pools disk_pool_name}] | resource
resource_type value . resource_type can becanonicalhostname (Which takes quoted ethernet
name of a node in cluster that is unconnected to Conductor node by the hight speed
network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute
path name of a directory on a file system where intermediate data will be temporarily stored. It is
local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE,
etc.)

2.
1.
2.
3.
1.
3.

4.
1.

How datastage decides on which processing node a stage should be run?


If a job or stage is not constrained to run on specific nodes then parallel engine executes a
parallel stage on all nodes defined in the default node pool. (Default Behavior)
2. If the node is constrained then the constrained processing nodes are chosen while executing the
parallel stage.
In Datastage, the degree of parallelism, resources being used, etc. are all determined
during the run time based entirely on the configuration provided in the APT CONFIGURATION
FILE. This is one of the biggest strengths of Datastage. For cases in which you have changed
your processing configurations, or changed servers or platform, you will never have to worry
about it affecting your jobs since all the jobs depend on this configuration file for execution.
Datastage jobs determine which node to run the process on, where to store the temporary data ,
where to store the dataset data, based on the entries provide in the configuration file. There is a
default configuration file available whenever the server is installed. You can typically find it
under the <>\IBM\InformationServer\Server\Configurations folder with the name default.apt.
Bear in mind that you will have to optimise these configurations for your server based on your
resources.

Basically the configuration file contains the different processing nodes and also
specifies the disk space provided for each processing node. Now when we talk about processing
nodes you have to remember that these can are logical processing nodes that are specified in the
configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a
very inefficient configuration on your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.
{
node
{
fastname
pools
resource
resource
}
node
{
fastname
pools
resource
resource
}
node
{
fastname
pools
resource
resource
}

node1
SVR1

disk
C:/IBM/InformationServer/Server/Datasets/Node1
{pools
}
scratchdisk
C:/IBM/InformationServer/Server/Scratch/Node1
{pools
}
node2
SVR1

disk
C:/IBM/InformationServer/Server/Datasets/Node1
{pools
}
scratchdisk
C:/IBM/InformationServer/Server/Scratch/Node1
{pools
}
node3
SVR2

sort
disk
C:/IBM/InformationServer/Server/Datasets/Node1
{pools
}
scratchdisk
C:/IBM/InformationServer/Server/Scratch/Node1
{pools " }

}
This is a 3 node configuration file. Lets go through the basic entries and what it represents.
Fastname This refers to the node name on a fast network. From this we can imply that the
nodes node1 and node2 are on the same physical node. However if we look at node3 we can see
that it is on a different physical node (identified by SVR2). So basically in node1 and node2 , all
the resources are shared. This means that the disk and scratch disk specified is actually shared

between those two logical nodes. Node3 on the other hand has its own disk and scratch disk
space.
Pools Pools allow us to associate different processing nodes based on their functions and
characteristics. So if you see an entry other entry like node0 or other reserved node pools like
sort,db2,etc.. Then it means that this node is part of the specified pool. A node will be by
default associated to the default pool which is indicated by . Now if you look at node3 can see
that this node is associated to the sort pool. This will ensure that that the sort stage will run only
on nodes part of the sort pool.
Resource disk - This will specify Specifies the location on your server where the processing
node will write all the data set files. As you might know when Datastage creates a dataset, the
file you see will not contain the actual data. The dataset file will actually point to the place where
the actual data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk The location of temporary files created during Datastage processes, like
lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk
can also be made part of the sort scratch disk pool. This will ensure that the temporary files
created during sort are stored only in this location. If such a pool is not specified then Datastage
determines if there are any scratch disk resources that belong to the default scratch disk pool on
the nodes that sort is specified to run on. If this is the case then this space will be used.
Below is the sample diagram for 1 node and 4 node resource allocation:

SAMPLE CONFIGURATION FILES


Configuration file for a simple SMP

A basic configuration file for a single machine, two node server (2-CPU) is shown below. The
file defines 2 nodes (node1 and node2) on a single dev server (IP address might be provided as
well instead of a hostname) with 3 disk resources (d1 , d2 for the data and Scratch as scratch

space).
The configuration file is shown below:
node "node1"
{
fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource disk "/IIS/Config/d2" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}
node "node2"
{
fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}

Configuration file for a cluster / MPP / grid


The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (node[1-4]), node pools (n[1-4]) and s[1-4), resource pools
bigdata
and
sort
and
a
temporary
space.
node "node1"
{
fastname "dev1"
pool "" "n1" "s1" "sort"
resource disk "/IIS/Config1/d1" {}
resource disk "/IIS/Config1/d2" {"bigdata"}
resource scratchdisk "/IIS/Config1/Scratch" {"sort"}
}
node "node2"
{
fastname "dev2"
pool "" "n2" "s2"
resource disk "/IIS/Config2/d1" {}
resource disk "/IIS/Config2/d2" {"bigdata"}
resource scratchdisk "/IIS/Config2/Scratch" {}
}

node "node3"
{
fastname "dev3"
pool "" "n3" "s3"
resource disk "/IIS/Config3/d1" {}
resource scratchdisk "/IIS/Config3/Scratch" {}
}
node "node4"
{
fastname "dev4"
pool "n4" "s4"
resource disk "/IIS/Config4/d1" {}
resource scratchdisk "/IIS/Config4/Scratch" {}
}

Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource
disk.
Resource scratch disk : Here also a path to folder is defined. This path is used by the parallel job
stages for buffering of the data when the parallel job runs.
=====================================================================

Sequentional_Stage :
Sequential File:
The Sequential File stage is a file stage. It allows you to read data from or write
data to one or more flat files as shown in Below Figure:

The stage executes in parallel mode


by default if reading multiple files but executes sequentially if it is only reading one file.
In order read a sequential file datastage needs to know about the format of the file.
If you are reading a delimited file you need to specify delimiter in the format tab.
Reading Fixed width File:
Double click on the sequential file stage and go to properties tab.
Source:
File:Give the file name including path
Read Method:Whether to specify filenames explicitly or use a file pattern.
Important Options:
First Line is Column Names:If set true, the first line of a file contains column names on writing and
is ignored on reading.
Keep File Partitions:Set True to partition the read data set according to the organization of the input
file(s).
Reject Mode: Continue to simply discard any rejected rows; Fail to stop if any row is rejected; Output
to send rejected rows down a reject link.
For fixed-width files, however, you can configure the stage to behave differently:
* You can specify that single files can be read by multiple nodes. This can improve performance on
cluster systems.
* You can specify that a number of readers run on a single node. This means, for example, that a
single file can be partitioned as it is read.
These two options are mutually exclusive.
Scenario 1:
Reading file sequentially.

Scenario 2:
Read From Multiple Nodes = Yes

Once we add Read From Multiple Node = Yes then stage by default executes in Parallel mode.

If you run the job with above configuration it will abort with following fatal error.
sff_SourceFile: The multinode option requires fixed length records.(That means you can use this
option to read fixed width files only)
In order to fix the above issue go the format tab and add additions parameters as shown below.

Now job finished successfully and


please below datastage monitor for performance improvements compare with reading from single
node.

Scenario 3:Read Delimted file with By Adding Number of Readers Pernode instead of multinode
option to improve the read performance and once we add this option sequential file stage will execute
in default parallel mode.

If we are reading from and writing to fixed width file it is always good practice to add
APT_STRING_PADCHAR Datastage Env variable and assign 020 as default value then it will pad with
spaces ,otherwise datastage will pad null value(Datastage Default padding character).
Always Keep Reject Mode = Fail to make sure datastage job will fail if we get from format from source
systems.

Sequential File Best Performance Settings/Tips

Important Scenarios using sequential file


stage:
Sequential file with Duplicate Records
Splitting input files into three different files using lookup

Sequential file with Duplicate Records


Sequential file with Duplicate Records:
A sequential file has 8 records with one column, below are the values in the column
separated by space,
11223456
In a parallel job after reading the sequential file 2 more sequential files should be
created, one with duplicate records and the other without duplicates.
File 1 records separated by space: 1 1 2 2
File 2 records separated by space: 3 4 5 6
How will you do it

Sol1:
1. Introduce a sort stage very next to sequential file,
2. Select a property (key change column) in sort stage and you can assign 0-Unique

or 1- duplicate or viceversa as you wish.


3. Put a filter or transformer next to it and now you have unique in 1 link and
duplicates in other link.

Sol2:(Should check though)


First of all take a source file then connect it to copy stage. Then, 1 link is connected
to the aggregator stage and another link is connected to the lookup stage or join
stage. In Aggregator stage using the count function, Calculate how many times the
values are repeating in the key column.
After calculating that it is connected to the filter stage where we filter the cnt=1(cnt
is new column for repeating rows).
Then the o/p from the filter is connected to the lookup stage as reference. In the
lookup stage LOOKUP FAILURE=REJECT.
Then place two output links for the lookup, One collects the non-repeated values
and another collects the repeated values in reject link.

Splitting input files into three different files using lookup :


Splitting input files into three different files
Input file A contains
1
2
3
4
5
6
7
8
9
10
input file B contains
6
7
8
9
10
11
12
13

14
15
Output file X contains
1
2
3
4
5
Output file y contains
6
7
8
9
10
Output file z contains
11
12
13
14
15
Possible solution:
Change capture stage. First, i am going to use source as A and refrerence as B both of them are
connected to Change capture stage. From, change capture stage it connected to filter stage and
then targets X,Y and Z. In the filter stage: keychange column=2 it goes to X [1,2,3,4,5]
Keychange column=0 it goes to Y [6,7,8,9,10] Keychange column=1 it goes to Z
[11,12,13,14,15]
Solution 2:
Create one px job.
src file= seq1 (1,2,3,4,5,6,7,8,9,10)
1st lkp = seq2 (6,7,8,9,10,11,12,13,14,15)
o/p - matching recs - o/p 1 (6,7,8,9,10)
not-matching records - o/p 2 (1,2,3,4,5)
2nd lkp:
src file - o/p 1 (6,7,8,9,10)
lkp file - seq 2 (6,7,8,9,10,11,12,13,14,15)
not matching recs - o/p 3 (11,12,13,14,15)

Dataset :
Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These
carry meta data with them, both column definitions and information about the configuration that
was in effect when the data set was created. If for example, you have a stage which limits
execution to a subset of available nodes, and the data set was created by a stage using all nodes,
InfoSphere DataStage can detect that the data will need repartitioning.
If required, data sets can be landed as persistent data sets, represented by a Data Set
stage .This is the most efficient way of moving data between linked jobs. Persistent data sets are
stored in a series of files linked by a control file (note that you should not attempt to manipulate
these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere
DataStage).
there are the two groups of Datasets - persistent and virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual
datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in
the Unix file system, as long as they exist only virtually, while inhabiting RAM memory.
Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).
Further differences are much more significant. Primarily, persistent Datasets are being stored
in Unix files using internal Datastage EE format, while virtual Datasets are never stored on
disk - they do exist within links, and in EE format, but in RAM memory. Finally,
persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual
Datasets - might be passed through in memory.
A data set comprises a descriptor file and a number of other files that are added as the data set
grows. These files are stored on multiple disks in your system. A data set is organized in terms
of partitions and segments.
Each partition of a data set is stored on a single processing node. Each data segment contains all
the records written by a single job. So a segment can contain files from many partitions, and a
partition has files from many segments.
Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo
the same processes and modifications. In a word, all of them must go through the same
successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore
they cannot be treated commonly.
Alias names of Datasets are

1) Orchestrate File
2) Operating System file
And Dataset is multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files
In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.

Starting a Dataset Manager:


Choose Tools Data Set Management, a Browse Files dialog box appears:
1. Navigate to the directory containing the data set you want to manage. By convention, data set
files have the suffix .ds.
2. Select the data set you want to manage and click OK. The Data Set Viewer appears. From here
you can copy or delete the chosen data set. You can also view its schema (column definitions)
or the data it contains.

3. Transformer Stage :

4. Various functionalities of Transformer Stage:


5.
6.
7.
8.

Generating surrogate key using Transformer


Transformer stage using stripwhitespaces

9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.

TRANSFORMER STAGE TO FILTER THE DATA


TRANSFORMER STAGE USING PADSTRING FUNCTION
CONCATENATE DATA USING TRANSFORMER STAGE
FIELD FUNCTION IN TRANSFORMER STAGE
TRANSFORMER STAGE WITH SIMPLE EXAMPLE
TRANSFORMER STAGE FOR DEPARTMENT WISE DATA
HOW TO CONVERT ROWS INTO THE COLUMNS IN DATASTAGE
SORT STAGE AND TRANSFORMER STAGE WITH SAMPLE DATA EXAMPLE
FIELD FUNCTION IN TRANSFORMER STAGE WITH EXAMPLE
RIGHT AND LEFT FUNCTIONS IN TRANSFORMER STAGE WITH EXAMPLE

29.

SOME OTHER IMPORTANT FUNCTIONS:

30.
How to perform aggregation using a Transformer
31.
Date and time string functions
32.
Null handling functions
33.
Vector function-Transformer
34.
Type conversion functions-Transformer
35.
How to convert a single row into multiple rows ?

Data Stage Transformer Usage Guidelines


=========================================================================================

Sort Stage:

SORT STAGE PROPERTIES:

SORT STAGE WITH TWO KEY VALUES

HOW TO CREATE GROUP ID IN SORT STAGE IN DATASTAGE

Group ids are created in two different ways.

We can create group id's by using

a) Key Change Column

b) Cluster Key change Column

Both of the options used to create group id's .

When we select any option and keep true. It will create the Group id's group wise.

Data will be divided into the groups based on the key column and it will give (1) for

the first row of every group and (0) for rest of the rows in all groups.

Key change column and Cluster Key change column used based on the data we are getting

from the source.

If the data we are getting is not sorted , then we use key change column to create

group id's

If the data we are getting is sorted data, then we use Cluster Key change Column to

create Group Id's


Open

Sort

And
And

Stage

Properties

Select
if

you

are

getting

not

Key
sorted

data

Keep

Key

column
Change

Column

as

True

And
Group

Drag
Id's

will

and
be

Drop

generated

as

0's

in
and

1's

Output
Group

Wise.

If your data is already Sorted you need to keep cluster Key change Column as True
(

Dont

Select

Key

Change

Column

And Same process as above.

Aggregator_Stage :
The Aggregator Stage:

Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By
Default Aggregator stage will execute in parallel mode in parallel jobs.
Note:In a Parallel environment ,the way that we partition data before grouping and
summary will affect the results.If you parition data using round-robin method and then
records with same key values will distruute across different partiions and that will give in
correct results.
Aggregation Method:
Aggregator stage has two different aggregation Methods.

1)Hash:Use hash mode for a relatively small number of groups; generally, fewer than about 1000
groups per megabyte of memory.
2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping
keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires
presorted data, but only maintains the calculations for the current group in memory.
Aggregation Data Type:
By default aggregator stage calculation output column is double data type and if you want decimal
output then add following property as shown in below figure.

If you are using single key column for the grouping keys then there is no need to sort or hash
partition the incoming data.

AGGREGATOR STAGE AND FILTER STAGE WITH EXAMPLE

If we have a data as below


table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo
And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.
Take Job design as

Read and load the data in sequential file.


In Aggregator stage select group =dno
Aggregator type = count rows
Count output column =dno_cpunt( user defined )
In output Drag and Drop the columns required.Than click ok
In Filter Stage
----- At first where clause dno_count>1
-----Output link =0
-----At second where clause dno_count<=1 -----output link=0 Drag and drop the outputs to the two targets.
Give Target file names and Compile and Run the JOb. You will get the required data to the Targets.

AGGREGATOR STAGE TO FIND NUMBER OF PEOPLE GROUP WISE


We can use Aggregator stage to find number of people each in each department.
For example, if we have the data as below
e_id,e_name,dept_no
1,sam,10
2,tom,20

3,pinky,10
4,lin,20
5,jim,10
6,emy,30
7,pom,10
8,jem,20
9,vin,30
10,den,20

Take Job Design as below


Seq.-------Agg.Stage--------Seq.File

Read and load the data in source file.


Go to Aggregator Stage and Select Group as Dept_No
and Aggregator type = Count Rows
Count Output Column = Count ( This is User Determined)
Click Ok ( Give File name at the target as your wish )
Compile and Run the Job
AGGREGATOR STAGE WITH REAL TIME SCENARIO EXAMPLE
Aggregator stage works on groups.
It is used for the calculations and counting.
It supports 1 Input and 1 Outout
Example for Aggregator stage:
Input Table to Read
e_id, e_name, e_job,e_sal,deptno
100,sam,clerck,2000,10
200,tom,salesman,1200,20
300,lin,driver,1600,20
400,tim,manager,2500,10
500,zim,pa,2200,10
600,eli,clerck,2300,20

Here our requirement is to find the maximum salary from each dept. number.
According to this sample data, we have two departments.
Take Sequential File to read the data and take Aggregator for calculations.

And Take sequential file to load into the target.


That is we can take like this
Seq.File--------Aggregator-----------Seq.File

Read the data in Seq.Fie


And in Aggregator Stage ---In Properties---- Select Group =DeptNo
And Select e_sal in Column for calculations
i.e because to calculate maximum salary based on dept. Group.
Select output file name in second sequential file.
Now compile And run.
It will work fine.

3 comments:
Ram R said...
This comment has been removed by the author.
December 27, 2013 at 12:19 AM
Ram R said...
Hi,
I tried this one and have some questions.
If we have a data as below
table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo

And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.
My question:
I placed 2 seq files, one with count >1 and other with count <=1, 1 seq file output was
this :
dno count
10 3
20 2
2 seq file output was like this:
dno count
40 1
30 1
Instead I wanted output like this:
dno name
10 siva
10 ram
10 sam
20 tom
20 tiny
2nd output file should be:
dno name
30 emy
40 remo

Join Stage:
MULTIPLE JOIN STAGES TO JOIN THREE TABLES:
If we have three tables to join and we don't have same key column in all the tables to
join the tables using one join stage.
In this case we can use Multiple join stages to join the tables.
You can take sample data as below
soft_com_1
e_id,e_name,e_job,dept_no
001,james,developer,10

002,merlin,tester,20
003,jonathan,developer,10
004,morgan,tester,20
005,mary,tester,20
soft_com_2
dept_no,d_name,loc_id
10,developer,200
20,tester,300
soft_com_3
loc_id,add_1,add_2
10,melbourne,victoria
20,brisbane,queensland

Take Job Design as below

Read and load the data in three sequential files.


In first Join stage ,
Go to Properties ----Select Key column as Deptno
and you can select Join type = Inner
Drag and drop the required columns in Output
Click Ok
In Second Join Stage
Go to Properties ---- Select Key column as loc_id
and you can select Join type = Inner
Drag and Drop the required columns in the output
Click ok
Give file name to the Target file, That's it
Compile and Run the Job

You can Learn more on Join Stage with example here

JOIN STAGE WITHOUT COMMON KEY COLUMN:


If we like to join the tables using Join stage , we need to have common key
columns in those tables. But some times we get the data without common key column.
In that case we can use column generator to create common column in both the
tables.
Read and load the data in Seq. Files
Go to Column Generator to create column and sample data.
In properties select name to create.
and Drag and Drop the columns into the target
Now Go to the Join Stage and select Key column which we have created( You can give
any name, based on business requirement you can give understandable name)
In Output Drag and Drop all required columns
Give File name to Target File. Than
Compile and Run the Job.
Sample Tables You can take as below
Table1
e_id,e_name,e_loc
100,andi,chicago
200,borny,Indiana
300,Tommy,NewYork

Table2
Bizno,Job
20,clerk
30,salesman

INNER JOIN IN JOIN STAGE WITH EXAMPLE:

If we have a Source data as below


xyz1 (Table 1 )
e_id,e_name,e_add
1,tim,la
2,sam,wsn
3,kim,mex
4,lin,ind
5,elina,chc

xyz2 (Table 2 )
e_id,address
1,los angeles
2,washington
3,mexico
4,indiana
5,chicago

We need the output as a


e_id, e_name,address
1,tim,los angeles
2,sam,washington
3,kim,meixico
4,lin,indiana
5,elina,chicago

Take job design as below

Read and Load the both the sourc tables in seq. files

And go to Join stage properties


Select Key column as e_id
JOIN Type = Inner
In Out put Column Drag and Drop Required Columns to go to output file and click ok.
Give file name for Target dataset and then
Compile and Run the Job . You will get the Required Output in the Target File.

Join stages and its types explained:


Inner Join:

Say if we have duplicates in left table on key field? What will happen?
We all get all matching records. We will get all matching Duplicates all well here is the
table Representation of join.

LeftOuter Join:
All the records from left table and all matching records. If we dont exists in the right table it will be
populated with nulls.

Right Outer Join:


All the records from right table and all matching records.

Full Outer Join:


All records and all matching records:

=====================================================================

Lookup_Stage :
Lookup Stage:
The Lookup stage is most appropriate when the reference data for all lookup stages in a job
is small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or
MERGE stage should be used.
Lookup stages do not require data on the input link or reference links to be sorted. Be aware,
though, that large in-memory lookup tables will degrade performance because of their paging
requirements. Each record of the output data set contains columns from a source record plus columns
from all the corresponding lookup records where corresponding source and lookup records have the
same value for the lookup key columns. The lookup key columns do not have to have the same
names in the primary and the reference links.
The optional reject link carries source records that do not have a corresponding entry in the
input lookup tables.
You can also perform a range lookup, which compares the value of a source column to a range of
values between two lookup table columns. If the source column value falls within the required range, a
row is passed to the output link. Alternatively, you can compare the value of a lookup column to a
range of values between two source columns. Range lookups must be based on column values, not
constant values. Multiple ranges are supported.

There are some special partitioning considerations for Lookup stages. You need to ensure that the data
being looked up in the lookup table is in the same partition as the input data referencing it. One way
of doing this is to partition the lookup tables using the Entire method.
Lookup stage Configuration:Equal lookup

You can specify what action need to perform if lookup fails.


Scenario1: Continue

Choose entire partition on the reference link

Scenario2:Fail

Jo
b aborted with the following error:
stg_Lkp,0: Failed a key lookup for record 2 Key Values: CUSTOMER_ID: 3
Scenari03:Drop

Scenario4:Reject

If we select reject as lookup failure condition then we need to add reject link otherwise we get
compilation error.

Range Lookup:
Business scenario:we have input data with customer id and customer name and transaction date.We
have customer dimension table with customer address information.Customer can have multiple
records with different start and active dates and we want to select the record where incoming
transcation date falls between start and end date of the customer from dim table.
Ex Input Data:
CUSTOMER_I
D

CUSTOMER_NA
ME

1
1

UMA
UMA

TRANSACTION_
DT
2011-03-01
2010-05-01

Ex Di Data:
CUSTOMER_ID

CITY

ZIP_CODE

START_DT

END_DT

BUENA PARK

90620

2010-0101

2010-1231

CYPRESS

90630

2011-0101

2011-0430

Expected Output:
CUSTOMER_I
D
1
1

CUSTOMER_NAM
E
UMA
UMA

TRANSACTION_D
T

CITY

ZIP_COD
E

2011-03-01

CYPRES
S

90630

2010-05-01

BUENA
PARK

90620

Configure the lookup stage as shown below.Double click on Lnk_input.TRANSACTION_DATE


column.(specifying condition on the input link)

You need to
specify return multiple rows from the reference link otherwise you will get following warning in the job
log.Even though we have two distinct rows base on customer_id,start_dt and end_dt columns but
datastage is considering duplicate rows based on customer_id key only.
stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table

Compile and Run the job:

Sce
nario 2:Specify range on reference link:

Thi
s concludes lookup stage configuration for different scenarios.

RANGE LOOKUP WITH EXAMPLE IN DATASTAGE:


Range Look Up is used to check the range of the records from another table records.
For example If we have the employees list, getting salaries from $1500 to $ 3000.
If we like to check the range of the employees with respect to salaries.
We can do it by using Range Lookup.
For Example if we have the following sample data.
xyzcomp ( Table Name )
e_id,e_name,e_sal
100,james,2000
200,sammy,1600
300,williams,1900
400,robin,1700
500,ponting,2200
600,flower,1800
700,mary,2100

lsal is nothing but low salary


hsal is nothing but High salary
Now Read and load the data in Sequential files
And Open Lookup file--- Select e_sal in the first table data
And Open Key expression and
Here Select e_sal >=lsal And

e_sal <=hsal
Click Ok
Than Drag and Drop the Required columns into the output and click Ok
Give File name to the Target File.
Then Compile and Run the Job . That's it you will get the required Output.

Why Entire partition is used in LOOKUP stage ?


Entire partition has all data across the nodes So while matching(in lookup) the records all data should be
present across all nodes.
For lookup sorting is not required.so when we are not using entire partition then reference data splits into
all nodes. Then each primary record need check with all nodes for matched reference record.Then we
face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is
enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on
requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4
records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is
performance issue(I/O work will increase here) and also some times job will abort.

Difference between normal and sparse lookup?


Normal look-up:all the reference table data is stored in the buffer for cross- check with the primary
table data.
Sparse lookup:each record of the primary table is cross checked with the reference table datethe
types of look-ups will araise only if the reference table is in database.so depending on the size of the
reference table we will set the type of lookup to implement.

During lookup, what if we have duplicates in


reference table/file?
==================================================
===================================

Merge_Stage :
Merge Stage:

The Merge stage is a processing stage. It can have any number of input links, a single output
link, and the same number of reject links as there are update input links.(according to DS
documentation)
Merge stage combines a mster dataset with one or more update datasets based on the key
columns.the output record contains all the columns from master record plus any additional columns
from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures
that rows with the same key column values are located in the same partition and will be processed by
the same node. It also minimizes memory requirements because fewer rows need to be in memory at
any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate
records from the master data set. If you have more than one update data set, you must remove
duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject
links. You can route update link rows that fail to match a master row down a reject link that is specific
for that link. You must have the same number of reject links as you have update links. The Link
Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject
links. You can also specify whether to drop unmatched master rows, or output them on the output
data link.
Example :
Master
dataset:
CUSTOMER_I
D

CUSTOMER_NAM
E

UMA

POOJITHA

Update
dataset1
CUSTOMER_ID

CITY

ZIP_CODE

SEX

CYPRESS

90630

CYPRESS

90630

Output:
CUSTOMER_ID

CUSTOMER_NAME

CITY

UMA

CYPRESS

90630

POOJITHA

CYPRESS

90630

Merge stage configuration steps:

ZIP_CODE

SEX

Options:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the
master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are
rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the
master link.

Partitioning:Hash on both master input and update


input as shown below:

Co
mpile and run the job :

Scenario 2:
Remove a record from the updateds1 and check the output:

Check for the datastage warning in the job log as we have selected Warn on unmatched masters =
TRUE
stg_merge,0: Master record (0) has no updates.
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1

Scenario 4:Insert a duplicate record with same


customer id in the master dataset and check for the results.

Look at the output and it is clear that merge stage


automatically dropped the duplicate record from master dataset.

Scenario 4:Added new updatedataset2 which


contains following data.
Update
Dataset2

CUSTOMER_ID

CITIZENSHIP

INDIAN

AMERICAN

Still we have duplicate row in the master dataset.if you compile the job with above design you will get
compilation error like below.

If you look ate the above figure you can see 2 rows
in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .

Scenario 5:add a duplicate row for customer_id=1 in


updateds1 dataset.

Now we have duplicate record both in master dataset


and updateds1.Run the job and check the results and warnings in the job log.

No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630
instead of 90620.

Run the job and check output results.

I ran the same job multiple times and found the


merge stage is taking first record coming as input from the updateds1 and dropping the next records
with same customer id.
This post covered most of the merge scenarios.

==================================================
===================================

Filter_Stage :
Filter Stage:
Filter stage is a processing stage used to filter database based on filter condition.

The filter stage is configured by creating expression in the where clause.


Scenario1:Check for empty values in the customer name field.We are reading from sequential file and
hence we should check for empty value instead of null.

Scenario 2:Comparing incoming fields.check transaction date falls between strt_dt and end_dt and
filter those records.
Input Data:
CUSTOMER_I
D

CUSTOMER_NAM
E

TRANSACTION_D
T

STR_DT

END_DT

UMA

1/1/2010

5/20/201
0

12/20/201
0

UMA

5/28/2011

5/20/201
0

12/20/201
0

Output:
CUSTOMER_I
D

CUSTOMER_NAM
E
1

UMA

TRANSACTION_D
T
5/28/2011

STR_DT
5/20/201
0

END_DT
12/20/201
0

Reject:
CUSTOMER_I
D

CUSTOMER_NAM
E
1

UMA

TRANSACTION_D
T
1/1/2010

STR_DT
5/20/201
0

END_DT
12/20/201
0

Partition data based on CUSTOMER_ID to make sure all rows with same key values process on the
same node.
Condition : where TRANSACTION_DT Between STRT_DT and END_DT

Actual Output:
Actual Reject Data:

Scenario 3:Evaluating input column


data
ex:Where CUSTOMER_NAME=UMA AND CUSTOMER_ID=1

Output :

Reject :
This covers most filter stage scenarios.

FILTER STAGE WITH REAL TIME EXAMPLE:


Filter Stage is used to write the conditions on Columns.
We can write Conditions on any number of columns.
For Example if you have the data like as follows
e_id,e_name,e_sal
1,sam,2000
2,ram,2200
3,pollard,1800
4,ponting,2200
5,sachin,2200
If we need to find who are getting the salary of 2200.
( In real time there will thousands of records at the source)

We can take Sequential file to read the and filter stage for writing Conditions.
And Dataset file to load the data into the Target.
Design as follows: --Seq.File---------Filter------------DatasetFile

Open Sequential File And


Read the data.
In filter stage -- Properties -- Write Condition in Where clause as
e_sal=2200
Go to Output -- Drag and Drop
Click Ok
Go to Target Dataset file and give some name to the file and that's it
Compile and Run
You will get the required output in Target file.
If you are trying to write conditions on multiple columns
Write condition in where clause
and give output like=(Link order number ) For EXAMPLE : 1
And Write another condition and select output link =0
( You can get the link order number in link ordering Option)
Design as follows : ----

Compile And Run


You will get the data to the both the Targets.

Copy Stage :
COPY STAGE:

Copy Stage is one of the processing stage that have one input and 'n' number of outputs. The
copy stage is used to send the one source data to multiple copies and this can be used for the multiple
purpose. The records which we are sending through copy stage can be copied with any modifications and
also we can do the following.
a) Columns order can be altered .
b) And columns can be dropped.
c) We can change the column names.
In Copy Stage, we have the option called Force. It will be false in Default and if we kept to true, it is used
to specify that datastage should not try optimize the job by removing a copy operation where there is one
input and one output .
================================================================================

Funnel_Stage :
Funnel Stage:
Funnel stage is used to combine multiple input datasets into a single input dataset.This stage can have
any number of input links and single output link.

It operates in 3 modes:
Continuous Funnel combines records as they arrive (i.e. no particular order);
Sort Funnel combines the input records in the order defined by one or more key fields;
Sequence copies all records from the first input data set to the output data set, then all the records
from the second input data set, etc.
Note:Metadata for all inputs must be identical.
Sort funnel requires data must be sorted and partitioned by the same key columns as to be used by
the funnel operation.
Hash Partition guarantees that all records with same key column values are located in the same
partition and are processed in the same node.

1)Continuous funnel:
Go to the properties of the funnel stage page and set Funnel Type to continuous funnel.

2)Sequence:

Note:In order to use sequence funnel you need to specify which order the input links you
need to process and also make sure the stage runs in sequential mode.
Usually we use sequence funnel when we create a file with header,detail and trailer records.
3)Sort Funnel:

Note: If you are running your sort funnel stage in parallel, you should be aware of the
various
considerations about sorting data and partitions
Thats all about funnel stage usage in datastage.

FUNNEL STAGE WITH REAL TIME EXAMPLE

Some times we get data in multiple files which belongs to same bank customers information.
In that time we need to funnel the tables to get the multiple files data into the single file.( table)
For Example , if we have the data two files as below
xyzbank1
e_id,e_name,e_loc
111,tom,sydney
222,renu,melboourne
333,james,canberra
444,merlin,melbourne

xyzbank2
e_id,e_name,e_loc
555,,flower,perth
666,paul,goldenbeach
777,raun,Aucland
888,ten,kiwi

For Funnel take the Job design as


Read and Load the data into two sequential files.
Go to Funnel stage Properties and
Select Funnel Type = Continous Funnel
( Or Any other according to your requirement )
Go to output Drag and drop the Columns
( Remember Source Columns Stucture Should be same ) Then click ok
Give file name for the target dataset then
compile and run th job

Column Generator :
Column Generator is a development stage/ generating stage that is used to generate column
with sample data based on user defined data type .
Take Job Design as

Seq.File--------------Col.Gen------------------Ds

Take source data as a


xyzbank
e_id,e_name,e_loc
555,flower,perth
666,paul,goldencopy
777,james,aucland
888,cheffler,kiwi

In order to generate column ( for ex: unique_id)


First read and load the data in seq.file
Go to Column Generator stage -- Properties -- Select column method as explicit
In column to generate = give column name ( For ex: unique_id)
In Output drag and drop
Go to column write column name and you can change data type for unique_id in sql type and
can give length with suitable name
Then compile and Run
================================================================================

Surrogate_Key_Stage :

Surrogate Key Importance:


SURROGATE KEY IN DATASTAGE:
Surrogate Key is a unique identification key. It is alternative to natural key .
And in natural key, it may have alphanumeric composite key but the surrogate is
always single numeric key.
Surrogate key is used to generate key columns, for which characteristics can be
specified. The surrogate key generates sequential incremental and unique integers for a
provided start point. It can have a single input and a single output link.

WHAT IS THE IMPORTANCE OF OF SURROGATE KEY?


Surrogate Key is a Primary Key for a dimensional table. ( Surrogate key is alternate to Primary
Key) The most importance of using Surrogate key is not affected by the changes going on with a
database.
And in Surrogate Key Duplicates are allowed, where it cant be happened in the Primary Key .
By using Surrogate key we can continue the sequence for any jobs. If any job was aborted at the

n records loaded.. By using surrogate key you can continue the sequence from n+1.

Surrogate Key Generator:


The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and
maintains the key source.
A surrogate key is a unique primary key that is not derived from the data that it represents, therefore
changes to the data will not change the primary key. In a star schema database, surrogate keys are
used to join a fact table to a dimension table.
surrogate key generator stage uses:

Create or delete the key source before other jobs run


Update a state file with a range of key values
Generate surrogate key columns and pass them to the next stage in the job
View the contents of the state file
Generated keys are 64 bit integers and the key source can be stat file or database sequence.

Creating the key source:


Drag the surrogate key stage from palette to parallel job canvas with no input and output links.

Double click on the surrogate key stage and click on properties tab.

Properties:

Key Source Action = create


Source Type : FlatFile or Database sequence(in this case we are using FlatFile)
When you run the job it will create an empty file.
If you want to the check the content change the View Stat File = YES and check the job log for details.
skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty.
if you try to create the same file again job will abort with the following error.
skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists.
Deleting the key source:

Updating the stat File:


To update the stat file add surrogate key stage to the job with single input link from other stage.
We use this process to update the stat file if it is corrupted or deleted.
1)open the surrogate key stage editor and go to the properties tab.

If the stat file exists we can update otherwise we can create and update it.
We are using SkeyValue parameter to update the stat file using transformer stage.

Generating Surrogate Keys:


Now we have created stat file and will generate keys using the stat key file.
Click on the surrogate keys stage and go to properties add add type a name for the surrogate key
column in the Generated Output Column Name property

Go to ouput and define the mapping like below.

Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.
I have updated the stat file with 100 and below is the output.

If you want to generate the key value from begining you can use following property in the surrogate
key stage.

a.
o

If the key source is a flat file, specify how keys are generated:
To generate keys in sequence from the highest value that was last used, set

the Generate Key from Last Highest Value property to Yes. Any gaps in the key range are ignored.
To specify a value to initialize the key source, add the File Initial Value property to the

Options group, and specify the start value for key generation.
To control the block size for key ranges, add the File Block Size property to the Options

b.

group, set this property toUser specified, and specify a value for the block size.
If there is no input link, add the Number of Records property to the Options group, and specify
how many records to generate.

==================================================
===================================

SCD :
WHAT IS SCD IN DATASTAGE ? TYPES OF SCD IN DATASTAGE?
SCD's are nothing but Slowly changing dimension.
Scd's are the dimensions that have the data that changes slowly. Rather than
changing in a time period. That is a regular schedule.
The Scd's are performed mainly into three types.

They are
Type-1 SCD
Type-2 SCD
Type-3 SCD

Type -1 SCD: In the type -1 SCD methodology, it will overwrites the older data
( Records ) with the new data ( Records) and therefore it will not maintain the
historical information.
This will used for the correcting the spellings of names, and for small updates of
customers.
TYpe -2 SCD: In the Type-2 SCS methodology, it will tracks the complete historical
information by creating the multilple records for the given natural key ( Primary
key) in the dimension tables with a separate surrogate keys or a different
version numbers. We have a unlimited historical data preservation, as a new
record is inserted each time a change is made.
Here we use differet type of options inorder to track the historical data of
customers like
a) Active flag
b) Date functions
c) Version Numbers
d) Surrogate Keys
We use this to track all the historical data of the customer.
According to our input, we use required function to track.

Type-3 SCD: In the Type-2 SCD, it will maintain the partial historical

information.

HOW TO USE TYPE -2 SCD IN DATASTAGE?


SCD'S is nothing but Slowly changing Dimensions.
Slowly Changing Dimensions are the dimensions that have the data that change slowly rather than
changing in a time period, i.e regular schedule.
The most common Slowly Changing Dimensions are three types.
They are Type -1 , Type -2 , Type -3 SCD's
Type-2 SCD:-- The Type-2 methodology tracks the Complete Historical information by creating the
multiple records for a given natural keys in the dimension tables with the separate surrogate keys or
different version numbers.
And we have unlimited history preservation as every time new record is inserted each time a change is
made.

SLOWLY CHANGING DIMENSIONS (SCD) - TYPES | DATA WAREHOUSE

Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data
changes slowly, rather than changing regularly on a time basis.
For example, you may have a customer dimension in a retail domain. Let say the customer is in
India and every month he does some shopping. Now creating the sales report for the customers is
easy. Now assume that the customer is transferred to United States and he does shopping there.
How to record such a change in your customer dimension?
You could sum or average the sales done by the customers. In this case you won't get the exact
comparison of the sales done by the customers. As the customer salary is increased after the
transfer, he/she might do more shopping in United States compared to in India. If you sum the total
sales, then the sales done by the customer might look stronger even if it is good. You can create a
second customer record and treat the transferred customer as the new customer. However this will
create problems too.
Handling these issues involves SCD management methodologies which referred to as Type 1 to
Type 3. The different types of slowly changing dimensions are explained in detail below.
SCD Type 1: SCD type 1 methodology is used when there is no need to store historical data in the
dimension table. This method overwrites the old data in the dimension table with the new data. It is
used to correct data errors in the dimension.
As an example, i have the customer table with the below data.

surrogate_key customer_id customer_name Location


-----------------------------------------------1

Marspton

Illions

Here the customer name is misspelt. It should be Marston instead of Marspton. If you use type1
method, it just simply overwrites the data. The data in the updated table will be.

surrogate_key customer_id customer_name Location


-----------------------------------------------1

Marston

Illions

The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that
there is no historical data kept in the data warehouse.
SCD Type 3: In type 3 method, only the current status and previous status of the row is maintained
in the table. To track these changes two separate columns are created in the table. The customer
dimension table in the type 3 method will look as

surrogate_key customer_id customer_name Current_Location previous_location


-------------------------------------------------------------------------1

Marston

Illions

NULL

Let say, the customer moves from Illions to Seattle and the updated table will look as

surrogate_key customer_id customer_name Current_Location previous_location


-------------------------------------------------------------------------1

Marston

Seattle

Illions

Now again if the customer moves from seattle to NewYork, then the updated table will be

surrogate_key customer_id customer_name Current_Location previous_location


-------------------------------------------------------------------------1

Marston

NewYork

Seattle

The type 3 method will have limited history and it depends on the number of columns you create.
SCD Type 2: SCD type 2 stores the entire history the data in the dimension table. With type 2 we
can store unlimited history in the dimension table. In type 2, you can store the data in three different
ways. They are

Versioning

Flagging

Effective Date
SCD Type 2 Versioning: In versioning method, a sequence number is used to represent the
change. The latest sequence number always represents the current row and the previous sequence
numbers represents the past data.
As an example, lets use the same example of customer who changes the location. Initially the
customer is in Illions location and the data in dimension table will look as.

surrogate_key customer_id customer_name Location Version


-------------------------------------------------------1

Marston

Illions

The customer moves from Illions to Seattle and the version number will be incremented. The
dimension table will look as

surrogate_key customer_id customer_name Location Version


--------------------------------------------------------

Marston

Illions

Marston

Seattle

Now again if the customer is moved to another location, a new record will be inserted into the
dimension table with the next version number.
SCD Type 2 Flagging: In flagging method, a flag column is created in the dimension table. The
current record will have the flag value as 1 and the previous records will have the flag as 0.
Now for the first time, the customer dimension will look as.

surrogate_key customer_id customer_name Location flag


-------------------------------------------------------1

Marston

Illions

Now when the customer moves to a new location, the old records will be updated with flag value as
0 and the latest record will have the flag value as 1.

surrogate_key customer_id customer_name Location Version


-------------------------------------------------------1

Marston

Illions

Marston

Seattle

SCD Type 2 Effective Date: In Effective Date method, the period of the change is tracked using the
start_date and end_date columns in the dimension table.

surrogate_key customer_id customer_name Location Start_date

End_date

------------------------------------------------------------------------1

Marston

Illions

01-Mar-2010

20-Fdb-2011

Marston

Seattle

21-Feb-2011

NULL

The NULL in the End_Date indicates the current version of the data and the remaining records
indicate the past data.
SCD-2 Implementation in Datastage:
Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An
additional dimension record is created and the segmenting between the old record values and the new
(current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact
table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of
many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which
has the following structure and data:
D_CUSTOMER dimension table before loading
Datastage SCD2 job design
The most important facts and stages of the CUST_SCD2 job processing:
The dimension table with customers is refreshed daily and one of the data sources is a text file. For
the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it
is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:
There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text
file.
A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to
separate columns.
SCD 2 lookup transformer
A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes
through only records that differ.
SCD 2 check discrepancies transformer
A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated
with current indictator flag set to no and the new record is inserted with current indictator flag set to
yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer
ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and
the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed
where part of an SQL statement.
ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and
the key column is CUST_ID.
D_CUSTOMER dimension table after Datawarehouse refresh
===============================================================

Pivot_Enterprise_Stage:

Pivot enterprise stage is a processing stage which pivots data vertically and horizontally depending
upon the requirements. There are two types

1.

Horizontal

2.

Vertical

Horizontal Pivot operation sets input column to the multiple rows which is exactly opposite to the
Vertical Pivot Operation. It sets input rows to the multiple columns.
Lets try to understand it one by one with following example.
1.

Horizontal Pivot Operation.

Consider following Table.

Product Type
Pen
Dress

Color_1
Yellow
Pink

Color_2
Blue
Yellow

Color_3
Green
Purple

Step 1: Design Your Job Structure Like below.

Configure above table with input sequential stage se_product_clr_det.


Step 2: Lets configure Pivot enterprise stage. Double click on it. Following window will pop up.

Select Horizontal for Pivot Type from drop-down menu under Properties tab for horizontal Pivot
operation.
Step 3: Click onPivot Properties tab. Under which we need to check box against Pivot Index. After
which column of name Pivot_Index will appear under Name column also declare a new column of
name Color as shown below.

Step 4: Now we have to mention columns to be pivoted under Derivation against column Color.
Double click on it. Following Window will pop up.

Select columns to be pivoted from Available column pane as shown. Click OK.
Step 5: Under Output tab, only map pivoted column as shown.

Configure output stage. Give the file path. See below image for reference.

Step 6: Compile and Run the job. Lets see what is happen to the output.

This is how we can set multiple input columns to the single column (As here for colors).
Vertical Pivot Operation:
Here, we are going to use Pivot Enterprise stage to vertically pivot data. We are going to set multiple
input rows to a single row. The main advantage of this stage is we can use aggregation functions like
avg, sum, min, max, first, last etc. for pivoted column. Lets see how it works.
Consider an output data of Horizontal Operation as input data for the Pivot Enterprise stage. Here, we
will be adding one extra column for aggregation function as shown in below table.

Product
Pen
Pen
Pen
Dress
Dress
Dress

Color
Yellow
Blue
Green
Pink
Yellow
purple

Prize
38
43
25
1000
695
738

Lets study for vertical pivot operation step by step.


Step 1: Design your job structure like below. Configure above table data with input sequential file
se_product_det.

Step 2: Open Pivot Enterprise stage and select Pivot type as vertical under properties tab.

Step 3: Under Pivot Properties tab minimum one pivot column and one group by column. Here, we
declared Product as group by column. Color and prize as Pivot columns.Lets see how to use
Aggregation functions in next step.

Step 4: On clicking Aggregation functions required for this column for particular column following
window will pop up. In which we can select functions whichever required for that particular column.
Here we are using min, max and average functions with proper precision and scale for Prize column
as shown.

Step 5: Now we just have to do mapping under output tab as shown below.

Step 6: compile and Run the job. Lets see what will be the output is.
Output :

One more approach:

Many people have the following misconceptions about Pivot stage.

1) It converts rows into columns


2) By using a pivot stage, we can convert 10 rows into 100 columns and 100 columns
into 10 rows
3) You can add more points here!!

Let me first tell you that a Pivot stage only CONVERTS COLUMNS INTO ROWS and
nothing else. Some DS Professionals refer to this as NORMALIZATION. Another fact about
the Pivot stage is that it's irreplaceable i.e no other stage has this functionality of
converting columns into rows!!! So , that makes it unique, doesn't!!!
Let's cover how exactly it does it....
For example, lets take a file with the following fields: Item, Quantity1, Quantity2,
Quantity3....
Item~Quantity1~Quantity2~Quantity3
ABC~100~1000~10000
DEF~200~2000~20000
GHI~300~3000~30000
Basically you would use a pivot stage when u need to convert those 3 Quantity fields
into a single field whch contains a unique Quantity value per row...i.e. You would need
the following output
Item~Quantity
ABC~100
ABC~1000
ABC~10000
DEF~200
DEF~2000
DEF~20000
GHI~300
GHI~3000
GHI~30000

How to achieve the above in Datastage???


In this case our source would be a flat file. Read it using any file stage of your
choice: Sequential file stage, File set stage or Dataset stage. Specify 4 columns in
the Output column derivation tab.

Now connect a Pivot stage from the Tool pallette to the above output link and
create an output link for the Pivot stage itself (fr enabling the Output tab for the
pivot stage).
Unlike other stages, a pivot stage doesn't use the generic GUI stage page. It has a
stage page of its own. And by default the Output columns page would not have
any fields. Hence, you need to manually type in the fields. In this case just type in
the 2 field names : Item and Quantity. However manual typing of the columns
becomes a tedious process when the number of fields is more. In this case you can
use the Metadata Save - Load feature. Go the input columns tab of the pivot stage,
save the table definitions and load them in the output columns tab. This is the way
I use it!!!
Now, you have the following fields in the Output Column's tab...Item and
Quantity....Here comes the tricky part i.e you need to specify the
DERIVATION ....In case the field names of Output columns tab are same as the
Input tab, you need not specify any derivation i.e in this case for the Item field,
you need not specify any derivation. But if the Output columns tab has new field
names, you need to specify Derivation or you would get a RUN-TIME error for
free....
For our example, you need to type the Derivation for the Quantity field as
Column name Derivation
Item Item (or you can leave this blank)
Quantity Quantity1, Quantity2, Quantity3.
Just attach another file stage and view your output!!! So, objective met!!!

Sequence_Activities :
In this article i will explain how to use datastage looping acitvities in sequencer.
I have a requirement where i need to pass file id as parameter reading from a file.In Future file ids
will increase so that i dont have to add job or change sequencer if I take advantage of datastage
looping.
Contents in the File:
1|200

2|300
3|400
I need to read the above file and pass second field as parameter to the job.I have created one parallel
job with pFileID as parameter.
Step:1 Count the number of lines in the file so that we can set the upper limit in the datastage start
loop activity.
sample routine to count lines in a file:
Argument : FileName(Including path)
Deffun DSRMessage(A1, A2, A3) Calling *DataStage*DSR_MESSAGE
Equate RoutineName To CountLines
Command = wc -l: :FileName:| awk {print $1}
Call DSLogInfo(Executing Command To Get the Record Count ,Command)
* call support routine that executes a Shell command.
Call DSExecute(UNIX, Command, Output, SystemReturnCode)
* Log any and all output as an Information type log message,
* unless systems return code indicated that an error occurred,
* when we log a slightly different Warning type message.
vOutput=convert(char(254),",Output)
If (SystemReturnCode = 0) And (Num(vOutput)=1) Then
Call DSLogInfo(Command Executed Successfully ,Command)
Output=convert(char(254),",Output)
Call DSLogInfo(Here is the Record Count In :FileName: = :Output,Output)
Ans = Output
*GoTo NormalExit
End Else
Call DSLogInfo(Error when executing command ,Command)
Call DSLogFatal(Output, RoutineName)
Ans = 1
End

Now we use startLoop.$Counter variable to get the file id by using combination of grep and awk
command.
for each iteration it will get file id.

Finally the seq job looks like below.

I hope every one likes this post.


===============================================================

TRANSFORMER STAGE TO FILTER THE DATA :


TRANSFORMER STAGE TO FILTER THE DATA

Take Job Design as below

If our requirement is to filter the data department wise from the file below
samp_tabl
1,sam,clerck,10

2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
And our requirement is to get the target data as below
In Target1 we need 10th & 40th dept employees.
In Target2 we need 30th dept employees.
In Target1 we need 20th & 40th dept employees.
Read and Load the data in Source file
In Transformer Stage just Drag and Drop the data to the target tables.
Write expression in constraints as below
dept_no=10 or dept_no= 40 for table 1
dept_no=30 for table 1
dept_no=20 or dept_no= 40 for table 1
Click ok
Give file name at the target file and
Compile and Run the Job to get the Output

Shared Container :
Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job

defiantly u get successful massage


---------------------------------------------------------------------------------------------------------Question: I want to process 3 files in sequentially one by one how can i do that. while processing
the files it should fetch files automatically .
Ans:If the metadata for all the files r same then create a job having file name as parameter then
use same job in routine and call the job with different file name...or u can create sequencer to use
the job..
------------------------------------------------------------------------------------------------------------------------------------Parameterize the file name.
Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different file
each time.
RE: What Happens if RCP is disable ?
In such case Osh has to perform Import and export every time whenthe job runs and the
processing time job is also increased...
-------------------------------------------------------------------------------------------------------------------Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation
Question:
Source:
Eno
1
2
3

Ename
a,b
c,d
e,f

Target
Eno Ename
1
2
3

a
b
c

Difference Between Join,Lookup and Merge :

Datastage Scenarios and solutions :


Field mapping using Transformer stage:
Requirement:
field will be right justified zero filled, Take last 18 characters
Solution:
Right("0000000000":Trim(Lnk_Xfm_Trans.link),18)
Scenario 1:
We have two datasets with 4 cols each with different names. We should create a dataset with 4
cols 3 from one dataset and one col with the record count of one dataset.
We can use aggregator with a dummy column and get the count from one dataset and do a look
up from other dataset and map it to the 3 rd dataset

Something similar to the below design:

Scenario 2:
Following is the existing job design. But requirement got changed to: Head and trailer datasets
should populate even if detail records is not present in the source file. Below job don't do that
job.

Hence changed the above job to this following requirement:

Used row generator with a copy stage. Given default value( zero) for col( count) coming in from
row generator. If no detail records it will pick the record count from row generator.
We have a source which is a sequential file with header and footer. How to remove the header
and footer while reading this file using sequential file stage of Datastage?
Sol:Type command in putty: sed '1d;$d' file_name>new_file_name (type this
in job before job subroutine then use new file in seq stage)
IF I HAVE SOURCE LIKE COL1 A A B AND TARGET LIKE COL1 COL2 A 1 A 2 B1. HOW
TO ACHIEVE THIS OUTPUT USING STAGE VARIABLE IN TRANSFORMER STAGE?
If keyChange =1 Then 1 Else stagevaraible+1

Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job
defiantly u get successful massage
Question: I want to process 3 files in sequentially one by one how can i do that. while processing
the files it should fetch files automatically .

Ans:If the metadata for all the files r same then create a job having file name as parameter then
use same job in routine and call the job with different file name...or u can create sequencer to use
the job..
Parameterize the file name.
Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different file
each time.
RE: What Happens if RCP is disable ?
In such case Osh has to perform Import and export every time when the job runs and the
processing time job is also increased...
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation
Question:
Source:
Eno
1
2
3

Target

Ename
a,b
c,d
e,f

Eno

source has 2 fields like


COMPANY
IBM
TCS
IBM
HCL
TCS
IBM
HCL
HCL

LOCATION
HYD
BAN
CHE
HYD
CHE
BAN
BAN
CHE

LIKE THIS.......
THEN THE OUTPUT LOOKS LIKE THIS....

Ename
1
a
2
b
3
c

Company loc count


TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
2)input is like this:
no,char
1,a
2,b
3,a
4,b
5,a
6,a
7,b
8,a
But the output is in this form with row numbering of Duplicate occurence
output:
no,char,Count
"1","a","1"
"6","a","2"
"5","a","3"
"8","a","4"
"3","a","5"
"2","b","1"
"7","b","2"
"4","b","3"
3)Input is like this:
file1
10
20
10
10
20

30

Output is like:
file2
file3(duplicates)
10
10
20
10
30
20
4)Input is like:
file1
10
20
10
10
20
30
Output is like Multiple occurrences in one file and single occurrences in one file:
file2
file3
10
30
10
10
20
20
5)Input is like this:
file1
10
20
10
10
20
30
Output is like:
file2
file3
10
30
20
6)Input is like this:
file1
1
2
3
4

5
6
7
8
9
10
Output is like:
file2(odd)
file3(even)
1
2
3
4
5
6
7
8
9
10

7)How to calculate Sum(sal), Avg(sal), Min(sal), Max(sal) with out


using Aggregator stage..
8)How to find out First sal, Last sal in each dept with out using aggregator stage
9)How many ways are there to perform remove duplicates function with out using
Remove duplicate stage..
Scenario:
source has 2 fields like
COMPANY
IBM
TCS
IBM
HCL
TCS
IBM
HCL
HCL

LOCATION
HYD
BAN
CHE
HYD
CHE
BAN
BAN
CHE

LIKE THIS.......
THEN THE OUTPUT LOOKS LIKE THIS....
Company loc count
TCS HYD 3
BAN

CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
Solution:
Seqfile......>Sort......>Trans......>RemoveDuplicates..........Dataset
Sort
Key=Company
Sort order=Asc
Company1:',':in.Location
create keychange=True

Trans:
create stage variable as Company1
Company1=If(in.keychange=1) then in.Location Else
Drag and Drop in derivation
Company
....................Company
Company1........................Location

RemoveDup:
Key=Company
Duplicates To Retain=Last

11)The input is
Shirt|red|blue|green
Pant|pink|red|blue
Output should be,
Shirt:red
Shirt:blue
Shirt:green
pant:pink
pant:red
pant:blue
Solution:
it is reverse to pivote stage
use
seq------sort------tr----rd-----tr----tg
in the sort stage use create key change column is true
in trans create stage variable=if colu=1 then key c.value else key v::colum
rd stage use duplicates retain last
tran stage use field function superate columns
similar

Scenario: :

source
col1 col3
1 samsung
1 nokia
1 ercisson
2 iphone
2 motrolla
3 lava
3 blackberry
3 reliance
Expected Output
col 1 col2
col3 col4
1
samsung nokia ercission
2
iphone motrolla
3
lava
blackberry reliance
You can get it by using
Transformer --tgt

Sort stage --- Transformer stage --- RemoveDuplicates ---

Ok
First Read and Load the data into your source file( For Example Sequential File )
And in Sort stage

select key change column = True ( To Generate Group ids)

Go to Transformer stage
Create one stage variable.
You can do this by right click in stage variable go to properties and name it as your wish
( For example temp)
and in expression write as below
if keychange column =1

then column name

else temp:',':column name

This column name is the one you want in the required column with delimited commas.
On remove duplicates stage key is col1 and set option duplicates retain to--> Last.
in transformer drop col3 and define 3 columns like col2,col3,col4
in col1 derivation give Field(InputColumn,",",1) and
in col1 derivation give Field(InputColumn,",",2) and
in col1 derivation give Field(InputColumn,",",3)

Scenario:
12)Consider the following employees data as source?
employee_id, salary
------------------10,
1000
20,
2000

30,
40,

3000
5000

Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.
The output should look like as
employee_id, salary, salary_sum
------------------------------10,
1000, 11000
20,
2000, 11000
30,
3000, 11000
40,
5000, 11000

Scenario:
I have two source tables/files numbered 1 and 2.
In the the target, there are three output tables/files, numbered 3,4 and 5.
The scenario is that,
to the out put 4 -> the records which are common to both 1 and 2 should go.
to the output 3 -> the records which are only in 1 but not in 2 should go
to the output 5 -> the records which are only in 2 but not in 1 should go.
sltn:src1----->copy1------>----------------------------------->output_1(only left table)
Join(inner type)----> ouput_1
src2----->copy2------>----------------------------------->output_3(only right table)

Consider the following employees data as source?


employee_id, salary
------------------10,
1000
20,
2000
30,
3000
40,
5000

Scenario:
Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.
The output should look like as
employee_id, salary, salary_sum
------------------------------10,
1000, 11000
20,
2000, 11000

30,
40,
sltn:

3000,
5000,

11000
11000

Take Source --->Transformer(Add new Column on both the output links and assign a value
as 1 )------------------------>
1) Aggregator (Do group by using that
new column)
2)lookup/join( join on that new column)-------->tgt.

Scenario:
sno,sname,mark1,mark2,mark3
1,rajesh,70,68,79
2,mamatha,39,45,78
3,anjali,67,39,78
4,pavani,89,56,45
5,indu,56,67,78
out put is
sno,snmae,mark1,mark2,mark3,delimetercount
1,rajesh,70,68,79,4
2,mamatha,39,45,78,4
3,anjali,67,39,78,4
4,pavani,89,56,45,4
5,indu,56,67,78,4
seq--->trans--->seq
create one stage variable as delimiter..
and put derivation on stage as DSLink4.sno : "," : DSLink4.sname : "," : DSLink4.mark1 :
"," :DSLink4.mark2 : "," : DSLink4.mark3
and do mapping and create one more column count as integer type.
and put derivation on count column as Count(delimter, ",")
scenario:
sname
total_vowels_count
Allen
2
Scott
1
Ward
1
Under Transformer Stage Description:
total_Vowels_Count=Count(DSLink3.last_name,"a")+Count(DSLink3.last_name,"e")
+Count(DSLink3.last_name,"i")+Count(DSLink3.last_name,"o")
+Count(DSLink3.last_name,"u").

Scenario:

1)On daily we r getting some huge files data so all files metadata is same we have to load in
to target table how we can load?
Use File Pattern in sequential file
2) One column having 10 records at run time we have to send 5th and 6th record to target
at run time how we can send?
Can get through,by using UNIX command in sequential file filter option
How can we get 18 months date data in transformer stage?
Use transformer stage after input seq file and try this one as constraint in transformer
stage :
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=548 OR
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=546
where date_18 column is the column having that date which needs to be less or equal to
18 months and 548 is no. of days for 18 months and for leap year it is 546(these
numbers you need to check).

What is differences between Force Compile and Compile ?


Diff b/w Compile and Validate?
Compile option only checks for all mandatory requirements like link requirements, stage
options and all. But it will not check if the database connections are valid.
Validate is equivalent to Running a job except for extraction/loading of data. That is,
validate option will test database connectivity by making connections to databases.

How to FInd Out Duplicate Values Using Transformer?


You can capture the duplicate records based on keys using Transformer stage variables.
1. Sort and partition the input data of the transformer on the key(s) which defines the duplicate.
2. Define two stage variables, let's say StgVarPrevKeyCol(data type same as KeyCol) and StgVarCntr as Integer with
default value 0
where KeyCol is your input column which defines the duplicate.
Expression for StgVarCntr(1st stg var-- maintain order):
If DSLinknn.KeyCol = StgVarPrevKeyCol Then StgVarCntr + 1 Else 1
Expression for StgVarPrevKeyCol(2nd stg var):
DSLinknn.KeyCol
3. Now in constrain, if you filter rows where StgVarCntr = 1 will give you the unique records and if you filter
StgVarCntr > 1 will give you duplicate records.

My source is Like
Sr_no, Name
10,a
10,b
20,c
30,d
30,e
40,f

My target Should Like:


Target 1:(Only unique means which records r only once)
20,c
40,f
Target 2:(Records which r having more than 1 time)
10,a
10,b
30,d
30,e
How to do this in DataStage....
**************
use aggregator and transformer stages
source-->aggregator-->transformat-->target
perform count in aggregator, and take two op links in trasformer, filter data count>1 for one llink
and put count=1 for second link.
Scenario:
in my i/p source i have N no.of records
In output i have 3 targets
i want o/p like 1st rec goes to 1st targt and
2nd rec goes to 2nd target and
3rd rec goes to 3rd target again
4th rec goes to 1st taget ............ like this
do this ""without using partition techniques "" remember it.

*****************
source--->trans---->target
in trans use conditions on constraints
mod(empno,3)=1
mod(empno,3)=2
mod(empno,3)=0
Scenario:
im having i/p as
col A
a_b_c
x_F_I
DE_GH_IF
we hav to mak it as
col1 col 2 col3
abc
xfi
de gh if
*********************
Transformer
create 3 columns with derivation
col1 Field(colA,'_',1)
col2 Field(colA,'_',2)
col3 Field(colA,'_',3)
**************
Field function divides the column based on the delimeter,
if the data in the col is like A,B,C
then
Field(col,',',1) gives A
Field(col,',',2) gives B
Field(col,',',3) gives C
Scenario:
Scenario:
Scenario:
2 comments:

baba_007 said...
How to Find Out Duplicate Values Using Transformer?
another way to find the duplicate value can be using a sorter stage before transformer.
In sorter: make Cluster Key change = TRUE
on the Key
then in Transformer filter the oulput on basic of value of cluste key change which can be
put in stage variable.
====================================================================

Scenarios_Unix :

1) Convert single column to single row:


Input: filename : try
REF_PERIOD
PERIOD_NAME
ACCOUNT_VALUE
CDR_CODE
PRODUCT
PROJECT
SEGMENT_CODE
PARTNER
ORIGIN
BILLING_ACCRUAL
Output:
REF_PERIOD PERIOD_NAME ACCOUNT_VALUE CDR_CODE PRODUCT PROJECT
SEGMENT_CODE PARTNER ORIGIN BILLING_ACCRUAL
Command: cat try | awk {printf %s ,$1}

2) Print the list of employees in Technology


department :
Now department name is available as a fourth field, so need to check if $4
matches with the string Technology, if yes print the line.
Command: $ awk $4 ~/Technology/ employee.txt
200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
500 Randy DBA
Technology $6,000

Operator ~ is for comparing with the regular expressions. If it matches the


default action i.e print whole line will be performed.
3) Convert single column to multiple column :
For eg: Input file contain single column with 84 rows then output should
be single column data converted to multiple of 12 columns i.e. 12 column * 7
rows with field separtor (fs ;)
Script:
#!/bin/sh
rows=`catinput_file|wcl`
cols=12
fs=;
awkvr=$rowsvc=$colsvt=$fs'
NRoutput_file

4) Last field print:


input:
a=/Data/Files/201-2011.csv
output:
201-2011.csv
Command: echo $a | awk -F/ {print $NF}

5) Count no. of fields in file:


file1: a, b, c, d, 1, 2, man, fruit
Command: cat file1 | awk BEGIN{FS=,};{print NF}
and you will get the output as:8

6) Find ip address in unix server:


Command: grep -i your_hostname /etc/hosts

7) Replace the word corresponding to search


pattern:

>catfile
theblackcatwaschasedbythebrowndog.
theblackcatwasnotchasedbythebrowndog.
>sede'/not/s/black/white/g'file
theblackcatwaschasedbythebrowndog.
thewhitecatwasnotchasedbythebrowndog.

8) The below i have shown the demo for the A and 65.
Ascii value of character: It can be done in 2 ways:
1. printf %d A
2. echo A | tr -d \n | od -An -t dC
Character value from Ascii: awk -v char=65 BEGIN { printf %c\n, char; exit
}

9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Output >crmplp1 cmis461 No Online cmis462 No Offline
crmplp2 cmis462 No Online cmis463 No Offline
Command:
awk NR%2?ORS=FS:ORS=RS file

10) Variable can used in AWK


awk -F$c -v var=$c {print $1var$2} filename

11) Search pattern and use special character in sed command:

sed -e /COMAttachJob/s#)#.:JobID)#g input_file

12) Get the content between two patterns:sed -n /CREATE TABLE


table/,/MONITORING/p table_Script.sql

13) Pring debugging script output in log file Add following command in
script:
exec 1>> logfilename
exec 2>>logfilename

14) Check Sql connection:#!/bin/sh


ID=abc
PASSWD=avd
DB=sdf
exit | sqlplus -s -l $ID/$PASSWD@$DB
echo variable:$?
exit | sqlplus -s -L avd/df@dfg > /dev/null
echo variable_crr: $?

15) Trim the spaces using sed command


echo $var | sed -e s/^[[:space:]]*// -e s/[[:space:]]*$//
Another option is:
Code:
var=$(echo $var | sed -e s/^[[:space:]]*// -e s/[[:space:]]*$//)
echo Start $var
End
16) How to add sigle quote in statement using awk:Input:
/Admin/script.sh abc 2011/08
29/02/2012 00:00:00
/Admin/script.sh abc 2011/08
29/02/2012 00:00:00
command:
cat command.txt | sed -e s/[[:space:]]/ /g | awk -F {print \x27$1,$2,$3\x27,\x27
$4,$5\x27}
output:
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
17)
How to get a files from different servers to one server in datastage by using unix command?
scp test.ksh dsadm@10.87.130.111:/home/dsadm/sys/
============================================================================

Unix Interview Questions :

1. How to display the 10th line of a file?


head -10 filename | tail -1
2. How to remove the header from a file?
sed -i '1 d' filename
3. How to remove the footer from a file?
sed -i '$ d' filename
4. Write a command to find the length of a line in a file?
The below command can be used to get a line from a file.
sed n '<n> p' filename
We will see how to find the length of 10th line in a file
sed -n '10 p' filename|wc -c
5. How to get the nth word of a line in Unix?
cut f<n> -d' '
6. How to reverse a string in unix?
echo "java" | rev
7. How to get the last word from a line in Unix file?
echo "unix is good" | rev | cut -f1 -d' ' | rev
8. How to replace the n-th line in a file with a new line in Unix?
sed -i'' '10 d' filename # d stands for delete
sed -i'' '10 i new inserted line' filename # i stands for insert
9. How to check if the last command was successful in Unix?
echo $?
10. Write command to list all the links from a directory?
ls -lrt | grep "^l"
11. How will you find which operating system your system is running on in UNIX?
uname -a
12. Create a read-only file in your home directory?
touch file; chmod 400 file

13. How do you see command line history in UNIX?


The 'history' command can be used to get the list of commands that we are executed.
14. How to display the first 20 lines of a file?
By default, the head command displays the first 10 lines from a file. If we change the option of
head, then we can display as many lines as we want.
head -20 filename
An alternative solution is using the sed command
sed '21,$ d' filename
The d option here deletes the lines from 21 to the end of the file
15. Write a command to print the last line of a file?
The tail command can be used to display the last lines from a file.
tail -1 filename
Alternative solutions are:
sed -n '$ p' filename
awk 'END{print $0}' filename
16. How do you rename the files in a directory with _new as suffix?
ls -lrt|grep '^-'| awk '{print "mv "$9" "$9".new"}' | sh
17. Write a command to convert a string from lower case to upper case?
echo "apple" | tr [a-z] [A-Z]
18. Write a command to convert a string to Initcap.
echo apple | awk '{print toupper(substr($1,1,1)) tolower(substr($1,2))}'
19. Write a command to redirect the output of date command to multiple files?
The tee command writes the output to multiple files and also displays the output on the terminal.
date | tee -a file1 file2 file3
20. How do you list the hidden files in current directory?
ls -a | grep '^\.'
21. List out some of the Hot Keys available in bash shell?

Ctrl+l - Clears the Screen.

Ctrl+r - Does a search in previously given commands in shell.

Ctrl+u - Clears the typing before the hotkey.

Ctrl+a - Places cursor at the beginning of the command at shell.

Ctrl+e - Places cursor at the end of the command at shell.

Ctrl+d - Kills the shell.

Ctrl+z - Places the currently running process into background.


22. How do you make an existing file empty?
cat /dev/null > filename
23. How do you remove the first number on 10th line in file?
sed '10 s/[0-9][0-9]*//' < filename
24. What is the difference between join -v and join -a?
join -v : outputs only matched lines between two files.
join -a : In addition to the matched lines, this will output unmatched lines also.
25. How do you display from the 5th character to the end of the line from a file?
cut -c 5- filename
26. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'
Write a command to search for the file 'map' in the current directory?
find -name map -type f
How to display the first 10 characters from each line of a file?
cut -c -10 filename
Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename
How to print the file names in a directory that has the word "term"?
grep -l term *
The '-l' option make the grep command to print only the filename without printing the content of
the file. As soon as the grep command finds the pattern in a file, it prints the pattern and stops
searching other lines in the file.
How to run awk command specified in a file?
awk -f filename
How do you display the calendar for the month march in the year 1985?
The cal command can be used to display the current month calendar. You can pass the month and
year as arguments to display the required year, month combination calendar.
cal 03 1985
This will display the calendar for the March month and year 1985.
Write a command to find the total number of lines in a file?
wc -l filename
Other ways to pring the total number of lines are
awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename
awk 'END{print NR}' filename
How to duplicate empty lines in a file?
sed '/^$/ p' < filename
Explain iostat, vmstat and netstat?

Iostat: reports on terminal, disk and tape I/O activity.

Vmstat: reports on virtual memory statistics for processes, disk, tape and CPU activity.

Netstat: reports on the contents of network data structures.

27. How do you write the contents of 3 files into a single file?
cat file1 file2 file3 > file
28. How to display the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename
29. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | grep '^-'| awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'
30. Write a command to print the lines which end with the word "end"?
grep 'end$' filename
The '$' symbol specifies the grep command to search for the pattern at the end of the line.
31. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename
The '-w' option makes the grep command to search for exact whole words. If the specified
pattern is found in a string, then it is not considered as a whole word. For example: In the string
"mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.
32. How to remove the first 10 lines from a file?
sed '1,10 d' < filename
33. Write a command to duplicate each line in a file?
sed 'p' < filename
34. How to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '
35. Write a command to list the files in '/usr' directory that start with 'ch' and then display the
number of lines in each file?
wc -l /usr/ch*
Another way is
find /usr -name 'ch*' -type f -exec wc -l {} \;
36. How to remove blank lines in a file ?
grep -v ^$ filename > new_filename
37. How to display the processes that were run by your user name ?
ps -aef | grep <user_name>
38. Write a command to display all the files recursively with path under current directory?
find . -depth -print
39. Display zero byte size files in the current directory?
find -size 0 -type f
40. Write a command to display the third and fifth character from each line of a file?
cut -c 3,5 filename

41. Write a command to print the fields from 10th to the end of the line. The fields in the line are
delimited by a comma?
cut -d',' -f10- filename
42 How to replace the word "Gun" with "Pen" in the first 100 lines of a file?
sed '1,00 s/Gun/Pen/' < filename
43. Write a Unix command to display the lines in a file that do not contain the word "RAM"?
grep -v RAM filename
The '-v' option tells the grep to print the lines that do not contain the specified pattern.
44 How to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'
45. Write a command to display the files in the directory by file size?
ls -l | grep '^-' |sort -nr -k 5
46. How to find out the usage of the CPU by the processes?
The top utility can be used to display the CPU usage by the processes.
47. Write a command to remove the prefix of the string ending with '/'.
The basename utility deletes any prefix ending in /. The usage is mentioned below:
basename /usr/local/bin/file
This will display only file
48. How to display zero byte size files?
ls -l | grep '^-' | awk '/^-/ {if ($5 !=0 ) print $9 }'
49. How to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename
50. How to remove all the occurrences of the word "jhon" except the first one in a line with in
the entire file?
sed 's/jhon//2g' < filename
51. How to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename
52. How to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f
53. How to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f
54. How to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f
55. How to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename

56. Write a command to find the number of files in a directory.


ls -l|grep '^-'|wc -l
57. Write a command to display your name 100 times.
The Yes utility can be used to repeatedly output a line with the specified string or 'y'.
yes <your_name> | head -100
58. Write a command to display the first 10 characters from each line of a file?
cut -c -10 filename
59. The fields in each line are delimited by comma. Write a command to display third field from
each line of a file?
cut -d',' -f2 filename
60. Write a command to print the fields from 10 to 20 from each line of a file?
cut -d',' -f10-20 filename
61. Write a command to print the first 5 fields from each line?
cut -d',' -f-5 filename
62. By default the cut command displays the entire line if there is no delimiter in it. Which cut
option is used to supress these kind of lines?
The -s option is used to supress the lines that do not contain the delimiter.
63. Write a command to replace the word "bad" with "good" in file?
sed s/bad/good/ < filename
64. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename
65. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename
66. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename
67. Write a command to display the characters from 10 to 20 from each line of a file?
cut -c 10-20 filename
68. Write a command to print the lines that has the the pattern "july" in all the files in a particular
directory?
grep july *
This will print all the lines in all files that contain the word july along with the file name. If
any of the files contain words like "JULY" or "July", the above command would not print those
lines.
69. Write a command to print the lines that has the word "july" in all the files in a directory and
also suppress the filename in the output.
grep -h july *

70. Write a command to print the lines that has the word "july" while ignoring the case.
grep -i july *
The option i make the grep command to treat the pattern as case insensitive.
71. When you use a single file as input to the grep command to search for a pattern, it won't print
the filename in the output. Now write a grep command to print the filename in the output without
using the '-H' option.
grep pattern filename /dev/null
The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is
always an empty file.
Another way to print the filename is using the '-H' option. The grep command for this is
grep -H pattern filename
72. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *
The '-L' option makes the grep command to print the filenames that do not contain the specified
pattern.
73. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename
The '-n' option is used to print the line numbers in a file. The line numbers start from 1
74. Write a command to print the lines that starts with the word "start"?
grep '^start' filename
The '^' symbol specifies the grep command to search for the pattern at the start of the line.
75. In the text file, some lines are delimited by colon and some are delimited by space. Write a
command to print the third field of each line.
awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename
76. Write a command to print the line number before each line?
awk '{print NR, $0}' filename
77. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename
78. How to create an alias for the complex command and remove the alias?
The alias utility is used to create the alias for a command. The below command creates alias for
ps -aef command.
alias pg='ps -aef'
If you use pg, it will work the same way as ps -aef.
To remove the alias simply use the unalias command as
unalias pg
79. Write a command to display todays date in the format of 'yyyy-mm-dd'?
The date command can be used to display todays date with time
date '+%Y-%m-%d'

------------------------------------------------------------------------------------------------------

1) Convert single column to single row:


Input: filename : try
REF_PERIOD
PERIOD_NAME
ACCOUNT_VALUE
CDR_CODE
PRODUCT
PROJECT
SEGMENT_CODE
PARTNER
ORIGIN
BILLING_ACCRUAL
Output:
REF_PERIOD PERIOD_NAME ACCOUNT_VALUE CDR_CODE PRODUCT PROJECT
SEGMENT_CODE PARTNER ORIGIN BILLING_ACCRUAL
Command: cat try | awk {printf %s ,$1}

2) Print the list of employees in Technology


department :
Now department name is available as a fourth field, so need to check if $4
matches with the string Technology, if yes print the line.
Command: $ awk $4 ~/Technology/ employee.txt
200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
500 Randy DBA

Technology $6,000

Operator ~ is for comparing with the regular expressions. If it matches the


default action i.e print whole line will be performed.

3) Convert single column to multiple column :


For eg: Input file contain single column with 84 rows then output should
be single column data converted to multiple of 12 columns i.e. 12 column * 7
rows with field separtor (fs ;)
Script:
#!/bin/sh
rows=`catinput_file|wcl`
cols=12
fs=;
awkvr=$rowsvc=$colsvt=$fs'
NR<r*c{printf("%s",NR%c?$0"$":$0"\n");next}{print}
END{if(NR%c&&NR<r*c){print""}}'input_file>output_file

4) Last field print:


input:
a=/Data/Files/201-2011.csv
output:
201-2011.csv
Command: echo $a | awk -F/ {print $NF}

5) Count no. of fields in file:


file1: a, b, c, d, 1, 2, man, fruit
Command: cat file1 | awk BEGIN{FS=,};{print NF}

and you will get the output as:8

6) Find ip address in unix server:


Command: grep -i your_hostname /etc/hosts

7) Replace the word corresponding to search


pattern:
>catfile
theblackcatwaschasedbythebrowndog.
theblackcatwasnotchasedbythebrowndog.
>sede'/not/s/black/white/g'file
theblackcatwaschasedbythebrowndog.
thewhitecatwasnotchasedbythebrowndog.

8) The below i have shown the demo for the A and 65.
Ascii value of character: It can be done in 2 ways:
1. printf %d A
2. echo A | tr -d \n | od -An -t dC
Character value from Ascii: awk -v char=65 BEGIN { printf %c\n, char; exit
}

9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online

cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Output >crmplp1 cmis461 No Online cmis462 No Offline
crmplp2 cmis462 No Online cmis463 No Offline
Command:
awk NR%2?ORS=FS:ORS=RS file

10) Variable can used in AWK


awk -F$c -v var=$c {print $1var$2} filename

11) Search pattern and use special character in sed command:


sed -e /COMAttachJob/s#)#.:JobID)#g input_file

12) Get the content between two patterns:sed -n /CREATE TABLE


table/,/MONITORING/p table_Script.sql

13) Pring debugging script output in log file Add following command in
script:
exec 1>> logfilename
exec 2>>logfilename

14) Check Sql connection:#!/bin/sh


ID=abc
PASSWD=avd
DB=sdf
exit | sqlplus -s -l $ID/$PASSWD@$DB
echo variable:$?
exit | sqlplus -s -L avd/df@dfg > /dev/null

echo variable_crr: $?

15) Trim the spaces using sed command


echo $var | sed -e s/^[[:space:]]*// -e s/[[:space:]]*$//
Another option is:
Code:
var=$(echo $var | sed -e s/^[[:space:]]*// -e s/[[:space:]]*$//)
echo Start $var
End
16) How to add sigle quote in statement using awk:Input:
/Admin/script.sh abc 2011/08

29/02/2012 00:00:00

/Admin/script.sh abc 2011/08

29/02/2012 00:00:00

command:
cat command.txt | sed -e s/[[:space:]]/ /g | awk -F {print \x27 $1,$2,$3 \x27 ,\x27
$4,$5\x27}
output:
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
=====================================================================

Monday, March 18, 2013


Sql queries :
1.Query to display middle records drop first 5 last 5 records in emp table
select * from emp where rownum<=(select count(*)-5 from emp) - select * from emp where
rownum<=5;
2.Query to display first N records
select * from(select * from emp order by rowid) where rownum<=&n;
3.Query to display odd records only?
Q). select * from emp where (rowid,1) in (select rowid,mod (rownum,2) from emp);
4.Query to display even records only?
Q.) select * from emp where (rowid,0) in (select rowid,mod (rownum,2) from emp);

5.How to display duplicate rows in a table?


Q). select * from emp where deptno=any
(select deptno from emp having count(deptno)>1 group by deptno);
6.Query to display 3rd highest and 3rd lowest salary?
Q). select * from emp e1 where 3=(select count(distinct sal) from emp e2 where e1.sal<=e2.sal)
union
select * from emp e3 where 3=(select count(distinct sal) from emp e4 where e3.sal>=e4.sal);
7.Query to display Nth record from the table?
Q). select * from emp where rownum<=&n minus select * from emp where rownum<&n;
8.Query to display the records from M to N;
Q.) select ename from emp group by rownum,ename having rownum>1 and rownum<6;
select deptno,ename,sal from emp where rowid in(select rowid from emp
where rownum<=7 minus select rowid from emp where rownum<4);
select * from emp where rownum<=7 minus select * from emp where rownum<5;
9.Query to delete the duplicate records?
Q). delete from dup where rowid not in(select max(rowid)from dup group by eno);
10.Query to display the duplicate records?
Q). select * from dup where rowid not in(select max(rowid)from dup group by eno);
11.Query for joining two tables(OUTER JOIN)?
Q). select e.ename,d.deptno from emp e,dept d where e.deptno(+)=d.deptno order by e.deptno;
select empno,ename,sal,dept.* from emp full outer join dept on emp.deptno=dept.deptno;
Right Outer Join:
select empno,ename,sal,dept.* from emp right outer join dept on emp.deptno=dept.deptno;
Left Outer Join:
select empno,ename,sal,dept.* from emp left outer join dept on emp.deptno=dept.deptno
12.Query for joining table it self(SELF JOIN)?
Q). select e.ename employee name,e1.ename manger name from emp e,emp e1 where
e.mgr=e1.empno;
13.Query for combining two tables(INNER JOIN)?
select emp.empno,emp.ename,dept.deptno from emp,dept where emp.deptno=dept.deptno;
By using aliases:
select e.empno,e.ename,d.deptno from emp e,dept d where e.deptno=d.deptno;
select empno,ename,sal,dept.* from emp join dept on emp.deptno=dept.deptno:
14.Find the particular employee salary?
for maximum:
select * from emp where sal in(select min(sal)from

(select sal from emp group by sal order by sal desc)


where rownum<=&n);
select * from emp a where &n=(select count(distinct(sal)) from emp b where
a.sal<=b.sal);
for minimum:
select * from emp where sal in(select max(sal) from(select sal from emp group by sal order by
sal asc) where rownum<=&n);
select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal>=b.sal)
15.Find the lowest 5 employee salaries?
Q). select * from (select * from emp order by sal asc) where rownum<6;
Find the top 5 employee salaries queries
select * from (select * from emp order by sal desc) where rownum<6;
16.Find lowest salary queries
select * from emp where sal=(select min(sal) from emp);
17.Find highest salary queries
select * from emp where sal=(select max(sal) from emp);

Sample Sql Queries :

Simple select command:


SELECT SUBPRODUCT_UID
,SUBPRODUCT_PROVIDER_UID
,SUBPRODUCT_TYPE_UID
,DESCRIPTION
,EXTERNAL_ID
,OPTION_ID
,NEGOTIABLE_OFFER_IND
,UPDATED_BY
,UPDATED_ON
,CREATED_ON
,CREATED_BY FROM schemaname.SUBPRODUCT
With Inner Join:
SELECT eft.AMOUNT AS AMOUNT,
ceft.MERCHANT_ID AS MERCHANT_ID,
ca.ACCOUNT_NUMBER AS ACCOUNT_NUMBER,
bf.MBNA_CREDIT_CARD_NUMBER AS MBNA_CREDIT_CARD_NUMBER,
ceft.CUSTOMER_FIRST_NAME AS CUSTOMER_FIRST_NAME,

ceft.CUSTOMER_LAST_NAME AS CUSTOMER_LAST_NAME,
btr.TRACE_ID AS TRACE_ID,
ROWNUM
FROM schemaname.bt_fulfillment bf
INNER JOIN schemaname.balance_transfer_request btr
ON btr.bt_fulfillment_uid = bf.bt_fulfillment_uid
INNER JOIN schemaname.electronic_funds_transfer eft
ON eft.bt_fulfillment_uid = bf.bt_fulfillment_uid
INNER JOIN schemaname.creditor_eft ceft
ON ceft.ELECTRONIC_FUNDS_TRANSFER_UID =
eft.ELECTRONIC_FUNDS_TRANSFER_UID
INNER JOIN schemaname.credit_account ca
ON ca.ELECTRONIC_FUNDS_TRANSFER_UID =
ceft.ELECTRONIC_FUNDS_TRANSFER_UID
WHERE ((btr.TYPE ='CREATE_CREDIT' AND btr.STATUS ='PENDING')
OR (btr.TYPE ='RETRY_CREDIT' AND btr.STATUS ='PENDING'))
AND btr.RELEASE_DATE < CURRENT_TIMESTAMP
=====================================================================

Star schema vs. snowflake schema: Which is better?


What are the key differences in snowflake and star schema? Where should they be applied?
The Star schema vs Snowflake schema comparison brings forth four fundamental differences to
the fore:
1. Data optimization:
Snowflake model uses normalized data, i.e. the data is organized inside the database in order to
eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business
and its dimensions are preserved in the data model through referential integrity.

Figure 1 Snow flake model


Star model on the other hand uses de-normalized data. In the star model, dimensions directly
refer to fact table and business hierarchy is not implemented via referential integrity between
dimensions.

Figure 2 Star model


2. Business model:
Primary key is a single unique key (data attribute) that is selected for a particular data. In the
previous advertiser example, the Advertiser_ID will be the primary key (business key) of a
dimension table. The foreign key (referential attribute) is just a field in one table that matches a
primary key of another dimension table. In our example, the Advertiser_ID could be a foreign
key in Account_dimension.
In the snowflake model, the business hierarchy of data model is represented in a primary key
Foreign key relationship between the various dimension tables.
In the star model all required dimension-tables have only foreign keys in the fact tables.
3. Performance:
The third differentiator in this Star schema vs Snowflake schema face off is the performance of
these models. The Snowflake model has higher number of joins between dimension table and
then again the fact table and hence the performance is slower. For instance, if you want to know
the Advertiser details, this model will ask for a lot of information such as the Advertiser Name,
ID and address for which advertiser and account table needs to be joined with each other and
then joined with fact table.
The Star model on the other hand has lesser joins between dimension tables and the facts table.
In this model if you need information on the advertiser you will just have to join Advertiser
dimension table with fact table.
Star schema explained
Star schema provides fast response to queries and forms the ideal source for cube structures.
Learn all about star schema in this article.
4. ETL
Snowflake model loads the data marts and hence the ETL job is more complex in design and
cannot be parallelized as dependency model restricts it.
The Star model loads dimension table without dependency between dimensions and hence the
ETL job is simpler and can achieve higher parallelism.
This brings us to the end of the Star schema vs Snowflake schema debate. But where exactly do
these approaches make sense?
Where do the two methods fit in?
With the snowflake model, dimension analysis is easier. For example, how many accounts or
campaigns are online for a given Advertiser?

The star schema model is useful for Metrics analysis, such as What is the revenue for a given
customer?

Datastage Errors and Resolution :


You may get many errors in datastage while compiling the jobs or running the jobs.
Some of the errors are as follows
a)Source file not found.
If you are trying to read the file, which was not there with that name.
b)Some times you may get Fatal Errors.
c) Data type mismatches.
This will occur when data type mismaches occurs in the jobs.
d) Field Size errors.
e) Meta data Mismach
f) Data type size between source and target different
g) Column Mismatch
i) Pricess time out.
If server is busy. This error will come some time.
Some of the errors in detail:

ds_Trailer_Rec: When checking operator: When binding output schema variable


"outRec": When binding output interface field "TrailerDetailRecCount" to field
"TrailerDetailRecCount": Implicit conversion from source type "ustring" to result type
"string[max=255]": Possible truncation of variable length ustring when converting to
string using codepage ISO-8859-1.
Solution:I resolved changing the extended col under meta data of the transformer to
unicode
When checking operator: A sequential operator cannot preserve the partitioning
of the parallel data set on input port 0.
Solution:I resolved by changing the preserve partioning to 'clear' under transformer
properties
Syntax error: Error in "group" operator: Error in output redirection: Error in output
parameters: Error in modify adapter: Error in binding: Could not find type: "subrec", line
35
Solution:Its the issue of level number of those columns which were being added in
transformer. Their level number was blank and the columns that were being taken from
cff file had it as 02. Added the level number and job worked.
Out_Trailer: When checking operator: When binding output schema variable "outRec":
When binding output interface field "STDCA_TRLR_REC_CNT" to field
"STDCA_TRLR_REC_CNT": Implicit conversion from source type "dfloat" to result
type "decimal[10,0]": Possible range/precision limitation.
CE_Trailer: When checking operator: When binding output interface field "Data" to field
"Data": Implicit conversion from source type "string" to result type "string[max=500]":
Possible truncation of variable length string.
Implicit conversion from source type "dfloat" to result type "decimal[10,0]": Possible
range/precision limitation.
Solution: Used to transformer function'DFloatToDecimal'. As target field is Decimal. By
default the output from aggregator output is double, getting the above by using
above function able to resolve the warning.

When binding output schema variable "outputData": When binding output interface field
"RecordCount" to field "RecordCount": Implicit conversion from source type
"string[max=255]" to result type "int16": Converting string to number.

Problem(Abstract)
Jobs that process a large amount of data in a column can abort with this error:
the record is too big to fit in a block; the length requested is: xxxx, the max block length
is: xxxx.
Resolving the problem
To fix this error you need to increase the block size to accommodate the record size:
1.
Log into Designer and open the job.
2.
Open the job properties--> parameters-->add environment variable and select:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
3.
You can set this up to 256MB but you really shouldn't need to go over 1MB.
NOTE: value is in KB
For example to set the value to 1MB:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE=1048576
The default for this value is 128kb.
When setting APT_DEFAULT_TRANSPORT_BLOCK_SIZE you want to use the
smallest possible value since this value will be used for all links in the job.
For example if your job fails with APT_DEFAULT_TRANSPORT_BLOCK_SIZE set to
1 MB and succeeds at 4 MB you would want to do further testing to see what it the
smallest value between 1 MB and 4 MB that will allow the job to run and use that value.
Using 4 MB could cause the job to use more memory than needed since all the links
would use a 4 MB transport block size.
NOTE: If this error appears for a dataset use
APT_PHYSICAL_DATASET_BLOCK_SIZE.

.
While connecting Remote Desktop, Terminal server has been exceeded maximum
number of allowed connections

SOL: In Command Prompt, type mstsc /v: ip address of server /admin


OR

mstsc /v: ip address /console

2. SQL20521N. Error occurred processing a conditional compilation directive near


string. Reason code=rc.
Following link has issue description:
http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=
%2Fcom.ibm.db2.luw.messages.sql.doc%2Fdoc%2Fmsql20521n.html

3.
SK_RETAILER_GROUP_BRDIGE,1: runLocally() did not reach EOF on its input
data set 0.
SOL: Warning will be disappeared by regenerating SK File.

4.
While connecting to Datastage client, there is no response, and while restarting
websphere services, following errors occurred
[root@poluloro01 bin]# ./stopServer.sh server1 -user wasadmin -password
Wasadmin0708
ADMU0116I: Tool information is being logged in file
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
ADMU0128I: Starting tool with the default profile
ADMU3100I: Reading configuration for server: server1
ADMU0111E: Program exiting with error: javax.management.JMRuntimeException:
ADMN0022E: Access is denied for the stop operation on Server MBean
because of insufficient or empty credentials.

ADMU4113E: Verify that username and password information is on the command line
(-username and -password) or in the <conntype>.client.props file.
ADMU1211I: To obtain a full trace of the failure, use the -trace option.
ADMU0211I: Error details may be seen in the file:
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log

SOL:

Wasadmin and XMeta passwords needs to be reset and commands are below..
[root@poluloro01 bin]# cd /opt/ibm/InformationServer/ASBServer/bin/

[root@poluloro01 bin]# ./AppServerAdmin.sh -was -user wasadmin


-password Wasadmin0708
Info WAS instance /Node:poluloro01/Server:server1/ updated with new user information
Info MetadataServer daemon script updated with new user information
[root@poluloro01 bin]# ./AppServerAdmin.sh -was -user xmeta -password Xmeta0708
Info WAS instance /Node:poluloro01/Server:server1/ updated with new user information
Info MetadataServer daemon script updated with new user information

5.

The specified field doesnt exist in view adapted schema

SOL: Most of the time "The specified field: XXXXXX does not exist in the view
adapted schema" occurred when we missed a field to map. Every stage has got an output
tab if used in the between of the job. Make sure you have mapped every single field
required for the next stage.

Sometime even after mapping the fields this error can be occurred and one of the reason
could be that the view adapter has not linked the input and output fields. Hence in this
case the required field mapping should be dropped and recreated.

Just to give an insight on this, the view adapter is an operator which is responsible for
mapping the input and output fields. Hence DataStage creates an instance of
APT_ViewAdapter which translate the components of the operator input interface
schema to matching components of the interface schema. So if the interface schema is not
having the same columns as operator input interface schema then this error will be
reported.

1)When we use same partitioning in datastage transformer stage we get the following
warning in 7.5.2 version.
TFCP000043
2
3
input_tfm: Input dataset 0 has a partitioning method other
than entire specified; disabling memory sharing.
This is known issue and you can safely demote that warning into informational by adding
this warning to Project specific message handler.
2) Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0
Resolution: Clear the preserve partition flag before Sequential file stages.
3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable
On aix execute following command to check maxuproc setting and increase it if you plan
to run multiple jobs at the same time.
lsattr -E -l sys0 | grep maxuproc
maxuproc
1024
Maximum number of PROCESSES allowed per user
True
4)TFIP000000
3
Agg_stg: When checking operator: When binding input
interface field CUST_ACT_NBR to field CUST_ACT_NBR: Implicit conversion
from source type string[5] to result type dfloat: Converting string to number.
Resolution: use the Modify stage explicitly convert the data type before sending to

aggregator stage.
5)Warning: A user defined sort operator does not satisfy the requirements.
Resolution:check the order of sorting columns and make sure use the same order when
use join stage after sort to joing two inputs.
6)TFTM000000
2
3
Stg_tfm_header,1: Conversion error calling conversion
routine timestamp_from_string data may have been lost
TFTM000000
1
xfmJournals,1: Conversion error calling conversion routine
decimal_from_string data may have been lost
Resolution:check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
7)TOSO000119
2
3
Join_sort: When checking operator: Data claims to already
be sorted on the specified keys the sorted option can be used to confirm this. Data will
be resorted as necessary. Performance may improve if this sort is removed from the flow
Resolution: Sort the data before sending to join stage and check for the order of sorting
keys and join keys and make sure both are in the same order.
8)TFOR000000
2
1
Join_Outer: When checking operator: Dropping
component CUST_NBR because of a prior component with the same name.
Resolution:If you are using join,diff,merge or comp stages make sure both links have the
differnt column names other than key columns
9)TFIP000022
1
oci_oracle_source: When checking operator: When binding
output interface field MEMBER_NAME to field MEMBER_NAME: Converting a
nullable source to a non-nullable result;
Resolution:If you are reading from oracle database or in any processing stage where
incoming column is defined as nullable and if you define metadata in datastage as nonnullable then you will get above issue.if you want to convert a nullable field to non
nullable make sure you apply available null functions in datastage or in the extract query.

DATASTAGE COMMON ERRORS/WARNINGS AND SOLUTIONS 2

1. No jobs or logs showing in IBM DataStage Director Client, however jobs are still
accessible from the Designer Client.
SOL: SyncProject cmd that is installed with DataStage 8.5 can be run to analyze and
recover projects
SyncProject -ISFile islogin -project dstage3 dstage5 Fix
2. CASHOUT_DTL: Invalid property value /Connection/Database
(CC_StringProperty::getValue, file CC_StringProperty.cpp, line 104)
SOL: Change the Data Connection properties manually in the produced
DB2 Connector stage.
A patch fix is available for this issue JR35643
3. Import .dsx file from command line
SOL: DSXImportService -ISFile dataconnection DSProject dstage DSXFile
c:\export\oldproject.dsx
4. Generate Surrogate Key without Surrogate Key Stage
SOL: @PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM 1)) + 1
Use above Formula in Transformer stage to generate a surrogate key.
5. Failed to authenticate the current user against the selected Domain: Could not connect
to server.
RC: Client has invalid entry in host file
Server listening port might be blocked by a firewall
Server is down
SOL: Update the host file on client system so that the server hostname can be resolved
from client.
Make sure the WebSphere TCP/IP ports are opened by the firewall.
Make sure the WebSphere application server is running. (OR)

Restart Websphere services.


6. The connection was refused or the RPC daemon is not running (81016)
RC: The dsprcd process must be running in order to be able to login to DataStage.
If you restart DataStage, but the socket used by the dsrpcd (default is 31538) was busy,
the dsrpcd will fail to start. The socket may be held by dsapi_slave processes that were
still running or recently killed when DataStage was restarted.
SOL: Run ps -ef | grep dsrpcd to confirm the dsrpcd process is not running.
Run ps -ef | grep dsapi_slave to check if any dsapi_slave processes exist. If so, kill
them.
Run netstat -a | grep dsprc to see if any processes have sockets that are
ESTABLISHED, FIN_WAIT, or CLOSE_WAIT. These will prevent the dsprcd from
starting. The sockets with status FIN_WAIT or CLOSE_WAIT will eventually time out
and disappear, allowing you to restart DataStage.
Then Restart DSEngine.

(if above doesnt work) Needs to reboot the system.

7. To save Datastage logs in notepad or readable format


SOL: a) /opt/ibm/InformationServer/server/DSEngine (go to this directory)
./bin/dsjob -logdetail project_name job_name >/home/dsadm/log.txt
b) In director client, Project tab Print select print to file option and save it in local
directory.
8. Run time error 457. This Key is already associated with an element of this
collection.
SOL: Needs to rebuild repository objects.
a)

Login to the Administrator client

b)

Select the project

c)

Click on Command

d)

Issue the command ds.tools

e)

Select option 2

f)
g)

Keep clicking next until it finishes.


All objects will be updated.

9. To stop the datastage jobs in linux level


SOL: ps ef | grep dsadm
To Check process id and phantom jobs
Kill -9 process_id
10. To run datastage jobs from command line
SOL: cd /opt/ibm/InformationServer/server/DSEngine
./dsjob -server $server_nm -user $user_nm -password $pwd -run $project_nm
$job_nm
11. Failed to connect to JobMonApp on port 13401.
SOL: needs to restart jobmoninit script (in
/opt/ibm/InformationServer/Server/PXEngine/Java)
Type

sh jobmoninit start $APT_ORCHHOME

Add 127.0.0.1 local host in /etc/hosts file


(Without local entry, Job monitor will be unable to use the ports correctly)
12. SQL0752N. Connect to a database is not permitted within logical unit of work
CONNECT type 1 settings is in use.
SOL: COMMIT or ROLLBACK statement before requesting connection to another
database.
1. While running ./NodeAgents.sh start command getting the following error:
LoggingAgent.sh process stopped unexpectedly

SOL: needs to kill LoggingAgentSocketImpl


Ps ef | grep LoggingAgentSocketImpl (OR)
PS ef |

grep Agent (to check the process id of the above)

2. Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0
SOL:
3.

Clear the preserve partition flag before Sequential file stages.

Warning: A user defined sort operator does not satisfy the requirements.

SOL: Check the order of sorting columns and make sure use the same order when use
join stage after sort to joing two inputs.
4. Conversion error calling conversion routine timestamp_from_string data may have
been lost. xfmJournals,1: Conversion error calling conversion routine
decimal_from_string data may have been lost
SOL: check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
5.

To display all the jobs in command line

SOL:
cd /opt/ibm/InformationServer/Server/DSEngine/bin
./dsjob -ljobs <project_name>
6.

Error trying to query dsadm[]. There might be an issue in database server

SOL: Check XMETA connectivity.


db2 connect to xmeta (A connection to or activation of database xmeta cannot be made
because of BACKUP pending)
7.

DSR_ADMIN: Unable to find the new project location

SOL: Template.ini file might be missing in /opt/ibm/InformationServer/Server.

Copy the file from another severs.


8.

Designer LOCKS UP while trying to open any stage

SOL: Double click on the stage that locks up datastage


Press ALT+SPACE
Windows menu will popup and select Restore
It will show your properties window now
Click on X to close this window.
Now, double click again and try whether properties window appears.
9.

Error Setting up internal communications (fifo RT_SCTEMP/job_name.fifo)

SOL: Remove the locks and try to run (OR)


Restart DSEngine and try to run (OR)
Go to /opt/ibm/InformationServer/server/Projects/proj_name/
ls RT_SCT* then
rm f RT_SCTEMP
then try to restart it.
10.
While attempting to compile job, failed to invoke GenRunTime using Phantom
process helper
RC:

/tmp space might be full


Job status is incorrect
Format problems with projects uvodbc.config file

SOL:

a)

clean up /tmp directory

b)

DS Director JOB clear status file

c)

confirm uvodbc.config has the following entry/format:


[ODBC SOURCES]
<local uv>
DBMSTYPE = UNIVERSE
Network = TCP/IP
Service = uvserver
Host = 127.0.0.1

ERROR:Phantom error in jobs

Resolution Datastage Services have to be started


So follow the following steps.
Login to server through putty using dsadm user.

Check whether active or stale sessions are there.


ps ef|grep slave

Ask the application team to close the active or stale sessions running from applications
user.
If they have closed the sessions, but sessions are still there, then kill those sessions.

Make sure no jobs are running


If any, ask the application team to stop the job

ps ef|grep dsd.run

Check for output for below command before stopping Datastage services.
netstat a|grep dsrpc
If any processes are in established, check any job or stale or active or osh sessions are not
running.
If any processes are in close_wait, then wait for some time, those processes
will not be visible.

Stop the Datastage services.


cd $DSHOME
./dsenv
cd $DSHOME/bin
./uv admin stop
Check whether Datastage services are stopped.
netstat a|grep dsrpc
No output should come for above command.

Wait for 10 to 15 min for shared memory to be released by process holding them.
Start the Datastage services.
./uv admin start
If asking for dsadm password while firing the command , then enable
impersonation.through root user

${DSHOME}/scripts/DSEnable_impersonation.sh

Friday, December 6, 2013


InfoSphere DataStage Jobstatus returned Codes from dsjob

Equ DSJS.RUNNING To
Equ DSJS.RUNOK To
Equ DSJS.RUNWARN To
Equ DSJS.RUNFAILED To
Equ DSJS.QUEUED To
Equ DSJS.VALOK To
Equ DSJS.VALWARN To
Equ DSJS.VALFAILED To
Equ DSJS.RESET To
Equ DSJS.CRASHED To
Equ DSJS.STOPPED To
Equ DSJS.NOTRUNNABLE To
Equ DSJS.NOTRUNNING To

0
1
2

This is the only status that means the job is actually running
Job finished a normal run with no warnings
Job finished a normal run with warnings

Job finished a normal run with a fatal error

4
11
12
13
21
96
97
98
99

Job queued waiting for resource allocation


Job finished a validation run with no warnings
Job finished a validation run with warnings
Job failed a validation run
Job finished a reset run
Job has crashed
Job was stopped by operator intervention (can't tell run type)
Job has not been compiled
Any other status

Posted by manohar at 4:26 PM No comments:

Thursday, October 17, 2013


Warning : Ignoring duplicate entry at table record no further warnings will be
issued for this table

This warning is seen when there are multiple records with the same key
column is present in the reference table from which lookup is done. Lookup,
by default, will fetch the first record which it gets as match and will throw the
warning
since it doesnt know which value is the correct one to be returned from the
reference.
To solve this problem you can either one of the reference links from Multiple
rows returned from link dropdown, in Lookup constraints. In this case Lookup
will return multiple rows for each row that is matched.

Else use some method to eradicate duplicate multiple rows with same key
columns according to the business requirements.
Posted by manohar at 11:19 PM No comments:

Monday, June 24, 2013


How to replace ^M character in VI editor/sed?
^M is DOS line break character which shows up in UNIX files when uploaded from a windows file
system in ascii format.
To remove this, open your file in vi editor and type
:%s/(ctrl-v)(ctrl-m)//g
and press Enter key.
Important!! press (Ctrl-v) (Ctrl-m) combination to enter ^M character, dont use ^ and M.
If anything goes wrong exit with q!.

Also,
Your substitution command may catch more ^M then necessary. Your file may contain valid ^M in the
middle of a line of code for example. Use the following command instead to remove only those at the
very end of lines:
:%s/(ctrl-v)(ctrl-m)*$//g

Using sed:
sed -e "s/^M//g" old_file_name > new_file_name

Posted by manohar at 8:15 AM No comments:

Thursday, June 6, 2013


How to convert a single row into multiple rows ?

Below is a screenshot of our input data


City

State

Name1

Name2

Name3

xy

FGH

Sam

Dean

Winchester

We are going to read the above data from a sequential file and transform it to look like this
City

State

Name

xy

FGH

Sam

xy

FGH

Dean

xy

FGH

Winchester

So lets get to the job design


Step 1: Read the input data
Step 2: Logic for Looping in Transformer
In the adjacent image you can see a new box called Loop Condition. This where we are going to
control the loop variables.
Below is the screenshot when we expand the Loop Condition box
The Loop While constraint is used to implement a functionality similar to WHILE statement in
programming. So, similar to a while statement need to have a condition to identify how many times
the loop is supposed to be executed.
To achieve this @ITERATION system variable was introduced. In our example we need to loop the data
3 times to get the column data onto subsequent rows.
So lets have @ITERATION <=3
Now create a new Loop variable with the name LoopName
The derivation for this loop variable should be

If @ITERATION=1 Then DSLink2.Name1 Else If @ITERATION=2 Then DSLink2.Name2


Else DSLink2.Name3
Below is a screenshot illustrating the same
Now all we have to do is map this Loop variable LoopName to our output column Name

Lets map the output to a sequential file stage and see if the output is a desired.

After running the job, we did a view data on the output stage and here is the data as desired.

Making some tweaks to the above design we can implement things like

Adding new rows to existing rows


Splitting data in a single column to multiple rows and many more such stuff..

Posted by manohar at 9:20 PM No comments:

How to perform aggregation using a Transformer


Input:Below is the sample data of three students, their marks in two subjects, the
corresponding grades and the dates on which they were graded.
Output:Our requirement is to sum the marks obtained by each student in a subject and display it in
the output
Step 1: Once we have read the data from the source we have to sort data on our key field. In our
example the key field is the student name
Once the data is sorted we have to implement the looping function in transformer to calculate the
aggregate value
Before we get into the details, we need to know a couple of functions
SaveInputRecord(): This function saves the entire record in cache and returns the

o
o

number of records that are currently stored in cache


LastRowInGroup(input-column): When a input key column is passed to this
function it will return 1 when the last row for that column is found and in all other cases it will return 0
To give an example, lets say our input is
Student

Code

ABC

ABC

ABC

DEF

2
For the first two records the function will return 0 but for the last record ABC,3 it will

return 1 indicating that it is the last record for the group where student name is ABC
o

GetSavedInputRecord(): This function returns the record that was stored in cache
by the function SaveInputRecord()

Back to the task at hand, we need 7 stage variables to perform the aggregation operation successfully.
1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated
for a student (subject 1)

5. SumSub2: Similar to SumSub1 (for subject 2)


6. IntermediateSumSub2: Similar to IntermediateSumSub1 (for subject 2)
7. LoopBreakNum: Holds the value for the number of times the loop has to run
Below is the screenshot of the stage variables
We also need to define the Loop Variables so that the loop will execute for a student until his final
record is identified
To explain the above use of variables When the first record comes to stage variables, it is saved in the cache using the function
SaveInputRecord() in first stage variableLoopNumber
The second stage variable checks if it is the last record for this particular student, if it is it stores 1
else 0
The third SumSub1 is executed only if the record is the last record
The fourth IntermediateSumSum1 is executed when the input record is not the last record, thereby
storing the intermediate sum of the subject for a student
Fifth and sixth are the same as 3 and 4 stage variables
Seven will have the first value as 1 and for the second record also if the same student is fetched it will
change to 2 and so on
The loop variable will be executed until the final record for a student is identified and the
GetSavedInputRecord() function will make sure the current record is processed before the next record
is brought for processing.
What the above logic does is for each and every record it will send the sum of marks scored by each
student to the output. But our requirement is to have only one record per student in the output.
So we simply add a remove duplicates stage and add the student name as a primary key
Run the job and the output will be according to our initial expectation
We have successfully implemented AGGREGATION using TRANSFORMER Stage

Posted by manohar at 9:16 PM 3 comments:

Thursday, May 23, 2013


Star vs Snowflake Schemas

First Answer: My personal opinion is to use the star by default, but if the product you are using for
the business community prefers a snowflake, then I would snowflake it. The major difference between
snowflake and star is that a snowflake will have multiple tables for a dimension and a start with a
single table. For example, your company structure might be

Corporate Region Department Store


In a star schema, you would collapse those into a single "store" dimension. In a snowflake, you would
keep them apart with the store connecting to the fact.

Second Answer: First of all, some definitions are in order. In a star schema, dimensions
that reflect a hierarchy are flattened into a single table. For example, a star schema
Geography Dimension would have columns like country, state/province, city, state and
postal code. In the source system, this hierarchy would probably be normalized with
multiple tables with one-to-many relationships.
A snowflake schema does not flatten a hierarchy dimension into a single table. It would,
instead, have two or more tables with a one-to-many relationship. This is a more
normalized structure. For example, one table may have state/province and country columns
and a second table would have city and postal code. The table with city and postal code
would have a many-to-one relationship to the table with the state/province columns.
There are some good for reasons snowflake dimension tables. One example is a company
that has many types of products. Some products have a few attributes, others have many,
many. The products are very different from each other. The thing to do here is to create a
core Product dimension that has common attributes for all the products such as product
type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for
each distinct group of products where each group shares common attributes. The subproduct tables must contain a foreign key of the core Product dimension table.
One of the criticisms of using snowflake dimensions is that it is difficult for some of the
multidimensional front-end presentation tools to generate a query on a snowflake
dimension. However, you can create a view for each combination of the core product/subproduct dimension tables and give the view a suitably description name (Frozen Food
Product, Hardware Product, etc.) and then these tools will have no problem.

Posted by manohar at 10:20 PM No comments:

Tuesday, May 14, 2013


Performance Tuning in Datastage
1
2
3
4

Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server
using Hash/Sequential files for optimum performance
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster
inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance
Used sorted data for Aggregator

5
6
7
8
9
1
0

1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3

Sorted the data as much as possible in DB and reduced the use of DS-Sort for
betterperformance of jobs
Removed the data not used from the source as early as possible in the job
Worked with DB-admin to create appropriate Indexes on tables for betterperformance of
DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for
faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard
logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality
required in one of the standard routines supplied in the sdk or ds utilities
categories.Constraints are generally CPU intensive and take a significant amount of time
to process. This may be the case if the constraint calls routines or external macros but if
it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate
the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT
WHERE.
Make every attempt to use the bulk loader for your particular database. Bulk loaders are
generally faster than using ODBC or OLE.
Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator
Use SQL Code while extracting the data
Handle the nulls
Minimize the warnings
Reduce the number of lookups in a job design
Try not to use more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Check the write cache of Hash file. If the same hash file is used for Look up and as well
as target disable this Option.
If the hash file is used only for lookup then enable Preload to memory . This will
improve the performance. Also check the order of execution of the routines.
Don't use more than 7 lookups in the same transformer; introduce new transformers if it
exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.

0
3
1
3
2
3
3
3
4
3
5
3
6

3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4

4
5
4
6
4
7
4
8
4

Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writing into. Make sure your cache is big
enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
Ideally, if the amount of data to be processed is small, configuration files with less
number of nodes should be used while if data volume is more , configuration files with
larger number of nodes should be used.
Partitioning should be set in such a way so as to have balanced data flow i.e. nearly
equal partitioning of data should occur and data skew should be minimized.
In DataStage Jobs where high volume of data is processed, virtual memory settings for
the job should be optimized. Jobs often abort in cases where a single lookup has
multiple reference links. This happens due to low temp memory space. In such jobs
$APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should
be set to sufficiently large values.
Sequential files should be used in following conditions. When we are reading a flat file
(fixed width or delimited) from UNIX environment which is FTP ed from some external
system
When some UNIX operations has to be done on the file Dont use sequential file for
intermediate storage between jobs. It causes performance overhead, as it needs to do
data conversion before writing and reading from a UNIX file
In order to have faster reading from the Stage the number of readers per node can be
increased (default value is one).
Usage of Dataset results in a good performance in a set of linked jobs. They help in
achieving end-to-end parallelism by writing data in partitioned form and maintaining the
sort order.
Look up Stage is faster when the data volume is less. If the reference data volume is
more, usage of Lookup Stage should be avoided as all reference data is pulled in to local
memory
Sparse lookup type should be chosen only if primary input data volume is small.
Join should be used when the data volume is high. It is a good alternative to the lookup
stage and should be used when handling huge volumes of data.
Even though data can be sorted on a link, Sort Stage is used when the data to be sorted
is huge.When we sort data on link ( sort / unique option) once the data size is beyond
the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if
the volume of data is large explicit sort stage should be used instead of sort on link.Sort
Stage gives an option on increasing the buffer memory used for sorting this would mean
lower I/O and better performance.
It is also advisable to reduce the number of transformers in a Job by combining the logic
into a single transformer rather than having multiple transformers.
Presence of a Funnel Stage reduces the performance of a job. It would increase the time
taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is
better to isolate itself to one job. Write the output to Datasets and funnel them in new
job.
Funnel Stage should be run in continuous mode, without hindrance.
A single job should not be overloaded with Stages. Each extra Stage put in a Job
corresponds to lesser number of resources available for every Stage, which directly
affects the Jobs Performance. If possible, big jobs having large number of Stages should
be logically split into smaller units.
Unnecessary column propagation should not be done. As far as possible, RCP (Runtime

9
5
0
5
1
5
2
5
3
5
4
5
5
5
6
5
7

Column Propagation) should be disabled in the jobs


Most often neglected option is dont sort if previously sorted in sort Stage, set this
option to true. This improves the Sort Stage performance a great deal
In Transformer Stage Preserve Sort Order can be used to maintain sort order of the
data and reduce sorting in the job
Reduce the number of Stage variables used.
The Copy stage should be used instead of a Transformer for simple operations
The upsert works well if the data is sorted on the primary key column of the table
which is being loaded.
Dont read from a Sequential File using SAME partitioning
By using hashfile stage we can improve the performance.
In case of hashfile stage we can define the read cache size
& write cache size but the default size is 128M.B.
By using active-to-active link performance also we can
improve the performance.
Here we can improve the performance by enabling the row
buffer, the default row buffer size is 128K.B.

==================================================
=================================

TRANSFORMER STAGE TO FILTER THE DATA


TRANSFORMER STAGE TO FILTER THE DATA

Take Job Design as below

If our requirement is to filter the data department wise from the file below
samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10

4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
And our requirement is to get the target data as below
In Target1 we need 10th & 40th dept employees.
In Target2 we need 30th dept employees.
In Target1 we need 20th & 40th dept employees.
Read and Load the data in Source file
In Transformer Stage just Drag and Drop the data to the target tables.
Write expression in constraints as below
dept_no=10 or dept_no= 40 for table 1
dept_no=30 for table 1
dept_no=20 or dept_no= 40 for table 1
Click ok
Give file name at the target file and
Compile and Run the Job to get the Output.
================================================================================

You might also like