Professional Documents
Culture Documents
General objects
Database connectors
File stages
Processing stages
Restructure Stages
Sequence activities
Please refer to the list below for a description of the stages used in DataStage and QualityStage.
We classified all stages in order of importancy and frequency of use in real-life deployments (and
also on certification exams). Also, the most widely used stages are marked bold or there is a link
to a subpage available with a detailed description with examples.
General elements
Link indicates a flow of the data. There are three main types of links in Datastage: stream,
reference and lookup.
Annotation is used for adding floating datastage job notes and descriptions on a job
canvas. Annotations provide a great way to document the ETL process and help
understand what a given job does.
Description Annotation shows the contents of a job description field. One description
annotation is allowed in a datastage job.
Row generator produces a set of test data which fits the specified metadata (can be
random or cycled through a specified list of values). Useful for testing and development.
Click here for more..
Column generator adds one or more column to the incoming flow and generates test
data for this column.
Peek stage prints record column values to the job log which can be viewed in Director. It
can have a single input link and multiple output links.Click here for more..
Sample stage samples an input data set. Operates in two modes: percent mode and period
mode.
Head selects the first N rows from each partition of an input data set and copies them to
an output data set.
Tail is similiar to the Head stage. It select the last N rows from each partition.
Write Range Map writes a data set in a form usable by the range partitioning method.
Processing stages
Aggregator joins data vertically by grouping incoming data stream and calculating
summaries (sum, count, min, max, variance, etc.) for each group. The data can be
grouped using two methods: hash table or pre-sort. Click here for more..
Copy - copies input data (a single stream) to one or more output data flows
Filter filters out records that do not meet specified requirements.Click here for more..
Funnel combines mulitple streams into one. Click here for more..
Join combines two or more inputs according to values of a key column(s). Similiar
concept to relational DBMS SQL join (ability to perform inner, left, right and full outer
joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces
single output stream (no reject link). Click here for more..
Lookup combines two or more inputs according to values of a key column(s). Lookup
stage can have 1 source and multiple lookup tables. Records don't need to be sorted and
produces single output stream and a reject link. Click here for more..
Merge combines one master input with multiple update inputs according to values of a
key column(s). All inputs need to be sorted and unmatched secondary entries can be
captured in multiple reject links. Click here for more..
Modify stage alters the record schema of its input dataset. Useful for renaming columns,
non-default data type conversions and null handling
Remove duplicates stage needs a single sorted data set as input. It removes all duplicate
records according to a specification and writes to a single output
Slowly Changing Dimension automates the process of updating dimension tables, where
the data changes in time. It supports SCD type 1 and SCD type 2.Click here for more..
Transformer stage handles extracted data, performs data validation, conversions and
lookups.Click here for more..
Change Capture - captures before and after state of two input data sets and outputs a
single data set whose records represent the changes made.
Change Apply - applies the change operations to a before data set to compute an after
data set. It gets data from a Change Capture stage
Difference stage performs a record-by-record comparison of two input data sets and
outputs a single data set whose records represent the difference between them. Similiar to
Change Capture stage.
Checksum - generates checksum from the specified columns in a row and adds it to the
stream. Used to determine if there are differencies between records.
Decode decodes a data set previously encoded with the Encode Stage.
External Filter permits speicifying an operating system command that acts as a filter on
the processed data
Generic stage allows users to call an OSH operator from within DataStage stage with
options as required.
Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row
to a single column in multiple output rows. Pivoting data results in obtaining a dataset
with fewer number of columns but more rows.
Surrogate Key Generator generates surrogate key for a column and manages the key
source.
Switch stage assigns each input row to an output link based on the value of a selector
field. Provides a similiar concept to the switch statement in most programming
languages.
Compress - packs a data set using a GZIP utility (or compress command on
LINUX/UNIX)
Expand extracts a previously compressed data set back into raw binary data.
Sequential file is used to read data from or write data to one or more flat (sequential)
files.Click here for more..(.)
Data Set stage allows users to read data from or write data to a dataset. Datasets are
operating system files, each of which has a control file (.ds extension by default) and one
or more data files (unreadable by other applications). Click here for more info(.)
File Set stage allows users to read data from or write data to a fileset. Filesets are
operating system files, each of which has a control file (.fs extension) and data files.
Unlike datasets, filesets preserve formatting and are readable by other applications.
Complex flat file allows reading from complex file structures on a mainframe machine,
such as MVS data sets, header and trailer structured files, files that contain multiple
record types, QSAM and VSAM files.Click here for more info.
External Source - permits reading data that is output from multiple source programs.
Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be
used for lookups.
Database stages
Oracle Enterprise allows reading data from and writing data to an Oracle database
(database version from 9.x to 10g are supported).
ODBC Enterprise permits reading data from and writing data to a database defined as an
ODBC source. In most cases it is used for processing data from or to Microsoft Access
databases and Microsoft Excel spreadsheets.
DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.
Teradata permits reading data from and writing data to a Teradata data warehouse. Three
Teradata stages are available: Teradata connector, Teradata Enterprise and Teradata
Multiload
SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl
Server 2005 amd 2008 database.
Sybase permits reading data from and writing data to Sybase databases.
Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL
Server. The Stored Procedure stage can be used as a source (returns a rowset), as a target
(pass a row to a stored procedure to write) or a transform (to invoke procedure processing
within the database).
MS OLEDB helps retrieve information from any type of information repository, such as a
relational source, an ISAM file, a personal database, or a spreadsheet.
Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or
writing to a number of different supported relational DB engines using native interfaces,
such as Oracle, Microsoft SQL Server, DB2, Informix and Sybase.
Classic federation
RedBrick Load
Netezza Enterpise
iWay Enterprise
XML Input stage makes it possible to transform hierarchical XML data to flat relational
data sets
XML Output writes tabular data (relational tables, sequential files or any datastage data
streams) to XML structures
Java client stage can be used as a source stage, as a target and as a lookup. The java
package consists of three public classes: com.ascentialsoftware.jds.Column,
com.ascentialsoftware.jds.Row, com.ascentialsoftware.jds.Stage
Java transformer stage supports three links: input, output and reject.
Restructure stages
Column export stage exports data from a number of columns of different data types into a
single column of data type ustring, string, or binary. It can have one input link, one output
link and a rejects link. Click here for more..
Column import complementary to the Column Export stage. Typically used to divide data
arriving in a single column into multiple columns.
Combine records stage combines rows which have identical keys, into vectors of
subrecords.
Make subrecord combines specified input vectors into a vector of subrecords whose
columns have the same names and data types as the original vectors.
Split subrecord - separates an input subrecord field into a set of top-level vector columns
Split vector promotes the elements of a fixed-length vector to a set of top-level columns
Investigate stage analyzes data content of specified columns of each record from the
source file. Provides character and word investigation methods.
Match frequency stage takes input from a file, database or processing stages and
generates a frequence distribution report.
QualityStage Legacy
Reference Match
Standarize
Survive
Unduplicate Match
Notification Activity - used for sending emails to user defined recipients from within
Datastage
Terminator Activity permits shutting down the whole sequence once a certain situation
occurs.
Wait for file Activity - waits for a specific file to appear or disappear and launches the
processing.
EndLoop Activity
Exception Handler
Execute Command
Nested Condition
Routine Activity
StartLoop Activity
UserVariables Activity
=====================================================================
Configuration file:
The Datastage configuration file is a master control file (a textfile which sits on the
server side) for jobs which describes the parallel system resources and architecture. The
configuration file provides hardware configuration for supporting such architectures
as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or
MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands
the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed
your processing configurations, or changed servers or platform, you will never have to worry
about it affecting your jobs since all the jobs depend on this configuration file for execution.
Datastage jobs determine which node to run the process on, where to store the temporary data,
where to store the dataset data, based on the entries provide in the configuration file. There is a
default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration
file is to separate software and hardware configuration from job design. It allows changing
hardware and software resources without changing a job design. Datastage jobs can point to
different configuration files by using job parameters, which means that a job can utilize different
hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk
space provided for each processing node which are logical processing nodes that are specified in
the configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a
very inefficient configuration on your hands.
1.
APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one
can have many configuration files for a project) to be used. In fact, this is what is generally used
in production. However, if this environment variable is not defined then how DataStage
determines which file to use ??
1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default
configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of
DataStage installation.
2.
3.
1.
What are the different options a logical node can have in the configuration file?
fastname The fastname is the physical node name that stages use to open connections for high
volume data transfers. The attribute of this option is often the network name. Typically, you can
get this name by using Unix command uname -n.
pools Name of the pools to which the node is assigned to. Based on the characteristics of the
processing nodes you can group nodes into set of pools.
A pool can be associated with many nodes and a node can be part of many pools.
A node belongs to the default pool unless you explicitly specify apools list for it, and omit the
default pool name () from the list.
A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of
processing nodes).
In case job as well as stage within the job are constrained to run on specific processing nodes
then stage will run on the node which is common to stage as well as job.
resource resource resource_type location [{pools disk_pool_name}] | resource
resource_type value . resource_type can becanonicalhostname (Which takes quoted ethernet
name of a node in cluster that is unconnected to Conductor node by the hight speed
network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute
path name of a directory on a file system where intermediate data will be temporarily stored. It is
local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE,
etc.)
2.
1.
2.
3.
1.
3.
4.
1.
Basically the configuration file contains the different processing nodes and also
specifies the disk space provided for each processing node. Now when we talk about processing
nodes you have to remember that these can are logical processing nodes that are specified in the
configuration file. So if you have more than one CPU this does not mean the nodes in your
configuration file correspond to these CPUs. It is possible to have more than one logical node on
a single physical node. However you should be wise in configuring the number of logical nodes
on a single physical node. Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of processes. If your
underlying system should have the capability to handle these loads then you will be having a
very inefficient configuration on your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.
{
node
{
fastname
pools
resource
resource
}
node
{
fastname
pools
resource
resource
}
node
{
fastname
pools
resource
resource
}
node1
SVR1
disk
C:/IBM/InformationServer/Server/Datasets/Node1
{pools
}
scratchdisk
C:/IBM/InformationServer/Server/Scratch/Node1
{pools
}
node2
SVR1
disk
C:/IBM/InformationServer/Server/Datasets/Node1
{pools
}
scratchdisk
C:/IBM/InformationServer/Server/Scratch/Node1
{pools
}
node3
SVR2
sort
disk
C:/IBM/InformationServer/Server/Datasets/Node1
{pools
}
scratchdisk
C:/IBM/InformationServer/Server/Scratch/Node1
{pools " }
}
This is a 3 node configuration file. Lets go through the basic entries and what it represents.
Fastname This refers to the node name on a fast network. From this we can imply that the
nodes node1 and node2 are on the same physical node. However if we look at node3 we can see
that it is on a different physical node (identified by SVR2). So basically in node1 and node2 , all
the resources are shared. This means that the disk and scratch disk specified is actually shared
between those two logical nodes. Node3 on the other hand has its own disk and scratch disk
space.
Pools Pools allow us to associate different processing nodes based on their functions and
characteristics. So if you see an entry other entry like node0 or other reserved node pools like
sort,db2,etc.. Then it means that this node is part of the specified pool. A node will be by
default associated to the default pool which is indicated by . Now if you look at node3 can see
that this node is associated to the sort pool. This will ensure that that the sort stage will run only
on nodes part of the sort pool.
Resource disk - This will specify Specifies the location on your server where the processing
node will write all the data set files. As you might know when Datastage creates a dataset, the
file you see will not contain the actual data. The dataset file will actually point to the place where
the actual data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk The location of temporary files created during Datastage processes, like
lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk
can also be made part of the sort scratch disk pool. This will ensure that the temporary files
created during sort are stored only in this location. If such a pool is not specified then Datastage
determines if there are any scratch disk resources that belong to the default scratch disk pool on
the nodes that sort is specified to run on. If this is the case then this space will be used.
Below is the sample diagram for 1 node and 4 node resource allocation:
A basic configuration file for a single machine, two node server (2-CPU) is shown below. The
file defines 2 nodes (node1 and node2) on a single dev server (IP address might be provided as
well instead of a hostname) with 3 disk resources (d1 , d2 for the data and Scratch as scratch
space).
The configuration file is shown below:
node "node1"
{
fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource disk "/IIS/Config/d2" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}
node "node2"
{
fastname "dev"
pool ""
resource disk "/IIS/Config/d1" { }
resource scratchdisk "/IIS/Config/Scratch" { }
}
node "node3"
{
fastname "dev3"
pool "" "n3" "s3"
resource disk "/IIS/Config3/d1" {}
resource scratchdisk "/IIS/Config3/Scratch" {}
}
node "node4"
{
fastname "dev4"
pool "n4" "s4"
resource disk "/IIS/Config4/d1" {}
resource scratchdisk "/IIS/Config4/Scratch" {}
}
Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource
disk.
Resource scratch disk : Here also a path to folder is defined. This path is used by the parallel job
stages for buffering of the data when the parallel job runs.
=====================================================================
Sequentional_Stage :
Sequential File:
The Sequential File stage is a file stage. It allows you to read data from or write
data to one or more flat files as shown in Below Figure:
Scenario 2:
Read From Multiple Nodes = Yes
Once we add Read From Multiple Node = Yes then stage by default executes in Parallel mode.
If you run the job with above configuration it will abort with following fatal error.
sff_SourceFile: The multinode option requires fixed length records.(That means you can use this
option to read fixed width files only)
In order to fix the above issue go the format tab and add additions parameters as shown below.
Scenario 3:Read Delimted file with By Adding Number of Readers Pernode instead of multinode
option to improve the read performance and once we add this option sequential file stage will execute
in default parallel mode.
If we are reading from and writing to fixed width file it is always good practice to add
APT_STRING_PADCHAR Datastage Env variable and assign 020 as default value then it will pad with
spaces ,otherwise datastage will pad null value(Datastage Default padding character).
Always Keep Reject Mode = Fail to make sure datastage job will fail if we get from format from source
systems.
Sol1:
1. Introduce a sort stage very next to sequential file,
2. Select a property (key change column) in sort stage and you can assign 0-Unique
14
15
Output file X contains
1
2
3
4
5
Output file y contains
6
7
8
9
10
Output file z contains
11
12
13
14
15
Possible solution:
Change capture stage. First, i am going to use source as A and refrerence as B both of them are
connected to Change capture stage. From, change capture stage it connected to filter stage and
then targets X,Y and Z. In the filter stage: keychange column=2 it goes to X [1,2,3,4,5]
Keychange column=0 it goes to Y [6,7,8,9,10] Keychange column=1 it goes to Z
[11,12,13,14,15]
Solution 2:
Create one px job.
src file= seq1 (1,2,3,4,5,6,7,8,9,10)
1st lkp = seq2 (6,7,8,9,10,11,12,13,14,15)
o/p - matching recs - o/p 1 (6,7,8,9,10)
not-matching records - o/p 2 (1,2,3,4,5)
2nd lkp:
src file - o/p 1 (6,7,8,9,10)
lkp file - seq 2 (6,7,8,9,10,11,12,13,14,15)
not matching recs - o/p 3 (11,12,13,14,15)
Dataset :
Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These
carry meta data with them, both column definitions and information about the configuration that
was in effect when the data set was created. If for example, you have a stage which limits
execution to a subset of available nodes, and the data set was created by a stage using all nodes,
InfoSphere DataStage can detect that the data will need repartitioning.
If required, data sets can be landed as persistent data sets, represented by a Data Set
stage .This is the most efficient way of moving data between linked jobs. Persistent data sets are
stored in a series of files linked by a control file (note that you should not attempt to manipulate
these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere
DataStage).
there are the two groups of Datasets - persistent and virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual
datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in
the Unix file system, as long as they exist only virtually, while inhabiting RAM memory.
Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).
Further differences are much more significant. Primarily, persistent Datasets are being stored
in Unix files using internal Datastage EE format, while virtual Datasets are never stored on
disk - they do exist within links, and in EE format, but in RAM memory. Finally,
persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual
Datasets - might be passed through in memory.
A data set comprises a descriptor file and a number of other files that are added as the data set
grows. These files are stored on multiple disks in your system. A data set is organized in terms
of partitions and segments.
Each partition of a data set is stored on a single processing node. Each data segment contains all
the records written by a single job. So a segment can contain files from many partitions, and a
partition has files from many segments.
Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo
the same processes and modifications. In a word, all of them must go through the same
successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore
they cannot be treated commonly.
Alias names of Datasets are
1) Orchestrate File
2) Operating System file
And Dataset is multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files
In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.
3. Transformer Stage :
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
How to perform aggregation using a Transformer
31.
Date and time string functions
32.
Null handling functions
33.
Vector function-Transformer
34.
Type conversion functions-Transformer
35.
How to convert a single row into multiple rows ?
Sort Stage:
When we select any option and keep true. It will create the Group id's group wise.
Data will be divided into the groups based on the key column and it will give (1) for
the first row of every group and (0) for rest of the rows in all groups.
Key change column and Cluster Key change column used based on the data we are getting
If the data we are getting is not sorted , then we use key change column to create
group id's
If the data we are getting is sorted data, then we use Cluster Key change Column to
Sort
And
And
Stage
Properties
Select
if
you
are
getting
not
Key
sorted
data
Keep
Key
column
Change
Column
as
True
And
Group
Drag
Id's
will
and
be
Drop
generated
as
0's
in
and
1's
Output
Group
Wise.
If your data is already Sorted you need to keep cluster Key change Column as True
(
Dont
Select
Key
Change
Column
Aggregator_Stage :
The Aggregator Stage:
Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By
Default Aggregator stage will execute in parallel mode in parallel jobs.
Note:In a Parallel environment ,the way that we partition data before grouping and
summary will affect the results.If you parition data using round-robin method and then
records with same key values will distruute across different partiions and that will give in
correct results.
Aggregation Method:
Aggregator stage has two different aggregation Methods.
1)Hash:Use hash mode for a relatively small number of groups; generally, fewer than about 1000
groups per megabyte of memory.
2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping
keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires
presorted data, but only maintains the calculations for the current group in memory.
Aggregation Data Type:
By default aggregator stage calculation output column is double data type and if you want decimal
output then add following property as shown in below figure.
If you are using single key column for the grouping keys then there is no need to sort or hash
partition the incoming data.
3,pinky,10
4,lin,20
5,jim,10
6,emy,30
7,pom,10
8,jem,20
9,vin,30
10,den,20
Here our requirement is to find the maximum salary from each dept. number.
According to this sample data, we have two departments.
Take Sequential File to read the data and take Aggregator for calculations.
3 comments:
Ram R said...
This comment has been removed by the author.
December 27, 2013 at 12:19 AM
Ram R said...
Hi,
I tried this one and have some questions.
If we have a data as below
table_a
dno,name
10,siva
10,ram
10,sam
20,tom
30,emy
20,tiny
40,remo
And we need to get the same multiple times records into the one target.
And single records not repeated with respected to dno need to come to one target.
My question:
I placed 2 seq files, one with count >1 and other with count <=1, 1 seq file output was
this :
dno count
10 3
20 2
2 seq file output was like this:
dno count
40 1
30 1
Instead I wanted output like this:
dno name
10 siva
10 ram
10 sam
20 tom
20 tiny
2nd output file should be:
dno name
30 emy
40 remo
Join Stage:
MULTIPLE JOIN STAGES TO JOIN THREE TABLES:
If we have three tables to join and we don't have same key column in all the tables to
join the tables using one join stage.
In this case we can use Multiple join stages to join the tables.
You can take sample data as below
soft_com_1
e_id,e_name,e_job,dept_no
001,james,developer,10
002,merlin,tester,20
003,jonathan,developer,10
004,morgan,tester,20
005,mary,tester,20
soft_com_2
dept_no,d_name,loc_id
10,developer,200
20,tester,300
soft_com_3
loc_id,add_1,add_2
10,melbourne,victoria
20,brisbane,queensland
Table2
Bizno,Job
20,clerk
30,salesman
xyz2 (Table 2 )
e_id,address
1,los angeles
2,washington
3,mexico
4,indiana
5,chicago
Read and Load the both the sourc tables in seq. files
Say if we have duplicates in left table on key field? What will happen?
We all get all matching records. We will get all matching Duplicates all well here is the
table Representation of join.
LeftOuter Join:
All the records from left table and all matching records. If we dont exists in the right table it will be
populated with nulls.
=====================================================================
Lookup_Stage :
Lookup Stage:
The Lookup stage is most appropriate when the reference data for all lookup stages in a job
is small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or
MERGE stage should be used.
Lookup stages do not require data on the input link or reference links to be sorted. Be aware,
though, that large in-memory lookup tables will degrade performance because of their paging
requirements. Each record of the output data set contains columns from a source record plus columns
from all the corresponding lookup records where corresponding source and lookup records have the
same value for the lookup key columns. The lookup key columns do not have to have the same
names in the primary and the reference links.
The optional reject link carries source records that do not have a corresponding entry in the
input lookup tables.
You can also perform a range lookup, which compares the value of a source column to a range of
values between two lookup table columns. If the source column value falls within the required range, a
row is passed to the output link. Alternatively, you can compare the value of a lookup column to a
range of values between two source columns. Range lookups must be based on column values, not
constant values. Multiple ranges are supported.
There are some special partitioning considerations for Lookup stages. You need to ensure that the data
being looked up in the lookup table is in the same partition as the input data referencing it. One way
of doing this is to partition the lookup tables using the Entire method.
Lookup stage Configuration:Equal lookup
Scenario2:Fail
Jo
b aborted with the following error:
stg_Lkp,0: Failed a key lookup for record 2 Key Values: CUSTOMER_ID: 3
Scenari03:Drop
Scenario4:Reject
If we select reject as lookup failure condition then we need to add reject link otherwise we get
compilation error.
Range Lookup:
Business scenario:we have input data with customer id and customer name and transaction date.We
have customer dimension table with customer address information.Customer can have multiple
records with different start and active dates and we want to select the record where incoming
transcation date falls between start and end date of the customer from dim table.
Ex Input Data:
CUSTOMER_I
D
CUSTOMER_NA
ME
1
1
UMA
UMA
TRANSACTION_
DT
2011-03-01
2010-05-01
Ex Di Data:
CUSTOMER_ID
CITY
ZIP_CODE
START_DT
END_DT
BUENA PARK
90620
2010-0101
2010-1231
CYPRESS
90630
2011-0101
2011-0430
Expected Output:
CUSTOMER_I
D
1
1
CUSTOMER_NAM
E
UMA
UMA
TRANSACTION_D
T
CITY
ZIP_COD
E
2011-03-01
CYPRES
S
90630
2010-05-01
BUENA
PARK
90620
You need to
specify return multiple rows from the reference link otherwise you will get following warning in the job
log.Even though we have two distinct rows base on customer_id,start_dt and end_dt columns but
datastage is considering duplicate rows based on customer_id key only.
stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table
Sce
nario 2:Specify range on reference link:
Thi
s concludes lookup stage configuration for different scenarios.
e_sal <=hsal
Click Ok
Than Drag and Drop the Required columns into the output and click Ok
Give File name to the Target File.
Then Compile and Run the Job . That's it you will get the required Output.
Merge_Stage :
Merge Stage:
The Merge stage is a processing stage. It can have any number of input links, a single output
link, and the same number of reject links as there are update input links.(according to DS
documentation)
Merge stage combines a mster dataset with one or more update datasets based on the key
columns.the output record contains all the columns from master record plus any additional columns
from each update record that are required.
A master record and update record will be merged only if both have same key column values.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures
that rows with the same key column values are located in the same partition and will be processed by
the same node. It also minimizes memory requirements because fewer rows need to be in memory at
any one time.
As part of preprocessing your data for the Merge stage, you should also remove duplicate
records from the master data set. If you have more than one update data set, you must remove
duplicate records from the update data sets as well.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject
links. You can route update link rows that fail to match a master row down a reject link that is specific
for that link. You must have the same number of reject links as you have update links. The Link
Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject
links. You can also specify whether to drop unmatched master rows, or output them on the output
data link.
Example :
Master
dataset:
CUSTOMER_I
D
CUSTOMER_NAM
E
UMA
POOJITHA
Update
dataset1
CUSTOMER_ID
CITY
ZIP_CODE
SEX
CYPRESS
90630
CYPRESS
90630
Output:
CUSTOMER_ID
CUSTOMER_NAME
CITY
UMA
CYPRESS
90630
POOJITHA
CYPRESS
90630
ZIP_CODE
SEX
Options:
Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the
master link are output; Drop means that unmatched rows are dropped instead.
Warn On Reject Updates:True to generate a warning when bad records from any update links are
rejected.
Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the
master link.
Co
mpile and run the job :
Scenario 2:
Remove a record from the updateds1 and check the output:
Check for the datastage warning in the job log as we have selected Warn on unmatched masters =
TRUE
stg_merge,0: Master record (0) has no updates.
stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.
Scenarios 3:Drop unmatched master record and capture reject records from updateds1
CUSTOMER_ID
CITIZENSHIP
INDIAN
AMERICAN
Still we have duplicate row in the master dataset.if you compile the job with above design you will get
compilation error like below.
If you look ate the above figure you can see 2 rows
in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .
No change the results and merge stage automatically dropped the duplicate row.
Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630
instead of 90620.
==================================================
===================================
Filter_Stage :
Filter Stage:
Filter stage is a processing stage used to filter database based on filter condition.
Scenario 2:Comparing incoming fields.check transaction date falls between strt_dt and end_dt and
filter those records.
Input Data:
CUSTOMER_I
D
CUSTOMER_NAM
E
TRANSACTION_D
T
STR_DT
END_DT
UMA
1/1/2010
5/20/201
0
12/20/201
0
UMA
5/28/2011
5/20/201
0
12/20/201
0
Output:
CUSTOMER_I
D
CUSTOMER_NAM
E
1
UMA
TRANSACTION_D
T
5/28/2011
STR_DT
5/20/201
0
END_DT
12/20/201
0
Reject:
CUSTOMER_I
D
CUSTOMER_NAM
E
1
UMA
TRANSACTION_D
T
1/1/2010
STR_DT
5/20/201
0
END_DT
12/20/201
0
Partition data based on CUSTOMER_ID to make sure all rows with same key values process on the
same node.
Condition : where TRANSACTION_DT Between STRT_DT and END_DT
Actual Output:
Actual Reject Data:
Output :
Reject :
This covers most filter stage scenarios.
We can take Sequential file to read the and filter stage for writing Conditions.
And Dataset file to load the data into the Target.
Design as follows: --Seq.File---------Filter------------DatasetFile
Copy Stage :
COPY STAGE:
Copy Stage is one of the processing stage that have one input and 'n' number of outputs. The
copy stage is used to send the one source data to multiple copies and this can be used for the multiple
purpose. The records which we are sending through copy stage can be copied with any modifications and
also we can do the following.
a) Columns order can be altered .
b) And columns can be dropped.
c) We can change the column names.
In Copy Stage, we have the option called Force. It will be false in Default and if we kept to true, it is used
to specify that datastage should not try optimize the job by removing a copy operation where there is one
input and one output .
================================================================================
Funnel_Stage :
Funnel Stage:
Funnel stage is used to combine multiple input datasets into a single input dataset.This stage can have
any number of input links and single output link.
It operates in 3 modes:
Continuous Funnel combines records as they arrive (i.e. no particular order);
Sort Funnel combines the input records in the order defined by one or more key fields;
Sequence copies all records from the first input data set to the output data set, then all the records
from the second input data set, etc.
Note:Metadata for all inputs must be identical.
Sort funnel requires data must be sorted and partitioned by the same key columns as to be used by
the funnel operation.
Hash Partition guarantees that all records with same key column values are located in the same
partition and are processed in the same node.
1)Continuous funnel:
Go to the properties of the funnel stage page and set Funnel Type to continuous funnel.
2)Sequence:
Note:In order to use sequence funnel you need to specify which order the input links you
need to process and also make sure the stage runs in sequential mode.
Usually we use sequence funnel when we create a file with header,detail and trailer records.
3)Sort Funnel:
Note: If you are running your sort funnel stage in parallel, you should be aware of the
various
considerations about sorting data and partitions
Thats all about funnel stage usage in datastage.
Some times we get data in multiple files which belongs to same bank customers information.
In that time we need to funnel the tables to get the multiple files data into the single file.( table)
For Example , if we have the data two files as below
xyzbank1
e_id,e_name,e_loc
111,tom,sydney
222,renu,melboourne
333,james,canberra
444,merlin,melbourne
xyzbank2
e_id,e_name,e_loc
555,,flower,perth
666,paul,goldenbeach
777,raun,Aucland
888,ten,kiwi
Column Generator :
Column Generator is a development stage/ generating stage that is used to generate column
with sample data based on user defined data type .
Take Job Design as
Seq.File--------------Col.Gen------------------Ds
Surrogate_Key_Stage :
n records loaded.. By using surrogate key you can continue the sequence from n+1.
Double click on the surrogate key stage and click on properties tab.
Properties:
If the stat file exists we can update otherwise we can create and update it.
We are using SkeyValue parameter to update the stat file using transformer stage.
Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.
I have updated the stat file with 100 and below is the output.
If you want to generate the key value from begining you can use following property in the surrogate
key stage.
a.
o
If the key source is a flat file, specify how keys are generated:
To generate keys in sequence from the highest value that was last used, set
the Generate Key from Last Highest Value property to Yes. Any gaps in the key range are ignored.
To specify a value to initialize the key source, add the File Initial Value property to the
Options group, and specify the start value for key generation.
To control the block size for key ranges, add the File Block Size property to the Options
b.
group, set this property toUser specified, and specify a value for the block size.
If there is no input link, add the Number of Records property to the Options group, and specify
how many records to generate.
==================================================
===================================
SCD :
WHAT IS SCD IN DATASTAGE ? TYPES OF SCD IN DATASTAGE?
SCD's are nothing but Slowly changing dimension.
Scd's are the dimensions that have the data that changes slowly. Rather than
changing in a time period. That is a regular schedule.
The Scd's are performed mainly into three types.
They are
Type-1 SCD
Type-2 SCD
Type-3 SCD
Type -1 SCD: In the type -1 SCD methodology, it will overwrites the older data
( Records ) with the new data ( Records) and therefore it will not maintain the
historical information.
This will used for the correcting the spellings of names, and for small updates of
customers.
TYpe -2 SCD: In the Type-2 SCS methodology, it will tracks the complete historical
information by creating the multilple records for the given natural key ( Primary
key) in the dimension tables with a separate surrogate keys or a different
version numbers. We have a unlimited historical data preservation, as a new
record is inserted each time a change is made.
Here we use differet type of options inorder to track the historical data of
customers like
a) Active flag
b) Date functions
c) Version Numbers
d) Surrogate Keys
We use this to track all the historical data of the customer.
According to our input, we use required function to track.
Type-3 SCD: In the Type-2 SCD, it will maintain the partial historical
information.
Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data
changes slowly, rather than changing regularly on a time basis.
For example, you may have a customer dimension in a retail domain. Let say the customer is in
India and every month he does some shopping. Now creating the sales report for the customers is
easy. Now assume that the customer is transferred to United States and he does shopping there.
How to record such a change in your customer dimension?
You could sum or average the sales done by the customers. In this case you won't get the exact
comparison of the sales done by the customers. As the customer salary is increased after the
transfer, he/she might do more shopping in United States compared to in India. If you sum the total
sales, then the sales done by the customer might look stronger even if it is good. You can create a
second customer record and treat the transferred customer as the new customer. However this will
create problems too.
Handling these issues involves SCD management methodologies which referred to as Type 1 to
Type 3. The different types of slowly changing dimensions are explained in detail below.
SCD Type 1: SCD type 1 methodology is used when there is no need to store historical data in the
dimension table. This method overwrites the old data in the dimension table with the new data. It is
used to correct data errors in the dimension.
As an example, i have the customer table with the below data.
Marspton
Illions
Here the customer name is misspelt. It should be Marston instead of Marspton. If you use type1
method, it just simply overwrites the data. The data in the updated table will be.
Marston
Illions
The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that
there is no historical data kept in the data warehouse.
SCD Type 3: In type 3 method, only the current status and previous status of the row is maintained
in the table. To track these changes two separate columns are created in the table. The customer
dimension table in the type 3 method will look as
Marston
Illions
NULL
Let say, the customer moves from Illions to Seattle and the updated table will look as
Marston
Seattle
Illions
Now again if the customer moves from seattle to NewYork, then the updated table will be
Marston
NewYork
Seattle
The type 3 method will have limited history and it depends on the number of columns you create.
SCD Type 2: SCD type 2 stores the entire history the data in the dimension table. With type 2 we
can store unlimited history in the dimension table. In type 2, you can store the data in three different
ways. They are
Versioning
Flagging
Effective Date
SCD Type 2 Versioning: In versioning method, a sequence number is used to represent the
change. The latest sequence number always represents the current row and the previous sequence
numbers represents the past data.
As an example, lets use the same example of customer who changes the location. Initially the
customer is in Illions location and the data in dimension table will look as.
Marston
Illions
The customer moves from Illions to Seattle and the version number will be incremented. The
dimension table will look as
Marston
Illions
Marston
Seattle
Now again if the customer is moved to another location, a new record will be inserted into the
dimension table with the next version number.
SCD Type 2 Flagging: In flagging method, a flag column is created in the dimension table. The
current record will have the flag value as 1 and the previous records will have the flag as 0.
Now for the first time, the customer dimension will look as.
Marston
Illions
Now when the customer moves to a new location, the old records will be updated with flag value as
0 and the latest record will have the flag value as 1.
Marston
Illions
Marston
Seattle
SCD Type 2 Effective Date: In Effective Date method, the period of the change is tracked using the
start_date and end_date columns in the dimension table.
End_date
------------------------------------------------------------------------1
Marston
Illions
01-Mar-2010
20-Fdb-2011
Marston
Seattle
21-Feb-2011
NULL
The NULL in the End_Date indicates the current version of the data and the remaining records
indicate the past data.
SCD-2 Implementation in Datastage:
Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An
additional dimension record is created and the segmenting between the old record values and the new
(current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact
table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of
many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which
has the following structure and data:
D_CUSTOMER dimension table before loading
Datastage SCD2 job design
The most important facts and stages of the CUST_SCD2 job processing:
The dimension table with customers is refreshed daily and one of the data sources is a text file. For
the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it
is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:
There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text
file.
A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to
separate columns.
SCD 2 lookup transformer
A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes
through only records that differ.
SCD 2 check discrepancies transformer
A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated
with current indictator flag set to no and the new record is inserted with current indictator flag set to
yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer
ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and
the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed
where part of an SQL statement.
ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and
the key column is CUST_ID.
D_CUSTOMER dimension table after Datawarehouse refresh
===============================================================
Pivot_Enterprise_Stage:
Pivot enterprise stage is a processing stage which pivots data vertically and horizontally depending
upon the requirements. There are two types
1.
Horizontal
2.
Vertical
Horizontal Pivot operation sets input column to the multiple rows which is exactly opposite to the
Vertical Pivot Operation. It sets input rows to the multiple columns.
Lets try to understand it one by one with following example.
1.
Product Type
Pen
Dress
Color_1
Yellow
Pink
Color_2
Blue
Yellow
Color_3
Green
Purple
Select Horizontal for Pivot Type from drop-down menu under Properties tab for horizontal Pivot
operation.
Step 3: Click onPivot Properties tab. Under which we need to check box against Pivot Index. After
which column of name Pivot_Index will appear under Name column also declare a new column of
name Color as shown below.
Step 4: Now we have to mention columns to be pivoted under Derivation against column Color.
Double click on it. Following Window will pop up.
Select columns to be pivoted from Available column pane as shown. Click OK.
Step 5: Under Output tab, only map pivoted column as shown.
Configure output stage. Give the file path. See below image for reference.
Step 6: Compile and Run the job. Lets see what is happen to the output.
This is how we can set multiple input columns to the single column (As here for colors).
Vertical Pivot Operation:
Here, we are going to use Pivot Enterprise stage to vertically pivot data. We are going to set multiple
input rows to a single row. The main advantage of this stage is we can use aggregation functions like
avg, sum, min, max, first, last etc. for pivoted column. Lets see how it works.
Consider an output data of Horizontal Operation as input data for the Pivot Enterprise stage. Here, we
will be adding one extra column for aggregation function as shown in below table.
Product
Pen
Pen
Pen
Dress
Dress
Dress
Color
Yellow
Blue
Green
Pink
Yellow
purple
Prize
38
43
25
1000
695
738
Step 2: Open Pivot Enterprise stage and select Pivot type as vertical under properties tab.
Step 3: Under Pivot Properties tab minimum one pivot column and one group by column. Here, we
declared Product as group by column. Color and prize as Pivot columns.Lets see how to use
Aggregation functions in next step.
Step 4: On clicking Aggregation functions required for this column for particular column following
window will pop up. In which we can select functions whichever required for that particular column.
Here we are using min, max and average functions with proper precision and scale for Prize column
as shown.
Step 5: Now we just have to do mapping under output tab as shown below.
Step 6: compile and Run the job. Lets see what will be the output is.
Output :
Let me first tell you that a Pivot stage only CONVERTS COLUMNS INTO ROWS and
nothing else. Some DS Professionals refer to this as NORMALIZATION. Another fact about
the Pivot stage is that it's irreplaceable i.e no other stage has this functionality of
converting columns into rows!!! So , that makes it unique, doesn't!!!
Let's cover how exactly it does it....
For example, lets take a file with the following fields: Item, Quantity1, Quantity2,
Quantity3....
Item~Quantity1~Quantity2~Quantity3
ABC~100~1000~10000
DEF~200~2000~20000
GHI~300~3000~30000
Basically you would use a pivot stage when u need to convert those 3 Quantity fields
into a single field whch contains a unique Quantity value per row...i.e. You would need
the following output
Item~Quantity
ABC~100
ABC~1000
ABC~10000
DEF~200
DEF~2000
DEF~20000
GHI~300
GHI~3000
GHI~30000
Now connect a Pivot stage from the Tool pallette to the above output link and
create an output link for the Pivot stage itself (fr enabling the Output tab for the
pivot stage).
Unlike other stages, a pivot stage doesn't use the generic GUI stage page. It has a
stage page of its own. And by default the Output columns page would not have
any fields. Hence, you need to manually type in the fields. In this case just type in
the 2 field names : Item and Quantity. However manual typing of the columns
becomes a tedious process when the number of fields is more. In this case you can
use the Metadata Save - Load feature. Go the input columns tab of the pivot stage,
save the table definitions and load them in the output columns tab. This is the way
I use it!!!
Now, you have the following fields in the Output Column's tab...Item and
Quantity....Here comes the tricky part i.e you need to specify the
DERIVATION ....In case the field names of Output columns tab are same as the
Input tab, you need not specify any derivation i.e in this case for the Item field,
you need not specify any derivation. But if the Output columns tab has new field
names, you need to specify Derivation or you would get a RUN-TIME error for
free....
For our example, you need to type the Derivation for the Quantity field as
Column name Derivation
Item Item (or you can leave this blank)
Quantity Quantity1, Quantity2, Quantity3.
Just attach another file stage and view your output!!! So, objective met!!!
Sequence_Activities :
In this article i will explain how to use datastage looping acitvities in sequencer.
I have a requirement where i need to pass file id as parameter reading from a file.In Future file ids
will increase so that i dont have to add job or change sequencer if I take advantage of datastage
looping.
Contents in the File:
1|200
2|300
3|400
I need to read the above file and pass second field as parameter to the job.I have created one parallel
job with pFileID as parameter.
Step:1 Count the number of lines in the file so that we can set the upper limit in the datastage start
loop activity.
sample routine to count lines in a file:
Argument : FileName(Including path)
Deffun DSRMessage(A1, A2, A3) Calling *DataStage*DSR_MESSAGE
Equate RoutineName To CountLines
Command = wc -l: :FileName:| awk {print $1}
Call DSLogInfo(Executing Command To Get the Record Count ,Command)
* call support routine that executes a Shell command.
Call DSExecute(UNIX, Command, Output, SystemReturnCode)
* Log any and all output as an Information type log message,
* unless systems return code indicated that an error occurred,
* when we log a slightly different Warning type message.
vOutput=convert(char(254),",Output)
If (SystemReturnCode = 0) And (Num(vOutput)=1) Then
Call DSLogInfo(Command Executed Successfully ,Command)
Output=convert(char(254),",Output)
Call DSLogInfo(Here is the Record Count In :FileName: = :Output,Output)
Ans = Output
*GoTo NormalExit
End Else
Call DSLogInfo(Error when executing command ,Command)
Call DSLogFatal(Output, RoutineName)
Ans = 1
End
Now we use startLoop.$Counter variable to get the file id by using combination of grep and awk
command.
for each iteration it will get file id.
If our requirement is to filter the data department wise from the file below
samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
And our requirement is to get the target data as below
In Target1 we need 10th & 40th dept employees.
In Target2 we need 30th dept employees.
In Target1 we need 20th & 40th dept employees.
Read and Load the data in Source file
In Transformer Stage just Drag and Drop the data to the target tables.
Write expression in constraints as below
dept_no=10 or dept_no= 40 for table 1
dept_no=30 for table 1
dept_no=20 or dept_no= 40 for table 1
Click ok
Give file name at the target file and
Compile and Run the Job to get the Output
Shared Container :
Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job
Ename
a,b
c,d
e,f
Target
Eno Ename
1
2
3
a
b
c
Scenario 2:
Following is the existing job design. But requirement got changed to: Head and trailer datasets
should populate even if detail records is not present in the source file. Below job don't do that
job.
Used row generator with a copy stage. Given default value( zero) for col( count) coming in from
row generator. If no detail records it will pick the record count from row generator.
We have a source which is a sequential file with header and footer. How to remove the header
and footer while reading this file using sequential file stage of Datastage?
Sol:Type command in putty: sed '1d;$d' file_name>new_file_name (type this
in job before job subroutine then use new file in seq stage)
IF I HAVE SOURCE LIKE COL1 A A B AND TARGET LIKE COL1 COL2 A 1 A 2 B1. HOW
TO ACHIEVE THIS OUTPUT USING STAGE VARIABLE IN TRANSFORMER STAGE?
If keyChange =1 Then 1 Else stagevaraible+1
Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000
row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and
your job going to be aborted then.. How can short out the problem.Suppose job sequencer
synchronies or control 4 job but job 1 have problem, in this condition should go director and
check it what type of problem showing either data type problem, warning massage, job fail or job
aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this
option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job
defiantly u get successful massage
Question: I want to process 3 files in sequentially one by one how can i do that. while processing
the files it should fetch files automatically .
Ans:If the metadata for all the files r same then create a job having file name as parameter then
use same job in routine and call the job with different file name...or u can create sequencer to use
the job..
Parameterize the file name.
Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different file
each time.
RE: What Happens if RCP is disable ?
In such case Osh has to perform Import and export every time when the job runs and the
processing time job is also increased...
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated at
run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime column
propagation
Question:
Source:
Eno
1
2
3
Target
Ename
a,b
c,d
e,f
Eno
LOCATION
HYD
BAN
CHE
HYD
CHE
BAN
BAN
CHE
LIKE THIS.......
THEN THE OUTPUT LOOKS LIKE THIS....
Ename
1
a
2
b
3
c
30
Output is like:
file2
file3(duplicates)
10
10
20
10
30
20
4)Input is like:
file1
10
20
10
10
20
30
Output is like Multiple occurrences in one file and single occurrences in one file:
file2
file3
10
30
10
10
20
20
5)Input is like this:
file1
10
20
10
10
20
30
Output is like:
file2
file3
10
30
20
6)Input is like this:
file1
1
2
3
4
5
6
7
8
9
10
Output is like:
file2(odd)
file3(even)
1
2
3
4
5
6
7
8
9
10
LOCATION
HYD
BAN
CHE
HYD
CHE
BAN
BAN
CHE
LIKE THIS.......
THEN THE OUTPUT LOOKS LIKE THIS....
Company loc count
TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
Solution:
Seqfile......>Sort......>Trans......>RemoveDuplicates..........Dataset
Sort
Key=Company
Sort order=Asc
Company1:',':in.Location
create keychange=True
Trans:
create stage variable as Company1
Company1=If(in.keychange=1) then in.Location Else
Drag and Drop in derivation
Company
....................Company
Company1........................Location
RemoveDup:
Key=Company
Duplicates To Retain=Last
11)The input is
Shirt|red|blue|green
Pant|pink|red|blue
Output should be,
Shirt:red
Shirt:blue
Shirt:green
pant:pink
pant:red
pant:blue
Solution:
it is reverse to pivote stage
use
seq------sort------tr----rd-----tr----tg
in the sort stage use create key change column is true
in trans create stage variable=if colu=1 then key c.value else key v::colum
rd stage use duplicates retain last
tran stage use field function superate columns
similar
Scenario: :
source
col1 col3
1 samsung
1 nokia
1 ercisson
2 iphone
2 motrolla
3 lava
3 blackberry
3 reliance
Expected Output
col 1 col2
col3 col4
1
samsung nokia ercission
2
iphone motrolla
3
lava
blackberry reliance
You can get it by using
Transformer --tgt
Ok
First Read and Load the data into your source file( For Example Sequential File )
And in Sort stage
Go to Transformer stage
Create one stage variable.
You can do this by right click in stage variable go to properties and name it as your wish
( For example temp)
and in expression write as below
if keychange column =1
This column name is the one you want in the required column with delimited commas.
On remove duplicates stage key is col1 and set option duplicates retain to--> Last.
in transformer drop col3 and define 3 columns like col2,col3,col4
in col1 derivation give Field(InputColumn,",",1) and
in col1 derivation give Field(InputColumn,",",2) and
in col1 derivation give Field(InputColumn,",",3)
Scenario:
12)Consider the following employees data as source?
employee_id, salary
------------------10,
1000
20,
2000
30,
40,
3000
5000
Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.
The output should look like as
employee_id, salary, salary_sum
------------------------------10,
1000, 11000
20,
2000, 11000
30,
3000, 11000
40,
5000, 11000
Scenario:
I have two source tables/files numbered 1 and 2.
In the the target, there are three output tables/files, numbered 3,4 and 5.
The scenario is that,
to the out put 4 -> the records which are common to both 1 and 2 should go.
to the output 3 -> the records which are only in 1 but not in 2 should go
to the output 5 -> the records which are only in 2 but not in 1 should go.
sltn:src1----->copy1------>----------------------------------->output_1(only left table)
Join(inner type)----> ouput_1
src2----->copy2------>----------------------------------->output_3(only right table)
Scenario:
Create a job to find the sum of salaries of all employees and this sum should repeat for all
the rows.
The output should look like as
employee_id, salary, salary_sum
------------------------------10,
1000, 11000
20,
2000, 11000
30,
40,
sltn:
3000,
5000,
11000
11000
Take Source --->Transformer(Add new Column on both the output links and assign a value
as 1 )------------------------>
1) Aggregator (Do group by using that
new column)
2)lookup/join( join on that new column)-------->tgt.
Scenario:
sno,sname,mark1,mark2,mark3
1,rajesh,70,68,79
2,mamatha,39,45,78
3,anjali,67,39,78
4,pavani,89,56,45
5,indu,56,67,78
out put is
sno,snmae,mark1,mark2,mark3,delimetercount
1,rajesh,70,68,79,4
2,mamatha,39,45,78,4
3,anjali,67,39,78,4
4,pavani,89,56,45,4
5,indu,56,67,78,4
seq--->trans--->seq
create one stage variable as delimiter..
and put derivation on stage as DSLink4.sno : "," : DSLink4.sname : "," : DSLink4.mark1 :
"," :DSLink4.mark2 : "," : DSLink4.mark3
and do mapping and create one more column count as integer type.
and put derivation on count column as Count(delimter, ",")
scenario:
sname
total_vowels_count
Allen
2
Scott
1
Ward
1
Under Transformer Stage Description:
total_Vowels_Count=Count(DSLink3.last_name,"a")+Count(DSLink3.last_name,"e")
+Count(DSLink3.last_name,"i")+Count(DSLink3.last_name,"o")
+Count(DSLink3.last_name,"u").
Scenario:
1)On daily we r getting some huge files data so all files metadata is same we have to load in
to target table how we can load?
Use File Pattern in sequential file
2) One column having 10 records at run time we have to send 5th and 6th record to target
at run time how we can send?
Can get through,by using UNIX command in sequential file filter option
How can we get 18 months date data in transformer stage?
Use transformer stage after input seq file and try this one as constraint in transformer
stage :
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=548 OR
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=546
where date_18 column is the column having that date which needs to be less or equal to
18 months and 548 is no. of days for 18 months and for leap year it is 546(these
numbers you need to check).
My source is Like
Sr_no, Name
10,a
10,b
20,c
30,d
30,e
40,f
*****************
source--->trans---->target
in trans use conditions on constraints
mod(empno,3)=1
mod(empno,3)=2
mod(empno,3)=0
Scenario:
im having i/p as
col A
a_b_c
x_F_I
DE_GH_IF
we hav to mak it as
col1 col 2 col3
abc
xfi
de gh if
*********************
Transformer
create 3 columns with derivation
col1 Field(colA,'_',1)
col2 Field(colA,'_',2)
col3 Field(colA,'_',3)
**************
Field function divides the column based on the delimeter,
if the data in the col is like A,B,C
then
Field(col,',',1) gives A
Field(col,',',2) gives B
Field(col,',',3) gives C
Scenario:
Scenario:
Scenario:
2 comments:
baba_007 said...
How to Find Out Duplicate Values Using Transformer?
another way to find the duplicate value can be using a sorter stage before transformer.
In sorter: make Cluster Key change = TRUE
on the Key
then in Transformer filter the oulput on basic of value of cluste key change which can be
put in stage variable.
====================================================================
Scenarios_Unix :
>catfile
theblackcatwaschasedbythebrowndog.
theblackcatwasnotchasedbythebrowndog.
>sede'/not/s/black/white/g'file
theblackcatwaschasedbythebrowndog.
thewhitecatwasnotchasedbythebrowndog.
8) The below i have shown the demo for the A and 65.
Ascii value of character: It can be done in 2 ways:
1. printf %d A
2. echo A | tr -d \n | od -An -t dC
Character value from Ascii: awk -v char=65 BEGIN { printf %c\n, char; exit
}
9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Output >crmplp1 cmis461 No Online cmis462 No Offline
crmplp2 cmis462 No Online cmis463 No Offline
Command:
awk NR%2?ORS=FS:ORS=RS file
13) Pring debugging script output in log file Add following command in
script:
exec 1>> logfilename
exec 2>>logfilename
Vmstat: reports on virtual memory statistics for processes, disk, tape and CPU activity.
27. How do you write the contents of 3 files into a single file?
cat file1 file2 file3 > file
28. How to display the fields in a text file in reverse order?
awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename
29. Write a command to find the sum of bytes (size of file) of all files in a directory.
ls -l | grep '^-'| awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'
30. Write a command to print the lines which end with the word "end"?
grep 'end$' filename
The '$' symbol specifies the grep command to search for the pattern at the end of the line.
31. Write a command to select only those lines containing "july" as a whole word?
grep -w july filename
The '-w' option makes the grep command to search for exact whole words. If the specified
pattern is found in a string, then it is not considered as a whole word. For example: In the string
"mikejulymak", the pattern "july" is found. However "july" is not a whole word in that string.
32. How to remove the first 10 lines from a file?
sed '1,10 d' < filename
33. Write a command to duplicate each line in a file?
sed 'p' < filename
34. How to extract the username from 'who am i' comamnd?
who am i | cut -f1 -d' '
35. Write a command to list the files in '/usr' directory that start with 'ch' and then display the
number of lines in each file?
wc -l /usr/ch*
Another way is
find /usr -name 'ch*' -type f -exec wc -l {} \;
36. How to remove blank lines in a file ?
grep -v ^$ filename > new_filename
37. How to display the processes that were run by your user name ?
ps -aef | grep <user_name>
38. Write a command to display all the files recursively with path under current directory?
find . -depth -print
39. Display zero byte size files in the current directory?
find -size 0 -type f
40. Write a command to display the third and fifth character from each line of a file?
cut -c 3,5 filename
41. Write a command to print the fields from 10th to the end of the line. The fields in the line are
delimited by a comma?
cut -d',' -f10- filename
42 How to replace the word "Gun" with "Pen" in the first 100 lines of a file?
sed '1,00 s/Gun/Pen/' < filename
43. Write a Unix command to display the lines in a file that do not contain the word "RAM"?
grep -v RAM filename
The '-v' option tells the grep to print the lines that do not contain the specified pattern.
44 How to print the squares of numbers from 1 to 10 using awk command
awk 'BEGIN { for(i=1;i<=10;i++) {print "square of",i,"is",i*i;}}'
45. Write a command to display the files in the directory by file size?
ls -l | grep '^-' |sort -nr -k 5
46. How to find out the usage of the CPU by the processes?
The top utility can be used to display the CPU usage by the processes.
47. Write a command to remove the prefix of the string ending with '/'.
The basename utility deletes any prefix ending in /. The usage is mentioned below:
basename /usr/local/bin/file
This will display only file
48. How to display zero byte size files?
ls -l | grep '^-' | awk '/^-/ {if ($5 !=0 ) print $9 }'
49. How to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename
50. How to remove all the occurrences of the word "jhon" except the first one in a line with in
the entire file?
sed 's/jhon//2g' < filename
51. How to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename
52. How to list the files that are accessed 5 days ago in the current directory?
find -atime 5 -type f
53. How to list the files that were modified 5 days ago in the current directory?
find -mtime 5 -type f
54. How to list the files whose status is changed 5 days ago in the current directory?
find -ctime 5 -type f
55. How to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename
70. Write a command to print the lines that has the word "july" while ignoring the case.
grep -i july *
The option i make the grep command to treat the pattern as case insensitive.
71. When you use a single file as input to the grep command to search for a pattern, it won't print
the filename in the output. Now write a grep command to print the filename in the output without
using the '-H' option.
grep pattern filename /dev/null
The /dev/null or null device is special file that discards the data written to it. So, the /dev/null is
always an empty file.
Another way to print the filename is using the '-H' option. The grep command for this is
grep -H pattern filename
72. Write a command to print the file names in a directory that does not contain the word "july"?
grep -L july *
The '-L' option makes the grep command to print the filenames that do not contain the specified
pattern.
73. Write a command to print the line numbers along with the line that has the word "july"?
grep -n july filename
The '-n' option is used to print the line numbers in a file. The line numbers start from 1
74. Write a command to print the lines that starts with the word "start"?
grep '^start' filename
The '^' symbol specifies the grep command to search for the pattern at the start of the line.
75. In the text file, some lines are delimited by colon and some are delimited by space. Write a
command to print the third field of each line.
awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename
76. Write a command to print the line number before each line?
awk '{print NR, $0}' filename
77. Write a command to print the second and third line of a file without using NR.
awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename
78. How to create an alias for the complex command and remove the alias?
The alias utility is used to create the alias for a command. The below command creates alias for
ps -aef command.
alias pg='ps -aef'
If you use pg, it will work the same way as ps -aef.
To remove the alias simply use the unalias command as
unalias pg
79. Write a command to display todays date in the format of 'yyyy-mm-dd'?
The date command can be used to display todays date with time
date '+%Y-%m-%d'
------------------------------------------------------------------------------------------------------
Technology $6,000
8) The below i have shown the demo for the A and 65.
Ascii value of character: It can be done in 2 ways:
1. printf %d A
2. echo A | tr -d \n | od -An -t dC
Character value from Ascii: awk -v char=65 BEGIN { printf %c\n, char; exit
}
9) Input file:
crmplp1 cmis461 No Online
cmis462 No Offline
crmplp2 cmis462 No Online
cmis463 No Offline
crmplp3 cmis463 No Online
cmis461 No Offline
Output >crmplp1 cmis461 No Online cmis462 No Offline
crmplp2 cmis462 No Online cmis463 No Offline
Command:
awk NR%2?ORS=FS:ORS=RS file
13) Pring debugging script output in log file Add following command in
script:
exec 1>> logfilename
exec 2>>logfilename
echo variable_crr: $?
29/02/2012 00:00:00
29/02/2012 00:00:00
command:
cat command.txt | sed -e s/[[:space:]]/ /g | awk -F {print \x27 $1,$2,$3 \x27 ,\x27
$4,$5\x27}
output:
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
/Admin/script.sh abc 2011/08 29/02/2012 00:00:00
=====================================================================
ceft.CUSTOMER_LAST_NAME AS CUSTOMER_LAST_NAME,
btr.TRACE_ID AS TRACE_ID,
ROWNUM
FROM schemaname.bt_fulfillment bf
INNER JOIN schemaname.balance_transfer_request btr
ON btr.bt_fulfillment_uid = bf.bt_fulfillment_uid
INNER JOIN schemaname.electronic_funds_transfer eft
ON eft.bt_fulfillment_uid = bf.bt_fulfillment_uid
INNER JOIN schemaname.creditor_eft ceft
ON ceft.ELECTRONIC_FUNDS_TRANSFER_UID =
eft.ELECTRONIC_FUNDS_TRANSFER_UID
INNER JOIN schemaname.credit_account ca
ON ca.ELECTRONIC_FUNDS_TRANSFER_UID =
ceft.ELECTRONIC_FUNDS_TRANSFER_UID
WHERE ((btr.TYPE ='CREATE_CREDIT' AND btr.STATUS ='PENDING')
OR (btr.TYPE ='RETRY_CREDIT' AND btr.STATUS ='PENDING'))
AND btr.RELEASE_DATE < CURRENT_TIMESTAMP
=====================================================================
The star schema model is useful for Metrics analysis, such as What is the revenue for a given
customer?
When binding output schema variable "outputData": When binding output interface field
"RecordCount" to field "RecordCount": Implicit conversion from source type
"string[max=255]" to result type "int16": Converting string to number.
Problem(Abstract)
Jobs that process a large amount of data in a column can abort with this error:
the record is too big to fit in a block; the length requested is: xxxx, the max block length
is: xxxx.
Resolving the problem
To fix this error you need to increase the block size to accommodate the record size:
1.
Log into Designer and open the job.
2.
Open the job properties--> parameters-->add environment variable and select:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
3.
You can set this up to 256MB but you really shouldn't need to go over 1MB.
NOTE: value is in KB
For example to set the value to 1MB:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE=1048576
The default for this value is 128kb.
When setting APT_DEFAULT_TRANSPORT_BLOCK_SIZE you want to use the
smallest possible value since this value will be used for all links in the job.
For example if your job fails with APT_DEFAULT_TRANSPORT_BLOCK_SIZE set to
1 MB and succeeds at 4 MB you would want to do further testing to see what it the
smallest value between 1 MB and 4 MB that will allow the job to run and use that value.
Using 4 MB could cause the job to use more memory than needed since all the links
would use a 4 MB transport block size.
NOTE: If this error appears for a dataset use
APT_PHYSICAL_DATASET_BLOCK_SIZE.
.
While connecting Remote Desktop, Terminal server has been exceeded maximum
number of allowed connections
3.
SK_RETAILER_GROUP_BRDIGE,1: runLocally() did not reach EOF on its input
data set 0.
SOL: Warning will be disappeared by regenerating SK File.
4.
While connecting to Datastage client, there is no response, and while restarting
websphere services, following errors occurred
[root@poluloro01 bin]# ./stopServer.sh server1 -user wasadmin -password
Wasadmin0708
ADMU0116I: Tool information is being logged in file
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
ADMU0128I: Starting tool with the default profile
ADMU3100I: Reading configuration for server: server1
ADMU0111E: Program exiting with error: javax.management.JMRuntimeException:
ADMN0022E: Access is denied for the stop operation on Server MBean
because of insufficient or empty credentials.
ADMU4113E: Verify that username and password information is on the command line
(-username and -password) or in the <conntype>.client.props file.
ADMU1211I: To obtain a full trace of the failure, use the -trace option.
ADMU0211I: Error details may be seen in the file:
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
SOL:
Wasadmin and XMeta passwords needs to be reset and commands are below..
[root@poluloro01 bin]# cd /opt/ibm/InformationServer/ASBServer/bin/
5.
SOL: Most of the time "The specified field: XXXXXX does not exist in the view
adapted schema" occurred when we missed a field to map. Every stage has got an output
tab if used in the between of the job. Make sure you have mapped every single field
required for the next stage.
Sometime even after mapping the fields this error can be occurred and one of the reason
could be that the view adapter has not linked the input and output fields. Hence in this
case the required field mapping should be dropped and recreated.
Just to give an insight on this, the view adapter is an operator which is responsible for
mapping the input and output fields. Hence DataStage creates an instance of
APT_ViewAdapter which translate the components of the operator input interface
schema to matching components of the interface schema. So if the interface schema is not
having the same columns as operator input interface schema then this error will be
reported.
1)When we use same partitioning in datastage transformer stage we get the following
warning in 7.5.2 version.
TFCP000043
2
3
input_tfm: Input dataset 0 has a partitioning method other
than entire specified; disabling memory sharing.
This is known issue and you can safely demote that warning into informational by adding
this warning to Project specific message handler.
2) Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0
Resolution: Clear the preserve partition flag before Sequential file stages.
3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable
On aix execute following command to check maxuproc setting and increase it if you plan
to run multiple jobs at the same time.
lsattr -E -l sys0 | grep maxuproc
maxuproc
1024
Maximum number of PROCESSES allowed per user
True
4)TFIP000000
3
Agg_stg: When checking operator: When binding input
interface field CUST_ACT_NBR to field CUST_ACT_NBR: Implicit conversion
from source type string[5] to result type dfloat: Converting string to number.
Resolution: use the Modify stage explicitly convert the data type before sending to
aggregator stage.
5)Warning: A user defined sort operator does not satisfy the requirements.
Resolution:check the order of sorting columns and make sure use the same order when
use join stage after sort to joing two inputs.
6)TFTM000000
2
3
Stg_tfm_header,1: Conversion error calling conversion
routine timestamp_from_string data may have been lost
TFTM000000
1
xfmJournals,1: Conversion error calling conversion routine
decimal_from_string data may have been lost
Resolution:check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
7)TOSO000119
2
3
Join_sort: When checking operator: Data claims to already
be sorted on the specified keys the sorted option can be used to confirm this. Data will
be resorted as necessary. Performance may improve if this sort is removed from the flow
Resolution: Sort the data before sending to join stage and check for the order of sorting
keys and join keys and make sure both are in the same order.
8)TFOR000000
2
1
Join_Outer: When checking operator: Dropping
component CUST_NBR because of a prior component with the same name.
Resolution:If you are using join,diff,merge or comp stages make sure both links have the
differnt column names other than key columns
9)TFIP000022
1
oci_oracle_source: When checking operator: When binding
output interface field MEMBER_NAME to field MEMBER_NAME: Converting a
nullable source to a non-nullable result;
Resolution:If you are reading from oracle database or in any processing stage where
incoming column is defined as nullable and if you define metadata in datastage as nonnullable then you will get above issue.if you want to convert a nullable field to non
nullable make sure you apply available null functions in datastage or in the extract query.
1. No jobs or logs showing in IBM DataStage Director Client, however jobs are still
accessible from the Designer Client.
SOL: SyncProject cmd that is installed with DataStage 8.5 can be run to analyze and
recover projects
SyncProject -ISFile islogin -project dstage3 dstage5 Fix
2. CASHOUT_DTL: Invalid property value /Connection/Database
(CC_StringProperty::getValue, file CC_StringProperty.cpp, line 104)
SOL: Change the Data Connection properties manually in the produced
DB2 Connector stage.
A patch fix is available for this issue JR35643
3. Import .dsx file from command line
SOL: DSXImportService -ISFile dataconnection DSProject dstage DSXFile
c:\export\oldproject.dsx
4. Generate Surrogate Key without Surrogate Key Stage
SOL: @PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM 1)) + 1
Use above Formula in Transformer stage to generate a surrogate key.
5. Failed to authenticate the current user against the selected Domain: Could not connect
to server.
RC: Client has invalid entry in host file
Server listening port might be blocked by a firewall
Server is down
SOL: Update the host file on client system so that the server hostname can be resolved
from client.
Make sure the WebSphere TCP/IP ports are opened by the firewall.
Make sure the WebSphere application server is running. (OR)
b)
c)
Click on Command
d)
e)
Select option 2
f)
g)
2. Warning: A sequential operator cannot preserve the partitioning of input data set on
input port 0
SOL:
3.
Warning: A user defined sort operator does not satisfy the requirements.
SOL: Check the order of sorting columns and make sure use the same order when use
join stage after sort to joing two inputs.
4. Conversion error calling conversion routine timestamp_from_string data may have
been lost. xfmJournals,1: Conversion error calling conversion routine
decimal_from_string data may have been lost
SOL: check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
5.
SOL:
cd /opt/ibm/InformationServer/Server/DSEngine/bin
./dsjob -ljobs <project_name>
6.
SOL:
a)
b)
c)
Ask the application team to close the active or stale sessions running from applications
user.
If they have closed the sessions, but sessions are still there, then kill those sessions.
ps ef|grep dsd.run
Check for output for below command before stopping Datastage services.
netstat a|grep dsrpc
If any processes are in established, check any job or stale or active or osh sessions are not
running.
If any processes are in close_wait, then wait for some time, those processes
will not be visible.
Wait for 10 to 15 min for shared memory to be released by process holding them.
Start the Datastage services.
./uv admin start
If asking for dsadm password while firing the command , then enable
impersonation.through root user
${DSHOME}/scripts/DSEnable_impersonation.sh
Equ DSJS.RUNNING To
Equ DSJS.RUNOK To
Equ DSJS.RUNWARN To
Equ DSJS.RUNFAILED To
Equ DSJS.QUEUED To
Equ DSJS.VALOK To
Equ DSJS.VALWARN To
Equ DSJS.VALFAILED To
Equ DSJS.RESET To
Equ DSJS.CRASHED To
Equ DSJS.STOPPED To
Equ DSJS.NOTRUNNABLE To
Equ DSJS.NOTRUNNING To
0
1
2
This is the only status that means the job is actually running
Job finished a normal run with no warnings
Job finished a normal run with warnings
4
11
12
13
21
96
97
98
99
This warning is seen when there are multiple records with the same key
column is present in the reference table from which lookup is done. Lookup,
by default, will fetch the first record which it gets as match and will throw the
warning
since it doesnt know which value is the correct one to be returned from the
reference.
To solve this problem you can either one of the reference links from Multiple
rows returned from link dropdown, in Lookup constraints. In this case Lookup
will return multiple rows for each row that is matched.
Else use some method to eradicate duplicate multiple rows with same key
columns according to the business requirements.
Posted by manohar at 11:19 PM No comments:
Also,
Your substitution command may catch more ^M then necessary. Your file may contain valid ^M in the
middle of a line of code for example. Use the following command instead to remove only those at the
very end of lines:
:%s/(ctrl-v)(ctrl-m)*$//g
Using sed:
sed -e "s/^M//g" old_file_name > new_file_name
State
Name1
Name2
Name3
xy
FGH
Sam
Dean
Winchester
We are going to read the above data from a sequential file and transform it to look like this
City
State
Name
xy
FGH
Sam
xy
FGH
Dean
xy
FGH
Winchester
Lets map the output to a sequential file stage and see if the output is a desired.
After running the job, we did a view data on the output stage and here is the data as desired.
Making some tweaks to the above design we can implement things like
o
o
Code
ABC
ABC
ABC
DEF
2
For the first two records the function will return 0 but for the last record ABC,3 it will
return 1 indicating that it is the last record for the group where student name is ABC
o
GetSavedInputRecord(): This function returns the record that was stored in cache
by the function SaveInputRecord()
Back to the task at hand, we need 7 stage variables to perform the aggregation operation successfully.
1. LoopNumber: Holds the value of number of records stored in cache for a student
2. LoopBreak: This is to identify the last record for a particular student
3. SumSub1: This variable will hold the final sum of marks for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum of marks until the final record is evaluated
for a student (subject 1)
First Answer: My personal opinion is to use the star by default, but if the product you are using for
the business community prefers a snowflake, then I would snowflake it. The major difference between
snowflake and star is that a snowflake will have multiple tables for a dimension and a start with a
single table. For example, your company structure might be
Second Answer: First of all, some definitions are in order. In a star schema, dimensions
that reflect a hierarchy are flattened into a single table. For example, a star schema
Geography Dimension would have columns like country, state/province, city, state and
postal code. In the source system, this hierarchy would probably be normalized with
multiple tables with one-to-many relationships.
A snowflake schema does not flatten a hierarchy dimension into a single table. It would,
instead, have two or more tables with a one-to-many relationship. This is a more
normalized structure. For example, one table may have state/province and country columns
and a second table would have city and postal code. The table with city and postal code
would have a many-to-one relationship to the table with the state/province columns.
There are some good for reasons snowflake dimension tables. One example is a company
that has many types of products. Some products have a few attributes, others have many,
many. The products are very different from each other. The thing to do here is to create a
core Product dimension that has common attributes for all the products such as product
type, manufacturer, brand, product group, etc. Create a separate sub-dimension table for
each distinct group of products where each group shares common attributes. The subproduct tables must contain a foreign key of the core Product dimension table.
One of the criticisms of using snowflake dimensions is that it is difficult for some of the
multidimensional front-end presentation tools to generate a query on a snowflake
dimension. However, you can create a view for each combination of the core product/subproduct dimension tables and give the view a suitably description name (Frozen Food
Product, Hardware Product, etc.) and then these tools will have no problem.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server
using Hash/Sequential files for optimum performance
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster
inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance
Used sorted data for Aggregator
5
6
7
8
9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
Sorted the data as much as possible in DB and reduced the use of DS-Sort for
betterperformance of jobs
Removed the data not used from the source as early as possible in the job
Worked with DB-admin to create appropriate Indexes on tables for betterperformance of
DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for
faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard
logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality
required in one of the standard routines supplied in the sdk or ds utilities
categories.Constraints are generally CPU intensive and take a significant amount of time
to process. This may be the case if the constraint calls routines or external macros but if
it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate
the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT
WHERE.
Make every attempt to use the bulk loader for your particular database. Bulk loaders are
generally faster than using ODBC or OLE.
Minimize the usage of Transformer (Instead of this use Copy modify Filter Row Generator
Use SQL Code while extracting the data
Handle the nulls
Minimize the warnings
Reduce the number of lookups in a job design
Try not to use more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Check the write cache of Hash file. If the same hash file is used for Look up and as well
as target disable this Option.
If the hash file is used only for lookup then enable Preload to memory . This will
improve the performance. Also check the order of execution of the routines.
Don't use more than 7 lookups in the same transformer; introduce new transformers if it
exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writing into. Make sure your cache is big
enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
Ideally, if the amount of data to be processed is small, configuration files with less
number of nodes should be used while if data volume is more , configuration files with
larger number of nodes should be used.
Partitioning should be set in such a way so as to have balanced data flow i.e. nearly
equal partitioning of data should occur and data skew should be minimized.
In DataStage Jobs where high volume of data is processed, virtual memory settings for
the job should be optimized. Jobs often abort in cases where a single lookup has
multiple reference links. This happens due to low temp memory space. In such jobs
$APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should
be set to sufficiently large values.
Sequential files should be used in following conditions. When we are reading a flat file
(fixed width or delimited) from UNIX environment which is FTP ed from some external
system
When some UNIX operations has to be done on the file Dont use sequential file for
intermediate storage between jobs. It causes performance overhead, as it needs to do
data conversion before writing and reading from a UNIX file
In order to have faster reading from the Stage the number of readers per node can be
increased (default value is one).
Usage of Dataset results in a good performance in a set of linked jobs. They help in
achieving end-to-end parallelism by writing data in partitioned form and maintaining the
sort order.
Look up Stage is faster when the data volume is less. If the reference data volume is
more, usage of Lookup Stage should be avoided as all reference data is pulled in to local
memory
Sparse lookup type should be chosen only if primary input data volume is small.
Join should be used when the data volume is high. It is a good alternative to the lookup
stage and should be used when handling huge volumes of data.
Even though data can be sorted on a link, Sort Stage is used when the data to be sorted
is huge.When we sort data on link ( sort / unique option) once the data size is beyond
the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if
the volume of data is large explicit sort stage should be used instead of sort on link.Sort
Stage gives an option on increasing the buffer memory used for sorting this would mean
lower I/O and better performance.
It is also advisable to reduce the number of transformers in a Job by combining the logic
into a single transformer rather than having multiple transformers.
Presence of a Funnel Stage reduces the performance of a job. It would increase the time
taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is
better to isolate itself to one job. Write the output to Datasets and funnel them in new
job.
Funnel Stage should be run in continuous mode, without hindrance.
A single job should not be overloaded with Stages. Each extra Stage put in a Job
corresponds to lesser number of resources available for every Stage, which directly
affects the Jobs Performance. If possible, big jobs having large number of Stages should
be logically split into smaller units.
Unnecessary column propagation should not be done. As far as possible, RCP (Runtime
9
5
0
5
1
5
2
5
3
5
4
5
5
5
6
5
7
==================================================
=================================
If our requirement is to filter the data department wise from the file below
samp_tabl
1,sam,clerck,10
2,tom,developer,20
3,jim,clerck,10
4,don,tester,30
5,zeera,developer,20
6,varun,clerck,10
7,luti,production,40
8,raja,priduction,40
And our requirement is to get the target data as below
In Target1 we need 10th & 40th dept employees.
In Target2 we need 30th dept employees.
In Target1 we need 20th & 40th dept employees.
Read and Load the data in Source file
In Transformer Stage just Drag and Drop the data to the target tables.
Write expression in constraints as below
dept_no=10 or dept_no= 40 for table 1
dept_no=30 for table 1
dept_no=20 or dept_no= 40 for table 1
Click ok
Give file name at the target file and
Compile and Run the Job to get the Output.
================================================================================