Professional Documents
Culture Documents
27/02/2009
PROCESSING STAGES:
Aggregator Stage:
Aggregator stage is an Active stage. It accepts 1 input and 1 output. It is used for
aggregating data based on a key column. We Group by on a column and do various calculations
on that group
This stage takes data from input and does aggregations on each group. We can do calculations
like Sum, Average, Min, Max, Percentage and also Count the values based on the keys from
input.
For better Performance, The Key column should be sorted before aggregation and hash
partitioned.
Properties:
-- Group key: It is a key by which we Group by the data. It will arrange the similar records
at same place which is easier when we go for any Calculation.
COPY stage:
Copy stage is an Active stage. It can have a single input link and any number of output
links. It is used to copy single input data set to multiple output data sets. Thus, Each record from
the input data is copied to every output data set. This stage is used to take backup of a data
set to some other location on the disk or when we want multiple copies of data for different
processes in other job. Here we cannot alter data in any order.
Properties:
-- Force = True / False
When we have a single input and single output in the job, set the Force as True and
when we have single input and multiple outputs, then set as False. By default it is set as
False.
DATASTAGE PX
Processing stages
27/02/2009
Funnel:
Funnel stage is an Active stage. It is used to copy multiple input data sets to a single
output data set. It accepts any number of input links and only one output link. For this stage all
input data sets must have the same metadata.
Properties:
Funnel type: Continuous Funnel / Sort Funnel / Sequential funnel
Continuous Funnel: It picks the input records randomly and there is no particular order by
which it picks data. It picks one record from each input link in their order. If data is not available
in one input data set, the stage skips to the next link than waiting for the data.
Sort Funnel: It combines the input records in the order defined by the key columns and the
order of the output records is determined by these sorting keys. All Input data sets for sort funnel
must be hash partitioned before theyre sorted. It ensures that all records with the same key
column are located in the same partition and will be processed to the same node.
Sequential funnel: It copies all records from the first input data set to the output data set, then
all the records from the second input data set, and so on.
DATASTAGE PX
Processing stages
27/02/2009
Filter stage:
Filter stage is an Active stage. It can have one input link and any number of output links
and a single Reject link.
This stage transforms the input data with respect to the given conditions and filter out the
remaining record which doesnt satisfy the condition. The filtered records can also be sent to the
reject link.
Properties:
Where clause: Here we can provide the condition as Store_Loc=New York. Now only the
records which satisfy this condition are populated to the output. We can give multiple conditions
for different columns from the input link.
Output rejects: True / False
If we set the option as True, the record which doesnt match the condition is sent to reject link
and if option set as False, the unmatched records are ignored.
Output records only once: True / False
When we give multiple conditions to the records from the input, Some times a single record may
satisfy more than one condition. Then we can send the same record to multiple outputs by
selecting the above condition as False. It allows to propagate the same valid record to multiple
outputs accordingly. Set the option as True doesnt allow a record to multiple outputs being
satisfied by multiple conditions.
Sort stage:
Sort stage is an Active stage. This stage can have one input link which carries the data to
be sorted, and a single output link carrying the sorted data. We specify sorting keys as the criteria
on which to perform the sort. We can specify more than one key for sorting. First key is called as
primary key and other as Secondary key column. If multiple records have the same value for the
primary key column, then this stage uses the secondary columns to sort the records. The stage
uses temporary disk space when performing the sort operation.
Properties:
Sort key: A key is a column on which data to be sorted. For example, we sort the data on
City_Code=New York, the data is sorted on this key and populated to the output.
Allow Duplicates: True / False
DATASTAGE PX
Processing stages
27/02/2009
If multiple records have identical sorting key values, only one record is retained if it is set
to False. Duplicate records also populated to the output if set to True. It is set to True by
default.
If Stable Sort is True, then the first record is retained. This property is not available for the
UNIX sort type.
Sort Utility: Datastage / UNIX
--- DataStage: It is by default. This uses the built-in DataStage sorter; you do not require any
additional software to use this option.
Stable Sort: It is applicable when we select the Sort Utility as DataStage. True guarantees that
this sort operation will not rearrange records that are already in a properly sorted data set. If set
to False, no prior ordering of records is guaranteed to be preserved by the sorting operation.
--- UNIX. This specifies that the UNIX sort command is used to perform the sort.
Output Statistics: True / False
If set as True, it causes the sort operation to output statistics. This property is not
available for the UNIX sort type. It is set as False by default.
DATASTAGE PX
Processing stages
27/02/2009
Properties:
Change key: Specifies the name of a difference key input column. We can specify multiple
difference key input columns here. This is a key on which we identify the record as
new/update/deleted record.
--- Sort order: Ascending / Descending
We can sort the key values in ascending or descending order for better performance.
Change Value: value of an input column which is also considered as a key to identify the
changes in the input records. We can select a column from drop down list.
Change Mode: Explicit Keys & Values / All keys Explicit values / Explicit Keys, All Values
This mode determines how keys and values are specified. Choose Explicit Keys & Values to
specify the keys and values yourself. Choose All keys Explicit values to specify that value
columns must be defined, but all other columns are key columns unless excluded. Choose
Explicit Keys, All Values to specify that key columns must be defined but all other columns are
value columns unless they are excluded.
DATASTAGE PX
Processing stages
27/02/2009
Properties:
Change key: Specifies the name of a difference key input column. We can specify multiple key
input columns here. This is a key on which we identify the record as new/update/deleted record.
--- Sort order: Ascending / Descending
We can sort the key values in ascending or descending order for better performance.
Change Value: value of an input column which is also considered as a key to identify the
changes in the input records. We can select the column from drop down list.
Change Mode: Explicit Keys & Values / All keys Explicit values / Explicit Keys, All Values
This mode determines how keys and values are specified. Choose Explicit Keys & Values
to specify the keys and values yourself. Choose All keys Explicit values to specify that value
columns must be defined, but all other columns are key columns unless excluded. Choose
Explicit Keys, All Values to specify that key columns must be defined but all other columns are
value columns unless they are excluded.
DATASTAGE PX
Processing stages
27/02/2009
Modify Stage:
Modify stage is a processing stage. It can have one input link and a single output link.
Modify stage alters the record schema of its input data set. The modified data set is then output.
We can Change data types, Rename the columns, Drop columns. Keep the columns and Handle
nulls.
For example, you can use the conversion hours_from_time to convert a time to an int8, or
to an int16, int32 and so on. We have to provide the specification for the destination column in
the properties window.
Syntax for conversion:
new_columnname:new_type = conversion_function (old_columnname)
Column_Name=Handle_Null('Column_Name',Value)
Properties:
Options:
Specification: Here we need to specify the conversion for the output column.
Ex: HIREDATE = date_from_timestamp (HIREDATE)
Switch Stage:
Switch stage is a processing stage. It can have one input link, up to 128 output links and a
single reject link. The switch stage takes a single data set as input and assigns each input row to
an output data set based on the value of a selector field (column_Name).This stage performs an
operation similar to a C switch statement. Rows that satisfy none of the cases are populated to
the rejects link.
Properties:
Selector: Specifies the input column that the switch applies to. Unlike Filter
we can specify multiple conditions on a single column here.
DATASTAGE PX
stage,
Processing stages
27/02/2009
Options Category:
If not found: Fail / Drop / Output
Specifies the action to take if a row fails to match any of the case statements. It is not visible if
you choose a Selector Mode of Hash. We can choose between the following options
Fail: Causes the job to fail.
Drop: Drops the record.
Output: Record will be sent to the Reject link.
DATASTAGE PX
Processing stages
27/02/2009
Pivot Stage:
Pivot stage is an Active stage. It accepts one input and a single output. It converts
multiple columns in to rows. We have to see that the Data types for both the input columns
should be the same and the output is created with the same data type.
Scenario: Let us assume that Mark-1 and Mark-2 are two columns in the input data
set. Now we need to convert these two columns in to one column with Name "Marks". So, we
have to provide the derivation in the derivation field of the output column Marks during the
process. Thus a new column "Marks" is derived from the input columns Mark-1 and Mark-2.
We need to provide the following derivation in the Marks column derivation:
(Marks = Mark-1, Mark-2)
JOIN Stage:
Join stage is a processing stage. It has any number of input links and a single output link.
It doesnt allow the Reject link. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The input data sets are called as the right set and
the left set and intermediate sets. You can specify which is which.
Join stage can perform four join operations: Inner Join , Left outer Join, Right outer Join,
Full outer Join. The default is inner Join.
The data sets input to the Join stage must be key partitioned and sorted. This ensures that
rows with the same key column values are located in the same partition and will be processed by
the same node. Choosing the auto partitioning method will ensure that partitioning and sorting is
done. If sorting and partitioning are carried out on separate stage before the Join stage, DataStage
in auto mode will detect this and wont repartition the data again.
Properties:
Join Key: This is the Column name on which the input tables are joined together and matched
data is sent to the output data set. We can select multiple keys for joining tables.
Join type: Inner / Left Outer / Right Outer / Full Outer
DATASTAGE PX
Processing stages
27/02/2009
Inner Join: Transfers records from input data sets whose key columns contain equal
values to the output data set. Records whose key columns do not contain equal
values are dropped
Left Outer Join: Transfers all values from the left data set but transfers values from the
right data set and intermediate data sets only where key columns match. The
stage drops the key column from the right and intermediate data sets.
Right Outer Join: Transfers all values from the right data set and transfers values from
the left data set and intermediate data sets only where key columns match. The
stage drops the key column from the left and intermediate data sets.
Full Outer Join: Transfers records in which the contents of the key columns are equal
from the left and right input data sets to the output data set. It also transfers
records whose key columns contain unequal values from both input data sets to
the output data set. Full outer joins do not support more than two input
links.
Merge Stage:
Merge stage is a processing stage. It can have any number of input links, a single output
link, and the same number of reject links as there are update input links.
Merge Stage combines a sorted master data set with one or more update data sets. The
columns from the records in the master and update data sets are merged so that the output record
contains all the columns from the master record plus any additional columns from each update
record. A master record and an update record are merged only if both of them have the same
values for the merge key column(s) that you specify. Merge key columns are one or more
columns that exist in both the master and update records.
Unlike Join stage and Lookup stage, the Merge stage allows you to specify several reject
links. You must have the same number of reject links as you have update links. You can also
specify whether to drop unmatched master rows, or output them on the output data link.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures
that rows with the same key column values are located in the same partition and will be
processed by the same node.
Properties:
Merger Key: It is the key column that exists in both the master and update records. We
can select the common key from the drop down list. We can give multiple key columns to merge
tables.
Sort order: Ascending / Descending
DATASTAGE PX
Processing stages
27/02/2009
Options:
Unmatched master mode = Drop / Keep
If selected as Keep, it specifies that unmatched rows from the master link are
output to the merged data set. If Set to Drop, it specifies the unmatched records to drop
It Set to Keep by default.
Warn on reject updates = True / False
Warn on Unmatched Masters = True / False
Warn On Unmatched Masters: This will warn you when bad records from the master link are
not matched. Set it to False to receive no warnings. It is set to True by default.
Warn On Reject Updates: It will warn you when bad records from any update links are
rejected. Set it to False to receive no warnings. It is set to True by default.
Lookup stage:
Lookup stage is a processing stage. It can have a reference link, a single input link, a
single output link, and a single reject link. Depending upon the type and setting of the stage(s)
providing the look up information, it can have multiple reference links.
Lookup Stage performs lookup operations on a data set read into memory from any other
Parallel job stage that can output data. It can also perform lookups directly in a DB2 or Oracle
database or in a lookup table contained in a Lookup File Set stage. Lookups can also be used for
validation of a row. If there is no corresponding entry in a lookup table to the keys values, the
row is rejected.
There are two types of lookup available with this stage. Normal lookup and sparse
lookup. Normal lookup is used when the reference data is small in size compared to the master
data set and when reference data is in huge then we need to keep the lookup type as sparse.
Instead of this, its better to go for join stage here.
We can set the lookup operations based on the following properties available in the stage tab.
Condition Not Met and Lookup Failure:
Choose the options from the Condition Not Met drop-down list. Possible actions are,
Continue: It continues processing any further lookups before sending the row to the
output link.
DATASTAGE PX
Processing stages
27/02/2009
Drop: Drops the row and continues with the next lookup.
Fail: Causes the job to issue a fatal error in the log file and stop.
Reject: Sends the row to the reject link.
To specify the action taken if a lookup on a link fails, choose an action from the Lookup Failure
drop-down list. Possible actions are:
Continue: Continues processing any further lookups before sending the row to the output
link.
Drop: Drops the row and continues with the next lookup.
Fail. Causes the job to issue a fatal error and stop.
Reject: Sends the row to the reject link.
DATASTAGE PX
Processing stages
27/02/2009
Dataset stage:
Dataset stage is a file stage. It allows you to read data from or write data to a data set. The
stage can have one input link or a single output link. It wont allow both input and output links at
the same time.
The data in the Dataset is stored in internal format. Using datasets wisely can be key to
good performance in a set of linked jobs.
A Dataset consists of two parts:
1. Descriptor file: Contains metadata and data location.
2. Data file: Contains the data.
Datasets are operating system files, each referred to by a control file, which has the suffix
(.ds). Parallel jobs use datasets to manage data within a job. It allows you to store data in
persistent form, which can then be used by other jobs. We can also manage data sets
independently of a job using the Data Set Management utility, available from the Datastage
Director and Manager.
Dataset can be saved across nodes using partitioning method selected, so it is always
faster when we used as a source or target. It can be configured to execute in parallel or sequential
mode. If the Data Set stage is operating in sequential mode, it will first collect the data before
writing it to the file using the default Auto collection method. By default the stage partitions in
Auto mode.
Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a
single sequential stream of data.
Properties:
Source category:
File: The name of the control file for the data set. We can browse for the file or enter a job
parameter. By convention this file has the suffix (.ds)
browse for the file or enter via job parameter if the file already exists. By convention, the file has
the suffix .ds.
Update Policy: Specifies what action will be taken if the data set you are writing to
already exists.
Append: Append any new data to the existing data.
Create (Error if exists): DataStage reports an error if the data set already exists.
Overwrite: Overwrites any existing data with new data.
DATASTAGE PX
Processing stages
27/02/2009
Properties:
Source category:
File: The name of the source flat file. We can browse for the file or enter a job
parameter.
Read method: specific files / File pattern
Specific files: specify the pathname of the file being read from (repeat this for
reading multiple files).
File pattern: specify the pattern of the file to read from.
Options:
First line is column name: True / False
The stage considers the first line of the file as Column Name if selected True and
ignores the first line if set to False.
Keep file partitions: True / False
It keeps the partitioning available in the source and doesnt re-partition if set as True
Missing file Mode: depends / Error / Ok
Depends means the default is error unless the file has
DATASTAGE PX
Processing stages
27/02/2009
Error to stop job if one of the files mentioned does not exist
Ok to skip the file
Reject mode: Continue / Fail / Output
Continue means the stage discards any rejected records
Fail to stop if any record is rejected
Output to send the rejected records to a reject link
Report Progress: Yes / No
Enable or disable logging of a progress report at intervals
Options:
Column method: Explicit / Schema file
Explicit means you should specify the meta data for the columns you want to
generate on the Output Page Columns tab. If you use the Explicit method, you also need to
specify which of the output link columns you are generating. You can repeat this property to
specify multiple columns. If you use the Schema File method, you should specify the schema
file.
Schema file is a plain text file in which Meta data for a stage is specified.
A Schema consists of a record definition. The following is an example for record
schema:
record(
name:string[];
address:nullable string[];
date:date[];
)
DATASTAGE PX
Processing stages
27/02/2009
Properties:
Options:
Number of Records: The number of records you want your generated data set to
contain. The default number is 10.
Schema File: (optional) by default the stage will take the Meta data defined on the input
link to base the mock data set on. But we can specify the column definitions in a schema file, if
required. We can browse for the schema file or specify a job parameter.
Peek stage:
Peek stage is a development / debug stage. It has one input link and any number of output
links.
The Peek stage lets you print record column values either to the job log or to a separate
output link as the stage copies records from its input data set to one or more output data sets.
This can be helpful for monitoring the progress of your application or to diagnose a bug in your
application.
Properties:
Rows Category:
DATASTAGE PX
Processing stages
27/02/2009
DATASTAGE PX
Processing stages
27/02/2009
Transformer stage:
Transformer stage is a processing stage. Transformer stage can have one input link and
any number of output links. It can also have a reject link that takes any rows which have not been
written to any of the outputs links by reason of a write failure or expression evaluation failure.
Transformer stages do not extract data or write data to a target database. It is used to handle
extracted data, perform any conversions required, and pass data to another Transformer stage or a
stage that writes data to the target.
In Transformer Editor window, we can Create new columns, Delete columns from a link, Move
columns within a link, Edit column meta data, Define output column derivations, Define link
constraints, Specify the order in which the links to be processed and Define local stage variables.
We can simply drag and drop the metadata from input link to the output link. We can specify
derivations for columns in the output pane where we can provide System variables, functions,
job_parameters, ds_macros and ds_routines.
Stage constraints: A constraint is an expression that specifies criteria that data must meet before
it is passed to the output link. If the constraint expression evaluates to TRUE for an input row,
the data row is output on that link. Rows that are not output on any of the links can be output on
the otherwise link. Constraint expressions on different links are independent.
Stage variables: This provides a method of defining expressions which can be reused in the
output column derivations. These values are not passed to the output.
The stage variables in the transformer have required order, if it has the dependencies on other
stage variables. For example, if you have three stage variables called A, B and C, in these stage
variables if B depends up on A then you need to maintain A, B and C in order. Otherwise you
will get very strange and wrong results.
If the result of a stage variable is Single character then keep the Length of the variable as
Varchar 1. Other wise, you will not get proper result
The following is the order of execution at the time of processing records:
Stage variables Stage Constraints Column Derivations
DATASTAGE PX
Processing stages
27/02/2009
Properties:
Table: Specifies the name of the table to write to. We can specify a job parameter if
required. It appears only when Write Method = Load.
Write Method: Load / Delete Rows / Upsert
Load: To load the data to the target table.
Delete Rows: Allows you to specify how the delete statement is to be derived
SQL property by using Auto-generated Delete or User-defined Delete actions.
from
Upsert: It allows you to provide the insert and update SQL statements for inserting
records. We can restrict oracle stage to take actions like update only or update and insert.
These tasks can be achieved by User-defined update or Auto-generated update actions
available in oracle stage.
Write Mode: Append / Create / Replace / Truncate
It appears only when Write Method = Load.
Append: New records are appended to an existing table. This is the default option.
Create: It creates a new table. If the Oracle table already exists, an error occurs and the job
terminates. You must specify this mode if the Oracle table does not exist.
Replace: The existing table is first dropped and an entirely new table is created in its place.
Oracle uses the default partitioning method for the new table.
DATASTAGE PX
Processing stages
27/02/2009
Truncate: The existing table attributes (including schema) and the Oracle partitioning keys
are retained, but any existing records are discarded. New records are then appended to the
table.
Connection Category:
DB Options: Specify a user name and password for connecting the database.
DB Options Mode: Auto generate / User defined
Here we can provide the user name and password for connecting to the remote server. If
you select User-defined, you have to edit the database options yourself.
Options Category:
Disable Constraints: Set True to disable all enabled constraints on a table when loading, then
attempt to re enable them at end of the load.
Silently Drop Columns Not in Table. This only appears for the Load Write Method. It is False
by default. Set to True to silently drop all input columns that do not correspond to columns in an
existing Oracle table. Otherwise the stage reports an error and terminates the job.
Truncate Column Names. This only appears for the Load Write Method. Set this property to
True to truncate column names up to 30 characters.
DATASTAGE PX