Professional Documents
Culture Documents
Here Datastage Manager will be integrated with the Datastage Design option.
and in 8.0.1
2)This is OS Independent . That is User can be created at Datastage, but one time dependant.
7) Sever is Websphere
4
4. DataStage Function enhancements New Client \ Domain Compatibility Check Before/after routines now
mask encrypted params Copy project permissions from existing project when creating new project
Environment variable enhancements: creation during import Add PX Stage Reset Support Enhancement
to Parallel Data Set Stage Multiple Null Field Values on Import Enhancements to improve Multi-Client
Manager support
5. DataStage Serviceability enhancements New Audit Tracing Enhanced Exception Dialog ISA Lite
Enhancements for DataStage Enhanced Project Creation Failure Details
5
If runtime column propagation is enabled in the DataStage Administrator, you can select the Runtime
column propagation to specify that columns encountered by a stage in a parallel job can be used even if
they are not explicitly defined in the meta data. You should always ensure that runtime column
propagation is turned on if you want to use schema files to define column meta data.
There are some special considerations when using runtime column propagation with certain stage types:
# Sequential File
# File Set
# External Source
# External Target
SMP:
a) SMP supports limited parallelism i.e 64 processors
b) SMP processing is SEQUENTIAL
MPP:
a) MPP can support N number of nodes or processors [high
Performance]
b) MPP Processing can be PARALLEL
ex:$dsjob -run
and also the options like
Dataset:
1. It preserves partition.it stores data on the nodes so when you read from a dataset you dont have to
repartition the data
2. It stores data in binary in the internal format of datastage. So it takes less time to read/write from ds
to any other source/target.
3. You cannot view the data without datastage.
4. It Creates 2 types of file to storing the data.
A) Descriptor File : Which is created in defined folder/path.
B) Data File : Created in Dataset folder mentioned in configuration file.
5. Dataset (.ds) file cannot be open directly, and you could follow alternative way to achieve that, Data
Set Management, the utility in client tool (such as Designer and Manager), and command line
ORCHADMIN.
Fileset:
1. It stores data in the format similar to that of sequential file.Only advantage of using fileset over seq file
is it preserves partition scheme.
2. You can view the data but in the order defined in partitioning scheme.
3. Fileset creates .fs file and .fs file is stored as ASCII format, so you could directly open it to see the path
of data file and its schema.
1) At Source level
What is the main difference between Lookup, Join and Merge stages?
Lookup: when the reference data is very less we use lookup. Because the data is stored in buffer. if the
reference data is very large then it will take time to load and for lookup.
Join: if the reference data is very large then we will go for join.Because it access the data directly from
the disk. So the
Processing time will be less when compared to lookup. But here in join we can’t capture the rejected data.
So we go for merge.
Merge: If we want to capture rejected data (when the join key is not matched) we use merge stage. For
every detailed link there is a reject link to capture rejected data.
What are the different types of lookup? When one should use sparse lookup in a job?
1 output
But in Datastage 8 Version, enhancements has been take place. They are
Normal Lookup:-- In Normal Look, all the reference records are copied to the memory and the primary
records are cross verified with the reference records.
Sparse Lookup:--In Sparse lookup stage, each primary records are sent to the Source and cross verified
with the reference records.
Here , we use sparse lookup when the data coming have memory sufficiency
and the primary records is relatively smaller than reference date we go for this sparse lookup.
Range Lookup:--- Range Lookup is going to perform the range checking on selected columns.
For Example: -- If we want to check the range of salary, in order to find the grades of the employee than
we can use the range lookup.
Sequence copies all records from the first input data set to the output data set, then all the
records from the second input data set, and so on.
11
For all methods the meta data of all input data sets must be identical. Name of columns should be same
in all input links.
Using SortStage you have the possibility to create a KeyChangeColumn - not possible in link sort.
Within a SortStage you have the possibility to increase the memory size per partition,
Within a SortStage you can define the 'don't sort' option on sort key they are already sorted.
Link Sort and stage sort,both do the same thing.Only the Sort Stage provides you with more options like
the amount of memory to be used,remove duplicates,sort in Ascending or descending order,Create
change key columns and etc.These options will not be available to you while using Link Sort.
What is main difference between change capture and change apply stages?
Change Capture stage: compares two data set(after and before) and makes a record of the
differences.
Change apply stage : combine the changes from the change capture stage with the original before data
set to reproduce the after data set.
Change capture stage catch holds of changes from two different datasets and generates a new column
called change code.... change code has values
0-copy
1-insert
2-delete
3-edit/update
Change apply stage applies these changes back to those data sets based on the change code column.
1. Auto
2. Same
3. Round robin
4. Hash
5. Entire
6. Random
7. Range
8. Modulus
Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions
into a single sequential stream (one data partition).
1. Auto
2. Round Robin
3. Ordered
4. Sort Merge
Auto - default. Datastage Enterprise Edition decides between using Same or Round Robin partitioning.
Typically Same partitioning is used between two parallel stages and round robin is used between a
sequential and an EE stage.
Round robin - rows are alternated evenly accross partitions. This partitioning method guarantees an
exact load balance (the same number of rows processed) between nodes and is very fast.
Hash - rows with same key column (or multiple columns) go to the same partition. Hash is very often
used and sometimes improves performance, however it is important to have in mind that hash partitioning
does not guarantee load balance and misuse may lead to skew data and poor performance.
13
Entire - all rows from a dataset are distributed to each partition. Duplicated rows are stored and the
data volume is significantly increased.
Range - an expensive refinement to hash partitioning. It is imilar to hash but partition mapping is user-
determined and partitions are ordered. Rows are distributed according to the values in one or more key
fields, using a range map (the 'Write Range Map' stage needs to be used to create it). Range partitioning
requires processing the data twice which makes it hard to find a reason for using it.
Modulus - data is partitioned on one specified numeric field by calculating modulus against number of
partitions. Not used very often.
Auto - the default algorithm reads rows from a partition as soon as they are ready. This may lead to
producing different row orders in different runs with identical data. The execution is non-deterministic.
Round Robin - picks rows from input partition patiently, for instance: first row from partition 0, next from
partition 1, even if other partitions can produce rows faster than partition 1.
Ordered - reads all rows from first partition, then second partition, then third and so on.
Sort Merge - produces a globally sorted sequential stream from within partition sorted rows. Sort
Merge produces a non-deterministic on un-keyed columns sorted sequential stream using the following
algorithm: always pick the partition that produces the row with the smallest key value.
#Remove duplicates using Sort Stage and Remove Duplicate Stages and Differences?
We can remove duplicates using both stages but in the sort stage we can capture duplicate records using
create key change column property.
1) The advantage of using sort stage over remove duplicate stage is that sort stage allows us to capture
the duplicate records whereas remove duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last
rec. The same can be done using sort stage by sorting in descending order to retain the first record.
In a copy stage there are no constraints or derivation so it surely should perform better than a
transformer. If you want a copy of a dataset you better use the copy stage and if there any business rules
to be applied to the dataset you better use the transformer stage.
We use the copy stage to change the metadata of input dataset(like changing the column name)
14
The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.
Specifying a horizontal pivot operation: Use the Pivot Enterprise stage to horizontally pivot data to
map sets of input columns onto single output columns.
Specifying a vertical pivot operation: Use the Pivot Enterprise stage to vertically pivot data and then
map the resulting columns onto the output columns.
15
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the
value into target column.
Constraint- is like a filter condition which limits the number of records coming from input according to
business rule.
The right order is: Stage variables Then Constraints Then Derivations
What is the difference between change capture and change apply stages?
16
Change capture stage is used to get the difference between two sources i.e. after dataset and before
dataset. The source which is used as a reference to capture the changes is called after dataset. The
source in which we are looking for the change is called before dataset. This change capture will add one
field called "change code" in the output from this stage. By this change code one can recognize which
kind of change this is like whether it is delete, insert or update.
Change apply stage is used along with change capture stage. It takes change code from the change
capture stage and apply all the changes in the before dataset based on the change code.
Change Capture is used to capture the changes between the two sources.
* To export a job:
* To export a project:
dsimport command
The dsimport command is as follows:
dsimport.exe /D=domain /H=hostname
/U=username /P=password
/NUA project|/ALL|
/ASK dsx_pathname1 dsx_pathname2 ...
The arguments are as follows:
17
domain or domain:port_number. The application server name. This can also optionally have a
port number.
hostname. The IBM® InfoSphere® DataStage® Server to which the file will be imported.
username. The user name to use for connecting to the application server.
password. The user's password .
NUA. Include this flag to disable usage analysis. This is recommended if you are importing a
large project.
project, /ALL, or /ASK. Specify a project to import the components to, or specify /ALL to import to
all projects or /ASK to be prompted for the project to which to import.
dsx_pathname. The file to import from. You can specify multiple files if required.
For example, the following command imports the components in the file jobs.dsx into the project dstage1
on the R101 server:
dsimport.exe /D=domain:9080 /U=wombat /P=w1ll1am dstage1 /H=R101
C:/scratch/jobs.dsx
When importing jobs or parameter sets with environment variable parameters, the import adds that
environment variable to the project definitions if it is not already present. The value for the project
definition of the environment variable is set to an empty string. because the original default for the project
is not known. If the environment variable value is set to $PROJDEF in the imported component, the
import warns you that you need to set the environment variable value in the project yourself.
dsexport command
The dsexport command is as follows:
dsexport.exe /D domain /H hostname
/U username /P password /JOB jobname
/XML /EXT /EXEC /APPEND project pathname1
The arguments are as follows:
domain or domain:port_number. The application server name. This can also optionally have a
port number.
hostname specifies the DataStage Server from which the file will be exported.
username is the user name to use for connecting to the Application Server.
password is the user’s password.
jobname specifies a particular job to export.
project. Specify the project to export the components from.
pathname. The file to which to export.
The command takes the following options:
/XML – export in XML format, only available with /JOB=jobname option.
/EXT – export external values, only available with /XML option.
/EXEC – export job executable only, only available with /JOB=jobname and when /XML is not
specified.
/APPEND – append to existing dsx file/ only available with /EXEC option.
For example, the following command exports the project dstage2 from the R101 to the file dstage2.dsx:
dsexport.exe /D domain:9080 /H R101 /U billg /P paddock
dstage2 C:/scratch/dstage2.dsx
In-process:
-The performance of Data Stage jobs can be improved by turning in-process row buffering on followed by
job recompilation.
-Data from connected active stages is passed through buffers instead of passing row by row.
Inter-process:
-Inter process is used when SMP parallel system runs server jobs
-Inter process enables running separate process for every active stage
Transformer Remembering:
DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL
experts have been manually coding into DataStage for years using some well known workarounds. A key
change in a DataStage job involves a group of records with a shared key where you want to process that
group as a type of array inside the overall recordset.
I am going to make a longer post about that later but there are two new cache objects inside a
Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve
it later on to compare two or more records inside a Transformer.
There are new system variables for looping and key change detection - @ITERATION, LastRow()
indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will
change in the next record.
DS8.7: Improvements in Xmeta. Significant Performance improvement in Job Open, Save, Compile etc.
DS9.1: No Change
DS8.7: Improved partition/sort insertion algorithm. XML parsing performance is improved by 3x or more
for large XML files.
19
DS9.1: No Change
DS8.7: New Feature has been added (Menu -> View -> Job Log) .
DS9.1: No Change
DS8.7: Stop/ Reset button added to Compile and Run buttons for the DS jobs.
DS9.1: No Change
The running job can be continued or aborted by using multiple breakpoints with
conditional logic per link and node. (row data or job parameter values can be examined by
breakpoint conditional logic)
DS9.1: No Change
DS8.5: Extended to current horizontal parallel pivot. Enhanced pivot stage to support vertical pivoting.
(mapping multiple input rows with a common key, to a single output row containing multiple columns)
DS8.7: No Change
DS9.1: No Change
7. Balance Optimization :
DS8.5: Balanced Optimization is that to redesign the job automatically with maximize performance by
minimizing the amount of input and output performed, and by balancing the processing against source,
intermediate, and target environments. The Balanced Optimization enables to take advantage of the
power of the databases without becoming an expert in native SQL.
DS8.7: No Change
8. Transformer Enhancements
DS8.5: Looping in the transformer, Multiple output rows to be produced from a single input row. 1. New
input cache: SaveInputRecord(), GetSavedInputRecord().
2. New System Variables: @ITERATION, @Loop Count, @EOD(End of data flag for last row).
3. Functions : LastRowInGroup(InputColumn).
DS8.7: No Change
DS9.1: New Transformation Expressions has been added. EREPLACE : Function to replace substring in
expression with another substring. If not specified occurrence, then each occurrence of substring will be
replaced.
DS8.7: Big Data File Stage for Big Data sources (Hadoop Distributed File System-HDFS).
1. The IBM Big Data Solution Integrate and manage the full variety, velocity and volume of data.
DS9.1: Java code and creates baseline for upcoming big data source support.
DS9.1: No Change
DS8.7: IPv6 Support: Information Server is fully compatible with IPv6 addresses and can support dual-
stack protocol implementations. (Env Variable: APT_USE_IPV4.)
DS9.1: No Change
DS9.1: Excel read capabilities on all platforms with rich features to support ranges, multiple worksheets
and New Unstructured data read.
DS9.1: New big buffer optimizations which has increased bulk load performance in DB2 and Oracle
Connector by more than 50% in many cases.
Processing stages
22
Aggregator joins data vertically by grouping incoming data stream and calculating summaries (sum,
count, min, max, variance, etc.) for each group. The data can be grouped using two methods: hash table
or pre-sort.
Copy - copies input data (a single stream) to one or more output data flows
Join combines two or more inputs according to values of a key column(s). Similiar concept to relational
DBMS SQL join (ability to perform inner, left, right and full outer joins). Can have 1 left and multiple right
inputs (all need to be sorted) and produces single output stream (no reject link).
Lookup combines two or more inputs according to values of a key column(s). Lookup stage can have 1
source and multiple lookup tables. Records don't need to be sorted and produces single output stream
and a reject link.
Merge combines one master input with multiple update inputs according to values of a key column(s). All
inputs need to be sorted and unmatched secondary entries can be captured in multiple reject links.
Modify stage alters the record schema of its input dataset. Useful for renaming columns, non-default data
type conversions and null handling
Remove duplicates stage needs a single sorted data set as input. It removes all duplicate records
according to a specification and writes to a single output
Slowly Changing Dimension automates the process of updating dimension tables, where the data
changes in time. It supports SCD type 1 and SCD type 2.
Transformer stage handles extracted data, performs data validation, conversions and lookups.
Change Capture - captures before and after state of two input data sets and outputs a single data set
whose records represent the changes made.
Change Apply - applies the change operations to a before data set to compute an after data set. It gets
data from a Change Capture stage
Difference stage performs a record-by-record comparison of two input data sets and outputs a single
data set whose records represent the difference between them. Similiar to Change Capture stage.
Checksum - generates checksum from the specified columns in a row and adds it to the stream. Used to
determine if there are differencies between records.
Compare performs a column-by-column comparison of records in two presorted input data sets. It can
have two input links and one output link.
Decode decodes a data set previously encoded with the Encode Stage.
External Filter permits speicifying an operating system command that acts as a filter on the processed
data
Generic stage allows users to call an OSH operator from within DataStage stage with options as
required.
Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row to a single
column in multiple output rows. Pivoting data results in obtaining a dataset with fewer number of columns
but more rows.
Surrogate Key Generator generates surrogate key for a column and manages the key source.
Switch stage assigns each input row to an output link based on the value of a selector field. Provides a
similiar concept to the switch statement in most programming languages.
Compress - packs a data set using a GZIP utility (or compress command on LINUX/UNIX)
Expand extracts a previously compressed data set back into raw binary data.
Sequential file is used to read data from or write data to one or more flat (sequential) files.
Data Set stage allows users to read data from or write data to a dataset. Datasets are operating system
files, each of which has a control file (.ds extension by default) and one or more data files (unreadable by
other applications)
File Set stage allows users to read data from or write data to a fileset. Filesets are operating system files,
each of which has a control file (.fs extension) and data files. Unlike datasets, filesets preserve formatting
and are readable by other applications.
Complex flat file allows reading from complex file structures on a mainframe machine, such as MVS
data sets, header and trailer structured files, files that contain multiple record types, QSAM and VSAM
files.
External Source - permits reading data that is output from multiple source programs.
Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be used for lookups.
Database stages
24
Oracle Enterprise allows reading data from and writing data to an Oracle database (database version
from 9.x to 10g are supported).
ODBC Enterprise permits reading data from and writing data to a database defined as an ODBC source.
In most cases it is used for processing data from or to Microsoft Access databases and Microsoft Excel
spreadsheets.
DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.
Teradata permits reading data from and writing data to a Teradata data warehouse. Three Teradata
stages are available: Teradata connector, Teradata Enterprise and Teradata Multiload
SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl Server 2005 amd
2008 database.
Sybase permits reading data from and writing data to Sybase databases.
Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL Server. The Stored
Procedure stage can be used as a source (returns a rowset), as a target (pass a row to a stored
procedure to write) or a transform (to invoke procedure processing within the database).
MS OLEDB helps retrieve information from any type of information repository, such as a relational source,
an ISAM file, a personal database, or a spreadsheet.
Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or writing to a
number of different supported relational DB engines using native interfaces, such as Oracle, Microsoft
SQL Server, DB2, Informix and Sybase.
Classic federation
RedBrick Load
Netezza Enterpise
25
iWay Enterprise
XML Input stage makes it possible to transform hierarchical XML data to flat relational data sets
XML Output writes tabular data (relational tables, sequential files or any datastage data streams) to XML
structures
Java client stage can be used as a source stage, as a target and as a lookup. The java package consists
of three public classes: com.ascentialsoftware.jds.Column, com.ascentialsoftware.jds.Row,
com.ascentialsoftware.jds.Stage
Java transformer stage supports three links: input, output and reject.
Restructure stages:
26
Column export stage exports data from a number of columns of different data types into a single column
of data type ustring, string, or binary. It can have one input link, one output link and a rejects link.
Column import complementary to the Column Export stage. Typically used to divide data arriving in a
single column into multiple columns.
Combine records stage combines rows which have identical keys, into vectors of subrecords.
Make subrecord combines specified input vectors into a vector of subrecords whose columns have the
same names and data types as the original vectors.
Split subrecord - separates an input subrecord field into a set of top-level vector columns
Split vector promotes the elements of a fixed-length vector to a set of top-level columns
Notification Activity - used for sending emails to user defined recipients from within Datastage
Sequencer used for synchronization of a control flow of multiple activities in a job sequence.
Terminator Activity permits shutting down the whole sequence once a certain situation occurs.
Wait for file Activity - waits for a specific file to appear or disappear and launches the processing.
EndLoop Activity
Exception Handler
Execute Command
Nested Condition
Routine Activity
StartLoop Activity
27
UserVariables Activity
A primary key is a special constraint on a column or set of columns. A primary key constraint ensures
that the column(s) so designated have no NULL values, and that every value is unique. Physically, a
primary key is implemented by the database system using a unique index, and all the columns in the
primary key must have been declared NOT NULL. A table may have only one primary key, but it may be
composite (consist of more than one column).
A surrogate key is any column or set of columns that can be declared as the primary key instead of a
"real" or natural key. Sometimes there can be several natural keys that could be declared as the primary
key, and these are all called candidate keys. So a surrogate is a candidate key. A table could actually
have more than one surrogate key, although this would be unusual. The most common type of surrogate
key is an incrementing integer, such as an auto increment column in MySQL, or a sequence in Oracle, or
an identity column in SQL Server.
Ans2): Surrogate key is an artificial identifier for an entity. In Surrogate key are generated by the system
sequentially. Primary key is a natural identifier for an entity. In primary key are all the values are entered
manually by the uniquely identifier there will be no replication of data.
# what is main difference between change capture and change apply stages?
Change Capture stage : compares two data set(after and before) and makes a record of the
differences.
change apply stage : combine the changes from the change capture stage with the original before data
set to reproduce the after data set.
Change capture stage catch holds of changesfrom two different datasets and generates a new column
called change code.... change code has values
0-copy
1-insert
2-delete
3-edit/update
Change apply stage applies these changes back to those data sets based on the chanecode column.
Basic Difference between Transformer and BASIC transfomer stage in parallel jobs ?
It supports one input link, 'n' number of output links, and only one reject link.
Can have one primary input link, multiple reference input links, and multiple output links.
The link from the main data input source is designated as the primary input link.
What is a Schema?
Star schema is a data warehouse schema where there is only one “fact table" and many de-normalized
dimension tables.
Fact table contains primary keys from all the dimension tables and other numeric columns of additive,
numeric facts.
Unlike Star-Schema, Snowflake schema contain normalized dimension tables in a tree like structure
with many nesting levels.
29
The star schema is the simplest data Snowflake schema is a more complex data
warehouse scheme. warehouse model than a star schema.
In star schema each of the dimensions is In snow flake schema at least one hierarchy
represented in a single table .It should not have should exists between dimension tables.
any hierarchies between dims.
It contains a fact table surrounded by dimension It contains a fact table surrounded by dimension
tables. If the dimensions are de-normalized, we tables. If a dimension is normalized, we say it is
say it is a star schema design. a snow flaked design.
In star schema only one join establishes the In snow flake schema since there is relationship
relationship between the fact table and any one between the dimensions tables it has to do
of the dimension tables. many joins to fetch the data.
It is called a star schema because the diagram It is called a snowflake schema because the
resembles a star. diagram resembles a snowflake.
Basics of SCD
Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than
changing on a time-based, regular schedule.
Type 1
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at
all.
In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the
surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code).
However, the joins will perform better on an integer than on a character string.
Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply
overwrite this record:
Type 2
The Type 2 method tracks historical data by creating multiple records for a given natural key in the
dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have
unlimited history preservation as a new record is inserted each time a change is made.
In the same example, if the supplier moves to Illinois, the table could look like this, with incremented
version numbers to indicate the sequence of changes:
Another popular method for tuple versioning is to add effective date columns.
The null End_Date in row two indicates the current tuple version. In some cases, a standardized
surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an
index, and so that null-value substitution is not required when querying.
Figure 1
Step 2: To set up the SCD properties in the SCD stage ,open the stage and access the Fast Path
32
Figure 2
Step 3: The tab 2 of SCD stage is used specify the purpose of each of the pulled keys from the
referenced dimension tables.
Figure 3
33
Step 4: Tab 3 is used to provide the seqence generator file/table name which is used to generate the new
surrogate keys for the new or latest dimesion records.These are keys which also get passed to the fact
tables for direct load.
Figure 4
Step 5: The Tab 4 is used to set the properties for configuring the data population logic for the new and
old dimension rows. The type of activies that we can configure as a part of this tab are:
1. Generation the new Surrogate key values to be passed to the dimension and fact table
2. Mapping the source columns with the source column
3. Setting up of the expired values for the old rows
4. Defining the values to mark the current active rows out of multiple type rows
34
Figure 5
Step 6: Set the derivation logic for the fact as a part of the last tab.
35
Figure 6
Step 7: Complete the remaining set up, run the job
36
Figure 7
How to perform incremental load in DataStage?
-When data is selected from source, selected records are loaded between timestamp of last load and the
current time
-The parameter that are passed to perform are last loaded date and current date
-The first parameter is the stored last run date is read through job parameters
Ans-2) ->Incremental load means daily load. Whenever you are selecting data from source select the
records which are loaded or updated between the timestamp of last successful load and todays load start
date and time. For this you have to pass parameters for those two dates. Store the last run date and time
in a file and read the parameter through job parameters and state second argument as current date and
time.
37
38
39
Sequence numbers can be generated in Datastage using certain routines. They are
-KeyMgtGetNextVal
-KeyMgtGetNextValConn
What are Routines and where/how are they written and have you written any routines before?
Ans: Routines are stored in the Routines branch of the DataStage Repository, where you can create, view
or edit. The following are different types of routines:
1) Transform functions
Basically Environment variable is predefined variable those we can use while creating DS job.
We create/declare these variables in DS Administrator. While designing the job we set the properties for
these variables. Environmental variables are also called as Global variables.
1. Local Variables
2.Environmental variables/Global Variables
Give me to you some example for environment variable. So that it will be more clear for us.
40
Example is
you want to connect to database you need use id , password and schema.
These are constant through out the project so they will be created as environment variables.
By using this if there is any change in password or schema no need to worry about all the jobs. Change it
at the level of environment variable that will take care of all the jobs.
There is an icon to go to Job parameters in the tool bar. Or you can press Ctrl+J to enter into Job
Parameters dialog box. Once you enter give a parameter name and corresponding default value for it.
This helps to enter the value when you run the job. Its not necessary always to open the job to change the
parameter value. Also when the job runs through script its just enough to give the parameter value in the
command line of script. Else you have to change the value in the job compile and then run in the script.
So its easy for the users to handle the jobs using parameters.
parameter set is new functionality provided starting with v8x , wherein we can define a group of
parameters as a "parameter set" at project level, and then can use that set in all of the jobs of concerned
project. Thereby eliminating the need of defining the parameters in each job specifically.
Primary Key is a combination of unique and not null. It can be a collection of key values called as
composite primary key.
Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and
Random etc. While using Hash partition we specify the Partition Key.
Sequential File :
-Converted into native format from ASCII, if utilized as source while compiling
Data Set :
-When data is selected from source, selected records are loaded between timestamp of last load and the
current time
-The parameter that are passed to perform are last loaded date and current date
-The first parameter is the stored last run date is read through job parameters
Tell me one situation from your last project, where you had faced problem and How did you solve
it?
Ans: The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage
the data before sending to the transformer to make the jobs run faster
Datastage has a property in which it allows users to suppress or demote a warning from the Datastage
log file. Although this is one thing ive never seen be used that much it’s still something worth knowing.
Datastage allows users to carry out this task by the use of Message handlers. This can be done either at
a job level or at a project level. Lets have a look on how to set the message handlers in Datastage.
As you can see from the design we are reading from a sequential file, sorting the data and then writing
back to the sequential file. Now since the sequential file stage does not run in parallel we will get the
below warning in the Datastage logs
42
Now if
we need to suppress this particular warning from the job logs we should do the following.
- Right click the warning and select the ‘add rule to message handler’ option
- This will give you a new window. In that window, select the ‘Add rule to message handler’ option and
then click add rule
43
- You then enter the name of the new message handler you are creating and click OK.
Your message handler has been created. The next time you run your job the warning won’t be present in
your logs. The list of messages handled will be present in your job log. It will be the last entry before the
‘Control’ entry.
The message handler you have just created is a local message handler that will be applied only to this
job. However you can use apply this message handler to all the jobs in your project by setting it in your
administrator. The option will be present in the ‘Parallel’ tab of your respective project.
44
The message handler I selected was the one I had just created.
SQL
SQL Statements
Most of the actions you need to perform on a database are done with SQL statements.
The following SQL statement selects all the records in the "Customers" table:
Example
SELECT column_name,column_name
FROM table_name;
and
The following SQL statement selects the "CustomerName" and "City" columns from the "Customers"
table:
Example
SELECT * Example
The following SQL statement selects all the columns from the "Customers" table:
Example
Tables are organized into rows and columns; and each table must have a name.
The column_name parameters specify the names of the columns of the table.
The data_type parameter specifies what type of data the column can hold (e.g. varchar, integer, decimal,
date, etc.).
The size parameter specifies the maximum length of the column of the table.
46
Now we want to create a table called "Persons" that contains five columns: PersonID, LastName,
FirstName, Address, and City.
Example
In a table, a column may contain many duplicate values; and sometimes you only want to list the different
(distinct) values.
The DISTINCT keyword can be used to return only distinct (different) values.
The following SQL statement selects only the distinct values from the "City" columns from the
"Customers" table:
Example
The WHERE clause is used to extract only those records that fulfill a specified criterion.
SELECT column_name,column_name
FROM table_name
WHERE column_name operator value;
47
The first form does not specify the column names where the data will be inserted, only their values:
The second form specifies both the column names and the values to be inserted:
Example
The following SQL statement will insert a new row, but only insert data in the "CustomerName", "City",
and "Country" columns (and the CustomerID field will of course also be updated automatically):
Example
UPDATE table_name
SET column1=value1,column2=value2,...
WHERE some_column=some_value;
48
Assume we wish to update the customer "Alfreds Futterkiste" with a new contact person and city.
Example
UPDATE Customers
SET ContactName='Alfred Schmidt', City='Hamburg'
WHERE CustomerName='Alfreds Futterkiste';
Update Warning!
Be careful when updating records. If we had omitted the WHERE clause, in the example above, like this:
UPDATE Customers
SET ContactName='Alfred Schmidt', City='Hamburg';
Assume we wish to delete the customer "Alfreds Futterkiste" from the "Customers" table.
Example
It is possible to delete all rows in a table without deleting the table. This means that the table structure,
attributes, and indexes will be intact:
or
Note: Be very careful when deleting records. You cannot undo this statement!
The ALTER TABLE statement is used to add, delete, or modify columns in an existing table.
To delete a column in a table, use the following syntax (notice that some database systems don't allow
deleting a column):
To change the data type of a column in a table, use the following syntax:
My SQL / Oracle:
Notice that the new column, "DateOfBirth", is of type date and is going to hold a date. The data type
specifies what type of data the column can hold. For a complete reference of all the data types available
in MS Access, MySQL, and SQL Server, go to our complete Data Types reference.
Now we want to change the data type of the column named "DateOfBirth" in the "Persons" table.
Notice that the "DateOfBirth" column is now of type year and is going to hold a year in a two-digit or four-
digit format.
Next, we want to delete the column named "DateOfBirth" in the "Persons" table.
Indexes, tables, and databases can easily be deleted/removed with the DROP statement.
What if we only want to delete the data inside the table, and not the table itself?
52
The following SQL statement selects all the customers from the country "Mexico", in the "Customers"
table:
Example
DELETE
1. DELETE is a DML Command.
2. DELETE statement is executed using a row lock, each row in the table is locked for deletion.
3. We can specify filters in where clause
4. It deletes specified data if where condition exists.
5. Delete activates a trigger because the operation are logged individually.
6. Slower than truncate because, it keeps logs.
7. Rollback is possible.
TRUNCATE
1. TRUNCATE is a DDL command.
2. TRUNCATE TABLE always locks the table and page but not each row.
3. Cannot use Where Condition.
4. It Removes all the data.
5. TRUNCATE TABLE cannot activate a trigger because the operation does not log individual row
deletions.
6. Faster in performance wise, because it doesn't keep any logs.
7. Rollback is not possible.
DELETE and TRUNCATE both can be rolled back when used with TRANSACTION.
If Transaction is done, means COMMITED, then we can not rollback TRUNCATE command, but we can
still rollback DELETE command from LOG files, as DELETE write records them in Log file in case it is
needed to rollback in future from LOG files.
DROP
The DROP command removes a table from the database. All the tables' rows, indexes and privileges will
also be removed. The operation cannot be rolled back.
A UNIQUE constraint and PRIMARY key both are similar and it provide unique enforce uniqueness of the
column on which they are defined.
Some are basic differences between Primary Key and Unique key are as follows.
Primary key
3. Primary key is implemented as indexes on the table. By default this index is clustered
index.
4. Primary key can be related with another table's as a Foreign Key.
5. We can generate ID automatically with the help of Auto Increment field. Primary key
supports Auto Increment value.
Unique Constraint
NORMALIZATION:
Some Oracle databases were modeled according to the rules of normalization that were intended to
eliminate redundancy.
Obviously, the rules of normalization are required to understand your relationships and functional
dependencies
Does not have a composite primary key. Meaning that the primary key can not be subdivided into
separate logical entities.
All the non-key columns are functionally dependent on the entire primary key.
A row is in second normal form if, and only if, it is in first normal form and every non-key attribute
is fully dependent on the key.
2NF eliminates functional dependencies on a partial key by putting the fields in a separate table
from those that are dependent on the whole key. An example is resolving many: many
relationships using an intersecting entity.
Functional dependencies on non-key fields are eliminated by putting them in a separate table. At
this level, all non-key fields are dependent on the primary key.
A row is in third normal form if and only if it is in second normal form and if attributes that do not
contribute to a description of the primary key are move into a separate table. An example is
creating look-up tables.
Have no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have
more than a single one-to-many relationship.
Alter
Drop
Truncate
Update
Delete
Revoke
Rollback
55
Save point
A view has a logical existence. It does not A materialized view has a physical existence.
contain data.
We cannot perform DML operation on view. We can perform DML operation on materialized
view.
When we do select * from view it will fetch the When we do select * from materialized view it
data from base table. will fetch the data from materialized view.
ROWID
A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table,
and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block
number, RRRR is the slot(row) number, and FFFF is a file number.
ROWNUM
For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in
which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of
1. The second has 2, and so on.
You can use ROWNUM to limit the number of rows returned by a query, as in this example:
Rowid Row-num
Rowid is a globally unique identifier for a row The row-num pseudocoloumn returns a
in a database. It is created at the time the row number indicating the order in which oracle
is inserted into the table, and destroyed when selects the row from a table or set of joined
it is removed from a table. rows.
FROM table
[WHERE condition]
[GROUP BY group_by_expression]
[HAVING group_condition]
[ORDER BY column];
The WHERE clause cannot be used to restrict groups. you use the
Both where and having clause can be used to filter the data.
It can be used without the GROUP BY clause. The HAVING clause cannot be used without the
GROUP BY clause. (/*But having clause we
need to use it with the group by*/).
The WHERE clause selects rows before The HAVING clause selects rows after grouping.
grouping. (/*Where clause applies to the (/*Whereas having clause is used to test some
individual rows*/). condition on the group rather than on individual
rows*/).
Where clause is used to restrict rows. But having clause is used to restrict groups.
In where clause every record is filtered based In having clause it is with aggregate records
on where. (group by functions).
The WHERE clause cannot contain aggregate The HAVING clause can contain aggregate
functions. functions.
Sub Query:
Example:
Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or
sal_grade=’B’)
Example:
Find all employees who earn more than the average salary in their department.
Group by B.department_id)
EXISTS:
A sub-query is executed once for the parent Where as co-related sub-query is executed
Query once for each row of the parent query.
Example: Example:
Select * from emp where deptno in (select Select a.* from emp e where sal >= (select
deptno from dept); avg(sal) from emp a where a.deptno=e.deptno
group by a.deptno);
IMPORTANT QUERIES
Get duplicate rows from the table:
58
Select empno, count (*) from EMP group by empno having count (*)>1;
Delete from EMP where rowid not in (select max (rowid) from EMP group by empno);
UNION
select
emp_id,
max(decode(row_id,0,address))as address1,
max(decode(row_id,1,address)) as address2,
max(decode(row_id,2,address)) as address3
group by emp_id
Other query:
select
emp_id,
max(decode(rank_id,1,address)) as add1,
max(decode(rank_id,2,address)) as add2,
max(decode(rank_id,3,address))as add3
from
(select emp_id,address,rank() over (partition by emp_id order by emp_id,address )rank_id from temp )
59
group by
emp_id
Rank query:
Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order by sal desc) r from
EMP);
The DENSE_RANK function works acts like the RANK function except that it assigns consecutive ranks:
Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over (order by sal desc) r from
emp);
Select empno, ename, sal,r from (select empno,ename,sal,dense_rank() over (order by sal desc) r from
emp) where r<=5;
Or
Select * from (select * from EMP order by sal desc) where rownum<=5;
2 nd highest Sal:
Select empno, ename, sal, r from (select empno, ename, sal, dense_rank () over (order by sal desc) r
from EMP) where r=2;
Top sal:
Select * from EMP where sal= (select max (sal) from EMP);
SQL> select *from emp where (rowid, 0) in (select rowid,mod(rownum,2) from emp);
The purpose of the SQL UNION and UNION ALL commands are to combine the results of two or more
queries into a single result set consisting of all the rows belonging to all the queries in the union. The
question becomes whether or not to use the ALL syntax.
The main difference between UNION ALL and UNION is that, UNION only selects distinct values,
while UNION ALLselects all values (including duplicates).
[SQL Statement 1]
UNION {ALL}
[SQL Statement 2]
[GROUP BY ...]
60
Sample Data
Use Authors table in SQL Server Pubs database or just use a simple table with these values
(obviously simplified to just illustrate the point):
Nashville TN 37215
Lawrence KS 66044
Corvallis OR 97330
This SQL statement combines two queries to retrieve records based on states. The two queries happen to
both get records from Tennessee ('TN'):
SELECT City, State, Zip FROM Authors WHERE State IN ('KS', 'TN')
UNION ALL
SELECT City, State, Zip FROM Authors WHERE IN ('OR' 'TN')
Nashville TN 37215
Lawrence KS 66044
Nashville TN 37215
Corvallis OR 97330
Notice how this displays the two query results in the order they appear from the queries. The first two
records come from the first SELECT statement, and the last two records from the second SELECT
statement. The TN record appears twice, since both SELECT statements retrieve TN records.
Using the same SQL statements and combining them with a UNION command:
SELECT City, State, Zip FROM Authors WHERE State IN ('KS', 'TN')
UNION
SELECT City, State, Zip FROM Authors WHERE IN ('OR' 'TN')
Corvallis OR 97330
Lawrence KS 66044
Nashville TN 37215
Notice how the TN record only appears once, even though both SELECT statements retrieve TN records.
The UNION syntax automatically eliminates the duplicate records between the two SQL statements and
sorts the results. In this example the Corvallis record appears first but is from the second SELECT
statement.
Answer: Oracle 9i introduced these two functions and they are used to rank the records of a table based
on column(s). The syntax of using these functions in SQL queries is 'RANK()/DENSE_RANK() OVER
(ORDER BY )'. Here the ORDER BYclause decides which all columns (and in what order) will be used to
group and rank the records.
The default sorting order is 'Ascending Order' and if we want we may specify 'DESC' to have the
Descending Sort Order. The ranks start from 1 and not from 0.
The RANK() returns the position of a value within the partition of a result set, with gaps in the ranking
where there are ties.
The DENSE_RANK() returns the position of a value within the partition of a result set, with no gaps in the
ranking where there are ties.
RANK
Let's assume we want to assign a sequential order, or rank, to people within a department based on
salary, we might use the RANK function like.
SELECT empno,
deptno,
sal,
RANK() OVER (PARTITION BY deptno ORDER BY sal) "rank"
FROM emp;
7876 20 1100 2
7566 20 2975 3
7788 20 3000 4
7902 20 3000 4
7900 30 950 1
7654 30 1250 2
7521 30 1250 2
7844 30 1500 4
7499 30 1600 5
7698 30 2850 6
SQL>
What we see here is where two people have the same salary they are assigned the same rank. When
multiple rows share the same rank the next rank in the sequence is not consecutive.
DENSE_RANK
The DENSE_RANK function acts like the RANK function except that it assigns consecutive ranks.
SELECT empno,
deptno,
sal,
DENSE_RANK() OVER (PARTITION BY deptno ORDER BY sal) "rank"
FROM emp;
DECODE is a function in Oracle and is used to provide if-then-else type of logic to SQL. It is not available
in MySQL or SQL Server. The syntax for DECODE is:
"search_value" is the value to search for, and "result" is the value that is displayed.
Table Store_Information
if we want to display 'LA' for 'Los Angeles', 'SF' for 'San Francisco', 'SD' for 'San Diego', and 'Others' for all
other cities, we would issue the following SQL,
"Area" is the name given to the column with the DECODE statement.
Result:
CASE and DECODE are the two widely used constructs in the SQL . And both have the functionality of an
IF-THEN-ELSE statement to return some specified value meeting some criteria.Even though they are
used interchangeably there are some differences between them.
This article tries to show list the advantage of CASE over DECODE and also explain how to convert
DECODE to CASE and vice versa.
CASE was introduced in Oracle 8.1.6 as a replacement for the DECODE . Anyway it is much better option
than DECODE as it is ,
3. ANSI Compatible
SIMPLE CASE
a. Expression Syntax
Code :
CASE [ expression ]
WHEN Value_1 THEN result_1
WHEN Value_2 THEN result_2
...
WHEN Value_n THEN result_n
[ELSE else_result]
END
Here CASE checks the value of Expression and returns the result each time for each record as specified.
Here is one such example to list the new salaries for all employees
Code :
14 rows selected.
The Equivalent DECODE syntax will be
Code :
SQL> SELECT EMPNO,JOB , SAL ,
66
14 rows selected.
In database design , we start with one single table, with all possible columns. Lot of redundant
data would be present since it’s a single table. The process of removing the redundant data, by
splitting up the table in a well defined fashion is called normalization.
A relation is said to be in first normal form if and only if all underlying domains contain atomic values only.
After 1NF , we can still have redundant data.
A relation is said to be in 2NF if and only if it is in 1NF and every non key attribute is fully dependent on
the primary key. After 2NF , we can still have redundant data
A relation is said to be in 3NF if and only if it is in 2NF and every non key attribute is non-transitively
dependent on the primary key
In order to avoid data duplication , data is stored in related tables . Join keyword is used to fetch data
from related table. Join return rows when there is at least one match in both tables . Type of joins are
Right Join
Return all rows from the right table, even if there are no matches in the left table .
Outer Join
Left Join
Return all rows from the left table, even if there are no matches in the right table .
Full Join
What is Self-Join?
Self-join is query used to join a table to itself. Aliases should be used for the same table comparison.
Cross Join will return all records where each row from the first table is combined with each row from the
second table.
DELETE from table_name A WHERE ROWID> (SELECT min (ROWID) from table_name B WHERE
A.Col_name=B.Col_name)
Select * from (Select * from table_name ORDER BY Salary Desc) WHERE ROWNUM<=5;
select * from (select * from emp order by sal desc) where rownum<6;
select * from (select * from emp order by sal asc) where rownum<6;
for maximum:
where rownum<=&n);
select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal<=b.sal);
for minimum:
select * from emp where sal in(select max(sal) from(select sal from emp group by sal order by sal asc)
where rownum<=&n);
select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal>=b.sal)
By using aliases:
select e.ename “employee name”,e1.ename “manger name” from emp e,emp e1 where e.mgr=e1.empno;
select * from dup where rowid not in(select max(rowid)from dup group by eno);
delete from dup where rowid not in(select max(rowid)from dup group by eno);
select ename from emp group by rownum,ename having rownum>1 and rownum<6;
select deptno,ename,sal from emp where rowid in(select rowid from emp
select * from emp where rownum<=7 minus select * from emp where rownum<5;
Select * from emp where rownum<=&n minus select * from emp where rownum<&n;
Select * from emp e1 where 3=(select count(distinct sal) from emp e2 where
e1.sal<=e2.sal)
union
Select * from emp e3 where 3=(select count(distinct sal) from emp e4 where
e3.sal>=e4.sal);
Select * from emp where (rowid,0) in (select rowid,mod (rownum,2) from emp);
Select * from emp where (rowid,1) in (select rowid,mod (rownum,2) from emp);
Query to display middle records (i.e., drop first 5, last 5 records in emp table)?
Minus
In this article I am giving example of some SQL query which is asked when you go for interview who is having one or
two year experience on this field .whenever you go for java developer position or any other programmer position
interviewee expect that if you are working from one or two years on any project definitely you come across to handle
this database query, so they test your skill by asking this type of simple query.
Answer : There are many ways to find second highest salary of Employee in SQL, you can either use SQL Join or
Subquery to solve this problem. Here is SQL query using Subquery :
select MAX(Salary) from Employee WHERE Salary NOT IN (select MAX(Salary) from Employee );
See How to find second highest salary in SQL for more ways to solve this problem.
Answer:
Ans:SQL has built in function called GetDate() which returns current timestamp.
SELECT GetDate();
Question 4:Write an SQL Query to check whether date passed to Query is date of given format or not.
Ans: SQL has IsDate() function which is used to check passed value is date or not of specified format ,it returns
1(true) or 0(false) accordingly.
Question 5: Write a SQL Query to print the name of distinct employee whose DOB is between 01/01/1960 to
31/12/1975.
Ans:
SELECT DISTINCT EmpName FROM Employees WHERE DOB BETWEEN ‘01/01/1960’ AND ‘31/12/1975’;
71
Question 6:Write an SQL Query find number of employees according to gender whose DOB is between
01/01/1960 to 31/12/1975.
Answer : SELECT COUNT(*), sex from Employees WHERE DOB BETWEEN ‘01/01/1960 ' AND
‘31/12/1975’ GROUP BY sex;
Question 7:Write an SQL Query to find employee whose Salary is equal or greater than 10000.
Question 8:Write an SQL Query to find name of employee whose name Start with ‘M’
Question 9: find all Employee records containing the word "Joe", regardless of whether it was stored as JOE,
Joe, or joe.
Hope this article will help you to take a quick practice whenever you are going to attend any interview and not have
much time to go into the deep of each query.