You are on page 1of 71

1

What is difference between server jobs and parallel jobs?


Ans:-
Server jobs:-
a) In server jobs it handles less volume of data with more performance.
b) It is having less number of components.
c) Data processing will be slow.
d) It’s purely work on SMP (Symmetric Multi-Processing).
e) It is highly impact usage of transformer stage.
f) The server jobs are made available once Data Stage Server is installed.
Parallel jobs:-
a) It handles high volume of data.
b) It’s work on parallel processing concepts.
c) It applies parallelism techniques.
d) It follows MPP (Massively parallel Processing).
e) It is having more number of components compared to server jobs.
f) It’s work on orchestrate framework
g) Parallel Jobs are available only when Enterprise Edition is installed
2

DIFFERENCE BETWEEN DATASTAGE 7.5X2 AND DATASTAGE 8.0.1 VERSIONS


3

1) In Datastage 7.5X2, there are 4 client components. They are


a) Datastage Design
b) Datastage Director
c Datastage Manager
d) Datastage Admin
And in
2) Datastage 8.0.1 Version, there are 5 components. They are
a) Datastage Design
b) Datastage Director
c) Datastage Admin
d) Web Console
e) Information Analyzer

Here Datastage Manager will be integrated with the Datastage Design option.

2) Datastage 7.X.2 Version is OS Dependent. That is OS users are Datastage Users.

and in 8.0.1
2)This is OS Independent . That is User can be created at Datastage, but one time dependant.

3) Datastage 7.X.2 version is File based Repository ( Folder).


3) Datastage 8.0.1 Version is Datastage Repository.

4) No Web based Administration here.


4) Web Based Administration.

5) There are 2 Architecture Components here. They are


a) Server
b) Client
5) There are 5 Architecture Components. They are
a) Common user Interface.
b) Common Repository.
c) Common Engine.
d) Common Connectivity.
e) Common Shared Services.

6) P-3 and P-4 can be performed here.


P-3 is Data Transformation.
P-4 is Metadata Management

6) P-1,P-2,P3,P4 can be performed here.


P-1 is Data Profiling
P-2 is Data Quality
P-3 is Data Transformation
P-4 is Metadata Management
7) Server is IIS

7) Sever is Websphere
4

8) No Web based Admin

8) Web based Admin.

DataStage 8.1 to DataStage 8.5

1. DataStage Designer performance improvement By changing the Metadata algorythm, copy/delete/save


jobs got faster about 30-40%.
2. Parallel Engine Performance and Resource improvements Resource usage is about 5% smaller than
8.1, for T-Sort, Windows desktop heap size has been decreased 94%.
3. Transformer enhancements Key break support
LastRowInGroup() function is added. This will return true for the last record of the group.
LastRow() will return the last record of input.
Output looping :: Allows multiple output records to be created per single input record.
Input looping :: Allows aggregation of input records so that aggregated data can be included with the
original input data. ( like adding average column to the original input is now possible. ( 2 pass....
calculation. )
New Null handling This is pretty complicate and need more verification by myself to explain clearly. But
this is the description I got.
Null values can now be included in any expression.
-> Null values no longer need to be explicitly handled.
A null value in an expression will return a null value result. As long as the target column is nullable,
records will not be dropped. Stage variables are now always nullable.

APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING is prepared to support backward compatibility.

New Transformer Functions


Create/offset a time, date or timestamp from component arguments
DateFromComponents(int32 years, int32 months, int32 dayofmonth)
DateOffsetByComponents(date basedate, int32 yearoffset, int32 monthoffset, int32 dayoffset)
DateOffsetByDays(date basedate, int32 offset)
TimeFromComponents(int32 hours, int32 minutes, int32 seconds, int32 microseconds)
TimeOffsetByComponents(time basetime, int32 houroffset, int32 minuteoffset, dfloat secondoffset)
TimeOffsetBySeconds(time basetime, dfloat secondoffset)
TimestampOffsetByComponents(timestamp basetimestamp, int32 yearoffset, int32 monthoffset, int32
dayoffset, int32 houroffset, int32 minuteoffset, dfloat secondoffset)
TimestampOffsetBySeconds(timestamp basetimestamp, dfloat secondoffset)

Various packed decimal conversions


DecimalToDate(decimal basedecimal [,string format] )
DecimalToTime(decimal basedecimal [,string format] )
DecimalToTimestamp(decimal basedecimal [,string format] )
DateToDecimal(date basedate [,string format] )
TimeToDecimal(time basetime [,string format] )
TimestampToDecimal(timestamp basetimestamp [,string format] )

4. DataStage Function enhancements New Client \ Domain Compatibility Check Before/after routines now
mask encrypted params Copy project permissions from existing project when creating new project
Environment variable enhancements: creation during import Add PX Stage Reset Support Enhancement
to Parallel Data Set Stage Multiple Null Field Values on Import Enhancements to improve Multi-Client
Manager support
5. DataStage Serviceability enhancements New Audit Tracing Enhanced Exception Dialog ISA Lite
Enhancements for DataStage Enhanced Project Creation Failure Details
5

6. ParallelPivot - Adding Vertical Pivoting


7. CVS (Code Version Control Integration) Information Server Manager was created on Eclipse from 8.1
Now the CVS or Subversion plugins to Eclipse are available for DataStage components.

What is configuration file? What is the use of this in data stage?


It is normal text file. it is having the information about the processing and storage resources that are
available for usage during parallel job execution.
The default configuration file is having like
a) Node: - It is logical processing unit which performs all ETL operations.
b) Pools: - It is a collection of nodes.
c) Fast Name: it is server name. By using this name it was executed our ETL jobs.
d) Resource disk: - It is permanent memory area which stores all Repository components.
e) Resource Scratch disk:-It is temporary memory area where the staging operation will be
performed.
Datastage Runtime Column propagation?
Enable the RCP feature, stages in parallel jobs can handle undefined
columns that they encounter when the job is run, and propagate these
columns through to the rest of the job.

Enable Runtime Column Propagation for Parallel Jobs


If you enable this feature, stages in parallel jobs can handle undefined columns that they encounter when
the job is run, and propagate these columns through to the rest of the job. This check box enables the
feature, to actually use it you need to explicitly select the option on each stage

If runtime column propagation is enabled in the DataStage Administrator, you can select the Runtime
column propagation to specify that columns encountered by a stage in a parallel job can be used even if
they are not explicitly defined in the meta data. You should always ensure that runtime column
propagation is turned on if you want to use schema files to define column meta data.

There are some special considerations when using runtime column propagation with certain stage types:

# Sequential File
# File Set
# External Source
# External Target

RCP Set at DataStage Adminstrator:


6

RCP Set at DataStage Stage Output:

What is DataStage parallel Extender / Enterprise Edition (EE)?


Parallel extender is that the parallel processing of data extraction and transformation application . there
are two types of parallel processing
1) Pipeline Parallelism
2) Partition Parallelism.

What is SMP & MPP? And Difference between SMP&MPP


7

SMP:
a) SMP supports limited parallelism i.e 64 processors
b) SMP processing is SEQUENTIAL
MPP:
a) MPP can support N number of nodes or processors [high
Performance]
b) MPP Processing can be PARALLEL

B.What is a conductor node?


Ans->Actually every process contains a conductor process where the execution was started and
a section leader process for each processing node and a player process for each set of combined
operators and a individual player process for each uncombined operator.
Whenever we want to kill a process we should have to destroy the player process and then section
leader process and then conductor process.

C.How do you execute datastage job from command line prompt?


Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

ex:$dsjob -run
and also the options like

-stop -To stop the running job


-lprojects - To list the projects
-ljobs - To list the jobs in project
-lstages - To list the stages present in job.
-llinks - To list the links.
-projectinfo - returns the project information(hostname and project name)
-jobinfo - returns the job information(Job status,job runtime,endtime, etc.,)
-stageinfo - returns the stage name ,stage type,input rows etc.,)
-linkinfo - It returns the link information
-lparams - To list the parameters in a job
-paraminfo - returns the parameters info
-log - add a text message to log.
-logsum - To display the log
-logdetail - To display with details like event_id,time,messge
-lognewest - To display the newest log id.
-report - display a report contains Generated time, start time,elapsed time,status etc.,
-jobid - Job id information.

D.Difference between sequential file,dataset and fileset?


Sequential File:
1. Extract/load from/to seq file max 2GB
2. when used as a source at the time of compilation it will be converted into native format from ASCII
3. Does not support null values
4. Seq file can only be accessed on one node.
8

Dataset:
1. It preserves partition.it stores data on the nodes so when you read from a dataset you dont have to
repartition the data
2. It stores data in binary in the internal format of datastage. So it takes less time to read/write from ds
to any other source/target.
3. You cannot view the data without datastage.
4. It Creates 2 types of file to storing the data.
A) Descriptor File : Which is created in defined folder/path.
B) Data File : Created in Dataset folder mentioned in configuration file.
5. Dataset (.ds) file cannot be open directly, and you could follow alternative way to achieve that, Data
Set Management, the utility in client tool (such as Designer and Manager), and command line
ORCHADMIN.

Fileset:
1. It stores data in the format similar to that of sequential file.Only advantage of using fileset over seq file
is it preserves partition scheme.
2. You can view the data but in the order defined in partitioning scheme.
3. Fileset creates .fs file and .fs file is stored as ASCII format, so you could directly open it to see the path
of data file and its schema.

Difference b/w Parameters and parameter set


 parameter set is new functionality provided starting with v8x , wherein we can define a group of
parameters as a "parameter set" at project level, and then can use that set in all of the jobs of
concerned project. Thereby eliminating the need of defining the parameters in each job
specifically.
 Parameters are stored in - as properties of - individual jobs. A Parameter Set is a re-usable
component stored in the Repository that can be loaded into as many jobs as required. A
Parameter Set con ...

This is a useful from maintainenance and reusability point of view.

What is Filter Stage? And use of it?

Filter Stage is used to block unwanted data.

We can perform Filter in 3 different ways.

1) At Source level

2) At Stages ( They are Filter, Switch , Ext. Filter )

3) At Constraints ( Tx, LookUp)

At Filter Stage We can apply the conditions on multiple Columns,


9

In Switch Stage we can apply the Condition on single column.

In Ext, Filter we can apply conditions based on Unix commands.

Filter Supports 1 i/p , N O/P, 1 Reject link

Switch Supports 1 I/P, 128 O/P , 1 Reject link

Ext Filter Supports 1 I/P , 1 O/P,No Reject link

What is the main difference between Lookup, Join and Merge stages?

All are used to join tables, but find the difference.

Lookup: when the reference data is very less we use lookup. Because the data is stored in buffer. if the
reference data is very large then it will take time to load and for lookup.

Join: if the reference data is very large then we will go for join.Because it access the data directly from
the disk. So the
Processing time will be less when compared to lookup. But here in join we can’t capture the rejected data.
So we go for merge.

Merge: If we want to capture rejected data (when the join key is not matched) we use merge stage. For
every detailed link there is a reject link to capture rejected data.

Significant differences that I have noticed are:


1) Number Of Reject Link
(Join) does not support reject link.
(Merge) has as many reject link as the update links (If there are n-input links then 1 will be master link and
n-1 will be the update link).
2) Data Selection
(Join) There are various ways in which data is being selected. e.g. we have different types of joins inner
outer( left right full) cross join etc. So you have different selection criteria for dropping/selecting a row.
(Merge) Data in Master record and update records are merged only when both have same value for the
merge key columns.

What are the different types of lookup? When one should use sparse lookup in a job?

In DS 7.5 we have 2 types of lookup options are available: 1. Normal 2. Sparse


In DS 8.0.1 onwards, we have 3 types of lookup options are available 1. Normal 2. Sparse 3. Range

Look Up stage is a processing stage which performs horizontal combining.

Lookup stage Supports


10

N-Inputs ( For Norman Lookup )

2 Inputs ( For Sparse Lookup)

1 output

And 1 Reject link

Up to Datastage 7 Version We have only 2 Types of LookUps

a) Normal Lookup and b) Sparse Lookup

But in Datastage 8 Version, enhancements has been take place. They are

c) Range Look Up And d) Case less Look up

Normal Lookup:-- In Normal Look, all the reference records are copied to the memory and the primary
records are cross verified with the reference records.

Sparse Lookup:--In Sparse lookup stage, each primary records are sent to the Source and cross verified
with the reference records.

Here , we use sparse lookup when the data coming have memory sufficiency
and the primary records is relatively smaller than reference date we go for this sparse lookup.

Range Lookup:--- Range Lookup is going to perform the range checking on selected columns.

For Example: -- If we want to check the range of salary, in order to find the grades of the employee than
we can use the range lookup.

Use and Types of Funnel Stage in Datastage ?


The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This
operation is useful for combining separate data sets into a single large data set. The stage can have any
number of input links and a single output link.
The Funnel stage can operate in one of three modes:
 Continuous Funnel combines the records of the input data in no guaranteed order. It takes one
record from each input link in turn. If data is not available on an input link, the stage skips to the
next link rather than waiting.
 Sort Funnel combines the input records in the order defined by the value(s) of one or more key
columns and the order of the output records is determined by these sorting keys.

 Sequence copies all records from the first input data set to the output data set, then all the
records from the second input data set, and so on.
11

For all methods the meta data of all input data sets must be identical. Name of columns should be same
in all input links.

What is the Diffrence Between Link Sort and Sort Stage?


Or Diffrence Between Link sort and Stage Sort ?

If the volume of the data is low, then we go for link sort.


If the volume of the data is high, then we go for sort stage.
"Link Sort" uses scratch disk (physical location on disk), whereas
"Sort Stage" uses server RAM (Memory). Hence we can change the default memory size in "Sort Stage"

Using SortStage you have the possibility to create a KeyChangeColumn - not possible in link sort.
Within a SortStage you have the possibility to increase the memory size per partition,
Within a SortStage you can define the 'don't sort' option on sort key they are already sorted.

Link Sort and stage sort,both do the same thing.Only the Sort Stage provides you with more options like
the amount of memory to be used,remove duplicates,sort in Ascending or descending order,Create
change key columns and etc.These options will not be available to you while using Link Sort.

What is main difference between change capture and change apply stages?

Change Capture stage: compares two data set(after and before) and makes a record of the
differences.
Change apply stage : combine the changes from the change capture stage with the original before data
set to reproduce the after data set.

Change capture stage catch holds of changes from two different datasets and generates a new column
called change code.... change code has values
0-copy
1-insert
2-delete
3-edit/update

Change apply stage applies these changes back to those data sets based on the change code column.

Difference between Transformer and Basic Transformer stage?


Basic Difference between Transformer and BASIC transformer stage in parallel jobs?

Basic transformer used in server jobs and Parallel Jobs but


It supports one input link, 'n' number of output links, and only one reject link.
Basic transformer will be operating in Sequential mode.
All functions, macros, routines are writtened by using BASIC language.
12

Parallel Transformer stages


Can have one primary input link, multiple reference input links, and multiple output links.
The link from the main data input source is designated as the primary input link.
PX Transformer all functions, macros are written in C++ language.
It Supports Portioning of Data.
# Details of Data partitioning and collecting methods in Datastage?
Partitioning mechanism divides a portion of data into smaller segments, which is then processed
independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP,
Grid computing and Clusters.

1. Auto
2. Same
3. Round robin
4. Hash
5. Entire
6. Random
7. Range
8. Modulus

Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions
into a single sequential stream (one data partition).

1. Auto

2. Round Robin

3. Ordered

4. Sort Merge

** DATA PARTITIONING METHODS : DATASTAGE SUPPORTS A FEW TYPES OF DATA


PARTITIONING METHODS WHICH CAN BE IMPLEMENTED IN PARALLEL STAGES:

 Auto - default. Datastage Enterprise Edition decides between using Same or Round Robin partitioning.
Typically Same partitioning is used between two parallel stages and round robin is used between a
sequential and an EE stage.

 Same - existing partitioning remains unchanged. No data is moved between nodes.

 Round robin - rows are alternated evenly accross partitions. This partitioning method guarantees an
exact load balance (the same number of rows processed) between nodes and is very fast.

 Hash - rows with same key column (or multiple columns) go to the same partition. Hash is very often
used and sometimes improves performance, however it is important to have in mind that hash partitioning
does not guarantee load balance and misuse may lead to skew data and poor performance.
13

 Entire - all rows from a dataset are distributed to each partition. Duplicated rows are stored and the
data volume is significantly increased.

 Random - rows are randomly distributed accross partitions

 Range - an expensive refinement to hash partitioning. It is imilar to hash but partition mapping is user-
determined and partitions are ordered. Rows are distributed according to the values in one or more key
fields, using a range map (the 'Write Range Map' stage needs to be used to create it). Range partitioning
requires processing the data twice which makes it hard to find a reason for using it.

 Modulus - data is partitioned on one specified numeric field by calculating modulus against number of
partitions. Not used very often.

** DATA COLLECTING METHODS: A COLLECTOR COMBINES PARTITIONS INTO A SINGLE


SEQUENTIAL STREAM. DATASTAGE PARALLEL SUPPORTS THE FOLLOWING COLLECTING
ALGORITHMS:

 Auto - the default algorithm reads rows from a partition as soon as they are ready. This may lead to
producing different row orders in different runs with identical data. The execution is non-deterministic.

 Round Robin - picks rows from input partition patiently, for instance: first row from partition 0, next from
partition 1, even if other partitions can produce rows faster than partition 1.

 Ordered - reads all rows from first partition, then second partition, then third and so on.

 Sort Merge - produces a globally sorted sequential stream from within partition sorted rows. Sort
Merge produces a non-deterministic on un-keyed columns sorted sequential stream using the following
algorithm: always pick the partition that produces the row with the smallest key value.

#Remove duplicates using Sort Stage and Remove Duplicate Stages and Differences?

We can remove duplicates using both stages but in the sort stage we can capture duplicate records using
create key change column property.

1) The advantage of using sort stage over remove duplicate stage is that sort stage allows us to capture
the duplicate records whereas remove duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last
rec. The same can be done using sort stage by sorting in descending order to retain the first record.

#what is difference between Copy & transformer stage?

In a copy stage there are no constraints or derivation so it surely should perform better than a
transformer. If you want a copy of a dataset you better use the copy stage and if there any business rules
to be applied to the dataset you better use the transformer stage.

We use the copy stage to change the metadata of input dataset(like changing the column name)
14

What is the use Enterprise Pivot Stage?

The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.

 Specifying a horizontal pivot operation: Use the Pivot Enterprise stage to horizontally pivot data to
map sets of input columns onto single output columns.

Table 1. Input data for a simple horizontal pivot operation

REPID last_name Jan_sales Feb_sales Mar_sales

100 Smith 1234.08 1456.80 1578.00

101 Yamada 1245.20 1765.00 1934.22

Table 2. Output data for a simple horizontal pivot operation

REPID last_name Q1sales Pivot_index

100 Smith 1234.08 0

100 Smith 1456.80 1

100 Smith 1578.00 2

101 Yamada 1245.20 0

101 Yamada 1765.00 1

101 Yamada 1934.22 2

 Specifying a vertical pivot operation: Use the Pivot Enterprise stage to vertically pivot data and then
map the resulting columns onto the output columns.
15

Table 1. Input data for vertical pivot operation

REPID last_name Q_sales

100 Smith 1234.08

100 Smith 1456.80

100 Smith 1578.00

101 Yamada 1245.20

101 Yamada 1765.00

101 Yamada 1934.22

Table 2. Output data for vertical pivot operation

Q_sales Q_sales1 Q_sales2


REPID last_name (January) (February) (March) Q_sales_average

100 Smith 1234.08 1456.80 1578.00 1412.96

101 Yamada 1245.20 1765.00 1934.22 1648.14

# What are Stage Variables, Derivations and Constants?

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the
value into target column.

Derivation - Expression that specifies value to be passed onto the target

Constraint- is like a filter condition which limits the number of records coming from input according to
business rule.

The right order is: Stage variables Then Constraints Then Derivations

What is the difference between change capture and change apply stages?
16

Change capture stage is used to get the difference between two sources i.e. after dataset and before
dataset. The source which is used as a reference to capture the changes is called after dataset. The
source in which we are looking for the change is called before dataset. This change capture will add one
field called "change code" in the output from this stage. By this change code one can recognize which
kind of change this is like whether it is delete, insert or update.

Change apply stage is used along with change capture stage. It takes change code from the change
capture stage and apply all the changes in the before dataset based on the change code.

Change Capture is used to capture the changes between the two sources.

Change Apply will apply those changes in the output file.

What is the difference between Server Job and Parallel Jobs?


A. Server jobs were doesn’t support the partitioning techniques but parallel jobs support the partition
techniques.
B. Server jobs are not support SMTP, MPP but parallel support SMTP, MPP.
C. Server jobs are running in single node but parallel jobs are running in multiple nodes.
D. Server jobs prefer while getting source data is low but data is huge then prefer the parallel
What are parallel jobs
These are compiled and run on the DataStage server in a similar way to server jobs , but support parallel
processing on SMP,MPP and cluster systems
how do you use procedure in datastage job
Use ODBC plug,pass one dummy colomn and give procedure name in SQL tab.
How to Import and export datastage jobs in unix?
Ans:)

Use the following command:

* To export a job:

DsExport /H=<Host> /U=<User> /P=<Password> /JOB=<job_name> <Project> <job_name.dsx>

* To export a project:

DsExport /H=<Host> /U=<User> /P=<Password> <Project> <project_name.dsx>

You can find the "DsExport" command in the DS install directory.

dsimport command
The dsimport command is as follows:
dsimport.exe /D=domain /H=hostname
/U=username /P=password
/NUA project|/ALL|
/ASK dsx_pathname1 dsx_pathname2 ...
The arguments are as follows:
17

 domain or domain:port_number. The application server name. This can also optionally have a
port number.
 hostname. The IBM® InfoSphere® DataStage® Server to which the file will be imported.
 username. The user name to use for connecting to the application server.
 password. The user's password .
 NUA. Include this flag to disable usage analysis. This is recommended if you are importing a
large project.
 project, /ALL, or /ASK. Specify a project to import the components to, or specify /ALL to import to
all projects or /ASK to be prompted for the project to which to import.
 dsx_pathname. The file to import from. You can specify multiple files if required.
For example, the following command imports the components in the file jobs.dsx into the project dstage1
on the R101 server:
dsimport.exe /D=domain:9080 /U=wombat /P=w1ll1am dstage1 /H=R101
C:/scratch/jobs.dsx
When importing jobs or parameter sets with environment variable parameters, the import adds that
environment variable to the project definitions if it is not already present. The value for the project
definition of the environment variable is set to an empty string. because the original default for the project
is not known. If the environment variable value is set to $PROJDEF in the imported component, the
import warns you that you need to set the environment variable value in the project yourself.

dsexport command
The dsexport command is as follows:
dsexport.exe /D domain /H hostname
/U username /P password /JOB jobname
/XML /EXT /EXEC /APPEND project pathname1
The arguments are as follows:
 domain or domain:port_number. The application server name. This can also optionally have a
port number.
 hostname specifies the DataStage Server from which the file will be exported.
 username is the user name to use for connecting to the Application Server.
 password is the user’s password.
 jobname specifies a particular job to export.
 project. Specify the project to export the components from.
 pathname. The file to which to export.
The command takes the following options:
 /XML – export in XML format, only available with /JOB=jobname option.
 /EXT – export external values, only available with /XML option.
 /EXEC – export job executable only, only available with /JOB=jobname and when /XML is not
specified.
 /APPEND – append to existing dsx file/ only available with /EXEC option.
For example, the following command exports the project dstage2 from the R101 to the file dstage2.dsx:
dsexport.exe /D domain:9080 /H R101 /U billg /P paddock
dstage2 C:/scratch/dstage2.dsx

Generate Surrogate Key without Surrogate Key Stage?

Use the following formula in Transformer stage to generate a surrogate key.


18

@PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM – 1)) + 1

Difference between In Process and Inter Process? – DataStage

In-process:

-The performance of Data Stage jobs can be improved by turning in-process row buffering on followed by
job recompilation.

-Data from connected active stages is passed through buffers instead of passing row by row.

Inter-process:

-Inter process is used when SMP parallel system runs server jobs

-Inter process enables running separate process for every active stage

-Every process will utilize a separate process while running blocks.

Transformer Remembering:

DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL
experts have been manually coding into DataStage for years using some well known workarounds. A key
change in a DataStage job involves a group of records with a shared key where you want to process that
group as a type of array inside the overall recordset.

I am going to make a longer post about that later but there are two new cache objects inside a
Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve
it later on to compare two or more records inside a Transformer.

There are new system variables for looping and key change detection - @ITERATION, LastRow()
indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will
change in the next record.

Here is an aggregation example where rows are looped.

Datastage 8.5, 8.7 and 9.1 Differences


1. Design & Runtime Performance Changes:
DS8.5: Implemented by Internal code change. Design and Runtime performance is better than 8.1, 40%
performance improvement in job open, save, compile etc.

DS8.7: Improvements in Xmeta. Significant Performance improvement in Job Open, Save, Compile etc.

DS9.1: No Change

2. PX Engine Performance Changes :

DS8.5: Not Exist

DS8.7: Improved partition/sort insertion algorithm. XML parsing performance is improved by 3x or more
for large XML files.
19

DS9.1: No Change

3. Added View Job Log in Designer client

DS8.5: Not Exist

DS8.7: New Feature has been added (Menu -> View -> Job Log) .

Job log is now viewed in Designer client.

DS9.1: No Change

4. Added Stop/Reset buttons In Designer Client:

DS8.5: Not Exist

DS8.7: Stop/ Reset button added to Compile and Run buttons for the DS jobs.

DS9.1: No Change

5. Interactive Parallel Job Debugging

DS8.5: Not Exist

DS8.7: Breakpoints with conditional logic per link and node.

(Link -> Rclick -> Toggle Breakpoint)

The running job can be continued or aborted by using multiple breakpoints with
conditional logic per link and node. (row data or job parameter values can be examined by
breakpoint conditional logic)

DS9.1: No Change

6. Added Vertical Pivot: Pivot Enterprise Stage

DS8.5: Extended to current horizontal parallel pivot. Enhanced pivot stage to support vertical pivoting.
(mapping multiple input rows with a common key, to a single output row containing multiple columns)

DS8.7: No Change

DS9.1: No Change

7. Balance Optimization :

DS8.5: Balanced Optimization is that to redesign the job automatically with maximize performance by
minimizing the amount of input and output performed, and by balancing the processing against source,
intermediate, and target environments. The Balanced Optimization enables to take advantage of the
power of the databases without becoming an expert in native SQL.

DS8.7: No Change

DS9.1: Balanced Optimization for Hadoop.


20

8. Transformer Enhancements

DS8.5: Looping in the transformer, Multiple output rows to be produced from a single input row. 1. New
input cache: SaveInputRecord(), GetSavedInputRecord().

2. New System Variables: @ITERATION, @Loop Count, @EOD(End of data flag for last row).

3. Functions : LastRowInGroup(InputColumn).

4. Null Handling more Options.

DS8.7: No Change

DS9.1: New Transformation Expressions has been added. EREPLACE : Function to replace substring in
expression with another substring. If not specified occurrence, then each occurrence of substring will be
replaced.

9. Big Data File Stage :

DS8.5: Not Exist

DS8.7: Big Data File Stage for Big Data sources (Hadoop Distributed File System-HDFS).

DS9.1: New Enhancement on

1. The IBM Big Data Solution Integrate and manage the full variety, velocity and volume of data.

2. New Hadoop-based Big Data Support Any to Big Data.

3. Big Data Integration with DataStage.

10. Added Java Integration Stage

DS8.5: Not Exist

DS8.7: Not Exist

DS9.1: Java code and creates baseline for upcoming big data source support.

11. Added Encryption Techniques

DS8.5: Not Exist

DS8.7: Encrypted because of security reasons.

1. Strongly encrypted credential files for command line utilities.

2. Strongly encrypted job parameter files for dsjob command.

3. Encryption Algorithm and Customization.

DS9.1: No Change

12. Added Dual-stack protocol Support


21

DS8.5: Not Exist

DS8.7: IPv6 Support: Information Server is fully compatible with IPv6 addresses and can support dual-
stack protocol implementations. (Env Variable: APT_USE_IPV4.)

DS9.1: No Change

13. Added Unstructured text stage

DS8.5: Not Exist

DS8.7: Not Exist

DS9.1: Excel read capabilities on all platforms with rich features to support ranges, multiple worksheets
and New Unstructured data read.

14. Added DBMS Connector Boost

DS8.5: Not Exist

DS8.7: Not Exist

DS9.1: New big buffer optimizations which has increased bulk load performance in DB2 and Oracle
Connector by more than 50% in many cases.

Processing stages
22

Aggregator joins data vertically by grouping incoming data stream and calculating summaries (sum,
count, min, max, variance, etc.) for each group. The data can be grouped using two methods: hash table
or pre-sort.

Copy - copies input data (a single stream) to one or more output data flows

FTP stage uses FTP protocol to transfer data to a remote machine

Filter filters out records that do not meet specified requirements.

Funnel combines mulitple streams into one.

Join combines two or more inputs according to values of a key column(s). Similiar concept to relational
DBMS SQL join (ability to perform inner, left, right and full outer joins). Can have 1 left and multiple right
inputs (all need to be sorted) and produces single output stream (no reject link).

Lookup combines two or more inputs according to values of a key column(s). Lookup stage can have 1
source and multiple lookup tables. Records don't need to be sorted and produces single output stream
and a reject link.

Merge combines one master input with multiple update inputs according to values of a key column(s). All
inputs need to be sorted and unmatched secondary entries can be captured in multiple reject links.

Modify stage alters the record schema of its input dataset. Useful for renaming columns, non-default data
type conversions and null handling

Remove duplicates stage needs a single sorted data set as input. It removes all duplicate records
according to a specification and writes to a single output

Slowly Changing Dimension automates the process of updating dimension tables, where the data
changes in time. It supports SCD type 1 and SCD type 2.

Sort sorts input columns

Transformer stage handles extracted data, performs data validation, conversions and lookups.

Change Capture - captures before and after state of two input data sets and outputs a single data set
whose records represent the changes made.

Change Apply - applies the change operations to a before data set to compute an after data set. It gets
data from a Change Capture stage

Difference stage performs a record-by-record comparison of two input data sets and outputs a single
data set whose records represent the difference between them. Similiar to Change Capture stage.

Checksum - generates checksum from the specified columns in a row and adds it to the stream. Used to
determine if there are differencies between records.

Compare performs a column-by-column comparison of records in two presorted input data sets. It can
have two input links and one output link.

Encode encodes data with an encoding command, such as gzip.


23

Decode decodes a data set previously encoded with the Encode Stage.

External Filter permits speicifying an operating system command that acts as a filter on the processed
data

Generic stage allows users to call an OSH operator from within DataStage stage with options as
required.

Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row to a single
column in multiple output rows. Pivoting data results in obtaining a dataset with fewer number of columns
but more rows.

Surrogate Key Generator generates surrogate key for a column and manages the key source.

Switch stage assigns each input row to an output link based on the value of a selector field. Provides a
similiar concept to the switch statement in most programming languages.

Compress - packs a data set using a GZIP utility (or compress command on LINUX/UNIX)

Expand extracts a previously compressed data set back into raw binary data.

File stage types

Sequential file is used to read data from or write data to one or more flat (sequential) files.

Data Set stage allows users to read data from or write data to a dataset. Datasets are operating system
files, each of which has a control file (.ds extension by default) and one or more data files (unreadable by
other applications)

File Set stage allows users to read data from or write data to a fileset. Filesets are operating system files,
each of which has a control file (.fs extension) and data files. Unlike datasets, filesets preserve formatting
and are readable by other applications.

Complex flat file allows reading from complex file structures on a mainframe machine, such as MVS
data sets, header and trailer structured files, files that contain multiple record types, QSAM and VSAM
files.

External Source - permits reading data that is output from multiple source programs.

External Target - permits writing data to one or more programs.

Lookup File Set is similiar to FileSet stage. It is a partitioned hashed file which can be used for lookups.

Database stages
24

Oracle Enterprise allows reading data from and writing data to an Oracle database (database version
from 9.x to 10g are supported).

ODBC Enterprise permits reading data from and writing data to a database defined as an ODBC source.
In most cases it is used for processing data from or to Microsoft Access databases and Microsoft Excel
spreadsheets.

DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.

Teradata permits reading data from and writing data to a Teradata data warehouse. Three Teradata
stages are available: Teradata connector, Teradata Enterprise and Teradata Multiload

SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl Server 2005 amd
2008 database.

Sybase permits reading data from and writing data to Sybase databases.

Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL Server. The Stored
Procedure stage can be used as a source (returns a rowset), as a target (pass a row to a stored
procedure to write) or a transform (to invoke procedure processing within the database).

MS OLEDB helps retrieve information from any type of information repository, such as a relational source,
an ISAM file, a personal database, or a spreadsheet.

Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or writing to a
number of different supported relational DB engines using native interfaces, such as Oracle, Microsoft
SQL Server, DB2, Informix and Sybase.

Informix (CLI or Load)

DB2 UDB (API or Load)

Classic federation

RedBrick Load

Netezza Enterpise
25

iWay Enterprise

Real Time stages:

XML Input stage makes it possible to transform hierarchical XML data to flat relational data sets

XML Output writes tabular data (relational tables, sequential files or any datastage data streams) to XML
structures

XML Transformer converts XML documents using an XSLT stylesheet

Websphere MQ stages provide a collection of connectivity options to access IBM WebSphere MQ


enterprise messaging systems. There are two MQ stage types available in DataStage and QualityStage:
WebSphere MQ connector and WebSphere MQ plug-in stage.

Web services client

Web services transformer

Java client stage can be used as a source stage, as a target and as a lookup. The java package consists
of three public classes: com.ascentialsoftware.jds.Column, com.ascentialsoftware.jds.Row,
com.ascentialsoftware.jds.Stage

Java transformer stage supports three links: input, output and reject.

WISD Input - Information Services Input stage

WISD Output - Information Services Output stage

Restructure stages:
26

Column export stage exports data from a number of columns of different data types into a single column
of data type ustring, string, or binary. It can have one input link, one output link and a rejects link.

Column import complementary to the Column Export stage. Typically used to divide data arriving in a
single column into multiple columns.

Combine records stage combines rows which have identical keys, into vectors of subrecords.

Make subrecord combines specified input vectors into a vector of subrecords whose columns have the
same names and data types as the original vectors.

Make vector joins specified input columns into a vector of columns

Promote subrecord - promotes input subrecord columns to top-level columns

Split subrecord - separates an input subrecord field into a set of top-level vector columns

Split vector promotes the elements of a fixed-length vector to a set of top-level columns

Sequence activity stage types:

Job Activity specifies a Datastage server or parallel job to execute.

Notification Activity - used for sending emails to user defined recipients from within Datastage

Sequencer used for synchronization of a control flow of multiple activities in a job sequence.

Terminator Activity permits shutting down the whole sequence once a certain situation occurs.

Wait for file Activity - waits for a specific file to appear or disappear and launches the processing.

EndLoop Activity

Exception Handler

Execute Command

Nested Condition

Routine Activity

StartLoop Activity
27

UserVariables Activity

What is the difference between a primary key and a surrogate key?

A primary key is a special constraint on a column or set of columns. A primary key constraint ensures
that the column(s) so designated have no NULL values, and that every value is unique. Physically, a
primary key is implemented by the database system using a unique index, and all the columns in the
primary key must have been declared NOT NULL. A table may have only one primary key, but it may be
composite (consist of more than one column).

A surrogate key is any column or set of columns that can be declared as the primary key instead of a
"real" or natural key. Sometimes there can be several natural keys that could be declared as the primary
key, and these are all called candidate keys. So a surrogate is a candidate key. A table could actually
have more than one surrogate key, although this would be unusual. The most common type of surrogate
key is an incrementing integer, such as an auto increment column in MySQL, or a sequence in Oracle, or
an identity column in SQL Server.

Ans2): Surrogate key is an artificial identifier for an entity. In Surrogate key are generated by the system
sequentially. Primary key is a natural identifier for an entity. In primary key are all the values are entered
manually by the uniquely identifier there will be no replication of data.

# what is main difference between change capture and change apply stages?

Change Capture stage : compares two data set(after and before) and makes a record of the
differences.

change apply stage : combine the changes from the change capture stage with the original before data
set to reproduce the after data set.

Change capture stage catch holds of changesfrom two different datasets and generates a new column
called change code.... change code has values

0-copy

1-insert

2-delete

3-edit/update

Change apply stage applies these changes back to those data sets based on the chanecode column.

# Difference between Transformer and Basic Transfomer stage ?

Basic Difference between Transformer and BASIC transfomer stage in parallel jobs ?

Basic transformer used in server jobs and Parallel Jobs but

It supports one input link, 'n' number of output links, and only one reject link.

Basic transformer will be operating in Sequential mode.


28

All functions, macros, routines are writtened by using BASIC language.

Parallel Transformer stages

Can have one primary input link, multiple reference input links, and multiple output links.

The link from the main data input source is designated as the primary input link.

PX Transformer all functions, macros are written in C++ language.

It Supports Partioning of Data.

What is a Schema?

Graphical Representation of the data structure.

First Phase in implementation of Universe

What is a star schema?

Star schema is a data warehouse schema where there is only one “fact table" and many de-normalized
dimension tables.

Fact table contains primary keys from all the dimension tables and other numeric columns of additive,
numeric facts.

What is a snowflake schema?

Unlike Star-Schema, Snowflake schema contain normalized dimension tables in a tree like structure
with many nesting levels.
29

Snowflake schema is easier to maintain but queries require more joins.

What is the difference between snow flake and star schema?

Star Schema Snow Flake Schema

The star schema is the simplest data Snowflake schema is a more complex data
warehouse scheme. warehouse model than a star schema.

In star schema each of the dimensions is In snow flake schema at least one hierarchy
represented in a single table .It should not have should exists between dimension tables.
any hierarchies between dims.

It contains a fact table surrounded by dimension It contains a fact table surrounded by dimension
tables. If the dimensions are de-normalized, we tables. If a dimension is normalized, we say it is
say it is a star schema design. a snow flaked design.

In star schema only one join establishes the In snow flake schema since there is relationship
relationship between the fact table and any one between the dimensions tables it has to do
of the dimension tables. many joins to fetch the data.

A star schema optimizes the performance by Snowflake schemas normalize dimensions to


keeping queries simple and providing fast eliminated redundancy. The result is more
response time. All the information about the complex queries and reduced query
each level is stored in one row. performance.

It is called a star schema because the diagram It is called a snowflake schema because the
resembles a star. diagram resembles a snowflake.

Datastage – Slowly Changing Dimensions


30

Basics of SCD

Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than
changing on a time-based, regular schedule.

Type 1
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at
all.

Here is an example of a database table that keeps supplier information:

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the
surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code).
However, the joins will perform better on an integer than on a character string.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply
overwrite this record:

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co IL

Type 2
The Type 2 method tracks historical data by creating multiple records for a given natural key in the
dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have
unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table could look like this, with incremented
version numbers to indicate the sequence of changes:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version

123 ABC Acme Supply Co CA 0

124 ABC Acme Supply Co IL 1

Another popular method for tuple versioning is to add effective date columns.

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date


31

123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004

124 ABC Acme Supply Co IL 22-Dec-2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardized
surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an
index, and so that null-value substitution is not required when querying.

How to Implement SCD using DataStage 8.1 –SCD stage?

Step 1: Create a datastage job with the below structure-


1. Source file that comes from the OLTP sources
2. Old dimesion refernce table link
3. The SCD stage
4. Target Fact Table
5. Dimesion Update/Insert link

Figure 1
Step 2: To set up the SCD properties in the SCD stage ,open the stage and access the Fast Path
32

Figure 2
Step 3: The tab 2 of SCD stage is used specify the purpose of each of the pulled keys from the
referenced dimension tables.

Figure 3
33

Step 4: Tab 3 is used to provide the seqence generator file/table name which is used to generate the new
surrogate keys for the new or latest dimesion records.These are keys which also get passed to the fact
tables for direct load.

Figure 4
Step 5: The Tab 4 is used to set the properties for configuring the data population logic for the new and
old dimension rows. The type of activies that we can configure as a part of this tab are:
1. Generation the new Surrogate key values to be passed to the dimension and fact table
2. Mapping the source columns with the source column
3. Setting up of the expired values for the old rows
4. Defining the values to mark the current active rows out of multiple type rows
34

Figure 5
Step 6: Set the derivation logic for the fact as a part of the last tab.
35

Figure 6
Step 7: Complete the remaining set up, run the job
36

Figure 7
How to perform incremental load in DataStage?

Ans-1) ->Daily loading is known as incremental load.

-When data is selected from source, selected records are loaded between timestamp of last load and the
current time

-The parameter that are passed to perform are last loaded date and current date

-The first parameter is the stored last run date is read through job parameters

-The second parameter is the current date

Ans-2) ->Incremental load means daily load. Whenever you are selecting data from source select the
records which are loaded or updated between the timestamp of last successful load and todays load start
date and time. For this you have to pass parameters for those two dates. Store the last run date and time
in a file and read the parameter through job parameters and state second argument as current date and
time.
37
38
39

How do you generate Sequence number in Datastage?

Sequence numbers can be generated in Datastage using certain routines. They are

-KeyMgtGetNextVal
-KeyMgtGetNextValConn

What are Routines and where/how are they written and have you written any routines before?

Ans: Routines are stored in the Routines branch of the DataStage Repository, where you can create, view
or edit. The following are different types of routines:

1) Transform functions

2) Before-after job subroutines

3) Job Control routines

How many places you can call Routines?

Ans: Four Places you can call

(i) Transform of routine

(A) Date Transformation

(B) Upstring Transformation

(ii) Transform of the Before & After Subroutines

(iii) XML transformation

(iv) Web base

What is Environment Variables?

Basically Environment variable is predefined variable those we can use while creating DS job.
We create/declare these variables in DS Administrator. While designing the job we set the properties for
these variables. Environmental variables are also called as Global variables.

There are two types variables are there.

1. Local Variables
2.Environmental variables/Global Variables

Local Variables:- only for particular job only


Environmental Variables:- In any job throughout your project in this some default variables are there and
also we can define some user defined variables also.
How means, Creating project specific environment variables- Startup DataStage Administrator.- Choose
the project and click the "Properties" button.- On the General tab click the "Environment..." button.- Click
on the "User Defined" folder to see the list of job specific environment variables.

Give me to you some example for environment variable. So that it will be more clear for us.
40

Example is
you want to connect to database you need use id , password and schema.

These are constant through out the project so they will be created as environment variables.

Use them where ever you are want with #Variable#.

By using this if there is any change in password or schema no need to worry about all the jobs. Change it
at the level of environment variable that will take care of all the jobs.

Explain Job parameters?

There is an icon to go to Job parameters in the tool bar. Or you can press Ctrl+J to enter into Job
Parameters dialog box. Once you enter give a parameter name and corresponding default value for it.
This helps to enter the value when you run the job. Its not necessary always to open the job to change the
parameter value. Also when the job runs through script its just enough to give the parameter value in the
command line of script. Else you have to change the value in the job compile and then run in the script.
So its easy for the users to handle the jobs using parameters.

Difference between the parameter and parameter set

parameter set is new functionality provided starting with v8x , wherein we can define a group of
parameters as a "parameter set" at project level, and then can use that set in all of the jobs of concerned
project. Thereby eliminating the need of defining the parameters in each job specifically.

This is a useful from maintainenance and reusability point of view.

Differentiate Primary Key and Partition Key?

Primary Key is a combination of unique and not null. It can be a collection of key values called as
composite primary key.

Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and
Random etc. While using Hash partition we specify the Partition Key.

Difference between Sequential File and Data Set – DataStage

Sequential File :

-The extraction and loading of a sequential file is limited to 2GB

-Converted into native format from ASCII, if utilized as source while compiling

-The processing is sequential

-Processing is done at the server


41

Data Set :

-It is an intermediate stage

-Compile time conversion is not needed.

-Supports only .ds extension

-Processing is done in local system

How to perform incremental load in DataStage?

-Daily loading is known as incremental load.

-When data is selected from source, selected records are loaded between timestamp of last load and the
current time

-The parameter that are passed to perform are last loaded date and current date

-The first parameter is the stored last run date is read through job parameters

-The second parameter is the current date

Tell me one situation from your last project, where you had faced problem and How did you solve
it?

Ans: The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage
the data before sending to the transformer to make the jobs run faster

Using the Datastage Message handlers

Datastage has a property in which it allows users to suppress or demote a warning from the Datastage
log file. Although this is one thing ive never seen be used that much it’s still something worth knowing.
Datastage allows users to carry out this task by the use of Message handlers. This can be done either at
a job level or at a project level. Lets have a look on how to set the message handlers in Datastage.

As you can see from the design we are reading from a sequential file, sorting the data and then writing
back to the sequential file. Now since the sequential file stage does not run in parallel we will get the
below warning in the Datastage logs
42

Now if
we need to suppress this particular warning from the job logs we should do the following.
- Right click the warning and select the ‘add rule to message handler’ option

- This will give you a new window. In that window, select the ‘Add rule to message handler’ option and
then click add rule
43

- You then enter the name of the new message handler you are creating and click OK.

Your message handler has been created. The next time you run your job the warning won’t be present in
your logs. The list of messages handled will be present in your job log. It will be the last entry before the
‘Control’ entry.

The message handler you have just created is a local message handler that will be applied only to this
job. However you can use apply this message handler to all the jobs in your project by setting it in your
administrator. The option will be present in the ‘Parallel’ tab of your respective project.
44

The message handler I selected was the one I had just created.

Moving Message Handlers


Most of the time, I hear people wanting to know if you can export your message handlers similar to the
way you export your Datastage jobs. As far as I know, its not possible. You will have to manually move
your message handler files from one environment to the other. You can find these files at
D:\IBM\InformationServer\Server\MsgHandlers on Windows, and the respective folder in UNIX. Copying
these files to the environment of your choice is how you transfer the message handlers between
environments

SQL
SQL Statements

Most of the actions you need to perform on a database are done with SQL statements.

The following SQL statement selects all the records in the "Customers" table:

Example

SELECT * FROM Customers;

The SQL SELECT Statement

The SELECT statement is used to select data from a database.

The result is stored in a result table, called the result-set.


45

SQL SELECT Syntax

SELECT column_name,column_name
FROM table_name;

and

SELECT * FROM table_name;

SELECT Column Example

The following SQL statement selects the "CustomerName" and "City" columns from the "Customers"
table:

Example

SELECT CustomerName,City FROM Customers;

SELECT * Example

The following SQL statement selects all the columns from the "Customers" table:

Example

SELECT * FROM Customers;

The SQL CREATE TABLE Statement

The CREATE TABLE statement is used to create a table in a database.

Tables are organized into rows and columns; and each table must have a name.

SQL CREATE TABLE Syntax

CREATE TABLE table_name


(
column_name1 data_type(size),
column_name2 data_type(size),
column_name3 data_type(size),
....
);

The column_name parameters specify the names of the columns of the table.

The data_type parameter specifies what type of data the column can hold (e.g. varchar, integer, decimal,
date, etc.).

The size parameter specifies the maximum length of the column of the table.
46

SQL CREATE TABLE Example

Now we want to create a table called "Persons" that contains five columns: PersonID, LastName,
FirstName, Address, and City.

We use the following CREATE TABLE statement:

Example

CREATE TABLE Persons


(
PersonID int,
LastName varchar(255),
FirstName varchar(255),
Address varchar(255),
City varchar(255)
);

The SQL SELECT DISTINCT Statement

In a table, a column may contain many duplicate values; and sometimes you only want to list the different
(distinct) values.

The DISTINCT keyword can be used to return only distinct (different) values.

SQL SELECT DISTINCT Syntax

SELECT DISTINCT column_name,column_name


FROM table_name;

SELECT DISTINCT Example

The following SQL statement selects only the distinct values from the "City" columns from the
"Customers" table:

Example

SELECT DISTINCT City FROM Customers;

The SQL WHERE Clause

The WHERE clause is used to extract only those records that fulfill a specified criterion.

SQL WHERE Syntax

SELECT column_name,column_name
FROM table_name
WHERE column_name operator value;
47

The SQL INSERT INTO Statement

The INSERT INTO statement is used to insert new records in a table.

SQL INSERT INTO Syntax

It is possible to write the INSERT INTO statement in two forms.

The first form does not specify the column names where the data will be inserted, only their values:

INSERT INTO table_name


VALUES (value1,value2,value3,...);

The second form specifies both the column names and the values to be inserted:

INSERT INTO table_name (column1,column2,column3,...)


VALUES (value1,value2,value3,...);

INSERT INTO Example

Assume we wish to insert a new row in the "Customers" table.

We can use the following SQL statement:

Example

INSERT INTO Customers (CustomerName, ContactName, Address, City, PostalCode, Country)


VALUES ('Cardinal','Tom B. Erichsen','Skagen 21','Stavanger','4006','Norway');

Insert Data Only in Specified Columns

It is also possible to only insert data in specific columns.

The following SQL statement will insert a new row, but only insert data in the "CustomerName", "City",
and "Country" columns (and the CustomerID field will of course also be updated automatically):

Example

INSERT INTO Customers (CustomerName, City, Country)


VALUES ('Cardinal', 'Stavanger', 'Norway');

The SQL UPDATE Statement

The UPDATE statement is used to update existing records in a table.

SQL UPDATE Syntax

UPDATE table_name
SET column1=value1,column2=value2,...
WHERE some_column=some_value;
48

SQL UPDATE Example

Assume we wish to update the customer "Alfreds Futterkiste" with a new contact person and city.

We use the following SQL statement:

Example

UPDATE Customers
SET ContactName='Alfred Schmidt', City='Hamburg'
WHERE CustomerName='Alfreds Futterkiste';

Update Warning!

Be careful when updating records. If we had omitted the WHERE clause, in the example above, like this:

UPDATE Customers
SET ContactName='Alfred Schmidt', City='Hamburg';

The SQL DELETE Statement

The DELETE statement is used to delete rows in a table.

SQL DELETE Syntax

DELETE FROM table_name


WHERE some_column=some_value;

SQL DELETE Example

Assume we wish to delete the customer "Alfreds Futterkiste" from the "Customers" table.

We use the following SQL statement:

Example

DELETE FROM Customers


WHERE CustomerName='Alfreds Futterkiste' AND ContactName='Maria Anders';

Delete All Data

It is possible to delete all rows in a table without deleting the table. This means that the table structure,
attributes, and indexes will be intact:

DELETE FROM table_name;

or

DELETE * FROM table_name;


49

Note: Be very careful when deleting records. You cannot undo this statement!

The ALTER TABLE Statement

The ALTER TABLE statement is used to add, delete, or modify columns in an existing table.

SQL ALTER TABLE Syntax

To add a column in a table, use the following syntax:

ALTER TABLE table_name


ADD column_name datatype

To delete a column in a table, use the following syntax (notice that some database systems don't allow
deleting a column):

ALTER TABLE table_name


DROP COLUMN column_name

To change the data type of a column in a table, use the following syntax:

SQL Server / MS Access:

ALTER TABLE table_name


ALTER COLUMN column_name datatype

My SQL / Oracle:

ALTER TABLE table_name


MODIFY COLUMN column_name datatype

Oracle 10G and later:

ALTER TABLE table_name


MODIFY column_name datatype

SQL ALTER TABLE Example

Look at the "Persons" table:

P_Id LastName FirstName Address City

1 Hansen Ola Timoteivn 10 Sandnes

2 Svendson Tove Borgvn 23 Sandnes


50

3 Pettersen Kari Storgt 20 Stavanger

Now we want to add a column named "DateOfBirth" in the "Persons" table.

We use the following SQL statement:

ALTER TABLE Persons


ADD Date Of Birth date

Notice that the new column, "DateOfBirth", is of type date and is going to hold a date. The data type
specifies what type of data the column can hold. For a complete reference of all the data types available
in MS Access, MySQL, and SQL Server, go to our complete Data Types reference.

The "Persons" table will now like this:

P_Id LastName FirstName Address City DateOfBirth

1 Hansen Ola Timoteivn 10 Sandnes

2 Svendson Tove Borgvn 23 Sandnes

3 Pettersen Kari Storgt 20 Stavanger

Change Data Type Example

Now we want to change the data type of the column named "DateOfBirth" in the "Persons" table.

We use the following SQL statement:

ALTER TABLE Persons


ALTER COLUMN DateOfBirth year

Notice that the "DateOfBirth" column is now of type year and is going to hold a year in a two-digit or four-
digit format.

DROP COLUMN Example

Next, we want to delete the column named "DateOfBirth" in the "Persons" table.

We use the following SQL statement:


51

ALTER TABLE Persons


DROP COLUMN DateOfBirth

SQL DROP INDEX, DROP TABLE, and DROP DATABASE

WHERE Clause Example

Indexes, tables, and databases can easily be deleted/removed with the DROP statement.

The DROP INDEX Statement

The DROP INDEX statement is used to delete an index in a table.

DROP INDEX Syntax for MS Access:

DROP INDEX index_name ON table_name

DROP INDEX Syntax for MS SQL Server:

DROP INDEX table_name.index_name

DROP INDEX Syntax for DB2/Oracle:

DROP INDEX index_name

DROP INDEX Syntax for MySQL:

ALTER TABLE table_name DROP INDEX index_name

The DROP TABLE Statement

The DROP TABLE statement is used to delete a table.

DROP TABLE table_name

The DROP DATABASE Statement

The DROP DATABASE statement is used to delete a database.

DROP DATABASE database_name

The TRUNCATE TABLE Statement

What if we only want to delete the data inside the table, and not the table itself?
52

Then, use the TRUNCATE TABLE statement:

TRUNCATE TABLE table_name

The following SQL statement selects all the customers from the country "Mexico", in the "Customers"
table:

Example

SELECT * FROM Customers


WHERE Country='Mexico';

Difference between DELETE, TRUNCATE and DROP

DELETE
1. DELETE is a DML Command.
2. DELETE statement is executed using a row lock, each row in the table is locked for deletion.
3. We can specify filters in where clause
4. It deletes specified data if where condition exists.
5. Delete activates a trigger because the operation are logged individually.
6. Slower than truncate because, it keeps logs.
7. Rollback is possible.

TRUNCATE
1. TRUNCATE is a DDL command.
2. TRUNCATE TABLE always locks the table and page but not each row.
3. Cannot use Where Condition.
4. It Removes all the data.
5. TRUNCATE TABLE cannot activate a trigger because the operation does not log individual row
deletions.
6. Faster in performance wise, because it doesn't keep any logs.
7. Rollback is not possible.
DELETE and TRUNCATE both can be rolled back when used with TRANSACTION.

If Transaction is done, means COMMITED, then we can not rollback TRUNCATE command, but we can
still rollback DELETE command from LOG files, as DELETE write records them in Log file in case it is
needed to rollback in future from LOG files.
DROP

The DROP command removes a table from the database. All the tables' rows, indexes and privileges will
also be removed. The operation cannot be rolled back.

Difference between Unique key and Primary key

A UNIQUE constraint and PRIMARY key both are similar and it provide unique enforce uniqueness of the
column on which they are defined.
Some are basic differences between Primary Key and Unique key are as follows.

Primary key

1. Primary key cannot have a NULL value.


2. Each table can have only single primary key.
53

3. Primary key is implemented as indexes on the table. By default this index is clustered
index.
4. Primary key can be related with another table's as a Foreign Key.
5. We can generate ID automatically with the help of Auto Increment field. Primary key
supports Auto Increment value.
Unique Constraint

1. Unique Constraint may have a NULL value.


2. Each table can have more than one Unique Constraint.
3. Unique Constraint is also implemented as indexes on the table. By default this index is
Non-clustered index.
4. Unique Constraint cannot be related with another table's as a Foreign Key.
5. Unique Constraint doesn't support Auto Increment value.

NORMALIZATION:
Some Oracle databases were modeled according to the rules of normalization that were intended to
eliminate redundancy.

Obviously, the rules of normalization are required to understand your relationships and functional
dependencies

First Normal Form:


A row is in first normal form (1NF) if all underlying domains contain atomic values only.

 Eliminate duplicative columns from the same table.


 Create separate tables for each group of related data and identify each row with a unique column
or set of columns (the primary key).

Second Normal Form:


An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form
(1NF) and additionally:

 Does not have a composite primary key. Meaning that the primary key can not be subdivided into
separate logical entities.
 All the non-key columns are functionally dependent on the entire primary key.

 A row is in second normal form if, and only if, it is in first normal form and every non-key attribute
is fully dependent on the key.

 2NF eliminates functional dependencies on a partial key by putting the fields in a separate table
from those that are dependent on the whole key. An example is resolving many: many
relationships using an intersecting entity.

Third Normal Form:


An entity is in Third Normal Form (3NF) when it meets the requirement of being in Second Normal Form
(2NF) and additionally:
54

 Functional dependencies on non-key fields are eliminated by putting them in a separate table. At
this level, all non-key fields are dependent on the primary key.
 A row is in third normal form if and only if it is in second normal form and if attributes that do not
contribute to a description of the primary key are move into a separate table. An example is
creating look-up tables.

Boyce-Codd Normal Form:


Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to
BCNF as 3NF. A row is in Boyce Codd normal form if, and only if, every determinant is a candidate key.
Most entities in 3NF are already in BCNF.

Fourth Normal Form:


An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form
(3NF) and additionally:

Have no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have
more than a single one-to-many relationship.

ORACLE SET OF STATEMENTS:

Data Definition Language :(DDL)


Create

Alter

Drop

Truncate

Data Manipulation Language (DML)


Insert

Update

Delete

Data Querying Language (DQL)


Select

Data Control Language (DCL)


Grant

Revoke

Transactional Control Language (TCL)


Commit

Rollback
55

Save point

What is the difference between view and materialized view?

View Materialized view

A view has a logical existence. It does not A materialized view has a physical existence.
contain data.

Its not a database object. It is a database object.

We cannot perform DML operation on view. We can perform DML operation on materialized
view.

When we do select * from view it will fetch the When we do select * from materialized view it
data from base table. will fetch the data from materialized view.

In view we cannot schedule to refresh. In materialized view we can schedule to refresh.

We can keep aggregated data into materialized


view. Materialized view can be created based on
multiple tables.

Difference between Rowid and Rownum?

ROWID

A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table,
and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block
number, RRRR is the slot(row) number, and FFFF is a file number.

ROWNUM

For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in
which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of
1. The second has 2, and so on.

You can use ROWNUM to limit the number of rows returned by a query, as in this example:

SELECT * FROM employees WHERE ROWNUM < 10;

Rowid Row-num

Rowid is an oracle internal id that is allocated Row-num is a row number returned by a


every time a new record is inserted in a table. select statement.
This ID is unique and cannot be changed by
the user.

Rowid is permanent. Row-num is temporary.


56

Rowid is a globally unique identifier for a row The row-num pseudocoloumn returns a
in a database. It is created at the time the row number indicating the order in which oracle
is inserted into the table, and destroyed when selects the row from a table or set of joined
it is removed from a table. rows.

Order of where and having:

SELECT column, group_function

FROM table

[WHERE condition]

[GROUP BY group_by_expression]

[HAVING group_condition]

[ORDER BY column];

The WHERE clause cannot be used to restrict groups. you use the

HAVING clause to restrict groups.

Differences between where clause and having clause

Where clause Having clause

Both where and having clause can be used to filter the data.

It can be used without the GROUP BY clause. The HAVING clause cannot be used without the
GROUP BY clause. (/*But having clause we
need to use it with the group by*/).

The WHERE clause selects rows before The HAVING clause selects rows after grouping.
grouping. (/*Where clause applies to the (/*Whereas having clause is used to test some
individual rows*/). condition on the group rather than on individual
rows*/).

Where clause is used to restrict rows. But having clause is used to restrict groups.

Restrict normal query by where Restrict group by function by having

In where clause every record is filtered based In having clause it is with aggregate records
on where. (group by functions).

The WHERE clause cannot contain aggregate The HAVING clause can contain aggregate
functions. functions.

What is the difference between sub-query & co-related sub query?


57

A sub query is executed once for the parent statement

Whereas the correlated sub query is executed once for each

Row of the parent query.

Sub Query:

Example:

Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or
sal_grade=’B’)

Co-Related Sun query:

Example:

Find all employees who earn more than the average salary in their department.

SELECT last-named, salary, department_id FROM employees A

WHERE salary > (SELECT AVG (salary)

FROM employees B WHERE B.department_id =A.department_id

Group by B.department_id)

EXISTS:

The EXISTS operator tests for existence of rows in

the results set of the subquery.

Select dname from dept where exists


(select 1 from EMP
where dept.deptno= emp.deptno);

Sub-query Co-related sub-query

A sub-query is executed once for the parent Where as co-related sub-query is executed
Query once for each row of the parent query.

Example: Example:

Select * from emp where deptno in (select Select a.* from emp e where sal >= (select
deptno from dept); avg(sal) from emp a where a.deptno=e.deptno
group by a.deptno);

IMPORTANT QUERIES
Get duplicate rows from the table:
58

Select empno, count (*) from EMP group by empno having count (*)>1;

Remove duplicates in the table:

Delete from EMP where rowid not in (select max (rowid) from EMP group by empno);

Below query transpose columns into rows.

Name No Add1 Add2

Abc 100 hyd bang

Xyz 200 Mysore pune

Select name, no, add1 from A

UNION

Select name, no, add2 from A;

Below query transpose rows into columns.

select

emp_id,

max(decode(row_id,0,address))as address1,

max(decode(row_id,1,address)) as address2,

max(decode(row_id,2,address)) as address3

from (select emp_id,address,mod(rownum,3) row_id from temp order by emp_id )

group by emp_id

Other query:

select

emp_id,

max(decode(rank_id,1,address)) as add1,

max(decode(rank_id,2,address)) as add2,

max(decode(rank_id,3,address))as add3

from

(select emp_id,address,rank() over (partition by emp_id order by emp_id,address )rank_id from temp )
59

group by

emp_id

Rank query:

Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order by sal desc) r from
EMP);

Dense rank query:

The DENSE_RANK function works acts like the RANK function except that it assigns consecutive ranks:

Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over (order by sal desc) r from
emp);

Top 5 salaries by using rank:

Select empno, ename, sal,r from (select empno,ename,sal,dense_rank() over (order by sal desc) r from
emp) where r<=5;

Or

Select * from (select * from EMP order by sal desc) where rownum<=5;

2 nd highest Sal:

Select empno, ename, sal, r from (select empno, ename, sal, dense_rank () over (order by sal desc) r
from EMP) where r=2;

Top sal:

Select * from EMP where sal= (select max (sal) from EMP);

How to display alternative rows in a table?

SQL> select *from emp where (rowid, 0) in (select rowid,mod(rownum,2) from emp);

Union vs. Union All Query Syntax

The purpose of the SQL UNION and UNION ALL commands are to combine the results of two or more
queries into a single result set consisting of all the rows belonging to all the queries in the union. The
question becomes whether or not to use the ALL syntax.

The main difference between UNION ALL and UNION is that, UNION only selects distinct values,
while UNION ALLselects all values (including duplicates).

The syntax for UNION {ALL} is as follows:

[SQL Statement 1]
UNION {ALL}
[SQL Statement 2]
[GROUP BY ...]
60

Sample Data

Use Authors table in SQL Server Pubs database or just use a simple table with these values
(obviously simplified to just illustrate the point):

City State Zip

Nashville TN 37215

Lawrence KS 66044

Corvallis OR 97330

UNION ALL Example

This SQL statement combines two queries to retrieve records based on states. The two queries happen to
both get records from Tennessee ('TN'):

SELECT City, State, Zip FROM Authors WHERE State IN ('KS', 'TN')
UNION ALL
SELECT City, State, Zip FROM Authors WHERE IN ('OR' 'TN')

Result of UNION ALL syntax:

City State Zip

Nashville TN 37215

Lawrence KS 66044

Nashville TN 37215

Corvallis OR 97330

Notice how this displays the two query results in the order they appear from the queries. The first two
records come from the first SELECT statement, and the last two records from the second SELECT
statement. The TN record appears twice, since both SELECT statements retrieve TN records.

Union Query SQL Example

Using the same SQL statements and combining them with a UNION command:

SELECT City, State, Zip FROM Authors WHERE State IN ('KS', 'TN')
UNION
SELECT City, State, Zip FROM Authors WHERE IN ('OR' 'TN')

Result of UNION Query


61

City State Zip

Corvallis OR 97330

Lawrence KS 66044

Nashville TN 37215

Notice how the TN record only appears once, even though both SELECT statements retrieve TN records.
The UNION syntax automatically eliminates the duplicate records between the two SQL statements and
sorts the results. In this example the Corvallis record appears first but is from the second SELECT
statement.

A GROUP BY clause can be added at the end to sort the list.

Difference between RANK() and DENSE RANK()

What's the difference between RANK() and DENSE_RANK() functions?

Answer: Oracle 9i introduced these two functions and they are used to rank the records of a table based
on column(s). The syntax of using these functions in SQL queries is 'RANK()/DENSE_RANK() OVER
(ORDER BY )'. Here the ORDER BYclause decides which all columns (and in what order) will be used to
group and rank the records.

The default sorting order is 'Ascending Order' and if we want we may specify 'DESC' to have the
Descending Sort Order. The ranks start from 1 and not from 0.

The RANK() returns the position of a value within the partition of a result set, with gaps in the ranking
where there are ties.

The DENSE_RANK() returns the position of a value within the partition of a result set, with no gaps in the
ranking where there are ties.

RANK

Let's assume we want to assign a sequential order, or rank, to people within a department based on
salary, we might use the RANK function like.

SELECT empno,
deptno,
sal,
RANK() OVER (PARTITION BY deptno ORDER BY sal) "rank"
FROM emp;

EMPNO DEPTNO SAL rank


---------- ---------- ---------- ----------
7934 10 1300 1
7782 10 2450 2
7839 10 5000 3
7369 20 800 1
62

7876 20 1100 2
7566 20 2975 3
7788 20 3000 4
7902 20 3000 4
7900 30 950 1
7654 30 1250 2
7521 30 1250 2
7844 30 1500 4
7499 30 1600 5
7698 30 2850 6

SQL>

What we see here is where two people have the same salary they are assigned the same rank. When
multiple rows share the same rank the next rank in the sequence is not consecutive.

DENSE_RANK

The DENSE_RANK function acts like the RANK function except that it assigns consecutive ranks.

SELECT empno,
deptno,
sal,
DENSE_RANK() OVER (PARTITION BY deptno ORDER BY sal) "rank"
FROM emp;

EMPNO DEPTNO SAL rank


---------- ---------- ---------- ----------
7934 10 1300 1
7782 10 2450 2
7839 10 5000 3
7369 20 800 1
7876 20 1100 2
7566 20 2975 3
7788 20 3000 4
7902 20 3000 4
7900 30 950 1
7654 30 1250 2
7521 30 1250 2
7844 30 1500 3
7499 30 1600 4
7698 30 2850 5
63

DECODE is a function in Oracle and is used to provide if-then-else type of logic to SQL. It is not available
in MySQL or SQL Server. The syntax for DECODE is:

SELECT DECODE ( "column_name", "search_value_1", "result_1",


["search_value_n", "result_n"],
{"default_result"} );

"search_value" is the value to search for, and "result" is the value that is displayed.

For example, assume we have the following Store_Information table,

Table Store_Information

Store_Name Sales Txn_Date


Los Angeles 1500 Jan-05-1999
San Diego 250 Jan-07-1999
64

San Francisco 300 Jan-08-1999


Boston 700 Jan-08-1999

if we want to display 'LA' for 'Los Angeles', 'SF' for 'San Francisco', 'SD' for 'San Diego', and 'Others' for all
other cities, we would issue the following SQL,

SELECT DECODE (Store_Name,


'Los Angeles', 'LA',
'San Francisco', 'SF',
'San Diego', 'SD',
'Others') Area, Sales, Txn_Date
FROM Store_Information;

"Area" is the name given to the column with the DECODE statement.

Result:

Area Sales Txn_Date


LA 1500 Jan-05-1999
SD 250 Jan-07-1999
SF 300 Jan-08-1999
Others 700 Jan-08-1999

CASE AND DECODE : Two powerful constructs of SQL

CASE and DECODE are the two widely used constructs in the SQL . And both have the functionality of an
IF-THEN-ELSE statement to return some specified value meeting some criteria.Even though they are
used interchangeably there are some differences between them.

This article tries to show list the advantage of CASE over DECODE and also explain how to convert
DECODE to CASE and vice versa.

CASE was introduced in Oracle 8.1.6 as a replacement for the DECODE . Anyway it is much better option
than DECODE as it is ,

1. More Flexible than DECODE

2. More easier to read

3. ANSI Compatible

4. compatible in PL/SQL Context

SIMPLE CASE

Generally CASE has two syntaxes as below


65

a. Expression Syntax

Code :

CASE [ expression ]
WHEN Value_1 THEN result_1
WHEN Value_2 THEN result_2
...
WHEN Value_n THEN result_n
[ELSE else_result]
END

Here CASE checks the value of Expression and returns the result each time for each record as specified.
Here is one such example to list the new salaries for all employees

Code :

SQL> SELECT EMPNO,JOB , SAL ,


2 CASE JOB WHEN 'ANALYST' THEN SAL*1.2
3 WHEN 'MANAGER' THEN SAL*1.4
4 ELSE SAL END NEWSAL
5 FROM EMP;
EMPNO JOB SAL NEWSAL
---------- --------- ---------- ----------
7369 CLERK 800 800
7499 SALESMAN 1600 1600
7521 SALESMAN 1250 1250
7566 MANAGER 2975 4165
7654 SALESMAN 1250 1250
7698 MANAGER 2850 3990
7782 MANAGER 2450 3430
7788 ANALYST 3000 3600
7839 PRESIDENT 5000 5000
7844 SALESMAN 1500 1500
7876 CLERK 1100 1100
7900 CLERK 950 950
7902 ANALYST 3000 3600
7934 CLERK 1300 1300

14 rows selected.
The Equivalent DECODE syntax will be

Code :
SQL> SELECT EMPNO,JOB , SAL ,
66

2 DECODE (JOB,'ANALYST', SAL*1.2 ,


3 'MANAGER', SAL*1.4,
4 SAL ) NEWSAL
5 FROM EMP;

EMPNO JOB SAL NEWSAL


---------- --------- ---------- ----------
7369 CLERK 800 800
7499 SALESMAN 1600 1600
7521 SALESMAN 1250 1250
7566 MANAGER 2975 4165
7654 SALESMAN 1250 1250
7698 MANAGER 2850 3990
7782 MANAGER 2450 3430
7788 ANALYST 3000 3600
7839 PRESIDENT 5000 5000
7844 SALESMAN 1500 1500
7876 CLERK 1100 1100
7900 CLERK 950 950
7902 ANALYST 3000 3600
7934 CLERK 1300 1300

14 rows selected.

What are the different type of normalization?

In database design , we start with one single table, with all possible columns. Lot of redundant
data would be present since it’s a single table. The process of removing the redundant data, by
splitting up the table in a well defined fashion is called normalization.

1. First Normal Form (1NF)

A relation is said to be in first normal form if and only if all underlying domains contain atomic values only.
After 1NF , we can still have redundant data.

2. Second Normal Form (2NF)

A relation is said to be in 2NF if and only if it is in 1NF and every non key attribute is fully dependent on
the primary key. After 2NF , we can still have redundant data

3. Third Normal Form (3NF)

A relation is said to be in 3NF if and only if it is in 2NF and every non key attribute is non-transitively
dependent on the primary key

Define Join and explain different type of joins?


67

In order to avoid data duplication , data is stored in related tables . Join keyword is used to fetch data
from related table. Join return rows when there is at least one match in both tables . Type of joins are

Right Join

Return all rows from the right table, even if there are no matches in the left table .

Outer Join

Left Join

Return all rows from the left table, even if there are no matches in the right table .

Full Join

Return rows when there is a match in one of the tables .

What is Self-Join?

Self-join is query used to join a table to itself. Aliases should be used for the same table comparison.

What is Cross Join?

Cross Join will return all records where each row from the first table is combined with each row from the
second table.

Important SQL Queries

1) How to delete Duplicate rows in a Table?

DELETE from table_name A WHERE ROWID> (SELECT min (ROWID) from table_name B WHERE
A.Col_name=B.Col_name)

2) How to Identify Duplicates in a Table?

Select Col1,Col2,Col3....... from table_name GROUP BY Col1,Col2,Col3....Having Count(*)>1;

3) To get Top Five Salary Employee Details?

Select * from (Select * from table_name ORDER BY Salary Desc) WHERE ROWNUM<=5;

Find highest salary queries?

select * from emp where sal=(select max(sal) from emp);

Find lowest salary queries?

select * from emp where sal=(select min(sal) from emp);

Find the top 5 employee salaries?


68

select * from (select * from emp order by sal desc) where rownum<6;

Find the lowest 5 employee salaries?

select * from (select * from emp order by sal asc) where rownum<6;

Find the particular employee salary?

for maximum:

select * from emp where sal in(select min(sal)from

(select sal from emp group by sal order by sal desc)

where rownum<=&n);

select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal<=b.sal);

for minimum:

select * from emp where sal in(select max(sal) from(select sal from emp group by sal order by sal asc)
where rownum<=&n);

select * from emp a where &n=(select count(distinct(sal)) from emp b where a.sal>=b.sal)

Query for combining two tables(INNER JOIN)?

select emp.empno,emp.ename,dept.deptno from emp,dept where emp.deptno=dept.deptno;

By using aliases:

select e.empno,e.ename,d.deptno from emp e,dept d where e.deptno=d.deptno;

select empno,ename,sal,dept.* from emp join dept on emp.deptno=dept.deptno:

Query for joining table it self(SELF JOIN)?

select e.ename “employee name”,e1.ename “manger name” from emp e,emp e1 where e.mgr=e1.empno;

Query for joining two tables(OUTER JOIN)?

select e.ename,d.deptno from emp e,dept d where e.deptno(+)=d.deptno order by e.deptno;

select empno,ename,sal,dept.* from emp full outer join dept on emp.deptno=dept.deptno;

Right Outer Join:

select empno,ename,sal,dept.* from emp right outer join dept on emp.deptno=dept.deptno;

Left Outer Join:


69

select empno,ename,sal,dept.* from emp left outer join dept on emp.deptno=dept.deptno

Query to display the duplicate records?

select * from dup where rowid not in(select max(rowid)from dup group by eno);

Query to delete the duplicate records?

delete from dup where rowid not in(select max(rowid)from dup group by eno);

Query to display the records from M to N?

select ename from emp group by rownum,ename having rownum>1 and rownum<6;

select deptno,ename,sal from emp where rowid in(select rowid from emp

where rownum<=7 minus select rowid from emp where rownum<4);

select * from emp where rownum<=7 minus select * from emp where rownum<5;

Query to display Nth record from the table?

Select * from emp where rownum<=&n minus select * from emp where rownum<&n;

Query to display 3rd highest and 3rd lowest salary?

Select * from emp e1 where 3=(select count(distinct sal) from emp e2 where
e1.sal<=e2.sal)
union
Select * from emp e3 where 3=(select count(distinct sal) from emp e4 where
e3.sal>=e4.sal);

How to display duplicate rows in a table?

Select * from emp where deptno=any

(Select deptno from emp having count(deptno)>1 group by deptno);

Query to display even records only?

Select * from emp where (rowid,0) in (select rowid,mod (rownum,2) from emp);

Query to display odd records only?

Select * from emp where (rowid,1) in (select rowid,mod (rownum,2) from emp);

Query to display first N records?

Select * from (select * from emp order by rowid) where rownum<=&n;


70

Query to display middle records (i.e., drop first 5, last 5 records in emp table)?

Select * from emp where rownum<=(select count(*)-5 from emp)

Minus

Select * from emp where rownum<=5;

10 Frequently asked SQL Query Interview Questions

In this article I am giving example of some SQL query which is asked when you go for interview who is having one or
two year experience on this field .whenever you go for java developer position or any other programmer position
interviewee expect that if you are working from one or two years on any project definitely you come across to handle
this database query, so they test your skill by asking this type of simple query.

Question 1: SQL Query to find second highest salary of Employee

Answer : There are many ways to find second highest salary of Employee in SQL, you can either use SQL Join or
Subquery to solve this problem. Here is SQL query using Subquery :

select MAX(Salary) from Employee WHERE Salary NOT IN (select MAX(Salary) from Employee );

See How to find second highest salary in SQL for more ways to solve this problem.

Question 2: SQL Query to find Max Salary from each department.

Answer:

SELECT DeptID, MAX(Salary) FROM Employee GROUP BY DeptID.

Question 3: Write SQL Query to display current date.

Ans:SQL has built in function called GetDate() which returns current timestamp.

SELECT GetDate();

Question 4:Write an SQL Query to check whether date passed to Query is date of given format or not.

Ans: SQL has IsDate() function which is used to check passed value is date or not of specified format ,it returns
1(true) or 0(false) accordingly.

SELECT ISDATE('1/08/13') AS "MM/DD/YY";

It will return 0 because passed date is not in correct format.

Question 5: Write a SQL Query to print the name of distinct employee whose DOB is between 01/01/1960 to
31/12/1975.

Ans:
SELECT DISTINCT EmpName FROM Employees WHERE DOB BETWEEN ‘01/01/1960’ AND ‘31/12/1975’;
71

Question 6:Write an SQL Query find number of employees according to gender whose DOB is between
01/01/1960 to 31/12/1975.

Answer : SELECT COUNT(*), sex from Employees WHERE DOB BETWEEN ‘01/01/1960 ' AND
‘31/12/1975’ GROUP BY sex;

Question 7:Write an SQL Query to find employee whose Salary is equal or greater than 10000.

Answer : SELECT EmpName FROM Employees WHERE Salary>=10000;

Question 8:Write an SQL Query to find name of employee whose name Start with ‘M’

Ans: SELECT * FROM Employees WHERE EmpName like 'M%';

Question 9: find all Employee records containing the word "Joe", regardless of whether it was stored as JOE,
Joe, or joe.

Answer : SELECT * from Employees WHERE upper(EmpName) like upper('joe%');

Question 10: Write a SQL Query to find year from date.

Answer : SELECT YEAR(GETDATE()) as "Year";

Hope this article will help you to take a quick practice whenever you are going to attend any interview and not have
much time to go into the deep of each query.

You might also like