You are on page 1of 10

DataStage Architecture

DataStage Architecture
What is the architecture of data stage?
Architecture of DS is client/server architecture.
We have different types of client /server architecture for DataStage starting from the different
versions.The latest version is DataStage 8.7
http://www-01.ibm.com/support/docview.wss?uid=swg27008803
1. Datastage 7.5 (7.5.1 or 7.5.2) version-standalone
DataStage 7.5 version was a standalone version where DataStage engine, service and repository
(metadata) was all installed in once server and client was installed in local PC and access the servers
using the ds-client. Here the users are created in Unix/windows DataStage server and was added to the
dstage group (dsadm is the owner of the DataStage and dstage is the group of that.)To give access to the
new user just create new Unix/windows user in the DS-server and add them to dstage group. The will
have access to the DataStage server from the client.
Client components & server components
Client components are 4 types they are
1. Data stage designer
2. Data stage administrator
3. Data stage director
4. Data stage manager
Data stage designer is user for to design the jobs. All the DataStage development activities are done
here. For a DataStage developer he should know this part very well.
Data stage manager is used for to import & export the project to view & edit the contents of the
repository. This is handled by DataStage operator/administrator
Data stage administrator is used for creating the project, deleting the project & setting the environment
variables. This is handled by DataStage administrator
Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. This is handled by
DataStage developer/operator
Server components
DS server: runs executable server jobs, under the control of the DS director, that extract,transform, and
load data into a DWH.
DS Package installer: A user interface used to install packaged DS jobs and plug-in;
Repository or project: a central store that contains all the information required to build DWH or data
mart.

More reference on DataStage 7.5


ftp://ftp.software.ibm.com/software/data/db2imstools/db2tools/pdf/d...
http://it.toolbox.com/wiki/index.php/DataStage_Enterprise_Edition
http://etl-tools.info/infosphere-datastage-ee.htm
http://h71028.www7.hp.com/enterprise/downloads/DataStage%20Product%...
http://it.toolbox.com/blogs/infosphere/new-release-datastage-753-th...
2.Datastage 8.0 (8.1 and 8.5)version-standalone
DataStage 8 version was a standalone version where DataStage engine and service are in DataStage
server but the Database part repository (metadata) was installed in Oracle/DB2 Database server and
client was installed in local PC and accesses the servers using the ds-client.
Metadata (Repository): This will be created as one database and will have 2 schemas (xmeta and
isuser).This can be made as RAC DB (Active/Active in 2 servers, if any one DB failed means the other will
be switch over without connection lost of the DataStage jobs running) where
1. xmeta :will have information about the project and DataStage software
2. iauser: will have information about the user of DataStage in IIS or webconsole
Note: we can install 2 or 3 DataStage instance in the same server like ds-8.0 or ds-8.1 or ds-8.5 and bring
up any version whenever we want to work on that. This will reduce the hardware cost. But only one
instance can be up and running.
The DataStage 8 was also a standalone version but here the 3 components were introduced defiantly.
1.information server(IIS)- isadmin
2.websphere server- wasadmin
3. Datastage server- dsadm
1. The IIS also called as DataStage webconsole was introduced where in which it will have all the user
information of the DataStage. This is general accessed in web browser and dont need and DataStage
software installation.
After the DataStage installation. The IIS or webconsole will be generated and will have isadmin as
administrator to mange this web console. once we login into the web console using isadmin we need to
map the dsadm user in the engine credentials(dsadm is the unix/windows user created in the datastage
server with dstage group).Then after the mapping the new users will be created in the same user
components(note:The users xxx created are internally tagged to dsadm mapped user which internally
making connecting between unix datastage server and IIS webconsole.All the files/project ..etc created
using xxx will be owned by dsadm user in the unix server)
We can restrict the xxx users here to access 1 or 2 projects.
http://www-01.ibm.com/support/docview.wss?uid=swg27009428&aid=1
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/index.jsp
https://www-304.ibm.com/support/docview.wss?uid=swg27013419
http://it.toolbox.com/blogs/infosphere/user-and-group-security-for-...
http://mayurdsguru.files.wordpress.com/2010/12/datastage_admin.pdf

Client components & server components


http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp?to...
Client components are
1. Data stage designer
2. Data stage administrator
3. Data stage director
4. IBM import export manager
5. Webconsole
6. IBM infosphere DataStage and Qualitystage multi-client manager
7. Others I have not come across J
Data stage designer is user for to design the jobs. All the DataStage development activities are done
here. For a DataStage developer he should know this part very well.
Data stage administrator is used for creating the project, deleting the project & setting the environment
variables. This is handled by DataStage administrator
Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. This is handled by
DataStage developer/operator
IBM import export manager is used for to import & export the project to view & edit the contents of the
repository. This is handled by DataStage operator/administrator
Webconsole is use for to create the datastage users and do the administration .This is handled by
DataStage administrator
Multi-client manager is use for to install multipal client like ds-7.5,ds-8.1 or ds-8.5 in the local pc and can
swap to any version when it is required. This is used by DataStage developer/operator/administrator/all
Server components:
-- IBM InfoSphere Blueprint Director
-- IBM InfoSphere Business Glossary
-- IBM InfoSphere DataStage
-- IBM InfoSphere FastTrack
-- IBM InfoSphere Information Analyzer
-- IBM InfoSphere Information Services Director
-- IBM InfoSphere Metadata Server
-- IBM InfoSphere Metadata Workbench
-- IBM InfoSphere QualityStage
http://www-01.ibm.com/support/docview.wss?uid=swg27016910
http://it.toolbox.com/blogs/infosphere/ten-reasons-why-you-need-dat...

3.Datastage 8.5version-Cluster(HA-High Availability clusters)


DataStage 8.5 version was a also have HA-High Availability clusters setup. All the function and working
is same as DataStage 8.5 standalone but the hardware and software structure will be different.
1. DataStage engine Tier is in different server (2 Active/Active or Active/passive) and
2. Service Tire is in different server (2 Active/Active or Active/passive) and
3. Metadata Database part (repository) tire is in different server (2 Active/Active or Active/passive) was
installed in Oracle/DB2 Database server with RAC(means 2 Database server in Active/Active mode, if one
DB fails the other will be switched immediately and no connection lost)
The whole DataStage HA is made in such way that any fail in any part may be engine/service or metadata
tire. It will automatically switch to other Active servers and without connection lost of the current
DataStage jobs running. This is the amazing setup done and it is implementing in out Citibank project and
I am lucky to work on this.
Also we can have multiple DataStage engines for ex: Singapore/Malaysia/Thiland/Russia(4 Engine tries)
running for the same 2 service Tires/Medata DB Tires.(This will reduce the cost of the Hardware)
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp
http://www.google.com.sg/url?q=http://www.scribd.com/doc/46695490/D...
http://it.toolbox.com/blogs/infosphere/the-2010-datastage-roadmap-n...
DataStage migration from 8.0 to 8.1 or 8.5 Activity
setps
SNO DataStage migration from 8.0 to 8.1 or 8.5 Activity
Prerequisites for DataStage 8.1 Installation
1 Storage Allocation
2 Management approval for Migration Activities
3 Migration activity Communication to Application Team
Filesystems Creation for DB's & Permissions setup for Ds-Metadata Database-
4 server B
Filesystems Creation for DS Binaries datastage 8.1 software installation-
5 server-A
6 Datastage DB's creation - server B
7 Datastage DB Schemas Creation-server B(iauser and Xmeta schemas)
8 Datastage DB Schemas privileges(create table..etc)
Datastage Setup Activities
9 Export All ds-8.0 project to tmp location
Take ds-DB backup in (old)server B for emergency restore(if migration 8.0 to
10 8.1 fails)
11 Backup DataStage 8.0.1 Binaries ( /opt/IBM ) and also project path(/projects)
Shutdown 8.0.1, Rename the 8.0.1 files / directories to names with extension .
12 801
13 Install DataStage 8.1/8.5 version
14 install fixpack1,xml patch and other security patch and bring up the 8.1 version
15 configure dsenv file for DB2/Oracle connections
16 Map the dsadm user in web console
17 Create the application user ID in web console to access the ds-cleint
18 Rename the Existing Project Paths to ProjectName_bkp
19 Create new Project (replicate from 8.0.1)
20 provide access to the project for application user ID created in web console
21 Import jobs and recompile/promote all the jobs
Test run jobs / bench mark performance / observe and fix problems DataStage
22 8.1
23 Basic Verification & Signoff
24 Complete Batch Run verification & Signoff
Note: only one DataStage instance can be up and running any time. But we can switch back to any
version 8.0 or 8.1 if it is installed in the same servers.

DataStage migration part1


DataStage migration from DB2 to Oracle or Oracle to DB2-Part1
As time passed most the projects will move to the new environment or will upgrade from the OLD
environment based on the project requirement. One the usual thing which happens is Database exchange
in the project. I am explaining this from my experience where we had already working DWH with the
below structure
OS-AIX
Database-DB2(DWH Database)
ETL Tool DataStage.
Due to new project implementation we got the new requirement as below
OS-AIX
Database-Oracle10G
ETL Tool-DataStage
We have built this environment same as OLD environment only the Oracle DB is new here
The most challenging thing was to migrate the DataStage jobs as it is without doing more coding or
creating new DataStage jobs from DB2 to Oracle.
Let us know the DB2 and Oracle objects in details and the difference
IBM DB2 UDB Oracle
Database Database
Table Space Tablespace
Schema User
User User
Group Role
Table Table
Typed Table Object Table
Temporary Table Temporary Table
Index Index
Check Constraint Check Constraint
Column Default Column Default
Unique Key Unique Key
Primary Key Primary Key
Foreign Key Foreign Key
UDB SQL Procedure PL/SQL Procedure
UDB SQL Function PL/SQL Function
UDB Package PL/SQL Package
UDB Trigger PL/SQL Trigger
Table Alias Public Synonym
IBM DB2 UDB Oracle
Sequence Sequence
View View
Typed View Object View
Identity Column Auto Increment Column
Structured Data Type Abstract Datatype
Datalink Binary File
Tables containing structured data type Table containing abstract datatype
View containing structured data type View containing abstract datatype
Stored Procedure Stored Procedure
Function Function

From the above we can find most of the objects are common But the main this is users. This where the
problem exists. Lets talk more on user and schema in detail.
Oracle Database Architecture
Oracle database contain 2 main components.
1. Instance: Memory structures and background processes constitute an instance
2. Database: disk resources

Instance

As we cover above, the memory structures and background processes constitute an instance. The
memory structure itself consists of System Global Area (SGA), Program Global Area (PGA), and an
optional area Software Area Code. In the other hand, the mandatory background processes are
Database Writer (DBWn), Log Writer (LGWR), Checkpoint (CKPT), System Monitor (SMON), and
Process Monitor (PMON). And another optional background processes are Archiver (ARCn), Recoverer
(RECO), etc. Figure 2 will illustrate the relationship for those components on an instance.

System Global Area

SGA is the primary memory structures. When Oracle DBAs talk about memory, they usually mean the
SGA. This area is broken into a few of part memory Buffer Cache, Shared Pool, Redo Log Buffer,
Large Pool, and Java Pool.

Buffer Cache

Buffer cache is used to stores the copies of data block that retrieved from datafiles. That is, when user
retrieves data from database, the data will be stored in buffer cache. Its size can be manipulated via
DB_CACHE_SIZE parameter in init.ora initialization parameter file.

Shared Pool

Shared pool is broken into two small part memories Library Cache and Dictionary Cache. The library
cache is used to stores information about the commonly used SQL and PL/SQL statements; and is
managed by a Least Recently Used (LRU) algorithm. It is also enables the sharing those statemens
among users. In the other hand, dictionary cache is used to stores information about object definitions in
the database, such as columns, tables, indexes, users, privileges, etc.

The shared pool size can be set via SHARED_POOL_SIZE parameter in init.ora initialization parameter
file.

Redo Log Buffer

Each DML statement (insert, update, and delete) executed by users will generates the redo entry. What is
a redo entry? It is an information about all data changes made by users. That redo entry is stored in redo
log buffer before it is written into the redo log files. To manipulate the size of redo log buffer, you can use
the LOG_BUFFER parameter in init.ora initialization parameter file.

Large Pool

Large pool is an optional area of memory in the SGA. It is used to relieves the burden place on the shared
pool. It is also used for I/O processes. The large pool size can be set by LARGE_POOL_SIZE parameter
in init.ora initialization parameter file.

Java Pool

As its name, Java pool is used to services parsing of the Java commands. Its size can be set by
JAVA_POOL_SIZE parameter in init.ora initialization parameter file.

Program Global Area

Although the result of SQL statemen parsing is stored in library cache, but the value of binding variable
will be stored in PGA. Why? Because it must be private or not be shared among users. The PGA is also
used for sort area.

Software Area Code

Software area code is a location in memory where the Oracle application software resides.

Oracle Background Processes

Oracle background processes is the processes behind the scene that work together with the memories.

DBWn

Database writer (DBWn) process is used to write data from buffer cache into the datafiles. Historically, the
database writer is named DBWR. But since some of Oracle version allows us to have more than one
database writer, the name is changed to DBWn, where n value is a number 0 to 9.

LGWR

Log writer (LGWR) process is similar to DBWn. It writes the redo entries from redo log buffer into the redo
log files.

CKPT

Checkpoint (CKPT) is a process to give a signal to DBWn to writes data in the buffer cache into datafiles.
It will also updates datafiles and control files header when log file switch occurs.

SMON

System Monitor (SMON) process is used to recover the system crach or instance failure by applying the
entries in the redo log files to the datafiles.

PMON

Process Monitor (PMON) process is used to clean up work after failed processes by rolling back the
transactions and releasing other resources.

Database

The database refers to disk resources, and is broken into two main structures Logical structures and
Physical structures.

Logical Structures

Oracle database is divided into smaller logical units to manage, store, and retrieve data efficiently. The
logical units are tablespace, segment, extent, and data block. Figure 3 will illustrate the relationships
between those units.
Tablespace

A Tablespace is a grouping logical database objects. A database must have one or more tablespaces. In
the Figure 3, we have three tablespaces SYSTEM tablespace, Tablespace 1, and Tablespace 2.
Tablespace is composed by one or more datafiles.

Segment

A Tablespace is further broken into segments. A segment is used to stores same type of objects. That is,
every table in the database will store into a specific segment (named Data Segment) and every index in
the database will also store in its own segment (named Index Segment). The other segment types are
Temporary Segment and Rollback Segment.

Extent

A segment is further broken into extents. An extent consists of one or more data block. When the
database object is enlarged, an extent will be allocated. Unlike a tablespace or a segment, an extent
cannot be named.

Data Block

A data block is the smallest unit of storage in the Oracle database. The data block size is a specific
number of bytes within tablespace and it has the same number of bytes.

Physical Structures

The physical structures are structures of an Oracle database (in this case the disk files) that are not
directly manipulated by users. The physical structure consists of datafiles, redo log files, and control files.

Datafiles

A datafile is a file that correspondens with a tablespace. One datafile can be used by one tablespace, but
one tablespace can has more than one datafiles.

Redo Log Files

Redo log files are the files that store the redo entries generated by DML statements. It can be used for
recovery processes.
Control Files

Control files are used to store information about physical structure of database, such as datafiles size and
location, redo log files location, etc.

http://ugweb.cs.ualberta.ca/~c391/manual/chapt2.html
http://docs.oracle.com/cd/B10500_01/server.920/a96524/c08memor.htm

DB2 Database Architecture

Database Administrator server (DAS): which should be up and running always. It is only one user for
the whole server. If it is down mans the DB2 server will be down.
Instance (db2inst1): This is the instance which will have multiple number of database inside it.we can
have more than one instance for one DB2 server.
Fence user(db2fence):This is used to run the functions and procedures in the Database.
Database: The database refers to disk resources, and is broken into two main structures Logical
structures and Physical structures

DataStage migration part2


I have given the environment details for the migration. Now will move to the technical part of the
migration.
1. The achievement is moving the DataStage project from oracle to DB2 environment.
In the old DataStage 7.5 environment, we dont have information server concept where the DataStage
repository data will be sorted in the oracle/db2 database as a separate database.
So the data will be storage in the same DataStage server. So we dont need to worry about the DataStage
software. It will be up and running fine. But the database connection to the country DB should be defined
here.
We have two type of DB here.
1. Application DB(where the application repository DB will be storage and maintained and also
needed for user/xmeta data administration which is used in DataStage 8.0 version onwards)
2. Country DB(where the real DWH DB will be having the data.)
Usually the DataStage will be collecting the data from different source like
1. Text format
2. Excel format
3. Database (Sybase,SQL,oracle,DB2,Teradata..etc)
And again it will put the final processed data in the any format like text or database (Sybase, SQL, oracle,
DB2, Teradata..etc) which is called as Datawarehouse. Now for the migration we need to consider the
below steps. Make sure the DataStage 7.5 or 8.0to 8.7 is up and running .i will different work flow for both
7.5 and 8*
DataStage migration for 7.5
1. install the DataStage software in new server(server A)
2. Take the DataStage project backup in old server(server B)
3. Create the same project as same as OLD server B in new server A
4. Import the project backup taken from server B to server A
5. Configure the dsenv file for oracle/db2 entries same as in OLD servers
6. Create new odbc configuration for orcle/DB2 country DB entries or update tnsname.ora file for oracle
entries for country DB.
7. Test the same jobs in new DataStage server.
Note: The main achievement here is. The Oracle/DB2 schema concept. Since whenever we login to the
DataStage jobs will fallow the below
1.login with unix/datastage user user1
2.connect to the oracle using user2/passwd from datastage job and create the table. The table name as
per the Database concept will be scheam name.table Name. so in oracle it will be user2.table1 will be
the table name
User1.table1(here user1 is the user name and schema name where user will login to oracle using user1 )
3.conenct to DB2 using user3/passwd from DataStage job. But here DB2 login will be as unix login as
db2admin and the schema will be named as per the user
Example
1. implicit function means use the same user/default schema as table Name
db2admin.tabel1
2.explicit function means use the separate schema name as u defined
User2.table1(here user2 is the schema name but not the user to login to DB2 database)
As I mentioned before, so for this major problem in the DataStage job migration. All the developers need
to sit and edit all the jobs for the table name after the migration from oracle to DB2 or They can simple
create the same user1 as schema name in DB2 and can run the jobs without editing :)
DataStage 8.0 or 81 to 8.5 or 8.7 migration
1. Install oracle/DB2 database in server A
2. Create the schema iauser/xmeta with DB privileges and with create DB link privileges
3. Create dsadm user and dstage group in DataStage server B.
4. Install the datastage 8* version (where the Datastage metadata repository will be in server A)
5. Configure the dsadm user in webconsole (map the unix dsadm user with webconsole dsadm user)
6. Configure the dsenv file for oracle/DB2 country database entries and also the tnsname.ora file for
oracle entries.
7. Make sure DB2/Oracle client is installed in DataStage server B.
8. Create the DataStage projects as same as OLD DataStage server.
9. Import the DataStage project backup from OLD DataStage server.
10. Test the jobs and compile and run.
Please let me know if any confusion in DataStage migration from oracle to DB2

You might also like