You are on page 1of 72

IBM Information Platform and Solutions

Center of Excellence

IBM IPS Parallel Framework Standard Practices


Administration and Management:
DataStage EE Administration and Production Automation
Prepared by IBM Information Platform and Solutions Center of Excellence
October 29, 2007

CONFIDENTIAL, PROPRIETARY, AND TRADE SECRET NATURE OF ATTACHED DOCUMENTS


This document is Confidential, Proprietary and Trade Secret Information (Confidential Information) of IBM, Inc. and is provided solely for the purpose
of evaluating IBM products with the understanding that such Confidential Information will be disclosed only to those who have a need to know. The
attached documents constitute Confidential Information as they include information relating to the business and/or products of IBM (including, without
limitation, trade secrets, technical, business, and financial information) and are trade secret under the laws of the State of Massachusetts and the United
States.
Copyrights
2007 IBM Information Platform and Solutions
All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM. While every precaution has been taken in the preparation of this document to reflect
current information, IBM assumes no responsibility for errors or omissions or for damages resulting from the use of information contained herein.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

1 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Document Goals
Intended Use

Target Audience

Product Version

This document presents a set of standard practices, methodologies, and an example Toolkit
for administering and integrating IBM WebSphere DataStage Enterprise Edition
(DSEE) with a production infrastructure. Except where noted, this document is intended
to supplement, not replace the installation documentation.
The primary audience for this document is DataStage Administrators and Developers who
have been trained in Enterprise Edition. Information in certain sections may also be relevant
for Technical Architects and System Administrators.
This document is intended for the following product releases:
- WebSphere DataStage Enterprise Edition 7.5.2 (UNIX, Linux, USS)
- WebSphere DataStage Enterprise Edition 7.5x2 (Windows)

Document Author and Contributors


Author
Contributing Authors

Mike Carney

Advanced Consulting Engineer

carneym@us.ibm.com

Paul Christensen
Bob Johnston
Patrick Owen
Mike Ruland
Jim Tsimis

Global Technical Architect


Advanced Consulting Engineer
Advanced Consulting Engineer
Global Technical Architect
Advanced Support Engineer

ptc@us.ibm.com
rdj@us.ibm.com
powen@us.ibm.com
mruland@us.ibm.com
jtsimis@us.ibm.com

Document Revision History


Date
April 27, 2006
July 17, 2006

Rev.
1.0
1.1

August 15, 2006


October 5, 2006

1.2
1.3

October 17, 2006

1.4

February 8, 2007
October 29, 2007

2.0
3.0

Description
Initial release
Updated ETL and Project_Plus directory hierarchies for consistency across DSEE
Standards. Added Staging directory hierarchy.
Updated styles and formatting.
Updated directory and Project_Plus naming standards for consistency across
deliverables. Updated terminology and Naming Standards for consistency.
Expanded discussion of Environment Variables and Parameters. Added
Environment Variable Reference Appendix. Added Document Author and
Contributors, and Package Contents.
Added Feedback section and IIS Services Offerings. Corrected Data Set and
Scratch file system naming. Expanded backup discussion for DataSets.
Updated positioning, naming (IIS to IPS), Services Offerings.
First public reference release compliments Administration and Production
Automation Services Workshop.

Document Conventions
This document uses the following conventions:
Convention

Bold
Italic
Plain
Bold Italic

Usage
In syntax, bold indicates commands, function names, keywords, and options that must be input
exactly as shown. In text, bold indicates keys to press, function names, and menu selections.
In syntax, italic indicates information that you supply. In text, italic also indicates UNIX
commands and options, file names, and pathnames.
In text, plain indicates Windows NT commands and options, file names, and pathnames.
Indicates: important information.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

2 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence
Lucida
Console

Lucida Console text indicates examples of source code and system output.

Lucida Bold

In examples, Lucida Console bold indicates characters that the user types or keys the user
presses (for example, <Return>).
In examples, Lucida Blue will be used to illustrate operating system command line prompt.
A right arrow between menu commands indicates you should choose each command in sequence.
For example, Choose File Exit means you should choose File from the menu bar, and then
choose Exit from the File pull-down menu.
The continuation character is used in source code examples to indicate a line that is too long to
fit on the page, but must be entered as a single line on screen.

Lucida Blue

This line
continues

The following are also used:

Syntax definitions and examples are indented for ease in reading.

All punctuation marks included in the syntaxfor example, commas, parentheses, or quotation
marksare required unless otherwise indicated.

Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The
continuation lines are indented. When entering syntax, type the entire syntax entry, including
the continuation lines, on the same input line.

Text enclosed in parenthesis and underlined (like this) following the first use of proper terms
will be used instead of the proper term.
Interaction with our example system will usually include the system prompt (in blue) and the
command, most often on 2 or more lines.
If appropriate, the system prompt will include the user name and directory for context. For example:
%etl_node%:dsadm /usr/dsadm/Ascential/DataStage >
/bin/tar cvf /dev/rmt0 /usr/dsadm/Ascential/DataStage/Projects

Feedback
We value your input and suggestions for continuous improvement to this content. Direct any questions,
comments, corrections, or suggested additions to: cedifeed@us.ibm.com

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

3 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Table of Contents
1

IBM INFORMATION PLATFORM AND SOLUTIONS SERVICES................................................................5

DATASTAGE ADMINISTRATION.......................................................................................................................9
2.1
2.2
2.3
2.4
2.5
2.6

CONFIGURING DATASTAGE ENVIRONMENTS FOR A SYSTEM LIFE CYCLE ...............................................................9


CONFIGURING DATASTAGE FILE SYSTEMS AND DIRECTORIES .............................................................................11
ADMINISTRATOR TIPS ...........................................................................................................................................16
PERFORMANCE MONITORING ................................................................................................................................18
SECURITY, ROLES, DATASTAGE USER ACCOUNTS ................................................................................................24
THE DATASTAGE ADMINISTRATOR PROJECT CONFIGURATION .............................................................................26
JOB MONITOR ......................................................................................................................................................30

3.1
3.2
3.3
3.4
4

CONFIGURATION ...................................................................................................................................................30
JOB MONITOR ENVIRONMENT VARIABLES ............................................................................................................30
STARTING & STOPPING THE MONITOR ..................................................................................................................30
MONITORING JOBMON ...........................................................................................................................................31
BACKUP / RECOVERY / REPLICATION/ FAILOVER PROCEDURES......................................................32

4.1
4.2
4.3
4.4
4.5
4.6
4.7

DATASTAGE CONDUCTOR BACKUP.......................................................................................................................32


DATASTAGE PROJECT BACKUPS ...........................................................................................................................32
DATASTAGE EXPORTS FOR PARTIAL BACK UP .....................................................................................................33
DATASETS, LOOKUP FILE SETS AND FILE SETS .....................................................................................................34
EXTERNAL ENTITIES SCRIPTS, ROUTINES, STAGING FILES ....................................................................................34
REPLICATING THE DATASTAGE ENVIRONMENT ....................................................................................................34
IMPORTANT PROJECT FILE SYSTEM CONSIDERATIONS ..........................................................................................37

5
OVERVIEW OF PRODUCTION AUTOMATION AND INFRASTRUCTURE INTEGRATION FOR
DATASTAGE ........................................................................................................................................................................39
5.1
5.2
5.3
5.4
6

DATASTAGE JOB CONTROL DEVELOPMENT KIT ...................................................................................................40


JOB SEQUENCER ....................................................................................................................................................41
EXCEPTION HANDLING ..........................................................................................................................................41
CHECKPOINT RESTART ..........................................................................................................................................42
JOB PARAMETER AND ENVIRONMENT VARIABLE MANAGEMENT ..................................................46

6.1
6.2
6.3
6.4
6.5
7

DATASTAGE ENVIRONMENT VARIABLES ..............................................................................................................46


DATASTAGE JOB PARAMETERS .............................................................................................................................49
AUDIT AND METRICS REPORTING IN AN AUTOMATED PRODUCTION ENVIRONMENT ............................................54
INTEGRATING WITH EXTERNAL SCHEDULERS .......................................................................................................54
INTEGRATING WITH ENTERPRISE MANAGEMENT CONSOLES .................................................................................55
CHANGE MANAGEMENT ..................................................................................................................................56

7.1
7.2
7.3
7.4

SOURCE CONTROL.................................................................................................................................................56
PRODUCTION MIGRATION LIFE CYCLE ..................................................................................................................57
SECURITY ..............................................................................................................................................................58
UPGRADE PROCEDURE (INCLUDING FALLBACK EMERGENCY PATCH) ..................................................................58

APPENDIX A: PROCESSES CREATED AT RUNTIME BY DATASTAGE EE..........................................................60


APPENDIX B: ENVIRONMENT VARIABLE REFERENCE ........................................................................................67

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

4 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

1 IBM Information Platform and Solutions Services


IBM Information Platform and Solutions (IPS) Professional Services offers a broad range of
workshops and services designed to help you achieve success in the design, implementation, and
rollout of critical information integration projects.

Iterations 2 Methodology
Standard Practices
Architecture and Design
Education and Mentoring
Virtual Services
Certification

Figure 1: IBM IPS Services Overview


Services Offerings
Staff Augmentation
and Mentoring

Learning Services

Certification

Client Support
Services

Virtual Services

Description
Whether through workshop delivery, project leadership, or mentored augmentation,
the Professional Services staff of IBM Information Platform and Solutions leverages
IBMs methodologies, Standard Practices and experience developed throughout
thousands of successful engagements in a wide range of industries and government
entities.
IBM offers a variety of courses covering the IPS product portfolio. IBMs blending
learning approach is based on the principle that people learn best when provided with
a variety of learning methods that build upon and complement each other. With that in
mind, courses are delivered through a variety of mechanisms: classroom, on-site and
Web-enabled FlexLearning.
IBM offers a number of Professional Certifications offered through independent
testing centers worldwide. These certification exams provide a reliable, valid and fair
method of assessing product skills and knowledge gained through classroom and realworld experience.
IBM is committed to providing our customers with reliable technical support
worldwide. All Client Support services are available to customers who are covered
under an active IBM IPS maintenance agreement. Our worldwide support organization
is dedicated to assuring your continued success with IPS products and solutions.
The low cost Virtual Services offering is designed to supplement the global IBM IPS
delivery team, as needed, by providing real-time, remote consulting services. Virtual
Services has a large pool of experienced resources that can provide IT consulting,
development, Migration and Training services to customers for WebSphere
DataStage Enterprise Edition (DSEE).

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

5 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Center of Excellence for Data Integration (CEDI)


Establishing a CEDI within your enterprise can help increase efficiency and drive down the cost of
implementing data integration projects. A CEDI can be responsible for Competency, Readiness,
Accelerated Mentored Learning, Common Business Rules, Standard Practices, Repeatable Processes,
and the development of custom methods and components tailored to your business.
IBM IPS Professional Services offerings can be delivered as part of a strategic CEDI initiative, or on
an as-needed basis across a project lifecycle:
Installation and Configuration
Information Analysis
Data Flow and Job Design Standard Practices
Data Quality Management Standard Practices
Administration, Management, and Production Automation

Information
Exchange and
Discovery

Identify

Strategic
Planning

Startup

Requirements
Definition, Architecture,
and Project Planning
Iterations 2

Analysis
& Design

Build

Test &
Implement

Monitor &
Refine

Health Check Evaluation


Sizing and Capacity Planning
Performance Tuning
High Availability Architecture
Grid Computing Discovery, Architecture, and Planning
Grid Computing Installation and Deployment

Figure 2: IPS Services Offerings within an Information Integration Project Lifecycle


Project Startup
Workshops
Information
Exchange and
Discovery Workshop
Requirements
Definition,
Architecture, and
Project Planning
Workshop
Iterations 2

Description
Targeted for clients new to the IBM IPS product portfolio, this workshop provides
IBMs high-level recommendations on how to solve a customer particular problem.
IBM analyzes the data integration challenges outlined by the client, and develops a
strategic approach for addressing those challenges.
Guiding clients through the critical process of establishing a framework for a
successful future project implementation, this workshop delivers a detailed project
plan, as well as a Project Blueprint. These deliverables document project parameters,
current and conceptual end states, network topology, data architecture and hardware
and software specifications, outlines a communication plan, defines scope, and
captures identified project risk.
IBMs Iterations 2 is a framework for managing enterprise data integration projects
that integrates with existing customer methodologies. Iterations 2 is a comprehensive,
iterative, step-by-step approach that leads project teams from initial planning and
strategy through to tactical implementation. This workshop includes the Iterations 2
software, along with customized mentoring.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

6 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Standard Practices
Workshops
Installation and
Configuration
Workshop
Information Analysis
Workshop
Data Flow and Job
Design Standard
Practices Workshop

Data Quality
Management
Standard Practices
Workshop
Administration,
Management, and
Production
Automation
Workshop
Advanced
Deployment
Workshops
Health Check
Evaluation

Sizing and Capacity


Planning Workshop

Performance Tuning
Workshop

High-Availability
Architecture
Workshop
Grid Computing
Discovery,
Architecture and

Description
Establishes a documented, repeatable process for installation and configuration of
DSEE server and client components. This may involve review and validation of one or
more existing DSEE environments, or planning, performing, and documenting a new
installation.
Provides clients with a set of Standard Practices and a repeatable methodology for
analyzing the content, structure, and quality of data sources using the combination of
WebSphere Profile Stage, Quality Stage, and Audit Stage.
Helps clients establish standards and templates for the design and development of
parallel jobs using DSEE through practitioner-led application of IBM Standard
Practices to a clients environment, business, and technical requirements. The delivery
includes a customized Standards document as well as custom job designs and
templates for a focused subject area.
Provides clients with a set of standard processes for the design and development of
data standardization, matching, and survivorship processes using WebSphere
QualityStage. The data quality strategy formulates an auditing and monitoring
program that helps ensure on-going confidence in data accuracy, consistency, and
identification through client mentoring and sharing of IBM Standard Practices.
This workshop provides customers with a customized Toolkit and set of proven
Standard Practices for integrating DSEE into a clients existing production
infrastructure (monitoring, scheduling, auditing/logging, change management) and for
administering, managing and operating DSEE environments.

Description
This workshop is targeted for clients currently engaged in IPS development efforts that
are not progressing according to plan, or for clients seeking validation of proposed
plans prior to the commencement of new projects. It provides review of and
recommendations for core ETL development and operational environments by an IBM
expert practitioner.
Provides clients with an action plan and set of recommendations for meeting current
and future capacity requirements for data integration. This strategy is based on
analysis of business and technical requirements, data volumes and growth projections,
existing standards and technical architecture, existing and future data integration
projects.
Guides a clients technical staff through IBM Standard Practices and methodologies
for review, analysis and performance optimization using a targeted sample of client
jobs and environments. This workshop can identify potential areas of improvement,
demonstrate IBMs processes and techniques, and provide a final report that contains
recommended performance modifications and IBM performance tuning guidelines.
Using IBMs IPS Standard Practices for high availability, this workshop presents a
plan for meeting a customers high availability requirements using the parallel
framework of DSEE. It then implements the architectural modifications necessary for
high availability computing.
Provides the planning and readiness efforts required to support a future deployment of
the parallel framework of IPS on Grid computing platforms. This workshop prepares
the foundation on which a follow-on Grid installation and deployment will be

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

7 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Planning Workshop
Grid Computing
Installation and
Deployment
Workshop

executed, and includes hardware and software recommendations and estimated scope.
Installs, configures, and deploys the IBM IPS Grid Enabled Toolkit in a clients Grid
environments and provides integration with Grid Resource Managers, configuration
of DSEE, QualityStage/EE, and/or ProfileStage/EE.

For more details on any of these IBM IPS Professional Services offerings, and to find a local IBM
Information Integration Services contact, visit:
http://www.ibm.com/software/data/services/ii.html
Administration, Management and Production Automation Workshop
The following flowchart illustrates the various IPS Services workshops around the parallel framework
of DSEE.
The Administration, Management and Production Automation Workshop is intended to provide a set of
proven Standard Practices and a customized toolkit for integrating DSEE into a customers existing
production infrastructure (monitoring, scheduling, auditing/logging, change management). It also
provides expert practitioner recommendations for administering, managing and operating DSEE
environments.

Figure 3: Services Workshops for the Parallel Framework of DSEE

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

8 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

2 DataStage Administration
This section of the document discusses DataStage Administration and Automation. It endeavors to join
these disciplines in a cohesive manner, by defining standard practices that are complementary. The
standard practices are based on a foundation. The foundation is the operating environment and a
simple life cycle methodology.

2.1 Configuring DataStage Environments for a System Life Cycle


There are many ways to configure a DataStage ETL environment. DataStage EE is a flexible piece of
software, presenting the DataStage team with many facets of Configuration, Administration, Design,
Development Operations to consider. After the software is installed many customers wonder what to
do next. Seeing the end picture of your environment, how users will interact with it, and how it will
function, will help you decide how to configure, administer, manage, develop and operate all aspects
the DataStage environment. It will help you in other areas that are related to the DataStage
environment like planning hardware resources, setting up users and security.
You should be familiar with many aspects of a DataStage project. In particular these aspects:

Projects are both the logical and physical means for storing work performed in DataStage.
Projects are Meta data repositories for DataStage Objects, such as jobs, stages, shared
containers.
Projects also store configuration metadata, like environment variables.
It is possible to create many projects.
Projects are independent of each other.
DataStage object metadata can be exported to a file as well as imported.

2.1.1 A Simple DataStage Application Life Cycle


It is common to refer to a collection of related DataStage objects such as jobs and shared containers, as
an application. And just as for all software applications, applications developed with DataStage need
to be developed and maintained under a life cycle methodology to ensure quality.
The life cycle that is advocated in this Standard practice only takes into account a subset of larger more
comprehensive methodologies. It is primarily concerned with the Development, Testing, Release and
Maintenance of the Data stage application. It describes how to physically implement the environment
for the life cycle and activities related to operation and maintenance. It does not consider aspects of a
broader life cycle such as Design, Documentation.
The DataStage application life cycle has at least three phases.
1. Development/Maintenance
2. Testing
3. Production
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

9 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

A more robust life cycle will utilize more testing phases.


1. Development/Maintenance
2. Integration Testing
3. Quality Assurance
4. User Acceptance Testing
5. Production
2.1.2 DataStage installation and Configuration Considerations
In support of the phases of the life cycle DataStage project environments should be configured for each
phase of the project. That is, per DataStage application create projects for dev, test, production, etc. In
addition to configuring project environments you will need to consider the system requirements as
well. It is a recognized standard practice, as well as, common for customers, to utilize completely
different systems for each phase of the life cycle. Consider other resources that will be used by the
project, such as disk space, CPU capacity, memory size and weather or not the system(s) can support
the anticipated or actual workload. Application performance depends on sufficient hardware resources
to support the work load that parallel execution puts on a system.
If you plan to execute your DataStage jobs in a distributed fashion on a loosely coupled cluster (MPP,
Grid, Cluster) or employ a failover strategy the DataStage EE environment will need to be replicated
on all physical processing nodes. See Replicating the DataStage Environment, section 4.6.
Installation of the DataStage environment requires some careful planning. Choosing hardware
resources is particularly important for a DataStage EE environment. Consider separate physical
environments for each phase of a life cycle to ensure adequate performance. For example designate a
separate machine for dev, test and production.

Separate Physical Environments


Development Machine

Test Machine

Production
Machine

Standard
practice

Mixed Physical Environments


Development & Test Machine

Production Machine

Single Mixed Physical Environments


Development & Test & Production Machine

Figure 4: DataStage EE Physical Environments for a life cycle


IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

10 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Configuration for DataStage user accounts for all phases of the ETL test cycle should be set up to allow
for separate developer accounts and a separate account for each phase of the life cycle. Each of these
accounts should be configured according to the section configuring a DataStage user (below).

2.2 Configuring DataStage File Systems and Directories


DataStage Enterprise Edition requires file systems to be available for:

Software Install Directory


- DataStage Enterprise Edition executables, libraries, and pre-built components

DataStage Project (Repository) Directory

Data Storage
- DataStage temporary storage - scratch, temp, buffer
- DataStage parallel Data Set segment files
- Staging and Archival storage for any source file(s)

By default, each of these directories (except for file staging) are created during installation as
subdirectories under the base DataStage installation directory.
IMPORTANT: Each storage class should be isolated in separate file systems to accommodate
their different performance and capacity characteristics and backup requirements.
The default installation is generally acceptable for small prototype environments.
2.2.1 Software Install Directory
The software install directory is created by the installation process, and contains the DSEE software
file tree. The install directory grows very little over the life of a major software release, so the default
location ($HOME for dsadm, e.g.: /home/dsadm) may be adequate.
The system administrator may choose to install DataStage in a subdirectory within an overall install
file system. You should verify that the install file system has at least 1GB of space for the software
directory (2GB if you are installing RTI or other optional components).
For cluster or Grid implementations, it is generally best to share the Install file system across
servers (at the same mount point).
NOTE: the DataStage installer will attempt to rename the installation directory to support later
upgrades; if you install directly to a mount point this rename will fail and several error
messages will be displayed. Installation will succeed but the messages may be confusing.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

11 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

2.2.2 DataStage Projects (Repository) Directory


The DataStage Projects subdirectory contains the repository (Universe database files) of job designs,
design and runtime metadata, logs, and components. Project directories can grow to contain thousands
of files and subdirectories depending on the number of projects, the number of jobs, and the volume of
logging information retained about each job.
During the installation process, the Projects subdirectory is created in the DataStage install directory.
By default, the DataStage Administrator client creates its projects in this Projects subdirectory.
For cluster or Grid implementations, it is generally best to share the Projects file system across
servers (at the same mount point).
IMPORTANT: It is a bad practice to create DataStage projects in the default directory within
the install file system, as disk space is typically limited.
Projects should be created in their own file system.
2.2.2.1 Creating the Projects File System
On most operating systems, it is possible to create separate file systems at non-root levels as a separate
file system for the Projects subdirectory within the DataStage installation, using the following
guidelines:

It is recommended that a separate file system be created and mounted over the default location
for projects, the $DSROOT/Projects directory. Mount this directory after installing DSEE but
before projects are created.

The Projects directory should be a mirrored file system with sufficient space (minimum 100MB
per project).

For cluster or Grid implementations, it is generally best to share the Project file system across
servers (at the same mount point).
IMPORTANT: The project file system should be monitored to ensure adequate free space
remains. If the Project file system runs out of free space during DataStage activity, the
repository may become corrupted, requiring a restore from backup.

Effective management of space is important to the health and performance of a project, and as jobs are
added to a project, new directories are created in this file tree, and as jobs are run, their log entries
multiply. These activities cause file-system stress (for example, more time to insert or delete DataStage
components, longer update times for logs). Failure to perform routine projects maintenance (for
example, remove obsolete jobs and manage log entries) can cause project obesity and performance
issues.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

12 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

The name of a DataStage Project is limited to a maximum of 18 characters. The project name can
contain alpha-numeric characters and it can contain underscores.
2.2.2.2 Project Recovery Considerations
Devising a backup scheme for project directories is based on 3 core issues:
1. Will there be valuable data stored in Server Edition hash files 1 ? DataStage Server Edition files
located in the DataStage file tree may require archiving from a data perspective.
2. How often will the UNIX file system containing the ENTIRE DataStage file tree be backed up?
When can DataStage be shut down to enable a cold snapshot of the Universe database as well as
the project files? A complete file system backup while DataStage is shut down accomplishes
this backup.
3. How often will the projects be backed up? Keep in mind that the grain of project backups will
represent the ability to recover lost work should a project or a job become corrupted.
At a minimum, a UNIX file system backup of the entire DataStage file tree should be performed at
least weekly with the DataStage engine shut down, and each project should be backed up with the
Manager at least nightly with all users logged out of DataStage. This is the equivalent of a cold
database backup and 6 updates.
If your installation has valuable information in Server hash files, you should increase the frequency of
your UNIX backup OR write jobs to unload the Server files to external media.
2.2.3 Data Set and Sort Directories
The DataStage installer creates the following two subdirectories within the DataStage install directory:
Datasets/
- stores individual segment files of DataStage parallel Data Sets
Scratch/
- used by the parallel framework for temporary files such as sort and buffer overflow
Try not to use these directories and consider deleting them to ensure they are never used. This is best
done immediately after installation; be sure to coordinate this standard with the rest of the team.
DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk and
scratch file systems) at runtime when a job is executed.

Note that the use of Server Edition components in an Enterprise Edition environment is discouraged for
performance and maintenance reasons. However, if legacy Server Edition applications exist, their corresponding
objects may need to be taken into consideration.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

13 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

The DataStage installer creates a default parallel Configuration file (Configurations/default.apt) which
references the Datasets and Scratch subdirectories within the install directory. The DataStage
Administrator should consider removing the default.apt file altogether, or at a minimum updating this
file reference the file systems you define (below).
2.2.3.1 Data and Scratch File Systems
It is a bad practice to share the DataStage install and Projects file systems with volatile files like scratch
files and Parallel data set segment files. Resource, scratch and sort disks service very different kinds of
data with completely opposite persistence characteristics. Furthermore, they compete directly with
each other for I/O bandwidth and service time if they share the same path.
Optimally, these file systems should not have any physical disks in common and should not share any
physical disks with databases. While it is often impossible to allocate contention-free storage, it must
be noted that at large data volumes and/or in highly active job environments, disk arm contention can
and usually does significantly constrain performance.
NOTE: For optimal performance, file systems should be created in high performance, low
contention storage. The file systems should be expandable without requiring destruction and recreation.

2.2.3.2 Data Sets


Parallel Data Sets are used for persistent data storage in parallel, in native DSEE format. The
DataStage developer specifies the location of the Data Set header file, which is a very small pointer to
the actual data segment files that are created by the DSEE engine, in the directories specified by the
disk resources assigned to each node in the parallel Configuration file. Over time, the Data Set segment
file directory(-ies) will grow to contain dozens to thousands of files depending on the number of
DataStage Data Sets used by DSEE jobs.
The need to archive Data Set segment files depends on the recovery strategy chosen by the DataStage
developer, the ability to recreate these files if the data sources remain, and the business requirements.
Whatever archive policy is chosen should be coordinated with the DataStage Administrator and
Developers. If Data Set segment files are archived, careful attention should be made to also archive the
corresponding Data Set header files.
2.2.3.3 Sort Space
As discussed, it is a recommended practice to isolate DataStage scratch space from Data Sets and flat
files, and DataStage sort space, in that temporary files exist only while a job is running 2 and that they

Some files created by database stages persist after job completion. For example, the Oracle .log, .ctl and .bad
files will remain in the first Scratch resource pool after a load completes.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

14 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

are warm files (e.g.: being read and written at above average rates). Note that sort space must
accommodate only the files being sorted simultaneously, and, assuming that jobs are scheduled nonconcurrently, only the maximum of said sorts.
There is no persistence to these temporary sort files so they need not be archived.
Sizing DataStage scratch space is somewhat difficult. Objects in this space include lookups and intraprocess buffers. Intra-process buffers absorb rows at runtime when a stage (or stages) in a partition (or
all partitions) cannot process rows as fast as they are supplied. In general, there are as many buffers as
there are stages on the canvas for each partition. As a practical matter, assume that scratch space must
accommodate the largest volume of data in one job (see the previous formula for Data Sets and flat
files). There are advanced ways to isolate buffer storage from sort storage, but this is a performance
tuning exercise, not a general requirement.
2.2.3.4 Maintaining Parallel Configuration Files
DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk and
scratch file systems) at runtime when a job is executed. Parallel Configuration files are discussed in
detail in the DataStage Parallel Job Advanced Developers Guide.
Parallel configuration files define can be located within any directory that has suitable access
permissions, defined at runtime through the environment variable $APT_CONFIG_FILE. However,
the graphical Configurations tool within the DataStage clients expects these files to be stored within the
Configurations subdirectory of the DataStage install. For this reason, it is recommended that all
parallel configuration files be stored in the Configurations subdirectory, with naming conventions to
associate them with a particular project or application.
The default.apt file is created when DataStage is installed, and references the DataSets and Scratch
subdirectories of the DataStage install directory. To manage system resources and disk allocation,
the DataStage administrator should consider removing this file, creating separate configuration
files that are referenced by the $APT_CONFIG_FILE setting in each DataStage Project.
At a minimum, the DataStage administrator should edit the default.apt configuration file to reference
the newly-created Data and Scratch file systems, and to ensure that these directories are used by any
other parallel configuration files.
2.2.4 Extending the DataStage Project for External Entities
It is recommended that another directory structure, be created to integrate all aspects of a DataStage
application that are managed outside of the DataStage Projects repository. This hierarchy should
include directories for secured parameter files, Data Set header files, custom components,
Orchestrate schema, sql and shell scripts. It may also be useful to support custom job logs and
reports.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

15 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

2.2.5 File Staging


It is recommended that a separate Staging file system and directory structure be used for storing,
managing, and archiving various source data files.

2.3 Administrator Tips


2.3.1 Shell environment
Establish a convenient environment variable to the main DataStage directory and automatically source
the DataStage environment by adding the following three lines to the users shell profile. (.profile,
.bashrc, etc.)
dsroot="`cat /.dshome`/.."
export dsroot
. $dsroot/DSEngine/dsenv
Note: The /.dshome file is only created with a standard (non-itag) install of DataStage. If you
have installed multiple DataStage engines on a single server (using an itag install) then you
will need to source the appropriate dsenv file for the DataStage environment you are
managing.

2.3.2 Standard DSParams


The DSParams comes from a template. Configure 1 project as the standard configuration for
environment variables, sequencer settings, etc. Then copy the DSParams from the model project
directory to the Template Directory. Every time a new project is created it will inherit the settings from
the DSParams file in template.
2.3.3 Starting / Stopping the DataStage Engine
The DataStage installation will configure your system to start the DataStage Server main processes
(dsrpcd and EE Job monitor JobMonApp) , automatically when the system starts. For UNIX systems
S99ds.rs is installed in /etc/rc2.d. For UNIX systems the DataStage services set to automatically start.
One exception is for a non-root installation; in this case scripts should be executed by the root user to
set up impersonation and autostart.
Too manually stop and start the service on windows from the Windows control panel invoke the
DataStage Control Panel application.
To manually stop or start the DataStage engine refer to the Administrator Guide (dsadmgde.pdf)
Stopping and Restarting the Server Engine.
2.3.4 Server will not start because port is busy.
Usually occurs when the server is brought down before all clients have exited.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

16 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

A feature of UNIX related to TCP sockets that become disconnected will hold the port (which port) in
FINWAIT state for the length of the FINWAIT. While this port is in a finwait state DataStage dsrpcd
server will not start.
You can either wait for the FINWAIT to expire usually 10 minutes or in an emergency as root change
the setting to something like 1 minute. This is a dynamic network parameter and can be set temporarily
to a lower value. Reset back to original value once the server starts.
Use the following utilities
ndd Solaris, hpux,
no - AIX
2.3.5 Universe Shell
The DataStage server engine is based on Universe. It is a complete application environment containing
a shell, file types, programming language and many facilities for application operations like lock
management. To invoke the universe shell the DataStage environment variables must be set. This is
easily done by sourcing/executing the dsenv file in $DSHOME. To invoked the Universe use these
commands:
cd $DSHOME
bin/uvsh
2.3.6 Resource Locks
If a developer is working on a job in the designer and there is a network failure or client machine
failure the job will remain locked according to DataStage. When a job is locked it must be cleared
before it can be accessed by any DataStage component. Clearing locks can be done from the
DataStage Director pull down Job->Cleanup Resources. Choosing this option will open the Job
Resources interface.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

17 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Figure 5: Releasing locks from Director


To release a locked Item:
Select Show All in the Processes pane
Select Show All in the locks pane
Locate the Item id you wish to unlock and note the PID/User#. For example
LdProductLocationJob2 has a PID of 25237
Locate the PID in the Processes pane and select the row.
Release the lock by clicking on the Logout button. This will kill the process holding the lock,
thus releasing it.

2.4 Performance Monitoring


Monitoring the performance of DataStage EE jobs can be done using many different utilities, ranging
from the DataStage EE Job Monitor and environment variables to operating system monitoring utilities
and database monitoring utilities.
When evaluating the performance of a DataStage job various metrics like rows per second, elapsed
time, CPU utilization, memory utilization and I/O rates will be used. This section describes the tools to
collect these metrics. However, it does not consider database performance tuning or monitoring.
Due to the nature of the parallel framework, where a flow runs as fast as its slowest component or
system resource, the person evaluating performance should utilize the tools in this section to identify
the processing bottle necks. These bottle necks could be related to job design or the operating
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

18 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

environment. Once the processing bottle neck is discovered action can then be taken to improve
performance. For example, a job may appear to be running slow, with no indication of CPU, I/O or a
memory bottle neck, performance of the job could be improved by creating more logical processing
nodes in the DataStage EE configuration file or it may need to be redesigned. As parallelism is
increased, more system resources will be utilized and one will find that the system may become the
gating factor of performance. The remedy to this problem may be to increase system resources, like
adding more CPU or spreading I/O to other physical devices and controllers.
2.4.1 DataStage EE Job Monitor
The DataStage EE job monitor (JobMonApp) provides a useful snapshot of the jobs performance at
that moment of execution, but does not provide thorough performance metrics. That is, a JobMonApp
snapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due to
buffering and to some jobs semantics, a snapshot image of the flow may not be a representative
sample of the performance over the course of the entire job.
The CPU Summary information provided by JobMonApp is useful as a first approximation of where
time is being spent in the flow. However, it will not show operators that were inserted by the parallel
framework. Such operators include sorts, that were not explicitly included, and sub-operators of
composites.
2.4.2 Performance Metrics with DataStage EE Environment Variables
There are a number of environment variables that direct DataStage parallel jobs to report detailed
runtime information that enable you to determine where time is being spent, how many rows processed
and how much memory each instance of a stage utilized during a run. Setting these environment
variables also allow you to report on operators that were inserted by the parallel framework. Such
operators include sorts, that were not explicitly included, buffer operators and sub-operators of
composites.
APT_PM_PLAYER_MEMORY
Setting this variable causes each player process to report the process heap memory allocation in the job
log when the operator instance completes execution.
Example of player memory:
APT_CombinedOperatorController,0: Heap growth during runLocally(): 1773568 bytes
APT_PM_PLAYER_TIMING
Setting this variable causes each player process to report its call and return in the job log. The message
with the return is annotated with CPU times for the player process.
Example of player timings, showing the elapsed time of the operator, the amount of user and system
time as will as total CPU.
APT_CombinedOperatorController,0: Operator completed. status: APT_StatusOk elapsed: 0.30 user: 0.02 sys: 0.02 (total
CPU: 0.04)
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

19 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

APT_RECORD_COUNTS
Setting the variable causes DataStage to print to the job log, for each operator player, the number of
records input and output. Abandoned input records are not necessarily accounted for. Buffer operators
do not print this information.
Example of record counts that shows the number of rows processed for the input link and output link
for partition 0 of the Sort_3 stage.
Sort_3,0: Input 0 consumed 5000 records.
Output 0 produced 5000 records.
APT_PERFORMACNE_DATA
APT_PERFORMANCE_DATA or the osh -pdd <performance data directory> advanced runtime
option allow you to capture raw performance data for every underlying job process at runtime.
Within a job parameter, set $APT_PERFORMANCE_DATA = dirpath where dirpath is a directory
specified on the DataStage server to capture performance statistics. This will create an XML document
named performance.<pid> in specified directory. You can influence the name of the file by
specifying the osh -jobid <jobid> advanced runtime option. Hence the performance XML document
will be named performance.<jobid>.
The following XML header shows the detailed performance data captured in each record. Note that this
information is more detailed than the higher-level information captured by DSMakeJobReport and
includes information on all of the processes (including Buffer operators and framework-inserted sorts):
<?xml version="1.0" encoding="ISO-8859-1" ?>
<performance_output version="1.0" date="20050111 16:29:00"
framework_revision="7.5.0" job_ident="202416">
<layout delimiter=",">
<field name="TIME"/>
<field name="PARTITION_NUMBER"/>
<field name="PROCESS_NUMBER"/>
<field name="OPERATOR_NUMBER"/>
<field name="IDENT"/>
<field name="JOBMON_IDENT"/>
<field name="PHASE"/>
<field name="SUBPHASE"/>
<field name="ELAPSED_TIME"/>
<field name="CPU_TIME"/>
<field name="SYSTEM_TIME"/>
<field name="HEAP"/>
<field name="RECORDS"/>
<field name="STATUS"/>
</layout>
<run_data>

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

20 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Starting with release 7.5 and later, the Perl script performance_convert located in the directory
$APT_ORCHHOME/bin can be used to convert the raw performance data into other usable formats
including:
- CSV text files
- detail Data Sets
- summary Data Sets (summarizes the total time and maximum heap memory usage per operator)
The syntax is:
perl $APT_ORCHHOME/bin/performance_convert inputfile output_base [-schema|-dataset|summary] [-help]
where
inputfile - location of performance data to convert
output_base - location and file prefix to all files being generated.
(ex. /mydir/jobid -> /mydir/jobid.CSV)
2.4.3 iostat
iostat is useful for examining the throughput of various disk resources. If one or more disks have high
throughput, understanding where that throughput is coming from is vital. If there are spare CPU cycles,
IO is often the culprit. iostat can also help a user determine if there is excessive IO for a specific job.
The specifics of iostat output vary slightly from system to system. Here is an example from a Linux
machine which shows a relatively light load:
(The first set of output is cumulative data since the machine was booted)
$ iostat 10
Device:
tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev8-0
13.50
144.09
122.33 346233038 293951288
every N seconds (10 in the command line example) iostat outputs:
Device:
tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev8-0
4.00
0.00
96.00
0
96
2.4.4 vmstat
vmstat is useful for examining system paging. Ideally, a EE flow, once it begins running, should never
be paging to disk (si and so should be zero). Paging suggests EE is consuming too much total memory.
$vmstat 1
procs
memory
swap
io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 10692 24648 51872 228836 0 0 0 1 2 2 1 1 0
vmstat produces the following every N seconds:
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

21 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

0 0 0 10692 24648 51872 228836

0 328

41 1 0 99

mpstat will produce a similar report based on each processor of an SMP.


2.4.5 Load Average
Ideally, each flow should be consuming as much CPU as is available. The load average on the machine
should be 2-3x the value as the number of processors on the machine (8-way SMP should have a load
average of roughly 16-24). Some operating systems, such as HPUX, show per-processor load average.
In this case, load average should be 2-3, regardless of number of CPUs on the machine.
If the machine isnt CPU-saturated, a bottleneck may exist elsewhere in the flow. Over-partitioning
may be a useful strategy in these cases.
If the flow pegs the machine, then the flow is likely CPU limited, and some determination needs to be
made as to where the CPU time is being spent if performance isnt adequate. See the next section
(2.4.6) to monitor individual processes.
The commands top or uptime can provide the load average.
xload can provide a histogram of the load average over time.
(top , topas, nmon) give you a real time view of the system and are extremely useful for evaluating a
systems performance.
2.4.6 How to Monitor DataStage EE Processes
Refer to Appendix A: Processes Created at Runtime by DataStage EE for diagrams of processes
created by DataStage.
Identifying the player processes identifiers (PID) of a job can be done so by setting the environment
variable APT_PM_PLAYER_PID=TRUE. This will produce messages in the job log correlating an
instance of an operator and the PID.
You can also identify the processes without using APT_PM_PLAYER_PID, by looking for processes
that are running the osh or phantom programs. osh is the orchestrate shell or the main program of the
parallel framework. All parallel job execution, that is, section leaders and players, are spawned from
this program. osh processes will be started on all physical processing nodes participating in a jobs
execution. Phantom is the name of the process that is spawned by DataStage for job control that is Job
Sequencers. Phantom processes only run on the conductor node. When you invoke a job from
DataStage, it will first start a phantom process, which controls and monitors the overall execution of a
job. The phantom will then invoke osh. Phantom processes can also spawn other child phantoms if
your job control invokes child Job Sequencers.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

22 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

2.4.7 Engine Processes and System Resources


Refer to Appendix A: Processes Created at Runtime by DataStage EE for diagrams of processes
created by DataStage
The DataStage server engine program is actually called dsrpcd., which is a daemon that manages
connections to DataStage projects. The dsrpcd utilizes semaphores and shared memory segments when
it is operating. Semaphores and shared memory segments used by dsrpcd are prefixed by the string
0xade. The UNIX command ipcs will produce a list of semaphores and shared memory segments
used by DataStage.
When a user logs into DataStage the dsrpcd will spawn two processes for each session, the dsapi_client
and dsapi_slave process. These manage all of your interactions with the DataStage project. One way
to force users to log off from DataStage would be to kill the dsapi_slave process. Note on UNIX the
dsapi_slave process is identified by dscs.
2.4.8 Disk Space Used by DataStage
DataStage utilizes disk space in a number of places.
Within a DataStage project the following will grow over time and need to be purged on a regular basis.

Job Logs Purge by setting up a purge policy through the DataStage Administrator. This
Standard practice emphatically recommends setting a purge policy to avoid filling the project
file system.

&PH& - This is a directory in each project that is used for stderr and stdout of a phantom
process. Each job execution will create a file in this directory, therefore over time the directory
will grow and thus should be cleaned on a regular basis to avoid filling up the project file
system. Typical file size is less than 1K, files larger than 1K are an indication of a problem
with a job. In the event of a hard crash of a job, examining the DSD.RUN* files may provide
useful information in explaining the problem.

$TMPDIR This environment variable tells DataStage where to write temporary files that are
created by the parallel framework, such as the job score, temp file for look up stage. This
directory is automatically cleaned up by the parallel framework; however, hard crashes may
leave files stranded in this directory. You can identify DataStage EE temp files by looking for
files that begin with APT*. The default $TMPDIR is /tmp, performance improvements can be
achieved by setting $TMPDIR to a faster file system.

Scratch Identifying scratch space is done so by examining the APT_CONFIG_FILE. Scratch


is used for sort and buffer overflow files. These files are temporary and are managed by the
framework. One can judge how a job is performing by examining the number of files that are

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

23 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

created in the scratch area. For example if there is a bottle neck in a process that fork joins,
buffer over flow files will be written to scratch. The more files the more buffering.

DataSets - Identifying the directories used by data sets can be done by examining the
APT_CONFIG_FILE or by using the orchadmin command line tool or Tools-> DataSet
Management from the DataStage GUIs

2.5 Security, Roles, DataStage User Accounts


In general initial access to DataStage projects is enforced by the operating systems security, such as
logging into a project through DataStage Designer, as well as, read, write, execute and delete
permissions to a project directory.
As a first level of security Administrators should leverage operating system groups to grant and deny
access to a DataStage project. That is for each project create an operating system group, (group name
should be the same as the project) and assign the group to the project directory (chown), as well as,
grant users access to that project by making them members of the projects group. This will give users
the authorization to log into and manage objects in the project.
As a second level of control Administrators should assign DataStage roles (see below) to the groups
that have access to the project. This will limit what users can do within DataStage, such as create jobs,
compile jobs and run jobs.
2.5.1 DataStage Roles
DataStage security is based on operating system groups. When creating a DataStage project consider
limiting access to a project by creating a operating system group and assigning that group as the
owner of the DataStage project directory. Then make operating system user ids members of the group.
Then Grant roles to users from the DataStage Administrator as described below.
Following was copied out of the DataStage Administrator Guide (dsadmgde.pdf)
To prevent unauthorized access to DataStage projects, you must assign the users on your system to the
appropriate DataStage user category. To do this, you must have administrator status You can do many
of the administration tasks described in this section if you have been defined as a DataStage Developer
or a DataStage Production Manager. You do not need to have specific administration rights.
However, to do some tasks you must be logged on to DataStage using a user name that gives you
administrator status:
For Windows servers: You must be logged on as a member of the Windows Administrators group.
For UNIX servers: You must be logged in as root or the DataStage administrative user (dsadm by
default).
You require administrator status, for example, to change license details, add and delete projects, or to
set user group assignments.
There are four categories of DataStage user:
DataStage Developer, who has full access to all areas of a DataStage project
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

24 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

DataStage Production Manager, who has full access to all areas of a DataStage project, and can
also create and manipulate protected projects. (Currently on UNIX systems the Production Manager
must be root or the administrative user in order to protect or unprotect projects.)
DataStage Operator, who has permission to run and manage DataStage jobs
<None>, who does not have permission to log on to DataStage. You cannot assign individual users
to these categories. You have to assign the operating system user group to which the user belongs. For
example, a user with the user ID peter belongs to a user group called clerks. To give DataStage
Operator status to user peter, you must assign the clerks user group to the DataStage Operator category.
Note: When you first install DataStage, the Everyone group is assigned to the category DataStage
Developer. This group contains all users, meaning that every user has full access to DataStage. When
you change the user group assignments, remember that these changes are meaningful only if you also
change the category to which the Everyone group is assigned.
2.5.2 User Environment
It is common for DataStage developers and administrators to utilize the UNIX or windows command
line. For this reason the DataStage users account should be configured with proper environment
variables.
All users should have these lines added to their login profile 3 .
dsroot="`cat /.dshome`/.."
export dsroot
. $dsroot/DSEngine/dsenv

Add these lines to the end of the $DSHOME/dsenv.


APT_ORCHHOME=$DSHOME/../PXEngine
export APT_ORCHHOME
APT_CONFIG_FILE=$DSHOME/../Configurations/default.apt
export APT_CONFIG_FILE
PATH=$APT_ORCHHOME/bin:$PATH
export PATH
LD_LIBRARY_PATH=$APT_ORCHHOME/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

To Configure an Orchestrate User:

The following steps explain in detail how to configure the DataStage users environment. The steps
described above in User Environment should be sufficient.
1 In your .profile, .kshrc, or .cshrc, set the APT_ORCHHOME environment variable to the directory in which
Orchestrate is installed. This is either the default, /ascential/apt, or the directory you have defined as part
of the installation procedure.

As noted earlier, the /.dshome file is only created on a default (non-itag) installs

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

25 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

2 Add $APT_ORCHHOME/bin to your PATH environment variable. This is required for access to all

scripts, executable files, and maintenance commands.


3 Add $APT_ORCHHOME/osh_wrappers and $APT_ORCHHOME/user_osh_wrappers to your PATH environment

variable. This is required for access to the osh operators.


4 Make sure LIBPATH has been set to /usr/lib:/lib:$APT_ORCHHOME/lib:$APT_ORCHHOME/user_lib, followed by

add additional libraries you need .


5 Optionally, add the path to the C++ compiler to your PATH environment variable. Orchestrate
requires that the compiler be included in PATH if you will use the buildop utility or develop and run
programs using the Orchestrate C++ interface.
6 Add the path to the dbx debugger to your PATH variable to facilitate error reporting. If an internal
execution error occurs, Orchestrate attempts to invoke a debugger in order to obtain a stack traceback
to include in the error report; if no debugger is available, no traceback will be generated.
7 By default, Orchestrate uses the directory /tmp for some temporary file storage. If you do not want to
use this directory, assign the path name to a different directory through the environment variable
TMPDIR.
You can additionally assign this location through the Orchestrate environment variable
APT_PM_SCOREDIR.
8 Make sure you have write access to the directories $APT_ORCHHOME/user_lib and
$APT_ORCHHOME/user_osh_wrappers on all processing nodes.
9 If your system connects multiple processing nodes by means of a switch network in an MPP, set
APT_IO_MAXIMUM_OUTSTANDING which sets the amount of memory in bytes reserved for Orchestrate on
every node communicating over the network. The default setting is 2 MB. Ascential Software suggests
setting APT_IO_MAXIMUM_OUTSTANDING to no more than 64 MB (67,108,864 bytes). If your job fails with
messages about broken pipes or broken TCP connections, reduce the value to 16 MB (16,777,216 bytes). In
general, if TCP throughput is so low that there is idle CPU time, increment this variable (by doubling) until
performance improves. If the system is paging, the setting is probably too high.

2.6 The DataStage Administrator Project Configuration


This section describes Standard practices for configuring a project.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

26 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

In most cases the DataStage project should be set not to timeout. This will prevent developers from
loosing unsaved work. However, you may find that you require an Inactivity Timeout, due to careless
developers that leave inactive sessions open for days.

All projects should have Enable job administration in Director. This will allow the developer to unlock
jobs and clear a jobs status file. Runtime column Propagation is an extremely powerful feature of the
parallel framework that allows reuse and efficient processing. Runtime column propagation should be
turned on. Auto-purge of job log should be configured; this will help keep your disk space usage on
the project file system under control.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

27 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

By default DataStage grants the Developer Role to all groups. You should restrict the DataStage
Developer and Production Manager roles to only trusted users.

This standard practice recommends always Automatically Handle Activities that fail. The other
options are optional. Add checkpoints so sequence is restart-able on failure should be configured
only if this is an acceptable approach to checkpoint restart.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

28 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

The Generated OSH visible for Parallel jobs in ALL projects button should be checked.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

29 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

3 Job Monitor
The DataStage job monitor provides the capability for collecting and reporting performance metrics. It
must be running in order for the Audit & Metrics system (below) to functions. The job monitor may
impact system performance and can be tuned or shut off, configurable with environment variables
below.

3.1 Configuration
The job monitor uses two tcp/ports which are chosen during installation. These should be entered in
/etc/services as a manual step.
Entries should be made in the /etc/services file to protect the sockets used by the job monitor. The
default socket numbers are 13400 and 13401, and entries in this file may look like this:
13400 tcp
13401 tcp

dsjobmon
dsjobmon

3.2 Job Monitor Environment Variables


The job monitor is controlled using the following environment variables. Standard practice in large volume data

environments is to use a size of about 10000 and turn off APT_MONITOR_TIME with $UNSET.
For an explanation of Time based versus row based monitoring in Parallel Job Advanced Developers
Guide (advpx.pdf) see JOB MONITOR PAGE 31.
APT_MONITOR_SIZE
Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000
records.
APT_MONITOR_TIME
Determines the minimum time interval in seconds for generating monitor information at runtime. The
default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.
A PT_NO_JOBMON
Turn off job monitoring entirely.

3.3 Starting & Stopping the Monitor


The monitor is normally started and stopped with the DataStage server engine. The root user has
permission stop and start the job monitor using these commands:
$DSHOME/../PXHOME/java/jobmoninit stop
$DSHOME/../PXHOME/java/jobmoninit start

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

30 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

3.4 Monitoring jobmon


The existence of the job monitor process can be detected by looking for the JobMonApp string in the
output of the ps command.
For example : ps ef | grep JobMonApp
Will produce this rather long output, but you will be able to identify the process number.
root
6700 1 0 Mar24 ?
00:00:01 /var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/jre/bin/java -classpath
/var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/JobMonApp.jar:/var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/jav
a/xerces/xercesImpl.jar:/var/dsadm/Ascential/DataStage/DSEngine/.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

31 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

4 Backup / Recovery / Replication/ Failover Procedures


The DataStage environment should be backed up using your sites system backup tools. It is also
possible to backup DataStage using archive tools such as tar and zip. In general you should protect the
DataStage environment with a combination of full and incremental backup ups, with a frequency that is
sufficient to minimize loss of work (disk crash) and minimize recovery time and effort.
In order to properly back up and promptly recover a DataStage installation and the applications
developed with DataStage, you must identify the files and file systems that are required by the
DataStage application. Minimal backup protection requires that DataStage Conductor and projects be
backed up by system backup on a regular basis.
It likely external entities will be closely integrated with applications developed with DataStage and will
need to be backed as well. If your site has standardized the directory structure for External Entities
then identifying them for backup is straight forward. Otherwise, identification is a cumbersome ad-hoc
exercise.

4.1 DataStage Conductor Backup


Also, known as the DataStage installation directory, the conductor directory contains the DataStage
core product software and configuration. It is critical that it is protected by regularly scheduled full and
Incremental backups.
Location Path ../Ascential/DataStage
Events that result in changes to the DataStage conductor files and directories are creating and deleting
projects, installing patches to the Engine, manual modifications to files or subdirectories.
The DataStage installation creates the following subdirectories under ../Ascential/DataStage. Scratch,
Datasets and Projects. These directories are used to store volatile files and warrant special
considerations. The project file system may be a separate file system, as recommended by Install &
upgrade standard practices. See the section below for details related backing up the Projects directory.
Consider not backing up the Scratch and Datasets directories.

4.2 DataStage Project Backups


The location of a DataStage project can be determined when the project is created, by specifying a
path. The default location is $DSHOME/Ascential/DataStage/Projects. The DataStage Projects
directory will contain a subdirectory for each project. It is a useful practice to utilize the default project
location, or standardize on one location for all projects created on the system, because it will simplify
identifying the location of projects for backup.
One can locate determine the path of a project through the DataStage Administrator.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

32 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

The DataStage Repository file UV.ACCOUNT contains the directory paths for each project. This file
can be queried by the command:
echo "SELECT * FROM UV.ACCOUNT;" | bin/uvsh

DataStage projects should be protected by both full and incremental system backups, performed at
regular intervals (daily, hourly) that minimize exposure to a crash.
Special consideration should be given to development projects, since these are where developers will
be saving work through out the day. Developers and administrators should be aware that work could be
lost that was saved between backups in the event of a catastrophic storage system failure.
It is best to backup up the system, especially projects, when jobs are not running or when Developers
are not on the system. Due to the dynamic nature of a DataStage repository and its multi file structure,
there is a potential for a hot backup to contain an inconsistent view of the repository. This situation
exist in almost all modern databases (except single file), because the database is made up of many files
that are updated at different times, getting a consistent view of all these files with a hot backup is
difficult, without complex solutions like breaking volume mirrors.
Avoid storing volatile files in a DataStage project to prevent the waste of time and space required for
the project backup.
Consider locating non volatile external entities in the project, to provide a convenient method for
backing up External Entities that are related to the project.
Consider DataStage job log purge policy. In order to maximize backup efficiency Set a log retention
policy to purge shortly after a backup, without erasing entries before they are backed up. For example
if you incrementally back up a project daily, then set the purge policy for every two days. This will
ensure all log entries are backed up, with minimal overlap.

4.3 DataStage Exports for Partial Back Up


Some customers may choose to rely on the DataStage exports for backups. This is not a
comprehensive solution and should only be used in conjunction with full and incremental backups of
the DataStage installation, DataStage projects and external entities.
DataStage developers can supplement exposure to gaps in system back up to save there work in
between backups.
You cannot export locked jobs.
Export is a DataStage client based Win32 application. It can be run from the DataStage manager or
using the command line tools Need to be at console because windows pop up dialog boxes sometimes
appear.
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

33 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

4.4 Datasets, Lookup File Sets and File Sets


Before you begin backing up directories full of Datasets and file sets, consider their volatility. These
files are often temporary files that do not justify the time and expense related to backing them up.
The parallel framework of DSEE supports three proprietary file types:
1. Persistent Data Sets: .ds native EE data types, partitioned to multiple part files
2. Lookup File Sets:
.fs native EE data types, lookup key structure, 1 or more partitioned part
files.
3. External File Sets:
.fs external data types, 1 or more data files per processing node.
All three are multipart files, consisting of a descriptor file, and one or more data part files. The
descriptor and all part files need to be backed up together. Data Sets can be backed up using any UNIX
backup method so long as BOTH the control file portion and data file portion(s) of the DataSets are
backed up at the same time (and no process is writing, or waiting to write to them). Restoration
requires that the data segment files return to the EXACT location from which they came, while the
control file portion (filename.ds) can be restored anywhere.
Following the standard practice, the descriptor file should be located in a Datasets directory for each
project, $PROJECT_PLUS/datasets and the part files will be located on processing nodes, as specified
by the EE configuration file (APT_CONFIG_FILE). It is also important to know that there is a
requirement that the nodes of the part file be reflected in the APT_CONFIG_FILE used by the job that
reads the dataset or file set. Thus, administrators should ensure that the APT configuration files are
backed up.
The orchadmin utility allows you to manage persistent data sets and look up files sets. The utility can
be accessed from the DataStage Manager, Designer and Director by choosing Tools-> Data Set
Management, or the orchadmin can be invoked form the command line. Note when using orchadmin
from the command line the users environment must be configured according to the setting up the
command line environment for orchestrate users.

4.5 External Entities Scripts, Routines, Staging Files


Account for all scripts that are related to applications developed with DataStage as part of the full and
incremental backup strategy.
Before you begin backing up directories full of Staging and temp files, consider their volatility. These
files are often temporary files that do not justify the time and expense related to backing them up.

4.6 Replicating the DataStage Environment

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

34 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Parts of the DataStage processing environment will need to be replicated to other physical processors if
you choose to run DataStage in an MPP, Cluster or Grid environment or employ a failover strategy.
4.6.1 Replication for MPP, Cluster, Grid.
When a parallel job is run in part or as a whole on one or more physical processing nodes other than the
DataStage conductor node the following two configuration steps need to be performed:
All or part of the DataStage EE environment needs to be replicated on all processing
nodes.
Ensure that the user account, under which the jobs will be launched from the
conductor node, has privileges to rsh to all the other nodes. DataStage EE can be
configured to use ssh
The DataStage EE environment includes the DataStage conductor directory (../Ascential/DataStage),
project and job specific object files (Transformer, Build-op, custom routines), and external entities such
as third party libraries. External entities may have specific installation requirements and dependencies,
therefore replicating external entities should be done so by following the vendor's instructions. Under
all remote execution scenarios the $dsroot/PXEngine directory requires replication. There are libraries
that may be used by a job that are in other directories such as $dsroot/DSCAPIop and
$dsroot/RTIOperators, for this reason it is a standard practice to replicate the entire
../Ascential/DataStage/ directory. This will also be relevant for conductor failover. Project directories
should be replicated as a standard practice.
There are two methods used to replicate the DataStage Environment
1. Globally cross mounting, usually via NFS
2. moving a physical copy, by hand or using the copy-orchdist utility
For more details refer to Install and Upgrade Guide (dsupgde.pdf) Copying the Parallel Engine to
Your System Nodes
For both replication methods the directory path should be identical. That is, the cross mount or
replicated copy should use the same paths on all systems. For example if DataStage is installed on
/opt/Ascential/DataStage and the project is in /var/Projects/myDataStageProject, the user would see the
same files on all systems of the cluster, when utilizing these paths.
As a standard practice adopting the global cross mount approach to replication is recommended. This
will greatly simplify propagating changes to all physical processors. For example, an upgrade, patch or
new job will be propagated to all nodes automatically.
4.6.2 Distributing Transformers at Runtime
The APT_COPY_TRANSFORM_OPERATOR environment variable can be used to distribute the
transformer shared objects. It is intended to be used in distributed environments, where the project is

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

35 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

not cross mounted. It is not a complete solution, because it does not copy external functions. It will
also add time to job startup.
The way this affects the transformer operator if set to any value,
APT_TransformOperator::distributeSharedObj() is called in describeOperator() to distribute the shared
object file of the sub-level transform operator and the shared object file .
4.6.3 Installing DataStage EE Engine on a Remote Node
If you plan to use DataStage EE in on multiple nodes, you need to ensure that the EE Engine
components are installed on all the nodes.
1. After your initial installation on the primary conductor, you need to copy the contents
PXEngine directory in this install over to all the other nodes. For example, if you installed
DataStage under /apps/Ascential/DataStage, then the PXEngine directory will be under
/apps/Ascential/DataStage/PXEngine. Note the PXEngine directory has to exist in exactly
the same location on all nodes. This can be a symbolic link.
2. Next, add entries in the EE configuration file for all the new nodes.
3. Ensure that the user under whom the jobs will be launched from the conductor node, has
privileges to rsh to all the other nodes. DataStage EE can be configured to use ssh (see the next
section).
If you have a large number of nodes, you can use the maintenance menu of the DataStage install to
copy over the PXEngine directories to new nodes. Note that you will need to configure rsh access to
nodes before the installation.
4.6.4 Replications for Conductor for Failover/Cluster
DataStage requires one node of any processing environment to be the Conductor. This conductor is
the machine of the cluster on which the DataStage Server runs. The DataStage Server must be running
in order for the rest of the environment to function. Therefore, if the conductor node crashes, the
DataStage Conductor can be moved to and run on different physical nodes of a cluster, grid or MPP.
Replication of the conductor on Windows requires a trick, because of the Windows registry. The trick
requires you to run the normal DataStage Server installation on all nodes you intend to run DataStage
on, always using the same volume and path. If you are using the cross mount method for replication
the end result is that all systems will refer to the same physical copy of the DataStage server.
When building on a windows cluster :
1. Cross mount the DataStage installation drive i.e. D: on all machines.
2. Then install on the first machine
3. Shutdown DataStage
4. Install on another machine, using same location D: (overwrite DataStage installation)
Be aware that after DataStage has been put into service, using this method to install on more machines
will overwrite a previously configured installation. Therefore, to install DataStage on additional
systems after any configuration or development has been performed follow these steps:
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

36 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

1.
2.
3.
4.

Mount temporary volume D: and perform the install.


Stop DataStage server.
Then delete the temporary volume D
Cross mount the installation that is in service D:.

Warning: starting up and running more than one instance of the DataStage server (dsrpcd) in a cross
mounted configuration, will cause corruption in the DataStage Conductor repository and Data Project
Repositories.
You may either cross mount or make a physical copy of the DataStage Conductor
../Ascential/DataStage. The cross mount method is recommended to simplify maintenance such as
upgrades and patches, and project creation.
Typically failover software is utilized to manage start and stop activities of the DataStage server. The
failover software should be configured to stop and start the DataStage server, therefore you will need to
disable normal system startup (S99ds.rc).
$DSHOME/bin/uv - admin -autostart on. Enable auto start.
$DSHOME/bin/uv - admin -autostart off. Disable auto start.

The failover software will need to monitor the DataStage Server process dsrpcd and the network port
used by dsrpc.
>ps -ef | grep dsrpcd
root
26870
1 0 Mar14 ?
00:00:00
/var/Ascential/DataStage/DSEngine/bin/dsrpcd
>netstat -a | grep dsrpc
tcp
0
0 *:dsrpc

*:*

LISTEN

If the DataStage server process fails it should try to restart it on the primary server. If the primary
server is not available then DataStage should be started on the failover server. You may have more
than one failover server. This will satisfy failover for the DataStage Conductor. Failover and restart for
applications developed in DataStage are covered in the production automation section 5.

4.7 Important Project File System Considerations


DataStage projects are seen from the operating systems perspective as a directory in a file system.
The project directory contains files that are managed by DataStage. These files contain metadata
describing DataStage objects such as Jobs, Tables, Shared Containers, etc. and configuration
information such as DSParams.
Warning The file systems in which DataStage projects reside should not be allowed to fill to 100%.
If a file system fills that contains a DataStage project is allowed to fill, then it will not be possible to
update the DataStage project repository. This means users will not be able to log in, they will lose
unsaved work, possible corruption of DataStage project files.
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

37 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

In the event that a file system is filled you will not detect any corruption unless the corrupt files in the
project repository are read. You should shutdown the DataStage server, and run the uvbackup
command, which will read all files. The uvbackup command will allow you to backup up to /dev/null
on UNIX and NUL on windows.
$ find $DSHOME/../Projects -print | $DSHOME/bin/uvbackup -f -v -l "FULL
SYSTEM BACKUP" - > /dev/null

To prevent the file system from filling up use these three practices.
1. Establish an auto purge policy for your job log files
2. Do not allow users to created temporary files in the project directory. The project directory
becomes the default directory for the DataStage user. It is common for developers to utilize the
project directory for hash files, data files, temp files etc. Developers should follow the standard
practice of using a file system other than the project file system and always parameterize the
path of the file name. For example #$ProjectPlus_tmp#/crossRef.dat
3. Monitor the project file systems closely.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

38 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

5 Overview of Production Automation and Infrastructure


Integration for DataStage
Production Automation is the running and monitoring of an application with no or minimal human
intervention. This section will discuss issues and standard practices for Developing, Deploying,
running and monitoring DataStage EE in an automated manner.
IBM IPS Services offers the Administration, Management, and Production Automation
Workshop for DataStage Enterprise Edition.
This offering provides a set of proven Standard Practices and a customized toolkit for
integrating DSEE into a customers existing production infrastructure (monitoring, scheduling,
auditing/logging, change management), as described in this section.
Applications that are developed in DataStage EE are typically enterprise class applications that run 7
days a week, 24 hours a day. In order for an enterprise to maintain this high standard of service some
level of automation must be utilized. \There are many tools and many ways to automate production
data processing applications such as an Enterprise Scheduler, Enterprise and system monitors, and
DataStage job control or log files.
Enterprise schedulers are powerful tools for coordinating events across multiple applications of the
enterprise. They are a vital component of an automated infrastructure. Applications developed in
DataStage tend to be event driven. For example, before a job is executed, it usually needs to wait for a
file download to complete, or another application to complete its work. The enterprise scheduler can
monitor and control some or all of these events. Some examples of Enterprise Schedulers are ControlM, CA7, AutoSys, Tivoli See (below for a detailed discussion Integrating with External Scheduler).
Enterprise Management Consoles are frequently used to monitor the network, data bases, machines and
applications throughout the enterprise. Mission critical applications usually have some form of
proactive monitoring in place to ensure they function properly. It is possible to proactively monitor the
DataStage environment and DataStage applications.
Controlling DataStage jobs in an automated fashion is done so using DataStage job control. DataStage
job control has multiple interfaces such as the Job Sequencer. All interfaces are based on the DataStage
Development Kit (below). The DataStage development kit (dsjob) allows external applications, such
as schedulers to control and report on DataStage jobs. There are other aspects of an automated system
that will be addressed in this document that will leverage the DataStage Development kit, such as
Parameter Management, Exception Handling and feedback, and Check-Point/Restart.
Other capabilities of a robust automation system are related to audit ability and metrics reporting.
Runtime stats regarding row counts, system resource consumption, exceptions or elapsed run time can
be used for trend analysis, data quality analysis and proactive monitoring. This document will discuss
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

39 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

the methods, tools and options for building a new audit and metrics reporting infrastructure or
integrating into an existing one.

DataStage Job Logs


Enterprise
Scheduler
Parameters
DataStage Sequencer Job (harness)
Script
dsjob
LogExceptoins

Pre-Execution
EE job

Rejects

Post-Execution
Enterprise
Monitoring
Agent

Check-Point Log
External Log
Audit/Metrics
Database

This diagram depicts some of the typical interactions between DataStage and the Production
Automation environment.

The Enterprise scheduler invokes a shell script and waits for a completion status.
The shell script will run a DataStage Sequencer Job and wait for it to complete. Then it will
check the status of dsjob and write exceptions to the External Log.
The External log is monitored by the enterprise monitoring agent for new entries and signals
the operations console reporting errors.
The Sequencer Job will run the DataStage jobs.

Various other features shown in this diagram are an Audit & Metrics Reporting Database, Rejects
(files and tables), The DataStage job logs, and parameters. All of which can have some interaction
with a DataStage Sequencer Job and the DataStage EE jobs.

5.1 DataStage Job Control Development Kit


The foundation to all DataStage job control is implemented using the DataStage Development Kit, this
includes dsjob (the command line interface), the Graphical Job Sequencer (Job Control is generated)
and custom Job control using the API.
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

40 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Consult the DataStage Parallel Job Advanced Developers Guide (advpx.pdf) section DataStage
Development Kit (Job Control Interfaces) for a complete description, of the dsjob and the API.
Consult the DataStage Designer Guide (coredevgde.pdf) section 6 Job Sequences for documentation
details about Job Sequencers.
Custom job control is often implemented using DataStage routines and shell scripts.
When developing a shell script the dsjob command line interface will most likely be used to run a job.
Experience in shell scripting is recommended (preferably korn shell)
When developing custom job control with DataStage routines the developer will need to program using
DataStage Basic. Useful resources are DataStage Basic Guide (Basic.pdf), the online help also does
an excellent job of documenting built in DataStage basic functions.
IMPORTANT: the dsjob command should be used in a conservative manner, each time the
dsjob command is invoked, it must log into the DataStage project. Excessive use of dsjob, is
slow and inefficient for frequent repetitive tasks.

5.2 Job Sequencer


The DataStage job sequencer is documented in Designer Guide (coredevgde.pdf) Section 6 Job
Sequencer. Please read for an Overview of Sequencer.
The Job Sequencer provides convenient functionality for enabling automated production environments.
Its ability to graphically represent the logical flow of an application is useful to all members of the
DataStage team. It also provides self documentation of the flow and useful design time features.

5.3 Exception Handling


It is extremely important for an Automated Production environment to have solid reliable exception
handling. Done properly integrated exception handling detects failures quickly and ensures high quality
and accurate data.
Error detection must be addressed in all aspects of the ETL process. As a standard practice, if a task is
performed in a process ask the questions, what can go wrong?, how can I correct it automatically?,
should someone be notified?. As you write routines, jobs, sequences, scripts, ensure that these all
handle exceptions.
The DataStage engine monitors message status written to the job log. Message statuses have a direct
influence on the overall execution of a sequence.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

41 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

The parallel framework internally monitors all the individual player processes. It insures that if one
player fails all players will fail on all processing nodes.
Messages written to the job log with status FATAL will cause a job to abort. Messages of type
WARNING could cause a job to abort if the warning threshold is exceeded. This threshold can be set
or turned off from the Director or using the Job Control API. When a limit is set on a sequencer job,
all child sequencers inherit that setting. The following example shows the DataStage Job Control API
calls to set limits from a job control routine:
*
* Set Limits for Job
*
LimitErr = DSSetJobLimit(JobHandle, DSJ.LIMITROWS, rowLimitParm)
LimitErr = DSSetJobLimit(JobHandle, DSJ.LIMITWARN, warningLimitParm)

Overall job design should consider exception handling including reject settings. Stages should be
configured to handle errors and rejects. Stages like sequential file, transformer, Oracle EE, merge
support a reject link. Rejects cause warnings to be written to the job log.
The Job Control API functions will allow you to write message directly to the job log from job control.
DSLogWarning, DSLogFatal are the basic versions. Messages can also be logged from dsjob.
Routines that are used in a sequencer Routine Activity must return a 0 for success or a 1 for error. If a
Rountine activity returns a non-zero value that is not an error you will have to configure a trigger to
handle this case, resulting in more complex sequencer logic. So, try standardizing on 0 for success and
1 for error.
Routines can also be used in a various Job Sequencer Stages in parameter expressions, trigger
expressions and Variable Activity Stage expressions. When used in these places the routine will return
a value that is not tested by the Job Sequencer the same way a Routine Activity is. Thus to ensure
exceptions are raised to the controlling job all Routines should trap exceptions and log a warning
(DSLogWarn) or fatal error (DSLogFatal) to the job log.
In general all routines should trap errors and call DSLogWarn or DSLogFatal.
Sequencer jobs should be configured with Automatically Handle Activities that fail. This will ensure
that any job that aborts will be detected. Warnings are not handled automatically by the sequencer.
You either have to explicitly set a Job Activity to trigger on a warning or set the warning limit on the
job.

5.4 Checkpoint Restart


In terms of a DataStage process there are two aspects of check point restart that need to be
distinguished. Check point restart of a DataStage Parallel job and Check point restart of DataStage
batch process or Job Sequencer.
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

42 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

The concept of a check point relies on the ability to define a unit of work. Within the scope of a
DataStage parallel job, this is a complex problem to solve, because stage instances tend to run
independent of each other in a nondeterministic manner.
DataStage EE supports functionality that allows a unit of work to be defined in terms of data rows, but
this has only been implemented using the mqread and unitofwork stages. Therefore, in most cases, it
is up to the developer to design their ETL process in a manner that allows it to restart without
corrupting the target data sink, while minimizing reprocessing. This typically is done by determining
logical boundaries for an overall process.
The typical ETL process is usually developed with multiple EE Parallel jobs. For example a simple
DataStage EE Application comprised of two jobs, where the first job will extract and transform the
data, and stage load ready data in persistent dataset. Then the Second job will read the dataset and
bulk load the database. Using logical boundaries allows a unit of work to be defined in terms of a
single DataStage Parallel job. Thus, it is straight forward to implement a check point restart
architecture using the DataStage job sequencer and the philosophy that each job is a unit of work. A
check point restart strategy co notates two modes of operation, Normal and Restart.
In Normal processing mode a checkpoint restart should follow these logical steps:
Pre-Execution Log Checkpoint starting.
Execution
- Log Checkpoint running.
Post-Execution check job status
=Success - Log Checkpoint Success
Exit with Success
=Failure - Log Checkpoint Failed
Exit with ERROR
In Restart processing mode a check point restart should follow these logical steps:
Pre-Execution Check last Check point was:
= Success
Skip execution. Implies go to next job in sequence.
= Failure
Log checkpoint Restarting
Execution
- Log Checkpoint running.
Post-Execution check job status
=Success - Log Checkpoint Success
Exit with Success
=Failure - Log Checkpoint Failed
Exit with ERROR
Restart job control requires the developer to consider:
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

43 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

any cleanup that may be required from a prior failure


For example you may want to delete rows from a database or delete temporary files.
This logic should be included in pre and post execution logic.
If the job sequencer should continue processing after a detected failure or abort
This Standard practice advocates a zero failure tolerance for a sequencer job. If one job
fails the sequencer should abort. The purpose of this standard practice ensures
exceptions percolate up the sequencer hierarchy, back to the shell script that runs dsjob
and ultimately back to the scheduler. It ensures the exceptions are detected in a timely
manner. The job should then be restarted from the scheduler.
How restart processing is invoked.
Should the job be reset?
Should the job be invoked with a restart parameter?
Steps that always need to be run even if they succeeded in the prior run
By disabling the check point restart for an individual job it will always run.
Steps that only need to run once if they succeed
By enabling the check point restart for an individual job it will be skipped the next if
the controlling job is rerun

DataStage job sequencers can utilize built in check point restart or it is possible to utilize a custom
approach to check point restart. This standard practice is impartial to the approach you choose. The
advantage to employing a custom check point solution is its flexibility. Custom check point logic
utilizes parameters to influence the behavior of check point restart logic, thus allowing you to
specifically name individual jobs or subsets of jobs to run.
Built in Check Point Restart:

Figure 6: Add checkpoints so sequence is restart able.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

44 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Warning: Reset will


always clear built in
check points

Figure 7: Job Activity with check point


If a sequence is restartable (i.e., is recording checkpoint information) and has one of its jobs fail during
a run, then the following status appears in the DataStage Director:
Aborted/restartable
In this case you can take one of the following actions:
Run Job. The sequence is re-executed, using the checkpoint information to ensure that only the
required components are re-executed.
Reset Job. All the checkpoint information is cleared, ensuring that the whole job sequence will be
run when you next specify run job.
Note: If, during sequence execution, the flow diverts to an error handling stage, DataStage does not
checkpoint anything more. This is to ensure that stages in the error handling path will not be skipped if
the job is restarted and another error is encountered.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

45 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

6 Job Parameter and Environment Variable Management


Overview of DataStage Job parameters can be found in the DataStage Designer Guide
(coredefgde.pdf) Job Properties: Specifying Job Parameters.
DataStage jobs can be parameterized to allow for portability and flexibility. Parameters are used to
pass values for variables into jobs at run time. There are two types of parameters supported by
DataStage jobs:

Standard Job Parameters:


o Are defined on a per job basis in job properties.
o The scope of a parameter is restricted to the job.
o Used to vary values for stage properties, and arguments to before/after job routines at
runtime
o No external dependencies, as all parameter metadata is a sub element to a single job.

Environment Variable Parameters:


o Leverage operating system environment variable concept.
o Provide a mechanism for passing the value of an environment variable into a job as a Job
Parameter. (Environment variables defined as Job Parameters start with a $ sign.)
o Similar to a standard Job Parameter in that it can be used to vary values for stage properties,
and arguments to before/after job routines
o Provide a mechanism to set the value of an environment variable at runtime. DataStage EE
provides a number of environment variables to enable / disable product features, fine tune
performance, and to specify runtime and design time functionality (for example,
$APT_CONFIG_FILE).

6.1

DataStage Environment Variables

6.1.1 DataStage Environment variable scope


Although operating system environment variables can be set in multiple places, there is a defined order
of precedence that is evaluated when a jobs actual environment is established at runtime. The scope of
an environment variable is dependent on where it is defined. The following table shows where
environment variables are set and order of precedence:
Where Defined

Scope * indicates highest precedence

System profile
dsenv
Shell script (if dsjob local is specified)
Project
Job Sequencer
Job

System wide
All DataStage processes
Only DataStage processes spawned by dsjob
All DataStage processes for a project
Job Seqeunce and Sequence sub processes
* Current jobs environment and sub processes

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

46 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

1) The daemon for managing client connections to the DataStage server engine is called dsrpcd.
By default (in a root installation), dsrpcd is started when the server installed, and should start
whenever the machine is restarted. dsrpcd can also be manually started and stopped using the
$DSHOME/uv admin command. (For more information, see the DataStage Administrator Guide.)
By default, DataStage jobs inherit the dsrpcd environment, which, on UNIX platforms is set in
the etc/profile and $DSHOME/dsenv scripts. On Windows, the default DataStage environment is
defined in the registry. Note that client connections DO NOT pick up per-user environment
settings from their $HOME/.profile script.
On USS environments, the dsrpc environment is not inherited since DataStage jobs do not
execute on the conductor node.
2) Environment variable settings for particular projects can be set in the DataStage Administrator
client. Any project-level settings for a specific environment variable will override any settings
inherited from dsrpcd.
Within DataStage Designer, environment variables may be defined for a particular job using the Job
Properties dialog box. Any job-level settings for a specific environment variable will override any
settings inherited from dsrpcd or from project-level defaults. Project-level environment variables are
set and defined within DataStage Administrator.
6.1.2 Special Values for DataStage Environment Variables
To avoid hard-coding default values for Job Parameters, there are three special values that can be used
for environment variables within job parameters:
Value

Use

$ENV

causes the value of the named environment variable to be retrieved from the operating system of
the job environment. Typically this is used to pickup values set in the operating system outside
of DataStage.
causes the project default value for the environment variable (as shown on the Administrator
client) to be picked up and used to set the environment variable and job parameter for the job.
causes the environment variable to be removed completely from the runtime environment.
Several environment variables are evaluated only for their presence in the environment (for
example, APT_SORT_INSERTION_CHECK_ONLY).

$PROJDEF
$UNSET

NOTE: $ENV should not be used for specifying the default $APT_CONFIG_FILE value because,
during job development, the Designer parses the corresponding parallel configuration file to
obtain a list of node maps and constraints (advanced stage properties).
6.1.3 Environment Variable Settings
An extensive list of environment variables is documented in the DataStage Parallel Job Advanced
Developers Guide. This section is intended to call attention to some specific environment variables,
and to document a few that are not part of the documentation.
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

47 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

6.1.3.1 Environment Variable Settings for All Jobs


IBM recommends the following environment variable settings for all DataStage Enterprise Edition
jobs. These settings can be made at the project level, or may be set on an individual basis within the
properties for each job. It may be helpful to create a Job Template and include these environment
variables in the parameter settings.
Environment Variable
$APT_CONFIG_FILE

Setting
filepath

$APT_DUMP_SCORE

$OSH_ECHO

$APT_RECORD_COUNTS

$APT_PERFORMANCE_DATA
$OSH_PRINT_SCHEMAS

$UNSET
0

$APT_PM_SHOW_PIDS

$APT_BUFFER_MAXIMUM_TIMEOUT

Description
Specifies the full pathname to the EE configuration
file. This variable should be included in all job
parameters so that it can be easily changed at runtime.
Outputs EE score dump to the DataStage job log,
providing detailed information about actual job flow
including operators, processes, and Data Sets.
Extremely useful for understanding how a job actually
ran in the environment.
Includes a copy of the generated osh in the jobs
DataStage log
Outputs record counts to the DataStage job log as each
operator completes processing. The count is per
operator per partition.
This setting should be disabled by default, but part of
every job design so that it can be easily enabled for
debugging purposes.
If set, specifies the directory to capture advanced job
runtime performance statistics.
Outputs actual runtime metadata (schema) to
DataStage job log.
This setting should be disabled by default, but part of
every job design so that it can be easily enabled for
debugging purposes.
Places entries in DataStage job log showing UNIX
process ID (PID) for each process started by a job.
Does not report PIDs of DataStage phantom
processes started by Server shared containers.
Maximum buffer delay in seconds

On Solaris platforms only:


When working with very large parallel Data Sets (where the individual data segment files are
larger than 2GB), you must define the environment variable $APT_IO_NOMAP
On Tru64 5.1A platforms only:
On Tru64 platforms, the environment variable $APT_PM_NO_SHARED_MEMORY should be set to 1 to
work around a performance issue with shared memory MMAP operations. This setting instructs
EE to use named pipes rather than shared memory for local data transport.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

48 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

6.1.3.2 Additional Environment Variable Settings


Throughout this document, a number of environment variables will be mentioned for tuning the
performance of a particular job flow, assisting in debugging, or changing the default behavior of
specific Enterprise Edition stages. The environment variables mentioned in this document are
summarized in Appendix B: Environment Variable Reference. An extensive list of environment
variables is documented in the DataStage Parallel Job Advanced Developers Guide.
6.1.4 Migrating Project-Level Environment Variables
When migrating projects between machines or environments, it is important to note that project-level
environment variable settings are not exported when a project is exported. These settings are stored in
a file named DSPARAMS in the project directory. If an environment variable has not been configured
for the project, the migrated job will fail during startup.
Any project-level environment variables must be set for new projects using the Administrator client, or
by carefully editing the DSPARAMS file within the project.

6.2

DataStage Job Parameters

Parameters are passed to a job as either DataStage job parameters or as environment variables. The
Naming Standard for Job Parameters uses the suffix _parm in the variable name. Environment
variables have a prefix of $ when used as Job Parameters.
Job parameters are passed from a job sequencer to the jobs in its control as if a user were answering the
runtime dialog questions displayed in the DataStage Director job-run dialog. As discussed later in this
section, Job parameters can also be specified using a parameter file.
The scope of a parameter depends on their type. Essentially:
The scope of a job parameter is specific to the job in which it is defined and used. Job
parameters are stored internally within DataStage for the duration of the job, and are not
accessible outside that job.
The scope of a job parameter can be extended by the use of job sequencer, which can manage
and pass job parameters among jobs.
6.2.1 When to Use Parameters
As a standard practice file names, database names, passwords and message queue names should always
be parameterized. It is left to the discretion of the developer to parameterize other properties. When
deciding on what to parameterize ask the questions:
Could this stage or link property be required to change from one project to another?
Could this stage or link property change from one job execution to another?
If you answer yes to either of these questions then you should create a job parameter and set the
property to that parameter.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

49 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

To facilitate production automation and file management, the Project_Plus and Staging file paths are
defined using a number of environment variables that should be used as Job Parameters.
Job parameters are required for the following DataStage programming elements:
1. File name entries in stages that use files or Data Sets must NEVER use a hard-coded operating
system pathname.
a. Staging area files must ALWAYS have pathnames as follows:
/#$STAGING_DIR##$DEPLOY_PHASE_parm#[filename.suffix]
b. DataStage datasets ALWAYS have pathnames as follows:
/#$PROJECT_PLUS_DATASETS#[headerfilename.ds]
2. Database stages must ALWAYS use variables for the server name, schema (if appropriate),
userid and password.
6.2.2 Parameter Standard Practices
File name stage properties should always be configured using two parameters. One for directory path
and the second for file name. The directory path delimiter should always be specified in the property to
avoid errors. Dont assume the runtime value of the directory parameter will include the appropriate
delimiter. If the user supplies it the operating system accepts // as a delimiter, and if it is not provided
which very common, the file name property specification will be correct.

Example of standard practice for file name properties: #Dir_Path#/#File_Name#

Similar to directory path delimiters, database schema names, etc. should contain any required delimiter.

Example of standard practice for table name properties: #DatabaseSchemaName#.TableName

User Accounts and passwords should always be specified as Environment Variables.


Passwords should be set to type encrypted, and the default value maintained using the DataStage
Administrator.
6.2.3 Specifying Default Parameter Values
Default values of job parameters migrate with the job. If the value is not over ridden then you run the
risk of unintentionally using the wrong resources, like connecting to the wrong database or referencing
the wrong file. To mitigate this risk Developers should follow this standard practice when setting the
default value of a Parallel Jobs parameter:
Standard Parameters ensure the default value is empty
Environment Variable Parameters ensure the default value is empty, $ENV, $PROJDEF or
$UNSET
The intent of this Standard practice is to ensure a job is portable. It thus will require the value of a
parameter to be set independent of the job. During development of a job consider using the standard
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

50 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

practice of always using a test harness sequencer to execute a parallel job. The test harness allows the
job to be run independently and ensures the parameter values are set. When the job is ready for
integration into a Production Sequencer, the test harness can be cut, pasted and linked into the
Production Sequencer. The test harness will also be useful in test environments, because it will allow
you to run isolated tests on a job.
In rare cases normal parameter values should be allowed default values. For example a job may be
configured with parameters for array_size or commit_size and the corresponding data base stage
properties set to these parameters. The default value should be set to an optimal value as determined
by performance testing. This value should be relatively static. The value can always be overridden by
job control. This exception will also minimize the number of parameters. Consider that array_size is
different for every job. You could have a unique parameter for every job, but it would be difficult to
manage the values for every job.
6.2.4 Managing Lists of DataStage Parameters
Both standard job parameters and environment variable parameters can be managed as lists if you
leverage the tools and techniques described in this section. Parameter lists help you consistently name
and centrally set job parameter values. Once entered into the list the developer will not have to type it
in to a new job thus avoiding typographical errors in names. Parameter values usually need to be
modified when a job is migrated from one project to another. Lists can help you simplify this process
and eliminate errors related to invalid job parameter values.
6.2.4.1 Environment Variable Parameter Lists
Environment variable parameters conveniently are managed by DataStage as a unique list. Once the
Environment variable has been entered in the list the developer picks it from a list to add it to a job.
When a job is migrated from one project to another the job will fail during startup, if the environment
variable has not been configured for the project.
When migrating to a new environment, you will have to enter the Environment Variable Parameters
one at a time in the DataStage Administrator. If this becomes too cumbersome consider the fact that
Environment Variable Parameters are stored in the DSParams file. The DSParams is a text file that can
be modified by hand. If you choose to modify this file by hand do so at your own risk.
6.2.4.2 Standard Parameter Lists
Standard parameter lists can be managed using shared containers. Unfortunately, this does not apply to
a Sequencer Job. When a shared container is Converted to local in a job, DataStage adds the
parameters to the job. The developer will not have to type in the parameters to the job. This standard
practice recommends using empty shared containers to store and organize parameters as lists.
The next two diagrams show how to add parameters to an empty parallel shared container. Notice the
name of the shared container MasterParameterList. Slide two shows that the MasterParameterList is
dropped onto the parallel job canvas and Convert to local is selected. Once converted to local the
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

51 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

parameters will be added to the job. If there are duplicated names they will be detected. Resolve
duplicate names by assigning the jobs parameter to shared container and then Convert to local again.
Empty container disappeared.

Figure 8: MasterParamterList Shared Container

Figure 9: Adding parameters by converting the MasterParmeterList Shared


Container to local.
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

52 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

6.2.5 Static and Dynamic Parameters


Classifying job parameters as either Static or Dynamic will help you determine what type to use
(Standard or Environment Variable) and how to configured and manage them.
6.2.5.1 Static Parameters
A parameter could be considered static if its value changes infrequently. For example database user
accounts and passwords could be classified as static. Static parameters usually need to be changed
when porting a job to another project.
To best manage static parameters utilize environment variables.
As a standard practice to ensure consistency and portability Environment variables Parameters should
be configured using the DataStage Administrator. The default value should be specified in the
Administrator. When setting the environment variable in a job, the Developer should specify
$PROJDEF, $ENV or $UNSET for the default value.
There are cases where parameter values need to be different for each developer. For example, it may
be a requirement that every developer use their own user and password to log into a database. In this
case the standard practice is to have a separate project for each developer.
Static parameters that are system wide can be defined in the dsenv or in the system profile, so that all
DataStage processes have visibility to the same value regardless of project. The environment variable
must be configured in each project with the default value $ENV.
6.2.5.2 Dynamic Parameters
Parameter values that may change frequently are considered Dynamic, for example deriving a file
name at runtime based on date and time. The value is usually set just before the job is run.
There are a number of methods supported deriving values for job parameters just before it is started.
1. Sequencer Job Activity Stage specify UvBasic expression for parameter.
2. Sequencer Variable Activity Stage then pass value to job activity.
3. Propagating the parameter from a controlling jobs parameter list.
4. Job control API function DSSetParam
All of the methods described above provide the means to set a value, the source of that value can be
anything that you can get DataStage to talk to. It is common to derive values from a file that contains
name value pairs.
Follow the Standard practice described above Specifying Default Parameter Values this will help
prevent any subtle errors like connecting to the test database from the production system. It will also
force the job control to explicitly specify a value.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

53 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Developers and testers, production system operator, as well as, enterprise scheduler should always start
jobs using job control. The developer should develop a test harness sequencer that explicitly sets a
jobs parameters. Testers should run the job as it would be run in production. Production job
invocation should be done so from a shell script.

6.3 Audit and Metrics Reporting in an Automated Production Environment


Integral to the Automated Production environment is an infrastructure that supports an Audit and
Metrics reporting system. This system will support many aspects of an information infrastructure,
from row counts to performance metrics to usage trend analysis to automated Data Quality and Service
Level Agreement (SLA) compliance.
DataStage Job log files contain most of the data that will support most Audit and Metrics reporting
requirements. There may be other types of information required for the system, which usually are
related to Data Quality and Data level Auditing. You may even try to integrate system wide metrics.
To best support a system of this nature a relational database schema should be utilized. This will allow
other data structures to be integrated with the DataStage metrics. It provides an optimal means for
consolidation and organizing historical data. SQL based reporting tools can access the data. Customers
can extend the core of the model to meet their specific audit and reporting requirements.
The IBM IPS Services Administration, Management, and Production Automation
Workshop provides a customized database schema and set of routines for creating and
implementing an audit and metrics reporting environment.

6.4 Integrating with External Schedulers


It is common for customers to use an enterprise scheduler to invoke DataStage jobs. DataStage
provides a command line utility, dsjob, which provides capabilities for running jobs, job log reporting
and access to job and project metadata.
There are two approaches that can be taken when integrating DataStage and the enterprise scheduler.
One approach is to encapsulate all job control within the scheduling tool and invoke the individual
DataStage tasks from the scheduler. The other approach is use a combination of a shell script and
DataStage Job Sequences. This standard practice recommends the latter approach because of its
superior robustness and simplicity.
As a standard practice, it strongly recommended to implement job control in a hierarchical fashion.
The following diagram demonstrates how one might implement this hierarchy.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

54 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Enterprise
Scheduler

Shell script
(encapsulating dsjob)

Parent
Sequencer

Child
Sequencer

JOB 1
Job2
JOBn

It is best to use dsjob to start a job sequence and wait for it to complete via a shell script.

6.5 Integrating with Enterprise Management Consoles


Monitoring tools usually allow applications to be monitored by utilizing a log watcher. These
usually tail f a log file detecting any new log entries instantly. They usually provide a keyword or
RSE type search mechanism for filtering messages and categorizing them according to severity.
Detecting long running or hung jobs should be a function of the External Scheduler. If the scheduler
does not provide this capability an alternative would be to leverage information in the Audit database.
A watchdog process will need to be set up that periodically queries for jobs in a running state and
compares the elapsed time to a benchmark expected run time.
Another feature of system monitoring tools and production operations in general is to provide the
operator with some documentation related to trouble shooting the application. These documents are
usually correlated to a console alarm by a document link in the management console or printed out and
organized in a binder.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

55 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

7 Change Management
This section considers aspects of a system life cycle as it relates to DataStage.

7.1 Source Control


Export by
object
Check In

Development

Release, Check out,


Import
Source
Code
Repository

Production

Maintenance
Check out/import
Check out, Import,Test,
Promote to Beta

Check out, Import, Test,


Promote to Production

BETA (test)

ALPHA (test)

In a perfect world every developer would have their own environment, including DataStage project,
File Systems, and Database. There would be a separate ALPHA test (for integration testing)
environment, where developers merged their components to complete a system. The BETA test system
should be a replica of production (or as close as possible).
Since we do not live in a perfect world at a minimum there should be at least one development
project, one test project, and one production project. Anything less than this is strongly discouraged,
because it does not provide the foundation for a robust life cycle.
The source code control repository can either be the DataStage Version Control tool, or any third party
source code control system like IBM Rational ClearCase, SCCS, and Microsoft Visual Source Safe.
DataStage projects can be configured to be protected, thus preventing any user from making
modification to jobs, routines, etc. It is strongly encouraged to protect all projects except for
development projects. The only users who can manipulate a protected project are those that have been
assigned the role of DataStage Production Manager. Even that role cannot perform tasks like job
compilations. Thus, when exporting jobs that will then be imported into a protected project, ensure
that the executable is included in the export.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

56 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

7.2 Production Migration Life Cycle


Following are the steps that should be followed to ensure a proper code migration life cycle. It defines
the responsibilities or roles of the various personnel that should be involved in the release cycle. This
does not cover any formal paperwork that an organization would require. It focuses on the process of
moving objects through the life cycle.
1. Developers Responsibility. New objects only. DataStage objects such as jobs, buildops,
routines, sequences, etc should be initially developed in a Development project. The developer
should thoroughly unit test their objects. The job should be instrumented with source code
control system tags to permit version tracking. See Appendix Source Code Control System
TAGS
2. Developers Responsibility Objects are then exported 1 at a time (not by category or project)
including the executable. The export can be either xml or dsx, but the name of the export
should be identical to the object being exported. This may seem cumbersome, but it is the best
way to control sources, because it provides the lowest level of granularity avoids storing the
same object more than once in the source code control repository and prevents any
unintentional overlaying of objects during import.
3. Developers Responsibility. Check the object into the source code control system.
4. Integration Testers Responsibility. Check the objects out of source code repository and import
them into the ALPHA test environment.
5. Integration Testers Responsibility. Run Integration Tests in the ALPHA environment and
Evaluate Results. If test fail go to step 12. If they succeed continue to next step.
6. Integration Testers Responsibility. Check successfully tested objects back into the source code
repository, marking them for promotion to Beta Test. And notify Quality Assurance Tester that
objects are ready for Beta Test.
7. Quality Assurance Testers Responsibility. Check the objects marked for Beta test out of source
code repository and import into Beta project.
8. Quality Assurance Testers Responsibility. Run QA tests. Evaluate Results. If tests fail go to
step. If they succeed continue to next step.
9. Quality Assurance Testers Responsibility. Check objects back into the source control repository
marking them ready for production. And notify Production manager objects are ready for
Release.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

57 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

10. Production Manager Responsibility. (Ensure a system backup is made of target production
project. This will provide a means to quickly recover from problems introduced by changed
code. Check objects out of SCCS and import them into the production environment.
11. End of normal release cycle.
12. Developers Responsibility. Bug fix and enhancements. Start with clean project. Import latest
objects form Source Code Control system. Make enhancements and unit test. Be sure to update
job description, for this is where change log is kept for an object. Once unit test is complete go
back to step 2 and repeat cycle.

7.3 Security
The Development environment. Each user should have a user id and project. Each project should
have a unique group identifier. Grant users access to a project by assigning them to a group.
Configure DataStage permissions by assigning Developer role to Developer groups.
Test Environment. Set up test users and test groups. Define test projects with tester group
permissions. Configure DataStage permissions by assigning Developer roles to tester groups. Limit
production manager.
Production Environment. Set up production operator user and groups. Define production projects.
Configure DataStage permissions by assigning Operator roles to the production operator account.

7.4 Upgrade Procedure (Including Fallback Emergency Patch)


This section talks about minimizing risk when upgrading objects in a project.
This section does not pertain to making changes to the system with software supplied by IBM, such as
an upgrade, or a patch to DataStage. However, a backup should be made prior to the upgrade, and
patches and upgrades should be applied using the same principles of the Production Migration Life
Cycle described above. That is test the upgrade in the Alpha and Beta environment prior to upgrading
or patching the production system. Consult the Install and upgrade guide and product release notes
prior to upgrade.
To minimize risk exposure to changes introduced to a production system a backup of the project should
be made prior to introducing any changes. This will provide a means for fast recovery in needed.
In the event that an emergency patch is required the normal object life cycle should be followed.
However, there may be times when this cycle is too cumbersome. There are a number of options, all of
which it is strongly recommended to avoid unless absolutely necessary.
1. Emergency patch. Make changes in development and skip Alpha Test and go directly to Beta
test. Then into production.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

58 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

2. Emergency patch on production system. Have system Production Manager unprotect the
project. This will allow the developer to make code changes directly on the production system.
The changes must be exported and then check into the source code system. Do not do this
unless there is no alternative.
7.4.1 Automating the Build Process
DataStage provides command line utilities in the DataStage Client installation for importing, exporting
and compiling jobs and routines.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

59 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Appendix A: Processes Created at Runtime by DataStage EE


This is the output of a PS command, organized in the order that they are created, with the earliest process at the top, and the latest at the
bottom.

These Processes all take place within the DSEngine environment:


DSRPC Daemon, services GUI Sessions and DSJOB invocations:
USERID PID PPID Command
root 1220 1 /u01/app/Ascential/DataStage/DSEngine/bin/dsrpcd
Primary GUI Session, child of DSRPCD:
USERID
abrewe

PID
12311

PPID
1220

Command
dscs 4 0 0

Secondary GUI Session, Child of DSCS:


USERID
abrewe

PID PPID
12312 12311

Command
dsapi_slave 8 7 0

Root process for all DS-Job Trees, created by dsapi_slave for DS-Director submitted jobs:
USERID
PID PPID Command
abrewe
20899 12312 phantom DSD.RUN HangEquSourceRef 0/0 $APT_DISABLE_COMBINATION=True
$APT_CONFIG_FILE=/u01/app/Ascential/DataStage/Configurations/default.apt $DS_PXDEBUG=1

Phantom Process, invokes UNIX shell and acts as the bridge to UNIX:
USERID
PID PPID Command
abrewe
20943 20899 phantom SH -c 'RT_SC351/OshExecuter.sh RT_SC351/OshScript.osh
RT_SC351/HangEquSourceRef.fifo -monitorport 13402 -pf RT_SC351/jpfile -input_charset
ASCL_MS1252 -output_charset ASCL_MS1252 '

These Processes all take place within the Unix Shell Environment:

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

60 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

OshExecuter Process, kicks off OSH Conductor and waits on it to complete:


USERID
PID PPID Command
abrewe
20945 20943 /bin/sh RT_SC351/OshExecuter.sh RT_SC351/OshScript.osh
RT_SC351/HangEquSourceRef.fifo -monitorport 13402 -pf RT_SC351/jpfile -input_charset ASCL_MS1252 output_charset ASCL_MS1252

Conductor Process, root process for all DS-EE jobs:


USERID
PID PPID Command
abrewe
20946 20945 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh -monitorport
13402 -pf RT_SC351/jpfile -input_charset ASCL_MS1252 -output_charset ASCL_MS1252 -f
RT_SC351/OshScript.osh

Proto-Section Leader #1 (This goes away immediately after starting its section leader, thus 'orphaning' the SL process):
USERID
abrewe

PID PPID
20990 20946

Command
/mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/etc/standalone

Section Leader #1:


USERID
PID PPID Command
abrewe
20993 20990 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2

Players for Section Leader #1, this represents a single instantiation of the job:
USERID
PID PPID Command
abrewe
20996 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
20998 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21001 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21007 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21010 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21011 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21018 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

61 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence
abrewe
21019 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21020 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21021 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21022 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21024 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21026 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21028 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21029 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21031 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21017 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21002 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21005 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21004 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2

Proto-Section Leader #2 (This goes away immediately after starting its section leader, thus 'orphaning' the SL process):
USERID
abrewe

PID PPID
20991 20946

Command
/mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/etc/standalone

Section Leader #2:


USERID
PID PPID Command
abrewe
20992 20991 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2

Players for Section Leader #2, this represents a single instantiation of the job:
USERID

PID

PPID

Command

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

62 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence
abrewe
20997 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
20999 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21008 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21009 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21012 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21013 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21015 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21016 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21023 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21025 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21027 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21030 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21032 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21014 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21000 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21006 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2
abrewe
21003 20992 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 1 30 node2 hvcwyds0001 1085435517.554825.51d2

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

63 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Represented Graphically:
The UNIX tool pstree can be used to generate a graphical report for a given processID. Most Linux platforms include pstree, but it can
be downloaded for most platforms (as Perl code). Here is the graphical output of the above process hierarchy:
dsrpcd(1220)---dscs(12311)---dsapi_slave(12312)---uvsh(20899)---uvsh(20943)---OshExecuter.sh(20945)---osh(20946)-+--osh(20990)-+-osh(20993)-+-osh(20996)
DSD.RUN
Phantom
Conductor | ProtoSection
|-osh(20998)
| Section
Leader 1
|-osh(21001)
| Leader 1
|-osh(21002)
|
|-osh(21004)
|
|-osh(21005)
|
|-osh(21007)
|
|-osh(21010)
|
|-osh(21011)
|
|-osh(21017)
|
|-osh(21018)
|
|-osh(21019)
|
|-osh(21020)
|
|-osh(21021)
|
|-osh(21022)
|
|-osh(21024)
|
|-osh(21026)
|
|-osh(21028)
|
|-osh(21029)
|
`-osh(21031)
|
Players
|
`--osh(20991)-+-osh(20992)-+-osh(20997)
ProtoSection
|-osh(20999)
Section
Leader 2
|-osh(21000)
Leader 2
|-osh(21003)
|-osh(21006)
|-osh(21008)
|-osh(21009)
|-osh(21012)
|-osh(21013)
|-osh(21014)
|-osh(21015)
|-osh(21016)
|-osh(21023)
|-osh(21025)
|-osh(21027)
|-osh(21030)
`-osh(21032)
Players

The pstree command can also be used to distinguish between Section Leader and Player processes, which are both launched with the APT_PMsectionLeaderFlag option. Using pstree on a given process ID will display child processes for Section Leaders and no child
processes for Players.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

64 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Another way to look at the DSJob Tree:


Root process for
all GUI sessions

DSRPCD
|
`-DSAPI_Slave _xor_ UNIX Shell Invoking DSJOB
|
Root Process for
|
all DS-Jobs
`-phantom DSD.RUN <JobName> 0/0
|
JobMon Process
+-phantom DSD.OshMonitor rowgen1 3266
|
Root Process for
|
OSH JobTree
`-phantom SH -c 'RT_SC1/OshExecuter.sh RT_SC1/OshScript.osh
RT_SC1/<JobName>.fifo -monitorport 13400 -pf RT_SC1/jpfile -input_charset ASCL_MS1252 -output_charset
ASCL_MS1252 '
|
`-/bin/sh RT_SC1/OshExecuter.sh RT_SC1/OshScript.osh
RT_SC1/rowgen1.fifo -monitorport 13400 -pf RT_SC1/jpfile -input_charset ASCL_MS1252 -output_charset
ASCL_MS1252
|
Conductor:
`-/scratch/Ascential/DataStage/PXEngine/bin/osh -monitorport 13400
-pf RT_SC1/jpfile -input_charset ASCL_MS1252 -output_charset ASCL_MS1252 -f RT_SC1/OshScript.osh (Conductor
Process)
|
ProtoSectionLeader 1
+-ProtoSectionLeader (Goes away once its SectionLeader has
started up)
|
|
Section Leader 1:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 0 30 node1 mk61 1084877021.130482.cc2
|
|
Player 1:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 1 30 node1 mk61 1084877021.130482.cc2
Player 2:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 0 30 node1 mk61 1084877021.130482.cc2
|
.
|
.
|
.
Player N:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 1 30 node1 mk61 1084877021.130482.cc2
|
|
IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

65 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

ProtoSectionLeader 2
started up)

|
+-ProtoSectionLeader (Goes away once its SectionLeader has

|
|
Section Leader 2:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 1 30 node2 mk61 1084877021.130482.cc2
|
|
Player 1:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 1 30 node2 mk61 1084877021.130482.cc2
Player 2:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 0 30 node2 mk61 1084877021.130482.cc2
|
.
|
.
|
.
Player N:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 1 30 node2 mk61 1084877021.130482.cc2
.
.
.
ProtoSectionLeader N
`-ProtoSectionLeader (Goes away once its SectionLeader has
started up)
|
Section Leader N:
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 0 30 nodeN mk61 1084877021.130482.cc2
|
Player 1:
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 0 30 nodeN mk61 1084877021.130482.cc2
Player 2:
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 0 30 nodeN mk61 1084877021.130482.cc2
.
.
.
Player N:
`-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 1 30 nodeN mk61 1084877021.130482.cc2

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

66 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Appendix B: Environment Variable Reference


This Appendix summarizes the environment variables mentioned throughout this document. These
variables can be used on an as-needed basis to tune the performance of a particular job flow, to assist in
debugging, or to change the default behavior of specific DataStage Enterprise Edition stages. An
extensive list of environment variables is documented in the DataStage Parallel Job Advanced
Developers Guide.
NOTE: The environment variable settings in this Appendix are only examples. Set values that are
optimal to your environment.

Job Design Environment Variables


Environment Variable

Setting

Description

$APT_STRING_PADCHAR

[char]

Overrides the default pad character of 0x0 (ASCII null)


used when EE extends, or pads, a variable-length string
field to a fixed length (or a fixed-length to a longer
fixed-length).

Sequential File Stage Environment Variables


Environment Variable

Setting

Description

$APT_EXPORT_FLUSH_COUNT

[nrows]

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS

Specifies how frequently (in rows) that the


Sequential File stage (export operator)
flushes its internal buffer to disk. Setting this
value to a low number (such as 1) is useful
for realtime applications, but there is a small
performance penalty from increased I/O.
Setting this environment variable directs
DataStage to reject Sequential File records
with strings longer than their declared
maximum column length. By default,
imported string fields that exceed their
maximum declared length are truncated.
When set, allows zero length null_field value
with fixed length fields. This should be used
with care as poorly formatted data will cause
incorrect results. By default a zero length
null_field value will cause an error.
Defines size of I/O buffer for Sequential File
reads (imports) and writes (exports)
respectively. Default is 128 (128K), with a
minimum of 8. Increasing these values on
heavily-loaded file servers may improve
performance.
In some disk array configurations, setting
this variable to a value equal to the read /
write size in bytes can improve performance
of Sequential File import/export operations.

(DataStage v7.01 and later)

$APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL

[set]

$APT_IMPORT_BUFFER_SIZE

[Kbytes]

$APT_EXPORT_BUFFER_SIZE

$APT_CONSISTENT_BUFFERIO_SIZE

[bytes]

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

67 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

[bytes]

$APT_DELIMITED_READ_SIZE

[bytes]

$APT_MAX_DELIMITED_READ_SIZE

[set]

$APT_IMPORT_PATTERN_USES_FILESET

Specifies the number of bytes the Sequential


File (import) stage reads-ahead to get the
next delimiter. The default is 500 bytes, but
this can be set as low as 2 bytes.
This setting should be set to a lower value
when reading from streaming inputs (for
example, socket or FIFO) to avoid blocking.
By default, Sequential File (import) will read
ahead 500 bytes to get the next delimiter. If it
is not found the importer looks ahead
4*500=2000 (1500 more) bytes, and so on
(4X) up to 100,000 bytes.
This variable controls the upper bound which
is by default 100,000 bytes. When more than
500 bytes read-ahead is desired, use this
variable instead of
APT_DELIMITED_READ_SIZE.
When this environment variable is set
(present in the environment) file pattern
reads are done in parallel by dynamically
building a File Set header based on the list of
files that match the given expression. For
disk configurations with multiple controllers
and disk, this will significantly improve file
pattern reads.

DB2 Environment Variables


Environment Variable

Setting

Description

$INSTHOME

[path]

$APT_DB2INSTANCE_HOME

[path]

$APT_DBNAME

[database]

$APT_RDBMS_COMMIT_ROWS
Can also be specified with the Row Commit
Interval stage input property.

[rows]

Specifies the DB2 install directory. This variable is


usually set in a users environment
from .db2profile.
Used as a backup for specifying the DB2
installation directory (if $INSTHOME is undefined).
Specifies the name of the DB2 database for
DB2/UDB Enterprise stages if the Use Database
Environment Variable option is True. If
$APT_DBNAME is not defined, $DB2DBDFT is used
to find the database name.
Specifies the number of records to insert between
commits. The default value is 2000 per partition.

$DS_ENABLE_RESERVED_CHAR_CONVERT

Allows DataStage plug-in stages to handle DB2


databases which use the special characters # and $
in column names.

Informix Environment Variables


Environment Variable

Setting

Description

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

68 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

$INFORMIXSERVER

[path]
[filepath]
[name]

$APT_COMMIT_INTERVAL

[rows]

$INFORMIXDIR
$INFORMIXSQLHOSTS

Specifies the Informix install directory.


Specifies the path to the Informix sqlhosts file.
Specifies the name of the Informix server
matching an entry in the sqlhosts file.
Specifies the commit interval in rows for
Informix HPL Loads. The default is 10000 per
partiton.

Oracle Environment Variables


Environment Variable

Setting

Description

$ORACLE_HOME

[path]

$ORACLE_SID

[sid]

$APT_ORAUPSERT_COMMIT_ROW_INTERVAL
$APT_ORAUPSERT_COMMIT_TIME_INTERVAL

[num]
[seconds]

Specifies installation directory for current


Oracle instance. Normally set in a users
environment by Oracle scripts.
Specifies the Oracle service name,
corresponding to a TNSNAMES entry.
These two environment variables work
together to specify how often target rows are
committed for target Oracle stages with
Upsert method.

$APT_ORACLE_LOAD_OPTIONS

[SQL*
Loader
options]

Commits are made whenever the time


interval period has passed or the row interval
is reached, whichever comes first. By
default, commits are made every 2 seconds
or 5000 rows per partition.
Specifies Oracle SQL*Loader options used
in a target Oracle stage with Load method.
By default, this is set to
OPTIONS(DIRECT=TRUE,
PARALLEL=TRUE)

$APT_ORACLE_LOAD_DELIMITED

[char]

(DataStage 7.01 and later)


$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM

$APT_ORA_WRITE_FILES

[filepath]

$DS_ENABLE_RESERVED_CHAR_CONVERT

Specifies a field delimiter for target Oracle


stages using the Load method. Setting this
variable makes it possible to load fields with
trailing or leading blank characters.
When set, a target Oracle stage with Load
method will limit the number of players to
the number of datafiles in the tables
tablespace.
Useful in debugging Oracle SQL*Loader
issues. When set, the output of a Target
Oracle stage with Load method is written to
files instead of invoking the Oracle
SQL*Loader. The filepath specified by this
environment variable specifies the file with
the SQL*Loader commands.
Allows DataStage plug-in stages to handle
Oracle databases which use the special
characters # and $ in column names.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

69 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

Teradata Environment Variables


Environment Variable

Setting

Description

$APT_TERA_SYNC_DATABASE

[name]

$APT_TERA_SYNC_USER

[user]

$APT_TER_SYNC_PASSWORD

[password]

$APT_TERA_64K_BUFFERS

$APT_TERA_NO_ERR_CLEANUP

$APT_TERA_NO_PERM_CHECKS

Starting with v7, specifies the database used for


the terasync table. By default, EE uses the
Starting with v7, specifies the user that creates
and writes to the terasync table.
Specifies the password for the user identified by
$APT_TERA_SYNC_USER.
Enables 64K buffer transfers (32K is the
default). May improve performance depending
on network configuration.
This environment variable is not recommended
for general use. When set, this environment
variable may assist in job debugging by
preventing the removal of error tables and
partially written target table.
Disables permission checking on Teradata
system tables that must be readable during the
TeraData Enterprise load process. This can be
used to improve the startup time of the load.

Netezza Environment Variables


Environment Variable

Setting

Description

$NETEZZA

[path]
[filepath]

Specifies the Nezza home directory


Points to the location of the .odbc.ini file. This is
required for ODBC connectivity on UNIX
systems.
Prints debug messages from a specific DSEE
module useful for debugging Netezza errors

$NZ_ODBC_INI_PATH

$APT_DEBUG_MODULE_NAMES

odbcstmt, odbcenv,
nzetwriteop, nzutils,
nzwriterep, nzetsubop

Job Monitoring Environment Variables


Environment Variable
$APT_MONITOR_TIME

Setting
[seconds]

$APT_MONITOR_SIZE

[rows]

$APT_NO_JOBMON

$APT_RECORD_COUNTS

Description
In v7 and later, specifies the time interval (in seconds)
for generating job monitor information at runtime. To
enable size-based job monitoring, unset this
environment variable, and set $APT_MONITOR_SIZE
below.
Determines the minimum number of records the job
monitor reports. The default of 5000 records is usually
too small. To minimize the number of messages during
large job runs, set this to a higher value (for example,
1000000).
Disables job monitoring completely. In rare instances,
this may improve performance. In general, this should
only be set on a per-job basis when attempting to
resolve performance bottlenecks.
Prints record counts in the job log as each operator
completes processing. The count is per operator per

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

70 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence
partition.

Performance-Tuning Environment Variables


Environment Variable
$APT_BUFFER_MAXIMUM_MEMORY

Setting
41903040
(example)

$APT_BUFFER_FREE_RUN

1000
(example)

Description
Specifies the maximum amount of virtual memory, in
bytes, used per buffer per partition. If not set, the
default is 3MB (3145728). Setting this value higher
will use more memory, depending on the job flow, but
may improve performance.
Specifies how much of the available in-memory buffer
to consume before the buffer offers resistance to any
new data being written to it. If not set, the default is 0.5
(50% of $APT_BUFFER_MAXIMUM_MEMORY).
If this value is greater than 1, the buffer operator will
read $APT_BUFFER_FREE_RUN *
$APT_BUFFER_MAXIMIMUM_MEMORY before offering
resistance to new data.
When this setting is greater than 1, buffer operators
will spool data to disk (by default scratch disk) after
the $APT_BUFFER_MAXIMUM_MEMORY threshold. The
maximum disk required will be
$APT_BUFFER_FREE_RUN * # of buffers *
$APT_BUFFER_MAXIMUM_MEMORY

$APT_PERFORMANCE_DATA

directory

$TMPDIR

[path]

Enables capture of detailed, per-process performance


data in an XML file in the specified directory. Unset
this environment variable to disable.
Defaults to /tmp. Used for miscellaneous internal
temporary data including FIFO queues and
Transformer temporary storage.
As a minor optimization, may be best set to a
filesystem outside of the DataStage install directory.

Job Flow Debugging Environment Variables


Environment Variable

Setting

Description

$OSH_PRINT_SCHEMAS

$APT_DISABLE_COMBINATION

Outputs the actual schema definitions used by the


parallel framework at runtime in the DataStage log.
This can be useful when determining if the actual
runtime schema matches the expected job design table
definitions.
Disables operator combination for all stages in a job,
forcing each EE operator into a separate process. While
not normally needed in a job flow, this setting may
help when debugging a job flow or investigating
performance by isolating individual operators to
separate processes.

The Advanced Stage Properties editor in


DataStage Designer v7.1 and later allows
combination to be enabled and disabled
for on a per-stage basis.

Note that disabling operator combination will generate


more UNIX processes, and hence require more system
resources (and memory). Disabling operator

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

71 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions


Center of Excellence

$APT_PM_PLAYER_TIMING

$APT_PM_PLAYER_MEMORY

$APT_BUFFERING_POLICY

FORCE

Setting
$APT_BUFFERING_POLICY=FORCE is not

recommended for production job runs.

$DS_PX_DEBUG

$APT_PM_STARTUP_CONCURRENCY

$APT_PM_NODE_TIMEOUT

[seconds]

combination also disables internal optimizations for job


efficiency and run-times.
Prints detailed information in the job log for each
operator, including CPU utilization and elapsed
processing time.
Prints detailed information in the job log for each
operator when allocating additional heap memory.
Forces an internal buffer operator to be placed between
every operator. Normally, the parallel framework
inserts buffer operators into a job flow at runtime to
avoid deadlocks and improve performance.
Using $APT_BUFFERING_POLICY=FORCE in
combination with $APT_BUFFER_FREE_RUN
effectively isolates each operator from slowing
upstream production. Using the job monitor
performance statistics, this can identify which part of a
job flow is impacting overall performance.
Set this environment variable to capture copies of the
job score, generated osh, and internal Enterprise
Edition log messages in a directory corresponding to
the job name. This directory will be created in the
Debugging sub-directory of the Project home
directory on the DataStage server.
This environment variable should not normally need to
be set. When trying to start very large jobs on heavilyloaded servers, lowering this number will limit the
number of processes that are simultaneously created
when a job is started.
For heavily loaded MPP or clustered environments,
this variable determines the number of seconds the
conductor node will wait for a successful startup from
each section leader. The default is 30 seconds.

IBM IPS Parallel Framework: Administration and Production Automation

October 29, 2007

72 of 72

2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.