Professional Documents
Culture Documents
Center of Excellence
1 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Document Goals
Intended Use
Target Audience
Product Version
This document presents a set of standard practices, methodologies, and an example Toolkit
for administering and integrating IBM WebSphere DataStage Enterprise Edition
(DSEE) with a production infrastructure. Except where noted, this document is intended
to supplement, not replace the installation documentation.
The primary audience for this document is DataStage Administrators and Developers who
have been trained in Enterprise Edition. Information in certain sections may also be relevant
for Technical Architects and System Administrators.
This document is intended for the following product releases:
- WebSphere DataStage Enterprise Edition 7.5.2 (UNIX, Linux, USS)
- WebSphere DataStage Enterprise Edition 7.5x2 (Windows)
Mike Carney
carneym@us.ibm.com
Paul Christensen
Bob Johnston
Patrick Owen
Mike Ruland
Jim Tsimis
ptc@us.ibm.com
rdj@us.ibm.com
powen@us.ibm.com
mruland@us.ibm.com
jtsimis@us.ibm.com
Rev.
1.0
1.1
1.2
1.3
1.4
February 8, 2007
October 29, 2007
2.0
3.0
Description
Initial release
Updated ETL and Project_Plus directory hierarchies for consistency across DSEE
Standards. Added Staging directory hierarchy.
Updated styles and formatting.
Updated directory and Project_Plus naming standards for consistency across
deliverables. Updated terminology and Naming Standards for consistency.
Expanded discussion of Environment Variables and Parameters. Added
Environment Variable Reference Appendix. Added Document Author and
Contributors, and Package Contents.
Added Feedback section and IIS Services Offerings. Corrected Data Set and
Scratch file system naming. Expanded backup discussion for DataSets.
Updated positioning, naming (IIS to IPS), Services Offerings.
First public reference release compliments Administration and Production
Automation Services Workshop.
Document Conventions
This document uses the following conventions:
Convention
Bold
Italic
Plain
Bold Italic
Usage
In syntax, bold indicates commands, function names, keywords, and options that must be input
exactly as shown. In text, bold indicates keys to press, function names, and menu selections.
In syntax, italic indicates information that you supply. In text, italic also indicates UNIX
commands and options, file names, and pathnames.
In text, plain indicates Windows NT commands and options, file names, and pathnames.
Indicates: important information.
2 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Lucida Console text indicates examples of source code and system output.
Lucida Bold
In examples, Lucida Console bold indicates characters that the user types or keys the user
presses (for example, <Return>).
In examples, Lucida Blue will be used to illustrate operating system command line prompt.
A right arrow between menu commands indicates you should choose each command in sequence.
For example, Choose File Exit means you should choose File from the menu bar, and then
choose Exit from the File pull-down menu.
The continuation character is used in source code examples to indicate a line that is too long to
fit on the page, but must be entered as a single line on screen.
Lucida Blue
This line
continues
All punctuation marks included in the syntaxfor example, commas, parentheses, or quotation
marksare required unless otherwise indicated.
Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The
continuation lines are indented. When entering syntax, type the entire syntax entry, including
the continuation lines, on the same input line.
Text enclosed in parenthesis and underlined (like this) following the first use of proper terms
will be used instead of the proper term.
Interaction with our example system will usually include the system prompt (in blue) and the
command, most often on 2 or more lines.
If appropriate, the system prompt will include the user name and directory for context. For example:
%etl_node%:dsadm /usr/dsadm/Ascential/DataStage >
/bin/tar cvf /dev/rmt0 /usr/dsadm/Ascential/DataStage/Projects
Feedback
We value your input and suggestions for continuous improvement to this content. Direct any questions,
comments, corrections, or suggested additions to: cedifeed@us.ibm.com
3 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Table of Contents
1
DATASTAGE ADMINISTRATION.......................................................................................................................9
2.1
2.2
2.3
2.4
2.5
2.6
3.1
3.2
3.3
3.4
4
CONFIGURATION ...................................................................................................................................................30
JOB MONITOR ENVIRONMENT VARIABLES ............................................................................................................30
STARTING & STOPPING THE MONITOR ..................................................................................................................30
MONITORING JOBMON ...........................................................................................................................................31
BACKUP / RECOVERY / REPLICATION/ FAILOVER PROCEDURES......................................................32
4.1
4.2
4.3
4.4
4.5
4.6
4.7
5
OVERVIEW OF PRODUCTION AUTOMATION AND INFRASTRUCTURE INTEGRATION FOR
DATASTAGE ........................................................................................................................................................................39
5.1
5.2
5.3
5.4
6
6.1
6.2
6.3
6.4
6.5
7
7.1
7.2
7.3
7.4
SOURCE CONTROL.................................................................................................................................................56
PRODUCTION MIGRATION LIFE CYCLE ..................................................................................................................57
SECURITY ..............................................................................................................................................................58
UPGRADE PROCEDURE (INCLUDING FALLBACK EMERGENCY PATCH) ..................................................................58
4 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Iterations 2 Methodology
Standard Practices
Architecture and Design
Education and Mentoring
Virtual Services
Certification
Learning Services
Certification
Client Support
Services
Virtual Services
Description
Whether through workshop delivery, project leadership, or mentored augmentation,
the Professional Services staff of IBM Information Platform and Solutions leverages
IBMs methodologies, Standard Practices and experience developed throughout
thousands of successful engagements in a wide range of industries and government
entities.
IBM offers a variety of courses covering the IPS product portfolio. IBMs blending
learning approach is based on the principle that people learn best when provided with
a variety of learning methods that build upon and complement each other. With that in
mind, courses are delivered through a variety of mechanisms: classroom, on-site and
Web-enabled FlexLearning.
IBM offers a number of Professional Certifications offered through independent
testing centers worldwide. These certification exams provide a reliable, valid and fair
method of assessing product skills and knowledge gained through classroom and realworld experience.
IBM is committed to providing our customers with reliable technical support
worldwide. All Client Support services are available to customers who are covered
under an active IBM IPS maintenance agreement. Our worldwide support organization
is dedicated to assuring your continued success with IPS products and solutions.
The low cost Virtual Services offering is designed to supplement the global IBM IPS
delivery team, as needed, by providing real-time, remote consulting services. Virtual
Services has a large pool of experienced resources that can provide IT consulting,
development, Migration and Training services to customers for WebSphere
DataStage Enterprise Edition (DSEE).
5 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information
Exchange and
Discovery
Identify
Strategic
Planning
Startup
Requirements
Definition, Architecture,
and Project Planning
Iterations 2
Analysis
& Design
Build
Test &
Implement
Monitor &
Refine
Description
Targeted for clients new to the IBM IPS product portfolio, this workshop provides
IBMs high-level recommendations on how to solve a customer particular problem.
IBM analyzes the data integration challenges outlined by the client, and develops a
strategic approach for addressing those challenges.
Guiding clients through the critical process of establishing a framework for a
successful future project implementation, this workshop delivers a detailed project
plan, as well as a Project Blueprint. These deliverables document project parameters,
current and conceptual end states, network topology, data architecture and hardware
and software specifications, outlines a communication plan, defines scope, and
captures identified project risk.
IBMs Iterations 2 is a framework for managing enterprise data integration projects
that integrates with existing customer methodologies. Iterations 2 is a comprehensive,
iterative, step-by-step approach that leads project teams from initial planning and
strategy through to tactical implementation. This workshop includes the Iterations 2
software, along with customized mentoring.
6 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Standard Practices
Workshops
Installation and
Configuration
Workshop
Information Analysis
Workshop
Data Flow and Job
Design Standard
Practices Workshop
Data Quality
Management
Standard Practices
Workshop
Administration,
Management, and
Production
Automation
Workshop
Advanced
Deployment
Workshops
Health Check
Evaluation
Performance Tuning
Workshop
High-Availability
Architecture
Workshop
Grid Computing
Discovery,
Architecture and
Description
Establishes a documented, repeatable process for installation and configuration of
DSEE server and client components. This may involve review and validation of one or
more existing DSEE environments, or planning, performing, and documenting a new
installation.
Provides clients with a set of Standard Practices and a repeatable methodology for
analyzing the content, structure, and quality of data sources using the combination of
WebSphere Profile Stage, Quality Stage, and Audit Stage.
Helps clients establish standards and templates for the design and development of
parallel jobs using DSEE through practitioner-led application of IBM Standard
Practices to a clients environment, business, and technical requirements. The delivery
includes a customized Standards document as well as custom job designs and
templates for a focused subject area.
Provides clients with a set of standard processes for the design and development of
data standardization, matching, and survivorship processes using WebSphere
QualityStage. The data quality strategy formulates an auditing and monitoring
program that helps ensure on-going confidence in data accuracy, consistency, and
identification through client mentoring and sharing of IBM Standard Practices.
This workshop provides customers with a customized Toolkit and set of proven
Standard Practices for integrating DSEE into a clients existing production
infrastructure (monitoring, scheduling, auditing/logging, change management) and for
administering, managing and operating DSEE environments.
Description
This workshop is targeted for clients currently engaged in IPS development efforts that
are not progressing according to plan, or for clients seeking validation of proposed
plans prior to the commencement of new projects. It provides review of and
recommendations for core ETL development and operational environments by an IBM
expert practitioner.
Provides clients with an action plan and set of recommendations for meeting current
and future capacity requirements for data integration. This strategy is based on
analysis of business and technical requirements, data volumes and growth projections,
existing standards and technical architecture, existing and future data integration
projects.
Guides a clients technical staff through IBM Standard Practices and methodologies
for review, analysis and performance optimization using a targeted sample of client
jobs and environments. This workshop can identify potential areas of improvement,
demonstrate IBMs processes and techniques, and provide a final report that contains
recommended performance modifications and IBM performance tuning guidelines.
Using IBMs IPS Standard Practices for high availability, this workshop presents a
plan for meeting a customers high availability requirements using the parallel
framework of DSEE. It then implements the architectural modifications necessary for
high availability computing.
Provides the planning and readiness efforts required to support a future deployment of
the parallel framework of IPS on Grid computing platforms. This workshop prepares
the foundation on which a follow-on Grid installation and deployment will be
7 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Planning Workshop
Grid Computing
Installation and
Deployment
Workshop
executed, and includes hardware and software recommendations and estimated scope.
Installs, configures, and deploys the IBM IPS Grid Enabled Toolkit in a clients Grid
environments and provides integration with Grid Resource Managers, configuration
of DSEE, QualityStage/EE, and/or ProfileStage/EE.
For more details on any of these IBM IPS Professional Services offerings, and to find a local IBM
Information Integration Services contact, visit:
http://www.ibm.com/software/data/services/ii.html
Administration, Management and Production Automation Workshop
The following flowchart illustrates the various IPS Services workshops around the parallel framework
of DSEE.
The Administration, Management and Production Automation Workshop is intended to provide a set of
proven Standard Practices and a customized toolkit for integrating DSEE into a customers existing
production infrastructure (monitoring, scheduling, auditing/logging, change management). It also
provides expert practitioner recommendations for administering, managing and operating DSEE
environments.
8 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
2 DataStage Administration
This section of the document discusses DataStage Administration and Automation. It endeavors to join
these disciplines in a cohesive manner, by defining standard practices that are complementary. The
standard practices are based on a foundation. The foundation is the operating environment and a
simple life cycle methodology.
Projects are both the logical and physical means for storing work performed in DataStage.
Projects are Meta data repositories for DataStage Objects, such as jobs, stages, shared
containers.
Projects also store configuration metadata, like environment variables.
It is possible to create many projects.
Projects are independent of each other.
DataStage object metadata can be exported to a file as well as imported.
9 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Test Machine
Production
Machine
Standard
practice
Production Machine
10 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Configuration for DataStage user accounts for all phases of the ETL test cycle should be set up to allow
for separate developer accounts and a separate account for each phase of the life cycle. Each of these
accounts should be configured according to the section configuring a DataStage user (below).
Data Storage
- DataStage temporary storage - scratch, temp, buffer
- DataStage parallel Data Set segment files
- Staging and Archival storage for any source file(s)
By default, each of these directories (except for file staging) are created during installation as
subdirectories under the base DataStage installation directory.
IMPORTANT: Each storage class should be isolated in separate file systems to accommodate
their different performance and capacity characteristics and backup requirements.
The default installation is generally acceptable for small prototype environments.
2.2.1 Software Install Directory
The software install directory is created by the installation process, and contains the DSEE software
file tree. The install directory grows very little over the life of a major software release, so the default
location ($HOME for dsadm, e.g.: /home/dsadm) may be adequate.
The system administrator may choose to install DataStage in a subdirectory within an overall install
file system. You should verify that the install file system has at least 1GB of space for the software
directory (2GB if you are installing RTI or other optional components).
For cluster or Grid implementations, it is generally best to share the Install file system across
servers (at the same mount point).
NOTE: the DataStage installer will attempt to rename the installation directory to support later
upgrades; if you install directly to a mount point this rename will fail and several error
messages will be displayed. Installation will succeed but the messages may be confusing.
11 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
It is recommended that a separate file system be created and mounted over the default location
for projects, the $DSROOT/Projects directory. Mount this directory after installing DSEE but
before projects are created.
The Projects directory should be a mirrored file system with sufficient space (minimum 100MB
per project).
For cluster or Grid implementations, it is generally best to share the Project file system across
servers (at the same mount point).
IMPORTANT: The project file system should be monitored to ensure adequate free space
remains. If the Project file system runs out of free space during DataStage activity, the
repository may become corrupted, requiring a restore from backup.
Effective management of space is important to the health and performance of a project, and as jobs are
added to a project, new directories are created in this file tree, and as jobs are run, their log entries
multiply. These activities cause file-system stress (for example, more time to insert or delete DataStage
components, longer update times for logs). Failure to perform routine projects maintenance (for
example, remove obsolete jobs and manage log entries) can cause project obesity and performance
issues.
12 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
The name of a DataStage Project is limited to a maximum of 18 characters. The project name can
contain alpha-numeric characters and it can contain underscores.
2.2.2.2 Project Recovery Considerations
Devising a backup scheme for project directories is based on 3 core issues:
1. Will there be valuable data stored in Server Edition hash files 1 ? DataStage Server Edition files
located in the DataStage file tree may require archiving from a data perspective.
2. How often will the UNIX file system containing the ENTIRE DataStage file tree be backed up?
When can DataStage be shut down to enable a cold snapshot of the Universe database as well as
the project files? A complete file system backup while DataStage is shut down accomplishes
this backup.
3. How often will the projects be backed up? Keep in mind that the grain of project backups will
represent the ability to recover lost work should a project or a job become corrupted.
At a minimum, a UNIX file system backup of the entire DataStage file tree should be performed at
least weekly with the DataStage engine shut down, and each project should be backed up with the
Manager at least nightly with all users logged out of DataStage. This is the equivalent of a cold
database backup and 6 updates.
If your installation has valuable information in Server hash files, you should increase the frequency of
your UNIX backup OR write jobs to unload the Server files to external media.
2.2.3 Data Set and Sort Directories
The DataStage installer creates the following two subdirectories within the DataStage install directory:
Datasets/
- stores individual segment files of DataStage parallel Data Sets
Scratch/
- used by the parallel framework for temporary files such as sort and buffer overflow
Try not to use these directories and consider deleting them to ensure they are never used. This is best
done immediately after installation; be sure to coordinate this standard with the rest of the team.
DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk and
scratch file systems) at runtime when a job is executed.
Note that the use of Server Edition components in an Enterprise Edition environment is discouraged for
performance and maintenance reasons. However, if legacy Server Edition applications exist, their corresponding
objects may need to be taken into consideration.
13 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
The DataStage installer creates a default parallel Configuration file (Configurations/default.apt) which
references the Datasets and Scratch subdirectories within the install directory. The DataStage
Administrator should consider removing the default.apt file altogether, or at a minimum updating this
file reference the file systems you define (below).
2.2.3.1 Data and Scratch File Systems
It is a bad practice to share the DataStage install and Projects file systems with volatile files like scratch
files and Parallel data set segment files. Resource, scratch and sort disks service very different kinds of
data with completely opposite persistence characteristics. Furthermore, they compete directly with
each other for I/O bandwidth and service time if they share the same path.
Optimally, these file systems should not have any physical disks in common and should not share any
physical disks with databases. While it is often impossible to allocate contention-free storage, it must
be noted that at large data volumes and/or in highly active job environments, disk arm contention can
and usually does significantly constrain performance.
NOTE: For optimal performance, file systems should be created in high performance, low
contention storage. The file systems should be expandable without requiring destruction and recreation.
Some files created by database stages persist after job completion. For example, the Oracle .log, .ctl and .bad
files will remain in the first Scratch resource pool after a load completes.
14 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
are warm files (e.g.: being read and written at above average rates). Note that sort space must
accommodate only the files being sorted simultaneously, and, assuming that jobs are scheduled nonconcurrently, only the maximum of said sorts.
There is no persistence to these temporary sort files so they need not be archived.
Sizing DataStage scratch space is somewhat difficult. Objects in this space include lookups and intraprocess buffers. Intra-process buffers absorb rows at runtime when a stage (or stages) in a partition (or
all partitions) cannot process rows as fast as they are supplied. In general, there are as many buffers as
there are stages on the canvas for each partition. As a practical matter, assume that scratch space must
accommodate the largest volume of data in one job (see the previous formula for Data Sets and flat
files). There are advanced ways to isolate buffer storage from sort storage, but this is a performance
tuning exercise, not a general requirement.
2.2.3.4 Maintaining Parallel Configuration Files
DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk and
scratch file systems) at runtime when a job is executed. Parallel Configuration files are discussed in
detail in the DataStage Parallel Job Advanced Developers Guide.
Parallel configuration files define can be located within any directory that has suitable access
permissions, defined at runtime through the environment variable $APT_CONFIG_FILE. However,
the graphical Configurations tool within the DataStage clients expects these files to be stored within the
Configurations subdirectory of the DataStage install. For this reason, it is recommended that all
parallel configuration files be stored in the Configurations subdirectory, with naming conventions to
associate them with a particular project or application.
The default.apt file is created when DataStage is installed, and references the DataSets and Scratch
subdirectories of the DataStage install directory. To manage system resources and disk allocation,
the DataStage administrator should consider removing this file, creating separate configuration
files that are referenced by the $APT_CONFIG_FILE setting in each DataStage Project.
At a minimum, the DataStage administrator should edit the default.apt configuration file to reference
the newly-created Data and Scratch file systems, and to ensure that these directories are used by any
other parallel configuration files.
2.2.4 Extending the DataStage Project for External Entities
It is recommended that another directory structure, be created to integrate all aspects of a DataStage
application that are managed outside of the DataStage Projects repository. This hierarchy should
include directories for secured parameter files, Data Set header files, custom components,
Orchestrate schema, sql and shell scripts. It may also be useful to support custom job logs and
reports.
15 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
16 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
A feature of UNIX related to TCP sockets that become disconnected will hold the port (which port) in
FINWAIT state for the length of the FINWAIT. While this port is in a finwait state DataStage dsrpcd
server will not start.
You can either wait for the FINWAIT to expire usually 10 minutes or in an emergency as root change
the setting to something like 1 minute. This is a dynamic network parameter and can be set temporarily
to a lower value. Reset back to original value once the server starts.
Use the following utilities
ndd Solaris, hpux,
no - AIX
2.3.5 Universe Shell
The DataStage server engine is based on Universe. It is a complete application environment containing
a shell, file types, programming language and many facilities for application operations like lock
management. To invoke the universe shell the DataStage environment variables must be set. This is
easily done by sourcing/executing the dsenv file in $DSHOME. To invoked the Universe use these
commands:
cd $DSHOME
bin/uvsh
2.3.6 Resource Locks
If a developer is working on a job in the designer and there is a network failure or client machine
failure the job will remain locked according to DataStage. When a job is locked it must be cleared
before it can be accessed by any DataStage component. Clearing locks can be done from the
DataStage Director pull down Job->Cleanup Resources. Choosing this option will open the Job
Resources interface.
17 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
18 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
environment. Once the processing bottle neck is discovered action can then be taken to improve
performance. For example, a job may appear to be running slow, with no indication of CPU, I/O or a
memory bottle neck, performance of the job could be improved by creating more logical processing
nodes in the DataStage EE configuration file or it may need to be redesigned. As parallelism is
increased, more system resources will be utilized and one will find that the system may become the
gating factor of performance. The remedy to this problem may be to increase system resources, like
adding more CPU or spreading I/O to other physical devices and controllers.
2.4.1 DataStage EE Job Monitor
The DataStage EE job monitor (JobMonApp) provides a useful snapshot of the jobs performance at
that moment of execution, but does not provide thorough performance metrics. That is, a JobMonApp
snapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due to
buffering and to some jobs semantics, a snapshot image of the flow may not be a representative
sample of the performance over the course of the entire job.
The CPU Summary information provided by JobMonApp is useful as a first approximation of where
time is being spent in the flow. However, it will not show operators that were inserted by the parallel
framework. Such operators include sorts, that were not explicitly included, and sub-operators of
composites.
2.4.2 Performance Metrics with DataStage EE Environment Variables
There are a number of environment variables that direct DataStage parallel jobs to report detailed
runtime information that enable you to determine where time is being spent, how many rows processed
and how much memory each instance of a stage utilized during a run. Setting these environment
variables also allow you to report on operators that were inserted by the parallel framework. Such
operators include sorts, that were not explicitly included, buffer operators and sub-operators of
composites.
APT_PM_PLAYER_MEMORY
Setting this variable causes each player process to report the process heap memory allocation in the job
log when the operator instance completes execution.
Example of player memory:
APT_CombinedOperatorController,0: Heap growth during runLocally(): 1773568 bytes
APT_PM_PLAYER_TIMING
Setting this variable causes each player process to report its call and return in the job log. The message
with the return is annotated with CPU times for the player process.
Example of player timings, showing the elapsed time of the operator, the amount of user and system
time as will as total CPU.
APT_CombinedOperatorController,0: Operator completed. status: APT_StatusOk elapsed: 0.30 user: 0.02 sys: 0.02 (total
CPU: 0.04)
IBM IPS Parallel Framework: Administration and Production Automation
19 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
APT_RECORD_COUNTS
Setting the variable causes DataStage to print to the job log, for each operator player, the number of
records input and output. Abandoned input records are not necessarily accounted for. Buffer operators
do not print this information.
Example of record counts that shows the number of rows processed for the input link and output link
for partition 0 of the Sort_3 stage.
Sort_3,0: Input 0 consumed 5000 records.
Output 0 produced 5000 records.
APT_PERFORMACNE_DATA
APT_PERFORMANCE_DATA or the osh -pdd <performance data directory> advanced runtime
option allow you to capture raw performance data for every underlying job process at runtime.
Within a job parameter, set $APT_PERFORMANCE_DATA = dirpath where dirpath is a directory
specified on the DataStage server to capture performance statistics. This will create an XML document
named performance.<pid> in specified directory. You can influence the name of the file by
specifying the osh -jobid <jobid> advanced runtime option. Hence the performance XML document
will be named performance.<jobid>.
The following XML header shows the detailed performance data captured in each record. Note that this
information is more detailed than the higher-level information captured by DSMakeJobReport and
includes information on all of the processes (including Buffer operators and framework-inserted sorts):
<?xml version="1.0" encoding="ISO-8859-1" ?>
<performance_output version="1.0" date="20050111 16:29:00"
framework_revision="7.5.0" job_ident="202416">
<layout delimiter=",">
<field name="TIME"/>
<field name="PARTITION_NUMBER"/>
<field name="PROCESS_NUMBER"/>
<field name="OPERATOR_NUMBER"/>
<field name="IDENT"/>
<field name="JOBMON_IDENT"/>
<field name="PHASE"/>
<field name="SUBPHASE"/>
<field name="ELAPSED_TIME"/>
<field name="CPU_TIME"/>
<field name="SYSTEM_TIME"/>
<field name="HEAP"/>
<field name="RECORDS"/>
<field name="STATUS"/>
</layout>
<run_data>
20 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Starting with release 7.5 and later, the Perl script performance_convert located in the directory
$APT_ORCHHOME/bin can be used to convert the raw performance data into other usable formats
including:
- CSV text files
- detail Data Sets
- summary Data Sets (summarizes the total time and maximum heap memory usage per operator)
The syntax is:
perl $APT_ORCHHOME/bin/performance_convert inputfile output_base [-schema|-dataset|summary] [-help]
where
inputfile - location of performance data to convert
output_base - location and file prefix to all files being generated.
(ex. /mydir/jobid -> /mydir/jobid.CSV)
2.4.3 iostat
iostat is useful for examining the throughput of various disk resources. If one or more disks have high
throughput, understanding where that throughput is coming from is vital. If there are spare CPU cycles,
IO is often the culprit. iostat can also help a user determine if there is excessive IO for a specific job.
The specifics of iostat output vary slightly from system to system. Here is an example from a Linux
machine which shows a relatively light load:
(The first set of output is cumulative data since the machine was booted)
$ iostat 10
Device:
tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev8-0
13.50
144.09
122.33 346233038 293951288
every N seconds (10 in the command line example) iostat outputs:
Device:
tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev8-0
4.00
0.00
96.00
0
96
2.4.4 vmstat
vmstat is useful for examining system paging. Ideally, a EE flow, once it begins running, should never
be paging to disk (si and so should be zero). Paging suggests EE is consuming too much total memory.
$vmstat 1
procs
memory
swap
io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 10692 24648 51872 228836 0 0 0 1 2 2 1 1 0
vmstat produces the following every N seconds:
IBM IPS Parallel Framework: Administration and Production Automation
21 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
0 328
41 1 0 99
22 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Job Logs Purge by setting up a purge policy through the DataStage Administrator. This
Standard practice emphatically recommends setting a purge policy to avoid filling the project
file system.
&PH& - This is a directory in each project that is used for stderr and stdout of a phantom
process. Each job execution will create a file in this directory, therefore over time the directory
will grow and thus should be cleaned on a regular basis to avoid filling up the project file
system. Typical file size is less than 1K, files larger than 1K are an indication of a problem
with a job. In the event of a hard crash of a job, examining the DSD.RUN* files may provide
useful information in explaining the problem.
$TMPDIR This environment variable tells DataStage where to write temporary files that are
created by the parallel framework, such as the job score, temp file for look up stage. This
directory is automatically cleaned up by the parallel framework; however, hard crashes may
leave files stranded in this directory. You can identify DataStage EE temp files by looking for
files that begin with APT*. The default $TMPDIR is /tmp, performance improvements can be
achieved by setting $TMPDIR to a faster file system.
23 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
created in the scratch area. For example if there is a bottle neck in a process that fork joins,
buffer over flow files will be written to scratch. The more files the more buffering.
DataSets - Identifying the directories used by data sets can be done by examining the
APT_CONFIG_FILE or by using the orchadmin command line tool or Tools-> DataSet
Management from the DataStage GUIs
24 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
DataStage Production Manager, who has full access to all areas of a DataStage project, and can
also create and manipulate protected projects. (Currently on UNIX systems the Production Manager
must be root or the administrative user in order to protect or unprotect projects.)
DataStage Operator, who has permission to run and manage DataStage jobs
<None>, who does not have permission to log on to DataStage. You cannot assign individual users
to these categories. You have to assign the operating system user group to which the user belongs. For
example, a user with the user ID peter belongs to a user group called clerks. To give DataStage
Operator status to user peter, you must assign the clerks user group to the DataStage Operator category.
Note: When you first install DataStage, the Everyone group is assigned to the category DataStage
Developer. This group contains all users, meaning that every user has full access to DataStage. When
you change the user group assignments, remember that these changes are meaningful only if you also
change the category to which the Everyone group is assigned.
2.5.2 User Environment
It is common for DataStage developers and administrators to utilize the UNIX or windows command
line. For this reason the DataStage users account should be configured with proper environment
variables.
All users should have these lines added to their login profile 3 .
dsroot="`cat /.dshome`/.."
export dsroot
. $dsroot/DSEngine/dsenv
The following steps explain in detail how to configure the DataStage users environment. The steps
described above in User Environment should be sufficient.
1 In your .profile, .kshrc, or .cshrc, set the APT_ORCHHOME environment variable to the directory in which
Orchestrate is installed. This is either the default, /ascential/apt, or the directory you have defined as part
of the installation procedure.
As noted earlier, the /.dshome file is only created on a default (non-itag) installs
25 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
2 Add $APT_ORCHHOME/bin to your PATH environment variable. This is required for access to all
26 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
In most cases the DataStage project should be set not to timeout. This will prevent developers from
loosing unsaved work. However, you may find that you require an Inactivity Timeout, due to careless
developers that leave inactive sessions open for days.
All projects should have Enable job administration in Director. This will allow the developer to unlock
jobs and clear a jobs status file. Runtime column Propagation is an extremely powerful feature of the
parallel framework that allows reuse and efficient processing. Runtime column propagation should be
turned on. Auto-purge of job log should be configured; this will help keep your disk space usage on
the project file system under control.
27 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
By default DataStage grants the Developer Role to all groups. You should restrict the DataStage
Developer and Production Manager roles to only trusted users.
This standard practice recommends always Automatically Handle Activities that fail. The other
options are optional. Add checkpoints so sequence is restart-able on failure should be configured
only if this is an acceptable approach to checkpoint restart.
28 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
The Generated OSH visible for Parallel jobs in ALL projects button should be checked.
29 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
3 Job Monitor
The DataStage job monitor provides the capability for collecting and reporting performance metrics. It
must be running in order for the Audit & Metrics system (below) to functions. The job monitor may
impact system performance and can be tuned or shut off, configurable with environment variables
below.
3.1 Configuration
The job monitor uses two tcp/ports which are chosen during installation. These should be entered in
/etc/services as a manual step.
Entries should be made in the /etc/services file to protect the sockets used by the job monitor. The
default socket numbers are 13400 and 13401, and entries in this file may look like this:
13400 tcp
13401 tcp
dsjobmon
dsjobmon
environments is to use a size of about 10000 and turn off APT_MONITOR_TIME with $UNSET.
For an explanation of Time based versus row based monitoring in Parallel Job Advanced Developers
Guide (advpx.pdf) see JOB MONITOR PAGE 31.
APT_MONITOR_SIZE
Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000
records.
APT_MONITOR_TIME
Determines the minimum time interval in seconds for generating monitor information at runtime. The
default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.
A PT_NO_JOBMON
Turn off job monitoring entirely.
30 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
31 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
32 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
The DataStage Repository file UV.ACCOUNT contains the directory paths for each project. This file
can be queried by the command:
echo "SELECT * FROM UV.ACCOUNT;" | bin/uvsh
DataStage projects should be protected by both full and incremental system backups, performed at
regular intervals (daily, hourly) that minimize exposure to a crash.
Special consideration should be given to development projects, since these are where developers will
be saving work through out the day. Developers and administrators should be aware that work could be
lost that was saved between backups in the event of a catastrophic storage system failure.
It is best to backup up the system, especially projects, when jobs are not running or when Developers
are not on the system. Due to the dynamic nature of a DataStage repository and its multi file structure,
there is a potential for a hot backup to contain an inconsistent view of the repository. This situation
exist in almost all modern databases (except single file), because the database is made up of many files
that are updated at different times, getting a consistent view of all these files with a hot backup is
difficult, without complex solutions like breaking volume mirrors.
Avoid storing volatile files in a DataStage project to prevent the waste of time and space required for
the project backup.
Consider locating non volatile external entities in the project, to provide a convenient method for
backing up External Entities that are related to the project.
Consider DataStage job log purge policy. In order to maximize backup efficiency Set a log retention
policy to purge shortly after a backup, without erasing entries before they are backed up. For example
if you incrementally back up a project daily, then set the purge policy for every two days. This will
ensure all log entries are backed up, with minimal overlap.
33 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
34 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Parts of the DataStage processing environment will need to be replicated to other physical processors if
you choose to run DataStage in an MPP, Cluster or Grid environment or employ a failover strategy.
4.6.1 Replication for MPP, Cluster, Grid.
When a parallel job is run in part or as a whole on one or more physical processing nodes other than the
DataStage conductor node the following two configuration steps need to be performed:
All or part of the DataStage EE environment needs to be replicated on all processing
nodes.
Ensure that the user account, under which the jobs will be launched from the
conductor node, has privileges to rsh to all the other nodes. DataStage EE can be
configured to use ssh
The DataStage EE environment includes the DataStage conductor directory (../Ascential/DataStage),
project and job specific object files (Transformer, Build-op, custom routines), and external entities such
as third party libraries. External entities may have specific installation requirements and dependencies,
therefore replicating external entities should be done so by following the vendor's instructions. Under
all remote execution scenarios the $dsroot/PXEngine directory requires replication. There are libraries
that may be used by a job that are in other directories such as $dsroot/DSCAPIop and
$dsroot/RTIOperators, for this reason it is a standard practice to replicate the entire
../Ascential/DataStage/ directory. This will also be relevant for conductor failover. Project directories
should be replicated as a standard practice.
There are two methods used to replicate the DataStage Environment
1. Globally cross mounting, usually via NFS
2. moving a physical copy, by hand or using the copy-orchdist utility
For more details refer to Install and Upgrade Guide (dsupgde.pdf) Copying the Parallel Engine to
Your System Nodes
For both replication methods the directory path should be identical. That is, the cross mount or
replicated copy should use the same paths on all systems. For example if DataStage is installed on
/opt/Ascential/DataStage and the project is in /var/Projects/myDataStageProject, the user would see the
same files on all systems of the cluster, when utilizing these paths.
As a standard practice adopting the global cross mount approach to replication is recommended. This
will greatly simplify propagating changes to all physical processors. For example, an upgrade, patch or
new job will be propagated to all nodes automatically.
4.6.2 Distributing Transformers at Runtime
The APT_COPY_TRANSFORM_OPERATOR environment variable can be used to distribute the
transformer shared objects. It is intended to be used in distributed environments, where the project is
35 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
not cross mounted. It is not a complete solution, because it does not copy external functions. It will
also add time to job startup.
The way this affects the transformer operator if set to any value,
APT_TransformOperator::distributeSharedObj() is called in describeOperator() to distribute the shared
object file of the sub-level transform operator and the shared object file .
4.6.3 Installing DataStage EE Engine on a Remote Node
If you plan to use DataStage EE in on multiple nodes, you need to ensure that the EE Engine
components are installed on all the nodes.
1. After your initial installation on the primary conductor, you need to copy the contents
PXEngine directory in this install over to all the other nodes. For example, if you installed
DataStage under /apps/Ascential/DataStage, then the PXEngine directory will be under
/apps/Ascential/DataStage/PXEngine. Note the PXEngine directory has to exist in exactly
the same location on all nodes. This can be a symbolic link.
2. Next, add entries in the EE configuration file for all the new nodes.
3. Ensure that the user under whom the jobs will be launched from the conductor node, has
privileges to rsh to all the other nodes. DataStage EE can be configured to use ssh (see the next
section).
If you have a large number of nodes, you can use the maintenance menu of the DataStage install to
copy over the PXEngine directories to new nodes. Note that you will need to configure rsh access to
nodes before the installation.
4.6.4 Replications for Conductor for Failover/Cluster
DataStage requires one node of any processing environment to be the Conductor. This conductor is
the machine of the cluster on which the DataStage Server runs. The DataStage Server must be running
in order for the rest of the environment to function. Therefore, if the conductor node crashes, the
DataStage Conductor can be moved to and run on different physical nodes of a cluster, grid or MPP.
Replication of the conductor on Windows requires a trick, because of the Windows registry. The trick
requires you to run the normal DataStage Server installation on all nodes you intend to run DataStage
on, always using the same volume and path. If you are using the cross mount method for replication
the end result is that all systems will refer to the same physical copy of the DataStage server.
When building on a windows cluster :
1. Cross mount the DataStage installation drive i.e. D: on all machines.
2. Then install on the first machine
3. Shutdown DataStage
4. Install on another machine, using same location D: (overwrite DataStage installation)
Be aware that after DataStage has been put into service, using this method to install on more machines
will overwrite a previously configured installation. Therefore, to install DataStage on additional
systems after any configuration or development has been performed follow these steps:
IBM IPS Parallel Framework: Administration and Production Automation
36 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
1.
2.
3.
4.
Warning: starting up and running more than one instance of the DataStage server (dsrpcd) in a cross
mounted configuration, will cause corruption in the DataStage Conductor repository and Data Project
Repositories.
You may either cross mount or make a physical copy of the DataStage Conductor
../Ascential/DataStage. The cross mount method is recommended to simplify maintenance such as
upgrades and patches, and project creation.
Typically failover software is utilized to manage start and stop activities of the DataStage server. The
failover software should be configured to stop and start the DataStage server, therefore you will need to
disable normal system startup (S99ds.rc).
$DSHOME/bin/uv - admin -autostart on. Enable auto start.
$DSHOME/bin/uv - admin -autostart off. Disable auto start.
The failover software will need to monitor the DataStage Server process dsrpcd and the network port
used by dsrpc.
>ps -ef | grep dsrpcd
root
26870
1 0 Mar14 ?
00:00:00
/var/Ascential/DataStage/DSEngine/bin/dsrpcd
>netstat -a | grep dsrpc
tcp
0
0 *:dsrpc
*:*
LISTEN
If the DataStage server process fails it should try to restart it on the primary server. If the primary
server is not available then DataStage should be started on the failover server. You may have more
than one failover server. This will satisfy failover for the DataStage Conductor. Failover and restart for
applications developed in DataStage are covered in the production automation section 5.
37 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
In the event that a file system is filled you will not detect any corruption unless the corrupt files in the
project repository are read. You should shutdown the DataStage server, and run the uvbackup
command, which will read all files. The uvbackup command will allow you to backup up to /dev/null
on UNIX and NUL on windows.
$ find $DSHOME/../Projects -print | $DSHOME/bin/uvbackup -f -v -l "FULL
SYSTEM BACKUP" - > /dev/null
To prevent the file system from filling up use these three practices.
1. Establish an auto purge policy for your job log files
2. Do not allow users to created temporary files in the project directory. The project directory
becomes the default directory for the DataStage user. It is common for developers to utilize the
project directory for hash files, data files, temp files etc. Developers should follow the standard
practice of using a file system other than the project file system and always parameterize the
path of the file name. For example #$ProjectPlus_tmp#/crossRef.dat
3. Monitor the project file systems closely.
38 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
39 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
the methods, tools and options for building a new audit and metrics reporting infrastructure or
integrating into an existing one.
Pre-Execution
EE job
Rejects
Post-Execution
Enterprise
Monitoring
Agent
Check-Point Log
External Log
Audit/Metrics
Database
This diagram depicts some of the typical interactions between DataStage and the Production
Automation environment.
The Enterprise scheduler invokes a shell script and waits for a completion status.
The shell script will run a DataStage Sequencer Job and wait for it to complete. Then it will
check the status of dsjob and write exceptions to the External Log.
The External log is monitored by the enterprise monitoring agent for new entries and signals
the operations console reporting errors.
The Sequencer Job will run the DataStage jobs.
Various other features shown in this diagram are an Audit & Metrics Reporting Database, Rejects
(files and tables), The DataStage job logs, and parameters. All of which can have some interaction
with a DataStage Sequencer Job and the DataStage EE jobs.
40 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Consult the DataStage Parallel Job Advanced Developers Guide (advpx.pdf) section DataStage
Development Kit (Job Control Interfaces) for a complete description, of the dsjob and the API.
Consult the DataStage Designer Guide (coredevgde.pdf) section 6 Job Sequences for documentation
details about Job Sequencers.
Custom job control is often implemented using DataStage routines and shell scripts.
When developing a shell script the dsjob command line interface will most likely be used to run a job.
Experience in shell scripting is recommended (preferably korn shell)
When developing custom job control with DataStage routines the developer will need to program using
DataStage Basic. Useful resources are DataStage Basic Guide (Basic.pdf), the online help also does
an excellent job of documenting built in DataStage basic functions.
IMPORTANT: the dsjob command should be used in a conservative manner, each time the
dsjob command is invoked, it must log into the DataStage project. Excessive use of dsjob, is
slow and inefficient for frequent repetitive tasks.
41 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
The parallel framework internally monitors all the individual player processes. It insures that if one
player fails all players will fail on all processing nodes.
Messages written to the job log with status FATAL will cause a job to abort. Messages of type
WARNING could cause a job to abort if the warning threshold is exceeded. This threshold can be set
or turned off from the Director or using the Job Control API. When a limit is set on a sequencer job,
all child sequencers inherit that setting. The following example shows the DataStage Job Control API
calls to set limits from a job control routine:
*
* Set Limits for Job
*
LimitErr = DSSetJobLimit(JobHandle, DSJ.LIMITROWS, rowLimitParm)
LimitErr = DSSetJobLimit(JobHandle, DSJ.LIMITWARN, warningLimitParm)
Overall job design should consider exception handling including reject settings. Stages should be
configured to handle errors and rejects. Stages like sequential file, transformer, Oracle EE, merge
support a reject link. Rejects cause warnings to be written to the job log.
The Job Control API functions will allow you to write message directly to the job log from job control.
DSLogWarning, DSLogFatal are the basic versions. Messages can also be logged from dsjob.
Routines that are used in a sequencer Routine Activity must return a 0 for success or a 1 for error. If a
Rountine activity returns a non-zero value that is not an error you will have to configure a trigger to
handle this case, resulting in more complex sequencer logic. So, try standardizing on 0 for success and
1 for error.
Routines can also be used in a various Job Sequencer Stages in parameter expressions, trigger
expressions and Variable Activity Stage expressions. When used in these places the routine will return
a value that is not tested by the Job Sequencer the same way a Routine Activity is. Thus to ensure
exceptions are raised to the controlling job all Routines should trap exceptions and log a warning
(DSLogWarn) or fatal error (DSLogFatal) to the job log.
In general all routines should trap errors and call DSLogWarn or DSLogFatal.
Sequencer jobs should be configured with Automatically Handle Activities that fail. This will ensure
that any job that aborts will be detected. Warnings are not handled automatically by the sequencer.
You either have to explicitly set a Job Activity to trigger on a warning or set the warning limit on the
job.
42 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
The concept of a check point relies on the ability to define a unit of work. Within the scope of a
DataStage parallel job, this is a complex problem to solve, because stage instances tend to run
independent of each other in a nondeterministic manner.
DataStage EE supports functionality that allows a unit of work to be defined in terms of data rows, but
this has only been implemented using the mqread and unitofwork stages. Therefore, in most cases, it
is up to the developer to design their ETL process in a manner that allows it to restart without
corrupting the target data sink, while minimizing reprocessing. This typically is done by determining
logical boundaries for an overall process.
The typical ETL process is usually developed with multiple EE Parallel jobs. For example a simple
DataStage EE Application comprised of two jobs, where the first job will extract and transform the
data, and stage load ready data in persistent dataset. Then the Second job will read the dataset and
bulk load the database. Using logical boundaries allows a unit of work to be defined in terms of a
single DataStage Parallel job. Thus, it is straight forward to implement a check point restart
architecture using the DataStage job sequencer and the philosophy that each job is a unit of work. A
check point restart strategy co notates two modes of operation, Normal and Restart.
In Normal processing mode a checkpoint restart should follow these logical steps:
Pre-Execution Log Checkpoint starting.
Execution
- Log Checkpoint running.
Post-Execution check job status
=Success - Log Checkpoint Success
Exit with Success
=Failure - Log Checkpoint Failed
Exit with ERROR
In Restart processing mode a check point restart should follow these logical steps:
Pre-Execution Check last Check point was:
= Success
Skip execution. Implies go to next job in sequence.
= Failure
Log checkpoint Restarting
Execution
- Log Checkpoint running.
Post-Execution check job status
=Success - Log Checkpoint Success
Exit with Success
=Failure - Log Checkpoint Failed
Exit with ERROR
Restart job control requires the developer to consider:
IBM IPS Parallel Framework: Administration and Production Automation
43 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
DataStage job sequencers can utilize built in check point restart or it is possible to utilize a custom
approach to check point restart. This standard practice is impartial to the approach you choose. The
advantage to employing a custom check point solution is its flexibility. Custom check point logic
utilizes parameters to influence the behavior of check point restart logic, thus allowing you to
specifically name individual jobs or subsets of jobs to run.
Built in Check Point Restart:
44 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
45 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
6.1
System profile
dsenv
Shell script (if dsjob local is specified)
Project
Job Sequencer
Job
System wide
All DataStage processes
Only DataStage processes spawned by dsjob
All DataStage processes for a project
Job Seqeunce and Sequence sub processes
* Current jobs environment and sub processes
46 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
1) The daemon for managing client connections to the DataStage server engine is called dsrpcd.
By default (in a root installation), dsrpcd is started when the server installed, and should start
whenever the machine is restarted. dsrpcd can also be manually started and stopped using the
$DSHOME/uv admin command. (For more information, see the DataStage Administrator Guide.)
By default, DataStage jobs inherit the dsrpcd environment, which, on UNIX platforms is set in
the etc/profile and $DSHOME/dsenv scripts. On Windows, the default DataStage environment is
defined in the registry. Note that client connections DO NOT pick up per-user environment
settings from their $HOME/.profile script.
On USS environments, the dsrpc environment is not inherited since DataStage jobs do not
execute on the conductor node.
2) Environment variable settings for particular projects can be set in the DataStage Administrator
client. Any project-level settings for a specific environment variable will override any settings
inherited from dsrpcd.
Within DataStage Designer, environment variables may be defined for a particular job using the Job
Properties dialog box. Any job-level settings for a specific environment variable will override any
settings inherited from dsrpcd or from project-level defaults. Project-level environment variables are
set and defined within DataStage Administrator.
6.1.2 Special Values for DataStage Environment Variables
To avoid hard-coding default values for Job Parameters, there are three special values that can be used
for environment variables within job parameters:
Value
Use
$ENV
causes the value of the named environment variable to be retrieved from the operating system of
the job environment. Typically this is used to pickup values set in the operating system outside
of DataStage.
causes the project default value for the environment variable (as shown on the Administrator
client) to be picked up and used to set the environment variable and job parameter for the job.
causes the environment variable to be removed completely from the runtime environment.
Several environment variables are evaluated only for their presence in the environment (for
example, APT_SORT_INSERTION_CHECK_ONLY).
$PROJDEF
$UNSET
NOTE: $ENV should not be used for specifying the default $APT_CONFIG_FILE value because,
during job development, the Designer parses the corresponding parallel configuration file to
obtain a list of node maps and constraints (advanced stage properties).
6.1.3 Environment Variable Settings
An extensive list of environment variables is documented in the DataStage Parallel Job Advanced
Developers Guide. This section is intended to call attention to some specific environment variables,
and to document a few that are not part of the documentation.
IBM IPS Parallel Framework: Administration and Production Automation
47 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Setting
filepath
$APT_DUMP_SCORE
$OSH_ECHO
$APT_RECORD_COUNTS
$APT_PERFORMANCE_DATA
$OSH_PRINT_SCHEMAS
$UNSET
0
$APT_PM_SHOW_PIDS
$APT_BUFFER_MAXIMUM_TIMEOUT
Description
Specifies the full pathname to the EE configuration
file. This variable should be included in all job
parameters so that it can be easily changed at runtime.
Outputs EE score dump to the DataStage job log,
providing detailed information about actual job flow
including operators, processes, and Data Sets.
Extremely useful for understanding how a job actually
ran in the environment.
Includes a copy of the generated osh in the jobs
DataStage log
Outputs record counts to the DataStage job log as each
operator completes processing. The count is per
operator per partition.
This setting should be disabled by default, but part of
every job design so that it can be easily enabled for
debugging purposes.
If set, specifies the directory to capture advanced job
runtime performance statistics.
Outputs actual runtime metadata (schema) to
DataStage job log.
This setting should be disabled by default, but part of
every job design so that it can be easily enabled for
debugging purposes.
Places entries in DataStage job log showing UNIX
process ID (PID) for each process started by a job.
Does not report PIDs of DataStage phantom
processes started by Server shared containers.
Maximum buffer delay in seconds
48 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
6.2
Parameters are passed to a job as either DataStage job parameters or as environment variables. The
Naming Standard for Job Parameters uses the suffix _parm in the variable name. Environment
variables have a prefix of $ when used as Job Parameters.
Job parameters are passed from a job sequencer to the jobs in its control as if a user were answering the
runtime dialog questions displayed in the DataStage Director job-run dialog. As discussed later in this
section, Job parameters can also be specified using a parameter file.
The scope of a parameter depends on their type. Essentially:
The scope of a job parameter is specific to the job in which it is defined and used. Job
parameters are stored internally within DataStage for the duration of the job, and are not
accessible outside that job.
The scope of a job parameter can be extended by the use of job sequencer, which can manage
and pass job parameters among jobs.
6.2.1 When to Use Parameters
As a standard practice file names, database names, passwords and message queue names should always
be parameterized. It is left to the discretion of the developer to parameterize other properties. When
deciding on what to parameterize ask the questions:
Could this stage or link property be required to change from one project to another?
Could this stage or link property change from one job execution to another?
If you answer yes to either of these questions then you should create a job parameter and set the
property to that parameter.
49 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
To facilitate production automation and file management, the Project_Plus and Staging file paths are
defined using a number of environment variables that should be used as Job Parameters.
Job parameters are required for the following DataStage programming elements:
1. File name entries in stages that use files or Data Sets must NEVER use a hard-coded operating
system pathname.
a. Staging area files must ALWAYS have pathnames as follows:
/#$STAGING_DIR##$DEPLOY_PHASE_parm#[filename.suffix]
b. DataStage datasets ALWAYS have pathnames as follows:
/#$PROJECT_PLUS_DATASETS#[headerfilename.ds]
2. Database stages must ALWAYS use variables for the server name, schema (if appropriate),
userid and password.
6.2.2 Parameter Standard Practices
File name stage properties should always be configured using two parameters. One for directory path
and the second for file name. The directory path delimiter should always be specified in the property to
avoid errors. Dont assume the runtime value of the directory parameter will include the appropriate
delimiter. If the user supplies it the operating system accepts // as a delimiter, and if it is not provided
which very common, the file name property specification will be correct.
Similar to directory path delimiters, database schema names, etc. should contain any required delimiter.
50 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
practice of always using a test harness sequencer to execute a parallel job. The test harness allows the
job to be run independently and ensures the parameter values are set. When the job is ready for
integration into a Production Sequencer, the test harness can be cut, pasted and linked into the
Production Sequencer. The test harness will also be useful in test environments, because it will allow
you to run isolated tests on a job.
In rare cases normal parameter values should be allowed default values. For example a job may be
configured with parameters for array_size or commit_size and the corresponding data base stage
properties set to these parameters. The default value should be set to an optimal value as determined
by performance testing. This value should be relatively static. The value can always be overridden by
job control. This exception will also minimize the number of parameters. Consider that array_size is
different for every job. You could have a unique parameter for every job, but it would be difficult to
manage the values for every job.
6.2.4 Managing Lists of DataStage Parameters
Both standard job parameters and environment variable parameters can be managed as lists if you
leverage the tools and techniques described in this section. Parameter lists help you consistently name
and centrally set job parameter values. Once entered into the list the developer will not have to type it
in to a new job thus avoiding typographical errors in names. Parameter values usually need to be
modified when a job is migrated from one project to another. Lists can help you simplify this process
and eliminate errors related to invalid job parameter values.
6.2.4.1 Environment Variable Parameter Lists
Environment variable parameters conveniently are managed by DataStage as a unique list. Once the
Environment variable has been entered in the list the developer picks it from a list to add it to a job.
When a job is migrated from one project to another the job will fail during startup, if the environment
variable has not been configured for the project.
When migrating to a new environment, you will have to enter the Environment Variable Parameters
one at a time in the DataStage Administrator. If this becomes too cumbersome consider the fact that
Environment Variable Parameters are stored in the DSParams file. The DSParams is a text file that can
be modified by hand. If you choose to modify this file by hand do so at your own risk.
6.2.4.2 Standard Parameter Lists
Standard parameter lists can be managed using shared containers. Unfortunately, this does not apply to
a Sequencer Job. When a shared container is Converted to local in a job, DataStage adds the
parameters to the job. The developer will not have to type in the parameters to the job. This standard
practice recommends using empty shared containers to store and organize parameters as lists.
The next two diagrams show how to add parameters to an empty parallel shared container. Notice the
name of the shared container MasterParameterList. Slide two shows that the MasterParameterList is
dropped onto the parallel job canvas and Convert to local is selected. Once converted to local the
IBM IPS Parallel Framework: Administration and Production Automation
51 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
parameters will be added to the job. If there are duplicated names they will be detected. Resolve
duplicate names by assigning the jobs parameter to shared container and then Convert to local again.
Empty container disappeared.
52 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
53 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Developers and testers, production system operator, as well as, enterprise scheduler should always start
jobs using job control. The developer should develop a test harness sequencer that explicitly sets a
jobs parameters. Testers should run the job as it would be run in production. Production job
invocation should be done so from a shell script.
54 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Enterprise
Scheduler
Shell script
(encapsulating dsjob)
Parent
Sequencer
Child
Sequencer
JOB 1
Job2
JOBn
It is best to use dsjob to start a job sequence and wait for it to complete via a shell script.
55 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
7 Change Management
This section considers aspects of a system life cycle as it relates to DataStage.
Development
Production
Maintenance
Check out/import
Check out, Import,Test,
Promote to Beta
BETA (test)
ALPHA (test)
In a perfect world every developer would have their own environment, including DataStage project,
File Systems, and Database. There would be a separate ALPHA test (for integration testing)
environment, where developers merged their components to complete a system. The BETA test system
should be a replica of production (or as close as possible).
Since we do not live in a perfect world at a minimum there should be at least one development
project, one test project, and one production project. Anything less than this is strongly discouraged,
because it does not provide the foundation for a robust life cycle.
The source code control repository can either be the DataStage Version Control tool, or any third party
source code control system like IBM Rational ClearCase, SCCS, and Microsoft Visual Source Safe.
DataStage projects can be configured to be protected, thus preventing any user from making
modification to jobs, routines, etc. It is strongly encouraged to protect all projects except for
development projects. The only users who can manipulate a protected project are those that have been
assigned the role of DataStage Production Manager. Even that role cannot perform tasks like job
compilations. Thus, when exporting jobs that will then be imported into a protected project, ensure
that the executable is included in the export.
56 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
57 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
10. Production Manager Responsibility. (Ensure a system backup is made of target production
project. This will provide a means to quickly recover from problems introduced by changed
code. Check objects out of SCCS and import them into the production environment.
11. End of normal release cycle.
12. Developers Responsibility. Bug fix and enhancements. Start with clean project. Import latest
objects form Source Code Control system. Make enhancements and unit test. Be sure to update
job description, for this is where change log is kept for an object. Once unit test is complete go
back to step 2 and repeat cycle.
7.3 Security
The Development environment. Each user should have a user id and project. Each project should
have a unique group identifier. Grant users access to a project by assigning them to a group.
Configure DataStage permissions by assigning Developer role to Developer groups.
Test Environment. Set up test users and test groups. Define test projects with tester group
permissions. Configure DataStage permissions by assigning Developer roles to tester groups. Limit
production manager.
Production Environment. Set up production operator user and groups. Define production projects.
Configure DataStage permissions by assigning Operator roles to the production operator account.
58 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
2. Emergency patch on production system. Have system Production Manager unprotect the
project. This will allow the developer to make code changes directly on the production system.
The changes must be exported and then check into the source code system. Do not do this
unless there is no alternative.
7.4.1 Automating the Build Process
DataStage provides command line utilities in the DataStage Client installation for importing, exporting
and compiling jobs and routines.
59 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
PID
12311
PPID
1220
Command
dscs 4 0 0
PID PPID
12312 12311
Command
dsapi_slave 8 7 0
Root process for all DS-Job Trees, created by dsapi_slave for DS-Director submitted jobs:
USERID
PID PPID Command
abrewe
20899 12312 phantom DSD.RUN HangEquSourceRef 0/0 $APT_DISABLE_COMBINATION=True
$APT_CONFIG_FILE=/u01/app/Ascential/DataStage/Configurations/default.apt $DS_PXDEBUG=1
Phantom Process, invokes UNIX shell and acts as the bridge to UNIX:
USERID
PID PPID Command
abrewe
20943 20899 phantom SH -c 'RT_SC351/OshExecuter.sh RT_SC351/OshScript.osh
RT_SC351/HangEquSourceRef.fifo -monitorport 13402 -pf RT_SC351/jpfile -input_charset
ASCL_MS1252 -output_charset ASCL_MS1252 '
These Processes all take place within the Unix Shell Environment:
60 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
Proto-Section Leader #1 (This goes away immediately after starting its section leader, thus 'orphaning' the SL process):
USERID
abrewe
PID PPID
20990 20946
Command
/mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/etc/standalone
Players for Section Leader #1, this represents a single instantiation of the job:
USERID
PID PPID Command
abrewe
20996 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
20998 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21001 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21007 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21010 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21011 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
abrewe
21018 20993 /mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/bin/osh
APT_PMsectionLeaderFlag hvcwyds0001 10001 0 30 node1 hvcwyds0001 1085435517.554825.51d2
IBM IPS Parallel Framework: Administration and Production Automation
61 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
Proto-Section Leader #2 (This goes away immediately after starting its section leader, thus 'orphaning' the SL process):
USERID
abrewe
PID PPID
20991 20946
Command
/mnt/control/i386/ETL_Grid/DSRunTime/datastage_etl_701R1/etc/standalone
Players for Section Leader #2, this represents a single instantiation of the job:
USERID
PID
PPID
Command
62 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
63 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
Represented Graphically:
The UNIX tool pstree can be used to generate a graphical report for a given processID. Most Linux platforms include pstree, but it can
be downloaded for most platforms (as Perl code). Here is the graphical output of the above process hierarchy:
dsrpcd(1220)---dscs(12311)---dsapi_slave(12312)---uvsh(20899)---uvsh(20943)---OshExecuter.sh(20945)---osh(20946)-+--osh(20990)-+-osh(20993)-+-osh(20996)
DSD.RUN
Phantom
Conductor | ProtoSection
|-osh(20998)
| Section
Leader 1
|-osh(21001)
| Leader 1
|-osh(21002)
|
|-osh(21004)
|
|-osh(21005)
|
|-osh(21007)
|
|-osh(21010)
|
|-osh(21011)
|
|-osh(21017)
|
|-osh(21018)
|
|-osh(21019)
|
|-osh(21020)
|
|-osh(21021)
|
|-osh(21022)
|
|-osh(21024)
|
|-osh(21026)
|
|-osh(21028)
|
|-osh(21029)
|
`-osh(21031)
|
Players
|
`--osh(20991)-+-osh(20992)-+-osh(20997)
ProtoSection
|-osh(20999)
Section
Leader 2
|-osh(21000)
Leader 2
|-osh(21003)
|-osh(21006)
|-osh(21008)
|-osh(21009)
|-osh(21012)
|-osh(21013)
|-osh(21014)
|-osh(21015)
|-osh(21016)
|-osh(21023)
|-osh(21025)
|-osh(21027)
|-osh(21030)
`-osh(21032)
Players
The pstree command can also be used to distinguish between Section Leader and Player processes, which are both launched with the APT_PMsectionLeaderFlag option. Using pstree on a given process ID will display child processes for Section Leaders and no child
processes for Players.
64 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
DSRPCD
|
`-DSAPI_Slave _xor_ UNIX Shell Invoking DSJOB
|
Root Process for
|
all DS-Jobs
`-phantom DSD.RUN <JobName> 0/0
|
JobMon Process
+-phantom DSD.OshMonitor rowgen1 3266
|
Root Process for
|
OSH JobTree
`-phantom SH -c 'RT_SC1/OshExecuter.sh RT_SC1/OshScript.osh
RT_SC1/<JobName>.fifo -monitorport 13400 -pf RT_SC1/jpfile -input_charset ASCL_MS1252 -output_charset
ASCL_MS1252 '
|
`-/bin/sh RT_SC1/OshExecuter.sh RT_SC1/OshScript.osh
RT_SC1/rowgen1.fifo -monitorport 13400 -pf RT_SC1/jpfile -input_charset ASCL_MS1252 -output_charset
ASCL_MS1252
|
Conductor:
`-/scratch/Ascential/DataStage/PXEngine/bin/osh -monitorport 13400
-pf RT_SC1/jpfile -input_charset ASCL_MS1252 -output_charset ASCL_MS1252 -f RT_SC1/OshScript.osh (Conductor
Process)
|
ProtoSectionLeader 1
+-ProtoSectionLeader (Goes away once its SectionLeader has
started up)
|
|
Section Leader 1:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 0 30 node1 mk61 1084877021.130482.cc2
|
|
Player 1:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 1 30 node1 mk61 1084877021.130482.cc2
Player 2:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 0 30 node1 mk61 1084877021.130482.cc2
|
.
|
.
|
.
Player N:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 1 30 node1 mk61 1084877021.130482.cc2
|
|
IBM IPS Parallel Framework: Administration and Production Automation
65 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
ProtoSectionLeader 2
started up)
|
+-ProtoSectionLeader (Goes away once its SectionLeader has
|
|
Section Leader 2:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 1 30 node2 mk61 1084877021.130482.cc2
|
|
Player 1:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 1 30 node2 mk61 1084877021.130482.cc2
Player 2:
|
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 0 30 node2 mk61 1084877021.130482.cc2
|
.
|
.
|
.
Player N:
|
`-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 1 30 node2 mk61 1084877021.130482.cc2
.
.
.
ProtoSectionLeader N
`-ProtoSectionLeader (Goes away once its SectionLeader has
started up)
|
Section Leader N:
`-/scratch/Ascential/DataStage/PXEngine/bin/osh APT_PMsectionLeaderFlag mk61 10001 0 30 nodeN mk61 1084877021.130482.cc2
|
Player 1:
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 0 30 nodeN mk61 1084877021.130482.cc2
Player 2:
+-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 0 30 nodeN mk61 1084877021.130482.cc2
.
.
.
Player N:
`-/scratch/Ascential/DataStage/PXEngine/bin/osh
APT_PMsectionLeaderFlag mk61 10001 1 30 nodeN mk61 1084877021.130482.cc2
66 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in
any form by any means without the written permission of IBM.
Setting
Description
$APT_STRING_PADCHAR
[char]
Setting
Description
$APT_EXPORT_FLUSH_COUNT
[nrows]
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
$APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL
[set]
$APT_IMPORT_BUFFER_SIZE
[Kbytes]
$APT_EXPORT_BUFFER_SIZE
$APT_CONSISTENT_BUFFERIO_SIZE
[bytes]
67 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
[bytes]
$APT_DELIMITED_READ_SIZE
[bytes]
$APT_MAX_DELIMITED_READ_SIZE
[set]
$APT_IMPORT_PATTERN_USES_FILESET
Setting
Description
$INSTHOME
[path]
$APT_DB2INSTANCE_HOME
[path]
$APT_DBNAME
[database]
$APT_RDBMS_COMMIT_ROWS
Can also be specified with the Row Commit
Interval stage input property.
[rows]
$DS_ENABLE_RESERVED_CHAR_CONVERT
Setting
Description
68 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
$INFORMIXSERVER
[path]
[filepath]
[name]
$APT_COMMIT_INTERVAL
[rows]
$INFORMIXDIR
$INFORMIXSQLHOSTS
Setting
Description
$ORACLE_HOME
[path]
$ORACLE_SID
[sid]
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL
$APT_ORAUPSERT_COMMIT_TIME_INTERVAL
[num]
[seconds]
$APT_ORACLE_LOAD_OPTIONS
[SQL*
Loader
options]
$APT_ORACLE_LOAD_DELIMITED
[char]
$APT_ORA_WRITE_FILES
[filepath]
$DS_ENABLE_RESERVED_CHAR_CONVERT
69 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Setting
Description
$APT_TERA_SYNC_DATABASE
[name]
$APT_TERA_SYNC_USER
[user]
$APT_TER_SYNC_PASSWORD
[password]
$APT_TERA_64K_BUFFERS
$APT_TERA_NO_ERR_CLEANUP
$APT_TERA_NO_PERM_CHECKS
Setting
Description
$NETEZZA
[path]
[filepath]
$NZ_ODBC_INI_PATH
$APT_DEBUG_MODULE_NAMES
odbcstmt, odbcenv,
nzetwriteop, nzutils,
nzwriterep, nzetsubop
Setting
[seconds]
$APT_MONITOR_SIZE
[rows]
$APT_NO_JOBMON
$APT_RECORD_COUNTS
Description
In v7 and later, specifies the time interval (in seconds)
for generating job monitor information at runtime. To
enable size-based job monitoring, unset this
environment variable, and set $APT_MONITOR_SIZE
below.
Determines the minimum number of records the job
monitor reports. The default of 5000 records is usually
too small. To minimize the number of messages during
large job runs, set this to a higher value (for example,
1000000).
Disables job monitoring completely. In rare instances,
this may improve performance. In general, this should
only be set on a per-job basis when attempting to
resolve performance bottlenecks.
Prints record counts in the job log as each operator
completes processing. The count is per operator per
70 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Setting
41903040
(example)
$APT_BUFFER_FREE_RUN
1000
(example)
Description
Specifies the maximum amount of virtual memory, in
bytes, used per buffer per partition. If not set, the
default is 3MB (3145728). Setting this value higher
will use more memory, depending on the job flow, but
may improve performance.
Specifies how much of the available in-memory buffer
to consume before the buffer offers resistance to any
new data being written to it. If not set, the default is 0.5
(50% of $APT_BUFFER_MAXIMUM_MEMORY).
If this value is greater than 1, the buffer operator will
read $APT_BUFFER_FREE_RUN *
$APT_BUFFER_MAXIMIMUM_MEMORY before offering
resistance to new data.
When this setting is greater than 1, buffer operators
will spool data to disk (by default scratch disk) after
the $APT_BUFFER_MAXIMUM_MEMORY threshold. The
maximum disk required will be
$APT_BUFFER_FREE_RUN * # of buffers *
$APT_BUFFER_MAXIMUM_MEMORY
$APT_PERFORMANCE_DATA
directory
$TMPDIR
[path]
Setting
Description
$OSH_PRINT_SCHEMAS
$APT_DISABLE_COMBINATION
71 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
$APT_PM_PLAYER_TIMING
$APT_PM_PLAYER_MEMORY
$APT_BUFFERING_POLICY
FORCE
Setting
$APT_BUFFERING_POLICY=FORCE is not
$DS_PX_DEBUG
$APT_PM_STARTUP_CONCURRENCY
$APT_PM_NODE_TIMEOUT
[seconds]
72 of 72
2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.