DS Best Practices Collection

Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata.
Modify stage will drop the metadata only when explicitey specified using DROP clause. Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses. Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high andperformance becomes a problem. Otherwise use Transformer. Its very easy to code a transformer than a modify stage.
Turn off Runtime Column propagation wherever its not required. One of the most important mistake that developers often make is not to have a volumetric analyses done before you decide to use Join or Lookup or Merge stages. Estimate the volumes and then decide which stage to go for. Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of datastage reources. Indicate dont sort option between DB stage and join stage using sort stage when using order by clause.
Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options. One of the most frequent mistakes that developers face is lookup failures by not taking care of String padchar that datastage appends when converting strings of lower precision to higher precision.Try to decide on the APT_STRING_PADCHAR, APT_CONFIG_FILE parameters from the beginning. Ideally APT_STRING_PADCHAR should be set to OxOO (C/C++ end of string) and Configuration file to the maximum number of nodes available.
Data Partitioning is very important part of Parallel job design. Its always advisable to have the data partitioning as Auto unless you are comfortable with partitioning, since all DataStage stages are designed to perform in the required way with Auto partitioning.
While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table.
Stage Variables: Having too much variables in the transformer can impact the memory consumption. Avoid looping in transformer unless necessary. Try not to use the basic transformer when possible using other one. Buffering: Try to tune the following before changing the buffering policy in the job: 1) Job design 2) Config File 3) Disk Performance of the job can be improved if: 1) Unnecessary column are removed from the up and down stream links. 2) Removing these unnecessary columns will help reducing the memory consumption. 3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption. 4) Use RCP very carefully. 5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job. Sort operations: Always perform the following checks first before using the sort in the design: 1) Is sort really needed?
2) What is the volume of data going to be sorted? 3) Is data being read from database first and getting sorted in the job? Can we not sort the data in the database and bring sorted data? 4) What are the values set in the system related to Sort stage? If we give attention to above questions before applying sort then this will help up creating more performant job. Parallelism is not always good: Remember parallelism is not always beneficial. You have to think about the design of the job and the configuration. Degree of parallelism is determined by the configuration file where you can check how many node are defined. Increased parallelism can bring more overhead but will help distributing the work. To get the maximum performance from job: To get the maximum performance from job we should start the job design with the smaller set of the data and then increase the amount of data. We will only get the best performing job when we will experiment with the design of the job using different partitioning methods etc. Point to remember while partitioning the data: While partitioning the data make sure that the partitions are having equal amount of data in them. inequality will make the job less performant. While configuring the SMP environment: While configuring the SMP environment designers should leave some processors for the operating systems processes. Following factors affect the job design: 1. Communication between the operators should be optimized. 2. Amount of data handled by the operators should be handled by the nodes efficiently. This means the partitions should be correct. Datasets Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning and would make the job more robust. Degree of Parallelism Degree of parallelism determined by the configuration file, by the number of nodes. More the number of nodes more the resource usage will happen. Hence having more nodes can improve the performance of the job but one should think about the resource consumption on the server also. Number of number will bring more number of processes which can exhaust the server. ULIMIT settings can prevent parallel jobs from running and processes from execution. Setting this up can help in better execution of the server and jobs. Scratch Area Scratch area is memory space which will be used by the jobs in case if the data required to read is more than the buffer size. One should keep scratch area maintained for the smooth operation of the jobs. You can not fit everything in memory One should always remember that not every thing can be fitted into memory hence the while reading / sorting large amount of data, correct stage selection is very necessary. Never use lookup stage where large amount of record set needs to be stored in virtual datasets. Stored Procedure Once should avoid using stored procedure for per row basis.
Standards
It is important to establish and follow consistent standards in: _ Directory structures for installation and application support directories. _ Naming conventions, especially for DataStage Project categories, stage names, and links. All DataStage jobs should be documented with the Short Description field, as well as Annotation fields. It is the DataStage developers responsibility to make personal backups of their work on their local workstation, using DataStage's DSX export capability. This can also be used for integration with source code control systems. Note: A detailed discussion of these practices is beyond the scope of this Redbooks publication, and you should speak to your Account Executive to engage IBM IPS Services.
Development guidelines
Modular development techniques should be used to maximize re-use of DataStage jobs and components:
_ Job parameterization allows a single job design to process similar logic instead of creating multiple copies of the same job. The Multiple-Instance job property allows multiple invocations of the same job to run simultaneously. _ A set of standard job parameters should be used in DataStage jobs for source and target database parameters (DSN, user, password, etc.) and directories where files are stored. To ease re-use, these standard parameters and settings should be made part of a Designer Job Parameter Sets. _ Create a standard directory structure outside of the DataStage project directory for source and target files, intermediate work files, and so forth. _ Where possible, create re-usable components such as parallel shared containers to encapsulate frequently-used logic. _ DataStage Template jobs should be created with: Standard parameters such as source and target file paths, and database login properties Environment variables and their default settings Annotation blocks _ Job Parameters should always be used for file paths, file names, database login settings. _ Standardized Error Handling routines should be followed to capture errors and rejects. Component usage
The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere DataStage Enterprise Edition:
_ Never use Server Edition components (BASIC Transformer, Server Shared Containers) within a parallel job. BASIC Routines are appropriate only for job control sequences.
_ Always use parallel Data Sets for intermediate storage between jobs unless that specific data also needs to be shared with other applications. _ Use the Copy stage as a placeholder for iterative design, and to facilitate default type conversions. _ Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch stages. Chapter 1. IBM InfoSphere DataStage overview 29 _ Use BuildOp stages only when logic cannot be implemented in the parallel Transformer. DataStage data types
The following guidelines should be followed with DataStage data types: _ Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data types. If possible, import table definitions for source databases using the Orchestrate Schema Importer (orchdbutil) utility. _ Leverage default type conversions using the Copy stage or across the Output mapping tab of other stages. Partitioning data
In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the Information Server Engine will choose the type of partitioning at runtime based on stage requirements, degree of parallelism, and source and target systems. While Auto partitioning will generally give correct results, it might not give optimized performance. As the job developer, you have visibility into requirements, and can optimize within a job and across job flows. Given the numerous options for keyless and keyed partitioning, the following objectives form a methodology for assigning partitioning: _ Objective 1 Choose a partitioning method that gives close to an equal number of rows in each partition, while minimizing overhead. This ensures that the processing workload is evenly balanced, minimizing overall run time. _ Objective 2 The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required. Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method. This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be necessary for Transformers and BuildOps that process groups of related records. _ Objective 3 Unless partition distribution is highly skewed, minimize re-partitioning, especially in cluster or Grid configurations. Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport.
_ Objective 4
Partition method should not be overly complex. The simplest method that meets the above objectives will generally be the most efficient and yield the best performance. Using the above objectives as a guide, the following methodology can be applied:
a. Start with Auto partitioning (the default). b. Specify Hash partitioning for stages that require groups of related records as follows: Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient Use Modulus partitioning if the grouping is on a single integer key column Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused) c. If grouping is not required, use Round Robin partitioning to redistribute data equally across all partitions. Especially useful if the input Data Set is highly skewed or sequential d. Use Same partitioning to optimize end-to-end partitioning and to minimize re-partitioning Be mindful that Same partitioning retains the degree of parallelism of the upstream stage Within a flow, examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. This may require re-examining key column usage within stages and re-ordering stages within a flow (if business requirements permit). Note: In satisfying the requirements of this second objective, it might not be possible to choose a partitioning method that gives an almost equal number of rows in each partition. Across jobs, persistent Data Sets can be used to retain the partitioning and sort order. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order.
Collecting data
Given the options for collecting data into a sequential stream, the following guidelines form a methodology for choosing the appropriate collector type: 1. When output order does not matter, use Auto partitioning (the default). 2. Consider how the input Data Set has been sorted: When the input Data Set has been sorted in parallel, use Sort Merge collector to produce a single, globally sorted stream of rows. When the input Data Set has been sorted in parallel and Range partitioned, the Ordered collector might be more efficient. 3. Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets, as long as the Data Set has not been re-partitioned or reduced.
Sorting
Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition data flow: 1. Start with a link sort. 2. Specify only necessary key column(s). 3. Do not use Stable Sort unless needed. 4. Use a stand-alone Sort stage instead of a Link sort for options that are not available on a Link sort: The Restrict Memory Usage option should be included here. If you want more memory available for the sort, you can only set that via the Sort Stage not on a sort link. The environment variable $APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort memory usage (in MB) per partition. Sort Key Mode, Create Cluster Key Change Column, Create Key Change Column, Output Statistics. Always specify DataStage Sort Utility for standalone Sort stages. Use the Sort Key Mode=Dont Sort (Previously Sorted) to resort a sub-grouping of a previously-sorted input Data Set. 32 IBM InfoSphere DataStage Data Flow and Job Design 5. Be aware of automatically-inserted sorts: Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish required sort order. 6. Minimize the use of sorts within a job flow. 7. To generate a single, sequential ordered result set, use a parallel Sort and a Sort Merge collector.
Stage specific guidelines
The guidelines by stage are as follows: _ Transformer Take precautions when using expressions or derivations on nullable columns within the parallel Transformer: Always convert nullable columns to in-band values before using them in an expression or derivation. Always place a reject link on a parallel Transformer to capture / audit possible rejects. _ Lookup It is most appropriate when reference data is small enough to fit into available shared memory. If the Data Sets are larger than available memory resources, use the Join or Merge stage. Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows, or when exception processing. _ Join Be particularly careful to observe the nullability properties for input links to any form of Outer Join. Even if the source data is not nullable, the non-key
columns must be defined as nullable in the Join stage input in order to identify unmatched records. _ Aggregators Use Hash method Aggregators only when the number of distinct key column values is small. A Sort method Aggregator should be used when the number of distinct key values is large or unknown. _ Database Stages The following guidelines apply to database stages: Where possible, use the Connector stages or native parallel database stages for maximum performance and scalability. The ODBC Connector and ODBC Enterprise stages should only be used when a native parallel stage is not available for the given source or target database. When using Oracle, DB2, or Informix databases, use Orchestrate Schema Importer (orchdbutil) to properly import design metadata. Take care to observe the data type mappings. If possible, use an SQL where clause to limit the number of rows sent to a DataStage job. Avoid the use of database stored procedures on a per-row basis within a high-volume data flow. For maximum scalability and parallel performance, it is best to implement business rules natively using DataStage parallel components

DS Best Practices Collection

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Best Practices Collection

Uploaded by

Copyright:

Available Formats

Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata.

Stage specific guidelines

You might also like