You are on page 1of 11

Slowly Changing Dimension Stage page - General tab

All parallel stages have a Stage page, and each Stage page has a General tab. This page and tab contain the following fields: Stage name Displays the name of the stage. You can edit this name. Data Connection Specifies the name of a data connection object in the repository that defines a connection to a data source. Data connections are valid only when the stage generates surrogate keys by using a database sequence. The Source type field on the Surrogate Key tab must be set to DBSequence. Click the browse button to load, save, or clear a data connection object. You can also drag and drop a data connection object from the repository to the stage icon on the canvas. Select output link Specifies which link is the main output from the stage. The other link will carry changes to the dimension. Description Contains an optional description of the stage. Add a description because it helps the maintainability of jobs. The description is displayed in any job reports. Fast Path Provides a navigation tool for the SCD stage. Click the Fast Path arrows to move through the sequence of tabs where user action is required. The Stage page also has the following tabs: Stage page - Advanced tab Use the Advanced tab to specify processing details for a stage. Slowly Changing Dimension Stage page - NLS Locale tab Use the NLS Locale tab to define a collate convention for the SCD stage.

Stage page - Advanced tab


Use the Advanced tab to specify processing details for a stage. All parallel stages have a Stage page and each Stage page in turn has an Advanced tab. In most situations, use the default values from this tab. You can use the tab to finely tune the behavior of a stage: for example, you can force it to execute sequentially or request partitioning be preserved if possible. The tab contains the following controls and fields: Execution Mode Specify the execution mode of the stage. You can choose between parallel and sequential operation. If the execution mode for a particular type of stage cannot be changed, then this list is not available. In sequential operation the stage runs on a single node. Mixing sequential and parallel stages in a job flow causes data to be partitioned and collected between the stages. You can use the default setting for the stage (the lst tells you whether this is parallel or sequential). Combinability mode When a job runs the engine can combine the operators that underlie parallel stages so that they run in the same process. This can improve performance. Specify one of these settings: Auto Use the default combination setting (use this for most jobs) Combinable Ignore the operator's default setting and combine if at all possible (some operator's are marked as noncombinable by default). Don't Combine

Never combine operators. Preserve Partitioning Preserve Partitioning specifies whether the stage requires that partitioning be preserved by the next stage of the job. Select one of these options: Set Specifies that the next stage in the job should preserve the partitioning if possible Clear Specifies that the partitioning method that the next stage uses is not relevant. Propagate Specifies that the stage will use the option that the previous stage in the job has used. If the previous stage used the Propagate option, the setting from the last stage that used either the Set option or the Clear option is used. For some stage types thePropagate option is not available. The default setting of a stage can be any of these options, depending on stage type, and in most cases you can accept the default. You might set something other than the default when you are implementing a particular partitioning scheme in your job. None of these options will prevent repartitioning, but will cause you to be warned if it happens when the job runs. Configuration File Shows the name of the configuration file that the system is currently using. The configuration file is specified by the APT_CONFIG_FILE environment variable. Node pool and resource constraints Specify constraints on where the stage can run. These controls are available if you have defined multiple pools in the configuration file. 1. 2. 3. Select Node pool or Resource pool from the Constraint drop-down list. If you selected Resource pool, select a type for a resource pool. Select the name of the node or resource pool.

You can select multiple node pools or resource pools. The stage can run only on the node pools or resource pools that you specify. Node map constraints Use this feature to specify a virtual node pool that does not appear in the configuration file. Select the option box and type the nodes you want the stage to be able to run on. You can also browse the available nodes. The lists of available nodes, available node pools, and available resource pools are derived from the configuration file. Constraints are ignored for Sequential File stages, File Set stages, External Source stages, and External Target stages.

Slowly Changing Dimension Stage page - NLS Locale tab


Use the NLS Locale tab to define a collate convention for the SCD stage. The collate convention defines the order in which characters are collated. The convention is used when the SCD stage evaluates derivation expressions. The NLS Locale tab has the following field: Collate Displays a list of the available collate locales. You can select a locale from the list or accept the default setting for the project. Click the arrow button to insert a job parameter that supplies the locale at run time or to browse for a file that defines custom collate rules.

Slowly Changing Dimension Stage Input page General tab


The Input page specifies information about the input data to the SCD stage, including the source data and the dimension reference data. This page and tab contain the following fields: Input name Displays the name of the selected input link. Description Contains an optional description of the link. This description is displayed in any job reports. The Input page also has the following tabs: Slowly Changing Dimension Input page - Lookup tab Use the Lookup tab to define the match condition for the dimension lookup, and to assign purpose codes to the dimension columns. Select the reference link in the Input name field to view this tab. Slowly Changing Dimension Input page - Surrogate Key tab If you created a key source with a Surrogate Key Generator stage, use the Surrogate Key tab to specify how to use the key source to generate surrogate keys. Select the reference link in the Input name field to view this tab. Columns tab (input) The Columns tab displays the column metadata for the selected input link in a grid. Advanced tab (input) Use this tab to specify how the stage buffers data arriving on the input link. Partitioning tab (input) Use the Partitioning tab to specify details about how the stage partitions or collects data on the current link before it processes the data or writes it to a data target.

Slowly Changing Dimension Input page - Lookup tab


Use the Lookup tab to define the match condition for the dimension lookup, and to assign purpose codes to the dimension columns. Select the reference link in the Input name field to view this tab. To define the match condition, select a source column in the left pane and drag it to a dimension column in the right pane. You must associate at least one pair of columns, but you can define multiple pairs if required. When more than one pair is defined, the match conditions are combined. A successful lookup requires all associated pairs of columns to match. To select purpose codes for the dimension columns, click the Purpose arrow next to each column in the right pane. For multiple columns with the same purpose code, select the columns and click Set Purpose Code from the pop-up menu. The Set Purpose Code window opens, where you can assign the same purpose code to the selected columns. The SCD stage automatically propagates the column definitions and purpose codes to the Dim Update tab. The Lookup tab has two panes: Left pane Displays the source columns on the primary input link to the stage. Right pane

Displays the dimension columns on the reference input link to the stage. The Key Expression field specifies the match condition between one or more source records and dimension rows. The Purpose field specifies the purpose code for each dimension column.

Slowly Changing Dimension Input page - Surrogate Key tab


If you created a key source with a Surrogate Key Generator stage, use the Surrogate Key tab to specify how to use the key source to generate surrogate keys. Select the reference link in the Input name field to view this tab. The Surrogate Key tab contains the following fields: Source type Specifies the type of key source. The default option is Flat File. Source name Specifies the name and fully qualified path of the state file, or the name of the database sequence. A state file must be accessible from all nodes that run the stage. Click the arrow button to browse for the file or to insert a job parameter. Flat File area: Initial value Specifies the value to initialize the key state the first time that the job runs. For every subsequent time that the job runs, the initial value is the start value for key generation. If the specified value is taken, the stage uses the next available value. New surrogate keys retrieved from state file Specifies the block size to use for surrogate key retrieval. Select In blocks of to set the block size manually, or System selected block size to have the system pick the optimal block size based on your job configuration. System-selected block sizes usually result in larger key ranges that require less frequent state access and have better performance. However, the key sequence might have temporary gaps until the next time the job runs. DB sequence area: If you specified a data connection object on the Stage page, then the User name, Password, Database name, and Server name fields are populated automatically. Database type Specifies the type of database where the sequence resides. User name Specifies the user name for the database connection. This field is required for a remote database. If you leave this field blank and you have a local database, the stage uses the workstation login user name. Password Specifies the password for the database connection. This property is required for a remote database. If you leave this field blank and you have a local database, the stage uses the workstation login password. Database name Specifies the name of the database to access. This field is available if the Database type field is set to DB2. If you leave this field blank, the stage uses the value that is specified by the environment variable APT_DBNAME or by the variable DB2DBDFT if APT_DBNAME is not set. Client instance name Specifies the name of the client instance. This field is available if the Database type field is set to DB2. This field is required for a remote database. Client alias DB name

Specifies the name of the client alias database on the remote server. This field is available if the Database type field is set to DB2 and the Client instance name field is not blank. This field is required only if the names of the alias database and the server database are different. Server name Specifies the name of the server.

Slowly Changing Dimension Output page - General tab


The Output page specifies information about the output data from the SCD stage. The SCD stage has one output link and one link that updates the dimension. This page and tab contain the following fields: Output name Displays the name of the selected output link. Description Contains an optional description of the link. This description is displayed in any job reports. The Output page also has the following tabs: Slowly Changing Dimension Output page - Dim Update tab Use the Dim Update tab to create column derivations that specify how to update the dimension table. Select the dimension update link in the Output name field to view this tab. Slowly Changing Dimension Output page - Output Map tab Use the Output Map tab to map data from the input links to the output link. Select the primary output link from the Output name field to view this tab. Columns tab (output) Use the Columns tab to define the column metadata for the selected output link. Advanced tab (output) Use this tab to specify how the stage buffers data on the output link.

Slowly Changing Dimension Output page - Dim Update tab


Use the Dim Update tab to create column derivations that specify how to update the dimension table. Select the dimension update link in the Output name field to view this tab. You must create a derivation for every dimension column. Columns with a purpose code of Type 1 or Type 2 must be derived from a source column. Columns with a purpose code of Current Indicator or Expiration Date must be derived from a literal value, and must also have an Expire derivation. To create derivations:

Drag source columns from the left pane to the Derivation field in the right pane Use the column auto-match facility (right-click the link header and select Auto Match) Define an expression by using the expression editor (double-click the Derivation or Expire field)

The Dim Update tab has two panes: Left pane Displays the source columns and any job parameters that you defined. This information is read-only and cannot be modified. Right pane

Displays the dimension update columns. The Derivation field specifies how each column is derived. The Purpose field specifies the purpose code for each column. The Expire field specifies how to expire a dimension record.

Slowly Changing Dimension Output page - Output Map tab


Use the Output Map tab to map data from the input links to the output link. Select the primary output link from the Output name field to view this tab. Define the mappings by creating column derivations. You must create a derivation for every output column. To create derivations:

Drag source columns from the left pane to the Derivation field in the right pane Use the column auto-match facility (right-click the link header and select Auto Match) Define an expression by using the expression editor (double-click the Derivation or Expire field)

The Output Map tab has two panes: Left pane Displays the source columns, dimension columns, and any job parameters that you defined. This information is read-only and cannot be modified. Right pane Displays the output columns. The Derivation field specifies how each column is derived.

Columns tab (output)


Use the Columns tab to define the column metadata for the selected output link. You can enter column definitions by typing them in this tab, or you can load predefined columns definitions from the repository. The Columns tab contains the following fields and controls: Columns grid The columns grid contains the following fields: Column name The name of the column. Key Indicates whether the column is part of the primary key. SQL type The SQL data type. Extended This column gives you further control over data types used in parallel jobs when NLS is enabled. The available values depend on the base data type: For Char, VarChar, and LongVarChar select Unicode to specify that these columns require mapping. Otherwise each character is taken as representing an ASCII character that does not need mapping. For each connector that has a Code page property, instead of specifying the columns here, specify them in the Code page property and sub-property values on the Properties tab. For Time select microseconds to indicate that the field contains microseconds. For Timestamp select microseconds to indicate that the field contains microseconds. For integer types select unsigned to specify that the underlying data type is a uint.

Length

The data precision. Specify the length for Char data and the maximum length for VarChar data. Scale The data scale factor. For Sequential File stages the scale should not exceed 9. Nullable Indicates whether the column can contain null values. Specify No to indicate that the column is subject to a NOT NULL constraint. The Nullable field is informative only, it does not enforce a NOT NULL constraint. Display For certain stage types, this field optionally gives the maximum number of characters required to display the column data. Data element For certain stage types, this field optionally enables you to specify a data element for the column, that specifies stricter data typing. Description Specify a description of the column. If you are typing column definitions in the Columns tab, and want several columns to share one or more of the same properties, you can propagate the properties. Select all the columns affected and select Propagate values... from the shortcut menu. The Propagate column values dialog box appears, use it to choose the properties you want to propagate. You then edit rows in the Edit Column Meta Data window by selecting the row and choosing Edit Row... from the shortcut menu. Save Click Save to save a copy of the column definitions as a table definition in the repository. The Save Table Definition window opens. To save a table definition: 1. Enter a name in the Data source type field. This name forms the first part of the unique table definition identifier. By default, the field contains Saved. 2. Enter a name in the Data source name field. This forms the second part of the table definition identifier. By default, this field contains the name of the link you are editing. 3. Enter a name in the Table/file name field. This is the last part of the table definition identifier and is also the name that is used for the table definition in the repository. By default, this field contains the name of the link you are editing. 4. Enter a brief description of the table definition in the Short description field. By default, this field contains the date and time you clicked Save . The format of the date and time depend on your Windows setup. This field is optional. 5. Enter a more detailed description of the table definition in the Long description field. This field is optional. 6. Click OK. The Save Table Definition As window opens. Select the folder in which you want to store the table definition and click Save. Load Click Load to load a table definition from the repository and populate the Columns tab. The Table Definition window opens. To load a table definition: 1. Browse the repository tree for the table definition that you want to load. 2. Select the table definition in the tree and click OK. The Select Columns window opens. 3. Use the arrow buttons to move the columns that you want to load from the Available columns list to the Selected column list. 4. Click OK to load the selected column definition into the Columns tab.

Advanced tab (output)


Use this tab to specify how the stage buffers data on the output link.

By default, stages buffer data in such a way that no deadlocks can arise. A deadlock is the situation where a number of stages are mutually dependent, and are waiting for input from another stage and cannot output data until they have received the input. The size and operation of the buffer are usually the same for all links on all stages. The default values that the settings take can be set using environment variables see WebSphere DataStage Parallel Job Advanced Developer Guide. Use the Advanced tab to specify buffer settings on a per-link basis. Any settings you make here automatically appear in the Advanced tab of the previous or next stage in the job. CAUTION: Use these settings with extreme caution, because inappropriate settings can cause deadlock situations to arise. The tab contains the following controls and fields: Buffering mode Select one of the following from the drop-down list. (Default) The link takes the settings that are specified by the environment variables. This is Auto-buffer unless you have changed the value of the APT_BUFFERING _POLICY environment variable. Auto buffer The link buffers incoming data only if necessary to prevent a dataflow deadlock situation. Buffer The link unconditionally buffers all outgoing data. No buffer The link does not buffer data under any circumstances. This could potentially lead to deadlock situations if not used carefully. If you choose the Auto buffer or Buffer options, you can also set the values of the various buffering parameters. Maximum memory buffer size (bytes) Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). Buffer free run (percent) Specifies how much of the available in-memory buffer to use before the buffer writes to disk. The value of Buffer free run is a percentage of Maximum memory buffer size. When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds the value, the buffer first tries to write some of the data that it contains to disk before accepting more data. The default value is 50%. You can set it to greater than 100%, in which case the buffer continues to store data up to that percentage of Maximum memory buffer size before writing to disk. Queue upper bound size (bytes) Specifies the maximum amount of data buffered at any time using both memory and disk. The default value is zero, meaning that the buffer size is limited only by the available disk space as specified by 'resource scratchdisk' in the configuration file. Specify a value and the total buffer size is limited to this number of bytes plus one block (where the data stored in a block cannot exceed 32 KB). Disk write increment (bytes) Sets the size, in bytes, of blocks of data being moved to and from disk. The default is 1048576 (1 MB). Adjusting this value trades the amount of disk access against data throughput. Increasing the block size reduces disk access, but may decrease performance when data is being read or written in smaller units. Decreasing the block size increases data throughput, but may increase the amount of disk access. To create a buffer that will not write to disk, set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer size and set Buffer free run to 1.0. CAUTION: the size of the buffer is limited by the virtual memory of your system and you can create deadlock if the buffer becomes full.

You might also like