Professional Documents
Culture Documents
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4
million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data volumes. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. Informatica JOINER has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in informatica level. We have executed these mappings with different data points and logged the result. Further to the above test we will execute m_db_side_join mapping once again, this time with proper database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The average time is plotted along vertical axis and data points are plotted along horizontal axis.
Data Points 1 2 3 4
Verdict In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance benchmarking. 2. This data is only indicative and may vary in different testing conditions.
In this "DWBI Concepts' Original article", we put Oracle database and Informatica PowerCentre to lock horns to prove which one of them handles data SORTing operation faster. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning.
Test Preparation
We will perform the same test with different data points (data volumes) and log the results. We will start with 1 million records and we will be doubling the volume for each next data points. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. The source table has 10 columns and first 8 columns will be used for sorting 9. Informatica sorter has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data in informatica level. We have executed these mappings with different data points and logged the result.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The time is plotted along vertical axis and data volume is plotted along horizontal axis.
Verdict The above experiment demonstrates that Oracle database is faster in SORT operation than Informatica by an average factor of 14%.
Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note
This data can only be used for performance comparison but cannot be used for performance benchmarking. To know the Informatica and Oracle performance comparison for JOIN operation, please click here
In this yet another "DWBI Concepts' Original article", we test the performance of Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning.
In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will look into the JOIN operation, not only because JOIN is the single most important data set operation but also because performance of JOIN can give crucial data to a developer in order to develop proper push down optimization manually. Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. This article will help them to take the informed decision.
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4 million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data volumes. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. Informatica JOINER has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in informatica level. We have executed these mappings with different data points and logged the result. Further to the above test we will execute m_db_side_join mapping once again, this time with proper database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The average time is plotted along vertical axis and data points are plotted along horizontal axis.
Data Points 1 2 3 4
Verdict In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance benchmarking. 2. This data is only indicative and may vary in different testing conditions.
When we run a session, the integration service may create a reject file for each target instance in the mapping to store the target reject record. With the help of the Session Log and Reject File we can identify the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads in the subsequent session runs. If theInformatica Writer or the Target Database rejects data due to any valid reason the integration service logs the rejected records into the reject file. Every time we run the session the integration service appends the rejected records to the reject file.
Row Indicator 0 1 2
3 4 5 6 7 8 9
Reject Rolled-back insert Rolled-back update Rolled-back delete Committed insert Committed update Committed delete
Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.
Writer passes it to the target database. The target accepts it unless a database error occurs, such as finding a duplicate key while inserting. Numeric data exceeded the specified precision or scale for the column. Bad data, if you configured the mapping target to reject overflow or truncated data. The column contains a null value. Good data. Writer passes it to the target, which rejects it if the target database does not accept null values. String data exceeded a specified precision for the column, so the Integration Service truncated it. Bad data, if you configured the mapping target to reject overflow or
Null Value.
truncated data.
Also to be noted that the second column contains column indicator flag value 'D' which signifies that the Row Indicator is valid. Now let us see how Data in a Bad File looks like:
The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that identifies the quarter number: Target Table
Quarter 1 2 3 4 1 2 3 4
Name Sam
Month Jan
Transportation 200
Food 500
and we need to transform the source data and populate this as below in the target table:
Name Sam Sam Sam John John John Tom Tom Tom
Month Jan Jan Jan Jan Jan Jan Jan Jan Jan
Expense Type Transport House rent Food Transport House rent Food Transport House rent Food
Expense 200 1500 500 300 1200 300 300 1350 350
.. like this. Now below is the screen-shot of a complete mapping which shows how to achieve this result using Informatica PowerCenter Designer. Image: Normalization Mapping Example 1
In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab. Interestingly we will observe two new columns namely,
GK_EXPENSEHEAD GCID_EXPENSEHEAD GK field generates sequence number starting from the value as defined in Sequence field while GCID holds the value of the occurence field i.e. the column no of the input Expense head.
Now the GCID will give which expense corresponds to which field while converting columns to rows. Below is the screen-shot of the expression to handle this GCID efficiently: Image: Expression to handle GCID This is how we will accomplish our task!
DynamicLookupCache
Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading. You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right. There is no "big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so
whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup cache. The lookup will still hold the previously cached data, even if the underlying target table is changing. But what if you want your Lookup cache to get updated as and when the target table is changing? What if you want your lookup cache to always show the exact snapshot of the data in your target table at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic cache to handle this.
STATIC CACHE SCENARIO Let's suppose you run a retail business and maintain all your customer information in a customer master table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a Customer Dimension table in your data warehouse. Your source customer table is a transaction system table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address, the old address is updated with the new address. But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache) and check with your every incoming customer record to determine if the customer is already existing in target or not. If the customer is not already existing in target, you conclude the customer is new and INSERT the record whereas if the customer is already existing, you may want to update the target record with this new record (if the record is updated). This is illustrated below, You don't need dynamic Lookup cache for this
Here are some more examples when you may consider using dynamic lookup,
Updating a master customer table with both new and updated customer information coming together as shown above Loading data into a slowly changing dimension table and a fact table at the same time. Remember, you typically lookup the dimension while loading to fact. So you load dimension table before loading fact table. But using dynamic lookup, you can load both simultaneously.
Loading data from a file with many duplicate records and to eliminate duplicate records in target by updating a duplicate row i.e. keeping the most recent row or the initial row Loading the same data from multiple sources using a single mapping. Just consider the previous Retail business example. If you have more than one shops and Linda has visited two of your shops for the first time, customer record Linda will come twice during the same load.
When the Integration Service reads a row from the source, it updates the lookup cache by performing one of the following actions:
Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags the row as insert.
Updates the row in the cache: If the row exists in the cache, the Integration Service updates the row in the cache based on the input ports. The Integration Service flags the row as update. Makes no change to the cache: This happens when the row exists in the cache and the lookup is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to update existing rows only or, the row is in the cache, but based on the lookup condition, nothing changes. The Integration Service flags the row as unchanged.
Notice that Integration Service actually flags the rows based on the above three conditions. And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different logic. This flag port is called
NewLookupRow Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to use a Router or Filter transformation followed by an Update Strategy. Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:
0 = Integration Service does not update or insert the row in the cache. 1 = Integration Service inserts the row into the cache. 2 = Integration Service updates the row in the cache. When the Integration Service reads a row, it changes the lookup cache depending on the results of the lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.
And here I provide you the screenshot of the lookup below. Lookup ports screen shot first, Image: Dynamic Lookup Ports Tab And here is Dynamic Lookup Properties Tab
If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE group. The router screenshot is also given below. New records are routed to the INSERT group and existing records are routed to the UPDATE group.
When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.
Output old values on update: The Integration Service outputs the value that existed in the cache before it updated the row. Output new values on update: The Integration Service outputs the updated value that it writes in the cache. The lookup/output port value matches the input/output port value. Note: We can configure to output old or new values using the Output Old Value On Update transformation property.
Insert null values: The Integration Service uses null values from the source and updates the lookup cache and target table using all values from the source. Ignore Null inputs for Update property : The Integration Service ignores the null values in the source and updates the lookup cache and target table using only the not null values from the source. If we know the source data contains null values, and we do not want the Integration Service to update the lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding lookup/output port. When we choose to ignore NULLs, we must verify that we output the same values to the target that the Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that lookup cache and the target table might not become unsynchronized
New values. Connect only lookup/output ports from the Lookup transformation to the target.
Old values. Add an Expression transformation after the Lookup transformation and before the Filter or Router transformation. Add output ports in the Expression transformation for each port in the target table and create expressions to ensure that we do not output null input values to the target.
But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to ignore when it compares ports. The Designer only enables this property for lookup/output ports when the port is not used in the lookup condition. We can improve performance by ignoring some ports during comparison.
We might want to do this when the source data includes a column that indicates whether or not the row contains data we need to update. Select the Ignore in Comparison property for all lookup ports except the port that indicates whether or not to update the row in the cache and target table. Note: We must configure the Lookup transformation to compare at least one port else the Integration Service fails the session when we ignore all ports.
Links
The Integration Service pushes as much transformation logic as possible to the source database. The Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the source database and executes the corresponding SELECT statement.
Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the corresponding SQL statement that is generated for the specified selections. When we select a pushdown option or pushdown group, we do not change the pushdown configuration. To change the configuration, we must update the pushdown option in the session properties.
Links
1.1 Calculate original query cost 1.2 Can the query be re-written to reduce cost? - Can IN clause be changed with EXISTS? - Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query? - Is there a redundant table join that can be avoided? - Can we include additional WHERE clause to further limit data volume? - Is there a redundant column used in GROUP BY that can be removed? - Is there a redundant column selected in the query but not used anywhere in mapping? 1.3 Check if all the major joining columns are indexed 1.4 Check if all the major filter conditions (WHERE clause) are indexed - Can a function-based index improve performance further? 1.5 Check if any exclusive query hint reduce query cost - Check if parallel hint improves performance and reduce cost 1.6 Recalculate query cost - If query cost is reduced, use the changed query
5.1 Unless unavoidable, use filteration at source query in source qualifier 5.2 Use filter as much near to source as possible
Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage. Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions). We can set the following attributes to partition a pipeline: Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads. We cannot create partition points at Source instances or at Sequence Generator transformations. Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently. Partition types: The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points. We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database partitioning: The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database. Pass-through: The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose passthrough partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions. Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing. Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations. Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key. Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and
ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range. We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning. Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a session in Workflow Manager. The PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.
Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx or .dat Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current data of the lookup table. Next in all the mappings where we want to use the same already built Named Persistent Cache we need to set two properties in the Properties tab of Lookup transformation.
Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in Cache Directory and if the cache file is not there the session will not fail it will just create the cache file instead. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that was defined in the mapping where the persistent cache file was created.
Note:
If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even also an extra blank space will fail the session that is using the already built persistent cache file. So if the incoming source data volume is high, the lookup tables data volume that need to be cached is also high, and the same lookup table is used in many mappings then the best way to handle the situation is to use one-time build, already created persistent named cache.
But wait... why would we do this? Aren't we complicating the thing here?
Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is already sorted or when you know input data will not violate the order, like you are loading daily data and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation operation. This needs time and cache space and this also voids the normal row by row processing in Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease out row by row processing. The mapping below will show how to do this Image: Aggregation with Expression and Sorter 1 Sorter (SRT_SAL) Ports Tab
Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties Sorter (SRT_SAL1) Ports Tab
This is how we can implement aggregation without using Informatica aggregator transformation. Hope you liked it!
Connected lookup participates in dataflow and receives input directly from the pipeline
Connected lookup can use both dynamic and static cache Connected lookup can return more than one column value ( output port ) Connected lookup caches all lookup columns
Unconnected Lookup can return only one column value i.e. output port Unconnected lookup caches only the lookup output ports in the lookup conditions and the
return port Supports user-defined default values (i.e. value to return when lookup conditions are not satisfied)
Filter transformation restricts or blocks the incoming record set based on one given condition.
Filter transformation does not have a default group. If one record does not match filter condition, the record is blocked
How can we update a record in target table without using Update strategy?
A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the target table in Informatica level and then we need to connect the key and the field we want to update in the mapping Target. In the session level, we should set the target property as "Update as Update" and check the "Update" check-box. Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and "Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID and Customer Address fields in the mapping. If the session properties are set correctly as described above, then the mapping will only update the customer address field for all matching customer IDs.
But what if the source is a flat file? How can we remove the duplicates from flat file source?
To know the answer of this question and similar high frequency Informatica questions, please continue to,