You are on page 1of 4

1. Explain the differences between OLTP and Data Warehouse.

Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to be recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a customer but was unable to record this event in the bank records. If this happens frequently, the bank wouldnt stay in business for too long. So the banking system is designed to make sure that every transaction gets recorded within the time you stand before the ATM machine. A Data Warehouse (DW) on the other end, is a database (yes, you are right, its a database) that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analyzed far more efficiently as compared to your regular OLTP application databases. In this sense an OLAP system is designed to be read-optimized. Separation from your application database also ensures that your business intelligence solution is scalable (your bank and ATMs dont go down just because the CFO asked for a report), better documented and managed. Creation of a DW leads to a direct increase in quality of analysis as the table structures are simpler (you keep only the needed information in simpler tables), standardized (well-documented table structures), and often de-normalized (to reduce the linkages between tables and the corresponding complexity of queries). Having a well-designed DW is the foundation for successful BI (Business Intelligence)/Analytics initiatives, which are built upon. Data Warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

2. With necessary diagram, Explain about Data Warehouse Development Life Cycle.
The Data Warehouse development life cycle covers two vital areas. One is warehouse management and the second one is data management. The former deals with defining the project activities and requirements gathering; whereas the latter deals with modeling and designing the Warehouse

Managing the Project Managing the Data Warehouse project is an on-going activity. It is not like traditional systems project. The Data Warehouse is concerned with the execution of warehousing process and the data. Defining the Project The process of defining the project typically involves the following questions: What do I want to analyze? Why do I want? What if I do not do this? How do I get this? Software personnel should get answers to these questions, and then we can understand the requirements that must be addressed. Requirements Gathering Transaction Processing Systems focus on automating the process, making it faster and efficient. This, in turn means that the requirements for transactional systems are specific and more directed towards business process automation. In contrast, the Data Warehousing environment focuses on facilitating the analysis that will change the process to make it more effective. Common questions/ information required during requirements. Who is of interest to the user? What is the user trying to analyze? Why does the user need data? When does the data need to be recovered? Where do relevant processes occur? How do we measure the performance?

3. What is Metadata? What is its use in Data Warehouse Architecture?


Metadata in a Data Warehouse is similar to the data dictionary or the data catalogue in a Database Management System. In the data dictionary, you keep the information about the logical data structures, the information about the files and addresses, the information about the indexes, and so on. The Data Dictionary contains data about the data in the database. Similarly, the Metadata component is the data about the data in the Data Warehouse. This definition is a commonly used definition. We need to elaborate on this definition. Metadata in a Data Warehouse is similar to a data dictionary, but much more than a data dictionary.

4. What is Surrogate key? When do we need it in data warehouse implementation?


Surrogate Keys How do we resolve the problem faced in the previous section? Can we use production system keys as primary keys for dimension tables? If not, what are the other candidate keys? There are two general principles to be applied when choosing primary keys for dimension tables. The first principle is derived from the problem caused when the product began to be stored in a different warehouse. In other words, the product key in the operational system has built-in meanings. Some positions in the operational system product key indicate the warehouse and some other positions in the key indicate the product category. These are built-in meanings in the key. The first principle is, avoid built-in meanings in the primary key of the dimension tables. In some companies, a few of the customers are no longer listed with the companies. They could have left their respective companies many years ago. It is possible that the customer numbers of such discontinued customers are reassigned to new customers. Now, let us say we had used the operational system customer key as the primary key for the customer dimension table. We will have a problem because the same customer number could relate to the data for the newer customer and also to the data of the retired customer. The data of the retired customer may still be used for aggregations and comparisons by city and state. Therefore, the second principle is: do not use production system keys as primary keys for dimension tables. What then should we use as primary keys for dimension tables? The answer is to use surrogate keys. The surrogate keys are simply system-generated sequence numbers. They do not have any built-in meanings. Of course, the surrogate keys will be mapped to the production system keys. Nevertheless, they are different. The general practice is to keep the operational system keys as additional attributes in the dimension tables. The STORE KEY is the surrogate primary key for the store dimension table. The operational system primary key for the store reference table may be kept as just another non-key attribute in the store dimension table.

5. What is Data Loading? Explain the Full Refresh Loading.


Loading often implies physical movement of the data from the computer(s) storing the source database (s) to that which will store the data warehouse database, assuming it is different. This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse. Data Loading Types Initial Load Populating all the Data Warehouse tables for the very first time Incremental Load Applying ongoing changes as necessary in a periodic manner. Full Refresh Completely erasing the contents of one or more tables and reloading with fresh data (initial load is a refresh of all the tables). Because loading the Data Warehouse may take an inordinate amount of time, loads are generally of great concern. During the loads, the Data Warehouse has to be offline. Initial Load: Let us say you are able to load the whole Data Warehouse in a single run. As a variation of this single run, let us say you are able to split the load into separate subloads and run each of these subloads as single loads. In other words, every load run creates the database tables from scratch. In these cases, you will be using the load mode discussed above. If you need more than one run to create a single table, and your load runs for a single table, must be scheduled to run several days, then the

approach is different. For the first run of the initial load of a particular table, use the load mode. All further runs will apply the incoming data using the append mode. Creation of indexes on initial loads or full refreshes requires special consideration. Index creation on mass loads can be too time-consuming. So drop the indexes prior to the loads to make the loads go quicker. You may rebuild or regenerate the indexes when the loads are complete. Incremental Loads: These are the applications of ongoing changes from the source systems. Changes to the source systems are always tied to specific times, irrespective of whether or not they are based on explicit time stamps in the source systems. Therefore, you need a method to preserve the periodic nature of the changes in the Data Warehouse. Let us review the constructive merge mode. In this mode, if the primary key of an incoming record matches with the key of an existing record, the existing record is left in the target table as is and the incoming record is added and marked as superseding the old record. If the time stamp is also part of the primary key or if the time stamp is included in the comparison between the incoming and the existing records, then constructive merge may be used to preserve the periodic nature of changes. This is an oversimplification of the exact details of how constructive merge may be used. Nevertheless, the point is that the constructive merge mode is an appropriate method for incremental loads. The details will have to be worked out based on the nature of the individual target tables. Are there cases in which the mode of destructive merge may be applied? What about a Type 1 slowly changing dimension? In this case, the change to a dimension table record is meant to correct an error in the existing record. The corrected incoming record must replace the existing record, so you may use the destructive merge mode. This mode is also applicable to any target tables where the historical perspective is not important. Full Refresh: This type of application of data involves periodically rewriting the entire Data Warehouse. Sometimes, you may also do partial refreshes to rewrite only specific tables. Partial refreshes are rare because every dimension table is intricately tied to the fact table. As far as the data application modes are concerned, full refresh is similar to the initial load. However, in the case of full refreshes, data exists in the target tables before incoming data is applied. The existing data must be erased before applying the incoming data. Just as in the case of the initial load, the load and append modes are applicable to full refresh.

6. What Data Quality factors effects Data Warehouse. Explain them.


The following list of quality problems occur during data warehouse creation. All these problems have to be rectified during ETL processing. Dummy values in source system fields Absence of data in source system fields Multipurpose fields Cryptic data Contradicting data Improper use of name and address lines Violation of business rules Reused primary keys Non-unique identifiers

You might also like