You are on page 1of 7

FAQS ON DATAWAREHOUSE

1. What is surrogate key ? where we use it expalin with examples


Surrogate key is the primary key for the Dimensional table surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME -----Another benefit you can get from surrogate keys (SID) is : Tracking the SCD - Slowly Changing Dimension. Let me give you a simple, classical example: On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.' If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actualy belongs to 'BU1.' If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key. This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.' You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.

2. What is a linked cube?


A cube can be stored on a single analysis server and then defined as a linked cube on other Analysis servers. End users connected to any of these analysis servers can then access the cube. This arrangement avoids the more costly alternative of storing and maintaining copies of a cube on multiple analysis servers. linked cubes can be connected using TCP/IP or HTTP. To end users a linked cube looks like a regular cube. 3.

What is meant by metadata in context of a Datawarehouse and how it is important?


--- Metadata or Meta Data Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization The process of consolidating, relating and synchronizing

data elements with the same or similar meaning from different systems. Metadata synchronization joins these differing elements together in the data warehouse to allow for easier access. In context of a Datawarehouse metadata is meant the information about the data .This information is stored in the designer repository.

6. What is the main differnce between schema in RDBMS and schemas in DataWarehouse....?
RDBMS Schema * Used for OLTP systems * Traditional and old schema * Normalized * Difficult to understand and navigate * Cannot solve extract and complex problems * Poorly modelled DWH Schema * Used for OLAP systems * New generation schema * De Normalized * Easy to understand and navigate * Extract and complex problems can be easily solved * Very good model

7. What are the vaious ETL tools in the Market


1.Informatica Power Center 2. Ascential Data Stage 3. ESS Base Hyperion 4. Ab Intio 5. BO Data Integrator 6. SAS ETL 7. MS DTS 8. Oracle OWB 9. Pervasive Data Junction 10. Cognos Decision Stream 8.

What is Dimensional Modelling

In Dimensional Modeling, Data is stored in two kinds of tables: Fact Tables and Dimension tables. Fact Table contains fact data e.g. sales, revenue, profit etc..... Dimension table contains dimensional data such as Product Id, product name, product description etc..... Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the dimensions on which the facts are calculated. Why is Data Modeling Important? Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother especially if you are pressed for time? A common response by practitioners who write on the subject is that you should no more build a database without a model than you should build a house without blueprints. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language , it can be reviewed and verified as correct by the end-users. The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the physical database. The information contained in the data model will be used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

9. What is VLDB

The perception of what constitutes a VLDB continues to grow. A one terabyte database would normally be considered to be a VLDB. Data base is too large to back up in a time frame then it's a VLDB

10. What is real time data-warehousing


Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Realtime activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available. real time data warehouse provide live data for DSS (may not be 100% up to that moment, some latency will be there). Data warehouse have access to the OLTP sources, data is loaded from the source to the target not daily or weekly, but may be every 10 minutes through replication or logshipping or something like that. SAP BW is providing real time DW, with the help of extended starschma, source data is shared.

11. What is a Data Warehousing?


Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases. Datawarehosing is a process of creating,queriring and populating datawarehouse. it includes a number of discrete technologies like Identifying sources Process of ECCD, ETL which includes data cleansing , data transforming and data loading to targets. Datawarehousing is a subject oriented, authoritative,integrated historical database reflective of changes over meaningful time periods in order to facilitate query and analysis for useful management decision making.

12. What does level of Granularity of a fact table signify


It describes the amount of space required for a database. Level of Granularity indicates the extent of aggregation that will be permitted to take place on the fact data. More Granularity implies more aggregation potential and vice-versa. In simple terms, level of granularity defines the extent of detail. As an example, let us look at geographical level of granularity. We may analyze data at the levels of COUNTRY, REGION, TERRITORY, CITY and STREET. In this case, we say the highest level of granularity is STREET.

13. What is data mining


Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance dataware house can be used to mine data for the most high risk people to insure in a certain geographial area. Data mining is a concept of deriving/discovering the hidden,unexpected information from the existing data

14. What is ER Diagram


ER - Stands for entitity relationship diagrams. It is the first step in the design of data model which will later lead to a physical database design of possible a OLTP or OLAP database

The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represents data objects. Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the database designer, the utility of the ER model is: it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. it is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. In addition, the model can be used as a design plan by the database developer to implement a data model in a specific database management software.

15. What is a CUBE in datawarehousing concept?


Cubes are logical representation of multidimensional data.The edge of the cube contains dimension members and the body of the cube contains data values. Cube is a logical schema which contains facts and dimentions

16. What is ODS


ODS stands for Operational Data Store. It is the final integration point in the ETL process before loading the data into the Data Warehouse. ODS stands for Operational Data Store. It contains near real time data. In typical data warehouse architecture, sometimes ODS is used for analytical reporting as well as souce for Data Warehouse. Operationa Data Services is Hybrid structure that has some aspects of a data warehouse and other aspects of an Operational system. Contains integrated data. It can support DSS processing. It can also support High transaction processing. Placed in between Warehouse and Web to support web users. The form that data warehouse takes in the operational environment. Operational data stores can be updated, do provide rapid constant time,and contain only limited amount of historical data

17. What is Normalization, First Normal Form, Second Normal Form , Third Normal Form
Normalization : The process of decomposing tables to eliminate data redundancy is called Normalization. 1N.F:- The table should caontain scalar or atomic values. 2 N.F:- Table should be in 1N.F + No partial functional dependencies 3 N.F :-Table should be in 2 N.F + No transitive dependencies ---2NF - table should be in 1NF + non-key should not dependent on subset of the key ({part, supplier}, sup address) 3NF - table should be in 2NF + non key should not dependent on another non-key ({part}, warehouse name, warehouse addr) {primary key} more... 4,5 NF - for multi-valued dependencies (essentially to describe many-to-many relations) ---- Normalization can be defined as segregating of table into two different tables, so as to avoid duplication of values. The normalization is a step by step process of removing redundancies and dependencies of attributes in data

structure The condition of data at completion of each step is described as a normal form. Needs for normalization : improves data base design. Ensures minimum redundancy of data. Reduces need to reorganize data when design is modified or enhanced. Removes anomalies for database activities. First normal form : A table is in first normal form when it contains no repeating groups. The repeating column or fields in an un normalized table are removed from the table and put in to tables of their own. Such a table becomes dependent on the parent table from which it is derived. The key to this table is called concatenated key, with the key of the parent table forming a part it. Second normal form: A table is in second normal form if all its non_key fields fully dependent on the whole key. This means that each field in a table ,must depend on the entire key. Those that do not depend upon the combination key, are moved to another table on whose key they depend on. Structures which do not contain combination keys are automatically in second normal form. Third normal form: A table is said to be in third normal form , if all the non key fields of the table are independent of all other non key fields of the same table.

18. What are the Different methods of loading Dimension tables


Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won't be indexed. Conventional and Direct load method are applicable for only oracle. The naming convension is not general one applicable to other RDBMS like DB2 or SQL server..

19. What is the Difference between OLTP and OLAP


OLTP Current data Short database transactions Online update/insert/delete Normalization is promoted High volume transactions Transaction recovery is necessary OLAP Current and historical data Long database transactions Batch update/insert/delete Denormalization is promoted Low volume transactions Transaction recovery is not necessary --- OLTP is nothing but OnLine Transaction Processing ,which contains a normalised tables and online data,which have frequent insert/updates/delete. But OLAP(Online Analtical Programming) contains the history of OLTP data, which is, non-volatile ,acts as a Decisions Support System and is used for creating forecasting reports.

20. What are Data Marts


Data Mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse.

21.

Compare Data Warehouse database and OLTP database

The data warehouse and the OLTP data base are both relational databases. However, the objectives of both these databases are different. The OLTP database records transactions in real time and aims to automate clerical data entry processes of a business entity. Addition, modification and deletion of data in the OLTP database is essential and the semantics of the application used in the front end impact on the organization of the data in the database. The data warehouse on the other hand does not cater to real time operational requirements of the enterprise. It is more a storehouse of current and historical data and may also contain data extracted from external data sources. Differences

Data warehouse database

OLTP database

Designed for analysis of business Designed for real time business measures by categories and operations. attributes Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table. Loaded with consistent, valid data; requires no real time validation Supports few concurrent users relative to OLTP Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table. Optimized for validation of incoming data during transactions; uses validation data tables. Supports thousands of concurrent users.

However, the data warehouse supports OLTP system by providing a place for the latter to offload data as it accumulates and by providing services which would otherwise degrade the performance of the database.

22. Question:
Answer:

What is the difference between Data Warehouse and Online Analytical Processing

Ralph Kimball the co-founder of the data warehousing concept has defined the data warehouse as a "a copy of transaction data specifically structured for query and analysis. Both definitions highlight specific features of the data warehouse. The former definition focuses on the structure and organization of the data and the latter focuses upon the usage of the data. However, a listing of the features of a data warehouse would necessarily include the aspects highlighted in both these definitions. Data warehouse and OLAP are terms which are often used interchangeably. Actually they refer to two different components of a decision support system. While data in a data warehouse is composed of the historical data of the organization stored for end user analysis, OLAP is a technology that enables a data warehouse to be used effectively for online analysis using complex analytical queries. The differences between OLAP and data warehouse is tabulated below for ease of understanding: Data Warehouse Data from different data sources is stored in a relational database for end use analysis Data from different data sources is stored in a relational database for end use analysis Data is organized in summarized, aggregated, subject oriented, non volatile patterns. Data is a data warehouse is consolidated, flexible collection of data Supports analysis of data but does not support online analysis of data. Online Analytical Processing

A tool to evaluate and analyze the data in the data warehouse using analytical queries. A tool which helps organize data in the data warehouse using multidimensional models of data aggregation and summarization. Supports the data analyst in real time and enables online analysis of data with speed and flexibility.

23. Question:
Answer:

Online Analytical Processing

Online Analytical Processing A tool to evaluate and analyze the data in the data warehouse using analytical queries. A tool which helps organize data in the data warehouse using multidimensional models of data aggregation and summarization. Supports the data analyst in real time and enables online analysis of data with speed and flexibility.

You might also like