You are on page 1of 24

DWH MATERIAL

Business Intelligence Study Guide Business Intelligence (BI): Business Intelligence refers to a set of methods and techniques that are used by organizations for tactical and strategic decision making. Data Warehousing: Integration of data from multiple sources into large warehouses and support of on-line analytical processing and business decision making DW vs. Operational Databases Data Warehouse Subject Oriented Integrated Nonvolatile Time variant Ad hoc retrieval Operational Databases Application oriented Limited integration Continuously updated Current data values only Predictable retrieval Data Warehouse: a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process. Data Mart A monothematic data warehouse Department- oriented or business line oriented It is a subset of Data Warehouse which is department level

Praveen 9535696199

Page 1

DWH MATERIAL

Difference between data mart and data warehouse Data Mart Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective. A data mart is used on a business division/ department level. A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups. By providing decision makers with only a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained. Top-Down Approach (Bill Inmon) Data Warehouse Data warehouse is a Subject-Oriented, Integrated, Time-Variant, Nonvolatile collection of data in support of decision making. A data warehouse is used on an enterprise level A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making. The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.

Praveen 9535696199

Page 2

DWH MATERIAL
Advantages A truly corporate effort, an enterprise view of data Inherently architected not a union of disparate data marts Single, central storage of data about the content Centralized rules and control May see quick results if implemented with iterations Disadvantages Takes longer to build even with an iterative method High exposure/risk to failure Needs high level of cross-functional skills High outlay without proof of concept Bottom-Up Approach (Ralph Kimball)

Advantages Faster and easier implementation of manageable pieces Favorable return on investment and proof of concept Less risk of failure Inherently incremental; can schedule important data marts first Allows project team to learn and grow Disadvantages Each data mart has its own narrow view of data Permeates redundant data in every data mart Perpetuates inconsistent and irreconcilable data Proliferates unmanageable interfaces

Data Staging Component Three major functions need to be performed for getting the data ready (ETL) extract the data transform the data and then load the data into the data warehouse storage Page 3

Praveen 9535696199

DWH MATERIAL
Data Warehouse Subject-Oriented - Data is stored by subjects Integrated Data - Need to pull together all the relevant data from the various systems Data from internal operational systems Data from outside sources Time-Variant Data - the stored data contains the current values The use needs data not only about the current purchase, but on the past purchases Nonvolatile Data - Data from the operational systems are moved into the data warehouse at specific intervals Data Granularity - Data granularity in a data warehouse refers to the level of detail The lower the level of detail, the finer the data granularity The lowest level of detail a lot of data in the data warehouse

Important features of a DWH


DRILL DOWN, DRILL ACROSS, Graphs, PI charts, dashboards and TIME HANDLING To be able to drill down/drill across is the most basic requirement of an end user in a data warehouse.

Praveen 9535696199

Page 4

DWH MATERIAL

Schema Graphical Representation of the data structure. First Phase in implementation of Universe Star schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables contain descriptions about the subjects of the business Dimension tables are denormalized to maximize performance

Praveen 9535696199

Page 5

DWH MATERIAL
Snow flake schema Unlike Star-Schema, Snowflake schema contain normalized dimension tables in a tree like structure with many nesting levels. Snowflake schema is easier to maintain but queries require more joins.

Granularity

Praveen 9535696199

Page 6

DWH MATERIAL

Principle: Create fact tables with the most granular data possible to support analysis of the business process. In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema. It is usually given as the number of records per key within the table. In general, the grain of the fact table is the grain of the star schema.

Grain of the star schema


In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema. It is usually given as the number of records per key within the table. In general, the grain of the fact table is the grain of the star schema.

Fact Table A Fact Table in a dimensional model consists of one or more numeric facts of importance to a business. Examples of facts are as follows:

the number of products sold the value of products sold the number of products produced the number of service calls received

Fact and Dimension


A "fact" is a numeric value that a business wishes to count or sum. A "dimension" is essentially an entry point for getting at the facts. Dimensions are things of interest to the business.

Types of facts

Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Page 7

Praveen 9535696199

DWH MATERIAL

Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Factless Fact Table Factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of Factless fact tables include: Identifying product promotion events (to determine promoted products that didnt sell)

Tracking student attendance or registration events Tracking insurance-related accident events

Conformed Dimension

These dimensions are something that is built once in your model and can be reused multiple times with different fact tables. For example, consider a model containing multiple fact tables, representing different data marts.

Slowly changing dimensions Are the Customer and Product Dim independent of Time Dim? Changes in names, family status, product district/region How to handle these changes in order not to affect the history status? Eg. Insurance 3 suggestions for slowly changing dimensions
Type 1: The new record replaces the original record. No trace of the old record exists. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people. Type 3: The original record is modified to reflect the change.

Praveen 9535696199

Page 8

DWH MATERIAL
Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois

After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer Key 1001 Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Name Christina State California

Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before. Usage: About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Type 2 Slowly Changing Dimension


In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key. In our example, recall we originally have the following table: Customer Key Name State

Praveen 9535696199

Page 9

DWH MATERIAL
1001 Christina Illinois

After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer Key 1001 1005 Advantages: - This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. Name Christina Christina State Illinois California

Type 3 Slowly Changing Dimension


In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key 1001 Name Christina State Illinois

To accomodate Type 3 Slowly Changing Dimension, we will now have the following columns: Customer Key Name Original State

Praveen 9535696199

Page 10

DWH MATERIAL
Current State Effective Date

After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer Key 1001 Name Christina Original State Illinois Current State California Effective Date 15-JAN-2003

Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Junk Dimensions A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. A good example would be a trade fact in a company that brokers equity trades.

Leave the flags in the fact tables likely sparse data no real browse entry capability can significantly increase the size of the fact table Remove the attributes from the design potentially critical information will be lost if they provide no relevance, remove them Make a flag into its own dimension may greatly increase the number of dimensions, increasing the size of the fact table can clutter and confuse the design Page 11

Praveen 9535696199

DWH MATERIAL

Combine all relevant flags, etc. into a single dimension the number of possibilities remain finite information is retained

Degenerated Dimension

An item that is in the fact table but is stripped off of its description, because the description belongs in dimension table, is referred to as Degenerated Dimension. Since it looks like dimension, but is really in fact table and has been degenerated of its description, hence is called degenerated dimension. A dimension which is located in fact table is called degenerated dimension.

Dimensional Modeling

A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension and fact table records all the fact or measure.

Four steps in Dimensional modeling 1. Identify the process being modeled. 2. Determine the grain at which facts will be stored. 3. Choose the dimensions. 4. Identify the numeric measures for the facts.

Praveen 9535696199

Page 12

DWH MATERIAL

There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other.

Conceptual Data Model Features of conceptual data model include: Includes the important entities and the relationships among them.

No attribute is specified. No primary key is specified.

At this level, the data modeler attempts to identify the highest-level relationships among the different entities. Logical Data Model Features of logical data model include: Includes all entities and relationships among them.

All attributes for each entity are specified. The primary key for each entity specified. Foreign keys (keys identifying the relationship between different entities) are specified. Page 13

Praveen 9535696199

DWH MATERIAL

Normalization occurs at this level.

At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database. In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable). The steps for designing the logical data model are as follows: 1. Identify all entities. 2. Specify primary keys for all entities. 3. Find the relationships between different entities. 4. Find all attributes for each entity. 5. Resolve many-to-many relationships. 6. Normalization. Physical Data Model Features of physical data model include: Specification all tables and columns.

Foreign keys are used to identify relationships between tables. Demoralization may occur based on user requirements. Physical considerations may cause the physical data model to be quite different from the logical data model.

At this level, the data modeler will specify how the logical data model will be realized in the database schema. The steps for physical data model design are as follows: 1. Convert entities into tables. 2. Convert relationships into foreign keys. 3. Convert attributes into columns. 1. http://www.learndatamodeling.com/dm_standard.htm 2. Modeling is an efficient and effective way to represent the organizations needs. It provides information in a graphical way to the members of an organization to understand and communicate the business rules and processes. Business Modeling and Data Modeling are the two important types of modeling. Praveen 9535696199 Page 14

DWH MATERIAL
Logical Data Model Represents business information and defines business rules Entity Attribute Primary Key Alternate Key Inversion Key Entry Rule Relationship Definition Physical Data Model Represents the physical implementation of the model in a database. Table Column Primary Key Constraint Unique Constraint or Unique Index Non Unique Index Check Constraint, Default Value Foreign Key Comment

3 types of multidimensional data Data from external sources (represented by the blue cylinder) is copied into the small red marble cube, which represents input multidimensional data What is Staging area why we need it in DWH? If target and source databases are different and target table volume is high it contains some millions of records in this scenario without staging table we need to design your informatica using look up to find out whether the record exists or not in the target table since target has huge volumes so its costly to create cache it will hit the performance. If we create staging tables in the target database we can simply do outer join in the source qualifier to determine insert/update this approach will give you good performance. Pre-calculated, stored results derived from it on-the-fly results, calculated as required at run-time, but not stored in a database

It will avoid full table scan to determine insert/updates on target. And also we can create index on staging tables since these tables were designed for specific application it will not impact to any other schemas/users. Praveen 9535696199 Page 15

DWH MATERIAL
While processing flat files to data warehousing we can perform cleansing. Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. During data cleansing, records are checked for accuracy and consistency.

Since it is one-to-one mapping from ODS to staging we do truncate and reload. We can create indexes in the staging state, to perform our source qualifier best. If we have the staging area no need to relay on the informatics transformation to known whether the record exists or not.

Data cleansing Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it more meaningful and informative Data merging Data can be gathered from heterogeneous systems and put together Data scrubbing Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to end user. Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bring consistency to various data sets that may have been created with different, incompatible business rules. ODS (Operational Data Sources): It is a replica of OLTP system to reduce the burden on production system (OLTP) while fetching data for loading targets. Hence it is a mandate Requirement for every Warehouse. OLTP is a sensitive database they should not allow multiple select statements it may impact the performance as well as if something goes wrong while fetching data from OLTP to data warehouse it will directly impact the business. ODS is the replication of OLTP. ODS is usually getting refreshed through some oracle jobs. It enables management to gain a consistent picture of the business.

Surrogate key

A surrogate key is a substitution for the natural primary key. Page 16

Praveen 9535696199

DWH MATERIAL

It is a unique identifier or number (normally created by a database sequence generator) for each record of a dimension table that can be used for the primary key to the table. A surrogate key is useful because natural keys may change.

Aggregation The system uses physically stored aggregates as a way to enhance performance of common queries. These aggregates, like indexes, are chosen silently by the database if they are physically present. End users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate When you go for higher level of aggregates, the sparsity percentage goes down, eventually reaching 100% of occupancy

Data Quality Issues Dummy values in fields Missing data Unofficial use of fields Cryptic values Contradicting values Reused primary keys Inconsistent values Incorrect values Multipurpose fields Steps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating

OLAP defined: Praveen 9535696199 Page 17

DWH MATERIAL
On-line Analytical Processing(OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user Users need the ability to perform multidimensional analysis with complex calculations

The basic virtues of OLAP Enables analysts, executives, and managers to gain useful insights from the presentation of data Can reorganize metrics along several dimensions and allow data to be viewed from different perspectives Supports multidimensional analysis Is able to drill down or roll up within each dimension

Praveen 9535696199

Page 18

DWH MATERIAL

BUSINESS METADATA Is like a roadmap or an easy-to-use information directory showing the contents and how to get it How can I sign onto and connect with the data warehouse? Which parts of the data warehouse can I access? Can I see all the attributes from a specific table? What are the definitions of the attributes I need in my query? Are there any queries and reports already predefined to give the results I need?

TECHNICAL METADATA Technical metadata is meant for the IT staff responsible for the development and administration of the data warehouse Technical metadata is like a support guide for the IT professionals to build, maintain, and administer the data warehouse Physical Design Objectives Improve Performance In OLTP, 1-2 secs max; in DW secs to mins Ensure scalability Manage storage Provide Ease of Administration Design for Flexibility. Physical Design Steps Develop Standards Create Aggregates Plan Determine Data Partitioning Establish Clustering Options Prepare Indexing Strategy Assign storage structures Partitioning Breaking data into several physical units that can be handled separately Not a question of whether to do it in data warehouses but how to do it Granularity and partitioning are key to effective implementation of a warehouse Partitions are spread across multiple disks to boost performance Why Partition? Flexibility in managing data Smaller physical units allow easy restructuring Praveen 9535696199 Page 19

DWH MATERIAL
free indexing sequential scans if needed easy reorganization easy recovery easy monitoring Improve performance

Criterion for Partitioning Vertically (groups of selected columns together. More typical in dimension tables) Horizontally (e.g. recent events and past history. Typical in fact tables) Parallelization The argument goes: if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing). Oracle uses this strategy in its warehousing products.

Indexing Structure separate from the table data it refers to, storing the location of rows in the database based on the column values specified when the index is created. They are used in data warehouse to improve warehouse throughput Indexing and loading Indexing for large tables

Btree characteristics: Balanced Page 20

Praveen 9535696199

DWH MATERIAL
Bushy: multi-way tree Block-oriented Dynamic

Bitmap Index Bitmap indices are a special type of index designed for efficient querying on multiple keys Records in a relation are assumed to be numbered sequentially from, say, 0 Given a number n it must be easy to retrieve record n Particularly easy if records are of fixed size Applicable on attributes that take on a relatively small number of distinct values E.g. gender, country, state, E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity) A bitmap is simply an array of bits In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute Bitmap has as many bits as records In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise

Clustering The technique involves placing and managing related units of data to be retrieved in the same physical block of storage This arrangement causes related units of data to be retrieved together in one single operation In a clustering index, the order of the rows is close to the index order. Close means that physical records containing rows will not have to be accessed more than one time if the index is accessed sequentially

DW Deployment Major deployment activities Complete user acceptance Perform initial loads Praveen 9535696199 Page 21

DWH MATERIAL
Get user desktops ready Complete initial user training Institute initial user support Deploy in stages DW Growth & Maintenance Monitoring the DW Collection of Stats Usage of Stats For growth planning For fine tuning User training Data Content Applications & Tools Dimensional Modeling Exercise Exercise: Create a star schema diagram that will enable FIT-WORLD GYM INC. to
analyze their revenue. 1 The fact table will include: for every instance of revenue taken attribute(s) useful for analyzing revenue. 2 The star schema will include all dimensions that can be useful for analyzing revenue. 3 The only data sources available are shown bellow. SOURCE 1 FIT-WORLD GYM Operational Database: ER-Diagram and the tables based on it (with data)
MEMBER Buys MembID MemZIP MembName Pays MEMBERSHIP MshpID MshpPrice MshpName

ONEDAYPASS PassID PassDate Is of

Date BuysVia

PASSCATEG PassCatID Price CatName SALESTRANS StID Date

SoldVia

Quantity

MERCHANDISE MrchID MrchName MrchPrice

Praveen 9535696199

Page 22

DWH MATERIAL

SOLUTION

Praveen 9535696199

Page 23

DWH MATERIAL

Praveen 9535696199

Page 24

You might also like