You are on page 1of 24

Architecture of Data Warehouse

By: Er. Manu Bansal (Assistant Professor) Dept. of IT mrmanubansal@gmail.com

Data Warehouse- Concept

A data warehouse refers to a database that is maintained separately from an organizations operational databases. The construction of data warehouses involves data cleaning, data integration, and data transformation. Data warehousing also forms an essential step in the knowledge discovery process.

Data Warehouse V/S Data Base


The four keywords distinguishing data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems are:

Subject-oriented Integrated Time-variant Nonvolatile

Three Tired Architecture


other

Metadata

sources
Operational Extract Transform Load Refresh

Monitor & Integrator

OLAP Server

DBs

Data Warehouse

Serve

Analysis Query Reports Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine Front-End Tools

Typical Components of a Data Warehouse Architecture

Operational data

Without source system, there would be no data The data sources for the data warehouse are supplied as follows:

Operational data held in network databases Departmental data held in file systems Private data held on workstations and private servers and external systems such as Internet, commercially available DB, or DB associated with and organizations suppliers or customers

Operational Data Store(ODS)

Is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse. ODS objectives: to integrate information from day-today systems and allow operational lookup to relieve day-to-day systems of reporting and current-data analysis demands. ODS can be helpful step towards building a data warehouse because ODS can supply data that has been already extracted from the source systems and cleaned.

Load Manager

Called the frontend component. The data is extracted from the operational systems directly or from the operational datastore and then to the data warehouse Performs all the operations associated with the extraction and loading of data into the warehouse. These operations include sourcing, acquisition, cleanup and transformation tools which prepare the data for entry into the warehouse. The functionality includes:

Removing unwanted data from operational databases. Converting to common data names and definitions. Calculating summaries. Establishing defaults for missing data.

Warehouse Manager

Performs all the operations associated with the management of the data in the warehouse as follows:

Analysis of data to ensure consistency Transformation and merging of source data from temporary storage into the data warehouse tables Creation of indexes and views. Backing-up and archiving data.

Data Warehouse Database


Central Repository for information. This database is almost always implemented on the relational database management system (RDBMS) technology. Certain data warehouse attributes such as very large database size, ad hoc query processing and need for flexible user view creation including aggregates, multi-table joins and drill downs have become drivers for different technology approaches to data warehouse database. These approaches include:

Data Warehouse Database- Contd.

Parallel Relational database designs that require a parallel computing platform, such as symmetric multiprocessors (SMPs) and massively parallel processors (MPPs). Multidimensional databases (MDDBs).

Query Manager

Called backend component Performs all the operations associated with the management of user queries Directing queries to the appropriate tables and scheduling the execution of queries.

Detailed Data

Stores all the detailed data in the database schema. On a regular basis, detailed data is added to the warehouse to supplement the aggregated data.

Lightly and Highly Summarized Data


Stores all the pre-defined lightly and highly aggregated data generated by the warehouse manager. The purpose of summary information is to speed up the performance of queries. On the other hand, it removes the requirement to continually perform summary operations (such as sort or group by) in answering user queries. The summarized data is updated continuously as new data is loaded into the warehouse.

Archive/Backup Data
Stores detailed and summarized data for the purposes of archiving and backup May be necessary to backup online summary data if this data is kept beyond the retention period for detailed data The data is transferred to storage archives such as magnetic tape or optical disk

Meta Data
This area of the warehouse stores all the metadata definitions used by all the processes in the warehouse Meta-Data is used for a variety of purposes:
Extraction and loading processes Warehouse management process

Used to automate the production of summary tables


Query management process

Used to direct a query to the most appropriate data source End-user access tools use metadata to understand how to build a query

End-user Access Tools


Users interact with the warehouse using end-user access tools. Can be categorized into five main groups

Data reporting and query tools (Query by Example MS Access DBMS) Application development tools (application used to access major DBS Oracle, sybase..) Executive information system (EIS) tools (For sales, marketing and finance) Online analytical processing (OLAP) tools (Allow users to analyze the data using complex and multidimentional views-from multiple databases) Data mining tools (allow the discovery of new patterns and trend by mining a large amount of data using statistical, mathematical tools)

Data Warehousing: Data flows

Inflow
The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse Cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity Transformation include adding date/time stamp fields, summarizing detailed data, deriving new fields to store calculated data Extract the relevant data from multiple, heterogeneous, and external sources (commercial tools are used) Then mapped and loaded into the warehouse

Upflow
The process associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data Summarizing the data works by choosing, projecting, joining, and grouping relational data into views that are more convenient and useful to the end users. Packeging the data involves converting the detailed or summarized information into more useful formats, such as spreadsheets, test documents, charts, other graphical presentations, private databases, and animation. Distribute the data in appropiate groups to increase its availability and accessibility

Downflow
The processes associated with archiving and backing-up of data in the warehouse. Archiving the effectiveness and performance maintanance is achieved by transferring the older data of limited value to storage archivers such as magnetic tapes, optical disk or digital storage devices. The downflow of data includes the processes to ensure that the current state of the data warehouse can be rebuilt following data loss, or software/hardware failures. Archived data should be stored in a way that allows the re-establishement of the data in the warehouse when required.

Outflow

Involves the process associated with making the data availabe to the end-users. This involves two activities such as data accessing and delivering Data accessing is concerned with satisfying the end userss requests for the data they need. The main problem here is the creation of an environment so that the users can effectively use the query tools to access the most appropiate data source. Delivering activity makes possible the information delivery to the users systems/workstations.

Metaflow
Meta-flow is a description of the data contents of the data warehouse, what is in it, where it came from originally, and what has been done to it by way of cleansing, integrating, and summarizing

Managing the metadata (data about the data)

Thanks

You might also like