Professional Documents
Culture Documents
DATA WAREHOUSING
CONTENTS
1. What is Data Warehousing? 2. Why to build a Data Warehouse? 3. Microsoft Data warehousing framework. 4. Data warehousing framework components. 5. Data warehousing architecture. 6. Data Marts. 7. Building a Data Warehouse from Data Marts. 8. OLAP. 9. Integrated OLAP Analytical capabilities. 10. Data Warehousing and OLAP. 11. Data Warehousing components. 12. Data Warehousing characteristics. 13. Designing and building a Data Warehouse. 14. Data Granularity. 15. Building the OLAP Data model. 16. Process flow within a data warehouse. 17. Data Warehousing and Online Analytical Processing.
A data warehouse is an integrated store of information collected from other systems that becomes the foundation for decision support and data analysis. Although there are many types of data warehouses, based on different design methodologies and philosophical approaches, they all have these common traits: Information is organized around the major subjects of the enterprise (for example, customers, products, sales, or vendors), reflecting a data-driven design. Raw data is gathered from nonintegrated operational and legacy applications, cleansed, and then summarized and presented in a way that makes sense to end users. Based on feedback from end users and discoveries in the data warehouse, the data warehouse architecture will change over time, reflecting the iterative nature of the process.
The data warehousing process is inherently complex and, as a result, is costly and time-consuming. A copy of the data in a database, created specifically to allow users to query the database. Creating a data warehouse allows you to structure the data for easiest reporting and for best securityfor example, pre-joining tables to assist with queries by novice users, or removing fields that contain sensitive data. It also lets you move the data to another server to minimize the performance impact that queries can cause. Because data in a data warehouse is a snapshot of a database, it must be refreshed periodically; the exact interval depends on your application needs.
technology arises in terms of ability to organise ,maintain large data and also be able to analyse in few seconds in the manner and depth required.There are conventional information systems which didnt succeed in meeting these requirements.Conventional and data warehousing tackle two different activity domains-OLTP & OLAP. Besides this conventional information system is not capable of analyzing a large number of past transactions or large number of data records. The cost of processing the data also increases as the volume of data increases .As a result the analyst finds data mart extremely useful for fast and easy analysis. Since the data flows into data mart (from the Data Warehouse) ,the department which owns it can easily customize the data. Data marts are of 2 types: 1.Multi dimensional OLAP (MOLAP) 2.Relational OLAP (ROLAP)
User,
and
Technical
Before a data warehouse can be built, a detailed project and implementation plan should be written. The project and implementation plan includes: Building a business case. Gathering user requirements. Determining the technical requirements. Defining standard reports required by users. Analyzing client application tools being used. Building the business case is common at the beginning of any project. It involves determining the business needs solved by the project, the costs of the project, and the return on the investment. Gathering user requirements largely involves interviewing the intended users of the data warehouse. The user requirements determine: Data requirements (level of granularity). Operational systems within the enterprise containing the data. Business rules followed by the data. Queries required to provide the users with data. The technical requirements may involve determining: Hardware architecture and infrastructure (for example, links to remote geographical regions where data marts might be located). Backup and recovery mechanisms. Security guidelines. Methods of loading and transforming data from operational systems to the data warehouse. Standard reports required by users should be analyzed to determine the tables, columns, and selection criteria necessary to create the reports, and the frequency in which they are generated. Provisions should also be made for expanding or modifying the scope of reports as required. Client application tools should be analyzed to determine if they can provide enhanced processing capabilities that help in processing data, performing queries, or generating reports.
data warehouse or data mart. This data staging process can be many levels deep, especially with enterprise data warehousing architectures, but is necessarily simplified in this illustration.
End-user tools, including desktop productivity products, specialized analysis products, and custom programs, are used to gain access to information in the data warehouse. Ideally, user access is through a directory facility that enables end-user searches for appropriate and relevant data to resolve questions and that provides a layer of security between the end users and the data warehouse systems. Finally, a variety of components can come into play for the management of the data warehousing environment, such as for scheduling repeated tasks and managing multiserver networks. The Data Warehousing Framework describes the relationships between the various components used in the process of building, using, and managing a data warehouse. Two enabling technologies comprise the core Data Warehousing Framework: the integrated metadata repository and the data transport layer (OLE DB). These technologies make possible the interoperability of many products and components involved in data warehousing. OLE DB provides for standardized, high-performance access to a wide variety of data, and allows for integration of multiple data types. Microsoft Repository provides an integrated metadata repository that is shared by the various components used in the data warehousing process. Shared metadata allows for the transparent integration of multiple products from a variety of vendors, without the need for specialized interfaces between each of the products: Mainframe indexed sequential access method/virtual storage access method (ISAM/VSAM) and hierarchical databases E-mail and file system stores Text, graphical, and geographical data Custom business objects
OLE DB defines a collection of COM interfaces that encapsulates various database management system services. These interfaces enable the creation of software components that implement such services. OLE DB components consist of data providers (which contain and expose data), data consumers (which use data), and service components (which process and transport data). OLE DB interfaces are designed to integrate components smoothly so that vendors can bring high-quality OLE DB components to the market quickly. In addition, OLE DB includes a bridge to ODBC that enables continued support for the broad range of ODBC relational database drivers available today.
The understandable preference for, and proliferation of, data marts instead of enterprise data warehouses is causing a future systems integration issue for
some organizations. Creation of a data warehousing architecture for the enterprise saves each data mart project team from having to recreate the elements of the architecture. If an organization's structure is one that may require future data mart integration, it is less work to plan and architect for this before the data marts are created than after the fact. It behooves organizations that will operate multiple data marts to establish an information architecture or model. The goal of the data warehousing architecture is to identify the following common items: source systems for data to populate the data marts
business dimensions
semantics (data definitions, measurements, and aggregation formulas) If an organization requires, or will require, that multiple data marts each download similar data from production source systems, then the most maintainable architecture may be to move data first from the source systems to a well designed enterprise data warehouse or operational data store. Then, each data mart can be populated from the enterprise data warehouse, inheriting in the process all of the architectural elements that were implemented in the enterprise data warehouse. This approach also reduces the number of interface links between data marts and source systems that must be maintained.
Data Marts
A data mart is typically defined as a subset of the contents of a data warehouse, stored within its own database. A data mart tends to contain data focused at the department level, or on a specific business area. The data can exist at both the detail and summary levels. The data mart can be populated with data taken directly from operational sources, similar to a data warehouse, or data taken from the data warehouse itself. Because the volume of data in a data mart is less than that in a data warehouse, query processing is often faster. :
Characteristics of a data mart include Quicker and simpler implementation. Lower implementation cost. Needs of a specific business unit or function met. Protection of sensitive information stored elsewhere in the data warehouse. Faster response times due to lower volumes of data. Distribution of data marts to user organizations. Built from the bottom upward. Departmental or regional divisions often determine whether data marts or data warehouses are used. For example, if managers in different sales regions require data from only their region, then it can be beneficial to build data marts containing specific regional data. If regional managers require access to all the organizations data, then a larger data warehouse is usually necessary. Although data marts are often designed to contain data relating to a specific business function, there can be times when users need a broader level of business data. However, because this broader-level data is often only needed in summarized form, it is acceptable to store it within each data mart rather than implementing a full data warehouse.
Designing and Processing Aggregations OLAP tools are typically used to create and manage summary data. OLAP Services allows aggregations to be stored in a variety of formats and locations, with dynamic connections to underlying details in the data warehouse. Summary data is often generated to satisfy the commonly executed queries in the data warehouse. Storing preaggregated data increases query performance, and reduces the load on the data warehouse. If a data warehouse is built so the data in it does not change, then preaggregating data in the fact table saves only the disk space required by the fact table. OLAP Services uses the processing time that would have been used to preaggregate in the fact table when it processes the fact table as it builds a cube. However, precalculated aggregations are stored in the cube and do not need to be recalculated for each query. If a hybrid OLAP (HOLAP) or relational OLAP (ROLAP) cube is used, the fact table is not copied into the cube as it is in multidimensional OLAP (MOLAP) cubes, so the overhead required to retain
availability of the detail data is only the fact table size, not processing time or query response time. Preaggregation strategy when designing a data warehouse for use by OLAP Services depends on the following variables: Stability of the data. If the source data changes, the preaggregations have to be performed each time, whether preaggregated in the fact table or in the OLAP cubes that have to be rebuilt from the fact table. Query response time. With properly designed OLAP cubes, the granularity of detail in the fact table has no effect on query response time for queries that do not access detail facts. Storage requirements. A finer level of granularity in the fact table requires more storage for the fact table and for MOLAP cubes. This is a trade-off against detail availability and choice of OLAP cube storage mode. OLAP cubes tend to be large regardless of the storage type; therefore the storage required to retain fine granularity in the fact table may not be particularly significant when compared to OLAP storage needs. When designing the data warehouse for OLAP, the user's needs should drive the preaggregation strategy. The fact table should only be preaggregated to the level of granularity below which no user would want to access detail.
OLAP: Online Analytical Processing (OLAP) systems, contrary to regular conventional OLTP systems are capable of analyzing online a large number of transactions or large number of data records(ranging from mega bytes to tera bytes).This type of data is usually multi dimensionality is the key driven for OLAP technology,which happens to be central to data warehousing. Any multidimensional data cant be processed by conventional SQL type DBMS. For complex real world problems the data is usually multi dimensional in nature. SQL will not be capable of handling it effectively even one can manage to put that data in a conventional relational database in normalised table Integrated OLAP Analytical Capabilities
OLAP is an increasingly popular technology that can dramatically improve business analysis. Historically, OLAP has been characterized by expensive tools, difficult implementation, and inflexible deployment. OLAP Services is a new, fully featured OLAP capability provided as a component of SQL Server 7.0. OLAP Services includes a middle-tier server that allows users to perform sophisticated
analysis on large volumes of data with exceptional results. OLAP Services also includes a client-side cache and calculation engine called Microsoft PivotTable Service, which helps improve performance and reduce network traffic. PivotTable Service allows end users to conduct analyses while disconnected from the network. OLAP Services is a middle-tier OLAP server that simplifies user navigation and helps improve performance for queries against information in the data warehouse. OLAP is a key component of data warehousing, and OLAP Services provides essential functionality for a wide array of applications ranging from reporting to advanced decision support. OLAP functionality within SQL Server 7.0 helps make multidimensional analysis much more affordable and bring the benefits of OLAP to a wider audience, from smaller organizations to groups and individuals within larger corporations. Coupled with the wide variety of tools and applications supporting OLAP applications through Microsoft OLE DB for OLAP, OLAP Services helps increase the number of organizations that have access to sophisticated analytical tools and can help reduce the costs of data warehousing. For more information about Microsoft SQL Server OLAP Services, see SQL Server Books Online.
DTS and OLAP Services offer an attractive and cost-effective solution. Data warehousing and OLAP solutions using DTS and OLAP Services are developed with point-and-click graphical tools that are tightly integrated and easy to use. Furthermore, because the PivotTable service client is using OLE DB, the interface is more open to access by a variety of client applications.
Data Components
Warehousing
A data warehouse always consists of a number of components, including: Operational data sources. Design/development tools. Data extraction and transformation tools. Database management system (DBMS). Data access and analysis tools. System management tools. Several years ago, Microsoft recognized the need for a set of technologies that would integrate these components. This led to the creation of the Microsoft Data Warehousing Framework, a roadmap not only for the development of Microsoft products such as SQL Server 7.0, but also for the technologies necessary to integrate products from other vendors.
mainframes, data in spreadsheets, or even data from the Internet, to be consolidated in the data warehouse, and effectively cross-referenced, giving the analysts a better understanding of the business. Subject-oriented Data Operational data sources across an organization tend to hold a large amount of data about a variety of business-related functions, such as customer records, product information, and so on. However, most of this information is also interspersed with data that has no relevance to business or executive reporting, and is organized in a way that makes querying the data awkward. The data warehouse organizes only the key business information from operational sources so that it is available for business analysis. Historical Data Data in OLTP systems correctly represents the current value at any moment in time. For example, an order-entry application always shows the current value of stock inventory; it does not show the inventory at some time in the past. Querying the stock inventory a moment later may return a different response. However, data stored in a data warehouse is accurate as of some past point in time because the data stored represents historical information. The data stored in a data warehouse typically represents data over a long period of time; perhaps up to ten years or more. OLTP systems often contain only current data, because maintaining large volumes of data used to represent ten years of information in an OLTP system can affect performance. In effect, the data warehouse stores snapshots of a businesss operational data generated over a long period of time. It is accurate for a specific moment in time and cannot change. This contrasts with an OLTP system where data is always accurate and can be updated when necessary. Read-only Data After data has been moved to the data warehouse successfully, it typically does not change unless the data was incorrect in the first place. Because the data stored in a data warehouse represents a point in time, it must never be updated. Deletes, inserts, and updates (other than those involved in the data loading process) are not applicable in a
data warehouse. The only operations that occur in a data warehouse, when it has been set up, are loading and querying data.
Data
The steps required to build a data warehouse include: Determining business, user, and technical requirements. Designing and building the database. Extracting and loading data into the data warehouse. Designing and processing aggregations using OLAP tools. Querying and maintaining the data warehouse and OLAP databases.
Data Granularity
A significant difference between an OLTP or operational system and a data warehouse is the granularity of the data stored. An operational system typically stores data at the lowest level of granularity: the maximum level of detail. However, because the data warehouse contains data representing a long period in time, simply storing all detail data from an operational system can result in an overworked system that takes too long to query. A data warehouse typically stores data in different levels of granularity or summarization, depending on the data requirements of the business. If an enterprise needs data to assist strategic planning, then only highly summarized data is required. The lower the level of granularity of data required by the enterprise, the higher the number of resources (specifically
data storage) required to build the data warehouse. The different levels of summarization in order of increasing granularity are: Current operational data Historical operational data Aggregated data Metadata Current and historical operational data are taken, unmodified, directly from operational systems. Historical data is operational level data no longer queried on a regular basis, and is often archived onto secondary storage. Aggregated, or summary, data is a filtered version of the current operational data. The design of the data warehouse affects how the current data is aggregated. Considerations for generating summary data include the period of time used to aggregate the data (for example, weekly, monthly, and so on), and the parts of the operational data to be summarized. For example, an organization can choose to aggregate at the part level the quantity of parts sold per sales representative per week. There may be several levels of summary data. It may be necessary to create summary level data based on an aggregated version of existing summary data. This can give an organization an even higher level view of the business. For example, an organization can choose to aggregate summary level data further by generating the quantity of parts sold per month. Metadata does not contain any operational data, but is used to document the way the data warehouse is constructed. Metadata can describe the structure of the data warehouse, source of the data, rules used to summarize the data at each level, and any transformations of the data from the operational systems.
the process, which ensures that the data being viewed by the OLAP user is correct, consistent, and matches organizational definitions for the data. Increasingly, information in a data warehouse is organized in star (or snowflake) schemas, which simplify user understanding of the data, maximize performance for decision support applications, and require less storage for large databases. The following illustration is an example of a star schema. In this database schema, a central fact table is linked to related dimension tables. A star (snowflake) schema is a relational approximation of the OLAP data model and can be an excellent starting point for building OLAP cube definitions. Few OLAP products, however, have taken advantage of this trend. Generally, they have not provided easy tools to map a star schema to an OLAP model, and as a result keep the cost of building OLAP models extremely high and the development time unnecessarily long.
WITHIN
DATA
1.Extract and load data. 2.Clean and transform data into a form that can cope with large data volumes. 3.Back up and archive data. 4.Manage queries and direct them to appropriate data source.
and
Online
Analytical
A data warehouse is often used as the basis for a decisionsupport system (also referred to from an analytical perspective as a business intelligence system). It is designed to overcome some of the problems encountered when an organization attempts to perform strategic analysis using the same database that is used to perform online transaction processing (OLTP). A typical OLTP system is characterized by having large numbers of concurrent users actively adding and modifying data. The database represents the state of a particular business function at a specific point in time, such as an airline reservation system. However, the large volume of data maintained in many OLTP systems can overwhelm an organization. As databases grow larger with more complex data, response time can deteriorate quickly due to competition for available resources. A typical OLTP system has many users adding new data to the database while fewer users generate
reports from the database. As the volume of data increases, reports take longer to generate. As organizations collect increasing volumes of data by using OLTP database systems, the need to analyze data becomes more acute. Typically, OLTP systems are designed specifically to manage transaction processing and minimize disk storage requirements by a series of related, normalized tables. However, when users need to analyze their data, a myriad of problems often prohibits the data from being used: Users may not understand the complex relationships among the tables, and therefore cannot generate ad hoc queries. Application databases may be segmented across multiple servers, making it difficult for users to find the tables in the first place. Security restrictions may prevent users from accessing the detail data they need. Database administrators prohibit ad hoc querying of OLTP systems, to prevent analytical users from running queries that could slow down the performance of mission-critical production databases. By copying an OLTP system to a reporting server on a regularly scheduled basis, an organization can improve response time for reports and queries. Yet a schema optimized for OLTP is often not flexible enough for decision support applications, largely due to the volume of data involved and the complexity of normalized relational tables. For example, each regional sales manager in a company may wish to produce a monthly summary of the sales per region. Because the reporting server contains data at the same level of detail as the OLTP system, the entire months data is summarized each time the report is generated. The result is longer-running queries that lower user satisfaction. Additionally, many organizations store data in multiple heterogeneous database systems. Reporting is more difficult because data is not only stored in different places, but in different formats. Data warehousing and online analytical processing (OLAP) provide solutions to these problems. Data warehousing is an approach to storing data in which heterogeneous data sources (typically from multiple OLTP databases) are migrated to a separate homogenous data store. Data warehouses provide these benefits to analytical users: Data is organized to facilitate analytical queries rather than transaction processing.
Differences among data structures across multiple heterogeneous databases can be resolved. Data transformation rules can be applied to validate and consolidate data when data is moved from the OLTP database into the data warehouse. Security and performance issues can be resolved without requiring changes in the production systems.
Sometimes organizations maintain smaller, more topic-oriented data stores called data marts. In contrast to a data warehouse which typically encapsulates all of an enterprises analytical data, a data mart is typically a subset of the enterprise data targeted at a smaller set of users or business functions. Whereas a data warehouse or data mart are the data stores for analytical data, OLAP is the technology that enables client applications to efficiently access the data. OLAP provides these benefits to analytical users: Pre-aggregation of frequently queried data, enabling a very fast response time to ad hoc queries. An intuitive multidimensional data model that makes it easy to select, navigate, and explore the data. A powerful tool for creating new views of data based upon a rich array of ad hoc calculation functions. Technology to manage security, client/server query management and data caching, and facilities to optimize system performance based upon user needs. The terms data warehousing and OLAP are sometimes used interchangeably. However, it is important to understand their differences because each represents a unique set of technologies, administrative issues, and user implications.
CONLUSION: So , Data Warehousing with OLAP technology is useful for handling large amounts of data .It is also useful for analyzing past data for forecasting , and also for graphical analysis and handling multidimensional data .This is useful for fast analysis rather than conventional databases with OLTP .
.
References:
Data warehousing in Real World --DENNIS MURRAY --SAM ANAHORY Data warehousing with Microsoft SQL server --JAKE STURM Data warehousing with oracle --Sima yazdani --Shirley s.wons www.olapreport.com