Mod1-Data Mining and Warehousepdf

DATA MINING AND WAREHOUSE
DATA MINING:-
 Data mining is the process of extracting the useful information, which is
stored in the large database.
 It is a powerful tool, which is useful for organizations to retrieve the useful
information from available data warehouses.
 Data mining can be applied to relational databases, object-oriented databases,
data warehouses, structured-unstructured databases, etc.
 Data mining is used in numerous areas like banking, insurance companies,
pharmaceutical companies etc.
 There is a huge amount of data available in the Information

Industry. This data is of no use until it is converted into useful
information. It is necessary to analyze this huge amount of data
and extract useful information from it.
 Extraction of information is not the only process we need to
perform; data mining also involves other processes such as Data
Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes
are over, we would be able to use this information in many
applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration, etc.
Data Mining Applications

Data mining is highly useful in the following domains −
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration, sports,
astrology, and Internet Web Surf-Aid.
Market Analysis and Management
Listed below are the various fields of market where data mining is used
−
 Customer Profiling − Data mining helps determine what kind of people buy what
kind of products.
 Identifying Customer Requirements − Data mining helps in identifying the best

products for different customers. It uses prediction to find the factors that may
attract new customers.
 Cross Market Analysis − Data mining performs Association/correlations between

product sales.
 Target Marketing − Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
 Determining Customer purchasing pattern − Data mining helps in determining

customer purchasing pattern.
 Providing Summary Information − Data mining provides us various

multidimensional summary reports.
Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −
 Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
 Resource Planning − It involves summarizing and comparing the resources and

spending.
 Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it helps to
find the destination of the call, duration of the call, time of the day or
week, etc. It also analyzes the patterns that deviate from expected
norms.
KDD and Data mining

 The process of discovering knowledge in data and application of data mining
techniques are referred to as Knowledge Discovery in Databases (KDD).
 KDD consists of various application domains such as artificial intelligence, pattern
recognition, machine learning and data visualization.
 The main goal of KDD is to extract knowledge from large databases with the help
of data mining methods.
The different steps of KDD are given as follows:-
1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)4
5. Data mining (an essential process where intelligent methods are applied to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measures—see Section 1.4.6)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are
prepared for mining. The data mining step may interact with the user or a knowledge
base. The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base.
KDD vs Datamining
KDD Data Mining
It is a field of computer science, that helps to extract Data mining is one of the important steps in the
useful and previously undiscovered knowledge from KDD process. It includes suitable algorithm based on
the large database by using various tools and the objective of the KDD process to identify the
theories. patterns from the database.
WHAT KINDS OF DATA CAN BE MINED?

Database Data
A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data. The software programs provide
mechanisms for defining database structures and data storage; for specifying and
managing concurrent, shared, or distributed data access; and for ensuring
consistency and security of the information stored despite system crashes or
attempts at unauthorized access.
A relational database is a collection of tables, each of which is assigned a unique

name. Each table consists of a set of attributes (columns or fields) and usually stores
a large set of tuples (records or rows). Each tuple in a relational table represents an
object identified by a unique key and described by a set of attribute values. A
semantic data model, such as an entity-relationship (ER) data model, is often
constructed for relational databases. An ER data model represents the database as
a set of entities and their relationships.
OLAP
Online analytical processing operations make use of background knowledge
regarding the domain of the data being studied to allow the presentation of data at
different levels of abstraction. Such operations accommodate different user
viewpoints. Examples of OLAP operations include drill-down and roll-up, which
allow the user to view the data at differing degrees of summarization.
 OLAP (Online Analytical Processing) provides the support for the multidimensional
view of the data.
 It provides easy and efficient access to the various views of information to the users.
 The complex queries are also processed by using OLAP. It is easy to analyze information
by processing multidimensional views of the data.
 The data warehouse is used to analyze the information, where the ample amount of
historical data is stored.
OLAP Operations
OLAP techniques are applied to retrieve the information from the data warehouse in the
form of OLAP multidimensional databases.
Multidimensional data mining (also called exploratory multidimensional data mining) performs
data mining in multidimensional space in an OLAP style. That is, it allows the exploration of multiple
combinations of dimensions at varying levels of granularity in data mining,
and thus has greater potential for discovering interesting patterns representing knowledge.
Multi-dimensional model has two types of tables:
1. Dimension tables: It contains the attributes of dimensions.

2. Fact tables: It consists of the facts or measures.
The following OLAP operations are performed to implement OLAP:
1. Roll up or consolidation
 Multi-dimensional databases have hierarchies with respect to the dimensions.

 Consolidation is rolling up or adding data relationship with respect to one or more
dimensions.
 This hierarchy can be as, the total order street<city<State<country.
 The roll up operation aggregates the data by city to country by location hierarchy as
shown in the following diagram.
2. Drill Down
In drill down operation, the view is changed to a greater level of detail.
For example: In the diagram shown below, the drill operations are performed on the
upper cube by stepping down a concept hierarchy for time. It can be defined as
day<month<quarter<year.
Transactional Data
In general, each record in a transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page. A transaction typically includes a unique
transaction identity number (trans ID) and a list of the items making up the transaction, such as the
items purchased in the transaction. A transactional database may have additional tables, which
contain other information related to the transactions.
WHAT KINDS OF PATTERNS CAN BE MINED?
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.
In general, such tasks can be classified into two categories: descriptive and predictive. Descriptive
mining tasks characterize properties of the data in a target data set. Predictive mining tasks perform
induction on the current data in order to make predictions.
Descriptive Function
The descriptive function deals with the general properties of data in the
database. Here is the list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
I. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales
include computer and printers, and concepts of customers include big
spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived
by the following two ways −
 Data Characterization −is a summarization of the general characteristics
or features of a target class of data. The data corresponding to the user-
specified class are typically collected by a query.
Simple data summaries based on statistical measures. The data cube-based
OLAP roll-up operation (Section 1.3.2) can be used to perform user-controlled
data summarization along a specified dimension An attribute-oriented
induction technique can be used to perform data generalization and
characterization without step-by-step user interaction.
The output of data characterization can be presented in various forms.
Examples include pie charts, bar charts, curves,multidimensional data
cubes, and multidimensional tables, including crosstabs. The resulting
descriptions can also be presented as generalized relations or in rule form
(called characteristic rules).
 Data Discrimination -is a comparison of the general features of the target

class data objects against the general features of objects from one or multiple
contrasting classes. The target and contrasting classes can be specified by a
user, and the corresponding data objects can be retrieved through database
queries. Discrimination descriptions expressed in the form of rules are
referred to as discriminant rules.
II. Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in
transactional data. Here is the list of kind of frequent patterns −
 Frequent Item Set − It refers to a set of items that frequently appear together in a
transaction dataset, for example, milk and bread, which are frequently bought together in
grocery stores by many customers.
 Frequent Subsequence − A sequence of patterns that occur frequently. A frequently
occurring subsequence, such as the pattern that customers, tend to purchase first a laptop,
followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. such
as purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences. A
substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured Pattern.
III. Mining of Association

Associations are used in retail sales to identify patterns that are
frequently purchased together. This process refers to the process of
uncovering the relationship among data and determining association
rules.
For example, a retailer generates an association rule that shows that
70% of time milk is sold with bread and only 30% of times biscuits are
sold with bread.
IV. Mining of Correlations

It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or
between two item sets to analyze that if they have positive, negative or
no effect on each other.
V. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers
to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.
PREDICTIVE FUNCTIONS
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts. The model are derived based on the
analysis of a set of training data (i.e., data objects for which the class labels are
known). The model is used to predict the class label of objects for which the the
class label is unknown.
The derived model can be presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
REGRESSION
Whereas classification predicts categorical (discrete, unordered) labels, regression
models continuous-valued functions. That is, regression is used to predict missing or
unavailable numerical data values rather than (discrete) class labels. The term
prediction refers to both numeric prediction and class label prediction.
Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well. Regression also encompasses the
identification of distribution trends based on the available data.
Classification and regression may need to be preceded by relevance analysis,
which attempts to identify attributes that are significantly relevant to the classification
and regression process. Such attributes will be selected for the classification and
regression process. Other attributes, which are irrelevant, can then be excluded from
consideration.
Data Mining Task Primitives

 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.

Note − These primitives allow us to communicate in an interactive
manner with the data mining system. Here is the list of Data Mining Task
Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.

 Set of task relevant data to be mined
This is the portion of database in which the user is interested. This
portion includes the following −
 Database Attributes
 Data Warehouse dimensions of interest
 Kind of knowledge to be mined

It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
 Background knowledge
The background knowledge allows data to be mined at multiple levels of
abstraction. For example, the Concept hierarchies are one of the
background knowledge that allows data to be mined at multiple levels of
abstraction.
 Interestingness measures and

thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process
of knowledge discovery. There are different interesting measures for
different kind of knowledge.
 Representation for visualizing the discovered

patterns
This refers to the form in which discovered patterns are to be displayed.
These representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes
DATA MINING ISSUES

Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues

The following diagram describes the major issues.
 Mining Methodology and User Interaction

Issues
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data

mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to

express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either

they represent common knowledge or lack novelty.
 Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms− In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as

huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.
 Diverse Data Types Issues

 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information

systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.
MOD-2 DATA WAREHOUSE
DATA WAREHOUSE
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site. Data
warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
A data warehouse exhibits the following characteristics to support the

management's decision-making process −
 Subject Oriented − Data warehouse is subject oriented because it provides us the
information around a subject rather than the organization's ongoing operations and
transaction processing. These subjects can be product, customers, suppliers, sales,
revenue, etc. The data warehouse does not focus on the ongoing operations, rather it
focuses on modelling and analysis of data for decision-making.
 Integrated − Data warehouse is constructed by integration of data from
heterogeneous sources such as relational databases, flat files etc. This integration
enhances the effective analysis of data.
 Time Variant − The data collected in a data warehouse is identified with a particular
time period. The data in a data warehouse provides information from a historical
point of view.
 Non-volatile − Nonvolatile means the previous data is not removed when new data
is added to it. The data warehouse is kept separate from the operational database
therefore frequent changes in operational database is not reflected in the data
warehouse.
Data Warehousing
Data warehousing is the process of constructing and using the data
warehouse. A data warehouse is constructed by integrating the data from
multiple heterogeneous sources. It supports analytical reporting,
structured and/or ad hoc queries, and decision making.
 A data warehouse is usually modeled by a multidimensional data structure,

called a data cube, in which each dimension corresponds to an attribute or a
set of attributes in the schema, and each cell stores the value of some
aggregate measure such as count or sum.sales amount/. A data cube
provides a multidimensional view of data and allows the precomputation and
fast access of summarized data.
Data warehousing involves data cleaning, data integration, and data
consolidations. To integrate heterogeneous databases, we have the
following two approaches −
 Query Driven Approach
 Update Driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases.
This approach is used to build wrappers and integrators on top of
multiple heterogeneous databases. These integrators are also known as
mediators.
 Process of Query Driven Approach

 When a query is issued to a client side, a metadata dictionary translates the query
into the queries, appropriate for the individual heterogeneous site involved.
 Now these queries are mapped and sent to the local query processor.
 The results from heterogeneous sites are integrated into a global answer set.
 Disadvantages
This approach has the following disadvantages −
 The Query Driven Approach needs complex integration and filtering processes.
 It is very inefficient and very expensive for frequent queries.
 This approach is expensive for queries that require aggregations.
Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather
than the traditional approach discussed earlier. In the update-driven
approach, the information from multiple heterogeneous sources is
integrated in advance and stored in a warehouse. This information is
available for direct querying and analysis.
 Advantages
This approach has the following advantages −
 This approach provides high performance.
 The data can be copied, processed, integrated, annotated, summarized and

restructured in the semantic data store in advance.
Query processing does not require interface with the processing at local
sources.
From Data Warehousing (OLAP) to Data

Mining (OLAM)
Online Analytical Mining integrates with Online Analytical Processing with
data mining and mining knowledge in multidimensional databases. Here
is the diagram that shows the integration of both OLAP and OLAM −
Importance of OLAM
OLAM is important for the following reasons −
 High quality of data in data warehouses − The data mining tools are required to
work on integrated, consistent, and cleaned data. These steps are very costly in the
preprocessing of data. The data warehouses constructed by such preprocessing are
valuable sources of high quality data for OLAP and data mining as well.
 Available information processing infrastructure surrounding data

warehouses − Information processing infrastructure refers to accessing,
integration, consolidation, and transformation of multiple heterogeneous databases,
web-accessing and service facilities, reporting and OLAP analysis tools.
 OLAP−based exploratory data analysis − Exploratory data analysis is required for

effective data mining. OLAM provides facility for data mining on various subset of
data and at different levels of abstraction.
 Online selection of data mining functions − Integrating OLAP with multiple data
mining functions and online analytical mining provide users with the flexibility to
select desired data mining functions and swap data mining tasks dynamically.
1. The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (e.g., customer profile information provided by external
consultants). These tools and utilities performdata extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load and refresh functions
to update the data warehouse.
The data are extracted using application program interfaces known as gateways. A gateway is
supported by the underlying DBMS and allows client programs to generate SQL code to be executed
at a server.
This tier also contains a metadata repository, which stores information about the data warehouse and
its contents.
2. The middle tier is an OLAP server that is typically implemented using either (1) a relational
OLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations on multidimensional
data to standard relational operations); or (2) a multidimensional OLAP (MOLAP) model (i.e., a
special-purpose server that directly implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data cube
A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
 Dimensions are the perspectives or entities with respect to which an organization

wants to keep records.
 Each dimension may have a table associated with it, called a dimension table,
which further describes the dimension. For example, a dimension table for item may
contain the attributes item name, brand, and type. Dimension tables can be specified
by users or experts, or automatically generated and adjusted based on data
distributions.
 A multidimensional data model is typically organized around a central theme. This
theme is represented by a fact table.
 Facts are numeric measures. Think of them as the quantities by which we want to
analyze relationships between dimensions.
 The fact table contains the names of the facts, or measures, as well as keys to each
of the related dimension tables.
 The cuboid that holds the lowest level of summarization is called the base cuboid.
 The 0-D cuboid, which holds the highest level of summarization, is called the apex
cuboid.
Schemas for multidimensional model

A data warehouse, however, requires a concise, subject-oriented schema that facilitates
online data analysis. The most popular data model for a data warehouse is a
multidimensional model, which can exist in the formof a star schema, a snowflake
schema, or a fact constellation schema.
 Star schema: The most common modeling paradigm is the star schema, in which
the data warehouse contains (1) a large central table (fact table) containing the bulk
of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension. The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern around the central fact table.
 Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.
 The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such a table is easy to maintain and saves storage space. However,
this space savings is negligible in comparison to the typical magnitude of the fact
table. Furthermore, the snowflake structure can reduce the effectiveness of browsing,
since more joins will be needed to execute a query. Consequently, the system
performance may be adversely impacted.Hence, although the snowflake schema
reduces redundancy, it is not as popular as the star schema in data warehouse
design.
 Fact constellation: Sophisticated applications may require multiple fact tables to

share dimension tables. This kind of schema can be viewed as a collection of stars,
and hence is called a galaxy schema or a fact constellation.
 Distinction between a data warehouse and a data mart.

 A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and
thus its scope is enterprise-wide. For data warehouses, the fact constellation
schema are commonly used, since it can model multiple, interrelated subjects.
 A data mart, on the other hand, is a department subset of the data
warehouse that focuses on selected subjects, and thus its scope is
department wide. For data marts, the star or snowflake schema is commonly
used, since both are geared toward modelling single subjects, although the
star schema is more popular and efficient.
MEASURES
A multidimensional point in the data cube space can be defined by a set of dimension–value
pairs. A data cube measure is a numeric function that can be evaluated at each point in the
data cube space. A measure value is computed for a given point by aggregating the data
corresponding to the respective dimension–value pairs defining the given point. We will look
at concrete examples of this shortly.
Measures can be organized into three categories—distributive, algebraic, and holistic—

based on the kind of aggregate functions used.
 Distributive: An aggregate function is distributive if it can be computed in a
distributed manner as follows. Suppose the data are partitioned into n sets. We apply
the function to each partition, resulting in n aggregate values. If the result derived by
applying the function to the n aggregate values is the same as that derived by
applying the function to the entire data set (without partitioning), the function can be
computed in a distributed manner. A measure is distributive if it is obtained by
applying a distributive aggregate function. Distributive measures can be computed
efficiently because of the way the computation can be partitioned.
 Algebraic: An aggregate function is algebraic if it can be computed by an algebraic
function with M arguments (where M is a bounded positive integer), each of which is
obtained by applying a distributive aggregate function. For example, avg() (average)
can be computed by sum()/count(), where both sum() and count() are distributive
aggregate functions. Similarly, it can be shown that min N() and max N() (which find
the N minimum and N maximum values, respectively, in a given set) and standard
deviation() are algebraic aggregate functions. A measure is algebraic if it is obtained
by applying an algebraic aggregate function.
 Holistic: An aggregate function is holistic if there is no constant bound on the
storage size needed to describe a subaggregate. That is, there does not exist an
algebraic function with M arguments (where M is a constant) that characterizes the
computation.Common examples of holistic functions include median(), mode(), and
rank(). A measure is holistic if it is obtained by applying a holistic aggregate function.
OLAP operations for multidimensional data.

 Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by
dimension reduction.
When roll-up is performed by dimension reduction, one or more dimensions are removed from
the given cube.
 Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions. Because a drill-down adds more detail to the
given data, it can also be performed by adding new dimensions to a cube.
 Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a subcube. The dice operation defines a subcube by performing a selection on
two or more dimensions.
 Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes
in view to provide an alternative data presentation.
 Other OLAP operations: Some OLAP systems offer additional drilling operations.
 drill-across executes queries involving (i.e., across) more than one fact table.
 The drill-through operation uses relational SQL facilities to drill through the bottom
level of a data cube down to its back-end relational tables.
 Other OLAP operations may include ranking the top N or bottom N items in lists, as
well as computing moving averages, growth rates, interests, internal return rates,
depreciation, currency conversions, and statistical functions.

Mod1-Data Mining and Warehousepdf

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mod1-Data Mining and Warehousepdf

Uploaded by

Copyright:

Available Formats

DATA MINING AND WAREHOUSE

 There is a huge amount of data available in the Information

Data Mining Applications

 Corporate Analysis & Risk Management

 Identifying Customer Requirements − Data mining helps in identifying the best

 Cross Market Analysis − Data mining performs Association/correlations between

 Determining Customer purchasing pattern − Data mining helps in determining

 Providing Summary Information − Data mining provides us various

Corporate Analysis and Risk Management

 Resource Planning − It involves summarizing and comparing the resources and

 Competition − It involves monitoring competitors and market directions.

KDD and Data mining

The different steps of KDD are given as follows:-

1. Data cleaning (to remove noise and inconsistent data)

WHAT KINDS OF DATA CAN BE MINED?

A relational database is a collection of tables, each of which is assigned a unique

Multi-dimensional model has two types of tables:

1. Dimension tables: It contains the attributes of dimensions.

 Multi-dimensional databases have hierarchies with respect to the dimensions.

 Mining of Frequent Patterns

 Data Discrimination -is a comparison of the general features of the target

III. Mining of Association

IV. Mining of Correlations

 Classification (IF-THEN) Rules

Data Mining Task Primitives

 This query is input to the system.

 A data mining query is defined in terms of data mining task primitives.

 Set of task relevant data to be mined.

 Kind of knowledge to be mined.

 Background knowledge to be used in discovery process.

 Interestingness measures and thresholds for pattern evaluation.

 Representation for visualizing the discovered patterns.

 Data Warehouse dimensions of interest

 Kind of knowledge to be mined

 Association and Correlation Analysis

 Interestingness measures and

 Representation for visualizing the discovered

DATA MINING ISSUES

 Mining Methodology and User Interaction

 Diverse Data Types Issues

 Mining Methodology and User Interaction

 Interactive mining of knowledge at multiple levels of abstraction − The data

 Incorporation of background knowledge − To guide discovery process and to

 Pattern evaluation − The patterns discovered should be interesting because either

 Parallel, distributed, and incremental mining algorithms − The factors such as

 Diverse Data Types Issues

 Mining information from heterogeneous databases and global information

A data warehouse exhibits the following characteristics to support the

 A data warehouse is usually modeled by a multidimensional data structure,

 Query Driven Approach

 Update Driven Approach

 Process of Query Driven Approach

 It is very inefficient and very expensive for frequent queries.

 This approach is expensive for queries that require aggregations.

 The data can be copied, processed, integrated, annotated, summarized and

From Data Warehousing (OLAP) to Data

 Available information processing infrastructure surrounding data

 OLAP−based exploratory data analysis − Exploratory data analysis is required for

 Dimensions are the perspectives or entities with respect to which an organization

Schemas for multidimensional model

 Fact constellation: Sophisticated applications may require multiple fact tables to

 Distinction between a data warehouse and a data mart.

Measures can be organized into three categories—distributive, algebraic, and holistic—

OLAP operations for multidimensional data.