Professional Documents
Culture Documents
By
One may claim that the exponential growth in the amount of data provides great
opportunities for data mining. In many real world applications, the number of sources over
which this information is fragmented grows at an even faster rate, resulting in barriers to
widespread application of data mining. A data warehouse is designed especially for decision
support queries.
Data warehousing is the process of extracting and transforming operational data into
The idea behind data mining , then is the “ non trivial process of identifying valid,
Data mining is concerned with the analysis of data and the use of software technique
for finding patterns and regularities in sets of data. Data mining potential can be enhanced if
the appropriate data has been collected and stored in data warehouse
Data warehousing provides the means to change raw data into information for
making effective business decision – the emphasis on information , not data. The data
This paper also explains partition algorithm to discover all requirements sets from the
data warehousing using the data mining. Also explained relation between operational data ,
related all aspects of their business. But locked up variety of systems, most of this data is
extremely difficult to access. Only a very small part of data – captured, processed and stored is
INTRODUCTION
the key pieces of information used to manage the and direct business for the most
popular outcome.
competitive environment. And this kind of information can be available only if there’s a totally
analysis. For such a repository, data and information extracted from heterogeneous resources
and consolidated in a single source. This makes it much easier and efficient to query the data.
planning). Information systems analyze the data make decision on how enterprise will be
operate, not only information systems have different focus from operational ones, they often
There are some specific rules that govern the basic warehouse , namely that such a
Time dependent: that is containing information collected over time, which implies there must
always be connection between the information in the warehouse and time when it was entered.
This is one of the most important aspect of warehouse as its relates to data mining, because
Non-volatile: that is data in a data warehouse never updated but used only for queries. Thus
such data only located from other database such as the operational database. End- users we want
to update data must use operational databases, as only latter can be updated, changed and
deleted. This means that a data warehouse will always be filled with historical data.
Subject oriented: that is, built all existing applications of the operational data. Not all the
information in operational database is useful for data warehouse, since the data warehouse is
designed specially for decision support while the operational database information containing
day-to-day.
Integrated: that is, it reflects the business information of organization. In an operational data
environment you will find many types of information being used in variety of applications and
some applications will be using different name for same entities. However in a data warehouse
is essential to integrate this information and make it consistent; only one name must exist to
A data warehouse is designed especially for decision support queries, therefore only
data that is needed for decision support extracted from operational data and stored and stored in
warehouse.
Users
From the definition we can infer that the data warehouse users are as follows
1. This person’s job involves drawing conclusions from, and making decision
2. This person doesn’t want to get involved with finding and organizing the
3. This person also doesn’t want to access a database highly technical fashion.
Data warehousing is one of the hottest industry trends for good reason. The
• Data marts
Physical data marts in which all the data for the data warehouse are stored, along
with meta data and processing for scrubbing , organizing , packing and processing detail the
data.
Logical marts also contain as physical database but does not contain actual data.
Instead it contains the information necessary to access the data wherever they reside.
DATA MARTS
overlapping data. The task of implementing a data warehouse can be a very big effort,
taking a significant amount of time. One feasible option is to start with a set of data
marts for each component of departments. One can have a stand alone data mart or
dependent data mart. A set of smaller, manageable, database is called data marts.
Stand alone data mart - a data mart with minimal or no impact on the enterprise
operational databases.
Dependent data mart – similar to stand alone data mart, Except that
management of data source by enterprise data base is required. These data sources include
DATA WAREHOUSE-ARCITECTURE
The architecture of an information system refers to the way its pieces are laid out ,
what types of tasks allocated to each piece of hoe pieces interaction with each other and how
they interact with outside world. The architecture of data warehouse is shown in fig.
L Q
O U
OPERATIONAL A E
DATA D R
Y
M DATA
DETAILED SUMMARY
A INFO M
INFORMATION
N A DIPPERS
INFORMATION
A N
G A
EXTERNAL E G
DATA R E
META R
DATA
OLAP
1. Load Manager
2. Warehouse manager
3. Query manager
Load Manager
• Perform simple transformation into a structure similar to the one in the data
warehouse.
Warehouse Manager
Query Manager
monitoring tools, native database facilities, bespoke coding , C programs and shell
scripts.
DATA EXTRACTION
DATA CLEANING
- which detects errors in the data and rectifies them when possible.
DATA TRANSFORMATION
- which converts data from legacy or host format to data warehouse format.
LOADING
- which sorts, summarizes , consolidates, computes, views, checks integrity and
REFRESH
Refresh- which propagate the updates from the data sources to the
Warehouse.
the data model consisting of data needed by user who want access at high speed, and so the data
In a data warehouse , an end-user may want to make joins from many tables
and this can be place tremendous demands on the system. For that reason , the data warehouse
META DATA
In setting up a data warehouse, the end user and the administrator must have access
to all the information in the tables and attributes. They will want to know a number of things ,
such as where the data is located, what data exists, what data type or format it is in , hoe this
data related to other data in other databases, where the data is from and to , whom data belongs
to. For these reason, another database containing the so – called Meta data is needed , which
People always have some sort of meta data in their heads or in their files
.
Computer based meta data for people to use:
Data warehouse developers often store the descriptive Data in its own. This provides a
If the meta data items people to use is stored in a well, structured computer
readable form , they can be read by a DBMS. This smooth between users and warehouse.
BACK FLUSHING
DATA WAREHOUSE
DATA
MINING
Other Data Inputs/ New Data
• Data must be formatted for consistency within the warehouse. Names ,meaning and
• The data must be cleaned to ensure validity . For input data , Cleaning must occur before
• Recognizing erroneous and incomplete data is difficult to automatic and cleaning that
requires automatic error correction can be even together. They will be likely want to
upgrade their data with the cleaned data. The process of returning cleaning data to the
must be installed in the data model warehouse. Data may have converted from relational,
• The must be located in the data warehouse. The sheer volume of data in the warehouse
The basis techniques are used to build data warehouse, known the ‘top down’
approaches. In the ‘top down’ approach, we first build a data warehouse from that we select
needed information to design a data mart. In ‘bottom up’ approach first data marts are designed
The relationship between operational data, a data warehouse and data marts
EXTRACT FROM
SEVARAL
DATA BASES
functionality, effect queries processing, structured queries and hoc queries, data mining and
materialized views. In particular enhanced spreadsheet functionality includes support for state of
derived values.
PARTITION ALGORITHM TO DISCOVER ALL REQUIREMENT SETS FROM THE
implicit, previously unknown and potentially useful information from the data. This
dependency networks, classification analyzing changes , and detecting anomalies. Data mining
search for the relationship and global patterns that exists in large databases byt are hidden
among of data ,such as the relationship between patient data and medical diagnosis. The
relationship represents valuable knowledge about the databases, and objects in the database, it
the database is a faithful mirror of the real word registered by the database. If refers to using a
database and extracting these in such a way that they can be put to use in areas such as
between items in a database of customer transaction. Market basket analysis technique used to
group items together. A rule may contain more than one ,item in the antecedent and the
consequent of the rule. In this paper . we concentrate on finding association, but with different
slant (i.e) by using partition algorithm. In the next section , we review the basis concepts of
association rule.
BASICS
X. Support can also be defined as a fraction supports, which means the proportion of
if % of information in D that support X also support Y. The rule X Y has support in the
Each rule has a left hand side and right hand side . The left hand side is also the
antecedent and right hand side is also called the consequent. In general,
both the left hand side and right hand side containing multiple items. Confidence (or
predictability) measures how much a particular item is dependent on another. Support does not
depend on the direction(or implication) of the rule, it is only dependent on the set of items in
the rule.
The discovery of association rules in the most well studied problem in data
mining. There are many interesting algorithm proposed recently and we shall discuss about the
partition algorithm for making association. The features of any efficient algorithm are(a)
reduce the I/o operations, and (b) at the same time be efficient in computing.
PARTITION ALGORITHM
Partition algorithm is based on the observation on the frequent sets are normally
very few in number compared to the set of all item sets. The partition algorithm uses two
scans of databases to discover all frequent sets by scanning the database once. This set is super
set of all frequent item sets i.e it may contain false positives. The algorithm executes in two
phases. In the first phase, the partition algorithm logically divides the database into a number of
non-overlapping partitions. The partitions are considered one at a time and all frequent item sets
read-in-partition(Ti in P)
End
For (k=2 ; LIK = 1,2,…….,n,k++) do begin // Merge Phase
For I =1 to n do begin
read_in_partition(T1 in P) //Phase 2
LG = { C CG/ S ( C ) T1 >= }
Answer = LG
EXAMPLE:
Let us take the database T, and let us partition, for the sake of illustration, T into
three partitions T1,T2,T3, each containing 5 transactions. The first partition T1 contains
We fix the local support as equal to given support, that is 20%. Thus ,Any item set that appears
in just one of the transaction in any partition is local frequent set in the partition.
A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 0 1
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 0 1 0 1 1 0 0
1 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 1
{2,8},{4,5},{4,7},{4,8},{5,6},{5,7},{5,8},{6,7},{6,8}, {1.6,8},{1,5,6},
{1,5,8},{2,4,8},{4,5,7},{5,6,8},{5,6,7},{1,5,6,8}}
similarly
L={{2},{3},{4},{5},{6},{7},{8},{9},{2,3},{2,4},{2,7},{2,9},{3,4},{3,5},{3,7},
{5,7},{6,7},{6,9},{7,9},{2,3,4},{2,6,7},{1,5,8},{2,6,9},{2,7,9},{3,5,7},{2,6,7,9}}
{2,6},{2,7},{2,9},{3,5},{3,7},{3,9},{4,6},{4,7},{5,6},{5,7},{5,8},{6,7},{6,8},
{1,3,5},{1,5,7},{2,3,9},{2,4,6},{2,4,7},{3,5,7},{4,6,7},{5,6,8},{2,4,6,7}}
C=LULUL
L={{1},{2},{3},{4},{5},{6},{7},{8},{9},{1,3},{1,5},{1,6,},{1,7},{1,8},{2,3},{2,4},{2,6},
{2,7},{2,8},{2,9},{3,4},{3,5},{3,7},{3,9},{4,5},{4,6},{4,7},{4,8},{5,6},{5,7},{5,8},{5,7}.
{6,7},{6,8},{6,9},{7,9},{1,3,5},{1,3,7},{15,6},{1,5,7,},{1,5,8,},{1.6,8},{1,5,8},{1,6,8},
{2,3,4},{2,3,9},{2,4,6},{2,4,7},{2,4,8},{2,6,7},{2,6,7},{2,6,9},{2,7,9},{3,5,7},{4,5,7},{4,6,7},
{5,6,8},{5,6,7},{1,5,6,8}{2,6,7,9}{1,3,5,7},{2,4,6,7}}
ADVANTAGES
- Data warehouse are free from the restrictions of the transactional environment
- Artificial intelligence techniques, which may include genetic algorithm And neural
networks, are used classification and are employed to discover knowledge from
analysis in retail
Data mining has many and varied fields of applications such as:
a. Retail/Marketing
b. Banking
c. Medicine
d. Transportation
Together.
A large number of data warehouse can be identified from existing data sources
with in the central government ministers. Let us examine potential areas on which data
OTHER SECTORS:
Accounts.
CRITICAL ISSUES
Data ware housing helps business makes informed decisions. But there are a few
critiacal issues that must be faced a head on while designing and implementation a data
• Capacity planning
• Performance tuning
• Testing
• Implementation obstacle
CONCLUSION:
Data warehousing provides the means to change raw data into information for making
effective business decision – the emphasis on information, not data. The data warehouse is the
hub for decision support data. Comprehensive data warehouse that integrate operational data
with customer, supplier, and market information have resulted in an explosion of information.
Completion requires timely and sophisticated analysis on an integrated view of the data
. Data mining tool can enhance inference process. Speed up design cycle, but con not be
substitute for statistical and domain expertise. Data mining allows for the creation of a self
learning organization.
So the future of data warehouse lies in their accessibility from the internet. Successful
implementation of a data warehouse and data mining requires a high performance; scalable
combination of hardware and software which can integrate easily within existing system, so
customer can use data warehouse to improve their decision –making—and their competitive
advantage
A good data warehouse provides the RIGHT data…to the RIGHT PEOPLE… at the
RIGHT time… RIGHT now! While data warehousing organizes data for business analysis,
REFERENCES: