You are on page 1of 58

Data Mining

Definition of Data
Mining
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data pyramid

Wisdom Knowledge + experience

Knowledge Information + rules

Information Data + context

Data
Related Fields

Machine
Learning Visualization

Data Mining and


Knowledge Discovery

Statistics Databases
Knowledge Discovery
Process Integration

Interpretation Knowledge
Da & Evaluation
ta
Mi
nin
Tr g Knowledge
an
Raw sfo
r

Understanding
Se ma
Data tio __ __ __
Patterns
& lect n __ __ __
Cl io __ __ __ and
ea n
nin Rules
g
Transformed
Target Data
DATA
Data
Ware
house
The Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies

Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"

Data Access "What were unit Relational Oracle, Sybase, Retrospective,


(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC

Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses

Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,


(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases startups
Need for Data Mining
❚ Data accumulate and double every 9 months
❚ There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
❚ Manual data analysis is not new but a bottleneck
❚ Fast developing Computer Science and
Engineering generates new demands
❚ Seeking knowledge from massive data
When is DM useful

❚ Data rich world


❚ Large data
❚ Little knowledge about data
(exploratory data analysis)
Data mining is not

❚ OLAP
❚ DATA WAREHOUSING
❚ Data Visualization
❚ SQL
❚ Ad Hoc Queries
❚ Reporting
Data Mining is…

❚ Predictive Modeling
❚ Liner/Logistic Regression
❚ Neural Networks
❚ Decision Trees
❚ Clustering
❚ Neural Networks
Clustering
Data Mining is

❚ Segmentation
❚ Decision Trees
❚ Neural Networks
❚ Predictive Modeling
❚ Affinity Analysis
❚ Association Rule
❚ Sequence Generators
Challenges

❚ Increasing data dimensionality and data


size
❚ Various data forms
❚ New data types
❙ Streaming data, multimedia data
❚ Efficient search and access to
data/knowledge
❚ Intelligent update and integration
Data Mining Survey
Industry Pioneers
❚ 23% Manufacturing
❚ 19% Financial Serv.
❚ 17% Tele/Data communication
❚ 13% Media
❚ 12% Retail/Wholesaler
Objectives
❚ 21.4% Understanding Customer Segments and Preferences,
❚ 19,5% Identifying Profitable Customers and Acquiring New ones,
❚ 14,1% Increasing Revenue From Customers.
Results of Data Mining
Include:
❚ Forecasting what may happen in the
future
❚ Classifying people or things into groups
by recognizing patterns
❚ Clustering people or things into groups
based on their attributes
❚ Associating what events are likely to
occur together
❚ Sequencing what events are likely to
lead to later events
Data Mining versus OLAP
❚OLAP - On-line
Analytical Processing
❙ Provides you with a
very good view of
what is happening,
but can not predict
what will happen in
the future or why it
is happening
Data Mining Versus Statistical Analysis

❚Data Mining ❚Data Analysis


❙ Originally developed to act ❙ Tests for statistical
as expert systems to solve correctness of models
problems ❘ Are statistical assumptions of
❙ Less interested in the models correct?
mechanics of the • Eg Is the R-Square good?
technique ❙ Hypothesis testing
❙ If it makes sense then let’s ❘ Is the relationship
use it significant?
❙ Does not require • Use a t-test to validate
significance
assumptions to be made
about data ❙ Tends to rely on sampling
❙ Can find patterns in very ❙ Techniques are not optimised
large amounts of data for large amounts of data
❙ Requires understanding of ❙ Requires strong statistical
data and business problem skills
Data Mining Tasks...

❚ Classification
❚ Clustering
❚ Association Rule Discovery
❚ Sequential Pattern Discovery
❚ Deviation Detection
Classification Application

❚ Direct Marketing

❚ Fraud Detection

❚ Customer Attrition/Churn

❚ Sky Survey Cataloging


Data Mining Tasks:
Clustering

❚ Goal is to identify
categories
❚ Natural grouping of
customers by processing
all the available data about
them.
❚ Other applications
❙ market segmentation,
discovering affinity groups,
and defect analysis
Data Mining Tasks:
Association Rule Discovery

❚ Given a set of records each of which


contain some number of items from a
given collection;
❙ Produce dependency rules which will predict
occurrence of an item based on occurrences of
TID other
Items items.
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule
Discovery Application

❚ Marketing and Sales Promotion

❚ Supermarket Shelf Management

❚ Inventory Management
Deviation Detection & Pattern
Discovery
Deviation Detection:
…discovering most significant changes in
data from previously measured or
normative values…

Sequential Pattern Discovery:


…process of looking for patterns and rules
that predict strong sequential
dependencies among different events…
Sequential Patterns

❚ Identify frequently occurring


sequences from given records
❚ 40 percent of female customers buy
a gray skirt six months after buying
a red jacket
Data Mining Methodology:

❚ Sample
❙ Extract a portion of the dataset for data mining
❚ Explore
❚ Modify
❙ create, select and transform variables with the
intention of building a model
❚ Model
❙ Specify a relationship of variables that reliably
predicts a desired goal
❚ Assess
❙ Evaluate the practical value of the findings and the
model resulting from the data mining effort
Data Mining Methodology:

❚ Data understanding
❚ Data preparation
❚ Modeling
❚ Evaluation
❚ Deployment
DM Phases
Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives
Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion
/ Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test DesignApproved Models Monitoring and
Test Design Maintenance Plan
Situation Assessment Explore Data Clean Data Review Process
Inventory of ResourcesData Exploration ReportData Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next StepsFinal Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks and Contingencies Generated Records Decision Review Project
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria

Produce Project Plan


Project Plan
Initial Asessment of
Tools and Techniques
Major Application Areas for
Data Mining Solutions
❚Fraud/Non-Compliance ❚Recruiting/Attracting
Anomaly detection customers
❙ Isolate the factors that lead ❚Maximizing profitability
to fraud, waste and abuse (cross selling, identifying
❙ Target auditing and profitable customers)
investigative efforts more ❚Service Delivery and
effectively Customer Retention
❙ Build profiles of customers
❚Credit/Risk Scoring likely to use which
❚Intrusion detection services

❚Parts failure prediction ❚Web Mining


❚Health Care
Case Study: Search
Engines

❚ Early search engines used mainly keywords on a


page – were subject to manipulation
❚ Google success is due to its algorithm which uses
mainly links to the page
❚ Google founders Sergey Brin and Larry Page were
students in Stanford doing research in databases
and data mining in 1998 which led to Google
Case Study:
Direct Marketing and CRM

❚ Most major direct marketing


companies are using modeling and
data mining
❚ Most financial companies are using
customer modeling
❚ Modeling is easier than changing
customer behaviour
Final Remarks

❚ Data Mining can be utilized for


any field that needs to find
patterns or relationships in their
data.
Special Data Types

❚ Spatial Data
❚ Streamed Data
❚ Multimedia data
Spatial Mining
Definitions
❚ Spatial data is about instances located in a
physical space
❚ Spatial data has location or geo-referenced
features
❚ Some of these features are:
❙ Address, latitude/longitude (explicit)
❙ Location-based partitions in databases
(implicit)
Applications and Problems

❚ Geographic information systems (GIS) store


information related to geographic locations on
Earth
❙ Weather, community infrastructure needs, disaster
management, and hazardous waste
❚ Homeland security issues such as prediction of
unexpected events and planning of evacuation
❚ Remote sensing and image classification
❚ Biomedical applications include medical imaging
and illness diagnosis
Use of Spatial Data
❚ Map overlay – merging disparate data
❙ Different views of the same area: (Level 1) streets, power lines,
phone lines, sewer lines, (Level 2) actual elevations, building
locations, and rivers
❚ Spatial selection – find all houses near WSU
❚ Spatial join – nearest for points, intersection for areas
❚ Other basic spatial operations
❙ Region/range query for objects intersecting a region
❙ Nearest neighbor query for objects closest to a given place
❙ Distance scan asking for objects within a certain radius
Spatial Data Structures

❚ Minimum bounding
rectangles (MBR)
❚ Different tree structures
❙ Quad tree
❙ R-Tree
❙ kd-Tree
❚ Image databases
MBR

❚ Representing a spatial object by the smallest rectangle [(x1,y1),


(x2,y2)] or rectangles

(x2,y2)

(x1,y1)
R-Tree

❚ Indexing MBRs in a tree


❙ An R-tree of order m has at most m entries in one node
❙ An example (order of 3)
R6 R8 R8

R1 R6 R7
R7
R2
R3 R4 R5 R1 R2 R3 R4 R5
Common Tasks dealing with
Spatial Data

❚ Data focusing
❙ Spatial queries
❙ Identifying interesting parts in spatial data
❙ Progress refinement can be applied in a tree structure
❚ Feature extraction
❙ Extracting important/relevant features for an
application
❚ Classification or others
❙ Using training data to create classifiers
❙ Many mining algorithms can be used
❘ Classification, clustering, associations
Spatial Mining Tasks

❚ Spatial classification
❚ Spatial clustering
❚ Spatial association rules
Summary
❚ Spatial data can contain both spatial and non-spatial
features.
❚ When spatial information becomes dominant
interest, spatial data mining should be applied.
❚ Spatial data structures can facilitate spatial mining.
❚ Standard data mining algorithms can be modified
for spatial data mining, with a substantial part of
preprocessing to take into account of spatial
information.
The Stream Model

❚ Data enters at a rapid rate from one


or more input ports.
❚ The system cannot store the entire
stream.
❚ How do you make critical calculations
about the stream using a limited
amount of (secondary) memory?

43
Queries

. . . 1, 5, 2, 7, 0, 9, 3
Output
Processor
. . . a, r, v, t, y, h, b

. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering
Limited
Storage

44
Applications --- (1)

❚ In general, stream processing is


important for applications where
❙ New data arrives frequently.
❙ Important queries tend to ask about the
most recent data, or summaries of data.

45
Applications --- (2)

❚ Mining query streams.


❙ Google wants to know what queries are
more frequent today than yesterday.
❚ Mining click streams.
❙ Yahoo wants to know which of its pages
are getting an unusual number of hits in
the past hour.

46
Applications --- (3)

❚ Sensors of all kinds need monitoring,


especially when there are many
sensors of the same type, feeding
into a central controller, most of
which are not sensing anything
important at the moment.
❚ Telephone call records summarized
into customer bills.
47
Applications --- (4)

❚ Intelligence-gathering.
❘ Who calls whom?
❘ Who accesses which Web pages?
❘ Who buys what where?

48
Characteristics of Data
Streams
❚ Data Streams
❙ Data streams—continuous, ordered, changing, fast, huge amount

❙ Traditional DBMS—data stored in finite, persistent data sets

❚ Characteristics
❙ Huge volumes of continuous data, possibly infinite
❙ Fast changing and requires fast, real-time response
❙ Data stream captures nicely our data processing needs of today
❙ Random access is expensive—single linear scan algorithm (can only have
one look)
❙ Store only the summary of the data seen thus far
❙ Most stream data are at pretty low-level or multi-dimensional in nature,
needs multi-level and multi-dimensional processing
Stream Data Applications
❚ Telecommunication calling records
❚ Business: credit card transaction flows
❚ Network monitoring and traffic engineering
❚ Financial market: stock exchange
❚ Engineering & industrial processes: power supply
& manufacturing
❚ Sensor, monitoring & surveillance: video streams
❚ Security monitoring
❚ Web logs and Web page click streams
❚ Massive data sets (even saved but random access
is too expensive)
Challenges of Stream Data
Processing

❚ Multiple, continuous, rapid, time-varying, ordered streams


❚ Main memory computations
❚ Queries are often continuous
❙ Evaluated continuously as stream data arrives
❚ Queries are often complex
❙ Beyond element-at-a-time processing
❙ Beyond stream-at-a-time processing
❙ Beyond relational queries (scientific, data mining, OLAP)
❚ Multi-level/multi-dimensional processing and data mining
❙ Most stream data are at pretty low-level or multi-dimensional in nature
Multi-Dimensional Stream Analysis:
Examples
❚ Analysis of Web click streams
❙ Raw data at low levels: seconds, web page addresses, user IP
addresses, …
❙ Analysts want: changes, trends, unusual patterns, at reasonable
levels of details
❙ E.g., Average clicking traffic in North America on sports in the
last 15 minutes is 40% higher than that in the last 24 hours.”
❚ Analysis of power consumption streams
❙ Raw data: power consumption flow for every household, every
minute
❙ Patterns one may find: average hourly power consumption
surges up 30% for manufacturing companies in Chicago in the
last 2 hours today than that of the same day a week ago
Data Warehouse
A Data Warehouse stores data that have been extracted
from the various operational ,external and other databases
of organization

It is a central source of the data that have been cleaned,


transformed ,cataloged ,so they can be used by managers
and other business professionals

The acquisition process include consolidating data from


several sources filtering out un wanted data ,correcting
incorrect data ,converting data to new data elements &
aggregating data into new data subsets
Where DW is used

❚ Data mining-data in a ware house are


analysed to reveal hidden pattern and
trends in historical business activity
❚ OLAP
❚ Business analysis
❚ Market Research
❚ Decision Support
Components of DW

❚ Analytical data store –holds data in a


more useful form for certain
analysis
❚ Meta data –data that defines the
data in the data warehouse
Datamart

❚ Data ware houses may be subdivide


into data marts
❚ Data marts holds subsets of data
from the focus on specific aspects of
company such as a department or a
business process
Data Warehouse
Architecture
Data Warehouse Options

You might also like