Professional Documents
Culture Documents
Definition of Data
Mining
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Data
Related Fields
Machine
Learning Visualization
Statistics Databases
Knowledge Discovery
Process Integration
Interpretation Knowledge
Da & Evaluation
ta
Mi
nin
Tr g Knowledge
an
Raw sfo
r
Understanding
Se ma
Data tio __ __ __
Patterns
& lect n __ __ __
Cl io __ __ __ and
ea n
nin Rules
g
Transformed
Target Data
DATA
Data
Ware
house
The Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"
Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses
❚ OLAP
❚ DATA WAREHOUSING
❚ Data Visualization
❚ SQL
❚ Ad Hoc Queries
❚ Reporting
Data Mining is…
❚ Predictive Modeling
❚ Liner/Logistic Regression
❚ Neural Networks
❚ Decision Trees
❚ Clustering
❚ Neural Networks
Clustering
Data Mining is
❚ Segmentation
❚ Decision Trees
❚ Neural Networks
❚ Predictive Modeling
❚ Affinity Analysis
❚ Association Rule
❚ Sequence Generators
Challenges
❚ Classification
❚ Clustering
❚ Association Rule Discovery
❚ Sequential Pattern Discovery
❚ Deviation Detection
Classification Application
❚ Direct Marketing
❚ Fraud Detection
❚ Customer Attrition/Churn
❚ Goal is to identify
categories
❚ Natural grouping of
customers by processing
all the available data about
them.
❚ Other applications
❙ market segmentation,
discovering affinity groups,
and defect analysis
Data Mining Tasks:
Association Rule Discovery
❚ Inventory Management
Deviation Detection & Pattern
Discovery
Deviation Detection:
…discovering most significant changes in
data from previously measured or
normative values…
❚ Sample
❙ Extract a portion of the dataset for data mining
❚ Explore
❚ Modify
❙ create, select and transform variables with the
intention of building a model
❚ Model
❙ Specify a relationship of variables that reliably
predicts a desired goal
❚ Assess
❙ Evaluate the practical value of the findings and the
model resulting from the data mining effort
Data Mining Methodology:
❚ Data understanding
❚ Data preparation
❚ Modeling
❚ Evaluation
❚ Deployment
DM Phases
Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation
Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives
Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion
/ Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test DesignApproved Models Monitoring and
Test Design Maintenance Plan
Situation Assessment Explore Data Clean Data Review Process
Inventory of ResourcesData Exploration ReportData Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next StepsFinal Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks and Contingencies Generated Records Decision Review Project
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria
❚ Spatial Data
❚ Streamed Data
❚ Multimedia data
Spatial Mining
Definitions
❚ Spatial data is about instances located in a
physical space
❚ Spatial data has location or geo-referenced
features
❚ Some of these features are:
❙ Address, latitude/longitude (explicit)
❙ Location-based partitions in databases
(implicit)
Applications and Problems
❚ Minimum bounding
rectangles (MBR)
❚ Different tree structures
❙ Quad tree
❙ R-Tree
❙ kd-Tree
❚ Image databases
MBR
(x2,y2)
(x1,y1)
R-Tree
R1 R6 R7
R7
R2
R3 R4 R5 R1 R2 R3 R4 R5
Common Tasks dealing with
Spatial Data
❚ Data focusing
❙ Spatial queries
❙ Identifying interesting parts in spatial data
❙ Progress refinement can be applied in a tree structure
❚ Feature extraction
❙ Extracting important/relevant features for an
application
❚ Classification or others
❙ Using training data to create classifiers
❙ Many mining algorithms can be used
❘ Classification, clustering, associations
Spatial Mining Tasks
❚ Spatial classification
❚ Spatial clustering
❚ Spatial association rules
Summary
❚ Spatial data can contain both spatial and non-spatial
features.
❚ When spatial information becomes dominant
interest, spatial data mining should be applied.
❚ Spatial data structures can facilitate spatial mining.
❚ Standard data mining algorithms can be modified
for spatial data mining, with a substantial part of
preprocessing to take into account of spatial
information.
The Stream Model
43
Queries
. . . 1, 5, 2, 7, 0, 9, 3
Output
Processor
. . . a, r, v, t, y, h, b
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering
Limited
Storage
44
Applications --- (1)
45
Applications --- (2)
46
Applications --- (3)
❚ Intelligence-gathering.
❘ Who calls whom?
❘ Who accesses which Web pages?
❘ Who buys what where?
48
Characteristics of Data
Streams
❚ Data Streams
❙ Data streams—continuous, ordered, changing, fast, huge amount
❚ Characteristics
❙ Huge volumes of continuous data, possibly infinite
❙ Fast changing and requires fast, real-time response
❙ Data stream captures nicely our data processing needs of today
❙ Random access is expensive—single linear scan algorithm (can only have
one look)
❙ Store only the summary of the data seen thus far
❙ Most stream data are at pretty low-level or multi-dimensional in nature,
needs multi-level and multi-dimensional processing
Stream Data Applications
❚ Telecommunication calling records
❚ Business: credit card transaction flows
❚ Network monitoring and traffic engineering
❚ Financial market: stock exchange
❚ Engineering & industrial processes: power supply
& manufacturing
❚ Sensor, monitoring & surveillance: video streams
❚ Security monitoring
❚ Web logs and Web page click streams
❚ Massive data sets (even saved but random access
is too expensive)
Challenges of Stream Data
Processing