You are on page 1of 86

Prof.

Chandan Singhavi
Process of semi-automatically analyzing
large databases to find patterns that are:
valid: hold on new data with some certainity
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to
interpret the pattern
Also known as Knowledge Discovery in
Databases (KDD)

Data mining consist of finding interesting
trends or patterns in large data sets to guide
decisions about future activities.
Analysis of data in a database using tools
which look for trends or anomalies without
knowledge of the meaning of the data.
True data mining software does not just
change the presentation, but discovers
previously unknown relationships among the
data.

Data Mining has been defined as a decision
support process in which a search is made for
patterns of information in data. To detect patterns
in data, Data Mining uses sophisticated statistical
analysis and modeling technologies to uncover
useful relationships hidden in databases. It predicts
future trends and finds behavior allowing
businesses to make predictive, knowledge-driven
decisions.
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
Data Mining helps extract such
information
Major challenge to exploit data mining is identifying suitable
data to mine.

Data mining requires single, separate, clean, integrated, and
self-consistent source of data.

A data warehouse is well equipped for providing data for
mining.

Data quality and consistency is a pre-requisite for mining to
ensure the accuracy of the predictive models. Data
warehouses are populated with clean, consistent data.

Advantageous to mine data from multiple sources to discover
as many interrelationships as possible. Data warehouses
contain data from a number of sources.

Selecting relevant subsets of records and fields for data
mining requires query capabilities of the data warehouse.

Results of a data mining study are useful if there is some way
to further investigate the uncovered patterns. Data
warehouses provide capability to go back to the data source.

OLAP - On-line
Analytical
Processing
Provides you
with a very
good view of
what is
happening, but
can not predict
what will
happen in the
future or why it
is happening
Are ad hoc, shrink wrapped
tools that provide an interface
to data

Are used when you have
specific known questions

Looks and feels like a
spreadsheet that allow
rotation, slicing and graphic

Can be deployed to large
number of users


Methods for analyzing
multiple data types
-- Regression Trees
-- Neural networks
-- Genetic algorithms

Are used when you dont
know what the questions are

Usually textual in nature

Usually deployed to a small
number of analysts



OLAP Tools Data Mining Tools
August 17, 2014
Data Mining: Concepts and
Techniques 15
Data
Warehouse
Data cleaning & data integration
Filtering
Databases
Database or
data warehouse
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-
base
Traditional analysis is via verification-driven
analysis
Requires hypothesis of the desired information
(target)
Requires correct interpretation of proposed query

Discovery-driven data mining
Finds data with common characteristics
Results are ideal solutions to discovery
Finds results without previous hypothesis

Define a target = hypothesis
Search for target
There are/are-not hits
Verify/negate hypothesis
Distribution is centred on target


Finds data with common characteristics
Results are ideal solutions to discovery
Finds results without previous hypothesis
Target is mathematical so has no scientific
dependency
August 17, 2014
Data Mining: Concepts and
Techniques 19
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
August 17, 2014 Data Mining: Concepts and 20
Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency and
variation)
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Competitive pressure!
The secret of success is to know something that
nobody else knows.
Aristotle Onassis

Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies)
Personalization, CRM
The real-time enterprise
Systemic listening
Security, homeland defense
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Steps:
1. Data selection
2. Data cleaning
3. Data mining
4. evaluation
23
__
__
__
__
__
__
__
__
__
Transformed
Data
Patterns
and
Rules
Target
Data
Raw
Dat
a
Knowledge Interpretation
& Evaluation
Integration
U
n
d
e
r
s
t
a
n
d
i
n
g

Knowledge Discovery Process
DATA
Ware
house
Knowledge
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
2.1 Data preprocessing
Data selection: Identify target datasets and
relevant fields
Data cleaning
Remove noise and outliers
Data transformation
Create common units
Generate new fields
2.2 Data mining model construction
2.3 Model evaluation
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Original Data
Target
Data
Preprocessed
Data
Patterns
Knowledge
Data
Integration
and Selection
Preprocessing
Model
Construction
Interpretation
Define the problem.
Select the data.
Prepare the data.
Mine the data.
Deploy the model.
Take business action.
Are you ready for Data Mining?


Discovery (patterns, relations
Associations, etc.)
Prior Knowledge
Information Model
Deployment
Validation
Data access
Data selection
Sensitivity to data quality
Data visualization
Extensibility
Performance
Scalability
Openness
Suite of algorithm


Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Relational data and transactional data
Time-series data
Text
Images, video
Mixtures of data
Sequence data

Features from processing other data
sources
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
Data Warehousing provides
the Enterprise with a
memory
Data Mining provides the
Enterprise with intelligence
Data warehouse mining:
assimilate data from operational sources
mine static data
Mining log data
Continuous mining: example in process
control
Stages in mining:
data selection pre-processing: cleaning
transformation mining result evaluation
visualization
Around 20 to 30 mining tool vendors
Major tool players:
Clementine,
Business Objects
IBMs Intelligent Miner,
SGIs MineSet,
SASs Enterprise Miner.
All pretty much the same set of tools
Many embedded products:
fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
Web log analysis for site design:
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering: Net perception, Wisewire
Inventory control: what was a shopper looking for
and could not find..
The US Government uses Data Mining to track
fraud
A Supermarket becomes an information broker
Basketball teams use it to track game strategy
Cross Selling
Target Marketing
Holding on to Good Customers
Weeding out Bad Customers
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Numerical or interval: Domain is ordered
and can be represented on the real line
(e.g., age, income)
Nominal or categorical: Domain is a finite
set without any natural ordering (e.g.,
occupation, marital status)
Ordinal: Domain is ordered, but absolute
differences between values is unknown
(e.g., preference scale, severity of an injury)
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Supervised learning
Classification and regression
Unsupervised learning
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection
Find correlated events:
Applications in medicine: find redundant
tests
Cross selling in retail, banking
Improve predictive capability of classifiers
that assume attribute independence
New similarity measures of categorical
attributes
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Given:
A database of
customer
transactions
Each transaction is
a set of items

Example:
Transaction with
TID 111 contains
items {Pen, Ink,
Milk, Juice}
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4

Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Find all itemsets
with
support >= 75%?
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4

Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Can you find all
association rules
with support >=
50%?
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4

Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Coocurrences
80% of all customers purchase items X, Y and Z
together.
Association rules
60% of all customers who purchase X and Y also
buy Z.
Sequential patterns
60% of customers who first buy X also purchase Y
within three weeks.
Support
The support of the itemset is the fraction of
transactions in the database that contain all the
item in the item set.

We are interested in all itemsets whose
support is higher than s user specified
minimum support minsup, we call such items
frequent item set.

Fundamental property of frequent itemset

A Priori property : Every subset of a frequent
itemset is also a frequent itemset.

For each item,
Check if it is a frequent iteset // appears in >
minsup transactions +
K=1
Repeat
for each new frequent itemsets I
K
with K items

generate all itemsets I
K+1
with k+1 items, IK IK+1
Scan all transaction once and check if generated
K+1 items are frequent
K= k +1
Until no new frequent item sets are identified
Select p.custid, p.item, sum(p.qty)
From purchases p
Group by p.custid,p.item
Having sum(p.qty) > 5

Select p.custid
From purchases p
Group by p.custid
Having sum(p.qty) > 5

Select p.item
From purchases p
Group by p.item
Having sum(p.qty) > 5

Association rule

LHS RHS

Support : The support for a set of items is the
percentage of transactions that contain all
these items.

Confidence : the percentage of the
transaction that also contain all items in RHS

Confidence = sup(LHS U RHS)/sup(LHS)

Confidence of the rule is an indication of
strength of rule


Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Examples:
{Pen} => {Milk}
Support: 75%
Confidence: 75%
{Ink} => {Pen}
Support: 100%
Confidence: 100%

TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4

ISA hierarchy or category hierarchy is imposed on
the set of items.

Stationary beverages


Pen ink juice milk

Ink , juice 50%
Ink, beverages 75%
Stationary, beverages 100%
{pen} {milk}

Has support and
confidence 100%

Calendric market
basket analysis

Collection of calenders
Could be every Sunday
Every first day of the month

TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4

Every first of the calender

Pen juice
100% support and confidence
Otherwise 50%

Grouping attribute and sophisticated
conditions can help us in identifying more
complex rules
Sequence of a item set purchased by a
customer can be found out.
Sub sequence
Sequential patterns algorithm



Causal link

Bayesian Networks are the graphs that can be
used to describe a class of such models, with
one node per variable or event and arcs
between nodes to indicate causality.
Think of
writing
instrument
pencils
Pen Ink
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example training database
Two predictor attributes:
Age and Car-type (Sport,
Minivan and Truck)
Age is ordered, Car-type is
categorical attribute
Class label indicates
whether person bought
product
Dependent attribute is
categorical we call such rule as
classification
Dependent attribute is
numerical we call such rule as
regression

Age Car Class
20 M Yes
30 M Yes
25 T No
30 S Yes
40 S Yes
20 T No
30 M Yes
25 M Yes
40 M Yes
20 S No
Customer renting property
> 2 years
Rent property Customer age>45
No
Yes
No Yes
Rent property
Buy property
Given old data about customers and
payments, predict new applicants loan
eligibility.
Age
Salary
Profession
Location
Customer type
Previous customers
Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicants
data
Good/
bad
Ramakrishnan and Gehrke. Database Management Systems, 3
rd
Edition.
Example training
database
Two predictor attributes:
Age and Car-type (Sport,
Minivan and Truck)
Spent indicates how much
person spent during a
recent visit to the web site
Dependent attribute is
numerical
Age Car Spent
20 M $200
30 M $150
25 T $300
30 S $220
40 S $400
20 T $80
30 M $100
25 M $125
40 M $500
20 S $420

Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.

Minivan
Age
Car Type
YES
NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,
Truck
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
A decision tree T encodes d (a classifier or
regression function) in form of a tree.
A node t in T without children is called a leaf
node. Otherwise t is called an internal node.

A decision tree is constructed in two phases

Growth phase
Pruning phase
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Each internal node has an associated splitting
attribute . Most common are binary
predicates.
Example predicates:
Age <= 20
Profession in {student, teacher}
5000*Age + 3*Salary 10000 > 0
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Encoded classifier:
If (age<30 and
carType=Minivan)
Then YES
If (age <30 and
(carType=Sports or
carType=Truck))
Then NO
If (age >= 30)
Then NO
Minivan
Age
Car Type
YES
NO
YES
<30 >=30
Sports, Truck
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
BuildTree(Node t, Training database D,
Split Selection Method S)

(1) Apply S to D to find splitting criterion
(2) if (t is not a leaf node)
(3) Create children nodes of t
(4) Partition D into children partitions
(5) Recurse on each partition
(6) endif
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Numerical or ordered attributes: Find a split
point that separates the (two) classes





(Yes: No: )
30 35
Age
Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Categorical attributes: How to group?
Sport: Truck: Minivan:

(Sport, Truck) -- (Minivan)

(Sport) --- (Truck, Minivan)

(Sport, Minivan) --- (Truck)

Decision trees
Salary < 1
M
Prof =
teacher
Good
Age <
30
Bad Bad
Good
Widely used learning method
Easy to interpret: can be re-represented as
if-then-else rules
Approximates function by piece wise
constant regions
Does not require any prior knowledge of
data distribution, works well on noisy data.
Has been applied to:
classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
Cons
- Cannot handle complicated
relationship between features
- simple decision boundaries
- problems with lots of missing
data
Pros
+ Reasonable training
time
+ Fast application
+ Easy to interpret
+ Easy to implement
+ Can handle large
number of features
Neural network mimic the human brain by
learning from a training dataset and applying the
learning to generalize patterns for classification
and prediction

These algorithms are useful when data is
shapeless and lacks any apparent pattern

The basic unit neuron is called nodes

Other structure is link that corresponds to
connection between neurons in the brain.



Output
Hidden layer
Input layer
Description
Difficult interpretation
Tends to overfit the data
Extensive amount of training time
A lot of data preparation
Works with all data types
Useful for learning complex data like
handwriting, speech and image recognition

Neural network Classification tree
Decision boundaries:
Linear regression
Cons
- Slow training time
- Hard to interpret
- Hard to implement: trial
and error for choosing
number of nodes
Pros
+ Can learn more complicated
class boundaries
+ Fast application
+ Can handle large number of
features
Clustering or Unsupervised
Learning
Unsupervised learning when old data with
class labels not available e.g. when
introducing a new product.
Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.
Key requirement: Need a good measure of
similarity between instances.
Identify micro-markets and develop policies
for each
Customer segmentation e.g. for targeted
marketing
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Identify micro-markets and develop policies for
each
Collaborative filtering:
group based on common items purchased
Text clustering

Ramakrishnan and Gehrke.
Database Management
Systems, 3
rd
Edition.
Groups of similar customers
Similar demographics
Similar buying behavior
Similar health
Similar products
Similar cost
Similar function
Similar store

Similarity usually is domain/problem
specific
Have
Children
Married
Last car is
A used one
Own car
First segment (high income>8,000)
Second Segment (8000>middle income >3000)
Third Segment (low income < 3000)
Hierarchical clustering
Partitional clustering
distance-based: K-means
Given database of user preferences, predict
preference of new user
Example: predict what new movies you will
like based on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person
may want to buy
(and suggest it, or give discounts to tempt
customer)


RangeelaQSQT 100 daysAnand Sholay Deewar Vertigo
Smita
Vijay
Mohan
Rajesh
Nina
Nitin ? ? ? ? ? ?
Possible approaches:
Average vote along columns [Same prediction for all]
Weight vote based on similarity of likings
[GroupLens]

RangeelaQSQT 100 daysAnand Sholay Deewar Vertigo
Smita
Vijay
Mohan
Rajesh
Nina
Nitin ? ? ? ? ? ?
External attributes of people and movies to
cluster
age, gender of people
actors and directors of movies.
[ May not be available]
Cluster people based on movie preferences
misses information about similarity of movies
Repeated clustering:
cluster movies based on people, then people based on
movies, and repeat
ad hoc, might smear out groups

RangeelaQSQT 100 daysAnand Sholay Deewar Vertigo
Smita
Vijay
Mohan
Rajesh
Nina
Nitin ? ? ? ? ? ?
Anand QSQT Rangeela 100 days Vertigo Deewar Sholay
Vijay
Rajesh
Mohan
Nina
Smita
Nitin ? ? ? ? ? ?

You might also like