You are on page 1of 40

Business Intelligence Technologies

Data Mining
+++
http://www.monografias.com/trabajos55
/mineria-de-datos/mineria-dedatos2.shtml
Lecture 1 Introduction

Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining

What is covered in this course

Theories/Methods
Data

mining cycle/process/methodology, evaluation


Association rules, decision trees, clustering, nearest
neighbor, neural networks, link analysis, Web mining
etc.

Applications
Market

basket analysis, customer segmentation,


CRM, personalization, Financial analysis etc.
Business Cases

Hands-on Experience
SAS

Enterprise Miner
3

Course Objectives

Understand data mining theories


Learn popular data mining methods
Enable you to solve special business
applications
Master a data mining package

Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining

Course Logistics

Qing Li

TA

kooliqing@gmail.com

Jia Wang
xiaojiajia198796@gmail.com

Office hours:

Walk-in
By appointment
Before and after class
Call me

Class Resources

Class homepage:
http://liqing.cai.swufe.edu.cn/

post slides,

announcements, downloads

Text Book + Cases + Handouts

Text Book
Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management, Second Edition
Michael Berry and Gordon Linoff, 2004, Wiley, ISBN
0471-470643

Class Schedule
Topic
1

Course Overview, Intro to Data Mining

Market Basket Analysis & Association Rules, CRM

Market Segmentation & Clustering, Prepare data

Prediction & Classification Decision Tree

Personalization & Nearest Neighbor

Financial Forecasting & Neural Networks

Link Analysis & Web mining

Misc. Topics

Guest Speaker

10

Term project presentations

Group Term Project

Group of 2-3 (3 is better).


Due

one week from now

Identify a company to study


Focus:

Data and Business Intelligence


Current practice
Your recommendations

Two phases
Phase

1: Identify the company and brief


description (Due 3 weeks from now)
Phase 2: Final report + class presentation
10

Software

SAS Enterprise Miner


Used

for homework assignments

Need Windows XP Professional or Mac OS9


Ill demo SAS in most classes.
Tutorial available on course website
Every student is recommended to have a copy
in order to follow class demo.
Alternative for Vista users - WEKA

11

Grading

15%

Participation

50%

3: Excellent
2: Good
1: OK
0: Absent with good reason and advance notification
-3: Absent with no reason

Homework
2 big assignments
Problem solving, data analysis and/or case discussion.
25% each

35%

Term Project

Phase 1 report --- 5%


Final report --- 20%
Class presentation --- 5%
Peer evaluation --- 5%

(No Curve)
12

Misc. Issues

Slides are available before class


Download

or print them before class

Lectures may be different from the text book


Some

materials in the lectures may not be in the


book, so please focus in class
The book is a great reference book, not a bible

Finish assigned case readings before


each class
Attendance is required

13

Survey

14

Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining

15

Case 1: Bank of America

Discussion Questions:
1.
2.
3.
4.

What is BoA trying to achieve?


What are the alternative solutions? Pros and
cons of each?
What are the stages of data mining? Describe
each.
What are the data mining techniques used,
and what are the findings from each
technique?

16

Case 2: A Wireless Company

Discussion Questions:
1.
2.
3.
4.

What is the company trying to achieve?


How can data mining help?
Where did data come from and How are data
processed?
How is the data mining approach evaluated?

17

Case 3: SUV

Discussion Questions:
1.
2.
3.
4.

What is the company trying to achieve?


How can data mining help?
What data files are used? What information
are contained in these files?
How is the two data mining technique
combined and why is it more powerful to
combine?

18

Agenda

Course Description

Course Logistics

Case discussion

Introduction to Data Mining

19

What is data mining?

Informal definition: Finding patterns in data

More formal definition: Non-trivial process of


identifying valid, novel, potentially useful, and
understandable patterns in data

Business Intelligence: a process for increasing


the competitive advantage of a business by
intelligent use of available data in decision
making. (one definition)
20

What is a pattern?

Informal definition: Any structure that can


be found in the data. e.g.
People

with good credit ratings have fewer


accidents
Risk = 0.93*prior_default + 0.23*num_cards
1.3* employed
On Friday nights male customers who buy
diapers also tend to buy beer

Not every pattern is desirable


People

with high income buy expensive cars


21

Why Data Mining ?


Because Data Mining virtually affects all data-intensive industry

Marketing

Telecommunications

Which patients may take longer to recover ?


What is the likely cause of an illness ?

Retail

What types of customers have high credit risks / insurance risks ?


What interest rate or insurance premium should be given to different customers?
Which stocks are likely to perform well in the next 3 months?

Healthcare

Which customers will switch to competitors ?


Which calls are fraudulent?

Finance and Insurance

Which customers are likely to respond to this campaign?


What other products or services should be offered to a customer? (cross-selling)
What types of customers are loyal?

Which products do customers buy together (or in sequence)?

Customer Support

Which customer service representative should be assigned to a task ?


When a customer calls, the customer representatives screen shows exactly where to
lead the conversation.

Wherever there is data, there is and should be data


mining!

22

Why Data Mining ? Some Real Examples

Safeway:

Pfizer pharmaceuticals:

Cross selling, when a customer calls, know what other services to offer
Build models to figure out what makes a loyal customer
These models saved a marginally profitable bill-paying service

Amazon:

Construct a predictive model which tells patients their cholesterol risk score.
High risk patients can request Lipitor, Pfizers cholesterol medication.

Fidelity:

Shopper cards capture point-of-sale data and personal information.


Arrange products on shelves: Beer & Diaper
Sell names to suppliers so that manufacturer coupons can be targeted.

Recommendations

Capital One:

What terms should be offered to different customers?


The lowest loan loss rates in the industry
23

Why Data Mining Now?


Better and cheaper
Computing
Power

Mature
data mining
technology

DM

Improved Data
Collection
& Storage

Plus: Data is being produced at a tremendous speed.


Competitive pressures are enormous
24

Descriptive vs. Predictive Data Mining

Descriptive DM is used to learn about and understand the


data.

What items are purchased together?


Identify and describe groups of customers with common buying
behavior

Predictive DM aims to build models in order to predict


unknown values of interest.

A model that given a customers characteristics predicts how


much the customer will spend on the next catalog order.
Predicting which customers are likely to leave
Which direction is Stock X going to move tomorrow?
Most predictive models are also descriptive

25

Data Mining Software

Big Names:

IBM Intelligent Miner


SPSS Clementine
Microsoft SQL Server 2000 Analysis Service
Oracle 9i Data Mining
SAS Enterprise Miner

Smaller Companies:
ANGOSS KnowledgeStudio
XLMiner
MegaPuter PolyAnalyst
DBMiner

Free or Open Source:


Weka
Lots of free programs on the Internet supporting individual data mining
techniques.

A good portal for data mining related stuff:

http://www.kdnuggets.com
26

Virtuous Cycle of Data Mining

Finding patterns is not enough


Must respond to the patterns
by taking action
Turning:

Data into Information


Information into Action
Action into Value

1, Identify the business problem


2, Mining data to transform the data
into actionable information
3, Acting on the information
4, Measuring the results
27

1, Identify the Business Opportunity

Many business processes are good candidates:

New product introduction

Direct marketing campaign

Understanding customer attrition/churn

Evaluating the results of a test market

Or more specific problems

What types of customers responded to our last campaign?

Where do the best customers live?

Are long waits in check-out lines a cause of customer attrition?

What products should be promoted with our XYZ product?

TIP: When talking with business users about data mining


opportunities, make sure you focus on the business
problems/opportunities and not on technology and algorithms.

Another goal of this course is for you to think strategically about


what business opportunities can be addressed by data mining
techniques.
28

2, Mining the Data to Transform it into Actionable Information

Success is making business sense of the data


Need to figure out the specific data mining tasks used to
address the business opportunities identified in the first
step.
Deal with messy data

Dont expect clean data. Data cleaning accounts for 70% of efforts

Implementation problems:

What techniques to use?

How to use the techniques?

Selecting the right model

Other problems

Data privacy issue


29

3, Take Action

Taking action is the whole purpose of data mining


Now with discovered patterns (from mining data), we
have better informed decisions.
Examples
Contact

targeted customers
Prioritizing customer service

Cingular and AT&T were fined for $1.5 million on Sept. 10, 2004
for discriminating their services based on customers credit rating.

Adjusting

inventory levels
Rearrange products on the shelves
Verizon sends out 40k mails to selected customers per
month
30

4, Measuring Results

Assess the impact of the action taken


Often overlooked, ignored, skipped
Planning for the measurement should begin
when analyzing the business opportunity, not
after it is all over
Assessment questions (examples):
Did

this campaign do what we hoped?


Did some offers work better than others?
Lower cost, increase profit?
Tons of others
31

Data Mining General Guidelines


The DM virtuous cycle (4 steps) is iterative
No steps should be skipped
Common sense prevails with respect to
how rigorous each step is carried out
The 4 steps of the virtuous cycle expand to
become an 11-step methodology --- more
rigorous

32

Detailed Data Mining Process 11 Steps


1, Translate the business
problem into a data
mining problem
2, Select appropriate
data
3, Get to know the data
4, Create a model set
5, Fix problems with the
data
6, Transform data to
bring information to the
surface
7, Build models
8, Assess models
9, Deploy models
10, Assess results
11, Begin again

33

Step 1: Transforming Business Problems into DM


Tasks

Business problems can often be big and vague


Data mining tasks need to be more concrete
Sample business problems:

How to improve response to a direct marketing


campaign?
Which ads to place on web pages in order to improve
click thorough rate?
How to transform these to DM task?

34

Step 2-6: Data Preparation

Get data

Clean/correct data

Different (heterogeneous) sources


Need to collect additional data?
Credit card charge records, points-of sale, web log etc.
Correct errors
Add missing values
Discard of garbage, remove outliers

Transform data if needed

Derived attributes --- bring information to the surface


Income Income bracket when model requires categorical data
DOB Age
35

Step 7-9: Model Building


Choice of model, model building and model assessment

Decide what model type to use

Descriptive or Predictive model?


Which specific technique?
Often can try different techniques
Things to consider:

Assess Models

Computational issues
Implementation issues
Availability of relevant and amount of data
Do we have the necessary expertise

Accuracy on testing data


Small is beautiful
Easier to understand

Step 9 is more about scoring or ranking in the real data


36

Step 10: Assess the Result

Its not model accuracy any more


Its more about achieving the business goal
Its closely related to business decisions

E.g. if its more expensive to deploy a data mining


model, a mass mailing may be more cost-effective
than a targeted one.

But its often hard to isolate the effect of a


solution. Indirect benefits may be hard to see.

Do a market test

37

Common Data Mining Mistakes

Learn things that arent true

Patterns may not represent any underlying rule

The data may not reflect the relevant population

The sample should not be biased


Otherwise, the result can not be extended
E.g. Your existing customers are not like the customers you want to acquire

Data may be at the wrong level of detail

Tall candidates win presidential election


True in data, but has no predictive power

Refer to the Simpsons paradox (next slide)

Learn things that are true, but not useful

Things that are already known

Majority of rules learned are normal business rules

E.g. Retired employees dont respond to retirement plan promotion

Things that cant be used (AT&T/Cingular example)

Inability to act upon patterns because of political, legal and ethical reasons
38

Simpsons Paradox

Male

Business School
Admit
Deny Total
480 (80%) 120 (20%)
600

Male

Female

180 (90%)

Female

20 (10%)

200

Law
Admit
Deny Total
10 (10%) 90 (90%)
100
100 (33%) 200 (66%)

300

Simpsons Paradox refers to the reversal of the direction


of a comparison or an association when data from
several groups are combined to form a single group.
This is caused by the different percentages in admission
in the two tables - they really shouldn't be combined.
39

What to Do After Class

Read Chapter 1, 2, 3
Read cases for Lecture 2
Install SAS
Find a group member for your term
project and start thinking about which
company to select for your project

40

You might also like