You are on page 1of 64

KDD-07 Invited Innovation Talk

August 12, 2007

Usama Fayyad, Ph.D.


Chief Data Officer & Executive VP
Yahoo! Inc.

Research 0
1

Thanks and Gratitude

My family: my wife Kristina and my 4 kids; my parents and my sisters


My academic roots: The University of Michigan, Ann Arbor my Ph.D. committee,
including Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie
Cheng), Internships at GM Research and at NASAs JPL
My Mentors and Collaborators
Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl
JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul Stolorz, Peter
Cheeseman, David Atkinson, many others
Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley,
Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others
Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff
Webb, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many
colleagues
My Business Partners
Bassel Ojjeh, Nick Besbeas, many VCs, many advisers and strategic clients including
Microsoft SQL Server and sales teams
My Yahoo! Colleagues:
Zod Nazem, Jerry Yang, David Filo, Yahoo! exec team, Prabhakar Raghavan, Pavel
Berkhin, Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research
folks, many at Yahoo SDS and current and previous Yahoo! employees
Research
A Data Miners Story
Getting to Know the Grand Challenges
Personal Observations of a Data Mining Disciple

Usama Fayyad, Ph.D.


Chief Data Officer & Executive VP
Yahoo! Inc.

Research 2
3

Overview

The setting
Why data mining is a must?
Why data mining is not happening?
A Data Miners Story
Grand Challenges: Pragmatic
Grand Challenges: Technical
Some case studies
Concluding Remarks

Research
4

The data gap

The Machinery Moves on:


Moores law: processing capacity doubles every 18 months : CPU,
cache, memory
Its more aggressive cousin: Disk storage capacity doubles every 9
months
The Demand is exploding:
Every business is an eBusiness
Scientific Instruments and Moores law
Government
The Internet the ubiquity of the Web
The Talent Shortage

Research
5

What is Data Mining?

Finding interesting structure in data


Structure: refers to statistical patterns, predictive
models, hidden relationships
Interesting: ?

Examples of tasks addressed by Data Mining


Predictive Modeling (classification, regression)
Segmentation (Data Clustering )
Affinity (Summarization)
relations between fields, associations, visualization

Research
6

Beyond Data Analysis

Scaling analysis to large databases


How to deal with data without having to move it out?
Are there abstract primitive accesses to the data, in database
systems, that can provide mining algorithms with the
information to drive the search for patterns?
How do we minimize--or sometimes even avoid--having to scan
the large database in its entirety?
Automated search
Enumerate and create numerous hypotheses
Fast search
Useful data reductions
More emphasis on understandable models
Finding patterns and models that are interesting or novel to
users.
Scaling to high-dimensional data and models.
Research
Data Mining and Databases

Many interesting analysis queries are difficult to state


precisely

Examples:
which records represent fraudulent transactions?
which households are likely to prefer a Ford over a Toyota?
Whos a good credit risk in my customer DB?

Yet database contains the information


good/bad customer, profitability
did/did not respond to mailout/survey/...

Research
8

Data Mining Grand Vision


ACME CORP ULTIMATE DATA MINING BROWSER

Whats New? Whats Interesting?

Predict for me

Research
9

The myths
Companies have built up some large and
impressive data warehouses
Data mining is pervasive nowadays
Large corporations know how to do it
There are tools and applications that discover
valuable information in enterprise databases

Research
10

The truths
Data is a shambles,
most data mining efforts end up not benefiting
from existing data infra-structure
Corporations care a lot about data, and are
obsessed with customer behavior and
understanding it
They talk a lot about it
An extremely small number of businesses are
successfully mining data
The successful efforts are one-of, lucky
strikes
Research
11

Current state of Databases

Ancient Egypt
Data navigation, exploration, & exploitation technology
is fairly primitive:
we know how to build massive data stores
we do not know how to exploit them
we do the book-keeping really well (OLTP)
Inadequate basic understanding of navigation /systems
many large data stores are write-only (= data tomb)

Research
12

A Data Miners Story

Started out in pure research


Professional student
Math and algorithms

Research
13

Researcher view

Database

Algorithms and
Theory

Systems

Research
14

Practitioner view

Database
Customer
Systems and integration

Algorithms

Research
15

Business view

Customer Database

Systems Algorithms

$$$s

Research
16

A Data Miners Story


Started out in pure research
At NASA-JPL did basic research and applied
techniques to Science Data Analysis problems
Worked with top scientists is several fields: astronomy,
planetary geology, atmospherics, space science, remote
sensing imagery
Great results, strong group, lots of funding, high demand

So why move to Microsoft Research?


Research
17

Example: Cataloging Sky Objects

Research
Data Mining Based Solution

94% accuracy in recognizing sky objects


Speed up catalog generation by one to two orders of
magnitude (unrealistic to perform manually).
Classify objects that are at least one magnitude fainter than
catalogs to-date.
Tripled the data yield
Generate sky catalogs with much richer content:
on order of billions of objects:
> 2x107 galaxies > 2x108 stars, 105 quasars
Discovered new quasars 40 times more efficiently

Research
Research
20

A Data Miners Story

Started out in pure research


At NASA-JPL
At Microsoft Research
Basic research in algorithms and scalability
Began to worry about building products and integrating
with database server
Two groups established: research and product

So why move out to a start-up?

Research
21

Working with Large Databases

One scan (or less) of the database


terminate early if appropriate
Work within confines of a given limited RAM
buffer
Cluster a Gigabyte or Terabyte in, say 10 or 100
Megabytes RAM
Anytime algorithm
best answer always handy
Pause/resume enabled, incremental
Operate on forward-only cursor over a view
(essentially a data stream)

Research
22

Business Results Gap

Business users are unable to apply the power of


existing data mining tools to achieve results
Business
Technologies
Technical Tools
Challenges
Acquisition
Neural CART
Networks
Conversion Segmentation
OLAP
Average Order Logistic
Decision
Trees
Regressions
Retention Bayesian
Genetic
Algorithms
Networks
Loyalty
Chaid

Research
23

Business Results Gap

Business users are unable to apply the power of


existing data mining tools to achieve results
Business
Specialists Technologies
Technical Tools
Challenges
Acquisition Statisticians
Neural CART
Networks
Conversion Data Mining PhDs Segmentation
OLAP
Average Order DBAs Logistic
Decision
Trees
Regressions
Retention Consultants Bayesian
Genetic
Algorithms
Networks
Loyalty
Chaid

Research
24

Evolving Data Mining

Evolution on the technical front:


New algorithms
Embedded applications
Make the analyst life easier

Evolution on the usability front


New metaphors
Vertical applications embedding
Used by the business user

In both cases, success means invisibility

Research
25

Grand Challenges

Pragmatic:
Achieving integration and invisibility
Research/Technical:
Solving some serious unaddressed problems

Research
26

Pragmatic Grand Challenge 1

Where is the data?


There is a glut of stored data
Very little of that data is ready for mining
Data warehousing has proven that it will not
solve the problem for us

Solution:
integration with operational systems
Take a serious database approach to solving the
storage management problem

Research
27

digiMine Background

Started as Venture Capital-funded company:


digiMine, Inc. in March 2000.
Built, operated and hosted data warehouses
with built-in data mining apps
Headquartered in Bellevue, Washington
$45 million in funding Mayfield, Mohr
Davidow, American Express, Deutsche Bank
Grew to over 120 employees
50 patents+ in technology and processes
Both technology and services

Research
28

Sample Customers

Research
29

A Data Miners Story

Started out in pure research


At NASA-JPL
At Microsoft Research
At digiMine
Lots of VC funding, great team, great press coverage,
and fast moving
great customers

So why move to a DMX Group?

Research
30

Why DMX Group?

At digiMine, we grew a large Professional Services


organization
We learned a lot from these engagements
VC-funded companies cannot do much consulting
A fork in the road appeared
digiMine re-focused on a market vertical: behavioral
targeting for media and publishers
Renamed to Revenue Science, Inc.
Formed DMX Group which was eventually acquired by
Yahoo!

Research
31

DMX Group Mission

Make enterprise data a working asset in the


enterprise:
Data strategy for the business
Implementation of Business Intelligence and data
mining capabilities
Business issues around data
What is possible?
How to expose it to business users
How to train people and change processes
Integration with operational systems

Research
32

Data Strategy

How can your data influence your revenues?


How do you optimize operations based on data?
How do you increase customer retention based on
data?
How do you utilize enterprise data assets to spot
new opportunities:
Cross-sell to existing customers
Grow new markets
Avoid problems such as fraud, abuse, churn, etc?

Research
33

A Data Miners Story

Started out in pure research


At NASA-JPL
At Microsoft Research
At digiMine/Revenue Science Inc.
At DMX Group

Research
34

Pragmatic Grand Challenge 2

Embedding within Operational Systems


We all worry about algorithms, they are fascinating
Most of us know that data mining in practice is mostly data prep
work
Go where the data is when the data does not come to you

But how much of the problem is data mining?


facts:
The effort in embedding an application is huge, and often not
discussed
Without it, all the algorithms are useless

Research
Case Study Wireless Telco
Churn Modelling and Prediction

Research 35
36

Modeling Process

2 Sample
3 Build
4 Score
6 High Risk
Database Churn Database Med Risk
Model
Low Risk

5 Assign
6 High Val
Customer Med Val
Risk
1
Value
Customer Low Val
Interaction High Val High Val High Val
Base High Risk Med Risk Low Risk

Med Val Med Val


Value
Med Val
High Risk Med Risk Low Risk

Low Val Low Val Low Val


High Risk Med Risk Low Risk
SMS WAP CDR Billing
Research
37

LTV and Its Application

A customers life-time value (LTV) is the net


value that a customer brings in to a business by
the end of their service. I.e. their profit
contribution.

LTV allows decisions for individual customers that


optimize the return-on-investment (ROI).
Examples:
Aggressive retention programs, such as equipment
upgrade and contract renewal for high LTV.
Differentiated customer care treatment for reactivations
by customer with low LTV

Research
38

What is the Required?

Detailed data
Integration of CDR, WIG, SMS, Billing
Maintained at detailed level
Integrated data mining
Algorithms tuned to model thousands of variables and millions of
rows
Accurate Forecasts
System Robustness
Massively scalable back end system
Flexible architecture to create new variables quickly and easily
Collaborative Service Model
Service model which guarantees success
Combined IQ Model to optimize science and business knowledge
Low cost to create and maintain models

Research
39

Map Segments to Actions

High
Save Program
Let them Cautiously Aggressively
go Defend Defend
Contract
Cost Reducing Equipment Renewal
Programs Upgrade
Feature Add Elite Program
Churn Change Plan
Probability Grow Nurture /
Bad Migration Margin Maintain
Behavior
Feature Use Loyalty Programs

Low
Low
Forecasted
Negative High
LTV

Research
40

Cost Rules Applied

Cost Rules are introduced to define scoring

For Example:
Network System Usage Cost
Mobile to Land Connections Costs
Technical Operations/Support Costs
Long Distance Costs
Inter-Carrier /International subsidy costs
Roaming Costs
Bad Debt Allocation
Many others

Research
41

Cost Rules for a Bank?

Cost Rules are introduced to define value

For Example:
Deposit Value
Product mix
Average. daily balance
Monthly service fees
Technical operations/Support costs
Branch/teller usage
Late payment/Overdraft history
Interest rate
Contract term
Credit Score
Employment history/Income

Research
42

Pragmatic Grand Challenge 3

Integrating domain knowledge


Data mining algorithms are knowledge free
There is no notion of common sense reasoning
Do we have to solve an AI-hard problem?

Robust and deep domain knowledge utilization


solution:
Very deep and very narrow integration
Ability to model business strategy
Reasoning capability just evolves (c.f. chess players)

Research
43

Cross-Sell / Up-Sell Example

Customer looking for pants

Help Me Complete the Any Related


Decide Assortment Products

Recommendations

Collaborative
Filtering

Alternates Up Sells Complement Add-on Impulse Buy

Context
Sensitive
Approach
Research
44

Pragmatic Grand Challenge 4

Managing and maintaining models


When was the last time you thought about the lifetime of a
mining model
What happens when a model is changed
Have you tried to merge the results of two different clustering
models over time?
How many data droppings (aka temp files, quick
transformations, quick fixes) do you generate in an analysis
session?
A framework for managing, updating, and
retiring mining models
solution: use techniques that have been invented for
this, databases, systems mngmt, s/w engr, etc

Research
45

Pragmatic Grand Challenge 5

Effectiveness Measurement
How do we measure [honestly] the effectiveness of a model in a
context?
Return on Investment (ROI) measurement
Evaluation in the context of the application
A framework and methodology for measurement
and evaluation
Build the measurement method as part of the design of the
model
An engineering recipe for measurements, and a set of metrics

Research
Technical Challenges

Research 46
47

Technical Challenges

0. Public benchmark data sets


As a field we have failed to define a common data collection
Very difficult to judge research and systems advances
Not an easy task, but not impossible
A mix of
synthetic (but realistic) data sets
and real datasets

Research
48

Technical Challenges

1. How does the data grow?


A theory for how large data sets get to be large
Definitely not IID sampling from a static distribution
Inappropriateness of a single-population model

2. Complexity/understandability tradeoff
Explaining how, when and why a model works
Explaining when a model fails
A Tuning Dial for reducing the complex into the
understandable
Research
49

Technical Challenges
3. Interestingness
What is an interesting pattern or summary?
How do you measure novelty?
What is unusual? When is it worthy of attention?
Is it low probability events? High summarization ability? Outliers?
Good fits? Bad fits?

Research
50

Technical Challenges
4. Scalability
Beyond just dealing with a large data set:
Principled feature reduction: what is SVD equivalent? Graceful
degradation with dimensionality
Uncovering graphical structure in data
Communities, relations, link analysis,
Dealing with multiple data types:
Structured, sparse, dense, text, images, video, audio, sequence
data, etc.
I have yet to see an algorithm that deals with more than one type.
Integration with DBMS
Appropriate sampling
Appropriate operator abstractions
Taking care of minor details
Initialization?
Determining k
Research
51

Technical Challenges
5. A theory for what we do
What are the fundamental abstractions?
What are the basics operations? What are the basic
components of an algorithm?
What is it that we are optimizing?
What is hard? What is doable? Why?
What is a data summary?
When are two attributes similar? Can you measure
efficiently?
How do we extract the right representation?

Research
52

A new theory is needed

What are the fundamental problems?


What do partial models or summaries of data really
mean?
What are the implications of post hoc data analysis?
When is it/is it not reasonable to conclude a task is
appropriate?
A new algebra for dealing with highly-summarized
views of the world
Effect of sparse spaces on dimensionality. What is the
true dimensionality of data? What are the limits?
A theory for adaptive sampling

Research
Summary
Pragmatic and Technical Grand Challenges

Research 53
54

Challenges

0. Public and challenging benchmark data sets


Pragmatic Technical
1. Wheres the Data? 1. Understanding large
2. In Situ mining 2. Simplicity knob
3. Domain knowledge 3. Interestingness
4. Life-cycle maintenance 4. Scalability
5. Metrics 5. Theory of what we do

A Scorecard for the field: At least 2 advances in the


next 10 years!!!
Research
55

Data Mining Grand Vision

ACME CORP ULTIMATE DATA MINING BROWSER

Whats New? Whats Interesting?

Predict for me

Research
56

In the meantime, there is an


understanding gap

The technical community speaks of tech


problems
The business strategic thinking hit an
understandability wall
Traditionally, the thinking of business
strategy never included data
A new generation of business challenges
are born

Research
57

Data Strategy

Is the mapping of the capabilities enabled by


data in driving the business
The Integration of data-driven capabilities in
revenue-driving activities
The Integration of data-derived metrics to
feedback into the measurement of the success
of the business
Evolving to an operational state where planning
includes data, measurability, and data-driven
feedback loops

Research
58

A Data Miners Story

Started out in pure research


At NASA-JPL
At Microsoft Research
At digiMine/Revenue Science Inc.
At DMX Group

So why join Yahoo! ?

Research
Yahoo! Case Study
Evolving the Data Strategy as Chief Data Officer

Research 59
60

Yahoo! is the #1 Destination on the


Web

More people visited Yahoo! in


73% of the U.S. Internet population uses the past month than:
Yahoo!
About 500 million users per month globally! Use coupons
Vote
Global network of content, commerce, media, Recycle
search and access products Exercise regularly
100+ properties including mail, TV, news, Have children living at
shopping, finance, autos, travel, games, movies, home
health, etc. Wear sunscreen regularly

25 terabytes of data collected each day and


Data is used to develop
growing
content, consumer, category
Representing thousands of cataloged consumer and campaign insights for our
behaviors key content partners and large
advertisers
Research Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.
61

Yahoo! Data A league of its own

Terrabytes of Warehoused Data


Millions of Events Processed Per Day
14,000 5,000

2,000 500
1,000

50 120 225 25 49 94 100

Amazon

Walmart

warehouse
Telecom

Warehouse
AT&T

Y! Panama
Y! LiveStor
Korea

Y! Main
SABRE VISA NYSE Y! Panama Y! Data
Highway

GRAND CHALLENGE PROBLEMS OF DATA PROCESSING


TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET

Y! PROBLEM EXCEEDS OTHERS BY 2 ORDERS OF MAGNITUDE

Research
62

To be continued

Will cover the Yahoo! case study on Tuesdays


Invited talk
Will include
Strategic Importance of Data
Evolving the data strategy
Evolving towards the need to invent the new sciences
of the Internet

Hope the Data Miners Story continues


Perhaps to a happy ending?

Research
Thank You! & Questions?
Usama_fayyad@yahoo.com

Research 63

You might also like