You are on page 1of 30

Data Mining Query

Languages
Kristen LeFevre
April 19, 2004
With Thanks to Zheng Huang and Lei Chen
Outline

 Introduce the problem of querying


data mining models
 Overview of three different solutions

and their contributions


 Topic for Discussion: What would an

ideal solution support?


Problem Description
 You guys are armed with two powerful tools
 Database management systems
 Efficient and effective data mining algorithms
and frameworks
 Generally, this work asks:
 “How can we merge the two?”
 “How can we integrate data mining more
closely with traditional database systems,
particularly querying?”
Three Different Answers
 DMQL: A Data Mining Query
Language for Relational Databases
(Han et al, Simon Fraser University)
 Integrating Data Mining with SQL
Databases: OLE DB for Data Mining
(Netz et al, Microsoft)
 MSQL: A Query Language for
Database Mining (Imielinski &
Virmani, Rutgers University)
Some Common Ground
 Create and manipulate data mining models
through a SQL-based interface (“Command-
driven” data mining)
 Abstract away the data mining particulars
 Data mining should be performed on data in
the database (should not need to export to
a special-purpose environment)
 Approaches differ on what kinds of models
should be created, and what operations we
should be able to perform
DMQL
 Commands specify the following:
 The set of data relevant to the data mining
task (the training set)
 The kinds of knowledge to be discovered
• Generalized relation
• Characteristic rules
• Discriminant rules
• Classification rules
• Association rules
DMQL

 Commands Specify the following:


 Background knowledge
• Concept hierarchies based on attribute
relationships, etc.
 Various thresholds
• Minimum support, confidence, etc.
DMQL
 Syntax
use database <database_name>
Specify background
knowledge
{use hierarchy <hierarchy_name> for
Specify rules to be
<attribute>}
discovered <rule_spec>
Relevant attributes or
aggregations related to <attr_or_agg_list>
Collect the set of from <relation(s)>
relevant data to mine
[where <conditions>]
[order by <order list>]
Specify threshold {with [<kinds of>] threshold =
parameters
<threshold_value> [for <attribute(s)>]}
DMQL
 Syntax <rule_spec>
find classification rules [as <rule_name>]
[according to <attributes>]

Find association rules [as <rule_name>]

generalize data [into <relation_name>]

others
DMQL
use database Hospital
find association rules as Heart_Health
related to Salary, Age, Smoker,
Heart_Disease
from Patient_Financial f, Patient_Medical m
where f.ID = m.ID and m.age >= 18
with support threshold = .05
with confidence threshold = .7
DMQL

 DMQL provides a display in


command to view resulting rules, but
no advanced way to query them
 Suggests that a GUI interface might

aid in the presentation of these results


in different forms (charts, graphs, etc.)
MSQL

 Focus on Association Rules


 Seeks to provide a language both to

selectively generate rules, and


separately to query the rule base
 Expressive rule generation language,

and techniques for optimizing some


commands
MSQL
 Get-Rules and Select-Rules Queries
 Get-Rules operator generates rules over
elements of argument class C, which satisfy
conditions described in the “where” clause
[Project Body, Consequent,
confidence, support]
GetRules(C) [as R1]
[into <rulebase_name>]
[where <conds>]
[sql-group-by clause]
[using-clause]
MSQL
 <conds> may contain a number of
conditions, including:
 restrictions on the attributes in the body or
consequent
in, has, and is are rule • “rule.body HAS {(Job = ‘Doctor’}”
subset, superset,
and equality • “rule1.consequent IN rule2.body”
respectively • “rule.consequent IS {Age = *}”
 pruning conditions (restrict by support,
confidence, or size)
 Stratified or correlated subqueries
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)

Retrieve all rules with descriptors of the form “Age = x” in the body,
except when there is a rule with equal or greater support and
confidence with a rule containing a superset of the descriptors in
the body
MSQL
GetRules(C) R1
where <pruning-conds>
correlated and not exists ( GetRules(C) R2
where <same pruning-conds>
and R2.Body HAS R1.Body)

GetRules(C) R1
where <pruning-conds>
and consequent is {(X=*)}
stratified and consequent in (SelectRules(R2)
where consequent is {(X=*)}
MSQL
 Nested Get-Rules Queries and their
optimization
 Stratified(non-corrolated) queries are
evaluated “bottom-up.” The subquery is
evaluated first, and replaced with its results
in the outer query.
 Correlated queries are evaluated either top-
down or bottom-up (like “loop-unfolding”),
and there are rules for choosing between the
two options
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)
MSQL

Top-Down Evaluation
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7

For each rule produced by the outer, evaluate the


inner
not exists ( GetRules(Patients)
Support > .05 and Confidence > .7
and R2.Body HAS R1.Body)
MSQL

Bottom-Up Evaluation
not exists ( GetRules(Patients)
Support > .05 and Confidence > .7
and R2.Body HAS R1.Body)

For each rule produced by the inner, evaluate the


outer
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
MSQL
 Choosing between the two
 In general, evaluate the expression with more
restrictive conditions first
 Heuristic rules
• Evaluate the query with higher support threshold first
• Next consider confidence threshold
Meant to prevent
• A (length = x) expression is in general more restrictive
unconstrained than (length > x), which is more restrictive than (length <
queries from being x)
evaluated first • “Body IS (constant expression)” is more restrictive than
“Body HAS”, which is more restrictive than “Body IN”
• Next consider “Consequent IN” expressions
• Descriptors of for (A = a) are more restrictive than
wildcards such as (A = *)
OLE DB for DM
 An extension to the OLE DB interface for
Microsoft SQL Server
 Seeks to support the following ideas:
 Define a model by specifying the set of
attributes to be predicted, the attributes used
for the prediction, and the algorithm
 Populate the model using the training data
None of the  Predict attributes for new data using the
others
seemed to populated model
support this  Browse the mining model (not fully
addressed because it varies a lot by model
type)
OLE DB for DM
 Defining a Mining Model
 Identify the set of data attributes to be
predicted, the set of attributes to be used for
prediction, and the algorithm to be used for
building the model
 Populating the Model
 Pullthe information into a single rowset
using views, and train the model using the
data and algorithm specified
 Supports complex objects, so rowset may be
hierarchical (see paper for more complex
examples)
OLE DB for DM

 Using the mining model to predict


 Defines a new operator prediction join.
A model may be used to make
predictions on datasets by taking the
prediction join of the mining model
and the data set.
OLE DB for DM
CREATE MINING MODEL [Heart_Health Prediction]
[ID] Int Key,
[Age] Int,
[Smoker] Int,
[Salary] Double discretized,
[HeartAttack] Int PREDICT, %Prediction column
USING [Decision_Trees_101]

Identifies the source columns for the training


data, the column to be predicted, and the data
mining algorithm.
OLE DB for DM
INSERT INTO [Heart_Health Prediction]
([ID], [Age], [Smoker], [Salary])
SELECT [ID], [Age], [Smoker], [Salary] FROM
Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID

The INSERT represents using a tuple for


training the model (not actually inserting it into
the rowset).
OLE DB for DM
SELECT t.[ID],
[Heart_Health Prediction].[HeartAttack]
FROM [Heart_Health Prediction]
PREDICTION JOIN (
SELECT [ID], [Age], [Smoker], [Salary]
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID) as t
ON [Heart_Health Prediction].Age = t.Age AND
[Heath_Health Prediction].Smoker = t.Smoker
AND [Heart_Health Prediction].Salary =
t.Salary

Prediction join connects the model and an actual data


table to make predictions
Key Ideas

 Important to have an API for creating


and manipulating data mining models
 The data is already in the DBMS, so it

makes sense to do the data mining


where the data is
 Applications already use SQL, so a

SQL extension seems logical


Key Ideas
 Need a method for defining data mining
models, including algorithm specification,
specification of various parameters, and
training set specification (DMQL, MSQL,
ODBDM)
 Need a method of querying the models
(MSQL)
 Need a way of using the data mining model
to interact with other data in the database,
for purposes such as prediction (ODBDM)
Discussion Topic:
What Functionality would
and Ideal Solution
Support?

You might also like