Data Mining Query Languages Compared

Data Mining Query
Languages
Kristen LeFevre
April 19, 2004
With Thanks to Zheng Huang and Lei Chen
Outline
 Introduce the problem of querying

data mining models
 Overview of three different solutions
and their contributions

 Topic for Discussion: What would an
ideal solution support?

Problem Description
 You guys are armed with two powerful tools
 Database management systems
 Efficient and effective data mining algorithms
and frameworks
 Generally, this work asks:
 “How can we merge the two?”
 “How can we integrate data mining more
closely with traditional database systems,
particularly querying?”
Three Different Answers
 DMQL: A Data Mining Query
Language for Relational Databases
(Han et al, Simon Fraser University)
 Integrating Data Mining with SQL
Databases: OLE DB for Data Mining
(Netz et al, Microsoft)
 MSQL: A Query Language for
Database Mining (Imielinski &
Virmani, Rutgers University)
Some Common Ground
 Create and manipulate data mining models
through a SQL-based interface (“Command-
driven” data mining)
 Abstract away the data mining particulars
 Data mining should be performed on data in
the database (should not need to export to
a special-purpose environment)
 Approaches differ on what kinds of models
should be created, and what operations we
should be able to perform
DMQL
 Commands specify the following:
 The set of data relevant to the data mining
task (the training set)
 The kinds of knowledge to be discovered
• Generalized relation
• Characteristic rules
• Discriminant rules
• Classification rules
• Association rules
DMQL
 Commands Specify the following:

 Background knowledge
• Concept hierarchies based on attribute
relationships, etc.
 Various thresholds
• Minimum support, confidence, etc.
DMQL
 Syntax
use database <database_name>
Specify background
knowledge
{use hierarchy <hierarchy_name> for
Specify rules to be
<attribute>}
discovered <rule_spec>
Relevant attributes or
aggregations related to <attr_or_agg_list>
Collect the set of from <relation(s)>
relevant data to mine
[where <conditions>]
[order by <order list>]
Specify threshold {with [<kinds of>] threshold =
parameters
<threshold_value> [for <attribute(s)>]}
DMQL
 Syntax <rule_spec>
find classification rules [as <rule_name>]
[according to <attributes>]
Find association rules [as <rule_name>]
generalize data [into <relation_name>]
others
DMQL
use database Hospital
find association rules as Heart_Health
related to Salary, Age, Smoker,
Heart_Disease
from Patient_Financial f, Patient_Medical m
where f.ID = m.ID and m.age >= 18
with support threshold = .05
with confidence threshold = .7
DMQL
 DMQL provides a display in

command to view resulting rules, but
no advanced way to query them
 Suggests that a GUI interface might
aid in the presentation of these results

in different forms (charts, graphs, etc.)
MSQL
 Focus on Association Rules

 Seeks to provide a language both to
selectively generate rules, and

separately to query the rule base
 Expressive rule generation language,
and techniques for optimizing some

commands
MSQL
 Get-Rules and Select-Rules Queries
 Get-Rules operator generates rules over
elements of argument class C, which satisfy
conditions described in the “where” clause
[Project Body, Consequent,
confidence, support]
GetRules(C) [as R1]
[into <rulebase_name>]
[where <conds>]
[sql-group-by clause]
[using-clause]
MSQL
 <conds> may contain a number of
conditions, including:
 restrictions on the attributes in the body or
consequent
in, has, and is are rule • “rule.body HAS {(Job = ‘Doctor’}”
subset, superset,
and equality • “rule1.consequent IN rule2.body”
respectively • “rule.consequent IS {Age = *}”
 pruning conditions (restrict by support,
confidence, or size)
 Stratified or correlated subqueries
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)
Retrieve all rules with descriptors of the form “Age = x” in the body,
except when there is a rule with equal or greater support and
confidence with a rule containing a superset of the descriptors in
the body
MSQL
GetRules(C) R1
where <pruning-conds>
correlated and not exists ( GetRules(C) R2
where <same pruning-conds>
GetRules(C) R1
where <pruning-conds>
and consequent is {(X=*)}
stratified and consequent in (SelectRules(R2)
where consequent is {(X=*)}
MSQL
 Nested Get-Rules Queries and their
optimization
 Stratified(non-corrolated) queries are
evaluated “bottom-up.” The subquery is
evaluated first, and replaced with its results
in the outer query.
 Correlated queries are evaluated either top-
down or bottom-up (like “loop-unfolding”),
and there are rules for choosing between the
two options
MSQL
GetRules(Patients)
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
MSQL
Top-Down Evaluation
GetRules(Patients)
For each rule produced by the outer, evaluate the

inner
not exists ( GetRules(Patients)
Support > .05 and Confidence > .7
MSQL
Bottom-Up Evaluation
not exists ( GetRules(Patients)
Support > .05 and Confidence > .7
For each rule produced by the inner, evaluate the

outer
GetRules(Patients)
MSQL
 Choosing between the two
 In general, evaluate the expression with more
restrictive conditions first
 Heuristic rules
• Evaluate the query with higher support threshold first
• Next consider confidence threshold
Meant to prevent
• A (length = x) expression is in general more restrictive
unconstrained than (length > x), which is more restrictive than (length <
queries from being x)
evaluated first • “Body IS (constant expression)” is more restrictive than
“Body HAS”, which is more restrictive than “Body IN”
• Next consider “Consequent IN” expressions
• Descriptors of for (A = a) are more restrictive than
wildcards such as (A = *)
OLE DB for DM
 An extension to the OLE DB interface for
Microsoft SQL Server
 Seeks to support the following ideas:
 Define a model by specifying the set of
attributes to be predicted, the attributes used
for the prediction, and the algorithm
 Populate the model using the training data
None of the  Predict attributes for new data using the
others
seemed to populated model
support this  Browse the mining model (not fully
addressed because it varies a lot by model
type)
OLE DB for DM
 Defining a Mining Model
 Identify the set of data attributes to be
predicted, the set of attributes to be used for
prediction, and the algorithm to be used for
building the model
 Populating the Model
 Pullthe information into a single rowset
using views, and train the model using the
data and algorithm specified
 Supports complex objects, so rowset may be
hierarchical (see paper for more complex
examples)
OLE DB for DM
 Using the mining model to predict

 Defines a new operator prediction join.
A model may be used to make
predictions on datasets by taking the
prediction join of the mining model
and the data set.
OLE DB for DM
CREATE MINING MODEL [Heart_Health Prediction]
[ID] Int Key,
[Age] Int,
[Smoker] Int,
[Salary] Double discretized,
[HeartAttack] Int PREDICT, %Prediction column
USING [Decision_Trees_101]
Identifies the source columns for the training

data, the column to be predicted, and the data
mining algorithm.
OLE DB for DM
INSERT INTO [Heart_Health Prediction]
([ID], [Age], [Smoker], [Salary])
SELECT [ID], [Age], [Smoker], [Salary] FROM
Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID
The INSERT represents using a tuple for

training the model (not actually inserting it into
the rowset).
OLE DB for DM
SELECT t.[ID],
[Heart_Health Prediction].[HeartAttack]
FROM [Heart_Health Prediction]
PREDICTION JOIN (
SELECT [ID], [Age], [Smoker], [Salary]
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID) as t
ON [Heart_Health Prediction].Age = t.Age AND
[Heath_Health Prediction].Smoker = t.Smoker
AND [Heart_Health Prediction].Salary =
t.Salary
Prediction join connects the model and an actual data

table to make predictions
Key Ideas
 Important to have an API for creating

and manipulating data mining models
 The data is already in the DBMS, so it
makes sense to do the data mining

where the data is
 Applications already use SQL, so a
SQL extension seems logical

Key Ideas
 Need a method for defining data mining
models, including algorithm specification,
specification of various parameters, and
training set specification (DMQL, MSQL,
ODBDM)
 Need a method of querying the models
(MSQL)
 Need a way of using the data mining model
to interact with other data in the database,
for purposes such as prediction (ODBDM)
Discussion Topic:
What Functionality would
and Ideal Solution
Support?

Data Mining Query Languages Compared

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Query Languages Compared

Uploaded by

Copyright:

Available Formats

Data Mining Query

 Introduce the problem of querying

and their contributions

ideal solution support?

 Commands Specify the following:

Find association rules [as <rule_name>]

generalize data [into <relation_name>]

 DMQL provides a display in

aid in the presentation of these results

 Focus on Association Rules

selectively generate rules, and

and techniques for optimizing some

For each rule produced by the outer, evaluate the

For each rule produced by the inner, evaluate the

 Using the mining model to predict

Identifies the source columns for the training

The INSERT represents using a tuple for

Prediction join connects the model and an actual data

 Important to have an API for creating

makes sense to do the data mining

SQL extension seems logical

You might also like