Foundations 6 Data Minimg

Texas A&M
GSPLab
Data Mining
Edward R. Dougherty
Department of Electrical and Computer Engineering
Center for Bioinformatics and Genomic Systems Engineering
Texas A&M University
gsp.tamu.edu
Texas A&M
GSPLab
Reading
Book: Chapter 8
Papers: Paper: Dougherty, E. R., Prudence, Risk, and
Reproducibility in Biomarker Discovery,
BioEssays, Vol. 34, No. 4, 277-279, 2012.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Knowledge Discovery
Knowing the constitution of scientific knowledge
and how to validate it leaves open the question of
how to discover knowledge.
Obviously, we need to observe Nature, but in what
manner.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Bacon on Planned Experiments

Francis Bacon (Novum Organum, 1620):
There remains simple experience which, if
taken as it comes, is called accident; if sought
for, experiment. But this kind of experience
isa mere groping, as of men in the dark
But the true method of experience, on the
contrary, first lights the candle, and then by
means of the candle shows the way;
commencing as it does with experience duly
ordered and digested, not bungling or erratic,
and from it educing axioms, and from
established axioms again new experiments.
gsp.tamu.edu
Texas A&M
GSPLab
Experimental Design: The Path of Progress

Immanuel Kant (Critique of Pure Reason,1781): It is
only when experiment is directed by rational principles
that it can have any real utility. Reason must approach
nature with the view, indeed, of receiving information
from it, not, however, in the character of a pupil, who
listens to all that his master chooses to tell him, but in
that of a judge, who compels the witnesses to reply to
those questions which he himself thinks fit to propose.
To this single idea must the revolution be ascribed, by
which, after groping in the dark for so many centuries,
natural science was at length conducted into the path of
certain progress.
gsp.tamu.edu
Texas A&M
GSPLab
Judicious Feature Selection

James Clerk Maxwell: The feature which
presents itself most forcibly to the untrained
inquirer may not be that which is considered
most fundamental by the experienced man of
science; for the success of any physical
investigation depends on the judicious
selection of what is to be observed as of
primary importance, combined with a
voluntary abstraction of the mind from those
features which, however attractive they
appear, we are not yet sufficiently advanced in
science to investigate with profit.
gsp.tamu.edu
Texas A&M
GSPLab
An Experiment is a Question
Hans Reichenbach (Rise of Scientific
Philosophy): An experiment is a question
addressed to Nature.As long as we
depend on the observation of occurrences
not involving our assistance, the
observable happenings are usually the
product of so many factors that we cannot
determine the contribution of each
individual factor to the total result.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Reasoning to Science
Hans Reichenbach: By means of the
artificial occurrences of planned
experiments, the complex occurrence of
Nature is thus analyzed into its
components. That Greek science did not
use experiments in any significant way
proves how difficult it was to turn from
reasoning to empirical science.
Science is not constituted by reasoning about
data; it is constituted by pragmatic, predictive
models.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Foolish Questions Yield Foolish Answers

Arturo Rosenblueth and Norbert
Wiener: An experiment is a question. A
precise answer is seldom obtained if the
question is not precise; indeed, foolish
answers i.e., inconsistent, discrepant or
irrelevant experimental results are
usually indicative of a foolish question.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Models Depend on Questions Asked

Werner Heisenberg: The most important new
result of nuclear physics was the recognition of
the possibility of applying quite different types
of natural laws, without contradiction, to one
and the same physical event. This is due to the
fact that within a system of laws which are
based on certain fundamental ideas only certain
quite definite ways of asking questions make
sense, and thus, that such a system is separated
from others which allow different questions to
be put.
gsp.tamu.edu
Texas A&M
GSPLab
Mere Observation
Hannah Arendt: [Natural science]
seemed to be liberated by the discovery that
our senses by themselves do not tell the
truth. Henceforth, sure of the unreliability
of sensation and the resulting insufficiency
of mere observation, the natural sciences
turned toward the experiment, which, by
directly interfering with nature, assured the
development whose progress has ever since
appeared to be limitless.
gsp.tamu.edu
Texas A&M
GSPLab
Answers Without Questions

Hannah Arendt: The experiment being a
question put before nature (Galileo), the
answers of science will always remain
replies to questions asked by men; the
confusion in the issue of objectivity was to
assume that there could be answers without
questions and results independent of a
question-asking being.
gsp.tamu.edu
Texas A&M
GSPLab
Efficient Experimentation
Douglas Montgomery: If an experiment is
to be performed most efficiently, then a
scientific approach to planning the
experiment must be considered. By the
statistical design of experiments we refer to
the process of planning the experiment so
that appropriate data will be collected, which
may be analyzed by statistical methods
resulting in valid and objective conclusions.
The statistical approach to experimental
design is necessary if we wish to draw
meaningful conclusions from the data.
gsp.tamu.edu
Texas A&M
GSPLab
Everyday Classification
Some algorithm is proposed.
The algorithm separates some data set.
We are not told the distribution from which the data come.
An estimation rule is used to estimate the error.

We are given no reason why the estimate should be good.
In fact, often we expect that the estimate is not good.
The estimate is small and the algorithm is claimed to

be validated.
We are given no justification for the claim.
We are given no conditions under which it is valid.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Data Mining Definition 1

(Merriam-Webster) Type of database analysis that
attempts to discover useful patterns or relationships in
a group of data. The analysis uses advanced statistical
methods, such as cluster analysis, and sometimes
employs artificial intelligence or neural network
techniques. A major goal of data mining is to discover
previously unknown relationships among the data,
especially when the data come from different
databases.
Relations among data, not among variables no science!
gsp.tamu.edu
Texas A&M
GSPLab
Data Mining Definition 2

(Wikipedia) Data analysis has increasingly been
augmented with indirect, automated data processing,
aided by other discoveries in computer science, such as
neural networks, cluster analysis, genetic algorithms, and
support vector machines. Data mining is the process of
applying these methods with the intention of uncovering
hidden patterns in large data sets.
Uncovering patterns in data sets no science!
gsp.tamu.edu
Texas A&M
GSPLab
Data Mining
Data mining is a return to pre-Baconian groping, albeit, at
a much faster groping rate than was then possible.
It suffers from three debilitating properties:
It does not ask precise questions.
There is no statistical characterization of the procedure.
As opposed to pattern recognition, it lacks a characterization of
prediction in the context of a distribution.
Sometimes it is justified by large sample theory,

typically absent a rigorous analysis to the problem at hand.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Statistics for the Proletariat

Julian L. Simon (Resampling: The New Statistics,
1997): Monte Carlo resampling simulation takes the
mumbo-jumbo out of statistics and enables even
beginning students to understand completely everything
that is done. Even many experts are unable to
understand intuitively the formal mathematical approach
to the subject. Clearly, we need a method free of the
formulas that bewilder almost everyone.
Everyday common sense should replace the mumbo-jumbo
of scientific rigor and, to a great extent, it has.
gsp.tamu.edu
Texas A&M
GSPLab
The Numbers Speak for Themselves

Chris Anderson (The End of Theory: The Data Deluge
Makes the Scientific Method Obsolete): The more we
learn about biology, the further we find ourselves from a
model that can explain it. There is now a better way.
Petabytes allow us to say: "Correlation is enough." We
can stop looking for models. We can analyze the data
without hypotheses about what it might show. We can
throw the numbers into the biggest computing clusters
the world has ever seen and let statistical algorithms find
patterns where science cannot With enough data, the
numbers speak for themselves.
Texas A&M
GSPLab
Consistency (Asymptotic Convergence)

For a sample S of size n, there is a design cost: n = n
Bayes.
A classification rule is consistent if E[n] 0 as n .
An error estimator is consistent if the estimate converges
to the true error as n .
What good is this for small samples?
gsp.tamu.edu
Texas A&M
GSPLab
Asymptotic Convergence is Irrelevant

Appeal to laws of large numbers or central limit
theorems in small-sample settings is unwarranted.
Training-data-based error estimation methods, such as
cross-validation and bootstrap, converge asymptotically
as the sample size goes to infinity, but this is of
virtually no value for small samples.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Asymptopia
Edward Leamer: Two of the latest products-toend-all-suffering are nonparametric estimation
and consistent standard errors, which promise
results without assumptions, as if we were
already in Asymptopia where data are so plentiful
that no assumptions are needed By
disguising the assumptions on which nonparametric
methods and consistent standard errors rely, the purveyors
of these methods have made it impossible to have an
intelligible conversation about the circumstances in which
their gimmicks do not work well and ought not to be used.
As for me, I prefer to carry parameters on my journey so I
know where I am and where I am going, not travel stoned
on the latest euphoria drug.
gsp.tamu.edu
Texas A&M
GSPLab
Tackling Small Sample Problems

Ronald A. Fisher (1925): Little experience
is sufficient to show that the traditional [large
sample] machinery of statistical processes is
wholly unsuited to the needs of practical
research. Not only does it take a cannon to
shoot a sparrow, but it misses the sparrow!...
The elaborate mechanism built on the theory of infinitely
large samples is not accurate enough for simple laboratory
data. Only by systematically tackling small sample problems
on their merits does it seem possible to apply accurate tests to
practical data.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Full Knowledge is in Sampling Distribution

Harald Cramer, 1946: It is clear that a
knowledge of the exact form of a sampling
distribution would be of a far greater value
than the knowledge of a number of moment
characteristics or a limiting expression for
large values of n. Especially when we are
dealing with small samples, as is often the
case in the applications, the asymptotic
expressions are sometimes grossly inadequate,
and a knowledge of the exact form of the
distribution would then be highly desirable.
gsp.tamu.edu
Texas A&M
GSPLab
Humean Trap: Data Without Reason

Hume (Treatise of Human Nature): The
mind is a kind of theatre, where several
perceptions successively make their
appearance; pass, repass, glide away, and
mingle in an infinite variety of postures
and situations. There is properly no
simplicity in it at one time, nor identity in
different [times].
A definition of radical empiricism.
Data mining.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Necessity of an Intelligent Idea

William Barrett (Illusion of
Technique): The absence of an
intelligent idea in the grasp of a
problem cannot be redeemed by the
elaborateness of the machinery Ione
subsequently employs.
gsp.tamu.edu
Texas A&M
GSPLab
The Imprint of Mind

William Barrett (Illusion of Technique): The scientists
mind is not a passive mirror that reflects the facts as they
are in themselves (whatever that might mean); the
scientist constructs models, which are not found among
the things given him in his experience, and proceeds to
I And he must often
impose those models upon Nature.
construct those models conceptually before they are
translated at any point into the material constructions of
his apparatus in the laboratory.The imprint of mind is
everywhere on the body of this science, and without the
founding power of mind it would not exist.
gsp.tamu.edu
Texas A&M
GSPLab
Radical Empiricism Denies Knowledge

Hans Reichenbach (Rise of Scientific
Philosophy): A mere report of relations
observed in the past cannot be called
knowledge. If knowledge is to reveal
objective relations of physical objects, it
must include reliable predictions. A radical
empiricism, therefore, denies the
possibility of knowledge.
A collection of measurements, together with
statements about the measurements, is not
scientific knowledge.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
A Huge Challenge
Janet Woodcock (Director, Center for Drug
Evaluation and Research, FDA): [As much as
75 percent of published biomarker associations
are not replicable] This poses a huge
challenge for industry in biomarker
identification and diagnostics development.
Dougherty, E. R., Prudence, Risk, and Reproducibility in
Biomarker Discovery, BioEssays, 34(4), 277-279, 2012.
Yousefi, M., and E. R. Dougherty, Performance Reproducibility
Index for Classification, Bioinformatics, 28(21), 2824-2833,
2012.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Reporting Bias When Using Real Data

m data sets of size n, LDA with 10-fold cross-validation
est(0) and true(0) are the estimated and true errors for the
sample with the lowest error estimate, and E[true] is
expected true error over all samples.
Left: est(0) true(0); right: est(0) E[true]; n = 60, 120.
Yousefi, M. R., Hua, J., Sima, C., and E. R. Dougherty, Reporting Bias When Using Real
Data Sets to Analyze Classification Performance, Bioinformatics, 26 (1),
( 68-76, 2010.
03/28/15
Texas A&M
GSPLab
Multiple-Rule Bias
Use r classification rules and s error
estimation rules. Select the pair with
the minimum estimated error, min,est...
Bias(m) = E[min,est true(imin)], over
sampling distribution, m = rs, n = 60.
Yousefi, M. R., Hua, J., and E. R. Dougherty, MultipleRule Bias in the Comparison of Classification Rules,
Bioinformatics, 27(12), 1675-1683, 2011.
Texas A&M
GSPLab
Reproducibility Performance Index

A preliminary study of size n is reproducible with
accuracy 0 if n nest + .
A follow-on study will be performed if nest .
Rn(, ) = P(n nest + | nest ).
Real data sets: LDA, n = 60, 5 features by t-test.
Yousefi, M., and E. R. Dougherty, Performance Reproducibility Index for

Classification, Bioinformatics, 28(21), 2824-2833, 2012.
gsp.tamu.edu
Texas A&M
GSPLab
Reproducibility with Reporting Bias

Reproducibility index for m = 5 data sets, LDA, 5F-CV,
5 features, Gaussian with equal covariance matrices,
uncorrelated features
(a) n = 60, = 0.0005; (b) n = 60, = 0.05;
(c) n = 120, = 0.0005; (d) n = 120, = 0.05;
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Separate Sampling: Classifier Error

Class sizes, n0 and n1, pre-determined
Hence, no estimate for class prior
probability c = P(Y = 0).
Random sampling, r = n0/n c, n (prob)
Fix c, expected error for r (QDA)

Dark blue (c =0.3), black (c = 0.4), light
blue (c = 0.5), red (c = 0.6), green (c = 0.7)
r* (crossing point) is minimax value
Top equal covariance; bottom unequal
covariance
Esfahani, M. S., and E. R. Dougherty, Effect of
Separate Sampling on Classification Accuracy,
Bionformatics,
gsp.tamu.edu
Texas A&M
GSPLab
Separate Sampling: Error Estimation

Class sizes, n0 and n1, pre-determined.
Apply classical 5-fold cross-validation on
the data set to estimate the error (dashed
lines).
Apply separate-sampling 5-fold crossvalidation (solid lines).
Fix c, Bias for r (L-SVM).

Dark blue (c =0.3), black (c = 0.4), light blue
(c = 0.5), red (c = 0.6), green (c = 0.7)
Top n = 80; bottom n = 1000
Braga-Neto, U. L., Zollanvari, U. M., and E. R.
Dougherty, Cross-Validation Under Separate
Sampling: Optimistic Bias and How to Correct It,
Texas A&M
GSPLab
Apparent Patterns in Microarray Data

Relationship?
time course or
experiments
patterns
genes
Texas A&M
GSPLab
What Does This Mean?

Data are clustered
by some clustering
algorithm.
Is there scientific
knowledge here?
Texas A&M
GSPLab
Clustering Algorithm
An algorithm that partitions a set of points into several
groups, based on a measure of similarity (or
dissimilarity) between the points.
Example:
x
3
Group 1
Group 2
Group 3
x2
x1
Texas A&M
GSPLab
Expression Profile Clustering

Cluster expression vectors: clusters indicate
potential co-regulation in time-course data analysis.
Cluster samples: clusters indicate potential similar
sources a sort of classification.
Methods
Fuzzy c-means
K-means
S.O.M.
Hierarchical clustering (Euclidean distance)
Hierarchical clustering (correlation)
Texas A&M
GSPLab
K-means Clustering
Goal: Partition points into tight clusters.
Algorithm:
Randomly initialize with k means m1,, mk
Place x into Ci if ||x mi|| ||x mj|| for j = 1,, k
Update m1,, mk as the means of C1,, Ck
Repeat until means do not change
Clusters determined by Voronoi diagram of m1,, mk
Texas A&M
GSPLab
Hierarchical Clustering
Iteratively join clusters based on similarity measure
(agglomerative clustering).
Farthest neighbor similarity measure:
d(Ci, Cj) = max {||x y|| : x Ci, y Cj}
Algorithm (complete linkage clustering):

Initialize clusters by Ci = {xi} for i = 1,, n
Iteratively merge the clusters for which the greatest distance
between points in the two clusters is minimized
Halts when the similarity measure exceeds a pre-defined
threshold
Texas A&M
GSPLab
Hierarchical Clustering Example
A. cholesterol biosynthesis
B. cell cycle
C. immediate-early response
D. signaling and angiogenesis
E. wound healing and tissue remodeling
Source: Michael B. Eisen, et
al., PNAS 1998, Vol.95
Texas A&M
GSPLab
The Clustering Problem

Jain et al.: Clustering is a subjective process; the
same set of data items often needs to be partitioned
differently for different applications.
Jain, A.K., Murty, M. N., and P.J. Flynn, Data Clustering: A Review,
ACM Computer Surveys, 31 (3), 264-323, 1999.
Solution
Mathematical theory
Pattern recognition theory and random set theory
Texas A&M
GSPLab
What Are Good Clusters?

Example:
- 2 or 3 clusters?
- What is the best separation?
Texas A&M
GSPLab
Naive Clustering Error

Generate set of points from different distributions: A1,, Ak.
Use clustering algorithm to form clusters: C1,, Ck.
Align point sets and clusters, and count errors.
Average over a number of randomly generated sets.
Dougherty, E. R. , Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y.,
Bittner, M. L., and J. M. Trent, "Inference From Clustering with Application to

Gene-Expression Microarrays," Computational Biology 9 (1), 105-126, 2002.
Texas A&M
GSPLab
Synthetic Example
5 synthetic
templates
Simulated data
from the templates
different variances
5 different
clustering methods
Texas A&M
GSPLab
Single Experiment ( 2 = 0.25)

No error!
Tighter clusters due

to small variance
Results from fuzzy

c-means
Texas A&M
GSPLab
Experiment ( 2 = 3.0)
many
misclassifications
clusters start mixing
22 misclassifications
(8.8%)
Texas A&M
GSPLab
Hierarchical Clustering Error!!!
Before clustering
After clustering with a

NICE dendrogram
24.5% Error!!
Algorithm: Hierarchical clustering with correlation measure
Texas A&M
GSPLab
Clustering Error
Points are a realization S of a labeled random point
process.
Clustering algorithm assigns to S a label function S.
The error of is the expected difference between its
labels and the labels generated by the point process.
Error must take into account that we do not care about
the ordering, only the partitions generated.
Expectation taken with respect to the distribution of the
point process.
Texas A&M
GSPLab
Example of Clustering Error

Left: Realization of point process
Right: Output of hierarchical clustering
Error: 40%
Texas A&M
GSPLab
Clustering Validity
Clustering validity is analogous to classification
validity.
Replace classifier with cluster operator and
classification error with clustering error.
Texas A&M
GSPLab
Validation Indices
Validation indices are meant to judge the validity of a
clustering output.
They can be based on a number of heuristic
considerations and methodologies.
Do they correspond to scientific validity?
Does a validation index correlate to clustering error?
Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and E. R.
Dougherty, Model-Based Evaluation of Clustering Validation Measures,
Pattern Recognition, 40 (3), 807-824, 2007.
Texas A&M
GSPLab
Kendalls Correlation for Indices

Top: Realization of
point process
Bottom: Kendalls
correlation:
Dunns index, D correl,
silhouette, figure of merit
Texas A&M
GSPLab
Kendalls Correlation for Indices

Top: Realization of point
process
Bottom: Kendalls
correlation:
Dunns index, D correl,
silhouette, figure of merit
Texas A&M
GSPLab
Scientific Knowledge
Requires a mathematical model.
In classification, the model is learned from training data.
Requires a methodology to test the model.

Can inferences be made from the model?
Texas A&M
GSPLab
Classification and Knowledge

model is composed of a classifier (decision
The
function) and an error a data point is observed and
it is assigned to a class.
model is inferred from data by classification and
The
error-estimation rules.
validity is determined by properties of the
Model
error estimation rule.
Texas A&M
GSPLab
Probabilistic Theory of Clustering

Clustering theory in the context of random sets.
Probabilistic error measure based on points being
clustered correctly.
Bayes clusterer (optimal clustering algorithm).
Learning theory for clustering algorithms.
Dougherty, E. R., and M. Brun, A Probabilistic Theory of Clustering, Pattern

Recognition, 37 (5), 917-925, 2004.
gsp.tamu.edu
Texas A&M
GSPLab
Data Mining Violates Basic Principles

Data mining violates two basic principles of
experimental design: (1) constrain the variables so
that the experiment is only minimally affected by
external conditions and the results elucidate clear
mathematically describable behavior; and (2) all
modeling is done within a rigorous statistical setting
in which both constraints and the sampling
distribution are clearly expressed.
gsp.tamu.edu
Texas A&M
GSPLab
What Data Mining Has Produced

Absent a sound epistemology there is no ground of
knowledge and therefore no knowledge.
There are thousands of papers in the literature for
which there is no demonstration of any meaning at all.
This has several serious consequences:
Huge waste of resources.

Literature is untrustworthy and much of it is useless.
Propagation of meaningless results on meaningless results.
Lack of progress on consequential problems.
gsp.tamu.edu
Texas A&M
GSPLab
Is Data Mining a Serious Scientific Endeavor

Dougherty and Bittner (Epistemology of the Cell):
Does anyone really believe that data mining could
produce the general theory of relativity?

Foundations 6 Data Minimg

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Foundations 6 Data Minimg

Uploaded by

Copyright:

Available Formats

Texas A&M

Bacon on Planned Experiments

Experimental Design: The Path of Progress

Judicious Feature Selection

Foolish Questions Yield Foolish Answers

Models Depend on Questions Asked

Answers Without Questions

An estimation rule is used to estimate the error.

The estimate is small and the algorithm is claimed to

Data Mining Definition 1

Data Mining Definition 2

Sometimes it is justified by large sample theory,

Statistics for the Proletariat

The Numbers Speak for Themselves

Consistency (Asymptotic Convergence)

Asymptotic Convergence is Irrelevant

Tackling Small Sample Problems

Full Knowledge is in Sampling Distribution

Humean Trap: Data Without Reason

Necessity of an Intelligent Idea

The Imprint of Mind

Radical Empiricism Denies Knowledge

Reporting Bias When Using Real Data

Reproducibility Performance Index

Yousefi, M., and E. R. Dougherty, Performance Reproducibility Index for

Reproducibility with Reporting Bias

Separate Sampling: Classifier Error

Fix c, expected error for r (QDA)

Separate Sampling: Error Estimation

Fix c, Bias for r (L-SVM).

Apparent Patterns in Microarray Data

What Does This Mean?

Expression Profile Clustering

Algorithm (complete linkage clustering):

Hierarchical Clustering Example

The Clustering Problem

What Are Good Clusters?

Naive Clustering Error

Bittner, M. L., and J. M. Trent, "Inference From Clustering with Application to

Single Experiment ( 2 = 0.25)

Tighter clusters due

Results from fuzzy

clusters start mixing

Hierarchical Clustering Error!!!

After clustering with a

Algorithm: Hierarchical clustering with correlation measure

Example of Clustering Error

Kendalls Correlation for Indices

Kendalls Correlation for Indices

Requires a methodology to test the model.

Classification and Knowledge

Probabilistic Theory of Clustering

Dougherty, E. R., and M. Brun, A Probabilistic Theory of Clustering, Pattern

Data Mining Violates Basic Principles

What Data Mining Has Produced

Huge waste of resources.

Is Data Mining a Serious Scientific Endeavor

You might also like