Professional Documents
Culture Documents
CS 395T
Unique Number: 49460
Course Announcement
Spring 2000
M-W 4:00-5:30pm
CPE 2.206
Paper Readings
Class Projects
Current Projects.
Sample projects descriptions and associated resources.
Handouts
Material to be covered
Related Courses
Stanford's CS 349, Data Mining, Search, and the World Wide Web,
Fall 1998.
UC Berkeley's <="" a="">CS 294-7, Large Datasets, Fall 1999.
UT Austin ECE course <="" a="">EE 380L, A Practicum in Data
Mining, Fall 1999.
Princeton's CIS 700/702, Information Retrieval, ?.
The Data Mining Lab(DML) is led by Prof. Inderjit Dhillon. It is
closely affiliated with the Machine Learning Research Group
(MLRG) (led by Prof. Mooney) and the Intelligent Data Exploration
and Analysis Laboratory (IDEAL) (led by Prof. Ghosh of ECE). For
applications in bioinformatics, the group closely collaborates with Prof.
Marcotte who is a faculty member in the Chemistry/Biochemistry
department and the Center for Computational Biology and
Bioinformatics (CCBB).
The Data Mining Lab at UT Austin is focused on the analysis of very
large data sets, especially those that arise in the application areas of text
mining and bioinformatics. The emphasis is on finding sound,
theoretically-motivated algorithms for the central tasks in data mining,
such as high-dimensional clustering, classification algorithms and data
visualization.
The current focus of the group is on uncovering the latent low-
dimensional structure that is often inherent in high-dimensional data. In
many important applications, such as text mining and face recognition,
the data matrices that arise are sparse and non-negative. Thus it is
natural to seek low-dimensional approximations that preserve these
properties -- sparsity in approximations implies economy in
representation while non-negativity enhances interpretation (note that
traditional methods such as SVD and PCA do not preserve these
properties).
With the above goals in mind, the lab has recently been exploring the
application of information theory to data mining tasks. Information
Theory provides a natural way of dealing with non-negative data
vectors by treating them as probability vectors. Problems such as
clustering can then be posed as optimization problems in information
theory, such as maximizing mutual information. As an application to
text mining, such an approach has been shown to reveal the semantic
similarity of words thus leading to substantial reduction in classifier
complexity and increased accuracy in document classification when
training data is sparse. Further directions currently being explored
include: (a) information-theoretic clustering and approximation of
higher order non-negative tensors (that often arise in applications as
multidimensional contingency tables), and (b) new algorithms for low-
rank non-negative matrix factorization.
The Data Mining Lab has disseminated publications, software and
results for document clustering, clustering of gene expression data in
bioinformatics and multidimensional data visualization.