You are on page 1of 2

BioE 530 Statistics and Machine Learning

Final project
This is a group project for a team with 3 people. Each team chooses to work on one dataset to
complete the project tasks.

Five datasets are available from the following sources. The details on datasets are provided in the URL
and references herein. You may also use your own research data or other dataset in public repository.
1) Diabetic Retinopathy Debrecen Data Set
(thttp://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set One)
2) Hundred plant species leaves data set
(https://archive.ics.uci.edu/ml/datasets/One-hundred+plant+species+leaves+data+set)
3) Gene expression cancer RNA-Seq Data Set
(http://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq)
4) Quality Assessment of Digital Colposcopies Data Set
(http://archive.ics.uci.edu/ml/datasets/Quality+Assessment+of+Digital+Colposcopies)
5) Parkinsons data set (http://archive.ics.uci.edu/ml/datasets/Parkinsons)
6) LSVT Voice Rehabilitation data set
(http://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation#)
7) Thoracic Surgery Data Data Set
(http://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data)
8) Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set
(http://archive.ics.uci.edu/ml/datasets/Smartphone-
Based+Recognition+of+Human+Activities+and+Postural+Transitions). For this problem you
could choose one activity type against the rest, e.g. sit against the rest five types to develop a
binary classifier. You could also choose to develop multiclass classifiers.
9) Epileptic Seizure Recognition Data Set
(http://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition)

Tasks to perform:
Task 1: Perform exploratory data analysis (e.g. scatter plots, clustering, PCA etc) to reveal the
relationship among features and the potential hidden structure.
Task 2: Establish classification models based on the following methods (each person in the group
should work on an independent method)
i) k-nearest neighbor
ii) logistic regression
iii) Nave Bays
iv) Linear Discriminant Analysis (LDA) or SVM
Task 3: Evaluate and compare the selected models and draw conclusion based on the computational
evaluation of your models.
Other Requirement:
Your implementation should include a component of feature selection to determine the number of
features. The models need to be trained in a cross-validated fashion. The criterion for model selection
should be based on the Area Under the ROC curve (with exception of k-nearest neighbor method).

Presentation requirements (15 minutes, ~10 slides)


Intro to the example (3)
Approach (2-3)
Results (3-4)
Conclusion (1)

What to submit
Report (typed, font size 12pt, double spaced). The report should include subsections of
Introduction, methods, results, conclusion and references. Individual contribution to the project
should be indicated in the final paragraph of the report. The report should be in the length less
than 10 pages.
Your group presentation file.
Your program used to generate the results (Matlab/R).
Submit all files in a zipper file.

You might also like