Professional Documents
Culture Documents
Achilles Rasquinha
March 31, 2017
1 Introduction
The following project proposal is for Helikar Lab under the organization -
Computational Biology at University of Nebraska, Lincoln. I am keen to work
on the project - CancerDiscover: a GUI for cancer prediction and biometric
identification using microarray data.
2 Student Information
General Information
Name: Achilles Rasquinha
Alternate Names:
GitHub: github.com/achillesrasquinha
LinkedIn: linkedin.com/in/achillesrasquinha
Email: achillesrasquinha@gmail.com
Time Zone: Asia/Kolkata (UTC +05:30)
Website: Brains from Scratch
Background Information
Education
University of Mumbai
Bachelors Degree, Computer Engineering (2013 - Present)
Courses: Machine Learning, Artificial Intelligence, Data
Warehouse and Mining, Soft Computing, Distributed Databases,
Software Engineering, Structured and Object-Oriented Anal-
ysis and Design.
St. Xaviers College, Mumbai
Higher Secondary Certificate (2011 - 2013)
1
Courses: Physics, Applied and Organic Chemistry, Applied
Mathematics, Economics and Biology.
What are your languages of choice?
Im an eternal Pythonista and currently, a Pythoneer. Im also well-
versed with C++ (STL and Boost) and Java. I also currently work
on various projects written in JavaScript (Node.js).
Ive also authored SnackJS - Android Snackbars for the web, a Re-
sponsive Web Design UI component for notification and feedback
on the web, written in TypeScript and SASS. Currently with 1,236
downloads, snackjs turns out to be web developers favourite choice.
(npmjs.com/package/snackjs)
2
and hurdles I may face, if given the chance to implement the said
project.
3 Project Information
Proposal Title
Problem
As of today, CancerDiscover has a not-so-easy build and workflow for
users to conduct microarray experiements. Moreover, the current setup
requires users to manually download Affymetrix CEL files and at the same
time, separately label them. Users have no visual analysis provided at any
given time during the course of the experiment. Moreover, users are also
limited to use the default parameters for each classifier (during analysis),
thus limiting a user to build more efficient prediction models. Finally, the
overall workflow is poorly documented for installation and usage.
Solution
candis - A minimalistic clean Graphical User Interface integrated with
the current command-line tool will not only provide a smooth build and
workflow for an experiment setup, but also provide remote download-
ing for data sets, quality control visualizations and user-defined
parameters during analysis. Such a modular framework will also pro-
vide future extensions for more methods and techniques. Users will also
3
have access to a well-documented manual that eases the overall use of the
proposed software. The current ongoing development of this application
can be viewed at github.com/achillesrasquinha/candis
Proposal Description
The first tab in the pipeline, the Source frame shall provide the
following functionalities:
A live search functionality for users to select and download raw
Affymetrix CEL files from the National Centre for Biotechnology
Information website (either via NCBI FTP or via NCBIs pre-
ferred API) simply based on a user search-query - using requests.
Downloaded data sets will be cached onto the users local disk
for future use.
4
A dataset loader for datasets available on a local machine.
An editor to input label names (normal, tumour, etc.) for each
custom selected dataset.
A visual list to display metadata about available datasets and
to custom select datasets (Data Selection) for the next stages in
the pipeline.
Preview
5
The next frame in the pipeline - the Preview frame, shall provide
the following functionalities:
Quality Control Statistics and Visualization such as box-and-
whisker plots (upper quartile, interquartile and lower quartile for
each microarray sample versus their intensities on a logarithmic
scale) and density plots (intensities on a logarithmic scale versus
their densities). - using matplotlib.
Any microarray samples required for removal, will be highlighted
in red.
Save visualizations in a format of ones choice (.png, .jpeg, etc.).
- using matplotlib.
The Preview frame will take into consideration of developers to
smoothly register more visualization methods as the software evolves
over time.
Preprocess
Model
We move to the next frame - Model. Here, we move from the pre-
processing phase to the modelling and analysis phase. Model shall
provide the following functionalities:
6
A custom dialog titled Experiment for users to select:
Name: A user-defined microarray experiment name.
Description: A user-defined description for the experiment.
Classifier : Random Forest (default), Decision Trees, Sup-
port Vector Machines, k-Nearest Neighbors, etc.
Parameters: A JSON (Python dict) editor for users to
tweak individual parameter set belonging to each classifier.
7
(e.g. - entropy or gini for information gain if the classifier is
a Decision Tree, etc.)
Training Size: A user-defined training size (ratio) within
the range (0, 1].
k-Folds: A user-defined fold count for cross-validation. Users
can check a check-box if it wishes to consider a validation-
split.
Dimensionality Reduction: A choice for a feature selec-
tion technique to be used (defaults to Correlation-based Fea-
ture Selection).
Users can view the current experiments (complete or in-progress)
in a tabular form and choose a model for the next stage in the
pipeline (Analysis and Prediction). Users can view the training
time taken for each experiment as well.
Users can save such experiments (serializing candis.Experiment
objects into a .cache directory on a users local disk and reload
them into the current frame when needed).
FUTURE SCOPE : Since the Model frame provides combi-
nations of various phases and techniques, a base framework for
users to create experiments using flow graphs will be imple-
mented. The current implementation however, will be a single
parent candis.Model.Dialog (the Experiment dialog) and one
or more child QtWidgets.QDialogs.
Analysis
Each trained model can be pushed into the Analysis frame which
shall provide the following functionalities:
A complete generalized report of the experiment with necessary
infographics.
Visualizations for analysis such as - Confusion Matrix, ROC
Curve, etc.
Metrics such as Accuracy, Precision, Recal, F1-Score, etc.
Users can save a complete report in the format of ones choice
(HTML, PDF, etc.) as well as individual visual plots.
Predict
The last frame in the pipeline, the Predict frame shall provide the
following functionalities:
Load a trained model in the pipeline for prediction.
Load a previously trained experiment from the local disk.
Perform a prediction based on user input.
Perform a prediction from an unknown sample.
8
Requirements
GUI Framework: PyQt5 - A Python binding for Qt5.
Visualization: matplotlib
HTTP requests: requests
Documentation: sphinx
Timeline
Community Bonding - (May 5th to May 30th , 2017)
During the community bonding phase, I would like to get famil-
iar with the organization, gather and understand as much neces-
sary information and references (in particular, quality control tech-
niques) and build working prototypes for the same. A complete wire-
frame will be structured out during this phase with modularity in
mind. I would also like to prioritize the modules to-be-implemented
based on inputs and feedback received from mentors. I would also
like to discuss the preferred choice of reviewing the latest develop-
ment, choice of platform for simultaneous documentation, continuous
integration, code coverage, re-factor (and probably document) the
existing command-line tool - CancerDiscover and implement front-
end functional prototypes smoothly integrated with the back-end
command-line scripts.
Possible Hurdles
Unknown dependency issues that may arise in the first stage of
development.
Visual Embedding integrations from WEKAs ARFF to pandas
and matplotlib.
Week 5 - 8 - (July 1st to July 28th , 2017)
Week 5, 6 and 7 will be dedicated to implementing the next 2 tabs in
the pipeline (Model and Analysis). There exists a mutual dependency
between the two phases which requires simultaneous development of
9
the two. Week 5 and 6 will be dedicated to each of the tabs during
which, candis.Experiment - an object which abstracts all necessary
information related to each experiment will be implemented. Week 7
will be dedicated to integrating candis.Experiment instances with
visual embeddings proposed for the two tabs. Appropriate unit tests
and documentation will be done during Week 8 - June 24th to June
28th . At the same time, possible bugs will be handled and feedback
from mentors will be considered for improvements in the latest de-
velopment. A working but unstable bleeding-edge prototype will be
assured during the completion of the first 8 weeks.
Possible Hurdles
CancerDiscover integration with candiss model and analysis frames.
Week 9 - 12 - (July 28th to August 28th , 2017)
Week 9 and 10 will be dedicated to implementing unit tests for the
previous 2 tabs and the final tab in the pipeline - Predict. Week
11 will be utilized for experimenting with candis, rectify possible
bugs as well as provide more enhancements. Reducing dependencies,
cross-platform testing and code coverage will be done during Week
11. Week 11 will also be dedicated for completing a full-fledged docu-
mentation of user and developer guides (accessible online via readthe-
docs.io) for the same. A production-ready software (versioned 1.0)
will be released during the final evaluation.
4 Commitments
When do your classes and exams finish?
My classes for my final semester end on the 18th April, 2017. However,
my examinations extend till the end of April. I shall be free full-time,
thereafter.
Do you have any other school-related activities scheduled during
the coding period?
None.
Do you have a full or part-time job or internship this summer?
10
5 Additional Information
R
esum
e: Achilles Rasquinhas Resume
Contact: +91 9821410251
Im eager and excited to work with the research team at Helikar Lab and I
hope you will consider me fit for the project to-be-implemented under the GSoC
banner.
11