Hruby Ondrej Hw3

Homework #3
Introduction to Machine Learning - NPFL054

Ondřej Hrubý
2017/2018
Basic data analysis and simple Decision Tree model
Percentage proportion of the binary output value (class)
library('data.table'); library('dplyr'); library('knitr')

data <- fread('plr.dataset.C.development.csv')
data$class <- as.integer(data$class) # Convert 'class' column from char to int
data.grp <- group_by(data, protein_id)

prop <- summarize(data.grp, ligandible = sum(class) / n(), non_ligandible = 1 - ligandible)
kable(data.table(prop), digits = 3)
protein_id ligandible non_ligandible

19 0.253 0.747
24 0.219 0.781
33 0.179 0.821
40 0.150 0.850
50 0.137 0.863
55 0.145 0.855
57 0.165 0.835
59 0.264 0.736
61 0.143 0.857
65 0.203 0.797
66 0.241 0.759
72 0.022 0.978
80 0.113 0.887
83 0.328 0.672
109 0.115 0.885
113 0.261 0.739
115 0.121 0.879
116 0.170 0.830
126 0.172 0.828
138 0.164 0.836
149 0.211 0.789
153 0.248 0.752
159 0.190 0.810
175 0.125 0.875
190 0.168 0.832
195 0.350 0.650
207 0.192 0.808
209 0.164 0.836
220 0.192 0.808
241 0.275 0.725
1
Decision Tree model – tuning CP parameter
We will try to find the best CP parameter for the tree. We will use 10-fold cross-validation for evaluating the
tree, so the best means such CP, that cross-validation error of the DT model for that CP would be minimal.
We will perform 50 instances of DT with different values of CP and observe corresponding cross-eval. error.
We will start at 0.026 and end with CP near 0. Another factor that we will measure will be corresponding
tree size (number of nodes), so we can see how changes of CP prune the tree.
# Perform 50 instances of decision tree with different CP values

# Measure evaluation error of each instance using 10-fold cross-validation
# WARNING: This piece of code is computationally expensive
source('utils.R') # Custom code used for generic cross-validation
cps <- c(); cvErrors <- c(); treeSizes <- c()
iterations <- 50 # Number of iterations

for (i in 1:iterations) {
cp <- 0.026 - 0.0005 * i
cps <- c(cps, cp)
evalData <- crossValDT(folds = 10, data, cp)
cvErrors <- c(cvErrors, mean(evalData$error))
treeSizes <- c(treeSizes, mean(evalData$treeSize))
}
As we can see from the plots below, the dependency of cross-eval. error on CP is not monotonous. On the
right side of the plot, is seems that the tendency is proportional and regular – the lower the CP is, the lower
the error. Moreover, not every change of CP implies a change of the error. While in the range from CP ~
0.010 to the left, the dependency is very unpredictable, and it seems that the lower the CP is, the higher the
error. As we can see CP = 0.010 seems as the best choice, because the error is minimal for this CP nd also
the tree is not that large for such CP (11 nodes) – it is almost the same as for the tree with the maximal
CP = 0.025 (7 nodes), which is very good compromise.
Dependency of cross−eval. error on CP

0.175
cross−eval. error [%]
0.165
0.155
0.005 0.010 0.015 0.020 0.025
CP
2
Dependency of tree size on CP
tree size [# of nodes]
200
50 100
0
0.005 0.010 0.015 0.020 0.025
CP
Decision Tree model – confidence intervals for the mean accuracy value
# Evaluate final DT model using 10-fold cross-validation and compute confidence intervals
library('knitr')
source('utils.R') # Custom code used for multiple t-tests
evalData <- crossValDT(folds = 10, data, class ~ ., cp = 0.010)
t <- t.tests(evalData$accuracy, mu = mean(evalData$accuracy),

conf.levels = c(0.90, 0.95, 0.99))
kable(t)
Table 2: Confidence intervals [a, b] for different confidence levels
conf. level a b
0.90 0.831 0.859
0.95 0.827 0.863
0.99 0.820 0.870
Decision Tree model – feature importance (Gini index)
# Plot final DT model with the best value of CP (using whole dataset as training data)
library('rpart.plot')
train <- within(data, rm(protein_id)) # Remove 'protein_id' column
model <- rpart(class ~ ., data = train, method = 'class', cp = 0.010)
rpart.plot(model)
3
0
0.19
100%
yes protrusion < 122 no
0
0.50
20%
hydrophilic >= 1.6
0 1
0.36 0.62
10% 11%
bfactor >= 63 aromatic < 0.092
1
0.55
3%
atomDensity >= 5.3
0 0 0 1 0 1
0.11 0.28 0.16 0.60 0.40 0.69
80% 7% 0% 2% 3% 8%
# Show DT variable importance

kable(model$variable.importance)
Table 3: Feature importance that rpart uses for tree split selection
feature importance
protrusion 793.82
atoms 177.34
atomC 118.86
hydrophilic 117.64
atomDensity 114.55
hydrophatyIndex 67.32
apRawValids 64.58
polar 56.45
aromatic 55.06
atomicHydrophobicity 53.69
hBondDonor 49.45
ionizable 47.82
bfactor 46.74
acidic 45.68
vsAromatic 32.32
hydrophobic 8.22
vsHydrophobic 4.03
Importance of all features can be seen directly from the table. As we can observe, the most important feature
is protrusion. It is also in the root of the tree. But the fact itself that some node is lower in the tree does
not automatically imply that the feature, associated with that node, is not important. Our tree is quite small,
due to our substantiated choice of CP value, but if it would be larger, we might see that some features are
used more often (thus are more important) than some other features that are, however, closer to the root.
4
Ensemble method – AdaBoost
Parameter tuning
We will try to build some good AdaBoost model. Training AdaBoost model is very time consuming operation,
compared to training simple (yet large) decision tree. Especially for some combinations of hyper-parameters –
for example for big amount of DTs in the ensemble. We will use grid search for parameter tuning. Therefore
it is almost impossible to use cross-validation for evaluating the model when tuning the parameters, because
cross-validation requires to learn a new model on each fold iteration. Instead, we will rather split our data to
two sets (train and test) and we will simply train a new model for each parameter combination on train
set and then evaluate it on test set.
We will focus mainly on the ensemble size parameter. We will try over 100 different values at step of 2,
starting at 3 and ending around 200. Because the parameter space is really large, instead of trying multiple
combinations of previous parameter with variety of another ones, we will rather try just one another option.
We will compare two models differing in hyper-parameters of individual trees and for each model try different
ensemble sizes as described. The first model will be regular DTs with default values set in rpart package.
Note that the default value of CP is 0.010, which is our optimum from previous section. The second model
will be stump ensemble, thus the rpart parameters will be set in that fashion – to obtain stumps. Choice of
stumps as a weak learnes could be legitimate, because it generally achieves good results and anyway it could
be interesting to compare them with full DTs.

# Compare full trees with stumps with different ensemble sizes
library('adabag')
data$class <- factor(data$class)
ensSizes <- c(); fullTreeErrs <- c(); stumpErrs <- c()
set <- first(getFolds(folds = 2, data)) # Split data to train and test sets
iterations <- 110
for (i in 1:iterations)
{
size <- 2 * i
ensSizes <- c(ensSizes, size)
fullTree <- boosting(class ~ ., set$train, mfinal = size)
stump <- boosting(class ~ ., set$train, mfinal = size, control =
rpart.control(maxdepth = 1, cp = -1 , minsplit = 0, xval = 0))
fullTreeErrs <- c(fullTreeErrs, predict.boosting(fullTree, set$test)$error)
stumpErrs <- c(stumpErrs, predict.boosting(stump, set$test)$error)
}
# Plot dependency of test error on ensemble size
plot(ensSizes, fullTreeErrs, pch=19, ylim = c(0.135, 0.18),
main='Dependency of test error on ensemble size',
xlab='ensemble size [# of trees]', ylab='test error [%]')
legend("topright", c('Full DTs', 'Stumps'),
col=c('red', 'dodgerblue'), lty=c(1,1,1), lwd=c(5,5,5))
lines(ensSizes, fullTreeErrs, col='red', lwd=2)
points(ensSizes, stumpErrs, pch=19)
lines(ensSizes, stumpErrs, col='dodgerblue', lwd=2)
5
Dependency of test error on ensemble size
0.18
Full DTs
Stumps
0.17
0.16
test error [%]
0.15
0.14
0 50 100 150 200
ensemble size [# of trees]
Table 4: Summary of error delta between full DTs and stumps
error delta [%] ens size

min 0 48
max 0.032 140
mean 0.014 -
After a night we have obtained the plot above. We can see that both curves oscilate intensively – it means
that small change in ensemble size can have a big impact on error rate of whole ensemble. Some changes are
very abrupt (even about 3%). Despite some first few modes with low ensemble size, we can observe that
stump ensembles are doing about 1.4% in mean better than full DT ensembles. In extreme case of ensemble
size 140, the stump ensemble is more than 3% better than full DT ensembles. In interesting case of ensemble
size 48 the error rate of both ensembles is the same.
Next, we will use stump ensembles, because it seems they will do better job and they are also easier to
train. We will select 3 best values of ensemble size and we will evaluate corresponding stump ensembles using
cross-validation. According to results above, we will select sizes of 16, 178 and 205.
6
Evaluation
First, we will perform 10-fold cross-validation on selected ensembles and measure their accuracy and error
rate. Next, we will perform randomized 10-fold cross-validation and measure accuracy and error rate as well.
Randomized means that we will perform 10-fold cross-validation 10 times on each ensemble and each time we
will randomly permutate proteinid values before splitting data into folds. We should achieve more accurate
results by using this approach.

# Evaluate 3 chosen ensembles using cross-validation...
sizes <- c(16, 178, 205); cvData <- list()
stumpCtl <- rpart.control(maxdepth = 1, cp = -1 , minsplit = 0, xval = 0)
for (i in 1:length(sizes))
{
size <- sizes[i]
cvData <- list.append(cvData, crossValAdaBoost(folds = 10, data, class ~ ., size,
control = stumpCtl))
}
# ...and randomized cross-validation

randCvData <- list()
for (i in 1:length(sizes))
{
size <- sizes[i]
error <- 0
for (j in 1:10)
{
cvData <- crossValAdaBoost(folds = 10, data, class ~ ., size,
control = stumpCtl, randomize = TRUE)
error <- error + cvData$error
}
randCvData <- list.append(randCvData, data.frame(folds = seq(10), error = error / 10,
accuracy = 1 - error / 10))
}
Table 5: Evaluation summary for 3 chosen stump ensembles
ens_size accuracy error rand_accuracy rand_error

16 0.836 0.164 0.835 0.165
178 0.838 0.162 0.839 0.161
205 0.839 0.161 0.839 0.161
Table 6: Accuracy conf. intervals for stump ensemble of size 205
conf. level a b
0.90 0.827 0.851
0.95 0.824 0.853
0.99 0.818 0.860
7
As we can se from the table above, the results are not as good as expected. The error estimate from plot
on the previous page is probabely somehow distorted. The reason could be higher sensitivity on training
data, which would be avoided if we would use cross-validation during parameter tuning. Rates of different
ensembles are very similar, although the size of the ensembles significantly differs, which is quite supprising
result. Even the output of randomized validation is very close to the output of regular cross-validation.
Despite of the bad results, it seems, that error goes slightly down with increasing ensemble size. Therfore we
will select ensemble of size 205 as our final choice. Confidence intervals of the mean accuracy for them is
printed below the summary table.
If we compare the ensemble with simple decision tree from previous section, we will be disappointed. Error
rate of the tree is around 15.5% whereas error rate of our best ensemble is around 16.1%. These are mean
values; if we would compare corresponding confidence intervals, we could conclude, that the error rate of
simple decision tree and stump ensemble is almost the same (according to our protein ligandibility recognition
task).
8
Searching for compact feature set – SVM model
Tuning SVM parameters
Training a good Support Vector Machine requires good consideration of hyper-parameters, especially kernel
transformation. The choice of an appropriate kernel then requires good knowledge of data – the geometry of
the problem. Because e are not a biologist, it is almost impossible for us trying to really understand the
background of the problem. We could try to do some data analysis (e. g. testing linear separability of the
data), but it would be finally harder than just to try several options.
Another disadvantage is, that it is almost impossible to use cross-validation error as an criterion for tuning
the hyper-parameters – it is too computionally expensive. So we have to do a compromise. We will start with
the choice of correct kernel. During this search we also should try some variety of other hyper-parameters
(cost, gamma and degree), because we could reject an appropriate kernel just because of bad choice of other
parameters.
At first, we will split the data into two fixed sets (train and test). train set will be then internally splitted
into real train set and tune set. This is done internally by R tune function (once we know it exists). Real
train subset is then used for training the model using certain combination of hyper-parameters. tune set
is used for evaluation of that model. The result of this evaluation is then reffered as tuning error. The
combination of hyper-parameters with the minimal tuning error is finally selected as a winner. We will left
test set for final evaluation of the model with the best hyper-parameters. This will prevent us from choosing
overfitted model, despite of its good tuning error.
# Tune SVM parameters

library(e1071); library(knitr); source('utils.R')
set <- getFolds(folds = 2, data)[[1]] # Split data to train/tune sets and test set
# Kernel parameter tunning and fixed validation

linearEst <- tune(svm, class ~ ., data = set$train,
ranges = list(kernel = 'linear', cost = 10^(-2:2)),
tunecontrol = tune.control(sampling = 'fix'))
linearErr <- evalSVM(svm(class ~ ., data = set$train, kernel='linear', cost = 1), set$test)
polyEst <- tune(svm, class ~ ., data = set$train,

ranges = list(kernel = 'polynomial', cost = 10^(-2:2),
gamma = 2^(-2:2), degree = (1:5)),
polyErr <- evalSVM(svm(class ~ ., data = set$train, kernel='polynomial', cost = 0.01,
gamma = 1, degree = 3), set$test)
radialEst <- tune(svm, class ~ ., data = set$train,

ranges = list(kernel = 'radial', cost = 10^(-2:2), gamma = 2^(-2:2)),
radialErr <- evalSVM(svm(class ~ ., data = set$train, kernel='radial', cost = 10,
gamma = 0.25), set$test)
Table 7: Error summary of tuning SVM kernel parameter
kernel cost gamma degree tuning_err. test_err.

linear 10.0 - - 0.165 0.147
polynomial 0.1 0.5 - 0.114 0.219
radial 10.0 0.25 3 0.123 0.205
9
As we can see from the table, polynomial and radial kernels have very good tunning error (almost 10%),
however their test error is very bad (around 20%) compared to linear kernel. This is probably because of
polynomial and radial kernel SVMs are very likely to overfit. Linear kernel have test error around 15%. The
reason could be that the data are well lineary separable. Thus we will choose linear kernel and next we will
try to tune its cost parametr.
# Cost parameter tuning

linearEst2 <- tune(svm, class ~ ., data = set$train,
ranges = list(kernel = 'linear', cost = (1:50)),
costTuning <- data.frame('cost' = linearEst2$performances$cost,
'error' = linearEst2$performances$error)
Dependency of tuning error on cost

0.163 0.164 0.165 0.166
tuning error [%]
0 10 20 30 40 50
cost
We have chosen scale from 1 to 50 for the cost parameter, according to results of previous tuning – the
resulting cost parameter was 10, thus the scale order seems reasonable. How we can observe from the plot,
the winning cost parameter is obvious – cost = 1.
Evaluation
We will evaluate our final SVM model using 10-fold cross-validation and compute corresponding confidence
intervals.
# Evaluate our chosen SVM model using 10-fold cross-validation

cvData <- crossValSVM(folds = 10, data, class ~ . , kernel = 'linear', cost = 1)
kable(cvData, digits = 3)
t <- t.tests(cvData$error, mu = mean(cvData$error),
conf.levels = c(0.90, 0.95, 0.99))
kable(t, digits = 3)
10
Table 8: Cross-validation results of chosen SVM
fold accuracy error

1 0.871 0.129
2 0.872 0.128
3 0.863 0.137
4 0.839 0.161
5 0.880 0.120
6 0.833 0.167
7 0.824 0.176
8 0.822 0.178
9 0.815 0.185
10 0.831 0.169
Table 9: Accuracy conf. intervals of chosen SVM
conf. level a b
0.90 0.831 0.859
0.95 0.828 0.862
0.99 0.820 0.870
How we can see in the table 8, the error differs a lot across all folds. It ranges between 12% and 18%. This
could be because of some parts of the data tend to be more linar separable than other parts, and thus SVM
with linear kernel better fits these data. As a result, also the confidency intervals are wider in comparision
with other models that we have trained.
In conclusion, the final mean value of error rate of SVM is 15.5%. Error rate of the DT is around 15.5% and
the error rate of AdaBoost ensemble is 16.1%. So this SVM model is closely comparable to our DT model,
which is quite interesting, because SVM is linear model (in our setup) and DTs are non-linear model, however
both have very similar accuracy.
11
Searching for a compact feature set
We will implement forward selection method, following the instructions. But we cannot afford to tune
hyper-parameters during the search, because of computational complexity of cross-validation. We will rather
concentrate on selecting the right feature set and will keep our k at lowest possible level (k = 1).
source('utils.R')
forwardSelection <- function(data, importance, k = 1)

{
importance <- sort(importance, decreasing = TRUE)
cvErrors <- c()
tr <- 5 # Tendency range
i <- k
while (i <= length(importance)) {

cols <- names(importance)[1:i]
print(cols)
dataSubset <- subset(data, select = c(cols, 'class', 'protein_id'))
cvData <- crossValSVM(folds = 10, dataSubset, class ~ ., kernel = 'linear', cost = 1)
print(mean(cvData$error))
cvErrors <- c(cvErrors, mean(cvData$error))
if (lastXInT(cvErrors, tr) || lastXAtLeastT(cvErrors, tr)) {
return(list('features' = cols, 'errors' = cvErrors))
}
i <- i + k
}
return(list('features' = cols, 'errors' = cvErrors))
}
In each iteration we add k features to the set and evaluate it using 10-fold cross-validation. We also
periodicallly check the tendency of the last x computed cross-val. errors from last x iterations. The chosen
value of x is 5 (tendency range). We detect two possible tendency scenarios:
1. If the last x values falls to specified threshold range, i. e. if the tendency of last x values is constant,
except of some deviation (threshold). This is what the lastXInT function does.
2. Whether all the deltas between last x + 1 values are at least some threshold. This is what the
lastXAtLeastT function does.
Thresholds for both scenarios above are set to 1% implicitly.
Once we have an implementation, the last thing we need is the importance ranking. We will train our
ensemble model again and obtain the feature importance list. As we can see from the table below, many
features have importance of 0. We do not expect these features at the output of forward selection method.
ensembleModel <- boosting(class ~ ., data = data, mfinal = 205, control = stumpCtl)

imp <- ensembleModel$importance
kable(imp, digits = 2)
12
Table 10: Feature importance obtained from ensemble model
feature importance
protrusion 80.88
vsAromatic 5.41
atomO 3.67
hydrophilic 3.11
bfactor 1.59
vsHydrophobic 1.22
polar 0.98
vsAcceptor 0.78
apRawValids 0.62
aliphatic 0.29
apRawInvalids 0.24
atomicHydrophobicity 0.14
hydrophobic 0.11
vsAnion 0.04
hydrophatyIndex 0.01
acidic 0.00
amide 0.00
aromatic 0.00
atomC 0.00
atomDensity 0.00
atomN 0.00
atoms 0.00
basic 0.00
hAcceptorAtoms 0.00
hBondAcceptor 0.00
hBondDonor 0.00
hBondDonorAcceptor 0.00
hDonorAtoms 0.00
hydroxyl 0.00
ionizable 0.00
negCharge 0.00
posCharge 0.00
sulfur 0.00
vsCation 0.00
vsDonor 0.00
# Run forward selection

compactSet <- forwardSelection(data, imp)
On the barplot below, we can observe the progress of the feature forward selection method. Each bar on the
plot represents i-th iteration of the method. Note, that the feature printed on the x axis is the name of the
feature that has been added to the set in the i-th step. So the set of the features in the i-th iteration consists
of all preceeding features plus that one, printed on the x axis.
13
Progress of feature forward selection method
0.20 0.186 0.186

0.174 0.168
0.166
cross−eval. error [%]
0.154 0.155 0.153 0.153 0.153

0.15
0.10
0.05
0.00
protrusion
vsAromatic
atomO
hydrophilic
bfactor
vsHydrophobic
polar
vsAcceptor
apRawValids
aliphatic
As we can see from the barplot, the method have selected first 10 features. We can see that exactly last 5
iterations have very similar cross-eval. error rate, so the stopping criterion has been triggered. Now, we have
to choose first n features as a representative compact feature set. n = 8 seems as a reasonable choice to me,
because the error rate is the same after this boundary. Finally, let’s print out our compact feature set:
kable(head(imp, 8), digits = 2)
Table 11: Representative compact feature set
feature importance
protrusion 80.88
vsAromatic 5.41
atomO 3.67
hydrophilic 3.11
bfactor 1.59
vsHydrophobic 1.22
polar 0.98
vsAcceptor 0.78
14
ROC evaluation
ROC curves and AUC values
We will run ROC evaluation, plot corresponding ROC curves and compute AUC values. We will use 10-fold
cross-validation when evaluating the model to obtain ROC data, so we will actually get 10 ROC curves for
each model. Note, that the curves that are on the plots below are average ROC obtained by averaging that
mentioned 10 ROC curves by threshold (cutoff) averaging method.
# ROC and AUC

library('ROCR'); library('knitr'); source('utils.R')
# ROC for decision tree

DTRocData <- rocEvalDT(foldsCount = 10, data, formula = class ~ ., cp = 0.010)
DTRoc <- performance(DTRocData, measure = "tpr", x.measure = "fpr")
DTAuc <- round(aucEval(DTRocData), digits = 3)
# ROC for AdaBoost model

AdaBoostRocData <- rocEvalAdaBoost(foldsCount = 10, data, formula = class ~ .,
ensSize = 205, control = stumpCtl)
AdaBoostRoc <- performance(AdaBoostRocData, measure = "tpr", x.measure = "fpr")
AdaBoostAuc <- round(aucEval(AdaBoostRocData), digits = 3)
# ROC for SVM

SVMRocData <- rocEvalSVM(foldsCount = 10, data, formula = class ~ .,
cost = 1, kernel = 'linear')
SVMRoc <- performance(SVMRocData, measure = "tpr", x.measure = "fpr")
SVMAuc <- round(aucEval(SVMRocData), digits = 3)
ROC curve for Decision Tree model

1.0
0.8
Average true positive rate
AUC
0.6
0.671
random guess
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Average false positive rate
15
ROC curve for AdaBoost model
1.0
0.8
0.6
random guess
AUC
0.4
0.463
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
ROC curve for SVM model

1.0
AUC
0.8
0.792
0.6
random guess
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
16
# AUC table summary
smr <- data.frame('model' = c('Decision Tree', 'AdaBoost', 'SVM'),
'AUC' = c(DTAuc, AdaBoostAuc, SVMAuc))
kable(smr, digits = 3)
Table 12: Summary of the model AUC values
model AUC
Decision Tree 0.671
AdaBoost 0.463
SVM 0.792
17
Final prediction on the blind test set
As we can see mainly from ROC curves and AUC values, the ensemble model – AdaBoost is the worst one,
even though one could expect, the ensemble method could have better chance to be a very good predictor.
Although possible reasons have been discussed, the ROC curve shows us that it is even worse than random
guess, which is a signal that something went very wrong. We will not trying to reimplement our solution,
failure is also a valid conclusion. Instead, we will choose two remaining models, whose parameters are not
that bad. According to ROC and AUC, we will choose SVM as our primary model and Decision Tree as a
secondary one.
# Final prediction on the blind test set

library('e1071')
library('rpart')
train <- within(data, rm(protein_id))
test <- fread('plr.dataset.C.test.blind.csv')
# Primary model -- SVM

svmModel <- svm(class ~ ., data = train, kernel = 'linear', cost = 1)
svmPred <- predict(svmModel, test)
writeLines(as.character(svmPred), 'pred-blind/SVM-pred-class.data')
# Secondary model -- Decision Tree

dtModel <- rpart(class ~ ., data = train, method = 'class', cp = 0.010)
dtPred <- predict(dtModel, test, type = 'class')
writeLines(as.character(dtPred), 'pred-blind/DT-pred-class.data')
Used R/ packages
• data.table
• dplyr
• rlist
• knitr
• rpart
• rpart.plot
• adabag
• e1071
• ROCR
18

Hruby Ondrej Hw3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hruby Ondrej Hw3

Uploaded by

Copyright:

Available Formats

Homework #3

Introduction to Machine Learning - NPFL054

Basic data analysis and simple Decision Tree model

Percentage proportion of the binary output value (class)

library('data.table'); library('dplyr'); library('knitr')

data.grp <- group_by(data, protein_id)

protein_id ligandible non_ligandible

# Perform 50 instances of decision tree with different CP values

iterations <- 50 # Number of iterations

Dependency of cross−eval. error on CP

0.005 0.010 0.015 0.020 0.025

0.005 0.010 0.015 0.020 0.025

evalData <- crossValDT(folds = 10, data, class ~ ., cp = 0.010)

t <- t.tests(evalData$accuracy, mu = mean(evalData$accuracy),

Table 2: Confidence intervals [a, b] for different confidence levels

Decision Tree model – feature importance (Gini index)

# Show DT variable importance

# WARNING: This piece of code is computationally expensive

0 50 100 150 200

ensemble size [# of trees]

Table 4: Summary of error delta between full DTs and stumps

error delta [%] ens size

# WARNING: This piece of code is computationally expensive

# ...and randomized cross-validation

Table 5: Evaluation summary for 3 chosen stump ensembles

ens_size accuracy error rand_accuracy rand_error

Table 6: Accuracy conf. intervals for stump ensemble of size 205

Tuning SVM parameters

# Tune SVM parameters

# Kernel parameter tunning and fixed validation

polyEst <- tune(svm, class ~ ., data = set$train,

radialEst <- tune(svm, class ~ ., data = set$train,

Table 7: Error summary of tuning SVM kernel parameter

kernel cost gamma degree tuning_err. test_err.

# Cost parameter tuning

Dependency of tuning error on cost

# Evaluate our chosen SVM model using 10-fold cross-validation

fold accuracy error

Table 9: Accuracy conf. intervals of chosen SVM

forwardSelection <- function(data, importance, k = 1)

while (i <= length(importance)) {

ensembleModel <- boosting(class ~ ., data = data, mfinal = 205, control = stumpCtl)

# Run forward selection

0.20 0.186 0.186

0.154 0.155 0.153 0.153 0.153

kable(head(imp, 8), digits = 2)

Table 11: Representative compact feature set

ROC curves and AUC values

# ROC and AUC

# ROC for decision tree

# ROC for AdaBoost model

# ROC for SVM

ROC curve for Decision Tree model

0.0 0.2 0.4 0.6 0.8 1.0

Average false positive rate

0.0 0.2 0.4 0.6 0.8 1.0

Average false positive rate

ROC curve for SVM model

0.0 0.2 0.4 0.6 0.8 1.0

Average false positive rate

Table 12: Summary of the model AUC values

# Final prediction on the blind test set

# Primary model -- SVM

# Secondary model -- Decision Tree