Professional Documents
Culture Documents
1
Decision Tree model – tuning CP parameter
We will try to find the best CP parameter for the tree. We will use 10-fold cross-validation for evaluating the
tree, so the best means such CP, that cross-validation error of the DT model for that CP would be minimal.
We will perform 50 instances of DT with different values of CP and observe corresponding cross-eval. error.
We will start at 0.026 and end with CP near 0. Another factor that we will measure will be corresponding
tree size (number of nodes), so we can see how changes of CP prune the tree.
As we can see from the plots below, the dependency of cross-eval. error on CP is not monotonous. On the
right side of the plot, is seems that the tendency is proportional and regular – the lower the CP is, the lower
the error. Moreover, not every change of CP implies a change of the error. While in the range from CP ~
0.010 to the left, the dependency is very unpredictable, and it seems that the lower the CP is, the higher the
error. As we can see CP = 0.010 seems as the best choice, because the error is minimal for this CP nd also
the tree is not that large for such CP (11 nodes) – it is almost the same as for the tree with the maximal
CP = 0.025 (7 nodes), which is very good compromise.
0.165
0.155
CP
2
Dependency of tree size on CP
tree size [# of nodes]
200
50 100
0
CP
Decision Tree model – confidence intervals for the mean accuracy value
# Evaluate final DT model using 10-fold cross-validation and compute confidence intervals
library('knitr')
source('utils.R') # Custom code used for multiple t-tests
kable(t)
conf. level a b
0.90 0.831 0.859
0.95 0.827 0.863
0.99 0.820 0.870
# Plot final DT model with the best value of CP (using whole dataset as training data)
library('rpart.plot')
train <- within(data, rm(protein_id)) # Remove 'protein_id' column
model <- rpart(class ~ ., data = train, method = 'class', cp = 0.010)
rpart.plot(model)
3
0
0.19
100%
yes protrusion < 122 no
0
0.50
20%
hydrophilic >= 1.6
0 1
0.36 0.62
10% 11%
bfactor >= 63 aromatic < 0.092
1
0.55
3%
atomDensity >= 5.3
0 0 0 1 0 1
0.11 0.28 0.16 0.60 0.40 0.69
80% 7% 0% 2% 3% 8%
Table 3: Feature importance that rpart uses for tree split selection
feature importance
protrusion 793.82
atoms 177.34
atomC 118.86
hydrophilic 117.64
atomDensity 114.55
hydrophatyIndex 67.32
apRawValids 64.58
polar 56.45
aromatic 55.06
atomicHydrophobicity 53.69
hBondDonor 49.45
ionizable 47.82
bfactor 46.74
acidic 45.68
vsAromatic 32.32
hydrophobic 8.22
vsHydrophobic 4.03
Importance of all features can be seen directly from the table. As we can observe, the most important feature
is protrusion. It is also in the root of the tree. But the fact itself that some node is lower in the tree does
not automatically imply that the feature, associated with that node, is not important. Our tree is quite small,
due to our substantiated choice of CP value, but if it would be larger, we might see that some features are
used more often (thus are more important) than some other features that are, however, closer to the root.
4
Ensemble method – AdaBoost
Parameter tuning
We will try to build some good AdaBoost model. Training AdaBoost model is very time consuming operation,
compared to training simple (yet large) decision tree. Especially for some combinations of hyper-parameters –
for example for big amount of DTs in the ensemble. We will use grid search for parameter tuning. Therefore
it is almost impossible to use cross-validation for evaluating the model when tuning the parameters, because
cross-validation requires to learn a new model on each fold iteration. Instead, we will rather split our data to
two sets (train and test) and we will simply train a new model for each parameter combination on train
set and then evaluate it on test set.
We will focus mainly on the ensemble size parameter. We will try over 100 different values at step of 2,
starting at 3 and ending around 200. Because the parameter space is really large, instead of trying multiple
combinations of previous parameter with variety of another ones, we will rather try just one another option.
We will compare two models differing in hyper-parameters of individual trees and for each model try different
ensemble sizes as described. The first model will be regular DTs with default values set in rpart package.
Note that the default value of CP is 0.010, which is our optimum from previous section. The second model
will be stump ensemble, thus the rpart parameters will be set in that fashion – to obtain stumps. Choice of
stumps as a weak learnes could be legitimate, because it generally achieves good results and anyway it could
be interesting to compare them with full DTs.
5
Dependency of test error on ensemble size
0.18
Full DTs
Stumps
0.17
0.16
test error [%]
0.15
0.14
After a night we have obtained the plot above. We can see that both curves oscilate intensively – it means
that small change in ensemble size can have a big impact on error rate of whole ensemble. Some changes are
very abrupt (even about 3%). Despite some first few modes with low ensemble size, we can observe that
stump ensembles are doing about 1.4% in mean better than full DT ensembles. In extreme case of ensemble
size 140, the stump ensemble is more than 3% better than full DT ensembles. In interesting case of ensemble
size 48 the error rate of both ensembles is the same.
Next, we will use stump ensembles, because it seems they will do better job and they are also easier to
train. We will select 3 best values of ensemble size and we will evaluate corresponding stump ensembles using
cross-validation. According to results above, we will select sizes of 16, 178 and 205.
6
Evaluation
First, we will perform 10-fold cross-validation on selected ensembles and measure their accuracy and error
rate. Next, we will perform randomized 10-fold cross-validation and measure accuracy and error rate as well.
Randomized means that we will perform 10-fold cross-validation 10 times on each ensemble and each time we
will randomly permutate proteinid values before splitting data into folds. We should achieve more accurate
results by using this approach.
conf. level a b
0.90 0.827 0.851
0.95 0.824 0.853
0.99 0.818 0.860
7
As we can se from the table above, the results are not as good as expected. The error estimate from plot
on the previous page is probabely somehow distorted. The reason could be higher sensitivity on training
data, which would be avoided if we would use cross-validation during parameter tuning. Rates of different
ensembles are very similar, although the size of the ensembles significantly differs, which is quite supprising
result. Even the output of randomized validation is very close to the output of regular cross-validation.
Despite of the bad results, it seems, that error goes slightly down with increasing ensemble size. Therfore we
will select ensemble of size 205 as our final choice. Confidence intervals of the mean accuracy for them is
printed below the summary table.
If we compare the ensemble with simple decision tree from previous section, we will be disappointed. Error
rate of the tree is around 15.5% whereas error rate of our best ensemble is around 16.1%. These are mean
values; if we would compare corresponding confidence intervals, we could conclude, that the error rate of
simple decision tree and stump ensemble is almost the same (according to our protein ligandibility recognition
task).
8
Searching for compact feature set – SVM model
Training a good Support Vector Machine requires good consideration of hyper-parameters, especially kernel
transformation. The choice of an appropriate kernel then requires good knowledge of data – the geometry of
the problem. Because e are not a biologist, it is almost impossible for us trying to really understand the
background of the problem. We could try to do some data analysis (e. g. testing linear separability of the
data), but it would be finally harder than just to try several options.
Another disadvantage is, that it is almost impossible to use cross-validation error as an criterion for tuning
the hyper-parameters – it is too computionally expensive. So we have to do a compromise. We will start with
the choice of correct kernel. During this search we also should try some variety of other hyper-parameters
(cost, gamma and degree), because we could reject an appropriate kernel just because of bad choice of other
parameters.
At first, we will split the data into two fixed sets (train and test). train set will be then internally splitted
into real train set and tune set. This is done internally by R tune function (once we know it exists). Real
train subset is then used for training the model using certain combination of hyper-parameters. tune set
is used for evaluation of that model. The result of this evaluation is then reffered as tuning error. The
combination of hyper-parameters with the minimal tuning error is finally selected as a winner. We will left
test set for final evaluation of the model with the best hyper-parameters. This will prevent us from choosing
overfitted model, despite of its good tuning error.
9
As we can see from the table, polynomial and radial kernels have very good tunning error (almost 10%),
however their test error is very bad (around 20%) compared to linear kernel. This is probably because of
polynomial and radial kernel SVMs are very likely to overfit. Linear kernel have test error around 15%. The
reason could be that the data are well lineary separable. Thus we will choose linear kernel and next we will
try to tune its cost parametr.
0 10 20 30 40 50
cost
We have chosen scale from 1 to 50 for the cost parameter, according to results of previous tuning – the
resulting cost parameter was 10, thus the scale order seems reasonable. How we can observe from the plot,
the winning cost parameter is obvious – cost = 1.
Evaluation
We will evaluate our final SVM model using 10-fold cross-validation and compute corresponding confidence
intervals.
10
Table 8: Cross-validation results of chosen SVM
conf. level a b
0.90 0.831 0.859
0.95 0.828 0.862
0.99 0.820 0.870
How we can see in the table 8, the error differs a lot across all folds. It ranges between 12% and 18%. This
could be because of some parts of the data tend to be more linar separable than other parts, and thus SVM
with linear kernel better fits these data. As a result, also the confidency intervals are wider in comparision
with other models that we have trained.
In conclusion, the final mean value of error rate of SVM is 15.5%. Error rate of the DT is around 15.5% and
the error rate of AdaBoost ensemble is 16.1%. So this SVM model is closely comparable to our DT model,
which is quite interesting, because SVM is linear model (in our setup) and DTs are non-linear model, however
both have very similar accuracy.
11
Searching for a compact feature set
We will implement forward selection method, following the instructions. But we cannot afford to tune
hyper-parameters during the search, because of computational complexity of cross-validation. We will rather
concentrate on selecting the right feature set and will keep our k at lowest possible level (k = 1).
source('utils.R')
In each iteration we add k features to the set and evaluate it using 10-fold cross-validation. We also
periodicallly check the tendency of the last x computed cross-val. errors from last x iterations. The chosen
value of x is 5 (tendency range). We detect two possible tendency scenarios:
1. If the last x values falls to specified threshold range, i. e. if the tendency of last x values is constant,
except of some deviation (threshold). This is what the lastXInT function does.
2. Whether all the deltas between last x + 1 values are at least some threshold. This is what the
lastXAtLeastT function does.
Thresholds for both scenarios above are set to 1% implicitly.
Once we have an implementation, the last thing we need is the importance ranking. We will train our
ensemble model again and obtain the feature importance list. As we can see from the table below, many
features have importance of 0. We do not expect these features at the output of forward selection method.
kable(imp, digits = 2)
12
Table 10: Feature importance obtained from ensemble model
feature importance
protrusion 80.88
vsAromatic 5.41
atomO 3.67
hydrophilic 3.11
bfactor 1.59
vsHydrophobic 1.22
polar 0.98
vsAcceptor 0.78
apRawValids 0.62
aliphatic 0.29
apRawInvalids 0.24
atomicHydrophobicity 0.14
hydrophobic 0.11
vsAnion 0.04
hydrophatyIndex 0.01
acidic 0.00
amide 0.00
aromatic 0.00
atomC 0.00
atomDensity 0.00
atomN 0.00
atoms 0.00
basic 0.00
hAcceptorAtoms 0.00
hBondAcceptor 0.00
hBondDonor 0.00
hBondDonorAcceptor 0.00
hDonorAtoms 0.00
hydroxyl 0.00
ionizable 0.00
negCharge 0.00
posCharge 0.00
sulfur 0.00
vsCation 0.00
vsDonor 0.00
On the barplot below, we can observe the progress of the feature forward selection method. Each bar on the
plot represents i-th iteration of the method. Note, that the feature printed on the x axis is the name of the
feature that has been added to the set in the i-th step. So the set of the features in the i-th iteration consists
of all preceeding features plus that one, printed on the x axis.
13
Progress of feature forward selection method
0.10
0.05
0.00
protrusion
vsAromatic
atomO
hydrophilic
bfactor
vsHydrophobic
polar
vsAcceptor
apRawValids
aliphatic
As we can see from the barplot, the method have selected first 10 features. We can see that exactly last 5
iterations have very similar cross-eval. error rate, so the stopping criterion has been triggered. Now, we have
to choose first n features as a representative compact feature set. n = 8 seems as a reasonable choice to me,
because the error rate is the same after this boundary. Finally, let’s print out our compact feature set:
feature importance
protrusion 80.88
vsAromatic 5.41
atomO 3.67
hydrophilic 3.11
bfactor 1.59
vsHydrophobic 1.22
polar 0.98
vsAcceptor 0.78
14
ROC evaluation
We will run ROC evaluation, plot corresponding ROC curves and compute AUC values. We will use 10-fold
cross-validation when evaluating the model to obtain ROC data, so we will actually get 10 ROC curves for
each model. Note, that the curves that are on the plots below are average ROC obtained by averaging that
mentioned 10 ROC curves by threshold (cutoff) averaging method.
AUC
0.6
0.671
random guess
0.4
0.2
0.0
15
ROC curve for AdaBoost model
1.0
0.8
Average true positive rate
0.6
random guess
AUC
0.4
0.463
0.2
0.0
AUC
0.8
Average true positive rate
0.792
0.6
random guess
0.4
0.2
0.0
16
# AUC table summary
smr <- data.frame('model' = c('Decision Tree', 'AdaBoost', 'SVM'),
'AUC' = c(DTAuc, AdaBoostAuc, SVMAuc))
kable(smr, digits = 3)
model AUC
Decision Tree 0.671
AdaBoost 0.463
SVM 0.792
17
Final prediction on the blind test set
As we can see mainly from ROC curves and AUC values, the ensemble model – AdaBoost is the worst one,
even though one could expect, the ensemble method could have better chance to be a very good predictor.
Although possible reasons have been discussed, the ROC curve shows us that it is even worse than random
guess, which is a signal that something went very wrong. We will not trying to reimplement our solution,
failure is also a valid conclusion. Instead, we will choose two remaining models, whose parameters are not
that bad. According to ROC and AUC, we will choose SVM as our primary model and Decision Tree as a
secondary one.
Used R/ packages
• data.table
• dplyr
• rlist
• knitr
• rpart
• rpart.plot
• adabag
• e1071
• ROCR
18