Pima Indians Diabetes Database Analysis

1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
We use cookies on kaggle to deliver our services, analyze web traffic, and improve your experience on the site. By using
kaggle, you agree to our use of cookies. Got it Learn more
Search  Competitions Datasets Kernels Discussion Learn Sign In
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 1/37
Pima Indians Diabetes Database Analysis
Luis Bronchal
March 12, 2017
Summary
Exploratory Data Analysis [EDA]
Data loading and cleaning
Variable analysis
Outcome
Correlation between variables
Univariable analysis
Machine learning model
Baseline model
Improving baseline model
Feature importance analysis
Explanatory models
Predictive model
Model comparasion
Conclusion
Next things to try
Reference
Summary
This is an analysis of the Pima Indians Diabetes Database, obtained from Kaggle (https://www.kaggle.com/uciml/pima-
diabetes-database) It is a small dataset with missing values. We have used imputation techniques and tryied some exp
(classification tree and linear regression) and predictive models (random forest and xgboost)
Exploratory Data Analysis [EDA]

library(needs)
library
needs(ggplot2,
dplyr,
corrplot,
gridExtra,
rpart.plot,
e1071,
mice,
DMwR,
pROC,
caTools,
caret,
doMC)
registerDoMC(cores = detectCores() - 1)
Data loading and cleaning
dat <- read.csv("../input/diabetes.csv")
str(dat)
'data.frame': 768 obs. of 9 variables:

$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
$ Age : int 50 31 32 21 33 30 26 29 53 54 ...
$ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
dat$Outcome <- factor(make.names(dat$Outcome))
Let’s take a look to the data:
summary(dat)
Pregnancies Glucose BloodPressure SkinThickness

Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction Age
Min. : Database
Pima Indians Diabetes 0.0 Min. : 0.00
Analysis Min. :0.0780 Min. :21.00
 10  Fork 19
Rmarkdown script 1st
using Qu.: 0.0
data from 1st Qu.:27.30
Pima Indians 1st
Diabetes Database Qu.:0.2437
· 5,571 views ·  1st Qu.:24.00
healthcare Median : 30.5 Median :32.00 Median :0.3725  Median :29.00
Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
Outcome
X0:500
X1:268
It looks like there aren’t explicit missing values, but if we see in detail we can see some biological measurements have
dataset, and that’s impossible:
biological_data <- dat[,setdiff(names(dat), c('Outcome', 'Pregnancies'))]

features_miss_num <- apply(biological_data, 2, function
function(x) sum(x<=0))
features_miss <- names(biological_data)[ features_miss_num > 0]
features_miss_num
Glucose BloodPressure SkinThickness

5 35 227
Insulin BMI DiabetesPedigreeFunction
374 11 0
Age
0
Let’s see how many rows are affected with this problem:
rows_errors <- apply(biological_data, 1, function

function(x) sum(x<=0)>1)
sum(rows_errors)
[1] 234
These are a lot of rows. It is more than 30% of the dataset:
sum(rows_errors)/nrow(dat)
[1] 0.3046875
biological_data[biological_data<=0] <- NA
dat[, names(biological_data)] <- biological_data
We have very few data and we can’t get rid off these rows. We are going to try to impute missing data.
dat_original <- dat

dat[,-9] <- knnImputation(dat[,-9], k=5)
    
Variable
Report analysis Code Data Log Comments
Outcome
Let’s see the proportion of the outcome output.
prop.table(table(dat$Outcome))
X0 X1
0.6510417 0.3489583
It is quite unbalanced with twice cases of non diabetes.
Correlation between variables

Let’s see the correlation between numerical variables. There are variables which are highly correlated. That’s the case
example.
correlat <- cor(dat[, setdiff(names(dat), 'Outcome')])
Correlation matrix Data
corrplot(correlat)
Univariable analysis
univar_graph <- function

function(univar_name, univar, data, output_var) {
g_1 <- ggplot(data, aes(x=univar)) + geom_density() + xlab(univar_name)
g_2 <- ggplot(data, aes(x=univar, fill=output_var)) + geom_density(alpha=0.4) + xlab
ame)
grid.arrange(g_1, g_2, ncol=2, top=paste(univar_name,"variable", "/ [ Skew:",skewnes
,"]"))
}
for (x in 1:(ncol(dat)-1)) {
univar_graph(names(dat)[x], dat[,x], dat, dat[,'Outcome'])
}
There are variables with high right skew (Insulin, DiabetesPedigreeFunction, Age) and other with high left skew like Blo
Machine learning model
Baseline model
Let’s create a baseline model. We’ll see later if it is necessary to improve it.
set.seed(1234)
dindex <- createDataPartition(dat$Outcome, p=0.7, list=FALSE)
train_data <- dat_original[dindex,]
test_data <- dat_original[-dindex,]
We are going to impute the missing data in training and testing set separately.
mice_train_mod <- mice(train_data[, features_miss], method='rf', seed=1234, printFlag

mice_test_mod <- mice(test_data[, features_miss], method='rf', seed=1234, printFlag =
train_data[, features_miss] <- complete(mice_train_mod)

test_data[, features_miss] <- complete(mice_test_mod)
The training set contains both possible cases and it is unbalanced.

The training set contains both possible cases and it is unbalanced.
table(train_data$Outcome)
X0 X1
350 188
Let’s try a logistic regression model with all the features
fitControl <- trainControl(method = "cv",

number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
model_glm <- train(Outcome~.,

train_data,
method="glm",
metric="ROC",
tuneLength=10,
preProcess = c('center', 'scale'),
trControl=fitControl)
pred_glm <- predict(model_glm, test_data)

cm_glm <- confusionMatrix(pred_glm, test_data$Outcome, positive="X1")
cm_glm
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 129 29
X1 21 51
Accuracy : 0.7826
95% CI : (0.7236, 0.8341)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 1.156e-05
Kappa : 0.5094
Mcnemar's Test P-Value : 0.3222
Sensitivity : 0.6375
Specificity : 0.8600
Pos Pred Value : 0.7083
Neg Pred Value : 0.8165
Prevalence : 0.3478
Detection Rate : 0.2217
Detection Prevalence : 0.3130
Balanced Accuracy : 0.7488
'Positive' Class : X1
pred_prob_glm <- predict(model_glm, test_data, type="prob")

roc_glm <- roc(test_data$Outcome, pred_prob_glm$X1)
colAUC(pred_prob_glm$X1, test_data$Outcome, plotROC = TRUE)
[,1]
X0 vs. X1 0.8595833
We can see the result of this baseline model:
The accuracy is not quite bad , but this is not the best metric in this case.
The auc has a value of 0.8595833
The F1 score is 0.6710526
The recall (Sensitivity) is quite bad 0.6375
Next things to consider in order to build a better model than these baseline one:
We have to think about the features to include in the model, because some are highly correlated (we can try PCA,…
We have to work with the unbalanced problem (oversampling, synthetic cases,…)
We can try different machine learning models
Improving baseline model

We can decide to build an explanatory model or a highly predictive model. We will try both cases:
Feature importance analysis

We are going to use the Boruta technique to find the most relevant features Let’s take a look to see if there are unimpo
with Boruta technique:
library(Boruta)
library
boruta_results <- Boruta(Outcome~., train_data)
boruta_results
Boruta performed 26 iterations in 6.13668 secs.

7 attributes confirmed important: Age, BMI,
DiabetesPedigreeFunction, Glucose, Insulin and 2 more;
1 attributes confirmed unimportant: BloodPressure;
plot(boruta_results)
All the variables are important except BloodPressure. Glucose is the most important one.
If we see the correlation matrix between variables, we can see some correlation, but they are below 0.75 , so that’s coh
Boruta and it looks like we can’t ride off any feature:
findCorrelation(correlat, cutoff=0.75)
integer(0)
We are going to use a different aproach. We are going to recursively explore which are the best feature set for a linear
model:
caretFuncs$summary <- twoClassSummary

rfe_ctl <- rfeControl(function
functions=caretFuncs,
method = "cv",
number = 10,
returnResamp="final",
verbose = FALSE)
rfe_glm <- rfe(train_data[ , setdiff(names(dat), 'Outcome')],

train_data$Outcome,
sizes=c(1:8),
rfeControl=rfe_ctl,
method="glm",
metric = "ROC",
trControl = fitControl)
It looks like that one of the features (SkinThickness) can be ommited
rfe_glm
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables ROC Sens Spec ROCSD SensSD SpecSD Selected

1 0.6197 0.9286 0.1751 0.06304 0.05260 0.10556
2 0.8115 0.8857 0.5111 0.04118 0.08193 0.07848
3 0.8088 0.8857 0.5164 0.04089 0.08518 0.06389
4 0.8155 0.8800 0.5322 0.04104 0.08497 0.09402
5 0.8128 0.8800 0.5322 0.04272 0.08497 0.09402
6 0.8186 0.8829 0.5424 0.04397 0.07184 0.09386
7 0.8275 0.8829 0.5582 0.04192 0.07309 0.10363
8 0.8310 0.8857 0.5582 0.04511 0.07499 0.10363 *
The top 5 variables (out of 8):

Glucose, BMI, Pregnancies, DiabetesPedigreeFunction, Age
predictors(rfe_glm)
[1] "Glucose" "BMI"

[3] "Pregnancies" "DiabetesPedigreeFunction"
[5] "Age" "Insulin"
[7] "SkinThickness" "BloodPressure"
plot(rfe_glm, type=c("g", "o"))
From this approach it looks like that all features are needed.
We are going to try explanatory models: logistic regression and classification trees.
Explanatory models
Logistic Regression with regularization Clasiffication trees
We used as baseline model a Logistic Regresion. We can see there that:
summ_model_glm <- summary(model_glm$finalModel)

summ_model_glm
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7007 -0.7340 -0.4207 0.7024 2.4104
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.831720 0.114078 -7.291 3.08e-13 ***
Pregnancies 0.359195 0.127200 2.824 0.004745 **
Glucose 1.125086 0.149146 7.544 4.57e-14 ***
BloodPressure 0 006958 0 124510 0 056 0 955433
BloodPressure 0.006958 0.124510 0.056 0.955433
SkinThickness 0.043982 0.145204 0.303 0.761969
Insulin -0.102950 0.134696 -0.764 0.444682
BMI 0.510131 0.154228 3.308 0.000941 ***
DiabetesPedigreeFunction 0.314418 0.114818 2.738 0.006174 **
Age 0.153331 0.130900 1.171 0.241452
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 696.28 on 537 degrees of freedom

Residual deviance: 506.51 on 529 degrees of freedom
AIC: 524.51
Number of Fisher Scoring iterations: 5
The model shows which are the most relevant features:
coef_glm <- summ_model_glm$coefficients %>%

as.data.frame() %>%
mutate(Feature=rownames(summ_model_glm$coefficients)) %>%
filter(Feature != "(Intercept)")
coef_glm %>% filter(`Pr(>|z|)` < 0.05) %>% arrange(`Pr(>|z|)`)
Estimate Std. Error z value Pr(>|z|) Feature

1 1.1250863 0.1491456 7.543545 4.573636e-14 Glucose
2 0.5101313 0.1542279 3.307645 9.408386e-04 BMI
3 0.3591953 0.1272002 2.823859 4.744927e-03 Pregnancies
4 0.3144183 0.1148182 2.738402 6.173855e-03 DiabetesPedigreeFunction
The more relevant feature related with diabetes is Glucose, followed by BMI, Pregnancies and DiabetesPedigreeFuncti
It looks like that one of the features (SkinThickness) can be ommited
rfe_glm
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables ROC Sens Spec ROCSD SensSD SpecSD Selected

1 0.6197 0.9286 0.1751 0.06304 0.05260 0.10556
2 0.8115 0.8857 0.5111 0.04118 0.08193 0.07848
3 0.8088 0.8857 0.5164 0.04089 0.08518 0.06389
4 0.8155 0.8800 0.5322 0.04104 0.08497 0.09402
5 0.8128 0.8800 0.5322 0.04272 0.08497 0.09402
6 0.8186 0.8829 0.5424 0.04397 0.07184 0.09386

7 0.8275 0.8829 0.5582 0.04192 0.07309 0.10363
8 0.8310 0.8857 0.5582 0.04511 0.07499 0.10363 *
The top 5 variables (out of 8):

Glucose, BMI, Pregnancies, DiabetesPedigreeFunction, Age
predictors(rfe_glm)
[1] "Glucose" "BMI"

[3] "Pregnancies" "DiabetesPedigreeFunction"
[5] "Age" "Insulin"
[7] "SkinThickness" "BloodPressure"
model_glmnet <- train(Outcome~.,

train_data,
method="glmnet",
metric="ROC",
tuneLength=20,
pred_glmnet <- predict(model_glmnet, test_data)

cm_glmnet <- confusionMatrix(pred_glmnet, test_data$Outcome, positive="X1")
cm_glmnet
Reference
Prediction X0 X1
X0 132 35
X1 18 45
Accuracy : 0.7696
95% CI : (0.7097, 0.8224)
P-Value [Acc > NIR] : 7.748e-05
Kappa : 0.4656
Prevalence : 0.3478
y
pred_prob_glmnet <- predict(model_glmnet, test_data, type="prob")

roc_glmnet <- roc(test_data$Outcome, pred_prob_glmnet$X1)
We achieve some better results that with the baseline model, but not very good ones:
cm_glmnet$byClass
Sensitivity Specificity Pos Pred Value

0.5625000 0.8800000 0.7142857
Neg Pred Value Precision Recall
0.7904192 0.7142857 0.5625000
F1 Prevalence Detection Rate
0.6293706 0.3478261 0.1956522
Detection Prevalence Balanced Accuracy
0.2739130 0.7212500
Predictive model
Random Forest XGBOOST KNN
model_rf <- train(Outcome~.,

train_data,
method="ranger",
metric="ROC",
tuneLength=20,
note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
pred_rf <- predict(model_rf, test_data)

cm_rf <- confusionMatrix(pred_rf, test_data$Outcome, positive="X1")
cm_rf
Reference
Prediction X0 X1
X0 122 28
X1 28 52
Accuracy : 0.7565
95% CI : (0.6958, 0.8105)
P-Value [Acc > NIR] : 0.000421
Kappa : 0.4633
Prevalence : 0.3478
pred_prob_rf <- predict(model_rf, test_data, type="prob")

roc_rf <- roc(test_data$Outcome, pred_prob_rf$X1)
cm_rf$byClass
Sensitivity Specificity Pos Pred Value

0.6500000 0.8133333 0.6500000
Neg Pred Value Precision Recall
0.8133333 0.6500000 0.6500000
F1 Prevalence Detection Rate
0.6500000 0.3478261 0.2260870
Detection Prevalence Balanced Accuracy
0.3478261 0.7316667
Model comparasion
We are going to compare these models over the training and resampling data:
model_list <- list(GLM=model_glm, GMLNET=model_glmnet , RPART=model_rpart, RF=model_rf

=model_xgbTree, KNN=model_knn)
resamples <- resamples(model_list)
bwplot(resamples, metric="ROC")
This is the correlation between models. This info can be used if we decide to combine some models to build a stacked
model_cor <- modelCor(resamples)

model_cor
GLM GMLNET RPART RF XGBOOST

GLM 1.0000000 0.29620506 -0.1539184 -0.16271384 0.20154705
GMLNET 0.2962051 1.00000000 -0.3128653 -0.07633388 -0.18250330
RPART -0.1539184 -0.31286529 1.0000000 0.28665403 -0.43370053
RF -0.1627138 -0.07633388 0.2866540 1.00000000 0.02043130
XGBOOST 0.2015471 -0.18250330 -0.4337005 0.02043130 1.00000000
KNN 0.1690141 -0.33037695 -0.1336449 -0.08902903 0.09460115
KNN
GLM 0.16901408
GMLNET -0.33037695
RPART -0.13364488
RF -0.08902903
XGBOOST 0.09460115
KNN 1.00000000
corrplot(model_cor)
We are going to see the models results when they are applied over the test data:
results_glm <- c(cm_glm$byClass['Sensitivity'], cm_glm$byClass['F1'], roc_glm$auc)

results_glmnet <- c(cm_glmnet$byClass['Sensitivity'], cm_glmnet$byClass['F1'], roc_glm
results_rpart <- c(cm_rpart$byClass['Sensitivity'], cm_rpart$byClass['F1'], roc_rpart$
results_rf <- c(cm_rpart$byClass['Sensitivity'], cm_rf$byClass['F1'], roc_rf$auc)
results_xgbTree <- c(cm_xgbTree$byClass['Sensitivity'], cm_xgbTree$byClass['F1'], roc_
uc)
results_knn <- c(cm_knn$byClass['Sensitivity'], cm_knn$byClass['F1'], roc_knn$auc)
results <- data.frame(rbind(results_glm, results_glmnet, results_rpart, results_rf, re

Tree, results_knn))
names(results) <- c("Sensitivity", "F1", "AUC")
results
Sensitivity F1 AUC
results_glm 0.6375 0.6710526 0.8595833
results_glmnet 0.5625 0.6293706 0.8586667
results_rpart 0.6375 0.6071429 0.7812500
results_rf 0.6375 0.6500000 0.8317500
results_xgbTree 0.4500 0.5413534 0.8252917
results_knn 0.5625 0.5625000 0.7662917
Simple logistic regression looks like to be the best model here: best sensitivity, F1 score and AUC.
Conclusion
We have developed some explanatory models (classification tree and linear regression). They show us what are the mo
factors in order to have a person diabetes. Predictive models should be improve prediction performance but they don’t
outstanding results.
Next things to try

Try different imputation techniques
Split dataset into training, validation and testing set in order to find optimal threshold over the validation data.
Try different machine learning models
Build a stacked model
Reference
This kernel has been released under the Apache 2.0 open source license.
Did you find this Kernel useful? 

Show your appreciation with an upvote 10
Code This kernel has been released under the Apache 2.0 open source license.
1 ---
2 title: "Pima Indians Diabetes Database Analysis"
3 author: "Luis Bronchal"
4 date: "March 12, 2017"
5 knit: (function(inputFile, encoding) {
6 out_dir <- '../output';
7 rmarkdown::render(inputFile,
8 encoding=encoding,
9 output_file=file.path(dirname(inputFile), out_dir, 'analysis.html'))
10
11 output:
12 html_document:
13 theme: lumen
14 toc: true
15 html_notebook: default
16 ---
17
18 ```{r setup, include=FALSE}
19 knitr::opts_chunk$set(comment=NA, message=FALSE, warning=FALSE)
20 ```
21
22 # Summary
23
24 This is an analysis of the *Pima Indians Diabetes Database*, obtained from [Kaggle](https://
25 It is a small dataset with missing values. We have used imputation techniques and tryied som
26
27
28 # Exploratory Data Analysis [EDA]
29
30 ```{r}
31 library(needs)
32 needs(ggplot2,
33 dplyr,
34 corrplot,
35 gridExtra,
36 rpart.plot,
37 e1071,
38 mice,
39 DMwR,
40 pROC,
41 caTools,
42 caret,
43 doMC)
44
45 registerDoMC(cores = detectCores() - 1)
46 ```
47
48 ## Data loading and cleaning
49
50 ```{r}
51 dat <- read.csv("../input/diabetes.csv")
52 ```
53
54 ```{r}
55 str(dat)
56 ```
57
58 ```{r}
59 dat$Outcome <- factor(make.names(dat$Outcome))
60 ```
61
62 Let's take a look to the data:
63 ```{r}
64 summary(dat)
65 ```
66
67 It looks like there aren't explicit missing values, but if we see in detail we can see some
68 impossible:
69
70 ```{r}
71 biological_data <- dat[,setdiff(names(dat), c('Outcome', 'Pregnancies'))]
72 features_miss_num <- apply(biological_data, 2, function(x) sum(x<=0))
73 features_miss <- names(biological_data)[ features_miss_num > 0]
74 features_miss_num
75 ```
76
77 Let's see how many rows are affected with this problem:
78 ```{r}
79
80 rows_errors <- apply(biological_data, 1, function(x) sum(x<=0)>1)
81 sum(rows_errors)
82 ```
83
84 These are a lot of rows. It is more than 30% of the dataset:
85 ```{r}
86 sum(rows_errors)/nrow(dat)
87 ```
88
89
90
91 ```{r}
92 biological_data[biological_data<=0] <- NA
93 dat[, names(biological_data)] <- biological_data
94 ```
95
96 We have very few data and we can't get rid off these rows. We are going to try to impute mis
97
98
99 ```{r}
100 dat_original <- dat
101 dat[,-9] <- knnImputation(dat[,-9], k=5)
102 ```
103
104
105
106 ## Variable analysis
107
108 ### Outcome
109
110 Let's see the proportion of the outcome output.
111
112 ```{r}
113 prop.table(table(dat$Outcome))
114 ```
115
116 It is quite unbalanced with twice cases of non diabetes.
117
118 ### Correlation between variables {.tabset}
119
120 Let's see the correlation between numerical variables.
121 There are variables which are highly correlated. That's the case of Age for example.
122
123 ```{r}
124 correlat <- cor(dat[, setdiff(names(dat), 'Outcome')])
125 ```
126
127
128 #### Correlation matrix
129 ```{r}
130 corrplot(correlat)
131 ```
132
133 #### Data
134 ```{r}
135 correlat
136 ```
137
138 ## Univariable analysis
139
140
141 ```{r}
142 univar_graph <- function(univar_name, univar, data, output_var) {
143 g_1 <- ggplot(data, aes(x=univar)) + geom_density() + xlab(univar_name)
144 g_2 <- ggplot(data, aes(x=univar, fill=output_var)) + geom_density(alpha=0.4) + xlab(univa
145 grid.arrange(g_1, g_2, ncol=2, top=paste(univar_name,"variable", "/ [ Skew:",skewness(univ
146 }
147
148 for (x in 1:(ncol(dat)-1)) {
149 univar_graph(names(dat)[x], dat[,x], dat, dat[,'Outcome'])
150 }
151 ```
152
153 There are variables with high right skew (Insulin, DiabetesPedigreeFunction, Age) and other
154
155 # Machine learning model
156
157 ## Baseline model
158
159 Let's create a baseline model. We'll see later if it is necessary to improve it.
160
161
162 ```{r}
163 set.seed(1234)
164 dindex <- createDataPartition(dat$Outcome, p=0.7, list=FALSE)
165 train_data <- dat_original[dindex,]
166 test_data <- dat_original[-dindex,]
167 ```
168
169 We are going to impute the missing data in training and testing set separately.
170
171 ```{r}
172 mice_train_mod <- mice(train_data[, features_miss], method='rf', seed=1234, printFlag = FALS
173 mice_test_mod <- mice(test_data[, features_miss], method='rf', seed=1234, printFlag = FALSE)
174 ```
175
176 ```{r}
177 train_data[, features_miss] <- complete(mice_train_mod)
178 test_data[, features_miss] <- complete(mice_test_mod)
179 ```
180
181
182 The training set contains both possible cases and it is unbalanced.
183 ```{r}
184 table(train_data$Outcome)
185 ```
186
187 Let's try a logistic regression model with all the features
188
189 ```{r}
190 fitControl <- trainControl(method = "cv",
191 number = 10,
192 classProbs = TRUE,
193 summaryFunction = twoClassSummary)
194 ```
195
196 ```{r}
197 model_glm <- train(Outcome~.,
198 train_data,
199 method="glm",
200 metric="ROC",
201 tuneLength=10,
202 preProcess = c('center', 'scale'),
203 trControl=fitControl)
204 ```
205
206 ```{r}
207 pred_glm <- predict(model_glm, test_data)
208 cm_glm <- confusionMatrix(pred_glm, test_data$Outcome, positive="X1")
209 cm_glm
210
211 ```
212
213 ```{r}
214 pred_prob_glm <- predict(model_glm, test_data, type="prob")
215 roc_glm <- roc(test_data$Outcome, pred_prob_glm$X1)
216 colAUC(pred_prob_glm$X1, test_data$Outcome, plotROC = TRUE)
217 ```
218
219 We can see the result of this baseline model:
220
221 - The accuracy is not quite bad `r model_glm$overall['Accuracy']` , but this is not the best
222 - The auc has a value of `r auc(roc_glm)`
223 - The F1 score is `r cm_glm$byClass['F1']`
224 - The recall (Sensitivity) is quite bad `r cm_glm$byClass['Sensitivity']`
225
226 Next things to consider in order to build a better model than these baseline one:
227
228 - We have to think about the features to include in the model, because some are highly corre
229 - We have to work with the unbalanced problem (oversampling, synthetic cases,...)
230 - We can try different machine learning models
231
232 ## Improving baseline model
233
234 We can decide to build an explanatory model or a highly predictive model.
235 We will try both cases:
236
237 ### Feature importance analysis
238
239 We are going to use the Boruta technique to find the most relevant features
240 Let's take a look to see if there are unimportant variables with Boruta technique:
241
242 ```{r}
243 library(Boruta)
244 boruta_results <- Boruta(Outcome~., train_data)
245 boruta_results
246 plot(boruta_results)
247 ```
248
249 All the variables are important except BloodPressure. Glucose is the most important one.
250
251 If we see the correlation matrix between variables, we can see some correlation, but they ar
252 it looks like we can't ride off any feature:
253 ```{r}
254 findCorrelation(correlat, cutoff=0.75)
255 ```
256
257
258 We are going to use a different aproach. We are going to recursively explore which are the b
259
260 ```{r}
261 caretFuncs$summary <- twoClassSummary
262 rfe_ctl <- rfeControl(functions=caretFuncs,
263 method = "cv",
264 number = 10,
265 returnResamp="final",
266 verbose = FALSE)
267 ```
268
269 ```{r}
270 rfe_glm <- rfe(train_data[ , setdiff(names(dat), 'Outcome')],
271 train_data$Outcome,
272 sizes=c(1:8),
273 rfeControl=rfe_ctl,
274 method="glm",
275 metric = "ROC",
277 trControl = fitControl)
278 ```
279
280 It looks like that one of the features (SkinThickness) can be ommited
281 ```{r}
282 rfe_glm
283 predictors(rfe_glm)
284 ```
285
286 ```{r}
287 plot(rfe_glm, type=c("g", "o"))
288 ```
289
290 From this approach it looks like that all features are needed.
291
292 We are going to try explanatory models: logistic regression and classification trees.
293
294 ### Explanatory models {.tabset}
295
296 #### Logistic Regression with regularization
297
298 We used as baseline model a Logistic Regresion. We can see there that:
299 ```{r}
300 summ_model_glm <- summary(model_glm$finalModel)
301 summ_model_glm
302 ```
303
304 The model shows which are the most relevant features:
305
306 ```{r}
307 coef_glm <- summ_model_glm$coefficients %>%
308 as.data.frame() %>%
309 mutate(Feature=rownames(summ_model_glm$coefficients)) %>%
310 filter(Feature != "(Intercept)")
311
312
313 ```
314
315 ```{r}
316 coef_glm %>% filter(`Pr(>|z|)` < 0.05) %>% arrange(`Pr(>|z|)`)
317 ```
318
319 The more relevant feature related with diabetes is Glucose, followed by BMI, Pregnancies and
320
321
322 It looks like that one of the features (SkinThickness) can be ommited
323 ```{r}
324 rfe_glm
325 predictors(rfe_glm)
326 ```
327
328
329
330 ```{r}
331 model_glmnet <- train(Outcome~.,
332 train_data,
333 method="glmnet",
334 metric="ROC",
335 tuneLength=20,
338 ```
339
340 ```{r}
341 pred_glmnet <- predict(model_glmnet, test_data)
342 cm_glmnet <- confusionMatrix(pred_glmnet, test_data$Outcome, positive="X1")
343 cm_glmnet
344 ```
345
346 ```{r}
347 pred_prob_glmnet <- predict(model_glmnet, test_data, type="prob")
348 roc_glmnet <- roc(test_data$Outcome, pred_prob_glmnet$X1)
349 ```
350
351 We achieve some better results that with the baseline model, but not very good ones:
352 ```{r}
353 cm_glmnet$byClass
354 ```
355
356 #### Clasiffication trees
357
358 ```{r}
359 model_rpart <- train(Outcome~.,
360 train_data,
361 method="rpart",
362 metric="ROC",
363 tuneLength=20,
365 ```
366
367 ```{r}
368 pred_rpart <- predict(model_rpart, test_data)
369 cm_rpart <- confusionMatrix(pred_rpart, test_data$Outcome, positive="X1")
370 cm_rpart
371 ```
372
373 ```{r}
374 pred_prob_rpart <- predict(model_rpart, test_data, type="prob")
375 roc_rpart <- roc(test_data$Outcome, pred_prob_rpart$X1)
376 ```
377
378 This is the best model at the moment:
379 ```{r}
380 cm_rpart$byClass
381 ```
382
383 ```{r}
384 rpart.plot(model_rpart$finalModel, type = 2, fallen.leaves = T, extra = 2)
385 ```
386
387 This is coherent with the results of the linear regression model. For people with high Gluco
388
389
390 ### Predictive model {.tabset}
391
392 #### Random Forest
393
394 ```{r}
395 model_rf <- train(Outcome~.,
396 train_data,
397 method="ranger",
398 metric="ROC",
399 tuneLength=20,
401 ```
402
403 ```{r}
404 pred_rf <- predict(model_rf, test_data)
405 cm_rf <- confusionMatrix(pred_rf, test_data$Outcome, positive="X1")
406 cm_rf
407 ```
408
409 ```{r}
410 pred_prob_rf <- predict(model_rf, test_data, type="prob")
411 roc_rf <- roc(test_data$Outcome, pred_prob_rf$X1)
412 ```
413
414
415 ```{r}
416 cm_rf$byClass
417 ```
418
419
420 #### XGBOOST
421
422 ```{r}
423 xgb_grid_1 = expand.grid(
424 nrounds = 50,
425 eta = c(0.03),
426 max_depth = 1,
427 gamma = 0,
428 colsample_bytree = 0.6,
429 min_child_weight = 1,
430 subsample = 0.5
431 )
432 model_xgbTree <- train(Outcome~.,
433 train_data,
434 method="xgbTree",
435 metric="ROC",
436 tuneGrid=xgb_grid_1,
438 ```
439
440 ```{r}
441 pred_xgbTree <- predict(model_xgbTree, test_data)
442 cm_xgbTree <- confusionMatrix(pred_xgbTree, test_data$Outcome, positive="X1")
443 cm_xgbTree
444 ```
445
446 ```{r}
447 pred_prob_xgbTree <- predict(model_xgbTree, test_data, type="prob")
448 roc_xgbTree <- roc(test_data$Outcome, pred_prob_xgbTree$X1)
449 ```
450
451 We get not a great result:
452 ```{r}
453 cm_xgbTree$byClass
454 ```
455
456 #### KNN
457
458 ```{r}
459 model_knn <- train(Outcome~.,
460 train_data,
461 method="knn",
462 metric="ROC",
463 tuneGrid = expand.grid(.k=c(3:10)),
465 ```
466
467
468 ```{r}
469 pred_knn <- predict(model_knn, test_data)
470 cm_knn <- confusionMatrix(pred_knn, test_data$Outcome, positive="X1")
471 cm_knn
472 ```
473
474 ```{r}
475 pred_prob_knn <- predict(model_knn, test_data, type="prob")
476 roc_knn <- roc(test_data$Outcome, pred_prob_knn$X1)
477 ```
478
479 ```{r}
480 cm_knn$byClass
481 ```
482
483 ### Model comparasion
484
485 We are going to compare these models over the training and resampling data:
486
487 ```{r}
488 model_list <- list(GLM=model_glm, GMLNET=model_glmnet , RPART=model_rpart, RF=model_rf, XGBO
489 resamples <- resamples(model_list)
490 bwplot(resamples, metric="ROC")
491 ```
492
493
494 This is the correlation between models. This info can be used if we decide to combine some m
495
496 ```{r}
497 model_cor <- modelCor(resamples)
498 model_cor
499 corrplot(model_cor)
500 ```
501
502 We are going to see the models results when they are applied over the test data:
503
504 ```{r}
505 results_glm <- c(cm_glm$byClass['Sensitivity'], cm_glm$byClass['F1'], roc_glm$auc)
506 results_glmnet <- c(cm_glmnet$byClass['Sensitivity'], cm_glmnet$byClass['F1'], roc_glmnet$au
507 results_rpart <- c(cm_rpart$byClass['Sensitivity'], cm_rpart$byClass['F1'], roc_rpart$auc)
508 results_rf <- c(cm_rpart$byClass['Sensitivity'], cm_rf$byClass['F1'], roc_rf$auc)
509 results_xgbTree <- c(cm_xgbTree$byClass['Sensitivity'], cm_xgbTree$byClass['F1'], roc_xgbTre
510 results_knn <- c(cm_knn$byClass['Sensitivity'], cm_knn$byClass['F1'], roc_knn$auc)
511
512 results <- data.frame(rbind(results_glm, results_glmnet, results_rpart, results_rf, results_
513 names(results) <- c("Sensitivity", "F1", "AUC")
514 results
515 ```
516
517 Simple logistic regression looks like to be the best model here: best sensitivity, F1 score
518
519
520 # Conclusion
521
522 We have developed some explanatory models (classification tree and linear regression). They
523
524 # Next things to try
525
526 - Try different imputation techniques
527 - Split dataset into training, validation and testing set in order to find optimal threshold
528 - Try different machine learning models
529 - Build a stacked model
530
531 # Reference
532
533 - [Missing values](http://machinelearningmastery.com/how-to-handle-missing-values-in-machine
534 - [Imputation kernel inspiration](https://www.kaggle.com/hinchou/d/uciml/pima-indians-diabet
535 - [Feature selection](http://machinelearningmastery.com/feature-selection-with-the-caret-r-p
536
537
538
Did you find this Kernel useful? 

Data
Data Sources
Pima Indians Diabetes Database
  Pima Indians Diabetes Database Predict the onset of diabetes based on diagnostic mea
Last Updated: 2 years ago (Version 1)
 diabetes.csv 768 x 9
About this Dataset
Context
This dataset is originally from the National Institute of Diabetes and
Kidney Diseases. The objective of the dataset is to diagnostically pre
not a patient has diabetes, based on certain diagnostic measuremen
the dataset. Several constraints were placed on the selection of thes
from a larger database. In particular, all patients here are females at
old of Pima Indian heritage.
Content
The datasets consists of several medical predictor variables and one
Outcome . Predictor variables includes the number of pregnancies t
had, their BMI, insulin level, age, and so on.
Acknowledgements
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes,
Using the ADAP learning algorithm to forecast the onset of diabetes
Proceedings of the Symposium on Computer Applications and Medic
-265). IEEE Computer Society Press.
Inspiration
Can you build a machine learning model to accurately predict wheth
patients in the dataset have diabetes or not?
Run Info
Succeeded True Run Time 209.6 seconds
Exit Code 0 Queue Time 0 seconds
Docker Image Name kaggle/rstats(Dockerfile) Output Size 0
Timeout Exceeded False Used All Space False
Failure Message
Log
Time Line # Log Message

1.3s 1
processing file: script.Rmd

1.3s 2 | |
| 0% |
| 1%
1.3s 3 ordinary text without R code
| |.
| 2%
label: setup (with options)
List of 1
$ include:
1.3s 4 logi FALSE
1.3s 5 | |..
| 3%
ordinary text without R code
1.3s 6 label: unnamed-chunk-1

1.5s 7
Load `package:needs` in an interactive session to set auto-load flag
8.8s 8 | |..
| 4%
|
8.8s 9 |... | 5%
label: unnamed-chunk-2
8.9s 10 | |..
| 6%
8.9s 11 | |..
| 7%
8.9s 12 | |..
| 8%
| |..
| 9%
8.9s 14
| |..
| 10%
8.9s 16 | |..
| 11%
| |..
| 12%
8.9s 18 | |..
| 13%
8.9s 19 | |..
| 14%
8.9s 20 | |..
| 15%
8.9s 21 | |..
| 15%
8.9s 22 | |..
| 16%
| |..
| 17%
8.9s 24 | |..
| 18%
8.9s 25 | |..
| 19%
9.2s 26 | |..
| 20%
| |..
| 21%
9.2s 28 | |..
| 21%
| |..
| 22%
9.2s 30 |
|............... | 23%
9.2s 31 |
|................ | 24%
9.8s 32 |
|................ | 25%
9.8s 33 |
|................. | 26%
9.8s 34
|
|.................. | 27%
15.6s 35 |
|.................. | 28%
|
|................... | 29%
15.6s 37 |
|................... | 30%
15.6s 38 |
|.................... | 31%
25.2s 39 |
|..................... | 32%

25.2s 41 |
|...................... | 33%
25.2s 42 |
|...................... | 34%
25.2s 43 |
|....................... | 35%
|
|....................... | 36%
25.2s 44 |
|........................ | 37%
|
|........................ | 38%
26.5s 46 |
|......................... | 38%
26.5s 47 |
|.......................... | 39%
26.6s 49 |
|.......................... | 40%
26.6s 50 |
|........................... | 41%
26.7s 51 |
|........................... | 42%
inline R code fragments
26.7s 52 |
|............................ | 43%
26.8s 53 Loading required package: ranger
33.3s 54 |
|............................ | 44%
|
|............................. | 44%
33.3s 55
33.3s 56 |
|............................. | 45%
|
|.............................. | 46%
33.3s 58 |
|............................... | 47%
|
|............................... | 48%
33.4s 60 -------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dpl
library(plyr); library(dplyr)
33.4s 61 -------------------------------------------------------------------------
Attaching package: 'plyr'
The following object is masked from 'package:DMwR':
join
The following objects are masked from 'package:dplyr':
arrange, count, desc, failwith, id, mutate, rename, summarise,

summarize
87.3s 62 |
|................................ | 49%
|
|................................ | 50%
87.3s 64 |
|................................. | 50%
87.3s 65 |
|................................. | 51%
87.5s 66 |
|.................................. | 52%
87.6s 67 |
|.................................. | 53%
87.7s 68 |
|................................... | 54%
|
|.................................... | 55%
87.7s 70 |
|.................................... | 56%
|
|..................................... | 56%
|
|..................................... | 57%
|
|...................................... | 58%
87.9s 72 |
|...................................... | 59%
|
|....................................... | 60%
88.0s 73 Loading required package: glmnet
88.2s 74 Loading required package: Matrix
Loaded glmnet 2.0-5
Attaching package: 'glmnet'
The following object is masked from 'package:pROC':
auc
97.2s 75 |
|....................................... | 61%
|
|........................................ | 62%
97.2s 77 |
|......................................... | 62%
|
|......................................... | 63%
97.2s 79 |
|.......................................... | 64%
97.2s 80 |
|.......................................... | 65%
97.2s 81 |
|........................................... | 66%
|
|........................................... | 67%
99.3s 83 |
|............................................ | 68%
99.3s 84
99.4s 85 |
|............................................. | 69%
|
|.............................................. | 70%
99.4s 87 |
|.............................................. | 71%
|
|............................................... | 72%
99.4s 89 |
|............................................... | 73%
|
|................................................ | 74%
99.9s 90
99.9s 91 |
|................................................. | 75%
112.5s 92 |
|................................................. | 76%
|
|.................................................. | 77%
112.5s 94 |
|................................................... | 78%
|
|................................................... | 79%
112.6s 96 |
|.................................................... | 79%
|
|.................................................... | 80%
112.6s 98 |
|..................................................... | 81%
112.6s 99 |
|..................................................... | 82%
112.8s 100 Loading required package: xgboost
209.2s 126
209.2s 127 ...
209.2s 128 Complete. Exited with code 0.
Sort
Comments (8)
All Comments Hot
Please sign in to leave a comment.
Pranav Pandya • Posted on Version 8 • 2 years ago • Options
Nice!
Luis Bronchal Kernel Author • Posted on Version 8 • 2 years ago • Options
Thanks
Carlos Crosetti • Posted on Version 8 • 2 years ago • Options
Luis, nice work. When looking to your analysis of missing vales (for instance blood pressure cannot be z
same for BMI) you documented the remediation by invoking some function (kmi…) could you please ela
bit further?
Luis Bronchal Kernel Author • Posted on Version 8 • 2 years ago • Options
You have to deal with the missing data. There are different approaches to do this. I have trie
KNN imputation (with the function knnImputation) to do a first analysis, but it is possible to
techniques (it is the first thing in the section 'Next things to try' of my report).
Here you can see an interesting intro about this subject:
http://r-statistics.co/Missing-Value-Treatment-With-R.html
Harshita S. Jain • Posted on Version 8 • 2 years ago • Options
the unit of insulin 2 hour serum test is muU/ml. can anybody tell me the normal range in order to differe
towards the diabetic prone person. i am able to find insulin test data in mg/dl or mol/l. so i am unable to
kindly help me either in conversion of muU/ml to mg/dl or mol/l, or kindly tell me the range of muU/ml
Luis Bronchal Kernel Author • Posted on Latest Version • 2 years ago • Options
I am not an expert on this business domain, so I can't help you with that.
Xiao Liu • Posted on Latest Version • 2 years ago • Options

Beautiful work! :)
liangwei93 • Posted on Latest Version • a year ago • Options
When I use the kNN imputation,

dat[,-9] <- knnImputation(dat[,-9], k=5)
Error: Invalid column indexes: 310, 347, 25, 256, 52

I got this error, any idea why?
Similar Kernels
Feature Selection And Basic Machine ML From Scratch-Part Deep Healthcare Intr
Data Visualization Learning With Cancer 2 Analysis Using B
BigQuery
© 2019 Kaggle Inc Our Team Terms Privacy Contact/Support 

Pima Indians Diabetes Database Analysis - Kaggle

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Pima Indians Diabetes Database Analysis - Kaggle

Uploaded by

Copyright:

1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Search  Competitions Datasets Kernels Discussion Learn Sign In

March 12, 2017

Exploratory Data Analysis [EDA]

Data loading and cleaning

dat <- read.csv("../input/diabetes.csv")

'data.frame': 768 obs. of 9 variables:

dat$Outcome <- factor(make.names(dat$Outcome))

Let’s take a look to the data:

Pregnancies Glucose BloodPressure SkinThickness

biological_data <- dat[,setdiff(names(dat), c('Outcome', 'Pregnancies'))]

Glucose BloodPressure SkinThickness

rows_errors <- apply(biological_data, 1, function

These are a lot of rows. It is more than 30% of the dataset:

dat_original <- dat

It is quite unbalanced with twice cases of non diabetes.

Correlation between variables

correlat <- cor(dat[, setdiff(names(dat), 'Outcome')])

Correlation matrix Data

univar_graph <- function

Machine learning model

mice_train_mod <- mice(train_data[, features_miss], method='rf', seed=1234, printFlag

train_data[, features_miss] <- complete(mice_train_mod)

The training set contains both possible cases and it is unbalanced.

Let’s try a logistic regression model with all the features

fitControl <- trainControl(method = "cv",

model_glm <- train(Outcome~.,

pred_glm <- predict(model_glm, test_data)

Confusion Matrix and Statistics

pred_prob_glm <- predict(model_glm, test_data, type="prob")

We can see the result of this baseline model:

Improving baseline model

Feature importance analysis

Boruta performed 26 iterations in 6.13668 secs.

caretFuncs$summary <- twoClassSummary

rfe_glm <- rfe(train_data[ , setdiff(names(dat), 'Outcome')],

It looks like that one of the features (SkinThickness) can be ommited

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold)

Resampling performance over subset size:

Variables ROC Sens Spec ROCSD SensSD SpecSD Selected

The top 5 variables (out of 8):

[1] "Glucose" "BMI"

plot(rfe_glm, type=c("g", "o"))

Logistic Regression with regularization Clasiffication trees

We used as baseline model a Logistic Regresion. We can see there that:

summ_model_glm <- summary(model_glm$finalModel)

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 696.28 on 537 degrees of freedom

Number of Fisher Scoring iterations: 5

The model shows which are the most relevant features:

coef_glm <- summ_model_glm$coefficients %>%

coef_glm %>% filter(`Pr(>|z|)` < 0.05) %>% arrange(`Pr(>|z|)`)

Estimate Std. Error z value Pr(>|z|) Feature