You are on page 1of 37

1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

We use cookies on kaggle to deliver our services, analyze web traffic, and improve your experience on the site. By using
kaggle, you agree to our use of cookies. Got it Learn more

Search  Competitions Datasets Kernels Discussion Learn Sign In

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 1/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Pima Indians Diabetes Database Analysis

Luis Bronchal

March 12, 2017

Summary
Exploratory Data Analysis [EDA]
Data loading and cleaning
Variable analysis
Outcome
Correlation between variables
Univariable analysis
Machine learning model
Baseline model
Improving baseline model
Feature importance analysis
Explanatory models
Predictive model
Model comparasion
Conclusion
Next things to try
Reference

Summary
This is an analysis of the Pima Indians Diabetes Database, obtained from Kaggle (https://www.kaggle.com/uciml/pima-
diabetes-database) It is a small dataset with missing values. We have used imputation techniques and tryied some exp
(classification tree and linear regression) and predictive models (random forest and xgboost)

Exploratory Data Analysis [EDA]


library(needs)
library
needs(ggplot2,
dplyr,
corrplot,
gridExtra,
rpart.plot,
e1071,
mice,
DMwR,
pROC,
caTools,
caret,
doMC)

registerDoMC(cores = detectCores() - 1)

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 2/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Data loading and cleaning

dat <- read.csv("../input/diabetes.csv")

str(dat)

'data.frame': 768 obs. of 9 variables:


$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
$ Age : int 50 31 32 21 33 30 26 29 53 54 ...
$ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...

dat$Outcome <- factor(make.names(dat$Outcome))

Let’s take a look to the data:

summary(dat)

Pregnancies Glucose BloodPressure SkinThickness


Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction Age
Min. : Database
Pima Indians Diabetes 0.0 Min. : 0.00
Analysis Min. :0.0780 Min. :21.00
 10  Fork 19
Rmarkdown script 1st
using Qu.: 0.0
data from 1st Qu.:27.30
Pima Indians 1st
Diabetes Database Qu.:0.2437
· 5,571 views ·  1st Qu.:24.00
healthcare Median : 30.5 Median :32.00 Median :0.3725  Median :29.00
Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
Outcome
X0:500
X1:268

It looks like there aren’t explicit missing values, but if we see in detail we can see some biological measurements have
dataset, and that’s impossible:

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 3/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

biological_data <- dat[,setdiff(names(dat), c('Outcome', 'Pregnancies'))]


features_miss_num <- apply(biological_data, 2, function
function(x) sum(x<=0))
features_miss <- names(biological_data)[ features_miss_num > 0]
features_miss_num

Glucose BloodPressure SkinThickness


5 35 227
Insulin BMI DiabetesPedigreeFunction
374 11 0
Age
0

Let’s see how many rows are affected with this problem:

rows_errors <- apply(biological_data, 1, function


function(x) sum(x<=0)>1)
sum(rows_errors)

[1] 234

These are a lot of rows. It is more than 30% of the dataset:

sum(rows_errors)/nrow(dat)

[1] 0.3046875

biological_data[biological_data<=0] <- NA
dat[, names(biological_data)] <- biological_data

We have very few data and we can’t get rid off these rows. We are going to try to impute missing data.

dat_original <- dat


dat[,-9] <- knnImputation(dat[,-9], k=5)

    
Variable
Report analysis Code Data Log Comments

Outcome
Let’s see the proportion of the outcome output.

prop.table(table(dat$Outcome))

X0 X1
0.6510417 0.3489583

It is quite unbalanced with twice cases of non diabetes.

Correlation between variables


https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 4/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Let’s see the correlation between numerical variables. There are variables which are highly correlated. That’s the case
example.

correlat <- cor(dat[, setdiff(names(dat), 'Outcome')])

Correlation matrix Data

corrplot(correlat)

Univariable analysis

univar_graph <- function


function(univar_name, univar, data, output_var) {
g_1 <- ggplot(data, aes(x=univar)) + geom_density() + xlab(univar_name)
g_2 <- ggplot(data, aes(x=univar, fill=output_var)) + geom_density(alpha=0.4) + xlab
ame)
grid.arrange(g_1, g_2, ncol=2, top=paste(univar_name,"variable", "/ [ Skew:",skewnes
,"]"))
}

for (x in 1:(ncol(dat)-1)) {
univar_graph(names(dat)[x], dat[,x], dat, dat[,'Outcome'])
}

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 5/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 6/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 7/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 8/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

There are variables with high right skew (Insulin, DiabetesPedigreeFunction, Age) and other with high left skew like Blo

Machine learning model

Baseline model
Let’s create a baseline model. We’ll see later if it is necessary to improve it.

set.seed(1234)
dindex <- createDataPartition(dat$Outcome, p=0.7, list=FALSE)
train_data <- dat_original[dindex,]
test_data <- dat_original[-dindex,]

We are going to impute the missing data in training and testing set separately.

mice_train_mod <- mice(train_data[, features_miss], method='rf', seed=1234, printFlag


mice_test_mod <- mice(test_data[, features_miss], method='rf', seed=1234, printFlag =

train_data[, features_miss] <- complete(mice_train_mod)


test_data[, features_miss] <- complete(mice_test_mod)

The training set contains both possible cases and it is unbalanced.


https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 9/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
The training set contains both possible cases and it is unbalanced.

table(train_data$Outcome)

X0 X1
350 188

Let’s try a logistic regression model with all the features

fitControl <- trainControl(method = "cv",


number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)

model_glm <- train(Outcome~.,


train_data,
method="glm",
metric="ROC",
tuneLength=10,
preProcess = c('center', 'scale'),
trControl=fitControl)

pred_glm <- predict(model_glm, test_data)


cm_glm <- confusionMatrix(pred_glm, test_data$Outcome, positive="X1")
cm_glm

Confusion Matrix and Statistics

Reference
Prediction X0 X1
X0 129 29
X1 21 51

Accuracy : 0.7826
95% CI : (0.7236, 0.8341)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 1.156e-05

Kappa : 0.5094
Mcnemar's Test P-Value : 0.3222

Sensitivity : 0.6375
Specificity : 0.8600
Pos Pred Value : 0.7083
Neg Pred Value : 0.8165
Prevalence : 0.3478
Detection Rate : 0.2217
Detection Prevalence : 0.3130
Balanced Accuracy : 0.7488
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 10/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

'Positive' Class : X1

pred_prob_glm <- predict(model_glm, test_data, type="prob")


roc_glm <- roc(test_data$Outcome, pred_prob_glm$X1)
colAUC(pred_prob_glm$X1, test_data$Outcome, plotROC = TRUE)

[,1]
X0 vs. X1 0.8595833

We can see the result of this baseline model:

The accuracy is not quite bad , but this is not the best metric in this case.
The auc has a value of 0.8595833
The F1 score is 0.6710526
The recall (Sensitivity) is quite bad 0.6375

Next things to consider in order to build a better model than these baseline one:

We have to think about the features to include in the model, because some are highly correlated (we can try PCA,…
We have to work with the unbalanced problem (oversampling, synthetic cases,…)
We can try different machine learning models

Improving baseline model


We can decide to build an explanatory model or a highly predictive model. We will try both cases:

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 11/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Feature importance analysis


We are going to use the Boruta technique to find the most relevant features Let’s take a look to see if there are unimpo
with Boruta technique:

library(Boruta)
library
boruta_results <- Boruta(Outcome~., train_data)
boruta_results

Boruta performed 26 iterations in 6.13668 secs.


7 attributes confirmed important: Age, BMI,
DiabetesPedigreeFunction, Glucose, Insulin and 2 more;
1 attributes confirmed unimportant: BloodPressure;

plot(boruta_results)

All the variables are important except BloodPressure. Glucose is the most important one.

If we see the correlation matrix between variables, we can see some correlation, but they are below 0.75 , so that’s coh
Boruta and it looks like we can’t ride off any feature:

findCorrelation(correlat, cutoff=0.75)

integer(0)

We are going to use a different aproach. We are going to recursively explore which are the best feature set for a linear
model:

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 12/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

caretFuncs$summary <- twoClassSummary


rfe_ctl <- rfeControl(function
functions=caretFuncs,
method = "cv",
number = 10,
returnResamp="final",
verbose = FALSE)

rfe_glm <- rfe(train_data[ , setdiff(names(dat), 'Outcome')],


train_data$Outcome,
sizes=c(1:8),
rfeControl=rfe_ctl,
method="glm",
metric = "ROC",
preProcess = c('center', 'scale'),
trControl = fitControl)

It looks like that one of the features (SkinThickness) can be ommited

rfe_glm

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold)

Resampling performance over subset size:

Variables ROC Sens Spec ROCSD SensSD SpecSD Selected


1 0.6197 0.9286 0.1751 0.06304 0.05260 0.10556
2 0.8115 0.8857 0.5111 0.04118 0.08193 0.07848
3 0.8088 0.8857 0.5164 0.04089 0.08518 0.06389
4 0.8155 0.8800 0.5322 0.04104 0.08497 0.09402
5 0.8128 0.8800 0.5322 0.04272 0.08497 0.09402
6 0.8186 0.8829 0.5424 0.04397 0.07184 0.09386
7 0.8275 0.8829 0.5582 0.04192 0.07309 0.10363
8 0.8310 0.8857 0.5582 0.04511 0.07499 0.10363 *

The top 5 variables (out of 8):


Glucose, BMI, Pregnancies, DiabetesPedigreeFunction, Age

predictors(rfe_glm)

[1] "Glucose" "BMI"


[3] "Pregnancies" "DiabetesPedigreeFunction"
[5] "Age" "Insulin"
[7] "SkinThickness" "BloodPressure"

plot(rfe_glm, type=c("g", "o"))

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 13/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

From this approach it looks like that all features are needed.

We are going to try explanatory models: logistic regression and classification trees.

Explanatory models

Logistic Regression with regularization Clasiffication trees

We used as baseline model a Logistic Regresion. We can see there that:

summ_model_glm <- summary(model_glm$finalModel)


summ_model_glm

Call:
NULL

Deviance Residuals:
Min 1Q Median 3Q Max
-2.7007 -0.7340 -0.4207 0.7024 2.4104

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.831720 0.114078 -7.291 3.08e-13 ***
Pregnancies 0.359195 0.127200 2.824 0.004745 **
Glucose 1.125086 0.149146 7.544 4.57e-14 ***
BloodPressure 0 006958 0 124510 0 056 0 955433
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 14/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
BloodPressure 0.006958 0.124510 0.056 0.955433
SkinThickness 0.043982 0.145204 0.303 0.761969
Insulin -0.102950 0.134696 -0.764 0.444682
BMI 0.510131 0.154228 3.308 0.000941 ***
DiabetesPedigreeFunction 0.314418 0.114818 2.738 0.006174 **
Age 0.153331 0.130900 1.171 0.241452
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 696.28 on 537 degrees of freedom


Residual deviance: 506.51 on 529 degrees of freedom
AIC: 524.51

Number of Fisher Scoring iterations: 5

The model shows which are the most relevant features:

coef_glm <- summ_model_glm$coefficients %>%


as.data.frame() %>%
mutate(Feature=rownames(summ_model_glm$coefficients)) %>%
filter(Feature != "(Intercept)")

coef_glm %>% filter(`Pr(>|z|)` < 0.05) %>% arrange(`Pr(>|z|)`)

Estimate Std. Error z value Pr(>|z|) Feature


1 1.1250863 0.1491456 7.543545 4.573636e-14 Glucose
2 0.5101313 0.1542279 3.307645 9.408386e-04 BMI
3 0.3591953 0.1272002 2.823859 4.744927e-03 Pregnancies
4 0.3144183 0.1148182 2.738402 6.173855e-03 DiabetesPedigreeFunction

The more relevant feature related with diabetes is Glucose, followed by BMI, Pregnancies and DiabetesPedigreeFuncti

It looks like that one of the features (SkinThickness) can be ommited

rfe_glm

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold)

Resampling performance over subset size:

Variables ROC Sens Spec ROCSD SensSD SpecSD Selected


1 0.6197 0.9286 0.1751 0.06304 0.05260 0.10556
2 0.8115 0.8857 0.5111 0.04118 0.08193 0.07848
3 0.8088 0.8857 0.5164 0.04089 0.08518 0.06389
4 0.8155 0.8800 0.5322 0.04104 0.08497 0.09402
5 0.8128 0.8800 0.5322 0.04272 0.08497 0.09402
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 15/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

6 0.8186 0.8829 0.5424 0.04397 0.07184 0.09386


7 0.8275 0.8829 0.5582 0.04192 0.07309 0.10363
8 0.8310 0.8857 0.5582 0.04511 0.07499 0.10363 *

The top 5 variables (out of 8):


Glucose, BMI, Pregnancies, DiabetesPedigreeFunction, Age

predictors(rfe_glm)

[1] "Glucose" "BMI"


[3] "Pregnancies" "DiabetesPedigreeFunction"
[5] "Age" "Insulin"
[7] "SkinThickness" "BloodPressure"

model_glmnet <- train(Outcome~.,


train_data,
method="glmnet",
metric="ROC",
tuneLength=20,
preProcess = c('center', 'scale'),
trControl=fitControl)

pred_glmnet <- predict(model_glmnet, test_data)


cm_glmnet <- confusionMatrix(pred_glmnet, test_data$Outcome, positive="X1")
cm_glmnet

Confusion Matrix and Statistics

Reference
Prediction X0 X1
X0 132 35
X1 18 45

Accuracy : 0.7696
95% CI : (0.7097, 0.8224)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 7.748e-05

Kappa : 0.4656
Mcnemar's Test P-Value : 0.02797

Sensitivity : 0.5625
Specificity : 0.8800
Pos Pred Value : 0.7143
Neg Pred Value : 0.7904
Prevalence : 0.3478
Detection Rate : 0.1957
Detection Prevalence : 0.2739
Balanced Accuracy : 0.7212
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 16/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
y

'Positive' Class : X1

pred_prob_glmnet <- predict(model_glmnet, test_data, type="prob")


roc_glmnet <- roc(test_data$Outcome, pred_prob_glmnet$X1)

We achieve some better results that with the baseline model, but not very good ones:

cm_glmnet$byClass

Sensitivity Specificity Pos Pred Value


0.5625000 0.8800000 0.7142857
Neg Pred Value Precision Recall
0.7904192 0.7142857 0.5625000
F1 Prevalence Detection Rate
0.6293706 0.3478261 0.1956522
Detection Prevalence Balanced Accuracy
0.2739130 0.7212500

Predictive model

Random Forest XGBOOST KNN

model_rf <- train(Outcome~.,


train_data,
method="ranger",
metric="ROC",
tuneLength=20,
trControl=fitControl)

note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

pred_rf <- predict(model_rf, test_data)


cm_rf <- confusionMatrix(pred_rf, test_data$Outcome, positive="X1")
cm_rf

Confusion Matrix and Statistics

Reference
Prediction X0 X1
X0 122 28
X1 28 52

Accuracy : 0.7565
95% CI : (0.6958, 0.8105)
No Information Rate : 0.6522
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 17/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

P-Value [Acc > NIR] : 0.000421

Kappa : 0.4633
Mcnemar's Test P-Value : 1.000000

Sensitivity : 0.6500
Specificity : 0.8133
Pos Pred Value : 0.6500
Neg Pred Value : 0.8133
Prevalence : 0.3478
Detection Rate : 0.2261
Detection Prevalence : 0.3478
Balanced Accuracy : 0.7317

'Positive' Class : X1

pred_prob_rf <- predict(model_rf, test_data, type="prob")


roc_rf <- roc(test_data$Outcome, pred_prob_rf$X1)

cm_rf$byClass

Sensitivity Specificity Pos Pred Value


0.6500000 0.8133333 0.6500000
Neg Pred Value Precision Recall
0.8133333 0.6500000 0.6500000
F1 Prevalence Detection Rate
0.6500000 0.3478261 0.2260870
Detection Prevalence Balanced Accuracy
0.3478261 0.7316667

Model comparasion
We are going to compare these models over the training and resampling data:

model_list <- list(GLM=model_glm, GMLNET=model_glmnet , RPART=model_rpart, RF=model_rf


=model_xgbTree, KNN=model_knn)
resamples <- resamples(model_list)
bwplot(resamples, metric="ROC")

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 18/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

This is the correlation between models. This info can be used if we decide to combine some models to build a stacked

model_cor <- modelCor(resamples)


model_cor

GLM GMLNET RPART RF XGBOOST


GLM 1.0000000 0.29620506 -0.1539184 -0.16271384 0.20154705
GMLNET 0.2962051 1.00000000 -0.3128653 -0.07633388 -0.18250330
RPART -0.1539184 -0.31286529 1.0000000 0.28665403 -0.43370053
RF -0.1627138 -0.07633388 0.2866540 1.00000000 0.02043130
XGBOOST 0.2015471 -0.18250330 -0.4337005 0.02043130 1.00000000
KNN 0.1690141 -0.33037695 -0.1336449 -0.08902903 0.09460115
KNN
GLM 0.16901408
GMLNET -0.33037695
RPART -0.13364488
RF -0.08902903
XGBOOST 0.09460115
KNN 1.00000000

corrplot(model_cor)

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 19/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

We are going to see the models results when they are applied over the test data:

results_glm <- c(cm_glm$byClass['Sensitivity'], cm_glm$byClass['F1'], roc_glm$auc)


results_glmnet <- c(cm_glmnet$byClass['Sensitivity'], cm_glmnet$byClass['F1'], roc_glm
results_rpart <- c(cm_rpart$byClass['Sensitivity'], cm_rpart$byClass['F1'], roc_rpart$
results_rf <- c(cm_rpart$byClass['Sensitivity'], cm_rf$byClass['F1'], roc_rf$auc)
results_xgbTree <- c(cm_xgbTree$byClass['Sensitivity'], cm_xgbTree$byClass['F1'], roc_
uc)
results_knn <- c(cm_knn$byClass['Sensitivity'], cm_knn$byClass['F1'], roc_knn$auc)

results <- data.frame(rbind(results_glm, results_glmnet, results_rpart, results_rf, re


Tree, results_knn))
names(results) <- c("Sensitivity", "F1", "AUC")
results

Sensitivity F1 AUC
results_glm 0.6375 0.6710526 0.8595833
results_glmnet 0.5625 0.6293706 0.8586667
results_rpart 0.6375 0.6071429 0.7812500
results_rf 0.6375 0.6500000 0.8317500
results_xgbTree 0.4500 0.5413534 0.8252917
results_knn 0.5625 0.5625000 0.7662917

Simple logistic regression looks like to be the best model here: best sensitivity, F1 score and AUC.

Conclusion
We have developed some explanatory models (classification tree and linear regression). They show us what are the mo
factors in order to have a person diabetes. Predictive models should be improve prediction performance but they don’t
outstanding results.

Next things to try


Try different imputation techniques
Split dataset into training, validation and testing set in order to find optimal threshold over the validation data.
Try different machine learning models
Build a stacked model

Reference
This kernel has been released under the Apache 2.0 open source license.

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 20/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Did you find this Kernel useful? 


Show your appreciation with an upvote 10

Code This kernel has been released under the Apache 2.0 open source license.

1 ---
2 title: "Pima Indians Diabetes Database Analysis"
3 author: "Luis Bronchal"
4 date: "March 12, 2017"
5 knit: (function(inputFile, encoding) {
6 out_dir <- '../output';
7 rmarkdown::render(inputFile,
8 encoding=encoding,
9 output_file=file.path(dirname(inputFile), out_dir, 'analysis.html'))
10
11 output:
12 html_document:
13 theme: lumen
14 toc: true
15 html_notebook: default
16 ---
17
18 ```{r setup, include=FALSE}
19 knitr::opts_chunk$set(comment=NA, message=FALSE, warning=FALSE)
20 ```
21
22 # Summary
23
24 This is an analysis of the *Pima Indians Diabetes Database*, obtained from [Kaggle](https://
25 It is a small dataset with missing values. We have used imputation techniques and tryied som
26
27
28 # Exploratory Data Analysis [EDA]
29
30 ```{r}
31 library(needs)
32 needs(ggplot2,
33 dplyr,
34 corrplot,
35 gridExtra,
36 rpart.plot,
37 e1071,
38 mice,
39 DMwR,
40 pROC,
41 caTools,
42 caret,
43 doMC)
44
45 registerDoMC(cores = detectCores() - 1)
46 ```
47
48 ## Data loading and cleaning

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 21/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
49
50 ```{r}
51 dat <- read.csv("../input/diabetes.csv")
52 ```
53
54 ```{r}
55 str(dat)
56 ```
57
58 ```{r}
59 dat$Outcome <- factor(make.names(dat$Outcome))
60 ```
61
62 Let's take a look to the data:
63 ```{r}
64 summary(dat)
65 ```
66
67 It looks like there aren't explicit missing values, but if we see in detail we can see some
68 impossible:
69
70 ```{r}
71 biological_data <- dat[,setdiff(names(dat), c('Outcome', 'Pregnancies'))]
72 features_miss_num <- apply(biological_data, 2, function(x) sum(x<=0))
73 features_miss <- names(biological_data)[ features_miss_num > 0]
74 features_miss_num
75 ```
76
77 Let's see how many rows are affected with this problem:
78 ```{r}
79
80 rows_errors <- apply(biological_data, 1, function(x) sum(x<=0)>1)
81 sum(rows_errors)
82 ```
83
84 These are a lot of rows. It is more than 30% of the dataset:
85 ```{r}
86 sum(rows_errors)/nrow(dat)
87 ```
88
89
90
91 ```{r}
92 biological_data[biological_data<=0] <- NA
93 dat[, names(biological_data)] <- biological_data
94 ```
95
96 We have very few data and we can't get rid off these rows. We are going to try to impute mis
97
98
99 ```{r}
100 dat_original <- dat
101 dat[,-9] <- knnImputation(dat[,-9], k=5)
102 ```
103
104
105
106 ## Variable analysis
107
108 ### Outcome
109
110 Let's see the proportion of the outcome output.
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 22/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
111
112 ```{r}
113 prop.table(table(dat$Outcome))
114 ```
115
116 It is quite unbalanced with twice cases of non diabetes.
117
118 ### Correlation between variables {.tabset}
119
120 Let's see the correlation between numerical variables.
121 There are variables which are highly correlated. That's the case of Age for example.
122
123 ```{r}
124 correlat <- cor(dat[, setdiff(names(dat), 'Outcome')])
125 ```
126
127
128 #### Correlation matrix
129 ```{r}
130 corrplot(correlat)
131 ```
132
133 #### Data
134 ```{r}
135 correlat
136 ```
137
138 ## Univariable analysis
139
140
141 ```{r}
142 univar_graph <- function(univar_name, univar, data, output_var) {
143 g_1 <- ggplot(data, aes(x=univar)) + geom_density() + xlab(univar_name)
144 g_2 <- ggplot(data, aes(x=univar, fill=output_var)) + geom_density(alpha=0.4) + xlab(univa
145 grid.arrange(g_1, g_2, ncol=2, top=paste(univar_name,"variable", "/ [ Skew:",skewness(univ
146 }
147
148 for (x in 1:(ncol(dat)-1)) {
149 univar_graph(names(dat)[x], dat[,x], dat, dat[,'Outcome'])
150 }
151 ```
152
153 There are variables with high right skew (Insulin, DiabetesPedigreeFunction, Age) and other
154
155 # Machine learning model
156
157 ## Baseline model
158
159 Let's create a baseline model. We'll see later if it is necessary to improve it.
160
161
162 ```{r}
163 set.seed(1234)
164 dindex <- createDataPartition(dat$Outcome, p=0.7, list=FALSE)
165 train_data <- dat_original[dindex,]
166 test_data <- dat_original[-dindex,]
167 ```
168
169 We are going to impute the missing data in training and testing set separately.
170
171 ```{r}
172 mice_train_mod <- mice(train_data[, features_miss], method='rf', seed=1234, printFlag = FALS
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 23/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
173 mice_test_mod <- mice(test_data[, features_miss], method='rf', seed=1234, printFlag = FALSE)
174 ```
175
176 ```{r}
177 train_data[, features_miss] <- complete(mice_train_mod)
178 test_data[, features_miss] <- complete(mice_test_mod)
179 ```
180
181
182 The training set contains both possible cases and it is unbalanced.
183 ```{r}
184 table(train_data$Outcome)
185 ```
186
187 Let's try a logistic regression model with all the features
188
189 ```{r}
190 fitControl <- trainControl(method = "cv",
191 number = 10,
192 classProbs = TRUE,
193 summaryFunction = twoClassSummary)
194 ```
195
196 ```{r}
197 model_glm <- train(Outcome~.,
198 train_data,
199 method="glm",
200 metric="ROC",
201 tuneLength=10,
202 preProcess = c('center', 'scale'),
203 trControl=fitControl)
204 ```
205
206 ```{r}
207 pred_glm <- predict(model_glm, test_data)
208 cm_glm <- confusionMatrix(pred_glm, test_data$Outcome, positive="X1")
209 cm_glm
210
211 ```
212
213 ```{r}
214 pred_prob_glm <- predict(model_glm, test_data, type="prob")
215 roc_glm <- roc(test_data$Outcome, pred_prob_glm$X1)
216 colAUC(pred_prob_glm$X1, test_data$Outcome, plotROC = TRUE)
217 ```
218
219 We can see the result of this baseline model:
220
221 - The accuracy is not quite bad `r model_glm$overall['Accuracy']` , but this is not the best
222 - The auc has a value of `r auc(roc_glm)`
223 - The F1 score is `r cm_glm$byClass['F1']`
224 - The recall (Sensitivity) is quite bad `r cm_glm$byClass['Sensitivity']`
225
226 Next things to consider in order to build a better model than these baseline one:
227
228 - We have to think about the features to include in the model, because some are highly corre
229 - We have to work with the unbalanced problem (oversampling, synthetic cases,...)
230 - We can try different machine learning models
231
232 ## Improving baseline model
233
234 We can decide to build an explanatory model or a highly predictive model.
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 24/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
235 We will try both cases:
236
237 ### Feature importance analysis
238
239 We are going to use the Boruta technique to find the most relevant features
240 Let's take a look to see if there are unimportant variables with Boruta technique:
241
242 ```{r}
243 library(Boruta)
244 boruta_results <- Boruta(Outcome~., train_data)
245 boruta_results
246 plot(boruta_results)
247 ```
248
249 All the variables are important except BloodPressure. Glucose is the most important one.
250
251 If we see the correlation matrix between variables, we can see some correlation, but they ar
252 it looks like we can't ride off any feature:
253 ```{r}
254 findCorrelation(correlat, cutoff=0.75)
255 ```
256
257
258 We are going to use a different aproach. We are going to recursively explore which are the b
259
260 ```{r}
261 caretFuncs$summary <- twoClassSummary
262 rfe_ctl <- rfeControl(functions=caretFuncs,
263 method = "cv",
264 number = 10,
265 returnResamp="final",
266 verbose = FALSE)
267 ```
268
269 ```{r}
270 rfe_glm <- rfe(train_data[ , setdiff(names(dat), 'Outcome')],
271 train_data$Outcome,
272 sizes=c(1:8),
273 rfeControl=rfe_ctl,
274 method="glm",
275 metric = "ROC",
276 preProcess = c('center', 'scale'),
277 trControl = fitControl)
278 ```
279
280 It looks like that one of the features (SkinThickness) can be ommited
281 ```{r}
282 rfe_glm
283 predictors(rfe_glm)
284 ```
285
286 ```{r}
287 plot(rfe_glm, type=c("g", "o"))
288 ```
289
290 From this approach it looks like that all features are needed.
291
292 We are going to try explanatory models: logistic regression and classification trees.
293
294 ### Explanatory models {.tabset}
295
296 #### Logistic Regression with regularization
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 25/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
297
298 We used as baseline model a Logistic Regresion. We can see there that:
299 ```{r}
300 summ_model_glm <- summary(model_glm$finalModel)
301 summ_model_glm
302 ```
303
304 The model shows which are the most relevant features:
305
306 ```{r}
307 coef_glm <- summ_model_glm$coefficients %>%
308 as.data.frame() %>%
309 mutate(Feature=rownames(summ_model_glm$coefficients)) %>%
310 filter(Feature != "(Intercept)")
311
312
313 ```
314
315 ```{r}
316 coef_glm %>% filter(`Pr(>|z|)` < 0.05) %>% arrange(`Pr(>|z|)`)
317 ```
318
319 The more relevant feature related with diabetes is Glucose, followed by BMI, Pregnancies and
320
321
322 It looks like that one of the features (SkinThickness) can be ommited
323 ```{r}
324 rfe_glm
325 predictors(rfe_glm)
326 ```
327
328
329
330 ```{r}
331 model_glmnet <- train(Outcome~.,
332 train_data,
333 method="glmnet",
334 metric="ROC",
335 tuneLength=20,
336 preProcess = c('center', 'scale'),
337 trControl=fitControl)
338 ```
339
340 ```{r}
341 pred_glmnet <- predict(model_glmnet, test_data)
342 cm_glmnet <- confusionMatrix(pred_glmnet, test_data$Outcome, positive="X1")
343 cm_glmnet
344 ```
345
346 ```{r}
347 pred_prob_glmnet <- predict(model_glmnet, test_data, type="prob")
348 roc_glmnet <- roc(test_data$Outcome, pred_prob_glmnet$X1)
349 ```
350
351 We achieve some better results that with the baseline model, but not very good ones:
352 ```{r}
353 cm_glmnet$byClass
354 ```
355
356 #### Clasiffication trees
357
358 ```{r}
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 26/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
359 model_rpart <- train(Outcome~.,
360 train_data,
361 method="rpart",
362 metric="ROC",
363 tuneLength=20,
364 trControl=fitControl)
365 ```
366
367 ```{r}
368 pred_rpart <- predict(model_rpart, test_data)
369 cm_rpart <- confusionMatrix(pred_rpart, test_data$Outcome, positive="X1")
370 cm_rpart
371 ```
372
373 ```{r}
374 pred_prob_rpart <- predict(model_rpart, test_data, type="prob")
375 roc_rpart <- roc(test_data$Outcome, pred_prob_rpart$X1)
376 ```
377
378 This is the best model at the moment:
379 ```{r}
380 cm_rpart$byClass
381 ```
382
383 ```{r}
384 rpart.plot(model_rpart$finalModel, type = 2, fallen.leaves = T, extra = 2)
385 ```
386
387 This is coherent with the results of the linear regression model. For people with high Gluco
388
389
390 ### Predictive model {.tabset}
391
392 #### Random Forest
393
394 ```{r}
395 model_rf <- train(Outcome~.,
396 train_data,
397 method="ranger",
398 metric="ROC",
399 tuneLength=20,
400 trControl=fitControl)
401 ```
402
403 ```{r}
404 pred_rf <- predict(model_rf, test_data)
405 cm_rf <- confusionMatrix(pred_rf, test_data$Outcome, positive="X1")
406 cm_rf
407 ```
408
409 ```{r}
410 pred_prob_rf <- predict(model_rf, test_data, type="prob")
411 roc_rf <- roc(test_data$Outcome, pred_prob_rf$X1)
412 ```
413
414
415 ```{r}
416 cm_rf$byClass
417 ```
418
419
420 #### XGBOOST
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 27/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
421
422 ```{r}
423 xgb_grid_1 = expand.grid(
424 nrounds = 50,
425 eta = c(0.03),
426 max_depth = 1,
427 gamma = 0,
428 colsample_bytree = 0.6,
429 min_child_weight = 1,
430 subsample = 0.5
431 )
432 model_xgbTree <- train(Outcome~.,
433 train_data,
434 method="xgbTree",
435 metric="ROC",
436 tuneGrid=xgb_grid_1,
437 trControl=fitControl)
438 ```
439
440 ```{r}
441 pred_xgbTree <- predict(model_xgbTree, test_data)
442 cm_xgbTree <- confusionMatrix(pred_xgbTree, test_data$Outcome, positive="X1")
443 cm_xgbTree
444 ```
445
446 ```{r}
447 pred_prob_xgbTree <- predict(model_xgbTree, test_data, type="prob")
448 roc_xgbTree <- roc(test_data$Outcome, pred_prob_xgbTree$X1)
449 ```
450
451 We get not a great result:
452 ```{r}
453 cm_xgbTree$byClass
454 ```
455
456 #### KNN
457
458 ```{r}
459 model_knn <- train(Outcome~.,
460 train_data,
461 method="knn",
462 metric="ROC",
463 tuneGrid = expand.grid(.k=c(3:10)),
464 trControl=fitControl)
465 ```
466
467
468 ```{r}
469 pred_knn <- predict(model_knn, test_data)
470 cm_knn <- confusionMatrix(pred_knn, test_data$Outcome, positive="X1")
471 cm_knn
472 ```
473
474 ```{r}
475 pred_prob_knn <- predict(model_knn, test_data, type="prob")
476 roc_knn <- roc(test_data$Outcome, pred_prob_knn$X1)
477 ```
478
479 ```{r}
480 cm_knn$byClass
481 ```
482
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 28/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
483 ### Model comparasion
484
485 We are going to compare these models over the training and resampling data:
486
487 ```{r}
488 model_list <- list(GLM=model_glm, GMLNET=model_glmnet , RPART=model_rpart, RF=model_rf, XGBO
489 resamples <- resamples(model_list)
490 bwplot(resamples, metric="ROC")
491 ```
492
493
494 This is the correlation between models. This info can be used if we decide to combine some m
495
496 ```{r}
497 model_cor <- modelCor(resamples)
498 model_cor
499 corrplot(model_cor)
500 ```
501
502 We are going to see the models results when they are applied over the test data:
503
504 ```{r}
505 results_glm <- c(cm_glm$byClass['Sensitivity'], cm_glm$byClass['F1'], roc_glm$auc)
506 results_glmnet <- c(cm_glmnet$byClass['Sensitivity'], cm_glmnet$byClass['F1'], roc_glmnet$au
507 results_rpart <- c(cm_rpart$byClass['Sensitivity'], cm_rpart$byClass['F1'], roc_rpart$auc)
508 results_rf <- c(cm_rpart$byClass['Sensitivity'], cm_rf$byClass['F1'], roc_rf$auc)
509 results_xgbTree <- c(cm_xgbTree$byClass['Sensitivity'], cm_xgbTree$byClass['F1'], roc_xgbTre
510 results_knn <- c(cm_knn$byClass['Sensitivity'], cm_knn$byClass['F1'], roc_knn$auc)
511
512 results <- data.frame(rbind(results_glm, results_glmnet, results_rpart, results_rf, results_
513 names(results) <- c("Sensitivity", "F1", "AUC")
514 results
515 ```
516
517 Simple logistic regression looks like to be the best model here: best sensitivity, F1 score
518
519
520 # Conclusion
521
522 We have developed some explanatory models (classification tree and linear regression). They
523
524 # Next things to try
525
526 - Try different imputation techniques
527 - Split dataset into training, validation and testing set in order to find optimal threshold
528 - Try different machine learning models
529 - Build a stacked model
530
531 # Reference
532
533 - [Missing values](http://machinelearningmastery.com/how-to-handle-missing-values-in-machine
534 - [Imputation kernel inspiration](https://www.kaggle.com/hinchou/d/uciml/pima-indians-diabet
535 - [Feature selection](http://machinelearningmastery.com/feature-selection-with-the-caret-r-p
536
537
538

Did you find this Kernel useful? 


Show your appreciation with an upvote 10
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 29/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
Show your appreciation with an upvote 10

Data

Data Sources
Pima Indians Diabetes Database
  Pima Indians Diabetes Database Predict the onset of diabetes based on diagnostic mea
Last Updated: 2 years ago (Version 1)
 diabetes.csv 768 x 9
About this Dataset

Context
This dataset is originally from the National Institute of Diabetes and
Kidney Diseases. The objective of the dataset is to diagnostically pre
not a patient has diabetes, based on certain diagnostic measuremen
the dataset. Several constraints were placed on the selection of thes
from a larger database. In particular, all patients here are females at
old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one
Outcome . Predictor variables includes the number of pregnancies t
had, their BMI, insulin level, age, and so on.

Acknowledgements
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes,
Using the ADAP learning algorithm to forecast the onset of diabetes
Proceedings of the Symposium on Computer Applications and Medic
-265). IEEE Computer Society Press.

Inspiration
Can you build a machine learning model to accurately predict wheth
patients in the dataset have diabetes or not?

Run Info

Succeeded True Run Time 209.6 seconds

Exit Code 0 Queue Time 0 seconds

Docker Image Name kaggle/rstats(Dockerfile) Output Size 0

Timeout Exceeded False Used All Space False

Failure Message

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 30/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Log

Time Line # Log Message


1.3s 1

processing file: script.Rmd


1.3s 2 | |
| 0% |
| 1%
1.3s 3 ordinary text without R code

| |.
| 2%
label: setup (with options)
List of 1
$ include:
1.3s 4 logi FALSE

1.3s 5 | |..
| 3%
ordinary text without R code

1.3s 6 label: unnamed-chunk-1


1.5s 7
Load `package:needs` in an interactive session to set auto-load flag

8.8s 8 | |..
| 4%
ordinary text without R code

|
8.8s 9 |... | 5%
label: unnamed-chunk-2
8.9s 10 | |..
| 6%
ordinary text without R code

8.9s 11 | |..
| 7%
label: unnamed-chunk-3
8.9s 12 | |..
| 8%
8.9s 13 ordinary text without R code

| |..
| 9%
label: unnamed-chunk-4
8.9s 14
ordinary text without R code

| |..
| 10%
8.9s 15 label: unnamed-chunk-5
8.9s 16 | |..
| 11%
ordinary text without R code

| |..
| 12%
8.9s 17 label: unnamed-chunk-6
8.9s 18 | |..
| 13%
ordinary text without R code

8.9s 19 | |..
| 14%
label: unnamed-chunk-7
8.9s 20 | |..
| 15%
ordinary text without R code

8.9s 21 | |..
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 31/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
| 15%
label: unnamed-chunk-8
8.9s 22 | |..
| 16%
8.9s 23 ordinary text without R code

| |..
| 17%
label: unnamed-chunk-9
8.9s 24 | |..
| 18%
ordinary text without R code

8.9s 25 | |..
| 19%
label: unnamed-chunk-10
9.2s 26 | |..
| 20%
ordinary text without R code

| |..
| 21%
9.2s 27 label: unnamed-chunk-11
9.2s 28 | |..
| 21%
ordinary text without R code

| |..
| 22%
9.2s 29 label: unnamed-chunk-12
9.2s 30 |
|............... | 23%
ordinary text without R code

9.2s 31 |
|................ | 24%
label: unnamed-chunk-13
9.8s 32 |
|................ | 25%
ordinary text without R code

9.8s 33 |
|................. | 26%
label: unnamed-chunk-14
9.8s 34
ordinary text without R code

|
|.................. | 27%
label: unnamed-chunk-15
15.6s 35 |
|.................. | 28%
ordinary text without R code

|
|................... | 29%
15.6s 36 label: unnamed-chunk-16
15.6s 37 |
|................... | 30%
ordinary text without R code

15.6s 38 |
|.................... | 31%
label: unnamed-chunk-17
25.2s 39 |
|..................... | 32%
ordinary text without R code

25.2s 40 label: unnamed-chunk-18


25.2s 41 |
|...................... | 33%
ordinary text without R code

25.2s 42 |
|...................... | 34%
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 32/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
label: unnamed-chunk-19
25.2s 43 |
|....................... | 35%
ordinary text without R code

|
|....................... | 36%
label: unnamed-chunk-20
25.2s 44 |
|........................ | 37%
25.3s 45 ordinary text without R code

|
|........................ | 38%
label: unnamed-chunk-21
26.5s 46 |
|......................... | 38%
ordinary text without R code

26.5s 47 |
|.......................... | 39%
26.6s 48 label: unnamed-chunk-22
26.6s 49 |
|.......................... | 40%
ordinary text without R code

26.6s 50 |
|........................... | 41%
label: unnamed-chunk-23
26.7s 51 |
|........................... | 42%
inline R code fragments

26.7s 52 |
|............................ | 43%
label: unnamed-chunk-24
26.8s 53 Loading required package: ranger
33.3s 54 |
|............................ | 44%
ordinary text without R code

|
|............................. | 44%
33.3s 55
label: unnamed-chunk-25
33.3s 56 |
|............................. | 45%
33.3s 57 ordinary text without R code

|
|.............................. | 46%
label: unnamed-chunk-26
33.3s 58 |
|............................... | 47%
ordinary text without R code

|
|............................... | 48%
33.3s 59 label: unnamed-chunk-27
33.4s 60 -------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dpl
library(plyr); library(dplyr)
33.4s 61 -------------------------------------------------------------------------

Attaching package: 'plyr'

The following object is masked from 'package:DMwR':

join

The following objects are masked from 'package:dplyr':

arrange, count, desc, failwith, id, mutate, rename, summarise,


summarize

87.3s 62 |
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 33/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
|................................ | 49%
ordinary text without R code

|
|................................ | 50%
87.3s 63 label: unnamed-chunk-28
87.3s 64 |
|................................. | 50%
ordinary text without R code

87.3s 65 |
|................................. | 51%
label: unnamed-chunk-29
87.5s 66 |
|.................................. | 52%
ordinary text without R code

87.6s 67 |
|.................................. | 53%
label: unnamed-chunk-30
87.7s 68 |
|................................... | 54%
ordinary text without R code

|
|.................................... | 55%
87.7s 69 label: unnamed-chunk-31
87.7s 70 |
|.................................... | 56%
ordinary text without R code

|
|..................................... | 56%
87.7s 71 label: unnamed-chunk-32
|
|..................................... | 57%
ordinary text without R code

|
|...................................... | 58%
label: unnamed-chunk-33
87.9s 72 |
|...................................... | 59%
ordinary text without R code

|
|....................................... | 60%
label: unnamed-chunk-34
88.0s 73 Loading required package: glmnet
88.2s 74 Loading required package: Matrix
Loaded glmnet 2.0-5

Attaching package: 'glmnet'

The following object is masked from 'package:pROC':

auc

97.2s 75 |
|....................................... | 61%
ordinary text without R code

|
|........................................ | 62%
97.2s 76 label: unnamed-chunk-35
97.2s 77 |
|......................................... | 62%
ordinary text without R code

|
|......................................... | 63%
97.2s 78 label: unnamed-chunk-36
97.2s 79 |
|.......................................... | 64%
ordinary text without R code

97.2s 80 |
https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 34/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
|.......................................... | 65%
label: unnamed-chunk-37
97.2s 81 |
|........................................... | 66%
ordinary text without R code

|
|........................................... | 67%
97.2s 82 label: unnamed-chunk-38
99.3s 83 |
|............................................ | 68%
ordinary text without R code

99.3s 84
label: unnamed-chunk-39
99.4s 85 |
|............................................. | 69%
99.4s 86 ordinary text without R code

|
|.............................................. | 70%
label: unnamed-chunk-40
99.4s 87 |
|.............................................. | 71%
99.4s 88 ordinary text without R code

|
|............................................... | 72%
label: unnamed-chunk-41
99.4s 89 |
|............................................... | 73%
ordinary text without R code

|
|................................................ | 74%
label: unnamed-chunk-42
99.9s 90
ordinary text without R code

99.9s 91 |
|................................................. | 75%
label: unnamed-chunk-43
112.5s 92 |
|................................................. | 76%
ordinary text without R code

|
|.................................................. | 77%
112.5s 93 label: unnamed-chunk-44
112.5s 94 |
|................................................... | 78%
ordinary text without R code

|
|................................................... | 79%
112.5s 95 label: unnamed-chunk-45
112.6s 96 |
|.................................................... | 79%
ordinary text without R code

|
|.................................................... | 80%
112.6s 97 label: unnamed-chunk-46
112.6s 98 |
|..................................................... | 81%
ordinary text without R code

112.6s 99 |
|..................................................... | 82%
label: unnamed-chunk-47
112.8s 100 Loading required package: xgboost
209.2s 126
209.2s 127 ...
209.2s 128 Complete. Exited with code 0.

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 35/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Sort
Comments (8)
All Comments Hot

Please sign in to leave a comment.

Pranav Pandya • Posted on Version 8 • 2 years ago • Options

Nice!

Luis Bronchal Kernel Author • Posted on Version 8 • 2 years ago • Options

Thanks

Carlos Crosetti • Posted on Version 8 • 2 years ago • Options

Luis, nice work. When looking to your analysis of missing vales (for instance blood pressure cannot be z
same for BMI) you documented the remediation by invoking some function (kmi…) could you please ela
bit further?

Luis Bronchal Kernel Author • Posted on Version 8 • 2 years ago • Options

You have to deal with the missing data. There are different approaches to do this. I have trie
KNN imputation (with the function knnImputation) to do a first analysis, but it is possible to
techniques (it is the first thing in the section 'Next things to try' of my report).
Here you can see an interesting intro about this subject:

http://r-statistics.co/Missing-Value-Treatment-With-R.html

Harshita S. Jain • Posted on Version 8 • 2 years ago • Options

the unit of insulin 2 hour serum test is muU/ml. can anybody tell me the normal range in order to differe
towards the diabetic prone person. i am able to find insulin test data in mg/dl or mol/l. so i am unable to
kindly help me either in conversion of muU/ml to mg/dl or mol/l, or kindly tell me the range of muU/ml

Luis Bronchal Kernel Author • Posted on Latest Version • 2 years ago • Options

I am not an expert on this business domain, so I can't help you with that.

Xiao Liu • Posted on Latest Version • 2 years ago • Options


https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 36/37
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle

Beautiful work! :)

liangwei93 • Posted on Latest Version • a year ago • Options

When I use the kNN imputation,


dat[,-9] <- knnImputation(dat[,-9], k=5)

Error: Invalid column indexes: 310, 347, 25, 256, 52


I got this error, any idea why?

Similar Kernels

Feature Selection And Basic Machine ML From Scratch-Part Deep Healthcare Intr
Data Visualization Learning With Cancer 2 Analysis Using B
BigQuery

© 2019 Kaggle Inc Our Team Terms Privacy Contact/Support 

https://www.kaggle.com/lbronchal/pima-indians-diabetes-database-analysis 37/37

You might also like