You are on page 1of 19

Give Me Some Credit

QTM Final Paper

Marshall Khakshouri,
Trevor Lomba,
Micah Nelson,
Kim Tran,
Delane Zahoruiko

Overview:
The purpose of this project was to create an algorithm or model that would assist banks when
determining whether or not they should grant clients loans. To construct our models we used data
provided by Kaggle. Our initial data included 250,000 observations and eleven unique variables. Our
observations consisted of historical credit data from 250,000 individual clients, recorded in the form of
the eleven variables. Our target would be to determine whether or not a client would become seriously
delinquent within two years. If a client is seriously delinquent they have defaulted on a loan. If it is
expected that a client will become seriously delinquent within two years, it is recommended that a bank
should not offer the client a loan. Before constructing our models we did a fair amount of data cleaning.
First, we removed any unwanted variables, which included X (a random number assigned to each
observation) and the row number for each observation. We then removed any observations that had
missing data points. Then we took a sample of approximately 5,000 observations. Next we partitioned the
sample into training and test subsets, using an 75/25 split. We were now ready to begin the construction
of the models.
When deciding which models to pursue we considered that our target was categorical and that our
predictors were primarily numeric. Due to the nature of our target and predictors we decided to pursue
General Logistic Models, Classification Trees, and k-Nearest-Neighbors Models. We would gauge the
success of our models based off of both a traditional error rate and how profitable a bank would be if they
used one of our models. To determine profitability we researched both how much money a bank will lose
on a typical loan if a client defaults and how much money a bank will gain on a typical loan if a client
does not default. We found these figures to be roughly 48% of the loan amount and 98% of the loan
amount respectively. To simplify our calculations we decided that any correctly predicted no default
would result in a $100,000 profit to a bank, whereas any falsely predicted default would result in a
$50,000 loss to a bank. Correctly and incorrectly predicted defaults would both result in no monetary
losses or gains. However, an incorrectly predicted default would result in an opportunity cost of
$100,000. This opportunity cost is reflected by there being one less client in the correctly predicted no
default category. We found the notion of profitability to be a good measure of model success because
maximizing profitability is a common goal amongst banks.
Data Exploration
The target SeriousDlqin2yrs is a binary variable that determines whether or not someone
defaulted on their bank loan. There are 10 predictors in the data set:

Age refers to the age of the person whose data is being collected.

RevolvingUtilizationOfUnsecuredLines is the persons balance on lines of credit divided by the


sum of their credit limits.

DebtRatio refers to the persons payments that the person needs to make every month divided
by their gross monthly income.

MonthlyIncome is the persons monthly income.

NumberOfOpenCreditLinesAndLoans and NumberRealEstateLoansOrLines make up all of


the lines of credit and loans that the person has with their bank.

NumberOfDependents refers to the number of dependents in the persons family such as their
children or spouse.

The last three variables-- NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime6089DaysPastDueNotWorse, and NumberOfTimes90DaysLate define those who have been late
on their payments and the extent of their lateness. This is split into three time periods wherein the
person is late 30-59 days but no more than that, 60 days late or over but never over 89 days, or 90
days or more past due. Specifically in our GLM, we combined all three of these variables because
it allowed us to create a simpler model that was easy to handle. In our other two models, k-NN
and classification trees, we kept these variables separate.
In the following graphs, the dark blue and 1.0s indicate that

someone has defaulted as opposed to the light blue and 0.0s which
indicates that they have not. The first pie chart Target shows that the
majority of people did not default on their loan. Only 6% of the records
have actually defaulted. The stacked bar chart labelled Age takes a
closer look at the two categories of people by age groups. The most
popular age group for people in the data set is around the 40s to 50s and
from the graph, it can be seen that those who are in the younger age
categories are more likely to default because the dark blue portions are
proportionally higher to the left of the graph than on the right.
The bar chart Income & RevUtil shows the distribution of
average monthly income and average revolving utilization of
unsecured lines for each age group. Those who have not defaulted
have slightly higher incomes than their peers in the same age group
who have defaulted. This is with the exception of the 90s age group
in the graph to the right which has the highest average monthly
income at over $8,000. From an age perspective, these two graphs
show the increase in income as people in their teens and twenties

start their careers; a peak income at around their 50s,


and a decline thereon perhaps due to high rates of
retirement.
In the charts regarding revolving utilization,
those who do not default on their loans will use more of
their credit line as they have a higher income. However,
for those who have defaulted, the younger ones have
high revolving utilization rates, possibly due to
irresponsibly using more credit or because they need
money to begin to settle down. Comparatively, those in
their 40s-60 have very low utilization rates. This age
range tends to have higher incomes which makes it is
easier for them to pay off their balances.
The plot to the right depicts the average number
of open credit lines and loans for each age group on the
top and the average number of real estate loans and
credit lines on the bottom. For those who have not
defaulted, the number of credit lines follows similarly with
their income. The more they make, the more comfortable
they are spending and the more credit cards they open.
However, those that have defaulted continue to open more
credit lines. In terms of real estate loans or lines, there is a
similar trend for when both groups take on mortgages;
however, those who have defaulted generally take longer to
repay their mortgages since there are older age groups who
still have higher average real estate loans.

The three pie charts below indicate that the longer it takes someone to make their payments, the
more likely they are to default. Furthermore, if someone is late on a payment, it implies that they
currently do not have
the money to repay
their loan and thus are
more likely to default.
In the three
graphs labelled
Lateness, the average number of times a person is
late within each of the time periods is broken down by
age group. Most of the people who are late are young,
twenty to thirty years of age. Those who ended up
defaulting had also been late in their respective time
periods more times on average than those who did not
default. Essentially, those who have defaulted are late
more often than their counterparts who have not
defaulted. This is indicated by the large amount of
dark blue.
Due to poor data creating extreme outliers,
those who were 94 and 103 years of age were
identified as the ages with the highest debt ratio. The
median debt ratios were recorded to be 20,809 and 899.25, respectively. However, this is highly unlikely:
it would mean that someone is spending 20,000 times more than their income per month. Therefore, after
getting rid of these two data points, the plot SeriousDlqin2yrs / Age (bin) was made. Here, it shows that
although those who did not default on their loan had
higher debt ratios around their 40s, the number
decreased again. This is reasonable because this age
group, generally, has the most responsibilities and
bills to pay. They will likely have a mortgage,
children to take care of, and bills (i.e. utilities).
Meanwhile, those who defaulted have an increasing
trend in debt ratio until their sixties because they
have not been able to repay their loans. This suddenly
drops at sixty years of age, due to the limited number

of variables we have access to we cannot determine exactly why this is. Many of the trends we identified
visually were supported through our different models.
Model 1: Logistic Regression
We chose to explore the General Linear Model because our target variable is categorical and our
predictors are numerical; we know the GLM is designed to deal with exactly these kind of inputs and
outputs, in addition to categorical inputs which our data set did not include. The GLM can be used for
either determining the class of a new observation (called classification) or to define similarities between
observations of a shared class (called profiling). Clearly, our objective called for the classification of
individual observations either as seriously delinquent in two years or not. For these purposes we defined
two broad steps our model had to accomplish: estimate the probabilities an observation would be
seriously delinquent, and making a prediction on whether or not they would be based on this probability.
Step one is made very simple by R, since many of minor steps it takes to move from the logit
produced by the GLM to probabilities are handled behind the scenes. To start, it is necessary to define
within the software the independent and dependent variables we intend to build this model using. We
chose to use the stepwise regression to move from a model that used every predictor, their square, and
their square root down to a model that had no predictors (via the step() function). R would systematically
remove variables with the highest Akaike Information Criterion (AIC) score until it found that removing
any more variables would be detrimental to the models overall AIC score. This process produced a linear
model that balanced effectiveness with simplicity. After running a summary of the model, we were able to
evaluate the predictors individual p-values. A p-value of over .05 for any of these predictors indicated to
us that there is not a significant relationship between the dependent variable and this independent
variable, and so the independent variable was removed. The following is a table showing each predictor
used in the model, their betas, and their respective p-values before the insignificant variables were
removed.

Coefficients:
Beta
P-Value
_
(Intercept)
-2.520e+00 8.49e-05 ***
age
-1.847e-02 0.001755 **
DebtRatio
-7.533e-01 0.053061 .
MonthlyIncome
-1.198e-04 0.122300
NumberOfOpenCreditLinesAndLoans
1.847e-01
0.040005 *
NumberOfDependents
2.381e-01
3.37e-05 ***
sqrt(DebtRatio)
2.584e+00
0.000802 ***
sqrt(MonthlyIncome)
2.169e-02
0.113445
sqrt(NumberOfOpenCreditLinesAndLoans) -1.255e+00
0.011451 *
NumberofTimesLate
-8.831e-02
< 2e-16 ***
sqrt(NumberofTimesLate)
1.614e+00
< 2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The interpretation of the beta values is as follows: for every one unit increase in age, the odds of
an individual being seriously delinquent will drop by 1.847e-02. Given these betas we can see that the
square roots of debt ratio, number of open credit lines and loans, and number of times late have the most
severe effect on these odds. The number of times late and its square root proved to be the most significant
amongst the selected independent variables.
When applied to the test set, this model provides us with the calculated odds of an individual
being seriously delinquent, which can be converted to corresponding probabilities using the predict()
function in R or the equation odds = probability / (1-probability). Using R, we in turn get a table of
probabilities for each observation which range in value from 0.000 (100% sure an individual will not
default) to 1.000 (100% sure they will default). The employment of cut-off values will turn these
probabilities into predictions of each observations class.
Cutoff values define the probability at which the model classifies an observation to a certain
class, in this case seriously delinquent within two years. So, given a cut-off value of .5, every observation
with a probability of .500 or higher will be classified as true. In R, we employed cutoff values by defining
a logical expression predTF that would assign TRUE to observations with probabilities about the cutoff
value, and FALSE to observations with probabilities below it.
Cut-off values are highly subjective and can be altered to optimize a model in different ways. As
stated earlier, the goal of this model is to maximize profit, not necessarily minimize error, and so this
decision was made with profit as the sole determinant. To reiterate, we assigned a $100,000 net gain to
correctly predicting an individual would not be seriously delinquent within two years (true negative), and
a $50,000 net loss to incorrectly predicting an individual would not be seriously delinquent within two
years (false negative). There is no upfront cost associated with not distributing a loan when our model
predicted TRUE, though false positives did represent a $100,000 opportunity cost. These criterions were
used to decide which cutoff value would produce the best model for our purposes.

Using simple cross tables to compare our predictions with the cases from our test set, we are
given our true negative and true positive rates which can be used to calculate the net gains and losses for a
given cutoff value. For example, keeping the test and training data sets constant and assigning a cutoff
value of .50, our false negative rate was 63 and our true negative rate was 1,154 and yielding an estimated
profit of 63(-$50,000) + 1,154($100,000), or $112.25 million. Raising our cutoff value to .60 brought us
to 1,163 true negative and 71 false negatives, a profit of $112.75 million. Turns out keeping this cutoff
value of .60 yielded the greatest profit, given this model and training and test sets. The exhibit below and
to the left shows how profit changed with our various cutoff values. In repartitioning our data frame and
running our code again we found that this cutoff value consistently maximized profit. On the right is a
cross table showing how well our model classified the test set with a cutoff value of .60, and how each
classification affected our profit.

Maximizing profit involved classifying a test set with high specificity (true negative rate) and
lower sensitivity (true positive rate). On this particular run with a cutoff value of .60, our classification
rate was ~.936, specificity was 1163/1167 = ~.997 (a near perfect rate) but sensitivity was 10/81 = ~.123,
which is clearly a much less appealing score. This is because given our assumptions the value of
classifying a true negative far outweighs the value of identifying a true positive. If we were to minimize
our error rate instead of maximize profit we would choose a cutoff value closer to .45, and receive a
marginally lower specificity (~.987) but a greater sensitivity and classification rate (~.272 and ~.941
respectively). Most relevant for our purposes, though, the estimated profit would be $500,000 less and so
was deemed an inferior model despite its lower error rate.

Model 2: Classification Trees

To begin, we ran the code for classification trees, and did not include stopping rules. From this,
came the tree pictured below and to the left. Then we added the stopping rules and got the tree pictured
7

below and to the right. There is a clear difference in what the two trees include. Upon increasing the
minsplit and minbucket percentages, the tree became smaller with less splits and variables. Contrastingly,
when we lowered the percentages, the tree grew, increasing the number of splits and variables. This is one
of the problems with using classification trees; there is some uncertainty of how Rstudio will split the
tree. Therefore, it is useful to change the minbucket and minsplit values to see the differences in each
outcome. Nonetheless, both trees we created are useful and unveil different aspects of the dataset.

For both codes, those with and without stopping rules, we created an unpruned and pruned tree to
uncover any differences. In both scenarios, the trees remained almost identical. The error rates for
pruned and unpruned were also extremely similar and averaged out around .06, while the benchmark was
approximately .07. Both qualities of the pruned and unpruned trees are attributed to the dataset being
relatively small due to the initial cleaning. From the beginning. we removed all unnecessary variables;
thus shrinking the data set. Furthermore, the data set was quite small to begin with, having fewer
variables than many other groups.
By analyzing the tree that incorporated stopping rules, we can learn a lot about how banks
determine whether or not clients are likely to default on a loan. In the first split, the predictor is the
number of times a person was ninety days late on a payment. Of the entire population 93% of people are
never ninety days late; however, of those people, 95% do not default on their loan while 5% do default on
their loan. This is an interesting point to note. This means that the 5% of people who will default on their
loan, will be on time with all of their payments, but suddenly have to default. In the first split there is 7%
of the population that are ninety days late with a payment at least once. You can automatically see the
banks uncertainty of default increase with the nodes lighter color. Of these people, 56% will not default,
and 44% will default on their loan.
The second split is number of times a person is ninety days late on a payment more than twice.
This splits 44% of the people who are ninety days late on a payment more than two times. It is apparent in
the blue color of the next node that the bank predicts that these type of people will default on their loan.

Of these people, 37% do not default on their loan, while 63% of people do. The next split uses the
predictor debt ratio. 63% of people have a debt ratio greater than 10%, and 37% have a debt ratio lower
than 10%. If clients have a debt ratio greater than 10%, the bank predicts they will default on their loan;
however if a client has a debt ratio less than 10% the bank predicts they will not default on their loan.
Those who the bank predicts will default on their loan, from debt ratio, make up only 1% of the total
population.
Going back to the other side of the second split, it is clear that 56% of people are ninety days late
on payments two times or less. Of these people 63% do not default on their loan, and 37% do default.
These people represent 4% of the prior split. Which brings us to the next split, the number of times a
person is past due sixty to eighty-nine days on a payment. If a person is past due, the bank predicts that
person will default; 46% of these people do not default while 54% do default. If a person is not past due,
the bank predicts that person will not default; of these people, 69% do not default, and 31% of people do
default. In the next split, we can see that the older a person is the less likely the bank predicts they will
default. If a person is under fifty-two years of age, 32% do not default while 62% do default. Whereas if a
person is older than fifty-two years of age, only 33% will default while 67% will not. In the final split, the
predictor is monthly income. The basic takeaway here is that the more money a person makes, the less
likely the bank will predict them to default on their loan. Of the people that make less than $1,811 per
month, 29% do not default on the loan and 71% do default. Of the people that make more the $1811 per
month, 67% do not default and 33% do default.
We also learned a lot about the banks predictions based on the graph that did not have stopping
rules incorporated in the code. The beginning of the tree that did not encompass stopping rules is similar
to the tree that did encompass stopping rules. In the first split, the predictor is the number of times a
person was ninety days late on a payment. Of the entire population 93% of people are never ninety days
late; however, of those people, 95% do not default on their loan while 5% do default on their loan. In the
first split, there is 7% of the population that are ninety days late with a payment at least once. Once again
you can see the banks uncertainty of default increase with the nodes lighter color. Of these people, 56%
will not default, and 44% will default on their loan.
The second split is number of times a person is ninety days late on a payment more than once.
This splits 44% of the people who are ninety days late on a payment more than once. It is apparent in the
blue color of the next node that the bank predicts that these type of people will default on their loan. Of
these people, 38% do not default on their loan, while 62% of people do.
Going back to the other side of the second split, we can note that of the people who are ninety
days late on a payment only once, 67% do not default on their loans and 33% do default. Revolving
utilization of unsecured lines is the next predictor. This variable has a lot to do with credit score and is

calculated by dividing a persons credit card balance by the credit limit on that account. The lower an
individuals revolving utilization of unsecured lines, the less risky they are as a client. More specifically,
those with a utilization rate less than 54% the bank predicts they will not default on a loan; 81% of people
do not default and 19% do default. Whereas, those with a utilization rate greater than 54% are predicted to
default on their loan; 59% do not default and 41% actually do default.
The final split involves the combined number of credit cards and loans a person has. The bank
predicts that a person will default if this number is greater than four. Of these people 41% do not default
and 59% do default. If the combined number of credit cards and loans for an individual is less than 4, the
bank predicts they will not default. Of these people, 76% do not default and 24% do default.
An ideal client, one with low risk, for the bank has the following characteristics: over the age of
fifty-two, an annual income of at least $21,732, a revolving utilization of unsecured lines less than 54%,
and a total of no more than four credit cards and loans. The older a person is, the less likely the bank
thinks they will default on a loan. Therefore, banks generally view younger individuals as riskier clients.
Banks also take into account the monthly income of clients. If a person makes less than $1,811 per month
or less than $21,732 per year, the bank predicts that these people are more likely to default on loans. As
income increases, client risk decreases. Revolving utilization of unsecured lines should be no greater than
.54 or 54%. Revolving utilization has a lot to do with credit scores. From it, we learn to avoid carrying a
balance greater than 54% of our credit card limit. Banks perceive individuals with a high revolving
utilization of unsecured lines as riskier clients who are more likely to default on loans. Having more debt,
in the number of credit cards and loans, means a person will be viewed as a riskier client. Arguably the
most important predictor is ninety days late on a payment; both trees have this variable as their first split.
This is the first variable banks look at when predicting if a person will default on a loan.
After analyzing the trees we calculated the banks profitability using classification trees. To do
this we made a cross table--pictured below. From this cross table, we took the profitability from the false
false predictions and subtracted the loss from the true false predictions (Seen in the table below).
Resulting in a total profit of $ 112,500,000 or $112.5 million.

10

Model 3: K Nearest Neighbors


For the final model we chose to use K-Nearest Neighbors (k-NN). k-NN was employed in our
project because we had categorical target and numerical predictors. There are several steps we took to
ensure we created a proficient k-NN Model. The steps are as follows:
1.) At first, we picked a number for k, which represents the predefined number of data
points that we will compare each observation to when attempting to categorize them. If k = 3, we
will compare each data point to the three other points which are closest to the point we are
attempting to categorize.We will place the observation in the whichever category had the majority
number of observations from all of the observations which we compared the observation we are
attempting to categorize to. You do not want k to be too small of a number because randomness
in the training set can have a large effect on predictions, since each data point will be compared to
and you dont want it to be too large of a number because the most common class in the training
set will dominate the predictions.
2.) After writing all the necessary code and producing your cross table, you then will
evaluate your model. You do this based on the number of false positives and false negatives
relative to the number of predictions.
3.) After evaluating your model, you may see that you Benchmark Error Rate is lower
than your Error Rate. No fear, this can happen and is not a rare occurrence. When dealing with
distance, it is very important to make sure that no one variable is dominating your predictions.
For example, a variable like age may have a range of 1-100 whereas a variable like Number of
Dependents maybe have a range of 1-10.
4.) The solution to this problem is either normalizing or standardizing your data. After
you do this, you then should pick the best number for K based on the error rates (the lower the
error rate, the better).
The results in the initial runs of our k-NN model were very promising. The cross table in the
exhibit below shows the model perfectly predicted when someone would not default on their loan and was
also very effective in predicting when someone would default on their loan. To that point, our false
positive rate here was only 0/85 = .000, and our false negative rate was only 1/1162 = ~.001. After the
code was run multiple times we found this model yielded very similar results consistently.

11

A key takeaway from the k-NN model in relation to our data is: normalizing the data helped more
than standardizing the data. To normalize the data, each independent variable x was subtracted by the
minimum in that variables output, then divided by the difference of the maximum value and minimum
value of the variable. This is called feature scaling and is represented by the equation (x-xmin)/(xmax-xmin).
This brings all variable values between 0 and 1, which allows the k-NN modeling to establish distance
between data points without regard to scale. For example, if predictors were left unnormalized variables
such as income would have dominated the grouping algorithm over predictors with a smaller range. The
benefits of normalization are getting the ranges relative to each other and this worked out perfectly for our
data. You may ask how an error rate of .0008 is possible. Well, normalization of our data made it
possible.
After looking at the error rates with different values of k, we concluded that k=3 yielded the
lowest error rate at ~.001. Our benchmark error rate was much higher at at ~.068. Upon comparison of
these error rates,we determined that the k-NN model is effective.

The above scatter plot exemplifies how the k-NN model was able to identify significant
relationships between the target and certain predictors. Here, the distribution of of seriously delinquent
observations (red) and non-seriously delinquent observations (blue) indicates that while age may not
clearly indicate a particular class, the total number of times late does. Very few individuals who were late

12

three or less times turned out to be seriously delinquent within two years, but as the total number of times
late increased the likelihood of these individuals being seriously delinquent within two years also
increased. Thus, it is simple for k-NN to group observations into categories based on these predictors.
Lastly, in terms of profit, as mentioned in section one, the profit for correctly predicting that
someone will not default on his or her loan is $100,000. The cost of falsely predicting that someone will
not default on his or her loan is $50,000. In this case, we predicted everything correctly except for
predicting one person to default when he or she did not. We did not have a false prediction that someone
would not default on his or her loan, so the profit that the bank receives is $116,200,000.

Model Comparison
As previously mentioned, we gauged the success of our models by looking at both traditional
error rates and profitability. The error rates and profitability for each model can be easily compared in the
chart below.

Model

Profit

Correctly
Predicted no
default (+$100k)

Incorrectly
predicted no
default (-$50k)

Error Rate

GLM

$112,750,000

1163

71

0.060

Classification Tree

$112,500,000

1157

64

0.064

k-NN

$116,200,000

1162

0.0008

When comparing our error rates and profitability figures k-NN is the clear winner. k-NN achieved both
the lowest error rate and the highest profitability, by substantial margins. It is also interesting to note that
lower error rates consistently achieved higher profits.

Conclusion

13

We would recommend to banks that they use our k-NN model when trying to determine whether
or not clients should be granted loans. We believe that our k-NN model can accurately predict whether or
not a client will default, and thus it should result in high profitability rates for banks. That being said, we
also believe that there is room for improvement in our models. Our models only allowed for two possible
outcomes: grant a client a loan or do not grant a client a loan. As different clients carry varying degrees of
risk it may be prudent to account for this variation in our model outcomes, by providing varying
outcomes which reflect variations in risk. To account for varying risk in our model outcomes, we could
charge higher risk clients with higher interest rates on loans. This would allow banks to accept more
clients and ultimately to be more profitable.

14

Appendix
Data Preprocessing:
library(rpart)
library(rpart.plot)
library(rattle)
library(class)
library(gmodels)
source("C:/Users/mnelson2/Dropbox/OurQTM2000/QTM2000.R")
source("C:/Users/ttravassoslomba1/Dropbox/OurQTM2000/QTM2000.R")
source("C:/Users/ktran2/Dropbox/OurQTM2000/QTM2000.R")
source("C:/Users/mkhakshouri1/Dropbox/OurQTM2000/QTM2000.R")
source("C:/Users/nkarst/Dropbox/OurQTM2000/QTM2000.R")
source("C:/Users/dzahoruiko1/Dropbox/OurQTM2000/QTM2000.R")
big <- read.csv("C:/Users/ttravassoslomba1/Dropbox/QTM2000Group Project/cs-training.csv")
big <- read.csv("C:/Users/ktran2/Dropbox/QTM2000Group Project/cs-training.csv")
big <- read.csv("C:/Users/mnelson2/Dropbox/QTM2000Group Project/cs-training.csv")
big <- read.csv("C:/Users/mkhakshouri1/Dropbox/QTM2000Group Project/cs-training.csv")
big <- read.csv("C:/Users/dzahoruiko1/Dropbox/QTM2000Group Project/cs-training.csv")
big <- read.csv("C:/Users/nkarst/Dropbox/QTM2000Group Project/cs-training.csv")
#^^ No need to rewrite this script so it knows where to look on your computer! At least one of the codes
above will find the correct file and define 'big' as such
#^^ The lines that don't 'stick' will just be dismissed by R and won't affect the rest of the script at all
big$SeriousDlqin2yrs = as.factor(big$SeriousDlqin2yrs)
big = na.omit(big) #remove the objects with missing values
big$X = NULL
Nbig = nrow(big) #define N as the number of rows in dataset big, ~120,000
data.frame.size = round(Nbig*0.041508) #define dfsize as approximately 5% of the size as N, ~5,000
data.frame= sample(Nbig,data.frame.size) #takes a sample of dfsize from the elements of Nbig
#View(data.frame)
df= big[data.frame,]#creates a data frame df using all 12 columns from big and only rows sampled from
data.frame
write.table(df, file ="C:/Users/ttravassoslomba1/Dropbox/QTM2000Group
Project/df.csv",row.names=FALSE,sep=",")
#ABOVE HERE IS TAKING OUR 120,000 OBJECT DATASET, REMOVING NULLS, AND
BRINGING SIZE DOWN TO ABOUT 5,000 OBJECTS
#FROM HERE ON OUT, THE DATA FRAME TO WORK WITH IS 'df'
Model 1:
N = nrow(df)
trainingsize = round(N*.75)
trainingcases = sample(N, trainingsize)
training = df[trainingcases,]
test = df[-trainingcases,]
hugemodel = glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines+age+ DebtRatio
+MonthlyIncome +
NumberOfOpenCreditLinesAndLoans+NumberRealEstateLoansOrLines+NumberOfDependents+Revolv

15

ingUtilizationOfUnsecuredLines^2+age^2+DebtRatio^2 +MonthlyIncome^2
+NumberOfOpenCreditLinesAndLoans^2 + NumberRealEstateLoansOrLines^2 +
NumberOfDependents^2 + sqrt(RevolvingUtilizationOfUnsecuredLines) + sqrt(age) + sqrt(DebtRatio) +
sqrt(MonthlyIncome) + sqrt(NumberOfOpenCreditLinesAndLoans) +
sqrt(NumberRealEstateLoansOrLines) + sqrt(NumberOfDependents) +
sqrt(RevolvingUtilizationOfUnsecuredLines) + NumberofTimesLate + sqrt(NumberofTimesLate) +
NumberofTimesLate^2, data=training, family = binomial(logit))
tinymodel = glm(SeriousDlqin2yrs ~ 1, data = training)
downmodel = step(hugemodel, scope=list("lower" = tinymodel), direction = "backward")
summary(hugemodel)
pred= predict(downmodel, test, type = "response")
pred
predTF= (pred> .65)
predTF
summary(predTF)
CrossTable(predTF,test$SeriousDlqin2yrs, expected=F, prop.r=F, prop.c=F, prop.t=F, prop.chisq=F)
observations = as.logical(test$SeriousDlqin2yrs)
error_rate = sum(predTF!= observations)/nrow(test)
error_rate
Model 2: Classification Trees
N = nrow(df)
trainingsize = round(N*.75)
trainingcases = sample(N, trainingsize)
training = df[trainingcases,]
test = df[-trainingcases,]
stoppingRules = rpart.control(minsplit=round(nrow(training)*0.02),
minbucket=round(nrow(training)*0.01))
model = rpart(SeriousDlqin2yrs ~ ., data=training, control=stoppingRules)
fancyRpartPlot(model,main="Default Tree")
pred = predict(model, test, type="class")
testerrorRate = sum(pred != test$SeriousDlqin2yrs)/nrow(test)
print(testerrorRate)
pruned=easyPrune(model)
fancyRpartPlot(pruned, main="Pruned Tree")
predpruned = predict(pruned, test, type="class")
prunederrorRate = sum(predpruned != test$SeriousDlqin2yrs)/nrow(test)
print(prunederrorRate)
benchmarkErrorRate(training$SeriousDlqin2yrs, test$SeriousDlqin2yrs)
CrossTable(pred,test$SeriousDlqin2yrs, expected=F, prop.r=F, prop.c=F, prop.t=F, prop.chisq=F)
Model 3:
N = nrow(df)
16

trainingsize = round(N*.75)
trainingcases = sample(N, trainingsize)
training = df[trainingcases,]
test = df[-trainingcases,]
plot(df$NumberOfDependents, df$age, type="p", col="red")
#Normalize
df$age= (df$age-min(df$age))/(max(df$age)-min(df$age))
df$SeriousDlqin2yrs= (df$SeriousDlqin2yrs-min(df$SeriousDlqin2yrs))/(max(df$SeriousDlqin2yrs)min(df$SeriousDlqin2yrs))
df$aRevolvingUtilizationOfUnsecuredLines= (df$RevolvingUtilizationOfUnsecuredLinesmin(df$RevolvingUtilizationOfUnsecuredLines))/(max(df$RevolvingUtilizationOfUnsecuredLines)min(df$RevolvingUtilizationOfUnsecuredLines))
df$NumberOfTime30.59DaysPastDueNotWorse= (df$NumberOfTime30.59DaysPastDueNotWorsemin(df$NumberOfTime30.59DaysPastDueNotWorse))/(max(df$NumberOfTime30.59DaysPastDueNot
Worse)-min(df$NumberOfTime30.59DaysPastDueNotWorse))
df$DebtRatio= (df$DebtRatio-min(df$DebtRatio))/(max(df$DebtRatio)-min(df$DebtRatio))
df$MonthlyIncome= (df$MonthlyIncome-min(df$MonthlyIncome))/(max(df$MonthlyIncome)min(df$MonthlyIncome))
df$NumberOfOpenCreditLinesAndLoans= (df$NumberOfOpenCreditLinesAndLoansmin(df$NumberOfOpenCreditLinesAndLoans))/(max(df$NumberOfOpenCreditLinesAndLoans)min(df$NumberOfOpenCreditLinesAndLoans))
df$NumberOfTimes90DaysLate= (df$NumberOfTimes90DaysLatemin(df$NumberOfTimes90DaysLate))/(max(df$NumberOfTimes90DaysLate)min(df$NumberOfTimes90DaysLate))
df$NumberRealEstateLoansOrLines= (df$NumberRealEstateLoansOrLinesmin(df$NumberRealEstateLoansOrLines))/(max(df$NumberRealEstateLoansOrLines)min(df$NumberRealEstateLoansOrLines))
df$NumberOfTime60.89DaysPastDueNotWorse= (df$NumberOfTime60.89DaysPastDueNotWorsemin(df$NumberOfTime60.89DaysPastDueNotWorse))/(max(df$NumberOfTime60.89DaysPastDueNot
Worse)-min(df$NumberOfTime60.89DaysPastDueNotWorse))
df$NumberOfDependents= (df$NumberOfDependentsmin(df$NumberOfDependents))/(max(df$NumberOfDependents)-min(df$NumberOfDependents))

plot(df$NumberOfDependents, df$age, xlim= c(0,8), ylim= c(0,80), type="p", col="red")

N = nrow(df)
trainingsize = round(N*.75)
trainingcases = sample(N, trainingsize)
training = df[trainingcases,]
test = df[-trainingcases,]
str(df)
pred = knn(training, test, training$SeriousDlqin2yrs, k = 3)

# evaluate performance
17

CrossTable(pred,test$SeriousDlqin2yrs, expected=F, prop.r=F, prop.c=F, prop.t=F, prop.chisq=F)


error_rate = sum(pred != test$SeriousDlqin2yrs)/nrow(test)
benchmarkErrorRate(training$SeriousDlqin2yrs, test$SeriousDlqin2yrs)

18

You might also like