You are on page 1of 26

Preston May

Yue Shen Gu
Edgardo Estrabo
Professor Duncan Temple Lang
Lending Club: Who Should We Lend To?
In recent years, the US economy has struggled to say the least, making it difficult
for people to make safe and profitable investments. The stock market, although trending
upward in past months, has been overall unpredictable, banks have been stingy with
interest rates, and bonds are expected to lose value over time. So where should people
invest their money so that it is safe and produces a respectable return?
In addition to this issue of where can we put our money, the recent economic
struggles have also put people and kept people in debt. Private debt ratios have been at all
time highs in the past five years (Chart 1) and with rising interest rates, it has become
extremely difficult for these debtors to completely pay off their debts. So, once someone
is in debt, how are they expected to get out of debt in a reasonable time frame?


CHART 1 From http://www.creditwritedowns.com/2012/10/us-household-debt-to-
income-debt-servicing-cost-ratios.html. Debt-to-income ratios refer to private debt
divided by household income which peaked in 2008 and has almost doubled since
1982.
Lendingclub.com provides a solution to both of the problems presented above.
Lending Club is a company that makes it possible for private investors to loan money to
debtors at interest rates higher than banks would pay out to investors, but lower than
banks or credit cards would charge debtors. How it works is a debtor will request a loan,
giving their credit score, their monthly income, their reason for the loan (typically to pay
off credit card debt, see CHART 2), etc. and then investors will decide whether they want
to loan their money to that person or not. Lending Club has minimized the risk on the
individual by allowing each investor to loan out as little as $25 to any given debtor so
that investors can create a diverse portfolio. This way, if a debtor defaults on their loan,
the loss is split amongst several investors, not just one. But having a debtor default on the
loan you lent them would still set back your portfolio, so how would you decide who to
invest in?

CHART 2 From Lendingclub.com Statistics. Over 75% of loans taken out on
Lendingclub.com are for the purpose of paying off/consolidating preexisting debts.

Lending Club gives every potential investee a grade A1 through G5, A1 being the
safest and G5 being the riskiest. However, although A1 is the safest, it pays the lowest
interest rate of 6.03% while G5 has an interest rate of over 25%, so there is incentive to
invest in the riskier loans. Looking at CHART 3, the proportion of bad loans for each
grade are as we would expect since A1 is the best, and they gradually get worse with each
grade all the way to G5.

CHART 3 Manually constructed in R using the data provided by Lending Club.
For code, see APPENDIX.CHART3
So it appears that whatever Lending Club uses to classify each debtor must be a
successful classifier since the default rate gradually increases with each riskiness grade.
However, we do not want to use these grades when deciding what loans we should invest
in because the safer ones pay out a smaller interest rate. So, we want to create our own
classification model to help us not only invest in safe people, but also maximize returns
on our investments.
Using the data provided by Lending Club on their website, we will attempt to
classify each loan as a good loan or a bad loan, a good loan being one that was fully paid
and a bad loan being one that was either not paid or late. Before we get into attempting to
classify each loan as good or bad, first lets see if good loans and bad loans have different
mean values for some key variables.
The variables that are provided by Lending Club that I think should differ the
most between good and bad loans are as follows. Monthly income and monthly payments
are both important but dont necessarily tell us much individually since someone could be
taking out a very small loan but make next to nothing. So, we will combine these by
creating a variable which represents the debtors monthly income divided by their
monthly payments. This value should be a rough representation of their ability to pay off
their loan. Another variable of interest is their credit score. Lending Club provides us
with a credit score range so we will simply take the lower bound of this range for all of
our observations. The number of years employed should also tell us something about
their likeliness to pay off their loan because if they have had the same job for a long time,
they should have a steady, consistent income. The number of open credit lines someone
has could speak to their ability to actually get out of debt. However, I think the reciprocal
of the number of open credit lines is more meaningful since it would give less weight to
each additional credit card as the number of credit cards gets higher and higher. One last
variable that should be helpful is the debt to income ratio. Just so we expect our good
loans to have a higher value, we will subtract all of the debt to income ratios from 100.
Before we start using these variables, lets make sure that we dont have any
weird observations. There were already some observations that contained NAs in
important variables such as length of employment and FICO score. In order to deal with
these, we needed to delete the entire observation where these NAs are present. Since
there are not too many NAs in our data set, this should not affect our data or our results
significantly.
After deleting these NA observations, lets take a look at the densities of each of
the variables that we will be using in our difference of means test (SEE CHART 4). The
first five graphs are the densities of our variables and they all seem to be distributed in a
way that makes sense for our data except payment ratio. Payment ratio seems to have
extreme values to the right of the mean so we created a plot that excludes most of these
extreme values in the sixth graph. After investigating why some of our payment ratios are
so high, it seems that these extreme values are due to people taking out a loan of only
$1000. So, the fact that there are extreme values for this variable doesnt necessarily
mean that they are incorrect. In fact, Lending Club checks each client to make sure they
are telling the truth about themselves, so we should be able to trust that our observations
are accurate.


Once we have created all of the above variables and checked their legitimacy, we
will subset our data into two separate data frames: one being all good loans, and one
being all bad loans. The code for all of this can be found in Appendix Code1. Using
these subsets, we can run a difference of means Hotellings T-test with a null hypothesis
of their true means being equal (See Appendix Test1). The means of each variable for
both of our subsets are as follows:

means_bad
[1] 85.8060328 21.0530092 701.1652447 4.9735070 0.1413222
means_good
[1] 87.4459685 26.0582156 717.4825052 4.6609517 0.1366743

Without running any tests, it is pretty clear that our subsets mean values are not
equivalent. Although the last two, which represent years of employment and the
reciprocal of the number of open credit lines respectively, appear to be fairly close in
value, the other three, which represent 100 debt to income ratio, monthly income
divided by monthly payment, and credit score respectively, are not even close in value.
After running our Hotellings T-test, we end up with an F* value of 909.784 while
our critical F value is only 15.09436. Since our F* value is so much greater than our
critical F value, we would reject our null hypothesis and conclude that the true means of
our variables are different between good and bad loans. This is not surprising since we
had already determined that our mean vectors did not appear to be very close to each
other in value. So, now that we have determined that the two subsets of our data are
different in value, we can begin creating a regression model to predict the probability that
any given borrower will pay back their loan in full.

Logistic Regression: For code, see Appendix Logistic Regression
Logistic regression is a technique for fitting a regression surface to data in which
the dependent variable has two outcomes. In this case, the dependent variable would be
determining whether a loan is good (borrower pays back the loan fully) or bad (borrower
does not pay back the loan, or the borrower pays it back late). This model will describe
the relationship between this binary response variable and several sets of explanatory
variables; most of which are continuous, but some are categorical. In the end, not all of
the predictors will be included in the final model for reasons that will be discussed later.
To create the initial model, the observations with already determined status
(labeled 0: for bad loan, 1: for good loan) were divided into a training set containing
roughly 70% of these observations, and a testing set containing the rest. The training set
will be used to create the logistic model, while the test set will be used to test the
accuracy of this model. 12 numeric variables and 2 categorical variables (Purpose of
Loan and Home Ownership) were entered as predictors. Variable selection was primarily
determined by using the stepAIC function from Rs MASS package. This function selects
the best subset of predictors by using stepwise regression whilst utilizing Akaikes
Information Criterion generally, a model with a smaller AIC value (compared to one
with a bigger value) indicates that the model with the smaller AIC value minimizes
information loss better, compared to the other one. Running the initial model, we obtain
the coefficients for the predictors below:


Coefficients in a logistic regression model indicate the expected change in log
of odds ratio of the response variable, per 1 unit increase in the coefficient. Note that
odds ratio is defined as the probability of success divided by probability of failure. In this
case, odds = P(Good Loan)/P(Bad Loan) or equivalently, P(Good Loan)/[1 P(Good
Loan)]. We then simply take the natural log of this odds ratio to get the log of odds ratio.
This transformation is necessary in order to obtain nice mathematical properties needed
for the modeling. For instance, unlike probability, log odds are not bounded between the
values 0 and 1. To actually calculate the coefficients, maximum likelihood estimation is
used. Basically, MLE seeks to manipulate the coefficients such that the log likelihood
that the observed values of the outcome may be predicted using the given set of
explanatory variables.
To interpret the coefficients, we first need to convert them from being expressed
in log odds to just plain odds ratio. To do this, we would simply exponentiate the
coefficient of interest. For example, to interpret a dummy variable such as Loan Purpose:
Debt Consolidation, we take the coefficients exponential (exp(0.3681) = 1.44) and say
that holding everything else constant, people who borrow from Lending Club for the
purpose of Debt Consolidation are 1.44 times or 44% times more likely to fully pay back
their loan than those who borrow for some other purpose. For numerical variables such as
loan length, we can say that holding everything else constant, we expect a (exp(-.03714 =
0.96, or 0.96 1 = -.04) 4% decrease in the odds that the borrower does not pay back the
loan, per one year increase in employment length. Because the predictors were not
standardized before entering the model, the metric of the original values (e.g., 100-Debt
to Income Ratio being in decimals vs. Credit Grade (Lower Bound) being in hundreds)
should be taken in consideration before assessing the importance of these predictors.


For easier interpretation, we can also transform the log odds ratio given by the
original model back into probability by using the equation above. Just to reiterate, P(y =
1) denotes the probability of a borrower fully paying back a loan. This probability is
determined by an intercept (alpha) and explanatory variables (Beta coefficients) provided
in the previous page. Using the predict() function in R, we obtain these probabilities
which are visualized by the histogram below.

A quick look at this histogram indicates that the majority of loans in the testing
set have greater than 0.5 probability of being paid in full. To check the accuracy of this
prediction, observations with predicted probabilities greater than .5 were assigned as
Good Loans, and observations with less than .5 probabilities were assigned as Bad
Loans. These were then compared to the actual status of these loans. A confusion matrix
displaying the accuracy of this classifier is displayed below:

Using a .5 probability cutoff, the accuracy of this classifier is 76.12%, which is good, but
not great. Adjusting the probability cutoff to .6 instead (such that Good Loans are those
with predicted probabilities > .6), we obtain the following confusion matrix:

which indicates that we may be obtaining an overall, less accurate classifier by
increasing our probability cutoff. However, this may be a better classifier than the
previous one if we simply pay more attention to minimizing the error rate of predicting
bad loans as good loans (and not caring too much about good loans being predicted as
bad loans).

For parsimonys sake, we will try to create another logistic model which uses
fewer predictor variables. The first sets of predictor variables to be removed from the
model are Loan Purpose (Car1, CreditCard1, etc.) and Home Ownership (with Rent being
the only significant dummy variable). The reason for this is due to the unevenness of the
cases in these variables. This unevenness may have caused the model to blow up , i.e.,
make the coefficient estimation inaccurate (Refer to CHART 2 in part I to see the Loan
Purpose pie chart, notice that 75% of loans from Lending Club are for the purpose of debt
consolidation). Other predictors that will be removed are Amount Funded By Investors,
and Public Records on File, mainly because their p-values were comparatively bigger
compared to the other predictors. Having excluded these predictors from the model, we
obtain the coefficients for the final model. A mosaic plot visualizing this models
accuracy is also displayed below. It can be seen that the models accuracy did not
decrease at all after removing a considerable amount of predictors.


Blue: Correct Prediction. Red: Incorrect Prediction.









Now considering this classifiers practicality, we can infer from the mosaic plot
that this classifier tends to misclassify bad loans as good loans at a pretty high rate.
Because of this, theres a good chance that Lending Club lenders would lend out money
to someone whom they thought to be a good borrower, but is in fact, a bad borrower.
Still, this classifier predicts loan status accurately at greater than chance level so it should
still be of some use.
We will now predict whether loans with unpredicted status would turn out to be as
good loans or bad loans. We now run another logistic model, now using the combined
training and test data set. This model will then be used to predict the loan status of
observations labeled as Current or Performing Payment Plan. As indicated by the
histogram and table below, the majority of these loans will turn out to be good loans. In
the next section, we will attempt to increase the accuracy of this classifier by
implementing bootstrap aggregating, a.k.a. bagging, to this model.











Bootstrap Aggregation using Logistic Regression: For code, see Appendix Bagging:
Logistic Regression

Bootstrap aggregation attempts to increase the power of a predictive statistical
model by taking multiple random samples (with replacement) from a training data set.
These random (bootstrap) samples will then be used to create separate models, in this
case, logistic regression models, which will be used to create separate predictions for the
observations of a given test set. Finally, the average value of these predictions will be
used as the final prediction values.
To determine whether bootstrap aggregation was an appropriate choice, we
simply compare the accuracy rate of the original logistic regression model with the
bootstrap aggregated version. The accuracy rate for the final logistic model from the
previous section was 76.29%. Using bootstrap aggregation, the accuracy rate stayed
roughly the same at 76.27%. Lack of improvement in the accuracy could have been due
to the fact that the original models predictors were already stable to begin with.
Bootstrap aggregation may be more suitable in unstable models. The greater the
variability in the predictions of these unstable models, the better the improvement will be
seen in the final averaged predictions. In any case, we will simply keep the final model
from the previous section and disregard this one.


See also below two plots that were generated using the predicted probability values given
by the logistic model from the previous section. Looking back at CHART 3, we see that
default rate decreases as letter grade gets better. This is consistent with the first chart
below, which shows that as the letter grade assigned by Lending Club gets better, the
higher is the probability that the loan will be paid back in full. The next plot shows the
same probability but now in terms of the State location of a borrower. Looking at this
plot, we can see that borrowers from Iowa, Washington DC, West Virginia, and
Massachusetts are the most likely to pay back their loan in full. In contrast, borrowers
from Wyoming, Arkansas, Vermont and Hawaii would be the least likely to pay back
their loan properly.
==


Now that we have created a logistic model for our data, we will now attempt to
use algorithms which were actually designated to act as classifiers. There are several
methods that we could utilize in doing this. There is the k-nearest neighbors technique
which takes each observation and compares it to other observations that it is similar to.
So, essentially, this method is saying which observations do you look like and what
were their classifications? Since so many of our independent variables are categorical, it
would be extremely difficult to implement this method, so we will attempt to classify our
data in another way.
Two other methods that we will be running are random forests and boosting.

Random Forest: For code, see Appendix Classification Random Forest

Another method of classification is the random forest method. Random forest is
basically creating an ensemble of tree models using bagging and randomly generating
reduced number of predictors at each split for each tree. In other words, random forest
creates a group of trees that are very different from each other. Random forest then use
all of these trees to predict the independent variable value for new observations by letting
the trees in the ensemble to vote for the best outcome.
This widely used model works well with dataset containing both numerical and
categorical variables. The package we used is randomForest. Our goal for modeling is to
find the smallest ratio of values that are actually bad when its been predicted good over
all the values that are predicted as good. The reason why we are interested in this ratio is
that we want to figure out whats the error rate if we invest in the observation that are
been predicted as good.

We first used the cross-validation function in randomForest (rfcv) to determine
the best number of reduced number of predictors at each split. The output is as follows:

#> rfloanscv$error.cv
#12 6 3 1
#0.1828936 0.1847596 0.1828851 0.1850041
Looking at the result from rfcv, best number of reduced number of predictors at each split
is 3 because it gives the smallest mean squared error when compared to the other values.

The next tuning parameter that we tried to optimize is the number of trees in the
ensemble. Using a sapply() loop, we generated the following graph that shows the value
for the ratio on training dataset for each additional 10 trees in the assemble. We can see
that the best number of trees obtained from this graph is 20 trees.

Using 20 trees in the ensemble and 3 predictors as the reduced number of
predictors at each split, we built a random forest model. The following is the confusion
matrix on the training dataset. The model predicts the training dataset nearly perfectly.
The ratio value for training dataset is 0.7%. An explanation for such a perfect prediction
on the training dataset is that, for each observation, about only 1/3 of the 20 trees have
the observation in their OOB dataset. With the other 2/3 of the 20 trees using the
observation to build their models, theres an overwhelming number of trees that could
predict the value at each observation really well.

predict
actual Bad Good
Bad 2996 73
Good 3 10220

However the predictions on the OOB dataset arent as amazingly good as the ones
for training dataset. The ratio value for OOB dataset is 20.5% with 77% error rate for
observations that are actually Bad and 10% error rate for observations that are labeled
as Good. The confusion matrix for OOB dataset is as follows with the rows
representing actual values and columns representing predicted values.

Bad Good class.error
Bad 695 2373 0.7734681
Good 1029 9192 0.1006751

Similar to the confusion matrix of OOB dataset, the confusion matrix based on
testing dataset gives a ratio value of 19.65% with a 80% error rate for actual Bad loans
and a 5% error rate for actual Good loans.

predict
actual Bad Good
Bad 231 944
Good 208 3859

The following table shows the importance of each predictor by the value of mean
decrease Gini. Based on this table, interest rate is the most important predictor and public
records on file is the least important predictor.

MeanDecreaseGini
Loan.Length 160.48590
Loan.Purpose 367.97988
Home.Ownership 120.17518
Revolving.CREDIT.Balance 645.74894
Delinquencies..Last.2.yrs. 84.41817
Public.Records.On.File 48.61846
DTI_inv 664.52132
payment_ratio 679.23169
lower_score 455.15278
length_of_employment 309.10431
inv_num_cards 417.22794
new.Interest.Rate 714.86348

Boosting: For code, see Appendix Classification Boosting

We also used boosting approach to model the loan data. The package that we used
is ada. The first thing we did is to optimize the number of trees to be included in the
boosting ensemble. We used a sapply() loop similar to the one for random forest to figure
out the best number of trees. The following plot shows that 200 trees is the best value.

Using 200 trees in the ensemble, we built a boosting model and obtained the
following confusion matrix on the training dataset. The ratio value is 21% with an error
rate of 84% for Bad loans and 3% for Good loans.

Final Prediction
True value Bad Good
Bad 487 2582
Good 322 9901

The confusion matrix on testing dataset is as follows. The ratio value is 22% with
an error rate of 98% for Bad loans and 0.3% for Good loans.

predicted
actual Bad Good
Bad 24 1151
Good 12 4055

The following plot shows the importance for each variable. Interest rate and
number of credit cards are the two most important variables whereas loan purpose is the
least important variable.


It is difficult to determine exactly which model is our best since there are different
types of error. However, for our data, predicting a loan is a good loan incorrectly is much
worse than predicting a loan is a bad loan when it is actually a good loan. The reason for
this is when we predict a loan is a good loan, we would invest in it, and if it turns out that
it is a bad loan, we will not get our money back. However, if we mark a good loan as a
bad loan, it simply means we are not investing in someone who would have paid us back
in full. We will call the better error type I error and the worse error type II error.
First lets look at our regression models. If we are calculating just the type II
error, then the error for our full logistic model would be 22.355% when we used a .5
probability cutoff as our classifier but it decreased to 19.93% when we increased the
probability cutoff to .6. We would expect our error to decrease with the increase in the
classifier since we are only classifying good loans as loans that have a predicted
probability of paying us back of 60% or higher, rather than only 50%. 19.93% type II
error is pretty good and would put us at the same risk level as if we had invested all of
our money into people with a credit grade of between B2 and B3 (See Chart 3).
Looking at our classification models, using random forests produced a type II
error of only 19.65% while boosting produced a 22.11% type II error. Boosting was also
not desirable since it only classified 36 loans as bad loans out of the 5000+ in our test
data. So, out of all of our models, it seems that using random forests was the best to
minimize type II error.

So, was this classification model any more useful than simply using the credit
grades given to us by Lending Club? The type II error in our random forests model was
19.65% which means that if we had invested in all of our good loans from our random
forests model, the default rate would have been 19.65%. This is a default rate slightly
higher than all of the loans with a B2 grade while slightly lower than those with a B3
credit grade, so we would be investing with approximately the same risk as investing into
everyone with a B2 or B3 credit grade. However, looking at the credit grades of our
good loans, the median credit grade of these loans is B4. So, what this means is that
although we would be investing with the same risk as investing in everyone with a B2 or
B3 credit grade, we would be receiving the interest rates similar to if we had invested in
all the B4s. Or, in other words, we are receiving the same interest rate as someone who
invested in all B4s, but less of our loans would default.
This finding is actually really interesting because it shows that our classification
models are more useful than simply using Lending Clubs credit grade system. Investing
in all of our good loans allows you to invest with a lower risk and receive a higher
interest rate which is exactly what we want when we are investing. Using this
classification method rather than using the credit grades given to us would give us an
expected additional return on our investments of approximately 1.5% (B4 - (B2+B3)/2).
This additional 1.5% translates to an additional $682.92 earned over five years with an
initial investment of $5,000!












APPENDIX:
loans = read.csv("C:\\Users\\Owner\\Downloads\\LoanStats.csv")

CODE1
#create our new variables that we want to run a difference of means
test on
#use regular expressions to extract only the numerical value from the
debt-to-income percentage
DTI_num = as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio))
#subtract the percentage from 100 so that the expected "good" debtor
has a higher value
loans$DTI_inv = 100 - DTI_num
#divide each monthly income by their monthly payment
loans$payment_ratio = loans$Monthly.Income/loans$Monthly.PAYMENT
#use the substring function to get only the lower bound of the FICO
score
loans$lower_score = as.integer(as.character(substr(loans$FICO.Range, 1,
3)))
#use regular expressions to isolate only the number in employment
length
b<-gsub('(.*[[:digit:]]).*','\\1',loans$Employment.Length)
b[b=='< 1']<-0
loans$length_of_employment = as.numeric(b)
#take the reciprocal of the number of open credit lines so that our
predicted "good" debtor has a higher value
loans$inv_num_cards = 1/loans$Open.CREDIT.Lines

#delete all observations that have an NA response for any of our
variables
new_loans = loans[-which(is.na(loans$length_of_employment) == TRUE),]
new_loans = new_loans[-which(is.na(loans$lower_score) == TRUE),]

#subset our data into good loans and bad loans
bad_loans = subset(new_loans, new_loans$Status == "Charged Off" |
new_loans$Status == "Default" | new_loans$Status == "Late (16-30 days)"
| new_loans$Status == "Late (31-120 days)")
good_loans = subset(new_loans, new_loans$Status == "Fully Paid")

CHART3
#calculate the default rate for each credit grade by dividing the
number of bad loans for each grade by the total number of loans
completed for each grade
Default_Rate =
100*sort(summary(bad_loans$CREDIT.Grade)/(summary(good_loans$CREDIT.Gra
de) + summary(bad_loans$CREDIT.Grade)))
barplot(Default_Rate, xlab = "Letter Grade", ylab = "Default
Percentage", main = "Default Rate By Letter Grade")

CHART4
#plot the distributions of all of our variables of interest
par(mfrow = c(2,3))
plot(density(new_loans$length_of_employment), main = "Density for
Length of Employment")
plot(density(new_loans$inv_num_cards), main = "Density for Inverse # of
cards")
plot(density(new_loans$payment_ratio[-which(new_loans$payment_ratio >
250)]), main = "Fixed Density for Payment Ratio")
plot(density(new_loans$payment_ratio), main = "Density for Payment
Ratio")
plot(density(new_loans$DTI_inv), main = "Density for Debt to Income
Ratio")
plot(density(new_loans$lower_score), main = "Density for FICO Score")

TEST1
#create a data matrix for both of our subsets
data_bad = matrix(c(bad_loans$DTI_inv, bad_loans$payment_ratio,
bad_loans$lower_score, bad_loans$length_of_employment,
bad_loans$inv_num_cards), ncol = 5)
data_good = matrix(c(good_loans$DTI_inv, good_loans$payment_ratio,
good_loans$lower_score, good_loans$length_of_employment,
good_loans$inv_num_cards), ncol = 5)

#create two numerical vectors of the means of each of our variables of
interest for both of our subsets of data
means_bad = colMeans(data_bad)
means_good = colMeans(data_good)

#create the variance/covariance matrices for both of our subsets
s_bad = cov(variables_bad)
s_good = cov(variables_good)

#calculate the number of observations in each of our subsets
n_bad = nrow(variables_bad)
n_good = nrow(variables_good)

#pool our variance/covariance matrices since our variances seem to be
almost equivalent
s_pooled = (n_good * s_good)/(n_bad + n_good) + (n_bad * s_bad)/(n_bad
+ n_good)
s_diff = s_pooled/n_good + s_pooled/n_bad
s_diff_inv = solve(s_diff)

#calculate our F* value using a Hotellings T-test
FStar = (t(means_good - means_bad)) %*% s_diff_inv %*% (means_good -
means_bad)
p = 5
#calculate our critical F-value which we will compare to our calculated
F* value
F = ((p*(n_good + n_bad - 2))/(n_good + n_bad - p - 1)) * qf(.99, p,
n_good + n_bad - p - 1)


Logistic Regression
####################################
##variables for logistic regression#
####################################

##########response##################
#remove current and performing payment plan status
data <- subset(loans, Status != "Current" & Status != "Performing
Payment Plan ")
table(data$Status)

#recode Status: Charged Off, Default, In Grace Period, Lates: 0. Fully
Paid: 1
###
data$Status <- as.factor(data$Status)
data$Status <- factor(with(data, ifelse ((Status != "Fully Paid"), 0,
1) ))
table(data$Status)

#########predictors################
#years employed
data[40] <- as.numeric(factor(data[[40]], levels = c("< 1 year", "1
year",
"2 years", "3 years", "4
years", "5 years", "6 years",
"7 years", "8 years", "9
years", "10+ years", "n/a"),
labels = c("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", NA)))
table(data[40])

#lower bound FICO range
#convert fico range to numeric
a <- substr(as.character(data[[26]]), 1,
nchar(as.character(data[[26]]))-4)
data[[26]] <- as.numeric(a)
head(data[26])

# 1/number open credit lines
data[[28]] <- 1/data[[28]]
head(data[28])

#[16] 100 - debt to income ratio
data[[16]] <- 100 - data[[16]]
head(data[16])

#loans[25]/loans[13] monthly income/monthly payment
data[[25]] <- data[[25]]/data[[13]]
head(data[25])

#[5] loan length, 0: 36 months, 1: 60 months
data$Loan.Length <- factor(with(data, ifelse ((Loan.Length == "36
months"), 0, 1) ))
head(data[5])

#rename some variables
colnames(data)[c(26, 28, 16, 25)] <-
as.character(c("CreditGrade(LowerBound)",
"OpenCreditLines(Reciprocal)", "100-DebtToIncomeRatio",
"MonthlyIncome/MonthlyPayment"))

#combine everything so far
dataB <- cbind(data[c(14, 40, 26, 28, 16, 25, 3, 5, 4, 30, 35, 37)])
summary(dataB)
dim(dataB)

##include dummy variables
##[11] loan purpose## 0: no, 1: yes. 13-25 on final data frame
levels(data[[11]])
# "other" would be all 0s
Car <- factor(with(data, ifelse ((Loan.Purpose != "car"), 0, 1) ))
CreditCard <- factor(with(data, ifelse ((Loan.Purpose !=
"credit_card"), 0, 1) ))
DebtConsol <- factor(with(data, ifelse ((Loan.Purpose !=
"debt_consolidation"), 0, 1) ))
Educational <- factor(with(data, ifelse ((Loan.Purpose !=
"educational"), 0, 1) ))
HomeImprov <- factor(with(data, ifelse ((Loan.Purpose !=
"home_improvement"), 0, 1) ))
House <- House <- factor(with(data, ifelse ((Loan.Purpose != "house"),
0, 1) ))
MajorPurchase <- factor(with(data, ifelse ((Loan.Purpose !=
"major_purchase"), 0, 1) ))
Medical <- factor(with(data, ifelse ((Loan.Purpose != "medical"), 0, 1)
))
Moving <- factor(with(data, ifelse ((Loan.Purpose != "moving"), 0, 1)
))
RenewableEnergy <- factor(with(data, ifelse ((Loan.Purpose !=
"renewable_energy"), 0, 1) ))
SmallBusiness <- factor(with(data, ifelse ((Loan.Purpose !=
"small_business" ), 0, 1) ))
Vacation <- factor(with(data, ifelse ((Loan.Purpose != "vacation"), 0,
1) ))
Wedding <- factor(with(data, ifelse ((Loan.Purpose != "wedding"), 0, 1)
))

#[24] home ownership, "NONE" would be all 0s. 26-29 on final data frame
levels(data[[24]])
Mortgage <- factor(with(data, ifelse ((Home.Ownership != "MORTGAGE"),
0, 1) ))
Other <- factor(with(data, ifelse ((Home.Ownership != "OTHER"), 0, 1)
))
Own <- factor(with(data, ifelse ((Home.Ownership != "OWN"), 0, 1) ))
Rent <- factor(with(data, ifelse ((Home.Ownership != "RENT"), 0, 1) ))

###combine everything###
dataB[c(13:29)] <- c(Car, CreditCard, DebtConsol, Educational,
HomeImprov, House, MajorPurchase, Medical, Moving, RenewableEnergy,
SmallBusiness, Vacation, Wedding,
Mortgage, Other, Own, Rent)

colnames(dataB)[13:29]<- as.character(c("Car", "CreditCard",
"DebtConsol", "Educational", "HomeImprov", "House", "MajorPurchase",
"Medical", "Moving", "RenewableEnergy", "SmallBusiness", "Vacation",
"Wedding",
"Mortgage", "Other", "Own", "Rent"))

#remove rows with NAs
dataB <- na.omit(dataB)

#final data frame of response + predictors
summary(dataB)
names(dataB)

#divide into test and training data
load("testID.R")

TestSet <- dataB[testID, ]

TrainingSet <- dataB[-testID, ]

############################
#Fitting the logistic Model#
############################
logit.train <- glm(Status ~ .,
family = binomial,
na.action = na.exclude,
data = TrainingSet)

summary(logit.train)
library("MASS")
fit1r <- stepAIC(logit.train)

TestSet$Prediction1 <-as.numeric(predict(fit1r, newdata=TestSet,
type="response"))
hist(TestSet$Prediction1, ylab = "Frequency", xlab = "Predicted
Probability",
main = "Predicted Probability for Testing Set", col = "lightblue")

#assessing accuracy of the model
library("caret")

# if status = 1 and predicted probability > .5, (i.e., prediction is
correct)
TestSet$Predicted1b <- factor(with(TestSet, ifelse ((Prediction1 < .5),
"Bad Loan", "Good Loan") ))
TestSet$Actual <- factor(with(TestSet, ifelse ((Status == 0), "Bad
Loan", "Good Loan") ))
cm1 <- confusionMatrix(data = TestSet$Predicted1b, reference =
TestSet$Actual)
as.table(cm1)

#using a .6 cutoff
TestSet$Predicted1c <- factor(with(TestSet, ifelse ((Prediction1 < .6),
"Bad Loan", "Good Loan") ))
cm2 <- confusionMatrix(data = TestSet$Predicted1c, reference =
TestSet$Actual)
as.table(cm2)

# reducing the number of predictors for the sake of parsimony
TrainingSet2 <- TrainingSet[c(1:12)]
logit.2 <- glm(Status ~ .,
family = binomial,
na.action = na.exclude,
data = TrainingSet2)
fit2r <- stepAIC(logit.2)

#final Model for testing set, remove unwanted predictors
TrainingSet3 <- TrainingSet2[c(1:6,8:9)]
logit.3 <- glm(Status ~ .,
family = binomial,
na.action = na.exclude,
data = TrainingSet3)
FinalModel <- stepAIC(logit.3)

summary(FinalModel)
TestSet$Prediction2 <- as.numeric(predict(FinalModel, newdata=TestSet,
type="response"))

#assessing the accuracy of the Final Model
TestSet$Predicted2b <- factor(with(TestSet, ifelse ((Prediction2 < .5),
"Bad Loan", "Good Loan") ))
cm3 <- confusionMatrix(data = TestSet$Predicted2b, reference =
TestSet$Actual)

library(vcd)
sieve(as.table(cm3),shade = TRUE, main = "Mosaic Plot for Accuracy of
Final Model")
cotabplot(as.table(cm3),shade = TRUE, main = "Mosaic Plot for Accuracy
of Final Model")

#########################################
#prediction for undetermined loan status#
#########################################

#all observations with determined status
dataC <- dataB[c(1:6,8:9)]

#(repetitive code ommitted)
Predict2 <- Predict[c(14, 40, 26, 28, 16, 25, 5, 4)]
names(Predict2)[c(1:8)] <- names(TrainingSet3)

logit.4 <- glm(Status ~ .,
family = binomial,
na.action = na.exclude,
data = dataC)
Classifier <- stepAIC(logit.4)
summary(Classifier)

Predict2$Probability <- as.numeric(predict(Classifier,
newdata=Predict2, type="response"))
library(lattice)
histogram(Predict2$Probability, main = "Predicted Probability of Good
Loans", col = "lightblue", xlab = "Probability", ylab = "Frequency",
type = "percent")

Predict2$PredictedStatus <- factor(with(Predict2, ifelse ((Probability
< .5), "Bad Loan", "Good Loan") ))
table(Predict2$PredictedStatus)

Bagging: Logistic Regression
###########
# Bagging #
###########
# code from: http://www.r-bloggers.com/improve-predictive-performance-
in-r-with-bagging/

logistic.bagging <-
function(training,testing,length_divisor=4,iterations=1000)
{
predictions<-foreach(m=1:iterations,.combine=cbind) %do% {
training_positions <- sample(nrow(training),
size=floor((nrow(training)/length_divisor)))
train_pos <- 1:nrow(training) %in% training_positions
lm_fit<-glm(Status~., family = binomial,
na.action = na.exclude,data=training[train_pos,])
lm_fit2 <- stepAIC(lm_fit)
predict(lm_fit, newdata = testing, type = "response")
}
rowMeans(predictions)
}

TestPrediction.bag <- logistic.bagging(training = TrainingSet3, testing
= TestSet[-1])
TestSet$Predict.bag <- factor(ifelse ((TestPrediction.bag < .5), "Bad
Loan", "Good Loan"))
cmTest.bag.logistic <- confusionMatrix(data =
as.factor(TestSet$Predict.bag), reference = TestSet$Actual)

#Prediction for undetermined loan status
logistic.prediction.bag <- logistic.bagging(training = dataC, testing =
Predict2[c(-1,-9)])

#plots
Predict2$CreditGrade <- Predict$CREDIT.Grade
ProbabilityByGrade <- tapply(Predict2$Probability, INDEX =
Predict2$CreditGrade, FUN = mean)

ByCreditGrade <- as.matrix(cbind(c(as.numeric(ProbabilityByGrade)),
c("A1", "A2", "A3", "A4", "A5", "B1", "B2", "B3", "B4", "B5", "C1",
"C2", "C3", "C4", "C5", "D1", "D2", "D3", "D4", "D5",
"E1", "E2", "E3", "E4", "E5",
"F1", "F2", "F3", "F4", "F5", "G1", "G2", "G3", "G4", "G5"))

barchart(ProbabilityByGrade, xlab = "Mean Probability of Loan being
Paid in Full",
ylab = "Letter Grade", main = "Probability that a loan is a
Good Loan by Letter Grade", col = (c(rep("firebrick2", 5),
rep("darkorange2", 5),
rep("gold2", 5),
rep("darkseagreen4", 5), rep("dodgerblue2", 5), rep("darkorchid4", 5),
rep("darkviolet", 5)))

ProbabilityByState <- tapply(Predict2$Probability, INDEX =
Predict2$State, FUN = mean, na.rm = TRUE)
sort(ProbabilityByState)

ProbabilityByState <- tapply(Predict2$Probability, INDEX =
Predict2$State, FUN = mean, na.rm = TRUE, col =)
dotplot(as.matrix(sort(ProbabilityByState)), xlab = "Mean Probability
of Loan being Paid in Full",
ylab = "State", main = "Probability that a Loan is a Good Loan
by State",
col = "blue")


Classification:

#recreated data in an easier way on a later date
loans<-loandat[loandat$Status%in%c('Charged Off','Default','Fully
Paid',
'Late (16-30 days','Late (31-120
days)'),]

# change interest.rate to numeric values
loans$new.Interest.Rate<-
as.numeric(gsub("%","",loans$Interest.Rate))/100
# Dependent variable
loans$type<-'Bad'
loans$type[loans$Status=='Fully Paid']<-'Good'
loans$type<-as.factor(loans$type)

loans.new<-loans[,c(5,11,24,30,35,37,43:49)]

testID<-sample(1:nrow(data),round(nrow(data)*.3)) # 28% data in test
save('testID',file='C:/Users/Yue Shen/Desktop/testID.R')
load('C:/Users/Yue Shen/Desktop/testID.R')
loans.test<-loans.new[testID,]
loans.test<-loans.test[is.na(loans.test$length_of_employment)==F,] #
get rid
# off NA's
loans.train<-loans.new[-testID,]
loans.train<-loans.train[is.na(loans.train$length_of_employment)==F,] #
get
# rid off NA's
loans.train<-loans.train[is.na(loans.train$lower_score)==F,]

Random Forest
library(randomForest)

rfloanscv<-
rfcv(trainx=loans.train[,1:12],trainy=loans.train[,13],cv.fold=5,
do.Trace=T)
#> rfloanscv$error.cv
#12 6 3 1
#0.1828936 0.1847596 0.1828851 0.1850041
# Thus doesn't depend much on number of variables at each step

numtreerf<-sapply(1:50,function(i){
model<-
randomForest(x=loans.train[,1:12],y=loans.train[,13],do.trace=T,
ntree=10*i)
predGoodactBad<-model$confusion[1,2]/sum(model$confusion[,2])
return(predGoodactBad)
})
plot(numtreerf,
main='(Actual=Bad and Predict=Good)/(Predict=Good)
by number of trees in random forest',
type='l',xlab='Number of trees X 10',ylab='Value for the
ratio',xaxt='n')
axis(1,at=c(1:50),labels=c(1:50))
order(numtreerf) # 20 trees

rfloansfinal<-
randomForest(x=loans.train[,1:12],y=loans.train[,13],do.trace=T,
ntree=20)
rfloansfinal$confusion
rfloansfinalpredictTrain<-predict(rfloansfinal,loans.train)
table(actual=loans.train[,13],predict=rfloansfinalpredictTrain)
rfloansfinalpredictTest<-predict(rfloansfinal,loans.test)
matrixRF<-table(actual=loans.test[,13],predict=rfloansfinalpredictTest)
rfabadpgood<-(matrixRF[1,2]/sum(matrixRF[,2]))
rfloansfinal$importance

# to calculate median credit.grade
IDpredGoodRF<-
names(rfloansfinalpredictTest[rfloansfinalpredictTest=='Good'])
CreditGradeGoodRF<-loandat[IDpredGoodRF,'CREDIT.Grade']
summary(CreditGradeGoodRF)

Boosting
numtreeb2<-sapply(11:20,function(i){
cat(i)
adamodel<-ada(type~.,data=loans.train,iter=10*i)
predGoodactBad<-adamodel$confusion[1,2]/sum(adamodel$confusion[,2])
return(predGoodactBad)
})
nmtreeb
plot(numtreeb,main='(Actual=Bad and Predict=Good)/(Predict=Good)
by number of trees in boosting',
type='l',xlab='Number of trees X 10',ylab='Value for the
ratio',xaxt='n')
axis(1,at=c(1:20),labels=c(1:20))

adamodel200tree<-ada(type~.,data=loans.train,iter=200)
adamodel200tree$confusion
predtestada10tree<-predict(adamodel10tree,newdata=loans.test)
matrixB<-table(actual=loans.test[,13],predicted=predtestada10tree)
babadpgood<-(matrixB[1,2]/sum(matrixB[,2]))

varplot(adamodel200tree)

You might also like