You are on page 1of 13

/************ BASICS OF CREDIT RISK MODELLI

NG *****************/
There are three keywords to this course:
1.Credit - Funds lent out on 'credit' of the borrowers to be repaid late
r.
2.Risk - An environment of uncertainity. It leads to a randomness in the
cash flows. Risk, in business, would refer to a situation where there is uncerta
inity
in the outcome creating a randomness in cash flows.
3.Modelling - A model is a replication of a real time business problem. M
odels help us in identifying the ex-ante the probable outcomes that can
happen. Models are constructed under certain assum
ptions and they help in predicting the possible average outcomes based on the pr
esent average
outcomes embodied in the data.
How does a bank make profit from credit business?
-> A bank borrows from a low risk segment and lends out to a high risk se
gment. It offers a lower rate of interest to the depositors and charges a higher
rate
of interest from its borrowers. For example: Mr.A keeps Rs.10000
in ABC bank. Mr.B has a requirement of Rs.10000 for personal loans. So ABC bank
s uses the
money kept by Mr.A to finance Mr.B. Now the market rate of
interest on personal loans is 14% and the rate of interest on savings account is
4%. Then the
profit made by ABC on each Rupee lent out is (14%-4
%) =10%.
-> Overdrafts is a type of revolving credit that banks offer on current a
ccounts. How does bank make money on O/D? If an account is overdrawn the bank ge
nerally
borrows from the treasury and gives the money to the customer. The rat
e of interest on borrowing from treasury is lower than the rate of interest paid
on the
current account. Therefore, the interest differential is the profit of
the bank or the Credit interest income. Now some banks also gives interest on c
urrent
accounts if some funds are maintained there. So in this case, the bank
lend these funds to treasury and earn interest. So, the interest income from th
is
source is the interest earned from treasury less the debit interest
given to the customer. This is known as the Debit interest income. Now, the Cre
dit
interest income + Debit Interest income is together called the Net
Interest Income (NII).
How is the risk coming into play?
-> Now Suppose after paying back Rs.3000 Mr.B defaults on the payment and
he does not give it back.
-> This is the CREDIT RISK, where the borrower does not pay back th
e loan. This causes a randomness in the cash flows which can manifest into diffe
rent
kinds of risk.
The next day Mr.A comes in to withdraw his money. What happens?
-> The first risk that happens is: The bank cannot pay Mr.A the ent
ire sum of money. Hence, the bank goes in for liquidation. They ask authorities

to bail
them out. This is known as Liqui
dity Risk.
-> Once the files in for Liquidation, the market loses confidence o
n the bank. Therefore, the Reputation of the bank is damdged. So This is the cas
e of
Reputational risk.
Different kinds of risk that the bank faces:
1. Credit Risk: This is the risk arising from lending out to borrower's. Hence
the factors which explain the borrower's risk also explain the credit risk.
2. Market Risk: This is the risk of losses experienced due to fluctuation in th
e financial system or the entire market. Such risks include: Stock market crashe
s,
Price Fluctuations, interest rate fluctuations etc.
3. Operational Risk: Risks resulting from the breakdowns in the internal proced
ures and operational inefficiencies of the bank. For ex: Server breakdown, Impro
per
working of ATMs etc.
4. Reputational Risk: Risk arising from the negative perception on part of the
customers.
5. Liquidity Risk: Risk that asset owner is unable to recover the final value o
f the asset sold.
How can the Credit risk managed?
-> Identify the sources of the credit risk: Borrower's capacity of paying bac
k the loan, Stability of the borrower, Willingness of the borrower to pay back,
Changes
in the customer's risk pr
ofiles, exposure of the borrower to the macroeconomic fluctuation.
-> Manage these sources from where the risks are likely to arise.
-> To manage these sources of the borrower's risk, we need to identify what a
re the major drivers of borrower's risk. That is where the roll of credit risk m
odels
largely come in.
For Ex: A customer applies for a loan. The bank asks for all the Pre-requisite i
nformation of the customer and feeds them into the system. The business problem
here is: Whether the customer is a Good customer or Bad Customer. So if I am dev
eloping a model to identify whether an applicant is good or bad, then I need to
technically define the dependent variable for my model. The dependent variable h
ere is:
Y = 1 if the customer is good
= 0 otherwise.
We try to identify the chances of a customer to be a good customer. In the stati
stical paradigm, what we do is: Calculate the probability of the customer to be
a good or a bad customer, i.e. based on the variables we have to explain the beh
aviour of an average customer we predict P(Y=1) and resultantly P(Y=0). Now, the
decision rule which help us to do this is known as the Application score. From
P(Y=1) an application score is calculated and there is a pre-defined application
cut-off score. Therefore, if the application score of an applicant lies above t
he cut-off score then the customer is given the loan and if the score lies below
the application cut-off score, the application is rejected. This is known as th
e Application Scorecard. This is a basic credit risk model which helps in managi
ng the credit risk that may arise at the time of originations of accounts in the
books of the banks.
Now suppose the customer who applied for the loan has been granted it. So, if he
is granted the loan it would mean that he passed all the application criteria a

nd the affordability requirement. So, the bank is mostly sure that the customer
has the capacity to pay back the loan. Now, what they would be concerned with is
: Is the customer willing to pay the loan? This calls for analysing the behavio
ur of the customer. Banks want to understand the chances of the customer to defa
ult over the next 12 months. So the Business problem of the customer would be:
Y =1 if the customer defaults over the next 12 months
=0 otherwise.
So the bank wants to asses
objective is to calculate
the customer is assigned a
the Probability of default
score.

the probability of the customer to default i.e. their


the P (Y=1). Now based on the probability of default
score. Score is generally negatively associated with
of the customer. This score is known as the Behaviour

What are the uses of this score?


-> The behaviour score is used for account management or collection manageme
nt strategies. For ex: There are two credit card holders who have similar kind o
f
account utilisation and payment patterns. The banks wants to give
a Credit Limit Increase of 20% to one of the two customers. Now of these two cu
stomers only
the one who has the higher behaviour score will be given
the CLI. Similar decisions can be taken for giving out the top-ups on personal l
oans.
OBSERVATION: For most of the Cases in credit risk models, the dependent variable
is Categorical in nature. The most in-use technique of modelling the categorica
l
variables is the Logistic Regression model.
TYPES OF CREDIT LENDING DONE BY BANKS:
There are two types of credit lending done by the banks: 1. Unsecured Lending 2.
Secured Lending. Unsecured lending refers to the line of lending which are not
backed by collaterals. Secured lending on the other hand are mortgaged by an und
erlying asset. In the case of retail products : Personal loans, credit cards, cu
rrent account overdrafts are examples of unsecured lending. Since most of the pr
oducts in the unsecured section are not backed by underlying assets and have an
attached revolving facility therefore, they known as 'open end loans'. For ex: C
redit Card is an example of 'Open-end products' while Personal loan is an unsecu
red loan but is 'closed -end product'. Secured lending on the other hand are mos
tly 'closed-end lending'. Secured lending comprises of: Home loans, Vehicle loan
s etc. Similarly, there are different examples of secured and unsecured lending
for commercial portfolios as well. For example: Asset Backed Loan and Cash Flow
Loan.
CREDIT CARDS: A card issued by a financial institution which gives the holder th
e benefit to borrow funds, at different points of time upto a pre-defined limit.
Credit cards charge interest and is usually desgined to finance short term need
s. Interest begins from one month after purchase is made and the borrowing limit
is pre-set. Credit cards have a higher rate of interest and the user of the car
d can revolve the credit. What is the credit risk associated with Credit Cards?
How can the risk be modelled?
-> The biggest risk of the credit card is that people might have the incenti
ve to flunk payments since the loan is unsecured. So from the modelling perspect
ive it becomes very critical to identify the willingness of the customer to pay
back. However, if all customers pay back excatly on time then it would not be pr
ofitable proposition for the bank. Again, if the customers revolve their entire
amount of credit limit and do not pay back anything since infinity, then that wo

uld be a risk to the bank. Most profitable customers for the bank are those who
revolve a decent percentage of their credit line and make payments regularly to
keep things within check. So the variables which capture the willingness of the
customer to pay back is most important to capture the default probabilities of t
he customer. If given the capacity to pay back, the willingness of the customer
to pay back is low then the chances of default will be higher. However, it can a
lso be the case that the capability to payback of the customer has also fallen.
Therefore, the final model variables in a behaviour scorecard model of a credit
card must capture both these aspects of the customers. These are captured using
the derived variables.
What are derived variables? -> derived variables are variables which are de
rived from the raw data variables. Raw variables are also known as Primary varia
bles. Some primary variables which are seen in the credit card datasets:
1. Date Variables: Account Opening Date, Date of first Origin with the Bank
, Application Date, Date of Last Purchase, Date of Last Payment, As on Date, Acc
ount
Closing Date.
2. Unique identifiers or ID variables: CustomerID, AccountID, ProductID etc
.
3. Categorical Variables: Product_type, VIP_status, Fraud_Bankruptcy_indica
tor, Delinquency_status,Account Status etc.
4. Numerical Variables: Amount of Payments made, No of Payments made, Credi
t Limit, Balance Outstanding, Purchase amount, No.of purchases made, Number of C
ash
payment made, Amount of Cash payments made, Sal
ary or the income of the customer etc.
Derived Variables are obtained as a function of the Primary variables. Some impo
rtant derived variables that we can frame here are as follows:
1. Highest Balance Ever in the last 12 months: Let the balance outstanding be
BO1, BO2, .... , BO12 for the last 12 months. Then the Highest Balance ever for
the
customer is = MAX (BO1, BO2,...BO12). Why is this derived variable
important to me? -> The higher the value of this variable for a given account me
ans higher
is the tendency of the account to revolve the credit.
Is it possible for me to say that a customer whose highest balance e
ver in the last 12 months is $2000 is more risky than a customer whose highest b
alance
ever in the last 12 months is $ 900?
-> This variable gives a partial picture about the customer's r
iskiness. Now it might be that the person who maintains $2000 as the highest bal
ance in
the last twelve months, generally does not have that
high an outstanding, or even if he has, he repays it over time. Therefore, we n
eed variables
which will capture other aspects of the borrow
er like his average outstanding and his payment propensities.
2. Average outstanding in the last twelve months: Sum of the total outstanding
balances/ total number of months(12). This gives us an idea about the average
outstanding balance of a given account.
Using 1 and 2 we may create a third variable as well. Let us call this
variable as:
V001 = Average balance in the last 12 months/ highest Ba
lance ever in the last 12 months.
For a particualr account: Average bal = $1800 and highest balance = $2
000, therefore V001 = 0.9

3. Balance Payment ratio : This variable is defined as the Balance outstanding


at a given month/ Payment made by the customer at a given month. Now looking int
o the
average trend of the Balance payment ratio we can form an idea about t
he proportion of the balance which is paid by the customer. For example: Balance
= $2000 and
payment is $200. Balance/ Payment ratio = 2000/200 =10. So this
means the balance outstanding is 10 times more than the payment made.
4. Month on Books: This measures the number of months for which the account has
been on books. This is generally the difference between the account open date a
nd the
as on date. It has been seen that more tenured an acc
ount is, lower are the chances of default.
PERSONAL LOANS: These are loans lent out to meet some small and medium financial
needs. It is an unsecured lending and is a closed end business product. The mai
n challenges in personal loans are:
1. Loan Origination: This is basically an accquisition problem. Which out
of the Through The Door (TTD) customers is the bank going to accept? what are th
ose factors with which the bank can capture the riskiness of the applicants? How
would the bank design its origination business problem?
-> The business problem of the bank is the variable, whose behaviour it is tr
ying to predict or capture. Therefore, the business problem at the origination s
tage is:
Y =1 if the Customer is good
=0 otherwise.
-> WHAT IS THE 'GOOD' DEFINITION??? -> What are the parameters based on
which 'Good' is defined?
2. Account Management: In this part generally the payment patterns and
elinquency of the customers are actually looked into. The changes (if any)
e risk
profile of the customers is observed and monitored. The chances
e customer to default over the next 12 months is identified in this stage.
e business problem of the bank here is :

the d
in th
of th
So th

Y =1 if the customer defaults over the next 12 months.


=0 otherwise.
What do you mean by 'defaults'? and Why are we looking at the 'default next
12 months'?
-> How to create 'Next_12_months_default_flag'?
Cust_id
A1123
A1123
A1123
A1123

Account_id
PL_12356
PL_12356
PL_12356
PL_12356

As_on_date Default_flag
31.03.2015
0
30.06.2015
0
30.09.2015
1
31.12.2015
1

Next_12_months_default_flag = If the account has defaulted even once in


the next 12 months then he is a defaulter. Sum (default_flag) = 0+0+1+1 = 2
'Once a defaulter always a defaulter'
if sum(Default_flag) > 0 then Next_12_months_default_flag =1 else 0.
The objective here is to model the P(Y=1) and estimate how probable is the cu
stomer to default over the next 12 months.

How would the variables for Personal loans look like?


-> 1. Date Variables: Account Opening Date, Date of first Origin with the Bank
, Application Date, Date of Last Purchase, Date of Last Payment, As on Date,
Account Closing Date, Date of Birth.
2. Unique identifiers or ID variables: CustomerID, AccountID, ProductID etc
.
3. Categorical Variables: Product_type, VIP_status, Fraud_Bankruptcy_indica
tor, Delinquency_status,Account Status etc.
4. Numerical variables: Salary of the individual, Principal of the loan, In
terest of the loan, EMI, Annual Percentage Rate (APR), fixed Obligation to Incom
e
5. Particulars related to other loans: current account outstanding, Credit
Card outstanding balances, Higher OD Limit, Lower OD limit, Obligations with oth
er
banks.
Seasoning of Loan: For calculating the Probability of Default for retail port
folios, institutions shall identify and analyze expected changes of risk paramet
ers over the life of credit exposures. Loan age has very important implications
for predicting credit card default rates. Seasoning analysis is the plot of the
default rates against the MOB. The seasoning cut-off is the critical MOB value a
fter which the default rates stabilise. Generally it is seen that the a loans se
asons after 12-18 months of origination for retail products. Chances are that if
PD is calculated on a non-matured loan then there are chances of under estimati
on.
As per BASEL Paragraph 467: "Seasoning can be quite material for some long term
retail exposures characterised by the Seasoning effects that peak several years
after
origination"
Identifying the different sources of risks arising from Personal Loans and h
ow can they be captured using derived variables:
1. MOB (Month on Books) : This is the difference between Account Opening Date
and the As on Date. This helps in identifying whether an account has matured su
fficiently for it to be analysed. Higher the MOB, the lower are the chances of n
on-defaulting given that the account is non-default as on date.
2. Time with Bank : This is the difference between the first date fo originat
ion and the As on Date. This variable is used to distinguish between 'Relationsh
ip'
and 'Non-relationship' customers. If for an Account, Time with Bank = MOB
then it is his first account with the bank and prior to this he was a 'non-rela
tionship' customer. If Time with bank > MOB then the customer was previously a '
Relationship' customer.
3. Fixed Obligation to Income: This is the ratio of the total burden of debt
which a customer already has divided by the total 'take-home' pay of the custome
r.
For ex: A person has a monthly income of 100000 INR. Now, he already
pays out 35000 as existing EMIs to other loan commitments. His credit card bill
on
an average is 15000. for his survial his average monthly expenditure
s are 20000. So, his fixed obligations = 35000+15000+20000 = 70000. Now, The
FOIR = 70000/100000 = 0.7. Therefore, the releasable income to pay the EM
Is for the personal loan = 0.3 of his total disposable income.
So derived variables help us to identify different application and behavioural t
raits of the customers. Now, out of all the variables derived which are the most

important variables in predicting the behaviour of the customer. Therefore we n


eed a mechanism to identify the most significant variables in deciding a 'good'
or a 'bad' customer, a 'default' or a 'non-default' customer. Therefore, we need
to develop a scorecard. A scorecard acts a decision making tool. It assigns a s
core to each
account and helps in distinguishing risk profiles in a portfolio. Based on a cut
-off score, the appropriate decision is taken. An arbitrary score card is descri
bed below:
The Scorecard below is a behaviour scorecard for a credit card portfolio. The de
pendent variable which has been modelled here is:
Y =1 if the customer is defaulting over the next 12 months.
=0 otherwise.
The final scorecard variables which are seen to be most significant in identify
ing the behaviour of a customer be : Current Utilisation, %time in delinquency >
0 in 12 months, Number of months with Purchase > 0 in 12 cycles, Month on books
. (Basically, the most significant variables are derived using logistic Regressi
on). Now, each of the scorecard variables have their categories and correspondin
g to each category there is a score. So a typical scorecard would look like:
Variable Name

Variable_Category

Score

Current
Utilisation

< 5%
[5%,15%]
[15%,25%]
[25%,40%]
> 50%

28
21
17
10
03

%time in
delinquency
> 0 in the
last 12
months

<5%
[5%,25%]
[25%,50%]
> 50%

37
18
09
00

Number of
months with
purchase > 0
in the last
12 months

< 3
[3,6]
[6,9]
> 9

27
20
16
05

Month on
books

< 6
[6,18]
>18

00
25
37

The base score is 88. So if a persone is having a score below 88 then he would b
e a defaulter. Else, if he has a score above 88 he will not be considered a defa
ulter.
An account has an outstanding of 20000 and the card limit is 50000. This custome
r had defaulted once in the last 12 months. He generally purchase on his credit
card for just three months and tries to repay a majority of the amount within th
e next 30 days. Now score would be:
Current Utilisation = 20000/50000 = 0.4 = 40% -> Score = 10
%time of delinquent in the last 12 months = 8% -> Score =18
Number of months with purchases > 0 = 3 -> Score = 20
Total score of the customer = 10+18+20 = 48. Now 48 < 88. Therefore, the custome
r is Bad as per the developed scorecard.

How to build a scorecard? -> We would discuss the main steps of buliding a score
card.
/* **** STEPS OF BUILDING A SCORECARD **** */
Step-1. Understanding the Business Problem.
-> Justification to the business about the model development. It involves
putting forward arguements as to why the proposed model is to be developed.
To provide the justification following are the important lines of argu
ement that are seen frequently:
a. The existent model in place is not performing well in terms of
stability, accuracy or distinguishing capacity. Therefore, the reasons for the
improper working of the present model has to be identified. So any scorecard mod
el development, in reality, begins with the validation of the existent model.Fol
lowing
are some of the important observations that the model developer can make:
-> The model's discriminatory power has deteriorated : Deteriora
tion of a scorecard is identified through the Change in Gini coefficient.
-> The population for which the scorecard was developed has chan
ged: A huge shift in Population Stability Index is observed.
-> The variables used in the model has changed over time in term
s of the characteristic : A huge shift in the Variable Deviation Index and Chara
cter Stability Index is observed.
-> The segments in the model has changed - > The Segments do not
rank order. It may be that over time the segments have shrunk in size which pre
vents proper rank ordering.
LIST OF CONCEPTS :
1. GINI COEFFICIENT 2. POPULATION STABILITY INDEX 3.CHARACTERISTIC S
TABILITY INDEX 4.VARIABLE DEVIATION INDEX 5.RANK ORDERING 6.SEGMENTATION
-> Analysis of the portfolio: A portfolio is defined as a collection of lo
ans. It is characterised by the Number of Accounts and the Recievables (or the b
alance outstanding) for the portfolio. A further deep dive analysis of the portf
olio comprises of understanding the balance by the delinquency buckets. Looking
into the distribution of the accounts and the balance by the delinquency buckets
gives the analyst an idea about the riskiness associated with the portfolio.
Step-2. Defining the dependent variables and understanding the relevant independ
ent variables.
-> The dependent variable in a credit risk mmodel is the variable which i
s to be modelled. For ex: In an application scorecard, the probability of the cu
stomer
to be a 'good' customer is estimated. Therefore, the dependent variable will be:
Y = 1 if the customer is good
= 0 otherwise -> To develop a model to predict the chances o
f the customer to be a good customer, i.e. P (Y =1).
Similarly, for developing a behaviour scorecard, the dependent variable in the m
odel is:
Y = 1 if the customer defaulter
= 0 otherwise. -> The problem is to model the P (Y=1) i.e. t
he chances of the customer to be a defaulter.

How to define the dependent variable?


-> The most important concept in a credit risk model development exercise is
to define the 'bad' customer or the 'good' customer. How to define a 'good' or a
'bad'
customer? Some concepts are important for framing the definition:
a. Snapshot Period : A Snapshot is a period of data which has been
picked up as sample by the modeler. Say, An account level info has been picked
up for
March 2014.
b. Observation Period: The Observation period is the period over w
hich the data is observed. Generally, the last twelve months behaviour from a se
lected
snapshot is the Observation period.
Now, for the example above: April 2013 - March 2014 is the Observation period
c. Performance period: The Perfomance period for an account is the
next twelve months performance data for the account, starting from the snapshot
date.
Therefore, for the example above April 2014
- March 2015 is the performance period. So Performance period is the period ove
r which
the behaviour or the performance of
the account is observed.
How to identify that which months in a year should be used as a Snapshot?
-> Snapshot months are selected from every quarter of each year. So there are
four snapshots from each year. This is done to reduce the bias in the snapshot d
ata by considering the entire year. So it contains a 'full round the year' kind
of a view, which accounts for seasonal differences.
-> While selcting Snapshots it must be kept in that each snapshot must have tw
elve months of observation and twelve months of performance.
March 2014<----------------------------- March 2015 --------------------------> Feb 2016
Observation window
Performance Window ->
This is called the Performance window because the 'performance' of the account
is to
be monitored over this period. This performance window will give us the Depende
nt
variable of our model
For ex: For a behaviour scorecard we want to develop a PREDICTIVE model to esti
mate the chances of default. The dependent variable has to be predictive or forw
ard looking in that case. So, the modelers estimate the chance that the customer
will default in the NEXT 12 MONTHS.
How to estimate the optimal performance period and how to obtain the bad definit
ion?
-> Identifying the optimal performance period: For determining the perform
ance period we look over to a time frame when the loss rates stabilise for the
portfolio.For Business As Usual models there are different bad definitio
n and performance period analysis, but for BASEL models or other regulatory mode
ls
like IFRS9 etc. the performance period for retail loans are specifie
d as 12 months. How do we identify the optimal window over which the loss rates
stabilise?
there are some important analysis which are done : 1. ROLL RATE ANALYSIS 2. VINT
AGE ANALYSIS. Under these analyses we check how much of the total accounts are f
lowing from one delinquency bucket to another. The bucket over which the flow ra
te stabilises is taken to be the Bad definition. As the flow rate of the delinqu
ent account stabilises, the loss rates also stabilise. This provides our Perform

ance definition. For ex:


Time
CD0-CD1 CD1-CD2 CD2-CD3 CD3-CD4 CD4-CD5
2012-13
10%
34%
80%
81%
78%
So we can see that accounts which enter CD3 buckets have very high chancess of g
oing into the higher buckets of delinquency. Therefore, the CD3+ is taken to be
the 'Bad' definition of the model. Similarly, an account which CD2 is called an
indeterminate account, since from this delinquency bucket the account may roll o
n to higher delinquency buckets as well as to lower delinquency buckets. Similar
ly, accounts which are CD0 or CD1 are considered to be good customers.
CD -> Cycle of delinquency
CD0 -> Customers who are in zero cycle of delinquency. These are customers who h
ave never been delinquent in the performance period.
CD1 -> Accounts which are 1 cycle delinquent, i.e. 1-29 days delinquent.
CD2 -> Accounts which are 2 cycles delinquent, i.e. 30-59 days delinquent.
CD3 -> Accounts which are 3 cycles delinquent, i.e. 60-89 days delinquent.
CD4 -> Accounts which are 4 cycles delinquent i.e. 90-119 days delinquent.
How to define the independent Variables?
-> All the independent variables present in the data are not used for develo
ping scorecards. Only variables which are relevant reflect credit and borrower r
isk are
Chosen. So variables which reflect - Operational risk, Market Risk etc a
re removed from the data. Operational variables include variables which give bra
nch
information, Chq.Nos etc. Market factors like Stock market index valu
es, interest rate values etc. are removed.
-> Creating the derived variables from the primary variables. Derived variab
les are used to capture the tendencies of the customers at different stages of t
he
model development.
Step-3: Pulling the Data from databases
Pulling the data for both dependent and independent variables include sim
ple exercises as importing a dataset in to the SAS environment from other enviro
nments such as Excel, csv, txt files etc. If the database is primarily maintaine
d in the SAS format, then the data extraction exercises would merely mean copyin
g a dataset from a given library to a user-specific library. For more complex da
ta architecture, data extraction might involve extraction from Access databases
by writting sql codes. When data is being imported from an exogenous environment
into SAS environment, some inherent challenges of formatting may arise. As such
some basic yet important checks need to be done on the data:
1. After the import is done, it is recommended that the number of varia
bles and the number of observations in the created SAS file is reconciled with t
he original data file.
2. After the import is done, we need to check whether all variables hav
e the desired format, i.e. whether the numerical variables have been imported as
numeric and the character variables as characters (PROC CONTENTS IN SAS)
3. If a master data is to be created then we need to validate that the
'unique merging key' exists in both the datasets which need to be clubbed. For e
xample:
Dataset A
IDVar X1 X2
1
2
3

Dataset B
IDVar X3 X4
1
2
3

->

Master_Data
IDVar X1 X2 X3 X4
1
2
3

The variable IDVar is the unique merging key. Now one data challenge can be that
the ID variables are maintained in a different format across the two datasets.
Then merging them becomes very difficult. Therefore, we must ensure that the uni
que merging key exists. For retail banking data: Customer ID and Account ID are
the two very widely used used merging keys for banking data. For Commercial port
folios: Obligor Id and Transaction ID are used as the common merging keys. (MERG
ING AND APPENDING IN SAS - data step merge and proc sql joins)
4. All the variables are in sync with their business definitions. for ex:
Loan outstanding -> This variable cannot have a zero value for accounts which a
re on books. If an account is still on books, it means that some fraction of the
loan is still outstanding. Therefore, Loan outstanding > 0. If Loan outstanding
is 0 and the account is stilll open then it means that there is some data issue
. This is actually a missing observation under Loan outstanding which has been d
efined as 0. So, it is suggested that for such variables a frequency distributio
n is done for zero and non-zero values and it is checked that the accounts which
are loan outstanding = 0 are actually closed. Another check of this kind is to
see whether the account opening date > As on date. If this condition holds true
then it means that there is some issue on the data entry side and the actually t
he account opening date is missing. Such missing values needs to be addressed.
Step 4 - Data Quality Checks : One of the most important mandates faced by the b
anks is to ensure that they maintain modelling data of sufficiently high standar
ds. To ensure the robustness of the data, certain checks need to be performed. I
n the risk modelling domain datasets are maintained at different frequencies of
time. For eg: Some organisation maintain their data in quarterly interval, some
others maintain their data in monthly intervals. Banks mostly maintain data on a
monthly basis. Monthly snapshots are used for model development exercise. Now,
when data over a long horizon of time is used, then it becomes neccessary to che
ck whether the behaviour of a variable is consistent or robust over time or not.
A list of checks are performed on the characterisitics of the data. Such checks
are known as Data quality checks.
Some basic Nuances of data quality check procedure:
a. To check for recent database changes or changes in the data architec
ture of the organisation -> If there is a database change in the neighbourhood o
f the model development then it is important to check for the common variables b
etween the two databases and identify whether they have the same values recorded
at a given point of time. Also, the distribution of the common variables needs
to be checked in the immediate neighbourhood of the time where the change of the
database took place.
b. To check for the presence of variables over time -> (First Occurence
and Last Occurence Analysis). For building business models like credit scorecar
ds this analysis is not always important because the span of time over which the
data is considered is not very large. Howerver, for developing Basel models thi
s analysis is important because sufficiently long period of time is considered f
or the model development data. Such long periods are enough to be considered for
policy changes and changes in the management decisions. Thus, First occurence a
nd the last occurence analysis is important for regulatory model building.
c. To check for relevant variables and observations: For developing cre
dit risk models all the variables in the database cannot be used. Those variable
s which reflect credit and borrower's risk will be included. There are two types
of exclusions that we talk about: Observation exclusion and Performance exclusi
ons. Observations which satisfy exclusion criteria in the Observation period are
called observation exclusions. Similarly, accounts which satisfies exclusion cr
iteria in the performance period are known as Performance exclusions. Any variab

les which include operational risk or market risk are to be removed. For ex: any
operational variable like Cheque_Bk_number, Branch_code, etc are removed from t
he analysis since they reflect operational aspects of the information. Similarly
, all accounts (observations) are not used in the model development exercise. Ac
counts which are Fraud or bankrupt are removed since they are operational risk f
actors. Similarly, accounts which are deceased, closed or immaterial accounts ar
e removed from the analysis. For credit cards 'lost cards' also form an importan
t exclusion criteria.
d. Identifying the missing percentage in the data: Some variables must b
e looked into for missing observations particularly - ID variables, Date variabl
es, Categorical variables (like product type, account type etc.).
e. Descriptive Univariate analysis of numerical and categorical variables
: For numerical variables univariate analysis comprises of the basic measures of
central tendency - Mean, Median , Mode , basic measures of dispersion - range,
variance, standard deviation, Measures of location - Percentiles, Deciles, Quart
iles etc. The measures of location are very important for identifying and treatm
ent of outliers. For categorical variables the frequency distribution is used to
analyse the behaviour of the variable over time. (PROC UNIVARIATE in SAS)
How do we infer about abnormal trends in the data quality exercise?
-> A RED-GREEN trigger is used to identify abnormal trends in the beh
aviour of the values. A normal distribution is a symmetric distribution of a var
iable about its mean. This identifies with a distribution which does not have an
y assymetry created by the presence of extreme observations. Assuming that a var
iable has a symmetric frequeny distribution, 99.97% of the observations are expe
cted to lie within +-3*std_deviations of the mean. So, if a standard normal vari
ate is created:
Z = (Value - Mean)/Std_Deviation then if Z > 3 then a RED is tri
ggered Else if Z < 3 then a GREEN is triggered.
For ex: There is a balance_outstanding variable for twelve months in 2015.For ea
ch month the average value is calculated and the trend of that average is checke
d
201501 201502 201503 201504 201505 201506 201507 21050
8 201509 201510 201511 201512
Mean_Outstnd_bal
1500 1800 1650 1570 1770 1590 1700 1850
2000 15000 1680 2200
For 201510 we see that the standard normal variate is greater than 3. Therefore
, we can say that there was an abnormal trend in the month.
Step 5 - Variable Selection Process : This process basically describes the techi
nques of selecting the independent variables in the model. This acts as a waterf
all of variables and helps us to zero down on the most important variables which
we would need for developing our scorecard.
Given a variable X1 for my scorecard, when will I consider it to be a
potential variable for my model? (Remember the dependent variable was Y =1 if e
vent Y =0 if non-event) -> X1 will be included as an explanatory variable for Y
if it has the capacity to distinguish between Y=1 and Y =0 group. What are the T
echniques that will help me know whether this model has the capacity to distingu
ish between Y =1 and Y =0?
-> Parametric and non-parametric Mean-difference tests (For numerica
l variables)
-> Kolmogorov-Smirnov tests (KS test) -(For Categorical variables)

-> Weight of Evidence and Information Value


-> Multicollinearity checks (Factor Analysis to solve the issue).
Step 6: Model Development Process : The model development process would mainly c
omprise of the statistical procedures that are used for identifying the causalit
y between the independent and the dependent variable. For choosing the accurate
statistical technique for moodel development we must look into the type of the d
ependent
variable. For continuous dependent variables we use Linear Regression Technique
while for modelling the discrete dependent variables (categorical variables) we
use the Logistic Regression approach. So two main types of Regression type of Re
gression Approach that we would discuss at this stage are:
1. LINEAR REGRESSION APPROACH 2. LOGISTIC REGRESSION APPROACH. Apart from Logist
ic and Linear regression approach DECISION TREES (CHAID,CART analysis ) are also
used for developing segment oriented models. There are two things that we need
to understand from the model development perspective: a. Parameter estimation an
d Significance test of the parameters (TESTING OF A STATISTICAL HYPOTHESIS- Basi
c concepts and understanding, STATISTICAL TESTS - TTEST, F-TEST, CHI-SQUARE TEST
, Z-TEST)
B. Score generation - The basic objective of a risk scorecard is to generate sco
res for the accounts. Under the model development exercise it is important to id
entify
the appropriate choice of score generation and calibration mechani
sm.
Step 8: Model Validation Process: The model validation process comprises of tech
niques to identify the performance of the model in a validation set (which is di
fferent from the model development set) to understand the robustness of the mode
l. The main validation checks that are done include: Test of the model discrimin
atory power, the model accuracy, the model stability and the Rank Ordering of th
e model. Each of these metrics are calculated for both the validation dataset an
d the model development dataset and they are reconciled to check the robustness
of the developed model.
Step 9: Model Implementation and Future monitoring (OUT OF SCOPE OF SYLLABUS, ON
LY A WALK THROUGH DISCUSSION WILL BE DONE)

You might also like