Comparing Tree Based Methods

Distinguishing the Forest from the TREES:
A Comparison of Tree Based Data Mining Methods

Richard Derrig, Ph. D. and Louise Francis, FCAS, MAAA
Ri c h a r d De r r i g , P h D,
OP AL Co n s u l t i n g LLC
41 F o s d y k e St r e e t
P r o v i d e n c e
R h o d e I s l a n d , 02906, U. S. A.
P h o n e : 0 0 1 - 4 0 1 - 8 6 1 - 2 8 5 5
emai l : r i c h a r d @d e r r i g . c o m
Lo u i s e Fr a n c i s , F CAS , AAA
Fr a n c i s Ana l yf i c s & Ac t ua r i a l Da t a Mi n i n g
706 L o mb a r d St r e e t
Ph i l a d e l p h i a
Pe n n s y l v a n i a , 19147, U. S. A.
P h o n e : 0 0 1 - 2 1 5 - 9 2 3 - 1 5 6 7
emai l : l o u i s e _ f r a n c i s @ms n . c o m
Abstract
In recent years a number of "data mining" approaches for modeling data containing nonlinear and ot her
complex dependencies have appeared in the literature. One of the key data mining techniques is decision trees,
also referred to as classification and regression trees or CART (Breiman et al, 1993). That met hod results in
relatively easy to apply decision rules that partition data and model many of the complexities in insurance data.
In recent years considerable effort has been expended to i mprove the qualit 3" of the fit of regression trees.
These new met hods are based on ensembles or networks of trees and cart 3, names like TREENET and
Random Forest. Viaene et al (2002) compared several data mining procedures, including tree met hods and
logistic regression, for prediction accuracy on a small fixed data set of fraud indicators or "red flags". They
found simple logistic regression did as well at predicting expert opinion as the more sophisticated procedures.
In this paper we will introduce some available regression tree approaches and explain how they are used to
model nonlinear dependencies in insurance claim data. We investigate the relative performance of several
software products in predicting the key claim variables for the decision to investigate for excessive a nd/ or
fraudulent practices, and the expectation of favorable results from the investigation, in a large claim database.
.Mnong the software programs we will investigate are CART, S-PLUS, TREENET, Random Forest and
Insightful Miner Tree procedures. The data used for this analysis are the approximately 500,000 auto injury
claims reported to the Detailed Claim Database (DCD) of the Automobile Insurers Bureau of Massachusetts
from accident years 1995 t hrough 1997. The decision to order an independent medical examination or a
special investigation for fraud, and the favorable outcomes of such decisions, are the modeling targets. We find
that the methods all provide some predictive value or lift from the available DCD variables with significant
differences among the met hods and the four targets. All modeling outcomes are compared to logistic
regression as in Viaene et al. with some model / soft ware combinations doing significantly betxer than the
logistic model.
Keywords: Fraud, Data Mining, ROC Cu~,e, Variable Importance, Decision Trees
- )
Derrig-Francis _005 - No more than two pqragraphs or one table or figure can be quoted without written
permission of the authors before March 1, 2006.
Casualty Actuarial Society Forum, Winter 2006 1
Distinguishing the Forest from the T R E E S
I NT R ODUC T I ON
I n recent years a number of approaches for model i ng data cont ai ni ng nonl i near and ot her
complex dependencies have appeared i n the literature. Many of the met hods were
developed by researchers from the comput er science, artificial intelligence and statistics
disciplines 1. The met hods have been widely characterized as data mining techniques. These
procedures include several that shoul d be of interest to actuaries dealing wi t h large and
complex data sets. The procedures of interest for the purposes of this paper are various
varieties of classification and regression trees or CART. Viaene et al (2002) applied a wider
set of procedures, i ncl udi ng neural networks, support vect or machines, and a dassical
general linear model, logistic regression, on a small single data set of i nsurance dai r n fraud
indicators or "red flags" as predictors of suspicion of fraud. They found simple logistic
regression did as well at predicting expert opi ni on on the presence of fraud as the mor e
sophisticated procedures. Stated differently, the logistic model performed well enough i n
model i ng the expert opi ni on of fraud that there was litde need for t he more sophisticated
procedures:.
A wide variety of statistical software is now available for i mpl ement i ng fraud and ot her
predictive models t hrough clustering and data mi ni ng. I n this paper we will i nt roduce a
variety of Regression Tree data mi ni ng approaches 3 and explain how they are used to model
nonl i near dependencies i n i nsurance claim data. We also investigate the relative performance
of several software products that i mpl ement these models. As an example of relative
performance, we test for the key claim variables i n the decision to investigate for excessive
a nd/ or fraudulent practices i n a large claim database. The software programs we
investigate are CART, S-PLUS, TREENET, Random Forests, and Insi ght ful Tree and
Ensembl e from the Insightful I~finer package. Naive Bayes and Logistic model s are used as
benchmarks. The data used for this analysis are the aut o bodily injury liability dai ms
reported to the Detailed Claim Dat abase 0DCD) of the Aut omobi l e Insurers Bureau of
Massachusetts from accident years 1995 t hrough 1997 ~. Three types of variables are
employed. Several variables t hought to be related to the decision to investigate are i ncl uded
here as report ed to the DCD, such as out pat i ent provi der medical bill amount s. A few
variables are i ncl uded that are derived from publicly available demographic data sources,
such as i ncome per househol d for each claimant' s zip code. Addi t i onal variables are derived
by accumul at i ng proport i onal statistics from the DCD; e.g., the distance from the claimant' s
zip code to the zip code of the first medical provi der or claimant' s zip code rank for t he
number of plaintiff attorneys per zip code. The decision to order an i ndependent medical
exami nat i on or a special investigation for fraud, and a favorable out come for each, are the
model i ng target.
Ei ght model i ng software results will be compared for effectiveness based on a st andard
procedure, the area under the receiver operating characteristic curve (AUROC). We find
that t he met hods all provide some predictive value or lift from the DCD variables we make
available, wi t h significant differences among the eight met hods and four targets. Model i ng
out comes can be compared to logistic regression as i n Viaene et al. but the results here are
different. They show some soft ware/ met hods can i mprove significantly on t he predictive
2 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6
Distinguishing the Forest from the T REES
ability of t he logistic model. That result may be due to the relative richness of this data set
a nd/ or t he types of i ndependent variables at hand compared to the Viaene data. We show
how "i mpor t ant " each variable is wi t hi n each soft ware/ model tested s and not e the type of
data that are i mpor t ant for this analysis. Thi s entire exercise shoul d provide practicing
actuaries with guidance on regression tree software and market met hods to analyze complex
nonl i near relationship commonl y found i n all types of i nsurance data.
The paper is organized as follows. Section 1 i nt roduces the general not i on of non-l i near
dependencies i n i nsurance data. Section 2 describes the data set of Massachusetts aut o bodily
injury liability claims and variables used for illustrating the model s and software
i mpl ement at i ons. Descriptions and illustrations of the data mi ni ng met hods applied i n the
paper appear i n Section 3 while the specific software procedures are covered i n Section 4.
Comparative out comes for the variables ("importance") and software ("AUROC") are
report ed i n Sections 5 and 6. We provide some i nt erpret at i on of the results i n terms of the
decision to investigate wi t hi n the Massachusetts data as an illustration of the usefulness of
the model i ng effort i n Section 7. Implications for the use of the software models are
discussed i n section 8. Cont us i ons are shown i n Section 9.
S E C T I ON 1. NONL I NE AR I T Y I N I NS UR ANC E DATA
Actuaries are nearly inseparable from data and data mani pul at i on techniques. Dat a come i n
all forms as a mat t er of course. Numeri c (loss ratios), categorical (injury types), and text
(accident description) data all flood insurers on a daily basis. Reserving and pricing are two
maj or functions of casualty actuaries. Reserving involves compi l i ng and underst andi ng
t hrough mathematical techniques historical patterns of a portfolio of i nsurance claims i n
order to predict an ultimate value. Pricing involves taking the best estimates of historical
cost data on claims and expenses, combi ni ng that data wi t h financial asset pricing models
that include projecting future values i n order to arrive at best estimates of all costs of
accepting underwri t i ng risk. Of course, actuaries continually l ook back at bot h analytic
exercises to det ermi ne the accuracy of those estimates as the real account i ng data develops
over time.
Traditionally, actuarial models were confi ned to linear, multiplicative or mi xed algebraic
equations i n the absence of the powerful comput i ng envi r omnent we enj oy today. Those
mostly manual met hods provided crude approximations that sufficed when alternative
met hods were unavailable or non-existent. Simple deviations from linear relationships, such
as escalating inflation, could be handled by simple t ransformat i ons of the data (log
transform) that allowed linear techniques to be applied to t he data. Gradually, over time
these t ransformat i on techniques became mor e sophisticated and could be applied to many
probl ems wi t h a variety of non-l i near data ~'.
Tr end fines of time series data, such as dal rn severity or frequency, are generally amenabl e to
linear techniques. However, data where interactions and cross correlations are essential to
the model i ng of the dynamics of the process underl yi ng the data, require mor e
Distinguishing the Forest from the TREES
compr ehensi ve t echni ques t hat yield mor e preci si on on mor e types of data compl exi t i es.
Fi gure 1-1 shows a part i cul ar non-l i near rel at i onshi p bet ween t wo i nsurance variables t hat
woul d be difficult, i f not i mpossi bl e, t o modal wi t h si mpl e techniques. One pur pose o f this
paper is t o demonst r at e a range of so-cal l ed artificial intelligence or statistical l earni ng
techniques t hat have been devel oped t o handl e compl i cat ed rel at i onshi ps wi t hi n dat a sets.
An Insurance Nonl i ne ar Funct i on:
Provi der Bill vs. Probabi l i ty of I ndependent Medi c al Ex a m
O9 O-
O 8 O -
o 7 0 -
|
~ o ~ -
o 5 o -
o 4o-
o a o -
F
1
I I I I I I I I f l l l l l l l l l l ml l l l l l l l l
P r o v i d e r 2 Bi l l
Figure 1 -1
Nearl y all regressi on and economet r i c academi c courses address the t opi c o f nonl i neari t y, at
least briefly. St udent s are i nst ruct ed in met hods t o det ect nonl i neari t y and how t o model it.
Det ect i on generally involves usi ng scat t er pl ot s of i ndependent versus dependent vari abl es
or evaluating pl ot s of residuals. Two met hods o f model i ng nonl i neari t y t hat are generally
taught: are 1) t r ansf or mat i on of variables and 2) pol ynomi al regressi on (Miller and Wi c he m 7,
1977, and Net er et al, 1985). For instance, i f an exami nat i on o f residual pl ot s i ndi cat es t hat
the magni t ude o f the residuals increases wi t h the size o f an i ndependent variable, the l og
t r ansf or mat i on is r ecommended. Pol ynomi al regressi ons are consi der ed useful
appr oxi mat i ons when a curvilinear rel at i onshi p exists but its exact f or m is unknown.
A general i zat i on of linear model s "known as Gener al i zed Li near Model s or GLM (McCullagh
and Nel der, 1989) enabl ed the model i ng o f mul t i vari at e rel at i onshi ps i n the presence o f
certain ki nds o f non- nor mal i t y (i.e. where the r andom component is f r om the exponent i al
family of di st ri but i on). The link funct i on of GLMs formal i zes t he i ncor por at i on of certain
nonl i near rel at i onshi ps i nt o the model i ng procedure: The t ransformat i ons i ncor por at ed i nt o
the c ommon GLMs are:
The i dent i t y link: h(Y) = Y
4 Casualty Actuarial Society Forum, Winter 2006
The l og link: h(Y) = lnC~
1
The inverse link: h(Y) = - - (1)
Y
The l ogi t link: h(Y) = l n ( l _ ~ )
The pr obi t link: h(Y) = ~( Y) , denot es t he nor mal CDF
Of these t ransformat i ons, the l og and logit t r ansf or mat i on appear frequently in the i nsurance
literature. Because many i nsurance variables are right skewed, the l og t r ansf or mat i on is
appl i ed to attained appr oxi mat e normal i t y and homogenei t y o f variance. I n addi t i on, apri ori
or domai n consi derat i ons (e.g., the rel at i onshi p bet ween the i ndependent variables and the
dependent variable is bel i eved to be mulfiplicative) somet i mes suggest the l og
t ransformat i on. The l ogi t t r ansf or m is commonl y used when the dependent variable is
binary.
Unf or mnat dy, while the t echni ques cited above add significantly to the analyst' s ability t o
model nonlinearity, t hey are not sufficient for many situations encount er ed in practice. I n
actual i nsurance data, compl ex nonl i near rel at i onshi ps are the rule r at her t han the except i on.
Some of the reasons t he t radi t i onal approaches oft en do not pr ovi de a satisfactory
appr oxi mat i on t o nonl i near funct i ons are:
The f or m of the nonl i neari t y may be ot her t han one o f t hose per mi t t ed by the
-known t r ansf or mat i ons whi ch pr oduce linearity. Fi gure 1-1 displays one such non-
linear funct i on based on the i nsurance dat abase used in this analysis.
Whi l e a pol ynomi al o f adequate degree can appr oxi mat e many compl ex functions,
ext rapol at i on beyond the data, or i nt erpol at i on wi t hi n the data, may be probl emat i c,
particularly for hi gher or der pol ynomi al s.
Det er mi ni ng the appr opr i at e t r ansf or mat i on (or pol ynomi al ) can be difficult i f not
i mpossi bl e when t here are many i ndependent variables, and the appr opr i at e rel at i on
bet ween the target and each i ndependent variable must be found.
The rel at i onshi p bet ween a dependent variable and an i ndependent variable may be
conf ounded by a t hi rd variable due to i nt eract i on or correl at i ons t hat are not si mpl e
to approxi mat e.
To r emedy these pr obl ems requires met hods where:
Any nonl i near rel at i onshi p can be approxi mat ed.
The analyst does not need to -know the f or m of the nonlinearity.
The effect o f i nt eract i ons can be easily det er mi ned and i ncor por at ed i nt o the model .
The met hod generalizes well on out - of - sampl e data for i nt er pol at i on or ext rapol at i on
purposes.
The regressi on tree met hods i ncl uded in our analysis meet these condi t i ons. Sect i on 3 of
this paper describes how each o f our met hods model s nonlinearity. We now t urn to a
descri pt i on o f the data set we will use in this analysis.
C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 5
SECTION 2. DESCRIPTION OF THE MASSACHUSETTS AUTO BODILY
INJURY DATA
The dat abase we will use for our analysis is a subset of the Aut omobi l e Insurers Bureau of
Massachuset t s Det ai l Claim Dat abase ( DCD) ; namely, t hose claims f r om acci dent years
1995-1997 t hat had cl osed by J une 30, 2003 (AIB, 2004). Al l aut o claims s arising f r om i nj ury
coverages: Per sonal Inj ury Pr ot ect i on ( PI P) / Me di c a l payment s excess of PI P 9, Bodi l y I nj ur y
Liability (BIL), Uni nsur ed and Under i nsur ed Mot ori st . Whi l e t here are mor e t han 500,000
claims in this subset of DCD data, we will rest ri ct our analysis t o the 162,761 t hi rd par t y BI L
coverage claims. This will allow us t o divide the sampl e i nt o training, test, and hol dout sub
sampl es, each cont ai ni ng in excess of 50,000 claims TM. The dat aset cont ai ns fi ft y-four
variables relating t o the i nsured, dai mant , accident, injury, medi cal t r eat ment , out pat i ent
medi cal pr ovi der s (2 maxi mum) , at t orney presence, and three claims handl i ng t echni ques for
mi t i gat i ng dai ms cost for t hei r presence, out come, and formul ai c savings amount s.
The claims handl i ng techniques t racked are: I ndependent Medi cal Exami nat i on (IME),
Medi cal Audi t (MA) and Special Invest i gat i on (SIU). I MEs are per f or med by l i censed
physicians o f t he same type as the treating physi ci an u. They cost appr oxi mat el y $350 per
exam wi t h a charge of $75 for no shows. They are desi gned t o verify cl ai med injuries and t o
evaluate t r eat ment modalities. One sign of a weak or bogus claim is the failure t o submi t t o
an I ME and, thus, an I ME can serve as a screeni ng devi ce for det ect i ng fraud and bui l d- up
claims. MAs are peer reviews of the injury, t r eat ment and billing. They are typically done by
physicians wi t hout a cl ai mant exami nat i on, by nurses on i nsurers' st af f or by t hi rd par t y
organi zat i ons, but also f r om expert systems t hat revi ew the billing and t r eat ment pat t erns 12.
Favor abl e out comes are r epor t ed by insurers when the damages are mi t i gat ed, the billing and
t r eat ment are curtailed, and when the cl ai mant refuses to under go t he I ME or does not
show. I n the l at t er t wo situations the i nsurer is on solid gr ound t o reduce or deny payment s
under the fai l ure-t o-cooperat e clause in the pol i cy) 3
Special Invest i gat i on (SIU) is r epor t ed when claims are handl ed t hrough non- r out i ne
investigative t echni ques (accident reconst ruct i on, exami nat i ons under oat h and surveillance
are examples), possi bl y i ncl udi ng an I ME or Medi cal Audi t , on suspi ci on of fraud. For the
mos t part , these claims are handl ed by Special Invest i gat i ve Units (SIU) wi t hi n the cl ai m
depar t ment or by some t hi rd part y investigative service. Occasi onal l y, compani es will be
or gani zed so t hat addi t i onal adjusters, not specifically a par t o f the company SIU, may also
conduct special investigations on suspi ci on of fraud. Bot h types are r epor t ed t o DCD and
we refer t o bot h by the shor t hand SIU in subsequent tables and figures. Favor abl e out comes
are r epor t ed for SIU i f the claim is deni ed or compr omi s ed based on the SIU i nvest i gat i on.
For pur poses o f this analysis and demonst r at i on of non-l i near model s and soft ware, we
empl oy t went y- one pot ent i al l y predi ct i ng variables and four target variables. Thi r t een
pr edi ct i ng variables are numeri c, t wo f r om DCD fields (F), eight deri ved f r om i nt ernal
demogr aphi c type dat a (DV), and three variables deri ved f r om ext ernal demogr aphi c dat a
(DM) as shown i n Tabl e 2-1.
6 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6
Aut o Injury Liability Claim Nume r i c Vari abl es
Variable N Type
Provider I_BILL 162,761 F
Provider 2_BILL 162,761 F
ARe 155,438 DV
Report La~ 162,709 DV
Treatla~ 147,296 DV
HouseholdsPerZipcode 118,976 DM
AveralgeHouseValue Per Zip 118,976 DM
IncomePerHousehold Per Zip 1 1 8 , 9 7 6 DM
Distance ~IP1 Zip to CLT. Zip) 72, 786 DV
Rankattl (rank art/zip/ 129,174 DV
Rankdoc2 (rank prov/zip/ 109,387 DV
Rankci~. (rank claimant city,) 118,976 DV
Rnkpcity (rank provider ci~') 162,761 DV
Valid N (lJstwise) 70,397
Mi ni mum M~Lximum
0 1,861,399
0 360,000
0 104
0 2,793
1 9
0 69,449
0 1,000,001
0 185,466
0 769
1 3,314
1 2,598
1 1,874
0 1,305
Std.
Me an De vi at i on
2,671.92 6,640.98
544. 78 1,805.93
34.15 15.55
47.94 144.44
3.29 1.89
10,868.87 5,975.44
166,816.75 77,314.11
43,160.69 17,364.45
38.85 76.44
150.34 343.07
110.85 253.58
77.37 172.76
30.84 91.65
N = Number of non missing records; F=DCD Field, DV = Internal derived variable, DM = External derived
variable
Source; Automobile Insurers Bureau of Massachusetts, Detail Claim Database, AY 1995-1997 aud Authors' Cakulations.
Tabl e 2-1
Ei g h t pr e di c t i ng var i abl es, a n d f our t ar get var i abl es ( I ME and SI U, De c i s i o n a nd Fa vor a bl e
Ou t c o me f or each) , are cat egor i cal var i abl es, all t aken as r e por t e d f r o m DC D a nd as s h o wn
i n Tabl e 2-2.
Vari abl e
Policy Type
Emergent, Treatment 162,761
Health Insurance 162,756
Provider I - Type 162,761
Provider 2 - T}'pe 162,761
2001 Territory 162,298
Attorney 162,761
Suspl (SIU Done / 162,761
Susp2 (IME Done / 162,761
Susp3 (SIU Favorable) 162,761
Susp4 (IME Favorable / 162,761
Injury Type 162,298
N = Nttmber of non missing records
Aut o Injury Liability Claim CateBorical Variables
N
Type Type Description
162,761 F Personal 92%, Commercial 8%
F None 9%, Onl,v 22%, w Outpatient 68%
F Yes, 15%, No 26%, Unknown 60%
F Chi r o 41%, Physical Th. 19%, Medical 30%, None 10%
F Chi r o 6%, Physical Th. 6%, Medical 36%, None 52%
F Rating Territories 1 (2.2%) Through 26 (1.3%); Territory 1-
16 by increasin~ risk, 17-26 is Boston
F :kttorne~, present (89%), no attorney (11%)
F Special Invesfi~tion Done (70/0/, No SIU (93%)
Independent Medical Examination Done (8%), No IME
F (920/o /
Special Investagation Favorable 0.4%), Not Favorable/Not
F Done (95.6% /
Independent Medical Exa m Favorable (4.4%), Not
F Favorable/Not Done (96.6% /
Injury Types (24) including manor visible (4O/o), strain or
F sprain, back and/or neck (81%), fatality (0.4%), disk
herniation (1%) and others
F= DCD Field
Note: Descriptive percentages may not add to 100% due to rounding
Source: Automobile Insurers Bureau of Massachusetts, Detail Claim Database, AY 1995-1997 andAuthors' Calculations.
Table 2-2
Similar claim investigation variables are now bei ng collected by t he Insurance Research
Council in their periodic sampling of countr3avide injury claims (IRC, 2004a, pp 89-104) 14.
Nationally, about 4% and 2% of BI claims i nvol ved I MEs and SIU respectively, only one-
hal f t o one-quart er o f t he Massachusetts rate. Most likely, this is because (1) a maj ori t y of
ot her states have a full t ort system and so BI L contains all injury claims and (2)
Massachusetts is a fairly urban state wi t h high claim frequencies and mor e dubious claimslk
I n fact, t he mos t recent IRC study shows (IRC, 2004b, p25) Massachusetts has t he hi ghest
percentage o f BI claims in no-fault states that are suspected of fraud (23%) a nd/ or buildup
(41%). I t is t herefore, entirely consi st ent for t he Massachusetts claims t o exhibit mor e non-
rout i ne claim handling techniques. Favorabl e out comes average about 67% when an I ME is
done or a claim is referred t o SIU. We now t urn t o descriptions o f t he types of model s, and
t he software that i mpl ement s t hem, in t he next t wo sections bef or e we describe how t hey are
applied to model the I ME and SIU target variables.
S E C T I ON 3. MODE L S F OR NON- L I NE AR DE P E NDE NC I E S
Ho w mo d e l s ha ndl e nonl i ne a r i t y
Tradi t i onal actuarial and statistical techniques oft en assume that t he functional r dat i onshi p
bet ween t he i ndependent variables and t he dependent variable is linear or that s ome
t ransformat i on of the data exists that can be treated as linear. Insurance data of t en cont ai n
variables where t he relationship among variables is nonlinear. Typically when nonl i near
relationships exist, t he exact nature o f the nonlinearity (i.e., wher e some t ransformat i on can
be used t o establish linearity) is not known. I n t he field o f data mining, a number of
nonparamet ri c techniques have been devel oped whi ch can model nonl i near relations wi t hout
any assumpt i on bei ng made about t he nature of t he nonlinearity. We cover how each of our
met hods reviewed i n this paper model s nonlinearities in t he fol l owi ng t wo examples. The
variables i n this exampl e were selected because of a known nonl i near relationship bet ween
i ndependent and dependent variables.
Ex. 1 The dependent variable, a numeri c variable, is total paid losses and t he
i ndependent variable is pr ovi der 2 bill. Tabl e 3-1 displays average paid losses at various
bands o f provi der 2 bilP ~.
Ex. 2 The dependent variable, a binary categorical variable, is whet her or not an
i ndependent medical exam is request ed and t he i ndependent variable again is provi der 2
bill.
Nonlinear Example Data
Provider 2 Bill (Banded)
Zero
1 - 250
251 - 500
501 - 1,000
1,001 - 1,500
1,501 - 2,500
2,501 - 5,000
5,001 - 10,000
10,001+
All Claims
Avg Provider 2 Bill Avg Total Paid
9,063
Percent IME
6%
154 8,761 8%
375 9,726 9%
731 11,469 10%
1,243 14,998 13%
1,915 17,289 14%
3,300 23,994 15%
6,720 47,728 15%
21,350 83261 15%
545 11,224 8%
Table 3-1
Tr e e s
Trees, also known as classification and regression trees (CART) fit a model by recursively
partitioning the data i nt o t wo groups, one group wi t h a hi gher value on the dependent
variable and the ot her group wi t h a l ower value on t he dependent variable. Each partition
o f the tree is referred t o as a node. When a parent node is split, t he t wo children nodes, or
"l eaves" of t he tree, are each mor e homogenous (i.e., less variable) wi t h respect t he
dependent variable 17. A goodness o f fit statistic is used t o select t he split whi ch maximizes
t he difference bet ween the t wo nodes. When t he i ndependent variable is numeric, such as
pr ovi der 2 bill, t he split takes t he f or m o f a cut poi nt , or threshold: x > c and x < c as in
Figure 3-1.
CART Exampl e of Parent and Children Node s
Tot al Paid as a Functi on of Provider 2 Bill
1 ~ S p l i t
I
i i
Figure 3-1
The cut poi nt c is f ound by evaluating all possible values for splitting the numeri c variable
i nt o higher and l ower groups, and selecting the value that optimizes t he split i n some
manner . When t he dependent variable is numeri c, the split is typically based on the value
whi ch results i n the greatest reduct i on i n residual sum of squares. For this example, all values
of provi der 2 bill are searched and a split is made at t he value $5,021. All claims wi t h
provi der 2 bill less t han $5,021 go to t he left node and "predi ct " a total paid of $10,770 and
all claims wi t h provi der 2 bill greater t han $5,021 go t o the right node, and "predi ct " a total
paid of $59,250. Thi s is depicted i n Figure 3-1. The tree graph shows that the total paid
mean is significantly lower for the claims wi t h provi der 2 bills less t han $5,021.
One statistic oft en used as a goodness of fit measure t o optimize tree splits is sum squared
error or the total squared devi at i on of actual values ar ound the predicted values. The selected
cut poi nt is t he one whi ch produces the largest reduct i on i n total sum squared errors (SSE).
That is, for the entire database t he total squared deviation of paid losses ar ound the
predicted value (i.e., t he mean) of paid losses is 4.95x10 is. The SSE declines t o 4.66x10 is
after the data are part i t i oned usi ng $5,021 as the cut-point. Any ot her part i t i on of the
provi der bill produces a larger SSE t han 4.66x1013. For instance, i f a cut' point of $10,000 is
selected, t he SSE is 4.76x1013.
The two nodes i n Figure 3-1 can each be split i nt o to children nodes and these can t hen be
further split. The sequential splitting cont i nues unt i l no i mpr ovement i n the goodness of fit
statistic occurs. The nodes cont ai ni ng the result of all the splits resulting from applying a
sequence of decision rules are the final nodes oft en referred to as t ermi nal nodes. The
t ermi nal nodes provi de t he predicted values of t he dependent variables. When the dependent
variable is numeri c, the mean of the dependent variable at the t ermi nal nodes is the
prediction.
The curve of the predicted value resulting from a tree fit to total paid losses is a step
function. As shown i n Figure 3-2A, with only two terminal nodes, the fitted funct i on is flat
unt i l $5,021, steps up to a higher value and t hen remains flat. Figure 3-2B displays the
predicted values of a tree wi t h 7 terminal nodes. The steps or increases are mor e gradual for
this function.
CART Exampl e w/ t h Two and Seven No de s
Total Paid as a Funct i on of Provider 2 Bill
| t
' 1 4 -
o
o
Fi gure 3-2A Fi gure 3-2B
The procedure for model i ng data where the dependent variable is categorical (binary i n our
example) is similar to that of a numeri c variable. For instance, one of the fraud surrogates is
i ndependent medical exam (IME) requested. The target class is claimants for whom an I ME
was requested and the non-t arget group of (presumably) legitimate claims is that where an
I ME was not requested. At each step, the tree procedure selects the split that best i mproves
or lowers node impurity. That is, i t attempts t o part i t i on the data i nt o two groups so that
one part i t i on has a significantly higher pr opor t i on of the target category, I ME requested,
t han the ot her node. A number of statistical goodness of fit statistics measures is used i n
different product s to select the optimal split. These i ncl ude ent ropy/ devi ance and Gi ni
i ndex (which is described later i n this paper). Kant ardzi c (2003), Brei man et al (1993) and
Venibles and Ripley (1999) describe the comput at i on and application of the Gi ni i ndex and
ent ropy/ devi ance measures is. A score or probability can be comput ed for each node after a
split is performed. This is generally estimated based on the number of observations i n the
target groups versus the total number of observations at the node. The score or probability
Casualty Actuarial Society Forum, Wi nt er 2006 11
is frequently used to assign records t o one of the t wo classes. Typically, i f t he model score
exceeds a t hreshol d such as 0.5, the record is assigned t o t he target class; ot herwi se it is
assigned t o the non-t arget class.
Figure 3-3A displays t he result of using a tree pr ocedur e t o predict a categorical variable
f r om t he AI B data. The graph shows that each t i me t he data is split on pr ovi der 2 bill; one
child node has a l ower pr opor t i on and t he ot her a hi gher pr opor t i on o f claimants recei vi ng
I MEs. The fitted tree funct i on is used t o model a nonl i near relationship bet ween prmfider
bill and t he probability that a claim receives an I ME as shown in Figure 3-3B.
CART Exampl e wi t h Seven No de s
IME Proporti on as a Funct i on of Provider 2 Bi l l
I .
t
!
e
Fi gure 3-3A
CART Exampl e wi t h Seven Step Func t i ons
IME Proporti on as a Funct i on of Provider 2 Bill
Fi gure 3-3B
Tr ee model s use categorical as well as numeri c i ndependent variables in model i ng compl ex
data. However , because t he levels on categorical data may not be ordered, all possible t wo-
way splits o f categorical variables must be consi dered bef or e t he data are partitioned.
E n s e mb l e Mo d e l s - B o o s t i n g
Ensembl e model s are composi t e tree models. A series of trees is fit and each tree i mpr oves
t he overall fit o f t he model. I n t he data mi ni ng literature t he t echni que is oft en referred t o as
12 Casualty Actuarial Society Forum, Wi nt er 2006
"boost i ng" (Hastie et al 2001, Frei dman, 2001). The met hod initially fits a small tree of say 5
to 10 t ermi nal nodes on a training dataset. Typically, the user specifies the number of
terminal nodes, and every tree fit has the same number of t ermi nal nodes. The error, or
difference bet ween t he actual and fitted values, is comput ed and used i n anot her r ound of
fitting as a dependent variable. The error is also used i n the comput at i on of t he weight i n
subsequent rounds of fitting, wi t h records cont ai ni ng larger errors receiving higher weighting
i n the next r ound of estimation.
One algorithm for comput i ng t he weight is described by Hastie et a119. Consi der an ensembl e
of trees 1, 2, ...,M. The error for the m 'h tree measures the departure of the actual from t he
fitted value on the test data after the m 'h model has been fit. When the dependent variable is
categorical, as it is i n the fraud application i n this paper, a c ommon error measure used i n
boost i ng is:
N
~"w I( y, * F ( x ) )
e r r = ' =' N (2)
~ w
I=1
where N is the total number of records, w, is a weight (which is initialized to 1 / N i n the first
r ound of fitting), I is an i ndi cat or funct i on equal to zero i f the category is correctly predicted
and one i f the class assigned is incorrect, y, is the dependent variable, x is a matrix of
predictors and Fm(x ) is the prediction for the i 'h record of the m 'h tree.
Then, the coefficient alpha is a funct i on of the weight:
log(1 - e r r m
~m = )
e r r ,
and the new weight is:
w,.m+ 1 =wm exp( aml ( y, #Fm (x)))
(3)
The process is performed many times unt i l no further statistical i mpr ovement i n the fit is
obtained.
The specific boost i ng procedures i mpl ement ed differ among different software products.
For instance, TREENET (Freidman, 2001) uses stochastic gradi ent boosting. Stochastic
gradient boost i ng i ncorporat es a number of procedures whi ch at t empt to build a mor e
robust model by cont rol l i ng the t endency of large complex model s to overfit the data. A key
technique used is resampling. A new sample is randoml y drawn from the training data each
time a new tree is fit to t he residuals from the prior r ound of model estimation. The
goodness of fit of the model is assessed on data not i ncl uded i n the sample, the test data.
Anot her procedure used by TREENET to cont rol overfitfing is s hri nkage or regul af f z at i on. A
simple way to i mpl ement shrinkage is to apply a weight whi ch is greater t han zero and less
t han one to the cont r i but i on of each tree as it is added to the weighted average estimate.
Casual t y Act uar i al Soci et y Forum, Wi nt er 2006 13
Alternatively, the Insi ght ful Miner Ensembl e model employs a simpler i mpl ement at i on of
boost i ng whi ch applies non-st ochast i c boost i ng and uses all the training data i n each r ound
of fitting.
The final estimate resulting from an ensembl e approach will be a weighted average o f all t he
trees fit. Usi ng a large collection of trees allows:
Many different variables to be used. Some of these woul d not be used i n smaller
model s a'.
Many different models are used. The predictive model i ng literature (Hasfie et al.,
2001, Francis, 2003a, 2003c) indicates that composites of multiple model s per f or m
bet t er t han the prediction of a single model -~1.
Di fferent training and test records are used (with stochastic gradient boosting). Thi s
makes the procedure mor e r obust to the i nfl uence of a few extreme observations.
The met hod of fitting many (often 100 or more) small trees results i n fitted curves whi ch are
al most smoot h. Figures 3-4A and 3-4B display two nonl i near funct i ons fit to total paid and
I ME variables by the TREENET ensembl e model.
Ensemble Prediction of Total Paid
~ 4000000-
~ 30000 00-
_~ 20000 00-
f
i i i i l ~ l l l l l l l l l l l l l l 1 1 1 1 1 1 1
P r o v i d e r 2 B i l l
Figuee 3-4 A
09O-
0 S O -
o 7o-
~ o e o -
|
o . . , o-
o 4o.
o 3 o -
V-
i i I I I I I I I I I t l l l l l l l l l l l l l l l ' l
P r o v i d e r 2 B i l l
Figure 3-4 B
E n s e mb l e Mo d e l s - B a g g i n ~
Bagging is an ensembl e approach based on resampling or bootstrapping. Bagging is an
acronym for "boot st rap aggregation" (Hastie et al., 2000). Bagging does not use the error
from the prior r ound of fitting as a dependent variable or weight i n subsequent rounds of
fitting. Bagging uses recursive sampling of records i n the data to fit many trees. For
instance an analyst may decide to take a 50% of the data as a training set each time a model
is fit. Under bagging, 100 or mor e model s may be fit, each one t o a di fferent sample. The
trees fit are not necessarily small trees wi t h 5 t o 10 terminal nodes as wi t h boost i ng and each
tree may have a di fferent number of terminal nodes. By averaging t he predictions of a
number o f boot st rap samples, bagging reduces t he predi ct i on variance. The i mpl ement at i on
of bagging used in this paper is known as Random Forest. I n addition to using onl y a
sample of t he data each t i me a tree model is fit, Random For est also samples f r om t he
variables. For t he analysis in this paper, one third of t he variables were sampl ed for each
tree fit.
Figures 3-5A displays an ensembl e Random For est tree fit t o total paid losses and Fi gure 3-
5B displays a tree fit t o I ME.
Random Forest Predi ct i on of Tot al Pai d
I I I I I I I
0 50000 150000 250000 350000
Provider 2 B i l l
Figure 3-5 A
Random Forest Predi ct i on of I ME
c;
o
o
c;
g
c~
I i I I i I i
50000 150000 250000 350000
Pr0v~der 2 B i l l
Figure 3-5 B
Nai ve Baves
The Naive Bayes met hod is a relatively simple and easy to i mpl ement met hod. I n our
comparison, we xdew it as a benchmar k data mi ni ng met hod. That is, we are interested i n
how more complex met hods i mpr ove performance (or not) against an approach where
simplifying assumpt i ons are made i n order to make the comput at i ons mor e tractable. We
also use logistic regression model s as a second benchmark.
The Naive Bayes met hod was developed for categorical data. Specifically, bot h dependent
and i ndependent variables are categorical. Therefore, its application to fitting nonl i near
functions will be illustrated onl y for the categorical target variable IME. I n order to utilize
numeri c predictor variables it was necessary to derive new categorical variables based on
discretizing, or "bi nni ng", the di st ri but i on of data for the numer i c variables = .
The key simplifying assumpt i on of the Naive Bayes met hod is the assumpt i on of
i ndependence. Al l predictor variables are assumed to act i ndependendy i n i nfl uenci ng the
target variable. Int eract i ons and correlations among the predi ct or variables are not
considered:
Bayes rule is used to estimate the probability that a record wi t h gi ven i ndependent variable
vect or X = {x} is i n category C = {c,} of the dependent variable.
P(cj Ix,)=P(x, Icl) P(cl) / P(x,) (4a)
Because of the Naive Bayes assumpt i on of condi t i onal i ndependence, the probability that an
observat i on ~11 have a specific set of values for the i ndependent variables is the pr oduct of
the conditional probabilities of observi ng each of the values given category c,
P ( X I cs ) =I'-I P(x , I c, ) (4b)
J
The met hod is described i n more detail i n Kant ardzi c (2003). To illustrate the use of Naive
Bayes i n predicting discrete variables, the provi der 2 bill data was bi nned i nt o groups based
on the quintiles of the distribution. Because about 50 percent of the dai ms have a value of
zero for provi der 2 bill, onl y four categories are created by t he bi nni ng procedure. The new
variable was used to estimate the I ME targets. Figure 3-6 displays a bar pl ot of t he predicted
probability of an I ME for each of the groups. Figure 3-7 displays the fitted function. Thi s
funct i on is a step funct i on whi ch changes value at each boundar y of a provider 2 bill bin.
Bayes Predi cted Probability I ME Req uest ed vs. Quintile of Provider 2 Bill
~ . t ac ~ x -
) . 1c ~ x -
: , xc c ~x-
Provider 2 Bill Ouintile
Figure 3-6
OA20000-
|
5
~ 0 0000o0.
Na i v e Bayes Predi cted I ME vs. Provi der 2 Bill
i l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
Provi der 2 Bi l l
Figure 3-7
S E C T I ON 4. S OF T WAR E F OR MODE L I NG NON- L I NE AR DE P E NDE NC I E S
No n a d d i t i v i t v : i n t e r a c t i o n s
Convent i onal statistical model s such as r egr essi on and logistic r egr essi on assume not onl y
linearity, but also addi t i vi t y o f t he pr edi ct or variables. Un d e r additivity, t he effect o f each
vari abl e can be can be added t o t he model one at a time. Wh e n t he exact f or m o f t he
r el at i onshi p bet ween a de pe nde nt and i nde pe nde nt vari abl e depends o n t he val ue o f one or
mor e ot he r vari abl es, t he effect s are not addi t i ve and one or mor e i nt er act i ons exist. For
i nst ance, t he r el at i onshi p bet ween pr ovi der 2 bill and I ME may var y by t ype o f i nj ury (i.e.
t raumat i c i nj uri es versus sprai ns and strains). I nt er act i ons are c o mmo n i n i nsur ance dat a
(Wei sberg and Der r i g, 1998, Franci s, 2003c).
Wi t h convent i onal l i near statistical model s, i nt er act i ons are i ncor por at ed wi t h mul t i pl i cat i ve
t erms:
Y = a + bl X 1 + b2X2 + b3*XI*X 2
(s)
I n t he case o f a t wo-way i nt er act i on, t he i nt er act i on t er ms appear as pr oduct s o f t wo
variables. I f one o f t he t wo vari abl es is categorical, t he i nt er act i on t er ms allow t he sl ope o f
t he fitted line t o vary wi t h t he levels o f t he categorical vari abl e. I f b o t h vari abl es are
cont i nuous t he i nt er act i on is a bi l i near i nt er act i on (Jicard and Turri si , 2003) and t he sl ope o f
one vari abl e changes as a l i near f unct i on o f t he ot he r variable. I f b o t h vari abl es are
categorical t he mode l is equi val ent t o a t wo f act or ANOVA wi t h i nt eract i ons.
The convent i onal approach to handl i ng interactions has some limitations.
Onl y a limited number of types of interactions can be model ed easily.
I f many predictor variables are i ncl uded i n the model , as is oft en t he case i n many
predictive model i ng applications, it can be tedious, i f not impossible, to find all the
significant interactions. Incl udi ng all possible interactions i n the model wi t hout
regard to their significance likely results i n a model whi ch is over-parameterized.
The tree-based data mi ni ng techniques used i n this paper each have efficient met hods for
handl i ng interactions.
Int eract i ons are i nherent i n the met hod used by trees to part i t i on data. Once data
have been partitioned, different partitions can and typically do split on different
variables and capture different interactions among the predictor variables. When t he
decision rules used by a tree to reach a t ermi nal node i nvol ve mor e t han one variable,
i n general, an interaction is bei ng modeled.
Ensembl e met hods incorporate interactions because they are based on the tree
approach.
Nai ve Bayes, because it assumes condi t i onal i ndependence of the predictors, ignores
interactions.
Logistic regression incorporates interactions i n the same way ordi nary least squares
regression does, with product i nt eract i on terms. I n this fraud compari son study, no
at t empt was made to i ncorporat e i nt eract i on terms as this procedure lacks an
efficient way to search for the significant interactions.
Mul t i pl e predictors
Thus far, the discussion of the tree-based models concerned onl y simple one or t wo variable
models. Ext endi ng the tree met hods to i ncorporat e many pot ent i al predictors is
straightforward. For each tree fit, the met hod proceeds as follows:
For each variable determine the best two-way part i t i on of the data.
Select the variable which produces the best i mpr ovement i n the goodness of fit
statistic to split the data at a particular node.
Repeat the process unt i l no further i mpr ovement i n fit can be obtained.
Software for model i ng nonlinear dependenci es and testi ng the model s
Four software products were i ncl uded i n our fraud comparison: They are CART,
TKEENET,
S-PLUS (R) and Insightful Mi n e r 23.
CART and TREENET are Salford Systems stand-alone software product s that each
performs one technique. CART (Classification and Regression Trees) does tree analysis and
TREENET applies stochastic gradient boost i ng usi ng the met hod described by Fr ei dman
(2001). All the software tested produce SAS c o d e 24 that can be used to i mpl ement the model
i n a pr oduct i on stage. All the product s cont ai n a procedure for handl i ng mi ssi ng values
usi ng surrogate variables. At any given split poi nt , CART and TREENET find the variable
that is next i n i mport ance i n i nfl uenci ng the target variable and they use this variable to
replace the mi ssi ng data. The specific statistic used to rank t he variables and find the
surrogates is described i n Bri eman et. al. (1993). Di fferent versions of CART and
TREENET handle different size databases. The number of levels of categorical variables
affects how much memor y is needed, as mor e levels necessitate mor e memory. The 128k
versi on of each product was used for this analysis. Wi t h approximately 100,000 records i n
the t rai ni ng data, occasional memor y probl ems were experienced and i t became necessary to
sample fewer records. One of the very useful features of the Salford Systems software is
that all the products rank variables i n i mport ance :5.
S-PLUS and R are comprehensi ve statistical languages used to per f or m a range of statistical
analyses i ncl udi ng exploratory data analysis, regression, ANOVA, generalized linear models,
trees and neural networks. Bot h S-PLUS and R are derived from S, a statistical programmi ng
language originally developed at Bell Labs. The S progeny, S-PLUS and R, are popul ar
among academic statisticians. S-PLUS is a commercial product sold by Insightful whi ch has
a true GUI interface that facilitates easier handl i ng of some functions. Insightful also
supplies technical support. The S-PLUS programmi ng language is widely used by analysts
who do serious number crunching. They find it more effective, especially for processes that
are frequently repeated. R is free open source statistical software that is supported largely by
academic statisticians and comput er science faculty. It has onl y limited GUI functionality
and the data mi ni ng funct i ons must be accessed t hrough the language. Most code written
for S-PLUS will also work for R. One not abl e difference is that data must be convert ed to
text mode to be read by R (a bi t of an i nconveni ence, but usually not an i nsurmount abl e
one). Fox (2002) points out some of the differences bet ween the two languages, where they
exist. The S-PLUS procedures used here i n the fraud compari son are found i n bot h S-PLUS
and R. However one ensembl e tree met hod used, Random Forest, appears onl y to be
available i n R. The S-PLUS (R) procedures used were: the tree funct i on for decision trees
and the gl m (generalized linear models) for logistic regression. S-PLUS (R) incorporates
relatively crude met hods for handl i ng mi ssi ng values. These i ncl ude eliminating all records
with a mi ssi ng value on any variable, an approach whi ch is generally not r ecommended
(Francis 2005, AUsion 2002). S-PLUS also creates a new category for missing values (on
categorical variables) and allows abort i ng the analysis i f a missing value is found. I n general,
i t is necessary to preprocess the data (at least the numeri c variables where there is no mi ssi ng
value met hod 2~) to make a provi si on for the missing values. I n the fraud comparison, a
const ant not i n the range of the data was substituted i nt o the variable and an indicator
dummy variable for mi ssi ng was created for each numeri c variable with missing values. S-
PLUS and R are generally not considered optimal choices for analyzing large databases.
Aft er experiencing some difficulty reading training data of about 100,000 records i nt o S-
PLUS, the database was reduced to cont ai n onl y the variables used i n the analysis. Once the
data was read i nt o S-PLUS, few probl ems were experienced. Anot her eccentricity is that the
S-PLUS tree funct i on can onl y handle 32 levels on any given categorical variable, so i n the
preprocessing the number of levels may need to be reduced 27. The R Random Forest
funct i on incorporates a procedure that can be used to rank variables i n importance. The
procedure produces an imp urity statistic which can be used to rank the variables. The
i mpuri t y is based on t he Gi ni index for classification applications and mean squared error for
numeri c dependent variables. The S-PLUS tree f uncdon cont ai ns no bui l t -i n capability for
ranki ng variables i n importance. Therefore usi ng the S-PLUS language, an algorithm was
coded i nt o S-PLUS to rank the variables. The met hod is described i n Francis (2001) and
Pot t s (2000). The procedure quantifies how much the error increases when a variable is
removed from the model ; the larger the increase i n errors, the more i mpor t ant the variable.
The Insi ght ful Mi ner is a data mi ni ng s.uite that cont ai ns the most common data mi ni ng
tools: regression, logistic regression, trees, ensembl e trees, neural networks and Nai ve
Bayes ~. As ment i oned earlier, Insightful also markets S-PLUS. However, the Insi ght ful
Mi ner has been opt i mi zed for large databases and cont ai ns met hods (Naive Bayes) whi ch are
not part of S-PLUS (R). The Naive Bayes, Tree and Ensembl e Tree procedures f r om
Insightful Mi ner are used here i n the fraud comparison. The insightful Mi ner has several
procedures for automatically handl i ng mi ssi ng values. These are 1) drop records wi t h
mi ssi ng values, 2) randoml y generate a value, 3) replace wi t h the mean, 4) replace wi t h a
const ant and 5) carry forward the last obseta-ation. Each mi ssi ng value was replaced wi t h a
constant. I n theory, the data mi ni ng met hods used, such as trees, shoul d be able to part i t i on
records coded for mi ssi ng from the ot her obse~-ations wi t h legitimate categorical or numer i c
values and separately estimate their impact on the target variable (possible after allowing for
interactions wi t h ot her variables). Server versions of the Insightful Mi ner generate C code
that can be used i n deploying the model, but the versi on used i n this analysis did not have
that capability. As ment i oned above some preprocessing was necessary for the Naive Bayes
procedure. Since Insi ght ful Miner cont ai ns no procedure for ranki ng variables i n
i mport ance, no rankings were provided for the I mi ner met hods.
Val i dat i ne and Te s t i n~
v
It is common i n data mi ni ng circles to partition the data i nt o three groups (Hastie et al.,
2001). One group is used for "training", or fitting the model. Anot her group, referred t o as
the validation set, is used for "testing" the fit of the model and re-estimating parameters i n
order to obt ai n a bet t er model. It is common for a number of iterations of testing and
fitting to occur before a final model is selected. The third group of data, the "hol dout "
sample, is used to obt ai n an unbi ased test of the model ' s accuracy. An alternative approach
to a validation sample that is especially appropriate when the sample size used i n the analysis
is relatively modest , is cross-validation. Cross-validation is a met hod i nvol vi ng hol di ng out a
por t i on of the t rai ni ng sample, say one fifth of the data, fitting a model to the remai nder of
the data and testing it on the held out data. I n the case of 5-fold cross validation, the
process is repeated five times and the average goodness of fit of the five validations is
comput ed. The various software products and procedures have different met hods for
validating the models. Some (Insightful Mi ner Tree) onl y allow cross-validation. Ot hers
( TREENET) use a validation sample. S-PLUS (R) allows either approach -~ to be used (so a
test sample of about 20% of the training data was used as we had a relatively large database).
Nei t her validation sample nor cross-validation was used with Nai ve Bayes, Logistic
Regression or the Ensembl e Tree.
I n this analysis, approximately a third of t he data, about 50,000 records, was used as the
hol dout sample for the final testing and compari son of the models. Two key statistics oft en
used to compare model s accuracy are sensitivity and spedficity. Sensitiv ity is the percentage of
events (i.e., claims wi t h an I ME or referred t o a special investigation unit) that were
predicted to be events. The q ~ecifid~y is the percentage of nonevent s (in our applications
claims believed to be legitimate) that were predicted to be nonevent s. Bot h of these
statistics should be high for a good model. Tabl e 4-1, oft en referred to as a confusi on
matrix (Hasfie et. al., 2001), presents an example of the calculation.
Sample Conf usi on Matrix: Sensitivity and Specificity
True Cl ass
Predi ct i on No Yes Row Tot al
No 800 200 1 ,000
Yes 200 400 600
Col umn Tot al 1 ,000 600
Correct Tot al Pement Correct
S ensi t i vi t y 800 1 ,000 80%
S p eci fi ci t y 400 600 67%
Table 4-1
I n the example confusi on matrix, 800 of 1,000 non- event s are predicted to be non-event s so
the sensitivity is 80%. The specificity is 67% since 400 of 600 true positives are accurately
predicted.
S ECTI ON 5. SOFTWARE RANKI NGS OF "I MPORTANT" VARI ABLES I N
T HE DECI S I ON TO I NVESTI GATE: I ME AND SI U
The remainder of this paper is devoted to illustrating the usefulness and effectiveness of
eight model / soft ware combinations applied to our Example 2, the decision to investigate via
IMEs or referral to SIU. We model the presence and proport i on of favorable outcomes, of
each investigative technique for the DCD subset of automobile bodily injury liability (third
party) claims from 1995-1997 accident years. 3" We employ twenty-one potentially predicting
variables of three types: (1) eleven typical claim variable fields informative of injury claims as
reported, bot h categorical and numeric, (2) three external demographic variables that may
play a role in capturing variations in investigative dairn types by geographic region of
Massachusetts, and (3) seven internal "demographic" variables derived from informative
pattern variables in the database. Variables of type 3 are commonly used in predictive
modeling for marketing purposes. The variables used for these illustrations are by no means
optimal choices for all three types of variables. Optimization can be approached by other
procedures (beyond the scope of this paper) that maximize information and minimize cross
correlations and by variable construction and selection by domain experts.
The eight model / soft ware combinations we xxtll use here are of two categories: six tree
models, and two benchmark models (Naive Bayes and Logistic). They are:
1) TREENET 5) Iminer Ensemble
2) Iminer Tree 6) Random Forest
3) SPLUS Tree 7) Naive Bayes
4) CART 8) Logistic
As described in Section 4, CART and TREENET are Salford Systems stand-alone software
products that each performs one technique. CART (Classification and Regression Trees)
does tree analysis, and TREENET applies stochastic gradient boosting to an ensemble of
trees using the met hod described by Freidman (2001). The S-PLUS procedures used here in
the fraud comparison are found in bot h S-PLUS and in a freeware version in R. These were:
the tree function for decision trees, and the GLM (generalized linear models) for logistic
regression.
Insightful Miner is a data mining suite. The Naive Bayes, Tree and Ensemble Tree
procedures, from Insightful Miner are used here in the fraud comparison.
Model performance is covered in the next section, section 6, as we first cover the ranking of
variables by "importance" in r dat i on to the target variables: the decision to perform an I ME
or a Special Investigation (SIU) and the favorable outcomes of each investigative technique.
The training data of approximately 75,000 records was used in the ranking evaluations.
Data mining modds are typically complex models where it is difficult to determine the
relevance of predictors to the model result. One of the handy tasks that some of the data
mi ni ng software product s perform is to rank the predictor variables by their i mport ance to
the model i n predicting the dependent variable. Where the software did not supply a
ranking, we omi t t ed an i mport ance ranki ng leaving five model / sof t war e det ermi nat i ons of
i mport ance for the twenty-one variables. Di fferent procedures are used for different
met hods and different products.
Two software products, CART and TREENET supply i mport ance rankings. The
procedures used are:
CART: CART uses a goodness of fit measure, also referred to i n the literature as an
i mpuri t y measure, and comput ed over the entire tree, to det ermi ne a variable' s importance.
I n this study the goodness of fit measure was the Gi ni Index defined bel ow (Hastie, et al.,
p.271-272):
i ( t ) = 1 - ~ p , i--the cat egori es of t he dependent vari abl e and p.is the probability of
I
class (6)
Each split of the tree lowers the overall value for the statistic. CART keeps track of the
i mpuri t y i mpr ovement at each node for bot h the variable used i n the split and for surrogate
variables used as a repl acement i n the case of mi ssi ng values. A consequence of this is that a
variable not used for splitting may rank higher i n i mport ance t han a variable that is.
TREENET: Because i t is composed of many small CART trees, TREENET uses the same
met hod as CART to comput e i mport ance rankings.
S-PLUS CR) does not supply an i mport ance ranking, but the programmi ng language can be
used to program a procedure to comput e rankings. A sensitix4ty value was comput ed for
each variable i n the model. The sensitivity is a measure of how much t he predicted value' s
error increases when the variables are excluded from the model one at a time. However,
instead of actually exdudi ng variables and refitting the model, their values are fixed at a
const ant value. (See Francis, 2001 for a detailed recipe for applying the approach). The
sensiti~fty statistic was used to rank the variables from the tree function. For the logistic
regression, i nf or mat i on about the variables cont ri but i on to sum of squared variation
explained by the model was used to rank it. Like CART and TREENET, Random Forest
uses an i mpuri t y measure (i.e., Gi ni Index) to produce an i mport ance ranking.
Insightful Mi ner does not supply i mport ance rankings. Unlike S-PLUS (R), the analytical
met hods are not accessed t hrough the language but t hrough a series of i cons placed on a
palate. Thus, we were not able to cust om program a ranki ng procedure for application wi t h
the Imi ner' s model i ng methods. The resulting i mport ance rankings were used i n Tables 5-
1A & 5-2A for the decisions to investigate and 5-1B and 5-2B for the favorable outcomes.
Each of five model / soft ware combi nat i on out put s allowed for the evaluation of the
predicting variables i n rank order of i mport ance, when significant, together wi t h a measure
of the relative value of i mport ance on a scale of zero (insignificant) to 100 (most significant
vari abl e). Tabl e 5- 1A displays t he i mpor t a nc e resul t s f or pr edi ct i ng an I ME usi ng t he fi ve
t r ee model s whi l e Tabl e 5-1B displays t hos e resul t s f or t he r emai ni ng fi ve mo d e l / s o f t wa r e
combi nat i ons , i ncl udi ng t he be nc hma r k Nai ' ve Bayes and Logi st i c. Th e pr edi ct i ng var i abl es
are l i st ed i n t he or de r o f i mpor t a nc e i n t he T R E E NE T model , wher e all var i abl es are
si gni fi cant . Th e n u mb e r o f si gni f i cant var i abl es f ound r anges f r om a l o w o f t wel ve var i abl es
( S- PLUS Tr ee) t o all t went y one ( T RE E NE T ) .
Variable
Provider 2 Bill
Attorneys Per Zip
Territo~"
Health Insurance
Injury Type
Provider 1 Bill
Provider 1 Type
Report Lag
Attorney
Age
Provider 2 Type
Income Household/Zip
Avg. Household Price/Zip .
Providers per City
Claimants per City
Providers / Zip
Households/Zip
Treatment Lag
Distance~lPl Zip to Clt Zip
Emergency Treatment
Policy. Type
Software Ranki ng of Variables for I ME De c i s i on
By Importance Rank and Val ue
(1)
T RE E NE T
1 (1 00)
2 (80)
3 (71)
4 (61)
5 (50)
6 (4"0
7 (31)
8 (31)
9 (25)
10 (23)
11 (19)
12 (18)
13 (17)
14 (17)
15(16)
16 (16)
17 (16)
18 (14)
19 (13)
20 (4)
21 (3)
( 2)
S Pl us Tree
2 (91)
5 (26)
(32)
1 (lOO)
6 (24)
3 (51)
9 (7)
7 (16)
12 (3)
8 (9)
11 (3)
l o (4)
(3)
CART
1 (100)
13 (9)
11 (11)
3 (68)
5 (47)
4 (58)
8 (1 8 )
17 (2)
10 (13)
15 (5)
9 (15)
1 8 (2)
20 (0.1)
7 (20)
19 (2)
(4 )
Random
Forest
1 (100)
6 0 4)
3 (59)
2 (8 4 )
10 (18)
4 (59)
12 (15)
8 (27)
19 (5)
17 (8)
5 (42)
11 (1 6)
1 6 (9)
7 (32)
15 (13)
13 (15)
9 (24)
14 (14)
1 8 (6)
20 (0)
(s)
Logi sdc
lO (1 )
11 (1)
1 (1 00)
2 (51)
6 (8 )
1 3 (1 )
5 (1 8 )
3 (47)
9 (2)
12 (1)
8 (2)
7 (2)
4 (24)
Note: * represents insignificance of variable in the model.
Tabl e 5-1 A
Th e s ame set o f mo d e l / s o f t wa r e c ombi na t i ons was us ed wi t h t he same set o f t we nt y- one
pr edi ct i ng var i abl es t o pr edi ct t he f avor abl e o u t c o me o f t he I ME. Tabl e 5-1B s hows t he
i mpor t a nc e o f each o f t he 21 pr edi ct or s f or mode l i ng f avor abl e out c ome s o f I MEs .
2 6 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6
Variable
Provider 2 Bill
Attorneys Per Zip
Territory
Health Insurance
Injury Type
Provider I Bill
Provider I Type
Report Lag
Attorney
Age
Proxdder 2 Type
Income Househol d/ Zi p
Avg. Household Pri ce/ Zi p
Providers per City
Claimants per City
Providers/Zip
Households/Zip
Treatment Lag
Distance MP1 Zip to Clt Zip
Emergency Treatment
Vor~q Type
Software Ranki ng of Variables for IME Favorable
By Importance Rank and Val ue
( 1 ) ( 2) ( 3 )
TREENET S Plus Tree CART
5 (64) 3 (22) 4 (37)
11 (28) * 11 (6)
2 (98) 2 (43) 12 (5)
1 (1 00) 1 (1 00) 1 (1 00)
( 4)
Random
Forest
5 (49)
13 (28)
1 (lOO)
2 (71)
(s)
Logi st i c
2 (13)
11 (1)
4 (9)
1 (lOO)
4 (76)
7 ( 4 5)
8 ( 38 )
6 (53) .
12 (25)
13 (24)
1 0 ( 29)
20 ( 7) .
15 (16)
1 9 ( 8 )
9 (36)
17 (12)
16 (15)
14 ( 22)
3 (78) .
1 8 ( 9)
21 (5)
5 ( 1 0) 9 ( 1 5)
4 (15) 2 (51)
9 (16) 5 0 6)
8(7) 18(o)
* 19 (0)
* 6 0 0 )
11 (4) 17 (0)
* l S ( 0)
* 8 (17)
12 (3) 13 (2)
13 (2) 7 (20)
7 (7) 16 (0)
1 4 (1 ) 1 0 (6)
6 (8) 14 (1)
10 (6) 3 (44)
4 (67) 3 (13)
3(70) *
10 02) 5 (9)
6 (45) 8 (6)
18 (3) 7 (8)
9 (33) *
12 01) *
8 (33) 1 0 g)
15 (23) *
16 (22) 13 (1)
11 01) *
7 (37) 9 (-'2)
14 (28) 6 (8)
17 (5) 12 (1)
Tabl e 5-1B
Th e s a me s et o f f i ve mo d e l / s o f t wa r e c o mb i n a t i o n s was us e d wi t h t he s a me s et o f t we nt y-
o n e pr e di c t i ng var i abl es t o p r e d i c t t he us e o f s p e d a l i nve s t i ga t i on or SI U. Ta bl e s 5- ZR a n d
5- 2B s h o w t he c o r r e s p o n d i n g r a n k i n g o f var i abl es by i mp o r t a n c e f or e a c h o f t he f i ve mo d e l
c o mb i n a t i o n s a nd t wo t a r ge t var i abl es , de c i s i on a nd f avor abl e.
C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 2 7
Software Ranki ng of Vari abl es for SIU Deci si on
By I mpor t ance Rank and Val ue
Vari abl e
Provi ders/ Zi p
Provider 2 Type
Territo9,
Health lnsumnce
Provider 1 Bill
l nj u~ Type
Attorney
Provider 1 Type
Age
Provider 2 Bill
Report lag
Average House Price
Attorneys/zip
Distance to Provider
(1)
TREENET
1 (100)
2 (98)
3 (92)
4 (64)
5 (59)
6 (52)
7 (47)
8 08)
9 (31)
10 (30)
11 (28)
Emergency, Treatment
Income/ Cap Household
Claimants per City
Treatment Lag
Househol ds/ Zi p
Policy Type
Providers per City
12 (28)
13 (22)
14 (20)
15 (19)
16 (18)
17 (17)
18 (1 6)
19 (16)
20 (8)
21 (6)
(2)
S Pl us Tree
1 (100)
10 (3)
5 (18)
3 (33)
2 (51)
7(6)
8 (4.5)
4 (29)
6 (8)
11 (3)
9 (34)
12 (1)
(3)
CART
8 (37)
1 5 (34)
3 (84)
7 (52)
2 (8 5)
5 (59)
17 (13)
4 (69)
1 (1 00)
6 (54 )
15 (18)
14 (20)
19(4)
13 (27)
9 (4.5)
12 (30)
18 (1 2)
16 (16)
11 (30)
( 4 )
Random
Forest
3 (74)
10 (30)
1 (1 00)
6 (50)
2 (89)
16 (5)
18(4)
s (8 1 )
17 (5)
4 (74 )
8 (10)
9 00)
15 (18)
19 (4)
13 (21)
11 (26)
14 (20)
12 (21 )
20 (1)
7(44)
( s)
Logi suc
6 (39)
7G8)
14 (2)
2 (71)
3 (63)
1 (100)
13 (5)
11 (17)
12 (7)
4 (58)
5 (4 9)
9 (27)
15 (2)
8 (28 )
10 (22)
Tabl e 5-2A
Vari abl e
Providers/Zip
Provider 2 Type
Territo~"
Health Insurance
Provider 1 Bill
Injuq, Type
Attorney
Provider 1 Type
Age
Proxader 2 Bill
Report lag
Average House Price
Attorneys/zip
Distance to Provider
Emergency Treatment
Income/Cap Household
Claimants per City
Treatment Lag
Households/Zip
Policy T.vpe
Providers per City
Sof t ware Ra nk i ng of Vari abl es for SI U Favorabl e
By I mpor t anc e Ra n k and Val ue
(6) (7) (8)
TREENET $ Pl us Tree CART
10 (20) 10 (6) 12 (25)
4 (41) * 7 (35)
1 (1 00) 2 (94) 1 (100)
13 (18) 6 (16) *
6 (30) . 13 (4) . 15 (9) .
3 (58) 5 (16) . 6 (39)
14 (16) 12 (4) 9 (27)
5 (40) 1 (100) 3 (50) .
8 (22) * 1 7 (7)
2 (66) 4 (18) 8 (32)
. 7 (25) . 7 (1 4) 1 9 (2) .
15 (16) * 13 (24)
11 (19) 8 (14) 4 (45)
16 (15) 9 (14) . 5 (39)
21 (9) , 3 (72) 14 (17)
17 (14) 11 (5) . 2 (61)
12 (19) * 11 (25)
19 (13) * 18 (4)
18 (13) * 16 (9) .
20 (10) * *
9 (21) 14 (3) 10 (26)
Tabl e 5-2B
( 9)
Ra n d o m
Fores t
7 (24)
9 08)
1 (1 00)
1 5 (1 0)
5 (29)
16 (8)
18 (6)
3 (33)
13 (13)
6 (26)
2 (36)
lO (1 7)
14 (11)
11 (16)
12 (13)
17 (6)
8 (1 9)
4 (31 )
(10)
Logistic
1 3 (2)
5 (21 )
1 (100)
7 (19)
14 (1)
3 (41)
6 (20)
2 (4 5)
11(2)
9 (3)
12 (2)
4 (25)
15 (1 )
8 (5)
lO (2)
Cl earl y, i n b o t h i ns t a nc e s o f t ar get var i abl es t he speci f i c mo d e l a nd s of t wa r e i mp l e me n t a t i o n
de t e r mi ne s h o w t o u n wi n d t he cr os s c or r e l a t i ons t o ext r act t he mo s t i n f o r ma t i o n f or
pr e di c t i on pur pos e s . F o r e xa mpl e , t he di s t a nc e b e t we e n t he c l a i ma nt ' s zi p c ode a n d t he f i r st
o u t p a t i e n t p r o v i d e r ( Di s t ance) r a nks l o w i n i mp o r t a n c e ( 19/ 21) i n t he T R E E NE T
a ppl i c a t i on f or t he I ME de c i s i on t ar get b u t i t is qui t e i mp o r t a n t i n t he T R E E NE T mo d e l f or
f a vor a bl e I ME o u t c o me ( 3/ 21) . No t e , h o we v e r , p r o v i d e r 2 bi l l is d e e me d hi ghl y i mp o r t a n t
i n all I ME n o n - b e n c h ma r k appl i cat i ons . On e way t o i s ol at e t he i mp o r t a n c e o f e a c h
pr e di c t i ng va r i a bl e is t o tally a s u mma r y i mp o r t a n c e s cor e acr os s mode l s . We wi l l us e a
s cor e o f ( 21- r a nk) *( i mpor t a nc e ) , wi t h all i ns i gni f i c a nt var i abl es a s s i gne d ze r o i mp o r t a n c e ,
s u mme d o v e r all r e l e va nt mo d e l c o mb i n a t i o n s . F o r e xa mpl e , t he var i abl e pr oxqder 2 t ype
wo u l d ha ve a s u mma r y s cor e r el at i ng t o t he I ME t a r ge t acr os s t he fi ve t r ee mo d e l s f or a t ot al
i mp o r t a n c e s cor e o f 2, 268. Thi s s c or i ng f o r mu l a i s t3"pical o f t he ad h o c me t h o d s c o mmo n
t o dat a mi n i n g anal yt i cs. Th e mul t i pl i cat i ve f o r m gi ves e mp h a s i s t o b o t h t he cat egor i cal r a n k
a nd t he i mp o r t a n c e s c or e i n a dua l mo n o t o n e way. Th e n u me r i c va l ue o f t he s c or e i s l ess
i mp o r t a n t t h a n t he fi nal r a nki ngs o f t he var i abl es . Ta bl e s 5 - 3 A&B a n d 5 - 4 A&B s h o w t he
r a nge o f var i abl e i mp o r t a n c e s u mma r y s cor es f or all var i abl es r el at i ve t o t he t wo t ar get s,
I ME a nd SI U, r espect i vel y. Th e r a nks o f t he var i abl es a c c o r d i n g t o t he t wo s u mma r y s cor es
ar e hi ghl y ( Pear s on) c or r e l a t e d as, f or e xa mpl e , t h e de c i s i on s u mma r y r a nks a n d f a vor a bl e
s u mma r y r a nks ha ve c or r e l a t i on coef f i ci ent s o f 0. 65 f or I ME a nd 0. 57 f or SI U. Th e t abl es
Cas ual t y Act uar i al Soci et y Forum, Wi n t e r 2006 29
al so i ndi cat e t he vari abl e cat egor y o f ori gi nal DC D field (F), an i nt ernal l y der i ved vari abl e
( DV) a nd an ext er nal de mogr a phi c vari abl e (DM). The ext ernal de mogr a phi c var i abl es do
not s e e m t o be ver y i nf or mat i ve i n t he pr es ence o f t he field and der i ved vari abl es chos en.
Important Variable Summari zadons for IME
Tree Model s Appl i ed to Dec i s i on and Favorable Targets
Total Dec i s i on Favorabl e
Score Score Score
Variable Total
Variable type Score Rank Rank Rank
Health Insurance F 17,206 1 2 1
Provider 2 Bill F 10,820 2 1 4
Territota? F 7,871 3 5 2
Provider 1 Bill F 6,726 4 4 3
Injury Type F 6,084 5 6 5
Attorneys Per Zip DV 3,102 6 3 15
Provider 2 Type F 2,873 7 8 9
Report Lag DV 2,859 8 16 7
Provider 1 Type F 2,531 9 10 6
Distance MP1 Zip to Clt Zip DV 1,655 10 11 8
Treatment Lag DV 1,331 11 17 16
Emergency. Treatment F 1,216 12 7 10
Claimants per Cit 3, DV 1,146 13 14 13
Income Household/Zip DM 987 14 13 17
Attorney F 971 15 9 19
Households/Zip DM 957 16 19 11
Age F 881 17 12 14
Pro~ders/Zip DV 838 18 18 12
Providers per City DV 719 19 20 18
Avg. Household Price/Zip DM 262 20 15 20
Policy. Type F 4 21 21 21
Tabl e 5-3
Important Variable Summari zat i ons for SIU
Tree Model s Appl i ed to Dec i s i on and Favorabl e Targets
Variable Tot al
Variable Type Score
Territory F 15,242
Provider 1 Type F 9, 965
Providers/Zip DV 6, 676
Provider I Bill F 6, 240
Provider 2 Bill F 6, 030
Injury Type F 5, 845
Provider 2 Type F 4, 753
Health Insurance F 4, 262
Emergency Treatment F 3, 039
Attorney F 2, 705
Report lag DV 2, 642
Providers per City. DV 2, 275
Attorneys/zip DV 2, 183
Distance to Provider DV 2, 109
Income/Cap Household DM 2, 091
Claimants per City DV 1, 142
Households/Zip DM 1,061
Age F 830
Treatment Lag DV 706
Average House Price DM 648
Policy Type F 19
Total
Score
Dec i s i on Favorable
Score Score
Rank Rank Rank
1 2 1
2 4 2
3 1 13
4 3 10
5 5 4
6 7 3
7 8 6
8 6 15
9 13 5
10 9 14
11 10 9
12 12 10
13 14 8
14 11 14
15 15 7
16 18 16
17 16 18
18 19 17
19 17 20
20 20 9
21 21 21
Table 5-4
Ad d i t i o n a l An a l y s e ~
Mo s t s of t war e al l ow f or addi t i onal di agnos t i c t ool s t hat f ocus o n t he i mp o r t a n c e o f
i ndi vi dual var i abl e l evel s i n t he pr edi ct i ve mode l . We f ocus o n t wo s uch feat ures: par t i al
d e p e n d e n c y pl ot s a n d p r u n i n g o f t rees. Bo t h f eat ur es ar e de s i gne d t o i l l ust rat e t he
c ont r i but i on o f each kv el o f cat egor i cal var i abl e a nd each interv al o f c ont i nuous var i abl es
cr eat ed by t he cut poi nt s . We i l l ust rat e t he addi t i onal anal yses us i ng t he Ra n d o m Fo r e s t a nd
S- PLUS' s t r ee s of t war e.
P a r t i a l De p e n d e n c e
Th e part i al d e p e n d e n c e pl ot is a us ef ul way t o vi sual i ze t he e f f e c t o f t he val ues o f a speci f i c
var i abl e o n a d e p e n d e n t var i abl e wh e n a c o mp l e x mo d e l i n g me t h o d s uc h as Ra n d o m Fo r e s t
is used. Th e part i al d e p e n d e n c e pl ot is a gr a ph o f t he mar gi nal e f f e c t o f a var i abl e o n t he
cl ass pr obabi l i t y. F o r a cl assi f i cat i on appl i cat i on (in Ra n d o m For es t ) , t he part i al pl ot us es
t he l ogi t or l og o f t he odds r at i o ( t he odds o f be i ng i n t h e t ar get cat egor y ver s us i t s
c ompl i me nt ) r a t he r t ha n t he act ual pr obabi l i t y.
K
f (x ) = log p, (x) - ~ log(p, ) (7)
/ = 1
Figures 5-1 and 5-2 show the partial dependence plot for the two IME targets for the most
important variable in Table 5-4, territory.
Random Forest: IME Requested
Part i al D e p e n d e n c e o n Ter r i t o r y
.D
m
Q
t o
o
o
t ' o
o
o4
o
x . -
o
O
O
01 04 07 10 13 16 19
Territory
F i g u r e 5-1
Random Forest: IME Favorable
22 25
32 Ca s u a l t y Ac t u a r i a l So c i e t y Forum, Wi n t e r 2 0 0 6
Par t i al D e p e n d e n c e o n Te r r i t o r y
o
o
o
o
o
o
01 04 07 1 0 1 3 1 6 1 9 22 25
Territ ory
Figure 5-2
Both bar graphs have a distinctive right shift i n the size of the partial dependency on the
territon, variable. This result is not surprising given that Massachusetts aut omobi l e
territories are set every two years based upon the calculation of a single 5-coverage pure
pr emi um i ndex for each of 350 towns. Towns are t hen grouped i nt o 16 nearly homogenous
territories with the i ndex generally rising from territory 1 (lowest) to territory 16 (highest).
Territories 17-26 are 10 individual parts of Bost on that vary widely i n this calculated pure
pr emi um i ndex (Conger, 1987). Figure 5-3 shows a bar graph of the pure pr emi um indices
for the 26 territories used i n this analysis for compari son purposes.
Massachuset t s Rat i ng Territories
Fi ve Cover age Pur e Pr e mi u ms
1400.00 T
1200.00
1000, 00
800.00
600.00
400.00
200.00
0 O0
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 0 2 0 21 2 2 23 24 25
F i g u r e 5 - 3
26
Fi gure 5-4 displays the pr opor t i on of dai rns wi t h an I ME r equest ed (not margi nal effects) by
territory, super i mposed on the pur e pr emi um t erri t ory levels. In cont rast to the similarity, of
the margi nal i mpor t ance of the I ME t erri t ory variable t o the t erri t ory pur e pr emi ums, the
pr opor t i ons of claims with I ME request ed shown in Fi gur e 5-4 show mor e uni f or mi t y across
territories, i ndi cat i ng a real dependence on ot her i mpor t ant variables.
Mas s achus et t s Rat i ng Ter r i t or i es
Fi ve Cover age Pur e Pr e mi um vs I ME Reques t Rat i os
1400.00 0.1200
1200.00
1000.00
800.00
600.00
400.00
200.00
0.1000
0.0800
0.0600
0.0400
0.0200
0.00
1 2 3 4 5 6 7 8 9 10 1112 1314 151617 18 19 2 0 2 12 2 2 32 4 2 52 6
Territory
Figure 5-4
0.0000
Pruni n~ t he Trees
Simple trees 3~ that ext end to a large number of t ermi nal nodes are difficult to assess t he flail
i mport ance of individual variable levels because (1) later node splits may or may not be
statistically significant dependi ng on the software algorithms employed and (2) t ermi nal
nodes on the order of fifty plus may obscure the precise cont r i but i on of each variable level
despite the i mpor t ance value described above for the overall variable.
The full tree produced by the software can be pruned back to the "best " tree wi t h a pre-
det ermi ned number of nodes. For example, Figure 5-5 shows a best 10 node pr uned tree
from S-PLUS. It begins with the health i nsurance variable as the "root " node (Y/N t o the
left and U to the fight) 32 and proceeds to make general node splits based onl y on the
provi der 2 bill amount . The uni verse of records is t hen classified by t ermi nal node I ME
requested ratios rangi ng from 0.019 to 0.170. A similar pr uned tree can be pr oduced for t he
ot her three targets.
S-PLUS TREE: IME Requested
Best Te n No d e Pruned Tree
h~nllh into Jmnr.~'h
I
/
mn2 hi 1<4Zr,,4 mn~) hJ<4R.q
L i- j o. -, L
mp2 h i l l 3010.5 mp2 hill 11~qF~1.5 mp? hill 15.o,6.5
0.019 0.100 0.100
1
0.060 0.072 0.150 0.090 0.170 0.120
F i g u r e S -S
We next t ur n t o cons i der at i on o f model per f or mance as a whol e i n sect i on 6 wi t h an
i nt er pr et at i on o f t he model s and vari abl es relative t o t he pr obl e m at ha nd (exampl e 2) i n
Sect i on 7.
S E C T I ON 6. R OC CURVE S AND L I F T F OR S OF T WAR E : T R E E S , NAI VE
BAYES AND L OGI S T I C MODE L S
The sensitivity and specificity measur es di scussed i n Sect i on 4 are de pe nde nt o n t he choi ce
o f a cut of f val ue f or t he pr edi ct i on. Many model s score each r ecor d wi t h a val ue bet ween
zero and one, t hough s ome ot he r scor i ng scale can be used. Thi s score is somet i mes t r eat ed
like a probability., al t hough t he concept is mu c h cl oser i n spi ri t t o a fuzzy set me a s ur e me nt
f unct i on 33. A c o mmo n c ut of f poi nt is .5 and r ecor ds wi t h scores gr eat er t han .5 are classified
as event s and r ecor ds bel ow t hat val ue are classified as non- e ve nt s 3~. However , ot he r cut of f
Di st i ngui shi ng the Forest f r om the T R E E S
values can be used. Thus, i f a cut of f l ower t han 50% were selected, mor e event s woul d be
accurately predi ct ed and fewer non-event s woul d be accurately predicted.
Because t he accuracy of a predi ct i on depends on t he selected cut of f point, t echni ques for
assessing t he accuracy o f model s over a range o f cut of f poi nt s have been devel oped. A
c ommon procedure for visualizing t he accuracy of model s used for classification is t he
recei ver operat i ng characteristic (ROC) curve 3~. This is a curve o f sensitivity versus
specificity (or mor e accurately 1.0 mi nus the specificity) over a range o f cut of f points. I t
illustrates graplfically the sensitivity or true positive rate compar ed t o 1- specificity or false
alarm rate. When the cut of f poi nt is very high (i.e. 1.0) all claims are classified as legitimate.
The spedfi ci t y is 100% (1.0 mi nus t he specificity is 0), but t he sensitivity is 0%. As t he
cut of f poi nt is lowered, t he sensitivity increases, but so does 1.0 mi nus t he specificity.
Ultimately a poi nt is reached where all claims are predi ct ed to be events, and t he specificity
declines t o zero (1.0 - specificity = 1.0). The baseline ROC curve (where no model is used)
can be t hought of as a straight line f r om the origin wi t h a 45-degree angle. I f t he model ' s
sensitivity increases faster than the specificity decreases, the curve "l i ft s" or rises above a 45-
degree line quickly. The higher the "l i ft " or "gai n"; the mor e accurate t he model 3c'. ROC
curves have been used in pri or studies of insurance claims and fraud det ect i on regressi on
model s (Derrig and \Veisberg, 1998 and Viaene et al., 2002). The use of ROC curves in
building model s as well as compar i ng per f or mance of compet i ng model s is a well established
procedure (Flach et al (2003)).
A statistic that proxddes a one-di mensi onal summary of t he predi ct i ve accuracy of a mo d d as
measured by an ROC curve is the area under the ROC curve (AUROC). I n general,
AUROC values can distinguish good model s f r om bad model s but may not be able t o
distinguish among good model s 0Vlarzban, 2004). A curve that rises quickly has mor e area
under t he ROC curve. A model wi t h an area of .50 demonst rat es no predi ct i ve ability, whi l e
a model wi t h an area of 1.0 is a perfect predi ct or (on t he sample the test is per f or med on).
For this analysis, SPSS was used t o pr oduce t he ROC cuin-es and area under t he ROC
curves. SPSS generates cut of f values midway bet ween each unique score in t he data and
uses the trapezoidal rule t o comput e the AUROC. A non-paramet ri c met hod was used to
comput e t he standard error o f t he AUROC. The formula for t he standard error 37 is:
I A(I - A) + (n+ - l)(Ql - A 2 ) + (n_ - I)(Q2 - A 2 )
SE( A) (8)
n+N
Where n+ is t he number o f event s, n. is t he number o f non-event s, N is t he sampl e si ze
A is t he AUROC and scores are denot ed as x
= = j x [ n + > / - n+ > z r/+= /
/./r/+2 E x t'/- , + + + ]
1
o , : 7 g + Xx + : jxt , , : + o , , + , :,
Tables 6-1A&B show t he values of AUROC for each of eight model / s of t war e combi nat i ons
in predicting a decision to investigate wi t h an I ME (6-1A) and an SIU (6-1B). for the
Massachusetts auto bodily injury liability claims that compri se t he hol dout sample, about
50,000 claims. Upper and l ower bounds for t he "t r ue" AUROC value are shown as t he
AUROC value + t wo standard deviation det ermi ned by equat i on (7). TREENET, Random
For est bot h do well wi t h AUROC values about 0.7, significantly bet t er than t he logistic
model. The I mi ner model s (Tree, Ensembl e and Nai've Bayes) generally have AUROC
values significantly b d o w t he t op t wo performers, wi t h t wo (Tree and Ensembl e)
significantly bel ow the Logistic and t he I mi ner Nai ve Bayes benchmarks. CART also scores
at or bel ow the benchmarks and significantly bel ow T R E E NE T and Random Forest. On the
ot her hand, S-PLUS (R) tree scores at or somewhat above the benchmarks.
Area Unde r t he ROC Curve - I ME De c i s i o n
CART
Tr e e
0.669
0.661
0.678
S-PLUS
Tr e e
AUROC
Lower Bound
Upper Bound
I mi ne r
Ens e mbl e Ra ndo m
Fores t
AUR()C 0.649 703
Lower Bound 0.641 695
Upper Bound 0.657 711
I mi ne r Tr e e
Tabl e 6-1A
T RE E NE T
0.688 0.629 0.701
0.680 0.620 0.693
0.696 0.637 0.708
I mi ne r
Na i v e
Bayes
0.676
0.669
0.684
Logi st i c
0.677
0.669
0.685
Area Unde r the ROC Curve - I ME Favorabl e
CART
Tree
AUROC
Lower Bound
Upper Bound
AUROC
Lower Bound
Upper Bound
0.651
0.641
0.662
I mi ne r
Ens e mbl e
0.654
0.643
0.665
S-PLUS
Tr e e I mi ne r Tr e e TREENET
0.664 0.591 0.683
0.653 0.578 0.673
0.675 0.603 0.693
Ra ndo m
Fores t
0.692
0.681
0.702
I mi ne r
Na / v e
Bayes
0.670
0.660
0.681
Tabl e 6-1B
Logi s t i c
0.677
0.667,
0.687
Tabl es 6- 2A&B s h o w t he val ues o f AUR OC f or t he mo d e l / s o f t wa r e c o mb i n a t i o n s t e s t e d
f or t he SI U d e p e n d e n t vari abl e. We f i r st n o t e t hat , i n gener al , t he mo d e l pr e di c t i ons as
me a s ur e d by AUR OC ar e si gni fi cant l y l owe r t ha n f or I ME acr os s all ei ght mo d e l / s o f t wa r e
c ombi na t i ons . Thi s r e duc t i on i n AUR OC val ues ma y be a r ef l ect i on o f t he e xpl a na t or y
var i abl es us e d i n t he anal ysi s; i.e., t hey may be mo r e i nf or ma t i ve a bout d a l m bui l d- up, f or
wh i c h I ME is t he pr i nci pal i nvest i gat i ve t ool , t ha n a b o u t cl ai m f r aud, f or wh i c h SI U is t h e
pr i nci pal i nvest i gat i ve t ool .
Area Unde r t he ROC Curve - SI U De c i s i o n
AUROC
Lower Bound
Upper Bound
AUROC
Lower Bound
Upper Bound
CART S-PLUS
Tr e e Tr e e
0.607
0.598
0.617
I mi ne r
Ens e mbl e
0.539
0.530
0.548
I mi ne r Tr e e TREENET
0.616 0.565 0.643
0.607 0.555 0.634
0.626 0.575 0.652
Ra ndo m
Fores t
I mi ne r
Na i v e
Bayes
0.615 0.677
Logistic
0.612
0.668 0.605 0.603
0.686 0.625 0.621
Tabl e 6-2A
Distinguishing the Fore#from the TREES
Area Under t he ROC Curve - SIU Favorable
CART
Tr e e
AUROC 0.598
Lower Bound 0.584
Upper Bound 0.612
I mi n e r
E n s e mb l e
AUROC 0.57.5
Lower Bound 0.530
Upper Bound 0.548
S-PLUS
Tr e e I mi n e r Tr e e T RE E NE T
0.616 0.547 0.678
0.607 0.555 0.667
0.626 0.575 0.689
Ra n d o m
Fo r e s t
0.645
I mi ner
Nai ve
Ba y e s
0.607
Logmic
0.610
0.631 0.593 0.596
0.658 0.625 0.623
Tabl e 6-2B
T R E E NE T a nd Ra n d o m Fo r e s t p e r f o r m si gni f i candy be t t e r t ha n all o t h e r mo d a l / s o f t wa r e
c o mb i n a t i o n s o n t he f avor abl e t ar get vari abl es. Bo t h p e r f o r m si gni fi cant l y be t t e r t ha n t he
Logi st i c. I mi n e r Tr e e a nd En s e mb l e agai n d o poor l y o n t he I ME a nd SI U Fa vor a bl e h o l d o u t
sampl es.
Fi gur es 6-1 t o 6-4 s h o w t he ROC cur ves f or T R E E NE T c o mp a r e d t o t he Logi st i c f or b o t h
I ME and SI U 38. As we can see, a s i mpl e di spl ay o f t he ROC cur ves ma y n o t be s uf f i ci ent t o
di s t i ngui s h p e r f o r ma n c e o f t he mo d e l s as wel l as t he AUR OC val ues.
1 .0
T R E E NE T R OC Cu r v e - I ME
AUR OC = 0. 701
~ . Q . 6-
~ 0 4 -
0 2 -
0.0
0 0
I I I I
02 0.4 0.6 0 8 1.0
1 - sp eci r. :~ y
F i g u r e 6-1
|.0
0 5-
t
~g 0.4-
02-
T RE E NE T R OC Cu r v e - S I U
AUR OC = 0. 677
0.0
I I I I
1- smar~
F i g u r e 6 - 2
Casualty Actuarial Society Forum, Wi nt er 2006 41
L o g i s t i c R OC C u r v e - I ME
AUR OC = 0. 643
1 .D
~>qXS-
W
1
~D. 4 -
02-
1.0
0.8-
i
~'3 D.4-
L o g i s t i c R OC C u r v e - S I U
AUR OC = 0. 612
{32-
/ .
0 . 0 I I OI I
{}0 Q2 0.4 0 0 5
0~0 I l 01 . I
oo 02 0.4 0 0.8 ~.o 1 - Sp ecWci t y
1 - s p e c mc ~
F i g u r e 6- 3
1.0
F i g u r e 6- 4
Fi nal l y, Ta b l e 6- 3 di s pl ays t h e r el at i ve p e r f o r ma n c e o f t h e mo d e l / s o f t wa r e c o mb i n a t i o n s
a c c o r d i n g t o AUR OC va l ue s a n d t hei r r anks . Wi t h Nai ' ve Ba ye s a n d Logi s t i c as t h e
b e n c h ma r k s , T R E E NE T , R a n d o m Fo r e s t a n d S P LUS Tr e e d o be t t e r t h a n t he b e n c h ma r k s
whi l e C AR T Tr e e , I n' f i ner Tr e e , a n d I mi n e r E n s e mb l e d o wor s e .
Ranki ng of Met hods By AUROC - Deci s i on
Met hod SIU AUROC
Random Forest 0.645
TREENET 0.643
S-PLUS Tree 0.616
Iminer NaPce Bayes 0.615
Logistic 0.612
CART Tree 0.607
1miner Tree 0.565
Iminer Ensemble 0.539
; IU Ra nk I ME Ra nk I ME
AUROC
1 0.703
2 0.701
3 0.688
4 5 0.676
4 0.677
6 0.669
8 0.629
7 0.649
Tabl e 6-3A
1 2 0.683
2 1 0.692
3 5 0.664
4 3 0.677
4 0.670
6 7 0.651
6 0.654
8 0.591
Ranl dng of Met hods By AUROC - Favorabl e
Met hod SIU AUROC SIU Rank IME Rank I ME
AUROC
TREENET 0.678
Random Forest 0.645
S-PLUS Tree 0.61
Logistic 0.61C
Iminer Naive Bayes 0.607
CART Tree 0.598
Iminer Ensemble 0.575
Iminer Tree 0.547
Tabl e 6-3B
Finally, Fi gures 6- 5A&B s how t he relative per f or mance i n a graphi c. Pr ocedur es woul d wor k
equally on b o t h I ME and SI U i f t hey lie o n t he 45 degree line. To t he ext ent t hat
per f or mance is bet t er o n t he I ME t arget s, pr ocedur es woul d be above t he diagonal. Bet t er
per f or mance is s hown by posi t i ons f ar t her t o t he f i ght and cl oser t o t he t op o f t he square.
Thi s gr aphi c dear l y shows t hat T R E E NE T and Ra ndom For est pr ocedur es do bet t er t han
t he ot her tree pr ocedur es and t he benchmar ks .
Pl ot of AUROC for SIU vs. I ME De c i s i on
0. 65
|
8
~ 0 . 6 0
I M Ensembl e
I M Tret
r r N net Rand~ n Fores
S- PL US Tree
L o g ~ U c / N ah ~ b y e s
CART
0 50 ' I I
0 S0 0 55 0 60 0, 65
AUROC. sI U
Fi gure 6-5A
Pl ot of AUROC for SIU vs. IME Favorabl e
, t
o. 6s - i
0. 60 -
0 55 "
R m d o m F o ~ s t
L ogisticR , , g n , s s ~
N~ L Ue Tr ee
I M E ~ CART
0 55 0 60 0 65
SI U A U R O C
Figure 6-5B
S E C T I ON 7. C ONC L US I ON
Insurance data oft en involves bot h large vol umes of i nf or mat i on and nonl i neari t y of variable
relationships. A range of data mani pul at i on techniques have been developed by comput er
scientists and statisticians that are now categorized as data mi ni ng, techniques with principal
advantages bei ng precisely the efficient handl i ng of large data sets and the fitting of non-
linear funct i ons to that data. I n this paper we illustrate the use of software i mpl ement at i ons
of CART and ot her tree-based met hods, together wi t h benchmar k procedures of Naive
Bayes and Logistic regression. Those eight model / soft ware combi nat i ons are applied to data
arising i n the Detail Claim Dat abase (DCD) of aut o injury liability claims i n Massachusetts.
Twent y- one variables were selected to use i n prediction model s usi ng the DCD and external
demographi c variables. Four target categorical variables were selected to model: The decision
to request an i ndependent medical exami nat i on (IME) or a special investigation (SIU) and
the favorable out come of each investigation. The two decision targets are the pri me claim
handl i ng techniques that insurers can use to reduce the asymmetry of i nformat i on bet ween
the dai mant and the i nsurer i n order to distinguish valid claims from those i nvol vi ng
bui l dup, exaggerated injuries or treatment, or outright fraud.
Ei ght model i ng software results were compared for effectiveness of model i ng the targets
based on a standard procedure, the area under the receiver operating characteristic curve
(AUROC). We find t hat the met hods all provide some predictive value or lift from t he
predicting variables we make available, with significant differences among the eight met hods
and four targets. Seven model i ng out comes are compared t o logistic regression as i n Vi aene
et al. (2002) but the results here are different. They show some soft ware/ met hods can
i mprove on the predictive ability of the logistic model. TREENET, Random Forest and
SPLUS Tree do bet t er t han the benchmark Naive Bayes and Logistic met hods, while CART
tree, Iminer tree, and Iminer Ensemble do worse. That some modal / soft ware combinations
do better than the logistic model may be due to the relative size and richness of this data set
and/ or the t3,pes of independent variables at hand compared to the Viaene et al. data.
We show how "important" each variable is within each soft ware/ model tested and note the
type of data that are important for this analysis. I n general, variables taken directly from
DCD fields and variables derived as demographic type variables based on DCD r i d& do
better than variables derived from external demographic data. Variables relating to the injury
and medical treatment dominate the highly important variables while the presence of an
attorney, age of the daimant, and policy type, personal or commercial, are less important in
making the decision to im, oke these two investigative techniques.
No general conclusions about auto injury claims can be drawn from the exercise presented
here except that these modeling techniques should have a place in the actuary's repertory of
data manipulation techniques. Technological advancements in database assembly and
management, especially the availability of text mining for the production of variables,
together with the easy access to computer power, will make the use of these techniques
mandatory for analyzing the nonlinearity of insurance data. As for our part in advancing the
use of data mining in actuarial work, we will continue to test various software products that
implement these and other data mining techniques (e.g. support vector machines).
REFERENCES
Allison, P., Missing Data. Sage Publications, 2002.
Automobile Insurers Bureau of Massachusetts, Detail Claim Database Claim Distribution
Characteristics, Accident Years 1995-1997, Boston MA, 2004.
Brieman, L., J. Freidman, R. Olshen, and C. Stone, Classification ~md Regression Trees,
Chapman Hall, 1993.
Conger, R.F., The Construction of Automobile Rating Territories in Massachusetts, Casualty
ActuarialSodely Forum, pp. 230-335, Fall 1987.
Derrig, R.A., and L. Francis, Comparison of Methods and Software for Modeling Nonlinear
Dependencies: A Fraud Application, Working Paper, 2005.
Flach, P., H. Blockeel, C. Ferri, J. Hernandez-Orallo, andJ . Struyf, Decision Support for
Data M.ining: An Introduction to ROC Analysis and Its Applications, Chapter 7 in Data
Mining and Decision Support, D. Mlandenic, Nada Lavrac, Marko Bohanec and Steve Moyle
Eds., Kluwer Academic, N.Y., 2003.
Fox, J, An R and S-PLUS Companion to Applied Regression, SAGE Publications, 2002.
Francis, L.A., An introduction to Neural Networks in Insurance, Intelligent and Ot her
Computational Techniques in Insurance, Chapter 2, A.F. Shapiro and L.C. Jain, Eds, World
Scientific, pp. 51-1, 2003a.
Francis, L.A., Mardan Chronicles: Is MARS better than Neural Networks? Casual~
ActuarialSodety Forum, Winter, pp. 253-320, 2003b.
Francis, L.A., Practical Applications of Neural Networks in Property and Casualty
Insurance, Intellieent and Ot her Computational Techniques in Insurance. Chapter 3, A.F.
Shapiro and L.C. Jain, Editors, World Scientific, pp. 104-136, 2003c.
Francis, L.A., Neural Networks Demystified, Casualty Actuarial Sode~y Forum, Winter, pp. 254-
319, 2001.
Francis, L.A., A Comparison of TREENET and Neural Networks in Insurance Fraud
Prediction, presentation at CART Data Mining Conference, 2005.
Friedman, J., Greedy Function Approximation: The Gradient Boosting Machine, Annals of
Statistics, 2001.
Hassoun, M.H., Fundamentals of Artificial Neural Ne~, orks. MI T Press, Cambridge MA,
1995.
Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer,
New York., 2001.
Hosmer, David W. and Stanley Lemshow, Applied Lodstic Regression, J ohn Wiley and
Sons, 1989.
Insurance Research Council, Auto Injury Insurance Claims: Countrywide Patterns in
Treatment Cost, and Compensation, Malvern, PA, 2004a.
Insurance Research Council, Fraud and Buildup in Auto Injury Insurance Claims, Malvem,
PA, 2004b.
Jacard, J and Turrisi, Interaction Effects in Multinle Recessi on, SAGE Publications, 2003.
Kantardzic, M., Data Mining. J ohn Wiley and Sons, 2003.
Lawrence, Jeannette, Introduction to Neural Networks: Design, Theory and Applications,
California Sdentiflc Software, 1994.
Marzban, C., A Comment on the ROC curve and the Area Under it as Performance
Measures, 2004. Weather and Forecasting, Vol. 19, No. 6, 1106-1114.
bicCullagh, P. and J. Nelder, General and Linear Models, Chapman and Hall, London, 1989.
Miller, R.B. and D.B. Wichern, Intermediate Business Statistics, Holden-Day, San Francisco,
1997.
Neter, John, Mihael H. Kutner, William Waserman, and Christopher J. Nachtsheim, Applied
Linear Reeression Models, 4 a' Edition, 1985.
v
Salford Systems, Data mining with Decision Trees: Advanced CART Techniques, Notes
from Course, 1999.
Smith, Murry, (1996), Neural Networks for Statistical Modeling, International Thompson
Computer Press, 1996.
Viaene, S., R.A. Derrig, and G. Dedene, A Case Study of Applying Boosting Nai've Bayes
for Claim Fraud Diagnosis, I EEE Transactions on Knowledge and Data Engineering, 2004,
Volume 16, (5), pp. 612-620, May.
Viaene, S., R.A. Derrig, and G. Dedene, Illustrating the Explicative Capabilities of Bayesian
Learning Neural Networks for Auto Claim Fraud Detection, Intelligent and Other
Comtmtational Techniaues in the Insurance, Chapter 10, A.F. Shapiro and L.C. Jain,
Editors, World Scientific, pp. 365-399, 2003.
Viaene, S., B. Baesens, G. Dedene, and R. A. Derrig, A Comparison of State-of-the-Art
Classification Techniques for Expert Automobile Insurance Fraud Detection, Journal of Rdsk
andlnsurance, 2002, Volume 69, (3), pp. 373-421.
Weisberg, H.I. and R.A. Derrig, Methodes Quantitatives Pour La Detection Des Demandes
D' Indemnisation Frauduleuses, RISQUES, No. 35, Juillet-Septembre, 1998, Paris.
Zhou, X-H, D.K. McClish, and N.A. Obuchowski, Statistical Methods in Diagnostic
Medicine, J ohn Wiley and Sons, New York, 2002.
A good up-to-date and comprehensive source for a variety of data manipulation procedures is Hastie,
Tibshirani, and Friedman (2001 ), Elements of Statistical Learning, Springer.
2 They also found that augmenting the categorized red flag variables with some other claim data (e.g. age,
report lag) improved the lift as measured by AUROC across all methods but the logistic model still did as
well as the other methods (Viaene et al., 2002, Table 6, p.400-40 l).
3 A wider set of data mining techniques is considered in Derrig, R.A. and L A. Francis, Comparison of
Methods and Software Modeling Nonlinear Dependencies: A Fraud Application, Congress of Actuaries,
Paris, June 2006
4 See section 2 for an overview of the database and descriptions of the variables used for this paper.
5 The relative importance of the independent variables in modeling the dependent variable within these
methods are analogous to statistical significance or p-values in ordinary regression models.
6 See, for example, 2004 Discussion Paper Program, Applying and Evaluating Generalized Linear Models,
May 16-19, 2004, Casualty Actuarial Society.
7 This was the text used by the Casualty Actuarial Society for the exam on applied statistics during the
19g0s
8 Claims that involve only third party subrogation of personal injury protection (no fault) claims but no
separate indemnity payment or no separate claims handling on claims without payment are not reported to
DCD.
9 Combined payments under PIP and Medical Payments are reported to DCD.
J0 With a large holdout sample, we are able to estimate tight confidence intervals for testing model results
in section 6 using the area under the ROC curve measure.
H This fact is a matter of Massachusetts law which does not permit IMEs by one type of physician, say an
orthopedist, when another physician type is treating, say a chiropractor. This situation may differ in other
jurisdictions.
12 Because expert bill review systems became pervasive by 2003, reaching 100% in some cases, DCD
redefined the reported MA to encompass only peer reviews by physicians or nurses for claims reported
after July l, 2003..
J3 The standard Massachusetts auto policy has a cooperation clause for IME both in the first party PIP
coverage and in the third party BI liability coverage.
14 The IRC also includes an index bureau check as one of the claims handling activities
~5 Prior studies of Massachusetts Auto Injury claim data for fraud content included Weisberg and Derrig
(1998, Suspicion Regression Models) and Derrig and Weisberg (1998, Claim Screening with Scoring
Models).
~6 See Section 5 for the importance of the provider 2 bill variable in the decision to investigate claims for
fraud (SIU) and/or buildup (IME).
~7 There are Tree Software models that may split nodes into three or more branches. SPSS classification
trees is an example of such software.
is For binary categorical data assumed to be generated from a binomial distribution, entropy and deviance
are essentially the same measure. Deviance is a generalized linear model concept and is closely related to
the log of the likelihood function.
19 Hastie et al., p. 301 Note that Hastie et al. describe other error and weight functions. [endnote]
.~0 Note that the ensemble tree methods employ all 21 variables in the models. See tables 5-1 and 5-2.
2~ The ROC curve results in Section 6 show that TREENET generally provides the best prediction models
for the Massachusetts data.
22 The numeric variables were grouped into five bins or into quintiles in this instance.
~,3 The software product MARS also was used to rank variables in importance. MARS implements
multivariate adaptive regression splines and is described in Francis (2003).
24 The SAS code is generally relatively easy to edit if some other language is used to implement the model
25 See Section 5 for the importance of variables in our study.
26 S-PLUS would convert the numeric variable into a categorical variable with a level for every nmnerie
value that is in the training data, including missing data, but the result would have far too many categories
to be feasible.
27 Generally by collapsing sparsely populated categories into an "all other" category
2s It al so cont ai ns some di mens i on reduct i on met hods such as cl ust er i ng and Pr i nci pal Component s whi ch
are al so cont ai ned in S-PLUS.
29 In gener al , some pr ogr ammi ng i s requi red to appl y ei t her appr oach in S- PLUS (R)
30 The dat a set is descr i bed in mor e det ai l in Sect i on 2 above.
31 Pr uni ng i s not f easi bl e or necessar y for t he exampl e t r ee met hods such as TREENET or Random
TREENET.
32 The S- PLUS t ree gr aph does not pri nt out t he val ues of cat egor i cal var i abl es, al t hough i t di spl ays t he
val ues of t he numer i c var i abl es. For cat egor i cal var i abl es letters are assi gned and di spl ayed i nst ead o f t he
cat egor y val ues.
33 See Os t as zews ki ( 1993) or Der r i g and Os t as zews ki (1999).
34 One way o f deal i ng wi t h val ues equal t o t he cut of f poi nt i s t o consi der such obser vat i ons as one- hal f in
t he event gr oup and one- hal f in t he non- event gr oup
35 A ROC cur ve i s one exampl e of a so- cal l ed "gai ns " chart.
36 ROC cur ves wer e devel oped ext ens i vel y for use in medi cal di agnosi s t es t i ng in t he 1970s and 1980s
(Zhou et al. 2004 and mor e r ecent l y i n weat her f or ecast i ng (Marzban, 2004) and (St ephenson, 2000).
37 The det ai l s of t he f or mul a wer e suppl i ed by SPSS.
38 Al l t went y ROC cur ves are avai l abl e from t he aut hors.
Acknowledgement: The authors gratefully acknowledge the production assistance of Eilish Browne and helpful
comments from Rudy Palenik on a prior version of the paper.

Comparing Tree Based Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparing Tree Based Methods

Uploaded by

Copyright:

Available Formats

Distinguishing the Forest from the TREES:

A Comparison of Tree Based Data Mining Methods

You might also like