Richard Derrig, Ph. D. and Louise Francis, FCAS, MAAA Ri c h a r d De r r i g , P h D, OP AL Co n s u l t i n g LLC 41 F o s d y k e St r e e t P r o v i d e n c e R h o d e I s l a n d , 02906, U. S. A. P h o n e : 0 0 1 - 4 0 1 - 8 6 1 - 2 8 5 5 emai l : r i c h a r d @d e r r i g . c o m Lo u i s e Fr a n c i s , F CAS , AAA Fr a n c i s Ana l yf i c s & Ac t ua r i a l Da t a Mi n i n g 706 L o mb a r d St r e e t Ph i l a d e l p h i a Pe n n s y l v a n i a , 19147, U. S. A. P h o n e : 0 0 1 - 2 1 5 - 9 2 3 - 1 5 6 7 emai l : l o u i s e _ f r a n c i s @ms n . c o m Abstract In recent years a number of "data mining" approaches for modeling data containing nonlinear and ot her complex dependencies have appeared in the literature. One of the key data mining techniques is decision trees, also referred to as classification and regression trees or CART (Breiman et al, 1993). That met hod results in relatively easy to apply decision rules that partition data and model many of the complexities in insurance data. In recent years considerable effort has been expended to i mprove the qualit 3" of the fit of regression trees. These new met hods are based on ensembles or networks of trees and cart 3, names like TREENET and Random Forest. Viaene et al (2002) compared several data mining procedures, including tree met hods and logistic regression, for prediction accuracy on a small fixed data set of fraud indicators or "red flags". They found simple logistic regression did as well at predicting expert opinion as the more sophisticated procedures. In this paper we will introduce some available regression tree approaches and explain how they are used to model nonlinear dependencies in insurance claim data. We investigate the relative performance of several software products in predicting the key claim variables for the decision to investigate for excessive a nd/ or fraudulent practices, and the expectation of favorable results from the investigation, in a large claim database. .Mnong the software programs we will investigate are CART, S-PLUS, TREENET, Random Forest and Insightful Miner Tree procedures. The data used for this analysis are the approximately 500,000 auto injury claims reported to the Detailed Claim Database (DCD) of the Automobile Insurers Bureau of Massachusetts from accident years 1995 t hrough 1997. The decision to order an independent medical examination or a special investigation for fraud, and the favorable outcomes of such decisions, are the modeling targets. We find that the methods all provide some predictive value or lift from the available DCD variables with significant differences among the met hods and the four targets. All modeling outcomes are compared to logistic regression as in Viaene et al. with some model / soft ware combinations doing significantly betxer than the logistic model. Keywords: Fraud, Data Mining, ROC Cu~,e, Variable Importance, Decision Trees - ) Derrig-Francis _005 - No more than two pqragraphs or one table or figure can be quoted without written permission of the authors before March 1, 2006. Casualty Actuarial Society Forum, Winter 2006 1 Distinguishing the Forest from the T R E E S I NT R ODUC T I ON I n recent years a number of approaches for model i ng data cont ai ni ng nonl i near and ot her complex dependencies have appeared i n the literature. Many of the met hods were developed by researchers from the comput er science, artificial intelligence and statistics disciplines 1. The met hods have been widely characterized as data mining techniques. These procedures include several that shoul d be of interest to actuaries dealing wi t h large and complex data sets. The procedures of interest for the purposes of this paper are various varieties of classification and regression trees or CART. Viaene et al (2002) applied a wider set of procedures, i ncl udi ng neural networks, support vect or machines, and a dassical general linear model, logistic regression, on a small single data set of i nsurance dai r n fraud indicators or "red flags" as predictors of suspicion of fraud. They found simple logistic regression did as well at predicting expert opi ni on on the presence of fraud as the mor e sophisticated procedures. Stated differently, the logistic model performed well enough i n model i ng the expert opi ni on of fraud that there was litde need for t he more sophisticated procedures:. A wide variety of statistical software is now available for i mpl ement i ng fraud and ot her predictive models t hrough clustering and data mi ni ng. I n this paper we will i nt roduce a variety of Regression Tree data mi ni ng approaches 3 and explain how they are used to model nonl i near dependencies i n i nsurance claim data. We also investigate the relative performance of several software products that i mpl ement these models. As an example of relative performance, we test for the key claim variables i n the decision to investigate for excessive a nd/ or fraudulent practices i n a large claim database. The software programs we investigate are CART, S-PLUS, TREENET, Random Forests, and Insi ght ful Tree and Ensembl e from the Insightful I~finer package. Naive Bayes and Logistic model s are used as benchmarks. The data used for this analysis are the aut o bodily injury liability dai ms reported to the Detailed Claim Dat abase 0DCD) of the Aut omobi l e Insurers Bureau of Massachusetts from accident years 1995 t hrough 1997 ~. Three types of variables are employed. Several variables t hought to be related to the decision to investigate are i ncl uded here as report ed to the DCD, such as out pat i ent provi der medical bill amount s. A few variables are i ncl uded that are derived from publicly available demographic data sources, such as i ncome per househol d for each claimant' s zip code. Addi t i onal variables are derived by accumul at i ng proport i onal statistics from the DCD; e.g., the distance from the claimant' s zip code to the zip code of the first medical provi der or claimant' s zip code rank for t he number of plaintiff attorneys per zip code. The decision to order an i ndependent medical exami nat i on or a special investigation for fraud, and a favorable out come for each, are the model i ng target. Ei ght model i ng software results will be compared for effectiveness based on a st andard procedure, the area under the receiver operating characteristic curve (AUROC). We find that t he met hods all provide some predictive value or lift from the DCD variables we make available, wi t h significant differences among the eight met hods and four targets. Model i ng out comes can be compared to logistic regression as i n Viaene et al. but the results here are different. They show some soft ware/ met hods can i mprove significantly on t he predictive 2 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 Distinguishing the Forest from the T REES ability of t he logistic model. That result may be due to the relative richness of this data set a nd/ or t he types of i ndependent variables at hand compared to the Viaene data. We show how "i mpor t ant " each variable is wi t hi n each soft ware/ model tested s and not e the type of data that are i mpor t ant for this analysis. Thi s entire exercise shoul d provide practicing actuaries with guidance on regression tree software and market met hods to analyze complex nonl i near relationship commonl y found i n all types of i nsurance data. The paper is organized as follows. Section 1 i nt roduces the general not i on of non-l i near dependencies i n i nsurance data. Section 2 describes the data set of Massachusetts aut o bodily injury liability claims and variables used for illustrating the model s and software i mpl ement at i ons. Descriptions and illustrations of the data mi ni ng met hods applied i n the paper appear i n Section 3 while the specific software procedures are covered i n Section 4. Comparative out comes for the variables ("importance") and software ("AUROC") are report ed i n Sections 5 and 6. We provide some i nt erpret at i on of the results i n terms of the decision to investigate wi t hi n the Massachusetts data as an illustration of the usefulness of the model i ng effort i n Section 7. Implications for the use of the software models are discussed i n section 8. Cont us i ons are shown i n Section 9. S E C T I ON 1. NONL I NE AR I T Y I N I NS UR ANC E DATA Actuaries are nearly inseparable from data and data mani pul at i on techniques. Dat a come i n all forms as a mat t er of course. Numeri c (loss ratios), categorical (injury types), and text (accident description) data all flood insurers on a daily basis. Reserving and pricing are two maj or functions of casualty actuaries. Reserving involves compi l i ng and underst andi ng t hrough mathematical techniques historical patterns of a portfolio of i nsurance claims i n order to predict an ultimate value. Pricing involves taking the best estimates of historical cost data on claims and expenses, combi ni ng that data wi t h financial asset pricing models that include projecting future values i n order to arrive at best estimates of all costs of accepting underwri t i ng risk. Of course, actuaries continually l ook back at bot h analytic exercises to det ermi ne the accuracy of those estimates as the real account i ng data develops over time. Traditionally, actuarial models were confi ned to linear, multiplicative or mi xed algebraic equations i n the absence of the powerful comput i ng envi r omnent we enj oy today. Those mostly manual met hods provided crude approximations that sufficed when alternative met hods were unavailable or non-existent. Simple deviations from linear relationships, such as escalating inflation, could be handled by simple t ransformat i ons of the data (log transform) that allowed linear techniques to be applied to t he data. Gradually, over time these t ransformat i on techniques became mor e sophisticated and could be applied to many probl ems wi t h a variety of non-l i near data ~'. Tr end fines of time series data, such as dal rn severity or frequency, are generally amenabl e to linear techniques. However, data where interactions and cross correlations are essential to the model i ng of the dynamics of the process underl yi ng the data, require mor e Casualty Actuarial Society Forum, Winter 2006 3 Distinguishing the Forest from the TREES compr ehensi ve t echni ques t hat yield mor e preci si on on mor e types of data compl exi t i es. Fi gure 1-1 shows a part i cul ar non-l i near rel at i onshi p bet ween t wo i nsurance variables t hat woul d be difficult, i f not i mpossi bl e, t o modal wi t h si mpl e techniques. One pur pose o f this paper is t o demonst r at e a range of so-cal l ed artificial intelligence or statistical l earni ng techniques t hat have been devel oped t o handl e compl i cat ed rel at i onshi ps wi t hi n dat a sets. An Insurance Nonl i ne ar Funct i on: Provi der Bill vs. Probabi l i ty of I ndependent Medi c al Ex a m O9 O- O 8 O - o 7 0 - | ~ o ~ - o 5 o - o 4o- o a o - F 1 I I I I I I I I f l l l l l l l l l l ml l l l l l l l l P r o v i d e r 2 Bi l l Figure 1 -1 Nearl y all regressi on and economet r i c academi c courses address the t opi c o f nonl i neari t y, at least briefly. St udent s are i nst ruct ed in met hods t o det ect nonl i neari t y and how t o model it. Det ect i on generally involves usi ng scat t er pl ot s of i ndependent versus dependent vari abl es or evaluating pl ot s of residuals. Two met hods o f model i ng nonl i neari t y t hat are generally taught: are 1) t r ansf or mat i on of variables and 2) pol ynomi al regressi on (Miller and Wi c he m 7, 1977, and Net er et al, 1985). For instance, i f an exami nat i on o f residual pl ot s i ndi cat es t hat the magni t ude o f the residuals increases wi t h the size o f an i ndependent variable, the l og t r ansf or mat i on is r ecommended. Pol ynomi al regressi ons are consi der ed useful appr oxi mat i ons when a curvilinear rel at i onshi p exists but its exact f or m is unknown. A general i zat i on of linear model s "known as Gener al i zed Li near Model s or GLM (McCullagh and Nel der, 1989) enabl ed the model i ng o f mul t i vari at e rel at i onshi ps i n the presence o f certain ki nds o f non- nor mal i t y (i.e. where the r andom component is f r om the exponent i al family of di st ri but i on). The link funct i on of GLMs formal i zes t he i ncor por at i on of certain nonl i near rel at i onshi ps i nt o the model i ng procedure: The t ransformat i ons i ncor por at ed i nt o the c ommon GLMs are: The i dent i t y link: h(Y) = Y 4 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the T R E E S The l og link: h(Y) = lnC~ 1 The inverse link: h(Y) = - - (1) Y The l ogi t link: h(Y) = l n ( l _ ~ ) The pr obi t link: h(Y) = ~( Y) , denot es t he nor mal CDF Of these t ransformat i ons, the l og and logit t r ansf or mat i on appear frequently in the i nsurance literature. Because many i nsurance variables are right skewed, the l og t r ansf or mat i on is appl i ed to attained appr oxi mat e normal i t y and homogenei t y o f variance. I n addi t i on, apri ori or domai n consi derat i ons (e.g., the rel at i onshi p bet ween the i ndependent variables and the dependent variable is bel i eved to be mulfiplicative) somet i mes suggest the l og t ransformat i on. The l ogi t t r ansf or m is commonl y used when the dependent variable is binary. Unf or mnat dy, while the t echni ques cited above add significantly to the analyst' s ability t o model nonlinearity, t hey are not sufficient for many situations encount er ed in practice. I n actual i nsurance data, compl ex nonl i near rel at i onshi ps are the rule r at her t han the except i on. Some of the reasons t he t radi t i onal approaches oft en do not pr ovi de a satisfactory appr oxi mat i on t o nonl i near funct i ons are: The f or m of the nonl i neari t y may be ot her t han one o f t hose per mi t t ed by the -known t r ansf or mat i ons whi ch pr oduce linearity. Fi gure 1-1 displays one such non- linear funct i on based on the i nsurance dat abase used in this analysis. Whi l e a pol ynomi al o f adequate degree can appr oxi mat e many compl ex functions, ext rapol at i on beyond the data, or i nt erpol at i on wi t hi n the data, may be probl emat i c, particularly for hi gher or der pol ynomi al s. Det er mi ni ng the appr opr i at e t r ansf or mat i on (or pol ynomi al ) can be difficult i f not i mpossi bl e when t here are many i ndependent variables, and the appr opr i at e rel at i on bet ween the target and each i ndependent variable must be found. The rel at i onshi p bet ween a dependent variable and an i ndependent variable may be conf ounded by a t hi rd variable due to i nt eract i on or correl at i ons t hat are not si mpl e to approxi mat e. To r emedy these pr obl ems requires met hods where: Any nonl i near rel at i onshi p can be approxi mat ed. The analyst does not need to -know the f or m of the nonlinearity. The effect o f i nt eract i ons can be easily det er mi ned and i ncor por at ed i nt o the model . The met hod generalizes well on out - of - sampl e data for i nt er pol at i on or ext rapol at i on purposes. The regressi on tree met hods i ncl uded in our analysis meet these condi t i ons. Sect i on 3 of this paper describes how each o f our met hods model s nonlinearity. We now t urn to a descri pt i on o f the data set we will use in this analysis. C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 5 Distinguishing the Forest from the T R E E S SECTION 2. DESCRIPTION OF THE MASSACHUSETTS AUTO BODILY INJURY DATA The dat abase we will use for our analysis is a subset of the Aut omobi l e Insurers Bureau of Massachuset t s Det ai l Claim Dat abase ( DCD) ; namely, t hose claims f r om acci dent years 1995-1997 t hat had cl osed by J une 30, 2003 (AIB, 2004). Al l aut o claims s arising f r om i nj ury coverages: Per sonal Inj ury Pr ot ect i on ( PI P) / Me di c a l payment s excess of PI P 9, Bodi l y I nj ur y Liability (BIL), Uni nsur ed and Under i nsur ed Mot ori st . Whi l e t here are mor e t han 500,000 claims in this subset of DCD data, we will rest ri ct our analysis t o the 162,761 t hi rd par t y BI L coverage claims. This will allow us t o divide the sampl e i nt o training, test, and hol dout sub sampl es, each cont ai ni ng in excess of 50,000 claims TM. The dat aset cont ai ns fi ft y-four variables relating t o the i nsured, dai mant , accident, injury, medi cal t r eat ment , out pat i ent medi cal pr ovi der s (2 maxi mum) , at t orney presence, and three claims handl i ng t echni ques for mi t i gat i ng dai ms cost for t hei r presence, out come, and formul ai c savings amount s. The claims handl i ng techniques t racked are: I ndependent Medi cal Exami nat i on (IME), Medi cal Audi t (MA) and Special Invest i gat i on (SIU). I MEs are per f or med by l i censed physicians o f t he same type as the treating physi ci an u. They cost appr oxi mat el y $350 per exam wi t h a charge of $75 for no shows. They are desi gned t o verify cl ai med injuries and t o evaluate t r eat ment modalities. One sign of a weak or bogus claim is the failure t o submi t t o an I ME and, thus, an I ME can serve as a screeni ng devi ce for det ect i ng fraud and bui l d- up claims. MAs are peer reviews of the injury, t r eat ment and billing. They are typically done by physicians wi t hout a cl ai mant exami nat i on, by nurses on i nsurers' st af f or by t hi rd par t y organi zat i ons, but also f r om expert systems t hat revi ew the billing and t r eat ment pat t erns 12. Favor abl e out comes are r epor t ed by insurers when the damages are mi t i gat ed, the billing and t r eat ment are curtailed, and when the cl ai mant refuses to under go t he I ME or does not show. I n the l at t er t wo situations the i nsurer is on solid gr ound t o reduce or deny payment s under the fai l ure-t o-cooperat e clause in the pol i cy) 3 Special Invest i gat i on (SIU) is r epor t ed when claims are handl ed t hrough non- r out i ne investigative t echni ques (accident reconst ruct i on, exami nat i ons under oat h and surveillance are examples), possi bl y i ncl udi ng an I ME or Medi cal Audi t , on suspi ci on of fraud. For the mos t part , these claims are handl ed by Special Invest i gat i ve Units (SIU) wi t hi n the cl ai m depar t ment or by some t hi rd part y investigative service. Occasi onal l y, compani es will be or gani zed so t hat addi t i onal adjusters, not specifically a par t o f the company SIU, may also conduct special investigations on suspi ci on of fraud. Bot h types are r epor t ed t o DCD and we refer t o bot h by the shor t hand SIU in subsequent tables and figures. Favor abl e out comes are r epor t ed for SIU i f the claim is deni ed or compr omi s ed based on the SIU i nvest i gat i on. For pur poses o f this analysis and demonst r at i on of non-l i near model s and soft ware, we empl oy t went y- one pot ent i al l y predi ct i ng variables and four target variables. Thi r t een pr edi ct i ng variables are numeri c, t wo f r om DCD fields (F), eight deri ved f r om i nt ernal demogr aphi c type dat a (DV), and three variables deri ved f r om ext ernal demogr aphi c dat a (DM) as shown i n Tabl e 2-1. 6 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 Distinguishing the Forest from the TREES Aut o Injury Liability Claim Nume r i c Vari abl es Variable N Type Provider I_BILL 162,761 F Provider 2_BILL 162,761 F ARe 155,438 DV Report La~ 162,709 DV Treatla~ 147,296 DV HouseholdsPerZipcode 118,976 DM AveralgeHouseValue Per Zip 118,976 DM IncomePerHousehold Per Zip 1 1 8 , 9 7 6 DM Distance ~IP1 Zip to CLT. Zip) 72, 786 DV Rankattl (rank art/zip/ 129,174 DV Rankdoc2 (rank prov/zip/ 109,387 DV Rankci~. (rank claimant city,) 118,976 DV Rnkpcity (rank provider ci~') 162,761 DV Valid N (lJstwise) 70,397 Mi ni mum M~Lximum 0 1,861,399 0 360,000 0 104 0 2,793 1 9 0 69,449 0 1,000,001 0 185,466 0 769 1 3,314 1 2,598 1 1,874 0 1,305 Std. Me an De vi at i on 2,671.92 6,640.98 544. 78 1,805.93 34.15 15.55 47.94 144.44 3.29 1.89 10,868.87 5,975.44 166,816.75 77,314.11 43,160.69 17,364.45 38.85 76.44 150.34 343.07 110.85 253.58 77.37 172.76 30.84 91.65 N = Number of non missing records; F=DCD Field, DV = Internal derived variable, DM = External derived variable Source; Automobile Insurers Bureau of Massachusetts, Detail Claim Database, AY 1995-1997 aud Authors' Cakulations. Tabl e 2-1 Ei g h t pr e di c t i ng var i abl es, a n d f our t ar get var i abl es ( I ME and SI U, De c i s i o n a nd Fa vor a bl e Ou t c o me f or each) , are cat egor i cal var i abl es, all t aken as r e por t e d f r o m DC D a nd as s h o wn i n Tabl e 2-2. Casualty Actuarial Society Forum, Winter 2006 7 Distinguishing the Forest from the T REES Vari abl e Policy Type Emergent, Treatment 162,761 Health Insurance 162,756 Provider I - Type 162,761 Provider 2 - T}'pe 162,761 2001 Territory 162,298 Attorney 162,761 Suspl (SIU Done / 162,761 Susp2 (IME Done / 162,761 Susp3 (SIU Favorable) 162,761 Susp4 (IME Favorable / 162,761 Injury Type 162,298 N = Nttmber of non missing records Aut o Injury Liability Claim CateBorical Variables N Type Type Description 162,761 F Personal 92%, Commercial 8% F None 9%, Onl,v 22%, w Outpatient 68% F Yes, 15%, No 26%, Unknown 60% F Chi r o 41%, Physical Th. 19%, Medical 30%, None 10% F Chi r o 6%, Physical Th. 6%, Medical 36%, None 52% F Rating Territories 1 (2.2%) Through 26 (1.3%); Territory 1- 16 by increasin~ risk, 17-26 is Boston F :kttorne~, present (89%), no attorney (11%) F Special Invesfi~tion Done (70/0/, No SIU (93%) Independent Medical Examination Done (8%), No IME F (920/o / Special Investagation Favorable 0.4%), Not Favorable/Not F Done (95.6% / Independent Medical Exa m Favorable (4.4%), Not F Favorable/Not Done (96.6% / Injury Types (24) including manor visible (4O/o), strain or F sprain, back and/or neck (81%), fatality (0.4%), disk herniation (1%) and others F= DCD Field Note: Descriptive percentages may not add to 100% due to rounding Source: Automobile Insurers Bureau of Massachusetts, Detail Claim Database, AY 1995-1997 andAuthors' Calculations. Table 2-2 Similar claim investigation variables are now bei ng collected by t he Insurance Research Council in their periodic sampling of countr3avide injury claims (IRC, 2004a, pp 89-104) 14. Nationally, about 4% and 2% of BI claims i nvol ved I MEs and SIU respectively, only one- hal f t o one-quart er o f t he Massachusetts rate. Most likely, this is because (1) a maj ori t y of ot her states have a full t ort system and so BI L contains all injury claims and (2) Massachusetts is a fairly urban state wi t h high claim frequencies and mor e dubious claimslk I n fact, t he mos t recent IRC study shows (IRC, 2004b, p25) Massachusetts has t he hi ghest percentage o f BI claims in no-fault states that are suspected of fraud (23%) a nd/ or buildup (41%). I t is t herefore, entirely consi st ent for t he Massachusetts claims t o exhibit mor e non- rout i ne claim handling techniques. Favorabl e out comes average about 67% when an I ME is done or a claim is referred t o SIU. We now t urn t o descriptions o f t he types of model s, and t he software that i mpl ement s t hem, in t he next t wo sections bef or e we describe how t hey are applied to model the I ME and SIU target variables. S E C T I ON 3. MODE L S F OR NON- L I NE AR DE P E NDE NC I E S Ho w mo d e l s ha ndl e nonl i ne a r i t y Tradi t i onal actuarial and statistical techniques oft en assume that t he functional r dat i onshi p bet ween t he i ndependent variables and t he dependent variable is linear or that s ome t ransformat i on of the data exists that can be treated as linear. Insurance data of t en cont ai n 8 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES variables where t he relationship among variables is nonlinear. Typically when nonl i near relationships exist, t he exact nature o f the nonlinearity (i.e., wher e some t ransformat i on can be used t o establish linearity) is not known. I n t he field o f data mining, a number of nonparamet ri c techniques have been devel oped whi ch can model nonl i near relations wi t hout any assumpt i on bei ng made about t he nature of t he nonlinearity. We cover how each of our met hods reviewed i n this paper model s nonlinearities in t he fol l owi ng t wo examples. The variables i n this exampl e were selected because of a known nonl i near relationship bet ween i ndependent and dependent variables. Ex. 1 The dependent variable, a numeri c variable, is total paid losses and t he i ndependent variable is pr ovi der 2 bill. Tabl e 3-1 displays average paid losses at various bands o f provi der 2 bilP ~. Ex. 2 The dependent variable, a binary categorical variable, is whet her or not an i ndependent medical exam is request ed and t he i ndependent variable again is provi der 2 bill. Nonlinear Example Data Provider 2 Bill (Banded) Zero 1 - 250 251 - 500 501 - 1,000 1,001 - 1,500 1,501 - 2,500 2,501 - 5,000 5,001 - 10,000 10,001+ All Claims Avg Provider 2 Bill Avg Total Paid 9,063 Percent IME 6% 154 8,761 8% 375 9,726 9% 731 11,469 10% 1,243 14,998 13% 1,915 17,289 14% 3,300 23,994 15% 6,720 47,728 15% 21,350 83261 15% 545 11,224 8% Table 3-1 Tr e e s Trees, also known as classification and regression trees (CART) fit a model by recursively partitioning the data i nt o t wo groups, one group wi t h a hi gher value on the dependent variable and the ot her group wi t h a l ower value on t he dependent variable. Each partition o f the tree is referred t o as a node. When a parent node is split, t he t wo children nodes, or "l eaves" of t he tree, are each mor e homogenous (i.e., less variable) wi t h respect t he dependent variable 17. A goodness o f fit statistic is used t o select t he split whi ch maximizes t he difference bet ween the t wo nodes. When t he i ndependent variable is numeric, such as pr ovi der 2 bill, t he split takes t he f or m o f a cut poi nt , or threshold: x > c and x < c as in Figure 3-1. Casualty Actuarial Society Forum, Winter 2006 9 Distinguishing the Forest from the TREES CART Exampl e of Parent and Children Node s Tot al Paid as a Functi on of Provider 2 Bill 1 ~ S p l i t I i i Figure 3-1 The cut poi nt c is f ound by evaluating all possible values for splitting the numeri c variable i nt o higher and l ower groups, and selecting the value that optimizes t he split i n some manner . When t he dependent variable is numeri c, the split is typically based on the value whi ch results i n the greatest reduct i on i n residual sum of squares. For this example, all values of provi der 2 bill are searched and a split is made at t he value $5,021. All claims wi t h provi der 2 bill less t han $5,021 go to t he left node and "predi ct " a total paid of $10,770 and all claims wi t h provi der 2 bill greater t han $5,021 go t o the right node, and "predi ct " a total paid of $59,250. Thi s is depicted i n Figure 3-1. The tree graph shows that the total paid mean is significantly lower for the claims wi t h provi der 2 bills less t han $5,021. One statistic oft en used as a goodness of fit measure t o optimize tree splits is sum squared error or the total squared devi at i on of actual values ar ound the predicted values. The selected cut poi nt is t he one whi ch produces the largest reduct i on i n total sum squared errors (SSE). That is, for the entire database t he total squared deviation of paid losses ar ound the predicted value (i.e., t he mean) of paid losses is 4.95x10 is. The SSE declines t o 4.66x10 is after the data are part i t i oned usi ng $5,021 as the cut-point. Any ot her part i t i on of the provi der bill produces a larger SSE t han 4.66x1013. For instance, i f a cut' point of $10,000 is selected, t he SSE is 4.76x1013. The two nodes i n Figure 3-1 can each be split i nt o to children nodes and these can t hen be further split. The sequential splitting cont i nues unt i l no i mpr ovement i n the goodness of fit statistic occurs. The nodes cont ai ni ng the result of all the splits resulting from applying a sequence of decision rules are the final nodes oft en referred to as t ermi nal nodes. The t ermi nal nodes provi de t he predicted values of t he dependent variables. When the dependent 10 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES variable is numeri c, the mean of the dependent variable at the t ermi nal nodes is the prediction. The curve of the predicted value resulting from a tree fit to total paid losses is a step function. As shown i n Figure 3-2A, with only two terminal nodes, the fitted funct i on is flat unt i l $5,021, steps up to a higher value and t hen remains flat. Figure 3-2B displays the predicted values of a tree wi t h 7 terminal nodes. The steps or increases are mor e gradual for this function. CART Exampl e w/ t h Two and Seven No de s Total Paid as a Funct i on of Provider 2 Bill | t ' 1 4 - o o Fi gure 3-2A Fi gure 3-2B The procedure for model i ng data where the dependent variable is categorical (binary i n our example) is similar to that of a numeri c variable. For instance, one of the fraud surrogates is i ndependent medical exam (IME) requested. The target class is claimants for whom an I ME was requested and the non-t arget group of (presumably) legitimate claims is that where an I ME was not requested. At each step, the tree procedure selects the split that best i mproves or lowers node impurity. That is, i t attempts t o part i t i on the data i nt o two groups so that one part i t i on has a significantly higher pr opor t i on of the target category, I ME requested, t han the ot her node. A number of statistical goodness of fit statistics measures is used i n different product s to select the optimal split. These i ncl ude ent ropy/ devi ance and Gi ni i ndex (which is described later i n this paper). Kant ardzi c (2003), Brei man et al (1993) and Venibles and Ripley (1999) describe the comput at i on and application of the Gi ni i ndex and ent ropy/ devi ance measures is. A score or probability can be comput ed for each node after a split is performed. This is generally estimated based on the number of observations i n the target groups versus the total number of observations at the node. The score or probability Casualty Actuarial Society Forum, Wi nt er 2006 11 Distinguishing the Forest from the T REES is frequently used to assign records t o one of the t wo classes. Typically, i f t he model score exceeds a t hreshol d such as 0.5, the record is assigned t o t he target class; ot herwi se it is assigned t o the non-t arget class. Figure 3-3A displays t he result of using a tree pr ocedur e t o predict a categorical variable f r om t he AI B data. The graph shows that each t i me t he data is split on pr ovi der 2 bill; one child node has a l ower pr opor t i on and t he ot her a hi gher pr opor t i on o f claimants recei vi ng I MEs. The fitted tree funct i on is used t o model a nonl i near relationship bet ween prmfider bill and t he probability that a claim receives an I ME as shown in Figure 3-3B. CART Exampl e wi t h Seven No de s IME Proporti on as a Funct i on of Provider 2 Bi l l I . t ! e Fi gure 3-3A CART Exampl e wi t h Seven Step Func t i ons IME Proporti on as a Funct i on of Provider 2 Bill Fi gure 3-3B Tr ee model s use categorical as well as numeri c i ndependent variables in model i ng compl ex data. However , because t he levels on categorical data may not be ordered, all possible t wo- way splits o f categorical variables must be consi dered bef or e t he data are partitioned. E n s e mb l e Mo d e l s - B o o s t i n g Ensembl e model s are composi t e tree models. A series of trees is fit and each tree i mpr oves t he overall fit o f t he model. I n t he data mi ni ng literature t he t echni que is oft en referred t o as 12 Casualty Actuarial Society Forum, Wi nt er 2006 Distinguishing the Forest from the T R E E S "boost i ng" (Hastie et al 2001, Frei dman, 2001). The met hod initially fits a small tree of say 5 to 10 t ermi nal nodes on a training dataset. Typically, the user specifies the number of terminal nodes, and every tree fit has the same number of t ermi nal nodes. The error, or difference bet ween t he actual and fitted values, is comput ed and used i n anot her r ound of fitting as a dependent variable. The error is also used i n the comput at i on of t he weight i n subsequent rounds of fitting, wi t h records cont ai ni ng larger errors receiving higher weighting i n the next r ound of estimation. One algorithm for comput i ng t he weight is described by Hastie et a119. Consi der an ensembl e of trees 1, 2, ...,M. The error for the m 'h tree measures the departure of the actual from t he fitted value on the test data after the m 'h model has been fit. When the dependent variable is categorical, as it is i n the fraud application i n this paper, a c ommon error measure used i n boost i ng is: N ~"w I( y, * F ( x ) ) e r r = ' =' N (2) ~ w I=1 where N is the total number of records, w, is a weight (which is initialized to 1 / N i n the first r ound of fitting), I is an i ndi cat or funct i on equal to zero i f the category is correctly predicted and one i f the class assigned is incorrect, y, is the dependent variable, x is a matrix of predictors and Fm(x ) is the prediction for the i 'h record of the m 'h tree. Then, the coefficient alpha is a funct i on of the weight: log(1 - e r r m ~m = ) e r r , and the new weight is: w,.m+ 1 =wm exp( aml ( y, #Fm (x))) (3) The process is performed many times unt i l no further statistical i mpr ovement i n the fit is obtained. The specific boost i ng procedures i mpl ement ed differ among different software products. For instance, TREENET (Freidman, 2001) uses stochastic gradi ent boosting. Stochastic gradient boost i ng i ncorporat es a number of procedures whi ch at t empt to build a mor e robust model by cont rol l i ng the t endency of large complex model s to overfit the data. A key technique used is resampling. A new sample is randoml y drawn from the training data each time a new tree is fit to t he residuals from the prior r ound of model estimation. The goodness of fit of the model is assessed on data not i ncl uded i n the sample, the test data. Anot her procedure used by TREENET to cont rol overfitfing is s hri nkage or regul af f z at i on. A simple way to i mpl ement shrinkage is to apply a weight whi ch is greater t han zero and less t han one to the cont r i but i on of each tree as it is added to the weighted average estimate. Casual t y Act uar i al Soci et y Forum, Wi nt er 2006 13 Distinguishing the Forest from the TREES Alternatively, the Insi ght ful Miner Ensembl e model employs a simpler i mpl ement at i on of boost i ng whi ch applies non-st ochast i c boost i ng and uses all the training data i n each r ound of fitting. The final estimate resulting from an ensembl e approach will be a weighted average o f all t he trees fit. Usi ng a large collection of trees allows: Many different variables to be used. Some of these woul d not be used i n smaller model s a'. Many different models are used. The predictive model i ng literature (Hasfie et al., 2001, Francis, 2003a, 2003c) indicates that composites of multiple model s per f or m bet t er t han the prediction of a single model -~1. Di fferent training and test records are used (with stochastic gradient boosting). Thi s makes the procedure mor e r obust to the i nfl uence of a few extreme observations. The met hod of fitting many (often 100 or more) small trees results i n fitted curves whi ch are al most smoot h. Figures 3-4A and 3-4B display two nonl i near funct i ons fit to total paid and I ME variables by the TREENET ensembl e model. 14 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the T REES Ensemble Prediction of Total Paid ~ 4000000- ~ 30000 00- _~ 20000 00- f i i i i l ~ l l l l l l l l l l l l l l 1 1 1 1 1 1 1 P r o v i d e r 2 B i l l Figuee 3-4 A 09O- 0 S O - o 7o- ~ o e o - | o . . , o- o 4o. o 3 o - V- i i I I I I I I I I I t l l l l l l l l l l l l l l l ' l P r o v i d e r 2 B i l l Figure 3-4 B E n s e mb l e Mo d e l s - B a g g i n ~ Bagging is an ensembl e approach based on resampling or bootstrapping. Bagging is an acronym for "boot st rap aggregation" (Hastie et al., 2000). Bagging does not use the error from the prior r ound of fitting as a dependent variable or weight i n subsequent rounds of fitting. Bagging uses recursive sampling of records i n the data to fit many trees. For instance an analyst may decide to take a 50% of the data as a training set each time a model Casualty Actuarial Society Forum, Winter 2006 15 Distinguishing the Forest from the T REES is fit. Under bagging, 100 or mor e model s may be fit, each one t o a di fferent sample. The trees fit are not necessarily small trees wi t h 5 t o 10 terminal nodes as wi t h boost i ng and each tree may have a di fferent number of terminal nodes. By averaging t he predictions of a number o f boot st rap samples, bagging reduces t he predi ct i on variance. The i mpl ement at i on of bagging used in this paper is known as Random Forest. I n addition to using onl y a sample of t he data each t i me a tree model is fit, Random For est also samples f r om t he variables. For t he analysis in this paper, one third of t he variables were sampl ed for each tree fit. Figures 3-5A displays an ensembl e Random For est tree fit t o total paid losses and Fi gure 3- 5B displays a tree fit t o I ME. Random Forest Predi ct i on of Tot al Pai d I I I I I I I 0 50000 150000 250000 350000 Provider 2 B i l l Figure 3-5 A 16 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES Random Forest Predi ct i on of I ME c; o o c; g c~ I i I I i I i 50000 150000 250000 350000 Pr0v~der 2 B i l l Figure 3-5 B Nai ve Baves The Naive Bayes met hod is a relatively simple and easy to i mpl ement met hod. I n our comparison, we xdew it as a benchmar k data mi ni ng met hod. That is, we are interested i n how more complex met hods i mpr ove performance (or not) against an approach where simplifying assumpt i ons are made i n order to make the comput at i ons mor e tractable. We also use logistic regression model s as a second benchmark. The Naive Bayes met hod was developed for categorical data. Specifically, bot h dependent and i ndependent variables are categorical. Therefore, its application to fitting nonl i near functions will be illustrated onl y for the categorical target variable IME. I n order to utilize numeri c predictor variables it was necessary to derive new categorical variables based on discretizing, or "bi nni ng", the di st ri but i on of data for the numer i c variables = . The key simplifying assumpt i on of the Naive Bayes met hod is the assumpt i on of i ndependence. Al l predictor variables are assumed to act i ndependendy i n i nfl uenci ng the target variable. Int eract i ons and correlations among the predi ct or variables are not considered: Bayes rule is used to estimate the probability that a record wi t h gi ven i ndependent variable vect or X = {x} is i n category C = {c,} of the dependent variable. P(cj Ix,)=P(x, Icl) P(cl) / P(x,) (4a) Casualty Actuarial Society Forum, Winter 2006 17 Distinguishing the Forest from the T REES Because of the Naive Bayes assumpt i on of condi t i onal i ndependence, the probability that an observat i on ~11 have a specific set of values for the i ndependent variables is the pr oduct of the conditional probabilities of observi ng each of the values given category c, P ( X I cs ) =I'-I P(x , I c, ) (4b) J The met hod is described i n more detail i n Kant ardzi c (2003). To illustrate the use of Naive Bayes i n predicting discrete variables, the provi der 2 bill data was bi nned i nt o groups based on the quintiles of the distribution. Because about 50 percent of the dai ms have a value of zero for provi der 2 bill, onl y four categories are created by t he bi nni ng procedure. The new variable was used to estimate the I ME targets. Figure 3-6 displays a bar pl ot of t he predicted probability of an I ME for each of the groups. Figure 3-7 displays the fitted function. Thi s funct i on is a step funct i on whi ch changes value at each boundar y of a provider 2 bill bin. Bayes Predi cted Probability I ME Req uest ed vs. Quintile of Provider 2 Bill ~ . t ac ~ x - ) . 1c ~ x - : , xc c ~x- Provider 2 Bill Ouintile Figure 3-6 18 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES OA20000- | 5 ~ 0 0000o0. Na i v e Bayes Predi cted I ME vs. Provi der 2 Bill i l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Provi der 2 Bi l l Figure 3-7 S E C T I ON 4. S OF T WAR E F OR MODE L I NG NON- L I NE AR DE P E NDE NC I E S No n a d d i t i v i t v : i n t e r a c t i o n s Convent i onal statistical model s such as r egr essi on and logistic r egr essi on assume not onl y linearity, but also addi t i vi t y o f t he pr edi ct or variables. Un d e r additivity, t he effect o f each vari abl e can be can be added t o t he model one at a time. Wh e n t he exact f or m o f t he r el at i onshi p bet ween a de pe nde nt and i nde pe nde nt vari abl e depends o n t he val ue o f one or mor e ot he r vari abl es, t he effect s are not addi t i ve and one or mor e i nt er act i ons exist. For i nst ance, t he r el at i onshi p bet ween pr ovi der 2 bill and I ME may var y by t ype o f i nj ury (i.e. t raumat i c i nj uri es versus sprai ns and strains). I nt er act i ons are c o mmo n i n i nsur ance dat a (Wei sberg and Der r i g, 1998, Franci s, 2003c). Wi t h convent i onal l i near statistical model s, i nt er act i ons are i ncor por at ed wi t h mul t i pl i cat i ve t erms: Y = a + bl X 1 + b2X2 + b3*XI*X 2 (s) I n t he case o f a t wo-way i nt er act i on, t he i nt er act i on t er ms appear as pr oduct s o f t wo variables. I f one o f t he t wo vari abl es is categorical, t he i nt er act i on t er ms allow t he sl ope o f t he fitted line t o vary wi t h t he levels o f t he categorical vari abl e. I f b o t h vari abl es are cont i nuous t he i nt er act i on is a bi l i near i nt er act i on (Jicard and Turri si , 2003) and t he sl ope o f one vari abl e changes as a l i near f unct i on o f t he ot he r variable. I f b o t h vari abl es are categorical t he mode l is equi val ent t o a t wo f act or ANOVA wi t h i nt eract i ons. Casualty Actuarial Society Forum, Winter 2006 19 Distinguishing the Forest from the T REES The convent i onal approach to handl i ng interactions has some limitations. Onl y a limited number of types of interactions can be model ed easily. I f many predictor variables are i ncl uded i n the model , as is oft en t he case i n many predictive model i ng applications, it can be tedious, i f not impossible, to find all the significant interactions. Incl udi ng all possible interactions i n the model wi t hout regard to their significance likely results i n a model whi ch is over-parameterized. The tree-based data mi ni ng techniques used i n this paper each have efficient met hods for handl i ng interactions. Int eract i ons are i nherent i n the met hod used by trees to part i t i on data. Once data have been partitioned, different partitions can and typically do split on different variables and capture different interactions among the predictor variables. When t he decision rules used by a tree to reach a t ermi nal node i nvol ve mor e t han one variable, i n general, an interaction is bei ng modeled. Ensembl e met hods incorporate interactions because they are based on the tree approach. Nai ve Bayes, because it assumes condi t i onal i ndependence of the predictors, ignores interactions. Logistic regression incorporates interactions i n the same way ordi nary least squares regression does, with product i nt eract i on terms. I n this fraud compari son study, no at t empt was made to i ncorporat e i nt eract i on terms as this procedure lacks an efficient way to search for the significant interactions. Mul t i pl e predictors Thus far, the discussion of the tree-based models concerned onl y simple one or t wo variable models. Ext endi ng the tree met hods to i ncorporat e many pot ent i al predictors is straightforward. For each tree fit, the met hod proceeds as follows: For each variable determine the best two-way part i t i on of the data. Select the variable which produces the best i mpr ovement i n the goodness of fit statistic to split the data at a particular node. Repeat the process unt i l no further i mpr ovement i n fit can be obtained. Software for model i ng nonlinear dependenci es and testi ng the model s Four software products were i ncl uded i n our fraud comparison: They are CART, TKEENET, S-PLUS (R) and Insightful Mi n e r 23. CART and TREENET are Salford Systems stand-alone software product s that each performs one technique. CART (Classification and Regression Trees) does tree analysis and TREENET applies stochastic gradient boost i ng usi ng the met hod described by Fr ei dman (2001). All the software tested produce SAS c o d e 24 that can be used to i mpl ement the model 20 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES i n a pr oduct i on stage. All the product s cont ai n a procedure for handl i ng mi ssi ng values usi ng surrogate variables. At any given split poi nt , CART and TREENET find the variable that is next i n i mport ance i n i nfl uenci ng the target variable and they use this variable to replace the mi ssi ng data. The specific statistic used to rank t he variables and find the surrogates is described i n Bri eman et. al. (1993). Di fferent versions of CART and TREENET handle different size databases. The number of levels of categorical variables affects how much memor y is needed, as mor e levels necessitate mor e memory. The 128k versi on of each product was used for this analysis. Wi t h approximately 100,000 records i n the t rai ni ng data, occasional memor y probl ems were experienced and i t became necessary to sample fewer records. One of the very useful features of the Salford Systems software is that all the products rank variables i n i mport ance :5. S-PLUS and R are comprehensi ve statistical languages used to per f or m a range of statistical analyses i ncl udi ng exploratory data analysis, regression, ANOVA, generalized linear models, trees and neural networks. Bot h S-PLUS and R are derived from S, a statistical programmi ng language originally developed at Bell Labs. The S progeny, S-PLUS and R, are popul ar among academic statisticians. S-PLUS is a commercial product sold by Insightful whi ch has a true GUI interface that facilitates easier handl i ng of some functions. Insightful also supplies technical support. The S-PLUS programmi ng language is widely used by analysts who do serious number crunching. They find it more effective, especially for processes that are frequently repeated. R is free open source statistical software that is supported largely by academic statisticians and comput er science faculty. It has onl y limited GUI functionality and the data mi ni ng funct i ons must be accessed t hrough the language. Most code written for S-PLUS will also work for R. One not abl e difference is that data must be convert ed to text mode to be read by R (a bi t of an i nconveni ence, but usually not an i nsurmount abl e one). Fox (2002) points out some of the differences bet ween the two languages, where they exist. The S-PLUS procedures used here i n the fraud compari son are found i n bot h S-PLUS and R. However one ensembl e tree met hod used, Random Forest, appears onl y to be available i n R. The S-PLUS (R) procedures used were: the tree funct i on for decision trees and the gl m (generalized linear models) for logistic regression. S-PLUS (R) incorporates relatively crude met hods for handl i ng mi ssi ng values. These i ncl ude eliminating all records with a mi ssi ng value on any variable, an approach whi ch is generally not r ecommended (Francis 2005, AUsion 2002). S-PLUS also creates a new category for missing values (on categorical variables) and allows abort i ng the analysis i f a missing value is found. I n general, i t is necessary to preprocess the data (at least the numeri c variables where there is no mi ssi ng value met hod 2~) to make a provi si on for the missing values. I n the fraud comparison, a const ant not i n the range of the data was substituted i nt o the variable and an indicator dummy variable for mi ssi ng was created for each numeri c variable with missing values. S- PLUS and R are generally not considered optimal choices for analyzing large databases. Aft er experiencing some difficulty reading training data of about 100,000 records i nt o S- PLUS, the database was reduced to cont ai n onl y the variables used i n the analysis. Once the data was read i nt o S-PLUS, few probl ems were experienced. Anot her eccentricity is that the S-PLUS tree funct i on can onl y handle 32 levels on any given categorical variable, so i n the preprocessing the number of levels may need to be reduced 27. The R Random Forest funct i on incorporates a procedure that can be used to rank variables i n importance. The Casualty Actuarial Society Forum, Winter 2006 21 Distinguishing the Forest from the TREES procedure produces an imp urity statistic which can be used to rank the variables. The i mpuri t y is based on t he Gi ni index for classification applications and mean squared error for numeri c dependent variables. The S-PLUS tree f uncdon cont ai ns no bui l t -i n capability for ranki ng variables i n importance. Therefore usi ng the S-PLUS language, an algorithm was coded i nt o S-PLUS to rank the variables. The met hod is described i n Francis (2001) and Pot t s (2000). The procedure quantifies how much the error increases when a variable is removed from the model ; the larger the increase i n errors, the more i mpor t ant the variable. The Insi ght ful Mi ner is a data mi ni ng s.uite that cont ai ns the most common data mi ni ng tools: regression, logistic regression, trees, ensembl e trees, neural networks and Nai ve Bayes ~. As ment i oned earlier, Insightful also markets S-PLUS. However, the Insi ght ful Mi ner has been opt i mi zed for large databases and cont ai ns met hods (Naive Bayes) whi ch are not part of S-PLUS (R). The Naive Bayes, Tree and Ensembl e Tree procedures f r om Insightful Mi ner are used here i n the fraud comparison. The insightful Mi ner has several procedures for automatically handl i ng mi ssi ng values. These are 1) drop records wi t h mi ssi ng values, 2) randoml y generate a value, 3) replace wi t h the mean, 4) replace wi t h a const ant and 5) carry forward the last obseta-ation. Each mi ssi ng value was replaced wi t h a constant. I n theory, the data mi ni ng met hods used, such as trees, shoul d be able to part i t i on records coded for mi ssi ng from the ot her obse~-ations wi t h legitimate categorical or numer i c values and separately estimate their impact on the target variable (possible after allowing for interactions wi t h ot her variables). Server versions of the Insightful Mi ner generate C code that can be used i n deploying the model, but the versi on used i n this analysis did not have that capability. As ment i oned above some preprocessing was necessary for the Naive Bayes procedure. Since Insi ght ful Miner cont ai ns no procedure for ranki ng variables i n i mport ance, no rankings were provided for the I mi ner met hods. Val i dat i ne and Te s t i n~ v It is common i n data mi ni ng circles to partition the data i nt o three groups (Hastie et al., 2001). One group is used for "training", or fitting the model. Anot her group, referred t o as the validation set, is used for "testing" the fit of the model and re-estimating parameters i n order to obt ai n a bet t er model. It is common for a number of iterations of testing and fitting to occur before a final model is selected. The third group of data, the "hol dout " sample, is used to obt ai n an unbi ased test of the model ' s accuracy. An alternative approach to a validation sample that is especially appropriate when the sample size used i n the analysis is relatively modest , is cross-validation. Cross-validation is a met hod i nvol vi ng hol di ng out a por t i on of the t rai ni ng sample, say one fifth of the data, fitting a model to the remai nder of the data and testing it on the held out data. I n the case of 5-fold cross validation, the process is repeated five times and the average goodness of fit of the five validations is comput ed. The various software products and procedures have different met hods for validating the models. Some (Insightful Mi ner Tree) onl y allow cross-validation. Ot hers ( TREENET) use a validation sample. S-PLUS (R) allows either approach -~ to be used (so a test sample of about 20% of the training data was used as we had a relatively large database). Nei t her validation sample nor cross-validation was used with Nai ve Bayes, Logistic Regression or the Ensembl e Tree. 22 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES I n this analysis, approximately a third of t he data, about 50,000 records, was used as the hol dout sample for the final testing and compari son of the models. Two key statistics oft en used to compare model s accuracy are sensitivity and spedficity. Sensitiv ity is the percentage of events (i.e., claims wi t h an I ME or referred t o a special investigation unit) that were predicted to be events. The q ~ecifid~y is the percentage of nonevent s (in our applications claims believed to be legitimate) that were predicted to be nonevent s. Bot h of these statistics should be high for a good model. Tabl e 4-1, oft en referred to as a confusi on matrix (Hasfie et. al., 2001), presents an example of the calculation. Sample Conf usi on Matrix: Sensitivity and Specificity True Cl ass Predi ct i on No Yes Row Tot al No 800 200 1 ,000 Yes 200 400 600 Col umn Tot al 1 ,000 600 Correct Tot al Pement Correct S ensi t i vi t y 800 1 ,000 80% S p eci fi ci t y 400 600 67% Table 4-1 I n the example confusi on matrix, 800 of 1,000 non- event s are predicted to be non-event s so the sensitivity is 80%. The specificity is 67% since 400 of 600 true positives are accurately predicted. Casualty Actuarial Society Forum, Winter 2006 23 Distinguishing the Forest from the TREES S ECTI ON 5. SOFTWARE RANKI NGS OF "I MPORTANT" VARI ABLES I N T HE DECI S I ON TO I NVESTI GATE: I ME AND SI U The remainder of this paper is devoted to illustrating the usefulness and effectiveness of eight model / soft ware combinations applied to our Example 2, the decision to investigate via IMEs or referral to SIU. We model the presence and proport i on of favorable outcomes, of each investigative technique for the DCD subset of automobile bodily injury liability (third party) claims from 1995-1997 accident years. 3" We employ twenty-one potentially predicting variables of three types: (1) eleven typical claim variable fields informative of injury claims as reported, bot h categorical and numeric, (2) three external demographic variables that may play a role in capturing variations in investigative dairn types by geographic region of Massachusetts, and (3) seven internal "demographic" variables derived from informative pattern variables in the database. Variables of type 3 are commonly used in predictive modeling for marketing purposes. The variables used for these illustrations are by no means optimal choices for all three types of variables. Optimization can be approached by other procedures (beyond the scope of this paper) that maximize information and minimize cross correlations and by variable construction and selection by domain experts. The eight model / soft ware combinations we xxtll use here are of two categories: six tree models, and two benchmark models (Naive Bayes and Logistic). They are: 1) TREENET 5) Iminer Ensemble 2) Iminer Tree 6) Random Forest 3) SPLUS Tree 7) Naive Bayes 4) CART 8) Logistic As described in Section 4, CART and TREENET are Salford Systems stand-alone software products that each performs one technique. CART (Classification and Regression Trees) does tree analysis, and TREENET applies stochastic gradient boosting to an ensemble of trees using the met hod described by Freidman (2001). The S-PLUS procedures used here in the fraud comparison are found in bot h S-PLUS and in a freeware version in R. These were: the tree function for decision trees, and the GLM (generalized linear models) for logistic regression. Insightful Miner is a data mining suite. The Naive Bayes, Tree and Ensemble Tree procedures, from Insightful Miner are used here in the fraud comparison. Model performance is covered in the next section, section 6, as we first cover the ranking of variables by "importance" in r dat i on to the target variables: the decision to perform an I ME or a Special Investigation (SIU) and the favorable outcomes of each investigative technique. The training data of approximately 75,000 records was used in the ranking evaluations. Data mining modds are typically complex models where it is difficult to determine the relevance of predictors to the model result. One of the handy tasks that some of the data 24 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES mi ni ng software product s perform is to rank the predictor variables by their i mport ance to the model i n predicting the dependent variable. Where the software did not supply a ranking, we omi t t ed an i mport ance ranki ng leaving five model / sof t war e det ermi nat i ons of i mport ance for the twenty-one variables. Di fferent procedures are used for different met hods and different products. Two software products, CART and TREENET supply i mport ance rankings. The procedures used are: CART: CART uses a goodness of fit measure, also referred to i n the literature as an i mpuri t y measure, and comput ed over the entire tree, to det ermi ne a variable' s importance. I n this study the goodness of fit measure was the Gi ni Index defined bel ow (Hastie, et al., p.271-272): i ( t ) = 1 - ~ p , i--the cat egori es of t he dependent vari abl e and p.is the probability of I class (6) Each split of the tree lowers the overall value for the statistic. CART keeps track of the i mpuri t y i mpr ovement at each node for bot h the variable used i n the split and for surrogate variables used as a repl acement i n the case of mi ssi ng values. A consequence of this is that a variable not used for splitting may rank higher i n i mport ance t han a variable that is. TREENET: Because i t is composed of many small CART trees, TREENET uses the same met hod as CART to comput e i mport ance rankings. S-PLUS CR) does not supply an i mport ance ranking, but the programmi ng language can be used to program a procedure to comput e rankings. A sensitix4ty value was comput ed for each variable i n the model. The sensitivity is a measure of how much t he predicted value' s error increases when the variables are excluded from the model one at a time. However, instead of actually exdudi ng variables and refitting the model, their values are fixed at a const ant value. (See Francis, 2001 for a detailed recipe for applying the approach). The sensiti~fty statistic was used to rank the variables from the tree function. For the logistic regression, i nf or mat i on about the variables cont ri but i on to sum of squared variation explained by the model was used to rank it. Like CART and TREENET, Random Forest uses an i mpuri t y measure (i.e., Gi ni Index) to produce an i mport ance ranking. Insightful Mi ner does not supply i mport ance rankings. Unlike S-PLUS (R), the analytical met hods are not accessed t hrough the language but t hrough a series of i cons placed on a palate. Thus, we were not able to cust om program a ranki ng procedure for application wi t h the Imi ner' s model i ng methods. The resulting i mport ance rankings were used i n Tables 5- 1A & 5-2A for the decisions to investigate and 5-1B and 5-2B for the favorable outcomes. Each of five model / soft ware combi nat i on out put s allowed for the evaluation of the predicting variables i n rank order of i mport ance, when significant, together wi t h a measure of the relative value of i mport ance on a scale of zero (insignificant) to 100 (most significant Casualty Actuarial Society Forum, Winter 2006 25 Distinguishing the Forest from the TREES vari abl e). Tabl e 5- 1A displays t he i mpor t a nc e resul t s f or pr edi ct i ng an I ME usi ng t he fi ve t r ee model s whi l e Tabl e 5-1B displays t hos e resul t s f or t he r emai ni ng fi ve mo d e l / s o f t wa r e combi nat i ons , i ncl udi ng t he be nc hma r k Nai ' ve Bayes and Logi st i c. Th e pr edi ct i ng var i abl es are l i st ed i n t he or de r o f i mpor t a nc e i n t he T R E E NE T model , wher e all var i abl es are si gni fi cant . Th e n u mb e r o f si gni f i cant var i abl es f ound r anges f r om a l o w o f t wel ve var i abl es ( S- PLUS Tr ee) t o all t went y one ( T RE E NE T ) . Variable Provider 2 Bill Attorneys Per Zip Territo~" Health Insurance Injury Type Provider 1 Bill Provider 1 Type Report Lag Attorney Age Provider 2 Type Income Household/Zip Avg. Household Price/Zip . Providers per City Claimants per City Providers / Zip Households/Zip Treatment Lag Distance~lPl Zip to Clt Zip Emergency Treatment Policy. Type Software Ranki ng of Variables for I ME De c i s i on By Importance Rank and Val ue (1) T RE E NE T 1 (1 00) 2 (80) 3 (71) 4 (61) 5 (50) 6 (4"0 7 (31) 8 (31) 9 (25) 10 (23) 11 (19) 12 (18) 13 (17) 14 (17) 15(16) 16 (16) 17 (16) 18 (14) 19 (13) 20 (4) 21 (3) ( 2) S Pl us Tree 2 (91) 5 (26) (32) 1 (lOO) 6 (24) 3 (51) 9 (7) 7 (16) 12 (3) 8 (9) 11 (3) l o (4) (3) CART 1 (100) 13 (9) 11 (11) 3 (68) 5 (47) 4 (58) 8 (1 8 ) 17 (2) 10 (13) 15 (5) 9 (15) 1 8 (2) 20 (0.1) 7 (20) 19 (2) (4 ) Random Forest 1 (100) 6 0 4) 3 (59) 2 (8 4 ) 10 (18) 4 (59) 12 (15) 8 (27) 19 (5) 17 (8) 5 (42) 11 (1 6) 1 6 (9) 7 (32) 15 (13) 13 (15) 9 (24) 14 (14) 1 8 (6) 20 (0) (s) Logi sdc lO (1 ) 11 (1) 1 (1 00) 2 (51) 6 (8 ) 1 3 (1 ) 5 (1 8 ) 3 (47) 9 (2) 12 (1) 8 (2) 7 (2) 4 (24) Note: * represents insignificance of variable in the model. Tabl e 5-1 A Th e s ame set o f mo d e l / s o f t wa r e c ombi na t i ons was us ed wi t h t he same set o f t we nt y- one pr edi ct i ng var i abl es t o pr edi ct t he f avor abl e o u t c o me o f t he I ME. Tabl e 5-1B s hows t he i mpor t a nc e o f each o f t he 21 pr edi ct or s f or mode l i ng f avor abl e out c ome s o f I MEs . 2 6 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 Distinguishing the Forest from the T R E E S Variable Provider 2 Bill Attorneys Per Zip Territory Health Insurance Injury Type Provider I Bill Provider I Type Report Lag Attorney Age Proxdder 2 Type Income Househol d/ Zi p Avg. Household Pri ce/ Zi p Providers per City Claimants per City Providers/Zip Households/Zip Treatment Lag Distance MP1 Zip to Clt Zip Emergency Treatment Vor~q Type Software Ranki ng of Variables for IME Favorable By Importance Rank and Val ue ( 1 ) ( 2) ( 3 ) TREENET S Plus Tree CART 5 (64) 3 (22) 4 (37) 11 (28) * 11 (6) 2 (98) 2 (43) 12 (5) 1 (1 00) 1 (1 00) 1 (1 00) ( 4) Random Forest 5 (49) 13 (28) 1 (lOO) 2 (71) (s) Logi st i c 2 (13) 11 (1) 4 (9) 1 (lOO) 4 (76) 7 ( 4 5) 8 ( 38 ) 6 (53) . 12 (25) 13 (24) 1 0 ( 29) 20 ( 7) . 15 (16) 1 9 ( 8 ) 9 (36) 17 (12) 16 (15) 14 ( 22) 3 (78) . 1 8 ( 9) 21 (5) 5 ( 1 0) 9 ( 1 5) 4 (15) 2 (51) 9 (16) 5 0 6) 8(7) 18(o) * 19 (0) * 6 0 0 ) 11 (4) 17 (0) * l S ( 0) * 8 (17) 12 (3) 13 (2) 13 (2) 7 (20) 7 (7) 16 (0) 1 4 (1 ) 1 0 (6) 6 (8) 14 (1) 10 (6) 3 (44) 4 (67) 3 (13) 3(70) * 10 02) 5 (9) 6 (45) 8 (6) 18 (3) 7 (8) 9 (33) * 12 01) * 8 (33) 1 0 g) 15 (23) * 16 (22) 13 (1) 11 01) * 7 (37) 9 (-'2) 14 (28) 6 (8) 17 (5) 12 (1) Note: * represents insignificance of variable in the model. Tabl e 5-1B Th e s a me s et o f f i ve mo d e l / s o f t wa r e c o mb i n a t i o n s was us e d wi t h t he s a me s et o f t we nt y- o n e pr e di c t i ng var i abl es t o p r e d i c t t he us e o f s p e d a l i nve s t i ga t i on or SI U. Ta bl e s 5- ZR a n d 5- 2B s h o w t he c o r r e s p o n d i n g r a n k i n g o f var i abl es by i mp o r t a n c e f or e a c h o f t he f i ve mo d e l c o mb i n a t i o n s a nd t wo t a r ge t var i abl es , de c i s i on a nd f avor abl e. C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 2 7 Distinguishing the Forest from the TREES Software Ranki ng of Vari abl es for SIU Deci si on By I mpor t ance Rank and Val ue Vari abl e Provi ders/ Zi p Provider 2 Type Territo9, Health lnsumnce Provider 1 Bill l nj u~ Type Attorney Provider 1 Type Age Provider 2 Bill Report lag Average House Price Attorneys/zip Distance to Provider (1) TREENET 1 (100) 2 (98) 3 (92) 4 (64) 5 (59) 6 (52) 7 (47) 8 08) 9 (31) 10 (30) 11 (28) Emergency, Treatment Income/ Cap Household Claimants per City Treatment Lag Househol ds/ Zi p Policy Type Providers per City 12 (28) 13 (22) 14 (20) 15 (19) 16 (18) 17 (17) 18 (1 6) 19 (16) 20 (8) 21 (6) (2) S Pl us Tree 1 (100) 10 (3) 5 (18) 3 (33) 2 (51) 7(6) 8 (4.5) 4 (29) 6 (8) 11 (3) 9 (34) 12 (1) (3) CART 8 (37) 1 5 (34) 3 (84) 7 (52) 2 (8 5) 5 (59) 17 (13) 4 (69) 1 (1 00) 6 (54 ) 15 (18) 14 (20) 19(4) 13 (27) 9 (4.5) 12 (30) 18 (1 2) 16 (16) 11 (30) ( 4 ) Random Forest 3 (74) 10 (30) 1 (1 00) 6 (50) 2 (89) 16 (5) 18(4) s (8 1 ) 17 (5) 4 (74 ) 8 (10) 9 00) 15 (18) 19 (4) 13 (21) 11 (26) 14 (20) 12 (21 ) 20 (1) 7(44) ( s) Logi suc 6 (39) 7G8) 14 (2) 2 (71) 3 (63) 1 (100) 13 (5) 11 (17) 12 (7) 4 (58) 5 (4 9) 9 (27) 15 (2) 8 (28 ) 10 (22) Note: * represents insignificance of variable in the model. Tabl e 5-2A 2 8 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 Distinguishing the Forest from the T REES Vari abl e Providers/Zip Provider 2 Type Territo~" Health Insurance Provider 1 Bill Injuq, Type Attorney Provider 1 Type Age Proxader 2 Bill Report lag Average House Price Attorneys/zip Distance to Provider Emergency Treatment Income/Cap Household Claimants per City Treatment Lag Households/Zip Policy T.vpe Providers per City Sof t ware Ra nk i ng of Vari abl es for SI U Favorabl e By I mpor t anc e Ra n k and Val ue (6) (7) (8) TREENET $ Pl us Tree CART 10 (20) 10 (6) 12 (25) 4 (41) * 7 (35) 1 (1 00) 2 (94) 1 (100) 13 (18) 6 (16) * 6 (30) . 13 (4) . 15 (9) . 3 (58) 5 (16) . 6 (39) 14 (16) 12 (4) 9 (27) 5 (40) 1 (100) 3 (50) . 8 (22) * 1 7 (7) 2 (66) 4 (18) 8 (32) . 7 (25) . 7 (1 4) 1 9 (2) . 15 (16) * 13 (24) 11 (19) 8 (14) 4 (45) 16 (15) 9 (14) . 5 (39) 21 (9) , 3 (72) 14 (17) 17 (14) 11 (5) . 2 (61) 12 (19) * 11 (25) 19 (13) * 18 (4) 18 (13) * 16 (9) . 20 (10) * * 9 (21) 14 (3) 10 (26) Note: * represents insignificance of variable in the model. Tabl e 5-2B ( 9) Ra n d o m Fores t 7 (24) 9 08) 1 (1 00) 1 5 (1 0) 5 (29) 16 (8) 18 (6) 3 (33) 13 (13) 6 (26) 2 (36) lO (1 7) 14 (11) 11 (16) 12 (13) 17 (6) 8 (1 9) 4 (31 ) (10) Logistic 1 3 (2) 5 (21 ) 1 (100) 7 (19) 14 (1) 3 (41) 6 (20) 2 (4 5) 11(2) 9 (3) 12 (2) 4 (25) 15 (1 ) 8 (5) lO (2) Cl earl y, i n b o t h i ns t a nc e s o f t ar get var i abl es t he speci f i c mo d e l a nd s of t wa r e i mp l e me n t a t i o n de t e r mi ne s h o w t o u n wi n d t he cr os s c or r e l a t i ons t o ext r act t he mo s t i n f o r ma t i o n f or pr e di c t i on pur pos e s . F o r e xa mpl e , t he di s t a nc e b e t we e n t he c l a i ma nt ' s zi p c ode a n d t he f i r st o u t p a t i e n t p r o v i d e r ( Di s t ance) r a nks l o w i n i mp o r t a n c e ( 19/ 21) i n t he T R E E NE T a ppl i c a t i on f or t he I ME de c i s i on t ar get b u t i t is qui t e i mp o r t a n t i n t he T R E E NE T mo d e l f or f a vor a bl e I ME o u t c o me ( 3/ 21) . No t e , h o we v e r , p r o v i d e r 2 bi l l is d e e me d hi ghl y i mp o r t a n t i n all I ME n o n - b e n c h ma r k appl i cat i ons . On e way t o i s ol at e t he i mp o r t a n c e o f e a c h pr e di c t i ng va r i a bl e is t o tally a s u mma r y i mp o r t a n c e s cor e acr os s mode l s . We wi l l us e a s cor e o f ( 21- r a nk) *( i mpor t a nc e ) , wi t h all i ns i gni f i c a nt var i abl es a s s i gne d ze r o i mp o r t a n c e , s u mme d o v e r all r e l e va nt mo d e l c o mb i n a t i o n s . F o r e xa mpl e , t he var i abl e pr oxqder 2 t ype wo u l d ha ve a s u mma r y s cor e r el at i ng t o t he I ME t a r ge t acr os s t he fi ve t r ee mo d e l s f or a t ot al i mp o r t a n c e s cor e o f 2, 268. Thi s s c or i ng f o r mu l a i s t3"pical o f t he ad h o c me t h o d s c o mmo n t o dat a mi n i n g anal yt i cs. Th e mul t i pl i cat i ve f o r m gi ves e mp h a s i s t o b o t h t he cat egor i cal r a n k a nd t he i mp o r t a n c e s c or e i n a dua l mo n o t o n e way. Th e n u me r i c va l ue o f t he s c or e i s l ess i mp o r t a n t t h a n t he fi nal r a nki ngs o f t he var i abl es . Ta bl e s 5 - 3 A&B a n d 5 - 4 A&B s h o w t he r a nge o f var i abl e i mp o r t a n c e s u mma r y s cor es f or all var i abl es r el at i ve t o t he t wo t ar get s, I ME a nd SI U, r espect i vel y. Th e r a nks o f t he var i abl es a c c o r d i n g t o t he t wo s u mma r y s cor es ar e hi ghl y ( Pear s on) c or r e l a t e d as, f or e xa mpl e , t h e de c i s i on s u mma r y r a nks a n d f a vor a bl e s u mma r y r a nks ha ve c or r e l a t i on coef f i ci ent s o f 0. 65 f or I ME a nd 0. 57 f or SI U. Th e t abl es Cas ual t y Act uar i al Soci et y Forum, Wi n t e r 2006 29 Distinguishing the Forest from the TREES al so i ndi cat e t he vari abl e cat egor y o f ori gi nal DC D field (F), an i nt ernal l y der i ved vari abl e ( DV) a nd an ext er nal de mogr a phi c vari abl e (DM). The ext ernal de mogr a phi c var i abl es do not s e e m t o be ver y i nf or mat i ve i n t he pr es ence o f t he field and der i ved vari abl es chos en. Important Variable Summari zadons for IME Tree Model s Appl i ed to Dec i s i on and Favorable Targets Total Dec i s i on Favorabl e Score Score Score Variable Total Variable type Score Rank Rank Rank Health Insurance F 17,206 1 2 1 Provider 2 Bill F 10,820 2 1 4 Territota? F 7,871 3 5 2 Provider 1 Bill F 6,726 4 4 3 Injury Type F 6,084 5 6 5 Attorneys Per Zip DV 3,102 6 3 15 Provider 2 Type F 2,873 7 8 9 Report Lag DV 2,859 8 16 7 Provider 1 Type F 2,531 9 10 6 Distance MP1 Zip to Clt Zip DV 1,655 10 11 8 Treatment Lag DV 1,331 11 17 16 Emergency. Treatment F 1,216 12 7 10 Claimants per Cit 3, DV 1,146 13 14 13 Income Household/Zip DM 987 14 13 17 Attorney F 971 15 9 19 Households/Zip DM 957 16 19 11 Age F 881 17 12 14 Pro~ders/Zip DV 838 18 18 12 Providers per City DV 719 19 20 18 Avg. Household Price/Zip DM 262 20 15 20 Policy. Type F 4 21 21 21 Tabl e 5-3 30 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the T REES Important Variable Summari zat i ons for SIU Tree Model s Appl i ed to Dec i s i on and Favorabl e Targets Variable Tot al Variable Type Score Territory F 15,242 Provider 1 Type F 9, 965 Providers/Zip DV 6, 676 Provider I Bill F 6, 240 Provider 2 Bill F 6, 030 Injury Type F 5, 845 Provider 2 Type F 4, 753 Health Insurance F 4, 262 Emergency Treatment F 3, 039 Attorney F 2, 705 Report lag DV 2, 642 Providers per City. DV 2, 275 Attorneys/zip DV 2, 183 Distance to Provider DV 2, 109 Income/Cap Household DM 2, 091 Claimants per City DV 1, 142 Households/Zip DM 1,061 Age F 830 Treatment Lag DV 706 Average House Price DM 648 Policy Type F 19 Total Score Dec i s i on Favorable Score Score Rank Rank Rank 1 2 1 2 4 2 3 1 13 4 3 10 5 5 4 6 7 3 7 8 6 8 6 15 9 13 5 10 9 14 11 10 9 12 12 10 13 14 8 14 11 14 15 15 7 16 18 16 17 16 18 18 19 17 19 17 20 20 20 9 21 21 21 Table 5-4 Ad d i t i o n a l An a l y s e ~ Mo s t s of t war e al l ow f or addi t i onal di agnos t i c t ool s t hat f ocus o n t he i mp o r t a n c e o f i ndi vi dual var i abl e l evel s i n t he pr edi ct i ve mode l . We f ocus o n t wo s uch feat ures: par t i al d e p e n d e n c y pl ot s a n d p r u n i n g o f t rees. Bo t h f eat ur es ar e de s i gne d t o i l l ust rat e t he c ont r i but i on o f each kv el o f cat egor i cal var i abl e a nd each interv al o f c ont i nuous var i abl es cr eat ed by t he cut poi nt s . We i l l ust rat e t he addi t i onal anal yses us i ng t he Ra n d o m Fo r e s t a nd S- PLUS' s t r ee s of t war e. P a r t i a l De p e n d e n c e Th e part i al d e p e n d e n c e pl ot is a us ef ul way t o vi sual i ze t he e f f e c t o f t he val ues o f a speci f i c var i abl e o n a d e p e n d e n t var i abl e wh e n a c o mp l e x mo d e l i n g me t h o d s uc h as Ra n d o m Fo r e s t is used. Th e part i al d e p e n d e n c e pl ot is a gr a ph o f t he mar gi nal e f f e c t o f a var i abl e o n t he cl ass pr obabi l i t y. F o r a cl assi f i cat i on appl i cat i on (in Ra n d o m For es t ) , t he part i al pl ot us es t he l ogi t or l og o f t he odds r at i o ( t he odds o f be i ng i n t h e t ar get cat egor y ver s us i t s c ompl i me nt ) r a t he r t ha n t he act ual pr obabi l i t y. Casualty Actuarial Society Forum, Winter 2006 31 Distinguishing the Forest from the T REES K f (x ) = log p, (x) - ~ log(p, ) (7) / = 1 Figures 5-1 and 5-2 show the partial dependence plot for the two IME targets for the most important variable in Table 5-4, territory. Random Forest: IME Requested Part i al D e p e n d e n c e o n Ter r i t o r y .D m Q t o o o t ' o o o4 o x . - o O O 01 04 07 10 13 16 19 Territory F i g u r e 5-1 Random Forest: IME Favorable 22 25 32 Ca s u a l t y Ac t u a r i a l So c i e t y Forum, Wi n t e r 2 0 0 6 Distinguishing the Forest from the TREES Par t i al D e p e n d e n c e o n Te r r i t o r y o o o o o o 01 04 07 1 0 1 3 1 6 1 9 22 25 Territ ory Figure 5-2 Both bar graphs have a distinctive right shift i n the size of the partial dependency on the territon, variable. This result is not surprising given that Massachusetts aut omobi l e territories are set every two years based upon the calculation of a single 5-coverage pure pr emi um i ndex for each of 350 towns. Towns are t hen grouped i nt o 16 nearly homogenous territories with the i ndex generally rising from territory 1 (lowest) to territory 16 (highest). Territories 17-26 are 10 individual parts of Bost on that vary widely i n this calculated pure pr emi um i ndex (Conger, 1987). Figure 5-3 shows a bar graph of the pure pr emi um indices for the 26 territories used i n this analysis for compari son purposes. Massachuset t s Rat i ng Territories Fi ve Cover age Pur e Pr e mi u ms Casualty Actuarial Society Forum, Winter 2006 33 Distinguishing the Forest from the TREES 1400.00 T 1200.00 1000, 00 800.00 600.00 400.00 200.00 0 O0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 0 2 0 21 2 2 23 24 25 F i g u r e 5 - 3 26 Fi gure 5-4 displays the pr opor t i on of dai rns wi t h an I ME r equest ed (not margi nal effects) by territory, super i mposed on the pur e pr emi um t erri t ory levels. In cont rast to the similarity, of the margi nal i mpor t ance of the I ME t erri t ory variable t o the t erri t ory pur e pr emi ums, the pr opor t i ons of claims with I ME request ed shown in Fi gur e 5-4 show mor e uni f or mi t y across territories, i ndi cat i ng a real dependence on ot her i mpor t ant variables. 34 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES Mas s achus et t s Rat i ng Ter r i t or i es Fi ve Cover age Pur e Pr e mi um vs I ME Reques t Rat i os 1400.00 0.1200 1200.00 1000.00 800.00 600.00 400.00 200.00 0.1000 0.0800 0.0600 0.0400 0.0200 0.00 1 2 3 4 5 6 7 8 9 10 1112 1314 151617 18 19 2 0 2 12 2 2 32 4 2 52 6 Territory Figure 5-4 0.0000 Casualty Actuarial Society Forum, Winter 2006 35 Distinguishing the Forest from the T REES Pruni n~ t he Trees Simple trees 3~ that ext end to a large number of t ermi nal nodes are difficult to assess t he flail i mport ance of individual variable levels because (1) later node splits may or may not be statistically significant dependi ng on the software algorithms employed and (2) t ermi nal nodes on the order of fifty plus may obscure the precise cont r i but i on of each variable level despite the i mpor t ance value described above for the overall variable. The full tree produced by the software can be pruned back to the "best " tree wi t h a pre- det ermi ned number of nodes. For example, Figure 5-5 shows a best 10 node pr uned tree from S-PLUS. It begins with the health i nsurance variable as the "root " node (Y/N t o the left and U to the fight) 32 and proceeds to make general node splits based onl y on the provi der 2 bill amount . The uni verse of records is t hen classified by t ermi nal node I ME requested ratios rangi ng from 0.019 to 0.170. A similar pr uned tree can be pr oduced for t he ot her three targets. 36 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES S-PLUS TREE: IME Requested Best Te n No d e Pruned Tree h~nllh into Jmnr.~'h I / mn2 hi 1<4Zr,,4 mn~) hJ<4R.q L i- j o. -, L mp2 h i l l 3010.5 mp2 hill 11~qF~1.5 mp? hill 15.o,6.5 0.019 0.100 0.100 1 0.060 0.072 0.150 0.090 0.170 0.120 F i g u r e S -S We next t ur n t o cons i der at i on o f model per f or mance as a whol e i n sect i on 6 wi t h an i nt er pr et at i on o f t he model s and vari abl es relative t o t he pr obl e m at ha nd (exampl e 2) i n Sect i on 7. S E C T I ON 6. R OC CURVE S AND L I F T F OR S OF T WAR E : T R E E S , NAI VE BAYES AND L OGI S T I C MODE L S The sensitivity and specificity measur es di scussed i n Sect i on 4 are de pe nde nt o n t he choi ce o f a cut of f val ue f or t he pr edi ct i on. Many model s score each r ecor d wi t h a val ue bet ween zero and one, t hough s ome ot he r scor i ng scale can be used. Thi s score is somet i mes t r eat ed like a probability., al t hough t he concept is mu c h cl oser i n spi ri t t o a fuzzy set me a s ur e me nt f unct i on 33. A c o mmo n c ut of f poi nt is .5 and r ecor ds wi t h scores gr eat er t han .5 are classified as event s and r ecor ds bel ow t hat val ue are classified as non- e ve nt s 3~. However , ot he r cut of f Casualty Actuarial Society Forum, Winter 2006 37 Di st i ngui shi ng the Forest f r om the T R E E S values can be used. Thus, i f a cut of f l ower t han 50% were selected, mor e event s woul d be accurately predi ct ed and fewer non-event s woul d be accurately predicted. Because t he accuracy of a predi ct i on depends on t he selected cut of f point, t echni ques for assessing t he accuracy o f model s over a range o f cut of f poi nt s have been devel oped. A c ommon procedure for visualizing t he accuracy of model s used for classification is t he recei ver operat i ng characteristic (ROC) curve 3~. This is a curve o f sensitivity versus specificity (or mor e accurately 1.0 mi nus the specificity) over a range o f cut of f points. I t illustrates graplfically the sensitivity or true positive rate compar ed t o 1- specificity or false alarm rate. When the cut of f poi nt is very high (i.e. 1.0) all claims are classified as legitimate. The spedfi ci t y is 100% (1.0 mi nus t he specificity is 0), but t he sensitivity is 0%. As t he cut of f poi nt is lowered, t he sensitivity increases, but so does 1.0 mi nus t he specificity. Ultimately a poi nt is reached where all claims are predi ct ed to be events, and t he specificity declines t o zero (1.0 - specificity = 1.0). The baseline ROC curve (where no model is used) can be t hought of as a straight line f r om the origin wi t h a 45-degree angle. I f t he model ' s sensitivity increases faster than the specificity decreases, the curve "l i ft s" or rises above a 45- degree line quickly. The higher the "l i ft " or "gai n"; the mor e accurate t he model 3c'. ROC curves have been used in pri or studies of insurance claims and fraud det ect i on regressi on model s (Derrig and \Veisberg, 1998 and Viaene et al., 2002). The use of ROC curves in building model s as well as compar i ng per f or mance of compet i ng model s is a well established procedure (Flach et al (2003)). A statistic that proxddes a one-di mensi onal summary of t he predi ct i ve accuracy of a mo d d as measured by an ROC curve is the area under the ROC curve (AUROC). I n general, AUROC values can distinguish good model s f r om bad model s but may not be able t o distinguish among good model s 0Vlarzban, 2004). A curve that rises quickly has mor e area under t he ROC curve. A model wi t h an area of .50 demonst rat es no predi ct i ve ability, whi l e a model wi t h an area of 1.0 is a perfect predi ct or (on t he sample the test is per f or med on). For this analysis, SPSS was used t o pr oduce t he ROC cuin-es and area under t he ROC curves. SPSS generates cut of f values midway bet ween each unique score in t he data and uses the trapezoidal rule t o comput e the AUROC. A non-paramet ri c met hod was used to comput e t he standard error o f t he AUROC. The formula for t he standard error 37 is: I A(I - A) + (n+ - l)(Ql - A 2 ) + (n_ - I)(Q2 - A 2 ) SE( A) (8) n+N Where n+ is t he number o f event s, n. is t he number o f non-event s, N is t he sampl e si ze A is t he AUROC and scores are denot ed as x = = j x [ n + > / - n+ > z r/+= / /./r/+2 E x t'/- , + + + ] 3 8 C a s u a l t y Ac t u a r i a l S o c i e t y Forum, Wi n t e r 2 0 0 6 Distinguishing the Forest from the T REES 1 o , : 7 g + Xx + : jxt , , : + o , , + , :, Tables 6-1A&B show t he values of AUROC for each of eight model / s of t war e combi nat i ons in predicting a decision to investigate wi t h an I ME (6-1A) and an SIU (6-1B). for the Massachusetts auto bodily injury liability claims that compri se t he hol dout sample, about 50,000 claims. Upper and l ower bounds for t he "t r ue" AUROC value are shown as t he AUROC value + t wo standard deviation det ermi ned by equat i on (7). TREENET, Random For est bot h do well wi t h AUROC values about 0.7, significantly bet t er than t he logistic model. The I mi ner model s (Tree, Ensembl e and Nai've Bayes) generally have AUROC values significantly b d o w t he t op t wo performers, wi t h t wo (Tree and Ensembl e) significantly bel ow the Logistic and t he I mi ner Nai ve Bayes benchmarks. CART also scores at or bel ow the benchmarks and significantly bel ow T R E E NE T and Random Forest. On the ot her hand, S-PLUS (R) tree scores at or somewhat above the benchmarks. Area Unde r t he ROC Curve - I ME De c i s i o n CART Tr e e 0.669 0.661 0.678 S-PLUS Tr e e AUROC Lower Bound Upper Bound I mi ne r Ens e mbl e Ra ndo m Fores t AUR()C 0.649 703 Lower Bound 0.641 695 Upper Bound 0.657 711 I mi ne r Tr e e Tabl e 6-1A T RE E NE T 0.688 0.629 0.701 0.680 0.620 0.693 0.696 0.637 0.708 I mi ne r Na i v e Bayes 0.676 0.669 0.684 Logi st i c 0.677 0.669 0.685 Casualty Actuarial Society Forum, Winter 2006 39 Distinguishing the Forest from the T REES Area Unde r the ROC Curve - I ME Favorabl e CART Tree AUROC Lower Bound Upper Bound AUROC Lower Bound Upper Bound 0.651 0.641 0.662 I mi ne r Ens e mbl e 0.654 0.643 0.665 S-PLUS Tr e e I mi ne r Tr e e TREENET 0.664 0.591 0.683 0.653 0.578 0.673 0.675 0.603 0.693 Ra ndo m Fores t 0.692 0.681 0.702 I mi ne r Na / v e Bayes 0.670 0.660 0.681 Tabl e 6-1B Logi s t i c 0.677 0.667, 0.687 Tabl es 6- 2A&B s h o w t he val ues o f AUR OC f or t he mo d e l / s o f t wa r e c o mb i n a t i o n s t e s t e d f or t he SI U d e p e n d e n t vari abl e. We f i r st n o t e t hat , i n gener al , t he mo d e l pr e di c t i ons as me a s ur e d by AUR OC ar e si gni fi cant l y l owe r t ha n f or I ME acr os s all ei ght mo d e l / s o f t wa r e c ombi na t i ons . Thi s r e duc t i on i n AUR OC val ues ma y be a r ef l ect i on o f t he e xpl a na t or y var i abl es us e d i n t he anal ysi s; i.e., t hey may be mo r e i nf or ma t i ve a bout d a l m bui l d- up, f or wh i c h I ME is t he pr i nci pal i nvest i gat i ve t ool , t ha n a b o u t cl ai m f r aud, f or wh i c h SI U is t h e pr i nci pal i nvest i gat i ve t ool . Area Unde r t he ROC Curve - SI U De c i s i o n AUROC Lower Bound Upper Bound AUROC Lower Bound Upper Bound CART S-PLUS Tr e e Tr e e 0.607 0.598 0.617 I mi ne r Ens e mbl e 0.539 0.530 0.548 I mi ne r Tr e e TREENET 0.616 0.565 0.643 0.607 0.555 0.634 0.626 0.575 0.652 Ra ndo m Fores t I mi ne r Na i v e Bayes 0.615 0.677 Logistic 0.612 0.668 0.605 0.603 0.686 0.625 0.621 Tabl e 6-2A 40 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Fore#from the TREES Area Under t he ROC Curve - SIU Favorable CART Tr e e AUROC 0.598 Lower Bound 0.584 Upper Bound 0.612 I mi n e r E n s e mb l e AUROC 0.57.5 Lower Bound 0.530 Upper Bound 0.548 S-PLUS Tr e e I mi n e r Tr e e T RE E NE T 0.616 0.547 0.678 0.607 0.555 0.667 0.626 0.575 0.689 Ra n d o m Fo r e s t 0.645 I mi ner Nai ve Ba y e s 0.607 Logmic 0.610 0.631 0.593 0.596 0.658 0.625 0.623 Tabl e 6-2B T R E E NE T a nd Ra n d o m Fo r e s t p e r f o r m si gni f i candy be t t e r t ha n all o t h e r mo d a l / s o f t wa r e c o mb i n a t i o n s o n t he f avor abl e t ar get vari abl es. Bo t h p e r f o r m si gni fi cant l y be t t e r t ha n t he Logi st i c. I mi n e r Tr e e a nd En s e mb l e agai n d o poor l y o n t he I ME a nd SI U Fa vor a bl e h o l d o u t sampl es. Fi gur es 6-1 t o 6-4 s h o w t he ROC cur ves f or T R E E NE T c o mp a r e d t o t he Logi st i c f or b o t h I ME and SI U 38. As we can see, a s i mpl e di spl ay o f t he ROC cur ves ma y n o t be s uf f i ci ent t o di s t i ngui s h p e r f o r ma n c e o f t he mo d e l s as wel l as t he AUR OC val ues. 1 .0 T R E E NE T R OC Cu r v e - I ME AUR OC = 0. 701 ~ . Q . 6- ~ 0 4 - 0 2 - 0.0 0 0 I I I I 02 0.4 0.6 0 8 1.0 1 - sp eci r. :~ y F i g u r e 6-1 |.0 0 5- t ~g 0.4- 02- T RE E NE T R OC Cu r v e - S I U AUR OC = 0. 677 0.0 I I I I 1- smar~ F i g u r e 6 - 2 Casualty Actuarial Society Forum, Wi nt er 2006 41 Distinguishing the Forest from the T REES L o g i s t i c R OC C u r v e - I ME AUR OC = 0. 643 1 .D ~>qXS- W 1 ~D. 4 - 02- 1.0 0.8- i ~'3 D.4- L o g i s t i c R OC C u r v e - S I U AUR OC = 0. 612 {32- / . 0 . 0 I I OI I {}0 Q2 0.4 0 0 5 0~0 I l 01 . I oo 02 0.4 0 0.8 ~.o 1 - Sp ecWci t y 1 - s p e c mc ~ F i g u r e 6- 3 1.0 F i g u r e 6- 4 Fi nal l y, Ta b l e 6- 3 di s pl ays t h e r el at i ve p e r f o r ma n c e o f t h e mo d e l / s o f t wa r e c o mb i n a t i o n s a c c o r d i n g t o AUR OC va l ue s a n d t hei r r anks . Wi t h Nai ' ve Ba ye s a n d Logi s t i c as t h e b e n c h ma r k s , T R E E NE T , R a n d o m Fo r e s t a n d S P LUS Tr e e d o be t t e r t h a n t he b e n c h ma r k s whi l e C AR T Tr e e , I n' f i ner Tr e e , a n d I mi n e r E n s e mb l e d o wor s e . Ranki ng of Met hods By AUROC - Deci s i on Met hod SIU AUROC Random Forest 0.645 TREENET 0.643 S-PLUS Tree 0.616 Iminer NaPce Bayes 0.615 Logistic 0.612 CART Tree 0.607 1miner Tree 0.565 Iminer Ensemble 0.539 ; IU Ra nk I ME Ra nk I ME AUROC 1 0.703 2 0.701 3 0.688 4 5 0.676 4 0.677 6 0.669 8 0.629 7 0.649 Tabl e 6-3A 42 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the T REES 1 2 0.683 2 1 0.692 3 5 0.664 4 3 0.677 4 0.670 6 7 0.651 6 0.654 8 0.591 Ranl dng of Met hods By AUROC - Favorabl e Met hod SIU AUROC SIU Rank IME Rank I ME AUROC TREENET 0.678 Random Forest 0.645 S-PLUS Tree 0.61 Logistic 0.61C Iminer Naive Bayes 0.607 CART Tree 0.598 Iminer Ensemble 0.575 Iminer Tree 0.547 Tabl e 6-3B Finally, Fi gures 6- 5A&B s how t he relative per f or mance i n a graphi c. Pr ocedur es woul d wor k equally on b o t h I ME and SI U i f t hey lie o n t he 45 degree line. To t he ext ent t hat per f or mance is bet t er o n t he I ME t arget s, pr ocedur es woul d be above t he diagonal. Bet t er per f or mance is s hown by posi t i ons f ar t her t o t he f i ght and cl oser t o t he t op o f t he square. Thi s gr aphi c dear l y shows t hat T R E E NE T and Ra ndom For est pr ocedur es do bet t er t han t he ot her tree pr ocedur es and t he benchmar ks . Pl ot of AUROC for SIU vs. I ME De c i s i on 0. 65 | 8 ~ 0 . 6 0 I M Ensembl e I M Tret r r N net Rand~ n Fores S- PL US Tree L o g ~ U c / N ah ~ b y e s CART 0 50 ' I I 0 S0 0 55 0 60 0, 65 AUROC. sI U Fi gure 6-5A Pl ot of AUROC for SIU vs. IME Favorabl e Casualty Actuarial Society Forum, Winter 2006 43 Distinguishing the Forest from the T REES , t o. 6s - i 0. 60 - 0 55 " R m d o m F o ~ s t L ogisticR , , g n , s s ~ N~ L Ue Tr ee I M E ~ CART 0 55 0 60 0 65 SI U A U R O C Figure 6-5B S E C T I ON 7. C ONC L US I ON Insurance data oft en involves bot h large vol umes of i nf or mat i on and nonl i neari t y of variable relationships. A range of data mani pul at i on techniques have been developed by comput er scientists and statisticians that are now categorized as data mi ni ng, techniques with principal advantages bei ng precisely the efficient handl i ng of large data sets and the fitting of non- linear funct i ons to that data. I n this paper we illustrate the use of software i mpl ement at i ons of CART and ot her tree-based met hods, together wi t h benchmar k procedures of Naive Bayes and Logistic regression. Those eight model / soft ware combi nat i ons are applied to data arising i n the Detail Claim Dat abase (DCD) of aut o injury liability claims i n Massachusetts. Twent y- one variables were selected to use i n prediction model s usi ng the DCD and external demographi c variables. Four target categorical variables were selected to model: The decision to request an i ndependent medical exami nat i on (IME) or a special investigation (SIU) and the favorable out come of each investigation. The two decision targets are the pri me claim handl i ng techniques that insurers can use to reduce the asymmetry of i nformat i on bet ween the dai mant and the i nsurer i n order to distinguish valid claims from those i nvol vi ng bui l dup, exaggerated injuries or treatment, or outright fraud. Ei ght model i ng software results were compared for effectiveness of model i ng the targets based on a standard procedure, the area under the receiver operating characteristic curve (AUROC). We find t hat the met hods all provide some predictive value or lift from t he predicting variables we make available, with significant differences among the eight met hods and four targets. Seven model i ng out comes are compared t o logistic regression as i n Vi aene et al. (2002) but the results here are different. They show some soft ware/ met hods can i mprove on the predictive ability of the logistic model. TREENET, Random Forest and SPLUS Tree do bet t er t han the benchmark Naive Bayes and Logistic met hods, while CART 44 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES tree, Iminer tree, and Iminer Ensemble do worse. That some modal / soft ware combinations do better than the logistic model may be due to the relative size and richness of this data set and/ or the t3,pes of independent variables at hand compared to the Viaene et al. data. We show how "important" each variable is within each soft ware/ model tested and note the type of data that are important for this analysis. I n general, variables taken directly from DCD fields and variables derived as demographic type variables based on DCD r i d& do better than variables derived from external demographic data. Variables relating to the injury and medical treatment dominate the highly important variables while the presence of an attorney, age of the daimant, and policy type, personal or commercial, are less important in making the decision to im, oke these two investigative techniques. No general conclusions about auto injury claims can be drawn from the exercise presented here except that these modeling techniques should have a place in the actuary's repertory of data manipulation techniques. Technological advancements in database assembly and management, especially the availability of text mining for the production of variables, together with the easy access to computer power, will make the use of these techniques mandatory for analyzing the nonlinearity of insurance data. As for our part in advancing the use of data mining in actuarial work, we will continue to test various software products that implement these and other data mining techniques (e.g. support vector machines). REFERENCES Allison, P., Missing Data. Sage Publications, 2002. Automobile Insurers Bureau of Massachusetts, Detail Claim Database Claim Distribution Characteristics, Accident Years 1995-1997, Boston MA, 2004. Brieman, L., J. Freidman, R. Olshen, and C. Stone, Classification ~md Regression Trees, Chapman Hall, 1993. Conger, R.F., The Construction of Automobile Rating Territories in Massachusetts, Casualty ActuarialSodely Forum, pp. 230-335, Fall 1987. Derrig, R.A., and L. Francis, Comparison of Methods and Software for Modeling Nonlinear Dependencies: A Fraud Application, Working Paper, 2005. Flach, P., H. Blockeel, C. Ferri, J. Hernandez-Orallo, andJ . Struyf, Decision Support for Data M.ining: An Introduction to ROC Analysis and Its Applications, Chapter 7 in Data Casualty Actuarial Society Forum, Winter 2006 45 Distinguishing the Forest from the TREES Mining and Decision Support, D. Mlandenic, Nada Lavrac, Marko Bohanec and Steve Moyle Eds., Kluwer Academic, N.Y., 2003. Fox, J, An R and S-PLUS Companion to Applied Regression, SAGE Publications, 2002. Francis, L.A., An introduction to Neural Networks in Insurance, Intelligent and Ot her Computational Techniques in Insurance, Chapter 2, A.F. Shapiro and L.C. Jain, Eds, World Scientific, pp. 51-1, 2003a. Francis, L.A., Mardan Chronicles: Is MARS better than Neural Networks? Casual~ ActuarialSodety Forum, Winter, pp. 253-320, 2003b. Francis, L.A., Practical Applications of Neural Networks in Property and Casualty Insurance, Intellieent and Ot her Computational Techniques in Insurance. Chapter 3, A.F. Shapiro and L.C. Jain, Editors, World Scientific, pp. 104-136, 2003c. Francis, L.A., Neural Networks Demystified, Casualty Actuarial Sode~y Forum, Winter, pp. 254- 319, 2001. Francis, L.A., A Comparison of TREENET and Neural Networks in Insurance Fraud Prediction, presentation at CART Data Mining Conference, 2005. Friedman, J., Greedy Function Approximation: The Gradient Boosting Machine, Annals of Statistics, 2001. Hassoun, M.H., Fundamentals of Artificial Neural Ne~, orks. MI T Press, Cambridge MA, 1995. Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, New York., 2001. Hosmer, David W. and Stanley Lemshow, Applied Lodstic Regression, J ohn Wiley and Sons, 1989. Insurance Research Council, Auto Injury Insurance Claims: Countrywide Patterns in Treatment Cost, and Compensation, Malvern, PA, 2004a. Insurance Research Council, Fraud and Buildup in Auto Injury Insurance Claims, Malvem, PA, 2004b. Jacard, J and Turrisi, Interaction Effects in Multinle Recessi on, SAGE Publications, 2003. Kantardzic, M., Data Mining. J ohn Wiley and Sons, 2003. 46 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES Lawrence, Jeannette, Introduction to Neural Networks: Design, Theory and Applications, California Sdentiflc Software, 1994. Marzban, C., A Comment on the ROC curve and the Area Under it as Performance Measures, 2004. Weather and Forecasting, Vol. 19, No. 6, 1106-1114. bicCullagh, P. and J. Nelder, General and Linear Models, Chapman and Hall, London, 1989. Miller, R.B. and D.B. Wichern, Intermediate Business Statistics, Holden-Day, San Francisco, 1997. Neter, John, Mihael H. Kutner, William Waserman, and Christopher J. Nachtsheim, Applied Linear Reeression Models, 4 a' Edition, 1985. v Salford Systems, Data mining with Decision Trees: Advanced CART Techniques, Notes from Course, 1999. Smith, Murry, (1996), Neural Networks for Statistical Modeling, International Thompson Computer Press, 1996. Viaene, S., R.A. Derrig, and G. Dedene, A Case Study of Applying Boosting Nai've Bayes for Claim Fraud Diagnosis, I EEE Transactions on Knowledge and Data Engineering, 2004, Volume 16, (5), pp. 612-620, May. Viaene, S., R.A. Derrig, and G. Dedene, Illustrating the Explicative Capabilities of Bayesian Learning Neural Networks for Auto Claim Fraud Detection, Intelligent and Other Comtmtational Techniaues in the Insurance, Chapter 10, A.F. Shapiro and L.C. Jain, Editors, World Scientific, pp. 365-399, 2003. Viaene, S., B. Baesens, G. Dedene, and R. A. Derrig, A Comparison of State-of-the-Art Classification Techniques for Expert Automobile Insurance Fraud Detection, Journal of Rdsk andlnsurance, 2002, Volume 69, (3), pp. 373-421. Weisberg, H.I. and R.A. Derrig, Methodes Quantitatives Pour La Detection Des Demandes D' Indemnisation Frauduleuses, RISQUES, No. 35, Juillet-Septembre, 1998, Paris. Zhou, X-H, D.K. McClish, and N.A. Obuchowski, Statistical Methods in Diagnostic Medicine, J ohn Wiley and Sons, New York, 2002. A good up-to-date and comprehensive source for a variety of data manipulation procedures is Hastie, Tibshirani, and Friedman (2001 ), Elements of Statistical Learning, Springer. Casualty Actuarial Society Forum, Winter 2006 47 Distinguishing the Forest from the TREES 2 They also found that augmenting the categorized red flag variables with some other claim data (e.g. age, report lag) improved the lift as measured by AUROC across all methods but the logistic model still did as well as the other methods (Viaene et al., 2002, Table 6, p.400-40 l). 3 A wider set of data mining techniques is considered in Derrig, R.A. and L A. Francis, Comparison of Methods and Software Modeling Nonlinear Dependencies: A Fraud Application, Congress of Actuaries, Paris, June 2006 4 See section 2 for an overview of the database and descriptions of the variables used for this paper. 5 The relative importance of the independent variables in modeling the dependent variable within these methods are analogous to statistical significance or p-values in ordinary regression models. 6 See, for example, 2004 Discussion Paper Program, Applying and Evaluating Generalized Linear Models, May 16-19, 2004, Casualty Actuarial Society. 7 This was the text used by the Casualty Actuarial Society for the exam on applied statistics during the 19g0s 8 Claims that involve only third party subrogation of personal injury protection (no fault) claims but no separate indemnity payment or no separate claims handling on claims without payment are not reported to DCD. 9 Combined payments under PIP and Medical Payments are reported to DCD. J0 With a large holdout sample, we are able to estimate tight confidence intervals for testing model results in section 6 using the area under the ROC curve measure. H This fact is a matter of Massachusetts law which does not permit IMEs by one type of physician, say an orthopedist, when another physician type is treating, say a chiropractor. This situation may differ in other jurisdictions. 12 Because expert bill review systems became pervasive by 2003, reaching 100% in some cases, DCD redefined the reported MA to encompass only peer reviews by physicians or nurses for claims reported after July l, 2003.. J3 The standard Massachusetts auto policy has a cooperation clause for IME both in the first party PIP coverage and in the third party BI liability coverage. 14 The IRC also includes an index bureau check as one of the claims handling activities ~5 Prior studies of Massachusetts Auto Injury claim data for fraud content included Weisberg and Derrig (1998, Suspicion Regression Models) and Derrig and Weisberg (1998, Claim Screening with Scoring Models). ~6 See Section 5 for the importance of the provider 2 bill variable in the decision to investigate claims for fraud (SIU) and/or buildup (IME). ~7 There are Tree Software models that may split nodes into three or more branches. SPSS classification trees is an example of such software. is For binary categorical data assumed to be generated from a binomial distribution, entropy and deviance are essentially the same measure. Deviance is a generalized linear model concept and is closely related to the log of the likelihood function. 19 Hastie et al., p. 301 Note that Hastie et al. describe other error and weight functions. [endnote] .~0 Note that the ensemble tree methods employ all 21 variables in the models. See tables 5-1 and 5-2. 2~ The ROC curve results in Section 6 show that TREENET generally provides the best prediction models for the Massachusetts data. 22 The numeric variables were grouped into five bins or into quintiles in this instance. ~,3 The software product MARS also was used to rank variables in importance. MARS implements multivariate adaptive regression splines and is described in Francis (2003). 24 The SAS code is generally relatively easy to edit if some other language is used to implement the model 25 See Section 5 for the importance of variables in our study. 26 S-PLUS would convert the numeric variable into a categorical variable with a level for every nmnerie value that is in the training data, including missing data, but the result would have far too many categories to be feasible. 27 Generally by collapsing sparsely populated categories into an "all other" category 48 Casualty Actuarial Society Forum, Winter 2006 Distinguishing the Forest from the TREES 2s It al so cont ai ns some di mens i on reduct i on met hods such as cl ust er i ng and Pr i nci pal Component s whi ch are al so cont ai ned in S-PLUS. 29 In gener al , some pr ogr ammi ng i s requi red to appl y ei t her appr oach in S- PLUS (R) 30 The dat a set is descr i bed in mor e det ai l in Sect i on 2 above. 31 Pr uni ng i s not f easi bl e or necessar y for t he exampl e t r ee met hods such as TREENET or Random TREENET. 32 The S- PLUS t ree gr aph does not pri nt out t he val ues of cat egor i cal var i abl es, al t hough i t di spl ays t he val ues of t he numer i c var i abl es. For cat egor i cal var i abl es letters are assi gned and di spl ayed i nst ead o f t he cat egor y val ues. 33 See Os t as zews ki ( 1993) or Der r i g and Os t as zews ki (1999). 34 One way o f deal i ng wi t h val ues equal t o t he cut of f poi nt i s t o consi der such obser vat i ons as one- hal f in t he event gr oup and one- hal f in t he non- event gr oup 35 A ROC cur ve i s one exampl e of a so- cal l ed "gai ns " chart. 36 ROC cur ves wer e devel oped ext ens i vel y for use in medi cal di agnosi s t es t i ng in t he 1970s and 1980s (Zhou et al. 2004 and mor e r ecent l y i n weat her f or ecast i ng (Marzban, 2004) and (St ephenson, 2000). 37 The det ai l s of t he f or mul a wer e suppl i ed by SPSS. 38 Al l t went y ROC cur ves are avai l abl e from t he aut hors. Acknowledgement: The authors gratefully acknowledge the production assistance of Eilish Browne and helpful comments from Rudy Palenik on a prior version of the paper. Casualty Actuarial Society Forum, Winter 2006 49