You are on page 1of 10

Hospitalization prediction from health care claims

Adithya Renduchintala M.S. Computer Science University of Colorado Boulder adithya.r@gmail.com SID: 8109540901 Benjamin Martin Software Engineer Raytheon Company bnjmnmrtn@gmail.com SID: 101262220 Lance Legel M.S. Telecommunications University of Colorado Boulder lancelegel@gmail.com SID: 100788934

4.4 million preventable hospitalizations occur each year in the US, costing health care providers over $29 billion [1]. In response, Heritage Provider Network has organized a competition to develop data mining algorithms that predict future hospitalizations [2]. As participants in this competition, we have applied machine learning algorithms to find correlations between 2.6 million health care claims and hospitalization outcomes for about 70 thousand people. We began implementing binary classifiers to determine whether or not people will go to the hospital, using support vector machines (SVM) and artificial neural networks (ANN). We used Orange [3], PyBrain [4], and SVMlight [5] libraries in Python. We faced high computational costs of training, and poor performance of standard algorithms on unbalanced data sets with rare positive entries. We partially overcame the first problem through cloud computing resources [6]; and the second problem through ensembles of SVMs, based on an original staircase model with meta learning [7,8]. Our model was submitted to the competition and earned a RMSLE accuracy of 0.502, beating predictions of all-zeros and a few hundred competitors. The skew of our data and complexity of processing it were central reasons why our algorithms alone could not converge onto deeply accurate models for prediction.

Researchers are actively pursuing innovative ways to minimize the need for people to be hospitalized, because its impact on people is harsh and costly. The Agency for Healthcare Research and Quality estimates that in the United States alone there are about 4.4 million hospitalizations each year that could be prevented through less intrusive and less expensive intervention [1]. In total, health care providers in the US lose an average of $29 billion collectively to such hospitalizations. Beyond these financial costs of not intervening, peoples lives are devastated by not actively preventing illnesses.

Why focus on hospitalization?

What can data mining do about this?

Data mining of historical health care and hospitalization outcomes can radically change the way health care is administered in the US and around the world. By identifying correlations between real health care claims physician visits, diagnoses, drug counts, and so on and real hospitalizations, data mining algorithms can help health care providers identify those most at risk of future hospitalization. Then, these providers can intervene to make sure their members can get the treatment they need.

Page 1 of 10

We participated in the Heritage Health Prize to further our applied understanding of advanced data mining techniques. We were evaluated according to the following iterative equation, where i is current member, n is number of members, p is predicted days in hospital, and a is actual days in hospital:

To capitalize on the opportunities identified above, Heritage Provider Network (HPN) is funding the Heritage Health Prize [2], a data mining competition worth more than $3 million dollars to its winner. HPN provides participants data on over 100,000 patients for 3 years with over 2 million health care claims in total. Contestants are provided information on how many days, if any, these patients spent in the hospital over the three years. From this data and other data provided, participants must predict how many days in the hospital a subset of these patients will spend in a 4th year following the 3 years of data.

Data of Heritage Health Prize

Figure 1 Sample from our primary data table organized for participation in the Heritage Health Prize. Attributes are explained in the section What is our data?.

Several of the 14 attributes are very rich dimensionally: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. MemberID - ID for patient ProviderID - ID for doctor providing service Vendor - ID for company issuing bill PCP - ID for members primary care physician Year - year of claim, either 1, 2, or 3 Specialty - expertise of doctor servicing PlaceSvc - type of location serviced PayDelay - days between claim and payment LengthOfStay - days spent being serviced DSFS - days since first service of year PrimaryConditionGroup - code for diagnosis CharlsonIndex - comorbidity score ProcedureGroup - code for treatment SupLOS - binary indication of data suppression

The goal is to minimize , with minimum of 0.4. Before discussing the techniques we have applied, we will explain what data we are actually mining.

In Figure 1, above, we see a sampling of 4 instances from 2,668,991 total instances of year one data.

What is our data?

Page 2 of 10

In addition to the 2.6 million instances of the attributes outlined in Figure 1, our data set also

In Figure 1, consider the range of values for Specialty, which is the specialty of doctors that patients visit - internal, anesthesiology, diagnostic imaging, and emergency. Yet, the richness of this range is dwarfed by the possible values contained in PrimaryConditionGroup and ProcedureGroup. In this sampling, PrimaryConditionGroup ranges from a cardiac condition (MISCHRT) to a spine disorder (ARTHSPIN) to chest pain (ROAMI) to a disease of pulmonary circulation (HEART2); similarly, the ProcedureGroup ranges from surgery of the cardiovascular system (SCS) to anesthesia (ANES) to radiology (RAD) to surgery of the digestive system (SDS). Strong correlations were expected to emerge relating these attributes to days hospitalized in subsequent years.

contains the sex and age group for all individuals; number of drugs prescribed per event; and number of laboratory and pathology tests per event.

Figure 2 shows how the hospitalization data is skewed. We will ultimately discuss how this skew created dramatic challenges to overcome, because most learning algorithms are not originally designed to detect what are essentially outliers.

of the claims that were recorded in the year prior to prediction. Figure 3 shows the type of data to be integrated, and explains further how we reformatted it.

Preprocessing: data integration

Figure 2 Count of days hospitalized after one year. The largest count for the majority class, zero days, is too large (64,269) to be shown on this graph. The bump at the right of the graph is because the final bin captures all hospitalizations that are 15 or more days.

Upon getting to know our data, we realized that it was formatted in a way that was not conducive to the prediction problem. We were provided a randomized set of 2.6 million instances, with members averaging about two dozen instances across 3 years. We needed our learning algorithms to take consideration of all instances for each member as a single object for classification. This meant integrating multiple instances into single entries for each year of prediction. To integrate the data, we wrote a Python script to scan through the complete set of claims, and build a dictionary for every member, which indicates all

Figure 3 Flattening and normalizing of data on one member, 99977491, in year 3. For any attribute of categorical data type (e.g. PrimaryConditionGroup), categories are expanded into multiple new attributes (ROAMI, MISCHRT, ARTHSPIN, SKNAUT, ...). Each new single category attribute receives a normalized value based upon how many instances with this category are recorded for this member, and what the maximum count of instances recorded of this category is for any member. In this example, member 99977491 has two instances with the category ROAMI. So, if the flattened file has a value of 0.1429 for the new attribute ROAMI_PCP-Y3, then any member with a single instance recorded of ROAMI would have a value of 0.07142, and the maximum instances recorded in a single member of ROAMI must be 1/0.07142 = 14, according to standard normalization procedure. Because some attributes are not categorical (e.g. ProviderID), they are excluded during this procedure, even though they contain interesting information. Learning with these attributes can be incorporated separately as features.

In addition to our flattening and normalizing outlined in Figure 3 above, we also chose to filter out instances where members are not present in at least one of the years, in order to establish a mining structure that considers time. This ultimately converted a file with 2.6 million instances and 14 attributes into a file with 42,900 instances (one per member) and 547 attributes. There are some significant drawbacks to this integration process. We lose granularity in our temporal richness, and more generally, our ability to mine frequent interrelationships of attribute data across single instances.

Page 3 of 10

By combining instances that occurred at various times into a single time-invariant vector, we lose the ability to mine frequent sequences of attribute data. There was no obvious way to preserve the timestamp for each set of instance data days since first service of year within a highdimensional aggregation of many instances. For example, in Figure 3, we see that the PrimaryConditionGroup categories of SKNAUT and ARTHSPIN occur around the same time (DSFS: 0-1 months), and before ROAMI and MISCHRT (DSFS: 5-6 months). It is possible that this sequence of conditions translates into different outcomes than, say, if ROAMI and MISCHRT came before SKNAUT and ARTHSPIN. However, we could not learn this pattern from our preprocessed data since DSFS in an aggregated vector does not correlate with specific instance data.

Artificial Neural Networks

Using the PyBrain library [4], we constructed multiple feed-forward back-propagation neural networks. with 96 inputs, 96 hidden nodes, and 1 output (binary classifier). In overcoming problems with overflows and multiplication errors, we discovered an issue in the way the tanh gradient error was calculated, and reformatted this calculation into a linear one.

Beyond time dilution, more generally, our member-centric dataset gives up the ability to mine contextual considerations of attribute data within single instances. The interestingness of attribute values often depends upon the context e.g. whether it is 80 F in the winter or summer. We lose our ability to make such distinctions when we aggregate all of our unique instances into larger, diluted structures.

When trained against the flattened files described previously, our ANNs ran but generally did not converge towards a low error, even after running for several days straight on a corporate server. This disappointing result stems from the skew in the data set, where the majority class represents about 85% of all data. The problem of skew causes the ANN to probabilistically choose the majority class too often. As we will see in the next section, support vector machines encountered the same problem. To overcome this problem with neural networks, we tried random sampling of the null class so that there is an equal amount of null and positive training occurring. This also failed to converge to a reasonably low error. This neural network actually proved to be less accurate than a classifier that always guessed null (no hospitalization) values.

Design of machine learning strategy

Our initial strategy for learning correlations across the health care data set broke down the prediction problem into one of binary classification followed by categorical classification. We anticipated getting good results from a support vector machine (SVM) in classifying whether or not an individual was to be hospitalized; and then successfully using an artificial neural network (ANN) on the subset of those predicted to be hospitalized to determine the number of days. This strategy of integrating two different machine learning techniques proved to be unnecessary and probably impractical, as similar problems were encountered with both that were not solved for both. In particular, the issue of skew in our data set, as illustrated by Figure 2, proved to hugely challenging for the algorithms.

Page 4 of 10

Another approach to using neural networks included not using a member-centric data set as described in the preprocessing section, but seeing if there may be any specific individual claims with predictive value. All member information was appended to each claim, including gender, drug counts, days in hospital, and so on. This produced better results, but the training behavior did not improve quickly, asymptotically approaching about 10% error (not in RMSLE, only a relative measure). After starting with about 25% error, it dropped after two iterations to about 13% error, before making three small drops over the following 30 epochs of about 1% error reduction each. The lack of real improvement thereafter was surprising compared to the number of epochs (thousands) smaller example sets were seen to converge.

Using Orange and SVMlight Python libraries [3,5], we did extensive experimentation with SVMs. In seeking to train against the skewed data set described in Figure 2, we found that SVMs suffered the same problem of over-prediction of null values. We therefore needed to attack this problem in a more sophisticated way, and previous researchers have found success in a partitioning method [7], described in Figure 4.

Support Vector Machines

we tested various distance measures and clustering techniques, with the support of the Python library Pycluster [9]. We tried using cosine similarity, k-means clustering, Pearsons correlation, variable and heuristic clustering, average distance, median distance, and common Euclidian metrics. As may have been anticipated, clustering based upon these measures does not naturally produce clusters of equal (or nearly equal) sizes. In other words, clustering as a form of partitioning for the purpose of removing skew does not directly work; we must impose some structure onto the process. The recursive partitioning process we used in response is explained in Figure 5.

Figure 4 From [7], a diagram explaining the architecture of our approach to solving the issue of skewed datasets for SVM ensemble prediction. In this figure, negative training data are decomposed into K partitions, and then each partition is combined with the single set of positive training data. SVMs are trained independently for each partition while a top-level aggregation scheme uses the predictions of these SVM sets to make a single set of predictions. We tested multiple means of aggregation, including: simple majority vote of rounded weights, unanimous positive vote or else negative, maximum output weight wins, and weight summations. The best-performing aggregation method was unanimous positive vote or else negative.

In partitioning the majority null data for removing skew, we also wondered whether better performance could be obtained through clustering of the null set. The interest in selective partitioning lies in the possibility that SVMs may be able to detect patterns more easily in similar data. Even if true, there remains a challenge in how information from different patterns could best be aggregated. To explore this question empirically,

Page 5 of 10

Figure 5 Illustration of recursive clustering until all negative data is clustered into equal partitions, according to requirements explained in Figure 4. The process starts through application of a traditional clustering algorithm such as k-means to negative data. It then examines checks whether any clusters of negative data have substantially more instances than the positive set. If so, it splits them, and checks again until all remaining clusters are of nearly equal size.

Recursive clustering as an intelligent means of partitioning did not directly lead to better results. We believe this is due to the complexity of data integration. Research into how the data structures are transformed by the machine learning algorithms could enable improvements. However, we decided to proceed for the time being using random partitioning of negative data.

The approach explained in Figure 4 is a means of binary classification in the face of skewed data to predict whether someone will or will not go to the hospital. But how could we expand our SVM model to predict the number of days hospitalized? Our data set is skewed in many ways, so we needed to preserve our solution in any extension. With this in mind, we implemented an original SVM classification scheme, which we refer to as a staircase model. Figure 6 explains the fundamentals of how we built a SVM learning system that can make categorical predictions, based on iterations of binary predictors. This system incorporates 15 ensembles of SVMS.

or meta-meta learning, considering how much

Figure 6 Illustration of staircase SVM model. The first step is to predict if hospitalization will be 0 days (red, negative) or more than 0 days (green, positive). This means training the ensemble of SVMs as explained in Figure 4. Then we classify whether those predicted to be hospitalized for more than 0 days will be hospitalized for 1 day -or- more than 1 day; for those classified as more than 1 day, we see if they are classified as being hospitalized for 2 days -or- more than 2 days; we repeat the process until we have classifications completed for all 15 days. This requires training 15 unique ensembles of SVMs using as many unique slices of the data. In this way, the system can iteratively step up towards classification of increasingly small minority classes.

hierarchy is in this model already. We can process all 15 models independent of each other, and potentially do so much faster than one at a time if we parallelize the task through cluster computing. (More can be found on such parallel computing considerations in our section detailing work done with Amazon EC2.) What we found with this second approach, perhaps unsurprisingly, is that rarely did all 15 models make predictions in harmony. For example, one ensemble may be strongly predicting that the number of days hospitalized is over 8, while another ensemble believes just as strongly, as measured by its output weights, that this person will not be going to the hospital at all.

The models that are trained can be tested and used step-by-step, with one ensemble model being run after the next. This can reduce computational demands if it is decided that the first negative output is the predicted output for the instance (negative outputs are shown in red in Figure 6). However, there is an opportunity for meta-learning

Page 6 of 10

The question then becomes: How to deal with the seeming chaos of contradictions in a smart way? We decided to construct basic rules for the 15 ensemble outputs to go through in determining a single final prediction. Rules included giving precedence to ensemble predictions with stronger weight, and averaging across multiple ensembles.

Decision Trees

We explored other simple data mining techniques to see if perhaps we could obtain better results than those from the SVM and neural network. Some interesting results were obtained from working with decision tree classifiers that were based on the Gini index for attribute selection. Figure 7 shows the ROC curve for our decision tree true/false-positive rate in predicting hospitalizations of one day.

algorithms in favor of eliciting better global performance. Therefore, we faced serious hardware challenges with our finite computers, which slowed us down for a time.

Figure 7 ROC curve for a decision tree tested for hospitalizations of one day. The composite area under the ROC curve (AUC) measures to 0.851. The decision tree ultimately performed better than solo, unadorned SVMs on non-zero data, receiving an RMSLE of 0.6991 compared to 0.7146.

The challenge came in figuring out how to organize our system, because no tutorials were really specific to our task. First we needed to choose the type of service we wanted, which turned out to be Amazon Elastic Compute Cloud (EC2). Other services were of interest, such as Amazon Simple Storage Service (S3) for dealing with large amounts of persistent data, and Elastic MapReduce, which had Apache Hadoop clustering implementations hard-wired into a simple offering. However, we realized that we could successfully save our information through EC2s functionality, making S3 unnecessary; and the machine learning libraries we had could not be easily converted into mapreduce functions designed for parallel computing, so we could not employ the Hadoop-based service. First, we needed to choose the type of virtual machine we wanted our cloud server to run, known as an Amazon Machine Image (AMI). Hundreds of public and private AMIs were available, resembling anything from a raw Linux setup to Windows servers to customized implementations for scientific computing. After exploring the nature of the system we needed, we chose the StarCluster AMI, an open source project from MIT [10]. (STAR is an abbreviation for Software Tools for Academics and Researchers.) It ultimately turned out that running a Python program on the cloud with various library dependencies required only a few simple steps.

We believed that cloud computing was an excellent solution to our problem. While we knew that we could enable running multiple jobs at once, we were especially interested in running jobs on faster processors. After a brief survey of the market for cloud computing services, we decided to work with Amazon Web Services for its extensive support and simple, open APIs [6]. So we signed up for a private account and got to work exploring how we could run our programs on it.

Amazon Elastic Compute Cloud

As we sought to explore the performance of neural networks and SVMs, we needed to deal with about 150 MB of raw data, and other data created during preprocessing. It is true that we were able to reduce the constraints on processing through sampling of our data set, but we often had to run against our entire data set as we tried to tweak our Page 7 of 10

StarCluster provides a Linux machine equipped with Python libraries that we needed installed and optimized, like NumPy and SciPy, in addition to a large number of other customized features.

Once we selected the AMI, engaging with it is as simple as launching an instance (job) through Amazons EC2 management console, and then connecting with the appropriate security credentials via SSH. Once connected, we have access to a command line interface identical to a normal Linux machine. We were then able to complete the additional dependencies we needed from PyBrain, SVMlight, and elsewhere using the simple git clone command. And we could upload our data and Python programs to this instance using SFTP. Finally, we recognized that we save our libraries and setup for future jobs to run at any time by exporting the existing instance into a completely new AMI. Then we were able to boot up our completely customized AMI on the cloud and run new Python programs we uploaded in a matter of a few minutes.

while companies like Amazon do actually dedicate the stated amount of hardware from their public servers to clients, much of that allocation is left unused intentionally for the purposes of isolating multiple divergent systems on the same machine. But that did not completely explain the issue. We discovered after some probing that our CPU utilization in the experiment plateaued at about 12.5 percent (see Figure 8) while our own machine allocated 25 percent of CPU usage to the Python processing. These values were striking because of their reciprocity with the numbers 8 and 4. Thats when we realized that the EC2 instance was considered to be comprised of 8 virtual cores, while our own machine had an Intel i7 quad-core: Our algorithms were not computing across cores!

While the privileges of being able to run multiple jobs in the cloud simultaneously were real, the efficiency improvements that we expected to gain proved not to be. We proved this through an experiment, and confirmed it with follow up research. Our experiment was to simultaneously launch a job on one of our machines and EC2s boldest single instance offering: High Memory Quadruple Extra Large Instance, which allegedly measures at about 26 GHz of processing power with 68 GB of RAM. Our machine, by comparison, was 2.7 GHz, 8 GB. After 10 minutes into the experiment, we were delighted to find the neural network on EC2 converging about 4 times faster, reaching 12 epochs to our machines 3 epochs. But our machine took the lead after 30 minutes, reaching 58 epochs to EC2s 54. And it slowly ran away after 60 minutes: 108 epochs to 93 epochs. Why was this so?

As machines with multiple cores have only become common within about the past decade, software Our disappointing efficiencies with EC2 were development regimes for multi-core computing are confirmed by research indicating that average CPU still nascent. Further, the problems created by utilization in public clouds like Amazons was only multi-core computing seem very similar to those about 7% of the nominal offering [11]. Apparently with clustered computing, solved by Hadoop. Page 8 of 10

Figure 8 Measurements from CloudWatch service of Amazon EC2 on CPU performance during an experiment with our neural network algorithms. EC2 ultimately optimized at 12.5%, which we realized was a full utilization of a single core among 8 virtual cores.

Results and conclusions

Why did our machine learning algorithms not converge to better results? Well, it is indeed a complex domain for any scientist, and an argument could be made that there is simply not enough data on the underlying physics to make extremely accurate predictions. But we believe there are two key reasons we did not perform better, relative to competitors: losses in preprocessing the complex data, and the major skew in learning from it. We know we could at least mitigate these challenges with more time working on them. Our preprocessing strategy was sufficient to convert our data into a format amenable to the learning algorithms, but it prevented us from mining frequent itemsets and sequences within individual instances. Essentially, we made the assumption that we could predict hospitalization from decoupled pools of attribute value counts, which may be nave to a significant extent. Our domain model is already complex because different people can respond differently to the exact same circumstances (set of instances). But by stripping temporal and structural information from our data, we made our problem more confusing, even for a perfect learner. Still, we believe the data as we preprocessed it holds more resources for accurate prediction than our algorithms found. Therefore, the main issue we spent most of our energy tackling was that of the data being dominated by negative values.

Our best result, obtained by the SVM staircase model, had a RMSLE of 0.502, and was ranked 720th among 990 submissions to the HHP. It performed better than submissions of all zeros, which had a RMSLE of 0.522. The worst team with multiple submissions had a RMSLE of 1.03, while the best team at time of writing had 0.451. A bare SVM run on hospitalization data of greater than 0 days earned a RMSLE of 0.714; a decision tree on non-zero data had 0.699. Ultimately, these results are encouraging, although reflective that there is still much work to be done in training a classifier that can be trusted in a real operational setting.

Future work and research

We would like to further explore other methods of parsing and formatting our data for purposes of dimensionally extracting other features from it. This means mapping out what the data physically represents, and attacking it through different slices, integrations, and transformations. It may turn out that some of the most explanatory attributes and structures of our data were discarded, stripped, or ignored in processing.

Fortunately, our approach to rare class prediction through ensembles delivered much better performance than raw SVMs and neural networks. By giving more weight to the positive instances of hospitalization in our learning, we built a system that is much more likely in the face of a balanced data set to generate false positives than false negatives. This would be appropriate for our specific real-world problem, since preventative care is significantly cheaper than hospitalization, and people would prefer to be risk averse with their lives at stake. So it is better to have a predictive system that sets off alarms, even if there is only a 10% chance of something terrible happening, like being hospitalized. However, RMSLE does not account for such a bias towards survival, and accordingly a prediction of all zeros, which is a perfect prediction for 85% of the data, was not easy to beat.

We hope to further develop original models for using ensembles. In dealing with complex data, complex algorithms, and complex preprocessing, we have a lot of opportunity to deviate from the path we have taken and try completely new approaches. We believe there are processing structures similar in depth to our staircase model in Figure 6, but completely different in design, which could perform very well on our data.

Page 9 of 10

Parallel computing interests us, across distributed clusters and multiple cores of single machines. Few data scientists understand how to run machine learning algorithms through pipelines. But knowing how to implement parallelization can be very rewarding, especially for development of

Finally, most of our work dealt with existing libraries of advanced algorithms, but it is certainly of interest to understand the structures of these black boxes better. Having a deeper knowledge of the systematic inner-workings of neural networks and support vector machines, in particular, will make us more prudent in knowing how to use them for attacking certain data types; it could potentially allow us, during our careers, to develop custom adaptions of these algorithms that solve unique real-world problems.
Russo, A., Jiang, H.J., Barrett, M. Trends in potentially preventable hospitalizations among adults and children. Healthcare Cost and Utilization Project St. Brief #36, (2007). 2. http://www.heritagehealthprize.com 3. http://orange.biolab.si 4. http://pybrain.org 5. http://svmlight.joachims.org/ 6. http://aws.amazon.com/ec2/ 7. Yan, R., Liu, Y., Jin, R., Hauptmann, A. On Predicting Rare Class with SVM Ensemble in Scene Classification. Computer Science Department Paper 998 (2003). 8. Wang, B. & Japkowicz, N. Boosting Support Vector Machines for Imbalanced Data Sets. ISMIS08 3847(2008). 9. http://pypi.python.org/pypi/Pycluster 10. http://web.mit.edu/star/cluster/ 11. Liu, H. A Measurement Study of Server Utilization in Public Clouds. Accenture Technology Labs (2012). 12. Koufakou, A. et al. Fast parallel outlier detection for categorical datasets using MapReduce. Neural Networks, 2008: IEEE World Conference on Computational Intelligence. 3298-3304 (2008). 1.

persistently running automated systems. We are particularly interested in the active research question of how Hadoop with MapReduce can enable fast classification of outliers in categorical datasets, specifically with neural networks [12]. Such research could be harvested into an application framework for dealing with a huge diversity of data mining problems efficiently.

Page 10 of 10

You might also like