You are on page 1of 4

ROC AND AUC

Really great question, and one that I find that most people don't really understand on an
intuitive level. AUC is in fact often predicted over accuracy for binary classification for a
number of different reasons. First though, let's talk about exactly what AUC is. Honestly, for
being one of the most widely used efficacy metrics, it's surprisingly obtuse to figure out
exactly how AUC works.
AUC stands for Area Under the Curve, which curve you ask? Well that would be
the ROC curve. ROC stands for Receiver Operating Characteristic, which is actually slightly non-
intuitive. The implicit goal of AUC is to deal with situations where you have a very skewed
sample distribution, and don't want to overfit to a single class.
A great example is in spam detection. Generally spam data sets are STRONGLY biased
towards ham, or not-spam. If your data set is 90% ham, you can get a pretty damn good
accuracy by just saying that every single email is ham, which is obviously something that
indicates a non-ideal classifier. Let's start with a couple of metrics that are a little more
useful for us, specifically the true positive rate (TPR) and the false positive rate (FPR):

Now in this graph, TPR is specifically the ratio of true positive to all positives, and FPR is the
ratio of false positives to all negatives. (Keep in mind, this is only for binary classification.)
On a graph like this, it should be pretty straightforward to figure out that a prediction of all
0's or all 1's will result in the points of (0,0) and (1,1) respectively. If you draw a line
through these lines you get something like this:

Which looks basically like a diagonal line (it is), and by some easy geometry, you can see
that the AUC of such a model would be 0.5 (height and base are both 1). Similarly, if you
predict a random assortment of 0's and 1's, let's say 90% 1's, you could get the point (0.9,
0.9), which again falls along that diagonal line.
Now comes the interesting part. What if we weren't only predicting 0's and 1's? What if
instead we wanted to say that, theoretically we were going to set a cutoff, above which
every result was a 1, and below which every result were a 0. This would mean that at the
extremes you get the original situation where you have all 0's and all 1's (at a cutoff of 0 and
1 respectively), but also a series of intermediate states that fall within the 1x1 graph that
contains your ROC. In practice you get something like

this:
So basically, what you're actually getting when you do an AUC over accuracy is something
that will strongly discourage people going for models that are representative, but not
discriminative, as this will only actually select for models that achieve false positive and true
positive rates that are significantly above random chance, which is not guaranteed for
accuracy.

AUC and accuracy are fairly different things. AUC applies to binary classifiers that have
some notion of a decision threshold internally. For example logistic regression returns
positive/negative depending on whether the logistic function is greater/smaller than a
threshold, usually 0.5 by default. When you choose your threshold, you have a classifier.
You have to choose one.

For a given choice of threshold, you can compute accuracy, which is the proportion of true
positives and negatives in the whole data set.

AUC measures how true positive rate (recall) and false positive rate trade off, so in that
sense it is already measuring something else. More importantly, AUC is not a function of
threshold. It is an evaluation of the classifier as threshold varies over all possible values. It
is in a sense a broader metric, testing the quality of the internal value that the classifier
generates and then compares to a threshold. It is not testing the quality of a particular
choice of threshold.

AUC has a different interpretation, and that is that it's also the probability that a randomly
chosen positive example is ranked above a randomly chosen negative example, according
to the classifier's internal value for the examples.

AUC is computable even if you have an algorithm that only produces a ranking on
examples. AUC is not computable if you truly only have a black-box classifier, and not one
with an internal threshold. These would usually dictate which of the two is even available to
a problem at hand.

AUC is, I think, a more comprehensive measure, although applicable in fewer situations. It's
not strictly better than accuracy; it's different. It depends in part on whether you care more
about true positives, false negatives, etc.

F-measure is more like accuracy in the sense that it's a function of a classifier and its threshold
setting. But it measures precision vs recall (true positive rate), which is not the same as either
above.

As answered before, on imbalanced dataset using the majority run as a classifier will lead to
high accuracy what will make it a misleading measure. AUC aggregate over confidence
threshold, for good and bad. For good, you get a weight result for all confidence level. The
bad is that you are usually care only about the confidence level you will actually use and the
rest are irrelevant.

However, I want to remark about choosing a proper performance measure for a model. You
should compare a model by its goal. The goal of a model is not a question os machine
learning or statistic, in is question of the business domain and its needs.

If you are digging for gold (a scenario in which you have huge benefit from a true positive,
not too high cost of a false positive) then recall is a good measure.

If you are trying to decide whether to perform a complex medical procedure on people (high
cost of false positive, hopefully a low cost of false negative), precision is the measure you
should use.

You might also like