You are on page 1of 3

10/26/2014

Goodness of Fit Measures - University of Strathclyde

We're currently redeveloping our website and you may notice some inconsistencies with our designs.
Please see here for more information.

Dont display this message again.

Continue
Home > ... > 5 Further Quantitative Research Design and Analysis > Unit 6 > Goodness of Fit Measures

Goodness of Fit Measures


Goodness of Fit Measures in Logistic Regression

Logistic Regression

One question that obviously arises when we are looking at regression models is that of the overall
fit of the model. As in ordinary linear regression, we can find measures of model fit of logistic
regression models.
There are, in fact, a number of different measures of goodness of fit for logistic regression
models. The first, most straightforward measure, is to simply look at the extent to which the
model accurately predicts the dependent variable, or, in other words, how accurate the model is
at predicting whether or not a pupil, in this sample, is likely to report having literature in their
home. This is calculated by comparing the predicted score of the individual students (as either
possessing or not possessing literature) on the basis of the two independent variables we have in
our model (gender and mothers education), with their actual group membership as given by the
data (in other words, what the data tells us about whether students have actually said they
possess or dont possess literature in the home). SPSS give us this output, in the box labelled
classification table. In this example, this table is as follows:
Observed

Step
1

Possessions
literature Q17h

Tick

No Tick

1211

No Tick

1707

3212

6663

84.6

Overall Percentage

Logistic Regression Using


SPSS
Goodness of Fit Measures
Multinomial Logistic
Regression
Ordinal Regression

Predicted

Possessions
literature Q17h
Tick

What is Logistic
Regression?

Percentage
Correct

34.7

65.4

The overall percentage is given as 65.4%. This means that 65.4% of students have been
accurately classified as either possessing or not possessing literature on the basis of our two
variable model. There is no absolute cut-off point which tells us whether or not this represents
good fit, but obviously 100% would represent perfect fit, in that all students would be correctly
classified on the basis of our model. This extreme situation is highly unlikely to occur in practice.
Rather, the key question is the extent to which or model is better able to predict group
membership (do they fall into the possessing literature or the not possessing literature group)
than a model without any of our independent variables. An indication of this is given us by the
initial classification table given at the start of the SPSS output:
Observed

Predicted

Possessions
literature Q17h
No Tick

Tick
Step
0

Possessions
literature Q17h

Tick

No Tick
Overall Percentage

4919

7874

100.0

Percentage
Correct

.0

61.5

In this null model, all values have been assigned to the modal value, no tick, which means not
possessing literature in the home. As can be seen, this gives us an overall accuracy of 61.5%,
meaning that 61.5% of students would be correctly classified if we merely assumed that they
belonged to the largest group, not possessing literature in the home. To say that our model is
worthwhile, we need to do better than this. If we look at the previous table, we can see that our
classification accuracy was 65.4%. Adding mothers education and gender has thus increased the
likelihood of a correct prediction of possession of literature in the home, but not by much. This
would suggest that this is not a particularly accurate model. We can also see in that table that
prediction of not possessing literature is more accurate than prediction of possessing literature. It
will usually be the case that prediction will be more accurate for the larger than for the smaller
group.
While intuitively easy to understand, this measure of fit is also somewhat limited. It does not
give us any measure of significance, and is not easily comparable to measures of fit in linear
regression. Because of these limitations a number of other measures of model fit have been
developed for logistic regression.

Hosmer-Lemeshow Test
One of these measures is the Hosmer-Lemeshow test of goodness of fit. This is similar to a Chi
Square test, and indicates the extent to which the model provides better fit than a null model
with no predictors, or, in a different interpretation, how well the model fits the data, as in log-

http://www.strath.ac.uk/aer/materials/5furtherquantitativeresearchdesignandanalysis/unit6/goodnessoffitmeasures/

1/3

10/26/2014

Goodness of Fit Measures - University of Strathclyde


linear modelling. If chi-square goodness of fit is not significant, then the model has adequate fit.
By the same token, if the test is significant, the model does not adequately fit the data. The
Hosmer and Lemeshow's (H-L) goodness of fit test divides subjects into deciles based on
predicted probabilities, then computes a chi-square from observed and expected frequencies. Then
a probability (p) value is computed from the chi-square distribution to test the fit of the logistic
model. If the H-L goodness-of-fit test statistic is greater than .05, as we want for well-fitting
models, we fail to reject the null hypothesis that there is no difference between observed and
model-predicted values, implying that the model's estimates fit the data at an acceptable level.
That is, well-fitting models show nonsignificance on the goodness-of-fit test, indicating model
prediction that is not significantly different from observed values.
A disadvantage of this goodness of fit measure is that it is a significance test, with all the
limitations this entails. Like other significant tests it only tells us whether the model fits or not,
and does not tell us anything about the extent of the fit. Similarly, like other significance tests, it
is strongly influenced by the sample size (sample size and effect size both determine
significance), and in large samples, such as the PISA dataset we are using here, a very small
difference will lead to significance. As the sample size gets large, the H-L statistic can find
smaller and smaller differences between observed and model-predicted values to be significant.
Small sample sizes are also problematic, however, as, being a Chi Square test we cant have too
many groups (more than 10%) with predicted frequencies of less than five.
The Hosmer-Lemeshow test of goodness of fit is not automatically a part of the SPSS logistic
regression output. To get this output, we need to go into options and tick the box marked
Hosmer-Lemeshow test of goodness of fit. In our example, this gives us the following output:

Step

Chi-square

142.032

df

Sig.

.000

Therefore, our model is significant, suggesting it does not fit the data. However, as we have a
sample size of over 13,000, even very small divergencies of the model from the data would be
flagged up and cause significance. Therefore, with samples of this size it is hard to find models
that are parsimonious (i.e. that use the minimum amount of independent variables to explain the
dependent variable) and fit the data. Therefore, other fit indices might be more appropriate.
In ordinary linear regression, our primary measure of model fit was R2, which was an indicator of
the percentage of variance in the dependent variable explained by the model. It would be useful
for us to have a similar measure for logistic regression. However, the R2 measure is only
appropriate to linear regression, with its continuous dependent variables. To get around this
problem, a number of statisticians have developed so-called Pseudo R2 measures that aim to
mimic R2 for logistic regression models. In contrast to the actual R2 , as these are approximations
there are a number of different Pseudo R Squares, which take a different conceptual approach to
what R2 means .
The most commonly used interpret R2 as representing the improvement of the model we are
using (in our case the two variable model) over the null model with no independent variables
(also called predictors). Other approaches are based on viewing R2 as explained variance.
In the first category we can find two Pseudo R2

measures used in SPSS, Cox and Snells and

Nagelkerkes Pseudo R square measures. Cox and Snell's R2 is based on calculating the proportion
of unexplained variance that is reduced by adding variables to the model.
The formula for Cox and Snell's Pseudo R2 is:

where -2LL null is the loglikelihood for the empty model, and -2LL k is the loglikelihood for the
model with the independent variables.
There is a major problem with Cox and Snells Pseudo R Square, however, which is that its
maximum can be (and usually is) less than 1.0, making it difficult to interpret. That is why
Nagelkerke developed a modified version of Cox and Snells measure that varies from 0 to 1. To
achieve this, Nagelkerke's R 2 divides Cox and Snell's R 2 by its maximum, giving us the
formula:

Therefore Nagelkerke's Pseudo R2 will normally be higher than the Cox and Snell measure. Both
of these Pseudo R2 measures will tend to be lower than traditional ordinary least squares
R2 measures.

Effron's Pseudo R-squared


2

An example of an approach that views R as explained variance is Effrons Pseudo R2. This
measure takes the model residuals, which are squared, summed, and divided by the total
variability in the dependent variable:

This R-squared is equal to the squared correlation between the predicted values and actual
values.
Other Pseudo R2s, such as Mc Faddens and McKelvey and Zavoinas measures also exist, but we
will not discuss them all in detail here. However, the existence of multiple measures, as opposed
to the one R2 we had in ordinary linear regression points to the facts that these are
approximations of R2, which are inexact and disputable to some extent, and it is important to
remember that they will give us somewhat different numbers.

http://www.strath.ac.uk/aer/materials/5furtherquantitativeresearchdesignandanalysis/unit6/goodnessoffitmeasures/

2/3

10/26/2014

Goodness of Fit Measures - University of Strathclyde

Pseudo R-Squared in SPSS


In SPSS we only get two Pseudo R2 measures in the output, Cox and Snell and Nagelkerke. These
are given in the box labelled model summary in the output:

Step
1

-2 Log
likelihood

Cox & Snell R


Square

16327.952(a)

.055

Nagelkerke R
Square
.074

As we can see, Nagelkerkes measure gives us a higher value than does Cox and Snells as we
would expect. We also said earlier that Nagelkerkes measure was a correction of Cox and Snells,
allowing the measure to use the full 0-1 range, so we will prefer to use this one.
Whichever of the measures we use, however, we can see that fit of the model is poor. As .07 is
close to 0 our model is not a great improvement over the null model with no predictors.
Summarising what these three measures tell us about the fit of our model, the Hosmer and
Lemesnow Goodness of Fit Chi Square test indicates that our model does not fit the data.
However, this measure is sensitive to sample size, and in our large sample few parsimonious
models would fit. Nevertheless, poor fit is also indicated by the other measures. Accuracy of
prediction has improved over the null model, but only by 4%. Nagelkerkes Pseudo R2 is only .07,
again (think of an analogous R2in ordinary linear regression) indicating poor fit. So, even though
both our predictors are significant, they are weak predictors of possessing literature in the home.

Task 1
Lets now try and improve our model. We will add two more independent variables that may be
related to whether or not pupils say they possess literature in the home. Firstly, we would like to
know whether there is any difference between the three countries in our sample, Finland,
Scotland and Flanders. Secondly, it might be the case that achievement in English would lead to a
greater awareness of literature in the home, so we shall include English reading test scores in the
model as well. The two variables we already have (gender and mothers education) will remain in
our model. [See Module 4 (on Multiple regression with nominal independent variables) for how to
create the dummy variables for Finnish and Flemish, if you dont have these in the version of the
dataset that you are using.]

View answer
Task 2
We said earlier that as the coefficients are unstandardised, we cannot directly compare the
strength of these variables to each other. What can we do to see whether country or readingscore
has the strongest relationship with the dependent variable?

View answer
University of Strathclyde

Edit

http://www.strath.ac.uk/aer/materials/5furtherquantitativeresearchdesignandanalysis/unit6/goodnessoffitmeasures/

3/3

You might also like