You are on page 1of 44

Reliability

CH 6 & 7

Reliability
the proportion of variance in a set of test
scores that is due to the real or true attributes
of the persons being measured, rather than
error
Also, repeatability, consistency, or stability
rxx =

True2
observed2

T 2
T2 + e2

0 rxx 1

Reliability as Repeatability
Conceptually, any observation has some degree of error or
imprecision
Observed score = TRUE SCORE + ERRORS OF MEASUREMENT

By taking multiple measurements it is presumed that these


random errors will cancel each other out
Under certain assumptions the mean of repeated measurements
is considered an estimate of the true score

Components of Reliability
Test scores reflect:
Consistency
stable characteristics
Inconsistency
factors that affect the scores
but have nothing to do with characteristics being
measured

Components of Reliability
Want a statistic of the proportion of total test
score variance that is due to the true score
variance
i.e., what proportion is not due to error
variance?
Defining true score variance as the consistent,
stable variance

Classical Test Theory (CTT) Reliability


Observed score = true score + error
X = True + error

2
X

= T2 + e2

What is observed is a function of the


variability in the true score and variability
of the errors of measurement

Assumptions of Classical
Test Theory
Error of measurement is unsystematic or random
deviation of an individuals score from a
theoretically expected observed score (truescore)
Observed score = True Score + error
True score is an expected or mean score not a
real trait represents a combination of all
the factors that lead to consistency in the
measurement
Errors are not correlated with true score (i.e.,
random)
For a given individual, an error may not be a completely
random event. However, across a number of individuals,
the causes of error are assumed to be random.

Errors on two different tests are not correlated

Reliability
Completely Reliable
scale
133 lb
133 lb
133 lb
133 lb
133 lb

Reliability
Completely Unreliable
scale
115 lb
140 lb
141 lb
122 lb
118 lb

Reliability
Reliability is
highest when

=
+

T
E is small
E

Less
error

Methods of Assessing
Reliability
1.Test-retest
2.Alternate Forms
3.Split-half
4.Internal Consistency
5.Inter-Rater Reliability

1. Test-Retest
Reliability
"Temporal stability: Simply, the rankorder stability of scores from one
administration of a test to another.
1.Administer the test to a group of people
2.Re-administer it at some other time to the
same group of people
3.correlate Time 1 and Time 2 scores

If correlation < 1.0, due to error


variance

Test-Retest Issues
The true score should remain
the same and that is what is
correlating across the time
points
The lower the correlation, the
less stable the scores and the
more error or extraneous variation

Problems with this approach?

Problem One
Characteristics or attributes being
measured may change between Time 1 and 2
Why might this happen?
Change in true score; almost all
psychological traits exhibit some change
across a long enough time interval (e.g.,
reading ability of children)

Use short time intervals (1 week, 1


month) to estimate reliability
Want error due to random fluctuations, not
long term changes

Problem Two
Practice Effects (a.k.a carry over
effects)
Learning might occur during the first
administration
Remembering content
Especially an issue if the time between
administrations is too short
More of a problem for performance-type
measures

Problem Three
Reactivity Effects
The experience of taking the test itself can
change a persons true score
E.g., on a test of geography, test takers may
become curious about the correct answers of the
questions- go out and study.
E.g., a test of marital satisfaction may
involve questions addressing dimensions that
the person had never thought of before- then,
they may start paying attention to that
dimension and change accordingly. Thus, the
practice of mere measurement may change the
test-taker.

2. Parallel
(Alternative) Forms
Reliability
Using the same test on repeated occasions
has certain problems (test-retest)

Therefore, we can use parallel forms of


the test on the two occasions
Persons take one form at Time 1 and the
alternate form at Time 2 (e.g., GRE)
The correlation between the tests is the
reliability of the test
Mean of the two scores is an individuals
true score
Correlation between parallel forms is also
an index of temporal stability

To Be Parallel You Must


Have the same means and standard
deviations
Items must be of the same difficulty
Same number of items, expressed in the
same form, and cover the same content
ALL other characteristics must be the
same
Examinees should be indifferent to the
form

Issues With Parallel


Forms
Nice idea, but it is not easy to construct
identical or very similar at all times.
Just as the test-retest method, this
method requires two separate test
administrations- thus, it can be quite
costly and cumbersome.
Unless the forms are perfectly parallel
this form of reliability violates the
assumptions of CTT.

3. Split-Half
Reliability
Instead of creating two
different forms, why not create
one form and split it into two?
Reliability is the correlation
between the two halves

Issues with Split-Half


Reliability
How should we split the test?
1st half, 2nd half ?(not a good idea; think
about fatigue effects)
Even-odd
Random halves

Fact:
items

Half does not contain all the

Problem because there is a direct


relationship between test length and
reliability
Only using half the items reduces our estimate of
reliability

Fundamentally violates the assumptions of CTT

Spearman-Brown Formula
A way to correct for using only half of the
items
Formula that computes the reliability if a test
were longer or shorter
So corrects for small number of items

4. Internal Consistency
Average Item Intercorrelation
Cronbachs Coefficient Alpha
Take the logic of split-half and
parallel forms reliability to the
extreme
Every ITEM is a parallel test of the
construct
Therefore, the average correlation among
items is an index of reliability
Last, the correlation of an item to the
total is an index of reliability

Average Item
Intercorrelation
Internal consistency reliability
ICR estimates are different from the
other methods; focus is on # of items
in the test and the intercorrelations
among the items and their correlation
to the test as a whole.
An example: imagine two people who
are taking an internally consistent
test of extraversion

Internal Consistency
Example
Person A is very extraverted, Person B
is not
For every item, Person A always responds
true and Person B always responds false

So, within a sample of different people,


the responses to items will be
correlated
People who score high on item 1 will also
score high on item 2, 3,..n
Internal consistency

A Second Example
Imagine Person A and Person B take an
internally consistent test of
intelligence
Person A is very intelligent; Person B is
not so bright
Person A passes every item; Person B fails
nearly every item
Again within a sample of different people,
the item responses will be correlated
People who pass item 1 will tend to pass
items 2, 3,.n

Internal Consistency
Data
1.Administering a test to a group of
individuals
2.Computing correlations among all
items and computing the average of
those intercorrelations
3.Computing the correlation of an item
score to the score of the test
without that item

Average Inter-Item
Correlations
I1

I2

I3

I4

I5

I1

1.00

I2

.89

1.00

I3

.91

.92

1.00

I4

.88

.93

.95

1.00

I5

.84

.86

.92

.85

1.00

In

.88

.91

.95

.87

.85

.90

In

1.00

Internal Consistency
Reliability
The reliability of a test is based on the
number of items in the test (k) and the
average intercorrelation among test items.
Thus, ICR methods are mathematically
linked to the split-half method.
alpha represents the mean reliability coefficient
one would obtain from all possible split halves.

Cronbachs Coefficient
Alpha
Most commonly used measure of
internal reliability
Alpha is the average value of all
possible split-half reliabilities

2
s
i

k 1 2

stotal score

k 1

Internal Consistency as
a Method
The principal advantage of ICR
methods is practicality.
You dont need to administer a test
multiple times.

Split-half methods may look more


computationally challenging.
This is not an issue anymore,
computers generate reliability
estimates in a second

Using Cronbachs Alpha


By convention, alpha should be at
least .70 or higher to retain an item
in an "adequate" scale;
Many researchers require a cut-off of
.80 for a good scale.
Most consider alpha above .95
indication of redundancy.
However, specificity/generality of
the scale focus affects the estimates

Some Examples of
Reliability

Observat
Item1 Item2
Item3 Item5 Item9
ion
2
5

291
Participants.

The Data

From a recent
scale
development
project.
Included 308
potential items
over 5 facets.
This facet ended
up with 25 items.

19

24

30

58

71

86

93

99

100

261

2.66

2.14

2.74

1.82

Sample
2.51
Mean

Split-Half Reliability
Item 1-13
and 14-25: r
= .92
Odd vs Even:
r = .92
By 3s: r
= .97

Average Inter-item
Correlations
Item
Item
Item
Item
Item

3
5
9
12
25

Item 3

Item 5

Item 9

1.00
0.68
0.58
0.71
0.71

1.00
0.52
0.67
0.68

1.00
0.55
0.53

Pure mean = 0.56

Item 12 Item 25

1.00
0.70

1.00

Cronbachs Alpha
Item

Alpha

If Item
Deleted

ItemTotal
Correlati
on

Item 3

.97

.97

.80

Item 5

.97

.97

.74

Item 9

.97

.97

.75

Item 12

.97

.97

.81

Item 25

.97

.97

.71

=
0.97

SPSS

Reliability Estimates
and Error

5. Inter-rater
Reliability
What if youre not using a test
but instead observing
individuals behaviors as a
psychological assessment tool?
How can we tell if the judges
(assessors) are reliable?

Typically a set of criteria are


established for judging the behavior
and the judge is trained on the
criteria
Then to establish the reliability of
both the set of criteria and the
judge, multiple judges rate the same
series of behaviors
The correlation between the judges is
the typical measure of reliability

Kappa is a measure of inter-rater


reliability that controls for
chance agreement
Values range from -1 (less
agreement than expected by chance)
to +1 (perfect agreement)
+.75 excellent
.40 - .75 fair to good
Below .40 poor

You might also like