5 Reliability

Reliability
CH 6 & 7
Reliability
the proportion of variance in a set of test
scores that is due to the real or true attributes
of the persons being measured, rather than
error
Also, repeatability, consistency, or stability
rxx =
True2
observed2
T 2
T2 + e2
0 rxx 1
Reliability as Repeatability
Conceptually, any observation has some degree of error or
imprecision
Observed score = TRUE SCORE + ERRORS OF MEASUREMENT
By taking multiple measurements it is presumed that these

random errors will cancel each other out
Under certain assumptions the mean of repeated measurements
is considered an estimate of the true score
Components of Reliability
Test scores reflect:
Consistency
stable characteristics
Inconsistency
factors that affect the scores
but have nothing to do with characteristics being
measured
Components of Reliability
Want a statistic of the proportion of total test
score variance that is due to the true score
variance
i.e., what proportion is not due to error
variance?
Defining true score variance as the consistent,
stable variance
Classical Test Theory (CTT) Reliability

Observed score = true score + error
X = True + error
2
X
= T2 + e2
What is observed is a function of the

variability in the true score and variability
of the errors of measurement
Assumptions of Classical
Test Theory
Error of measurement is unsystematic or random
deviation of an individuals score from a
theoretically expected observed score (truescore)
Observed score = True Score + error
True score is an expected or mean score not a
real trait represents a combination of all
the factors that lead to consistency in the
measurement
Errors are not correlated with true score (i.e.,
random)
For a given individual, an error may not be a completely
random event. However, across a number of individuals,
the causes of error are assumed to be random.
Errors on two different tests are not correlated
Reliability
Completely Reliable
scale
133 lb
133 lb
133 lb
133 lb
133 lb
Reliability
Completely Unreliable
scale
115 lb
140 lb
141 lb
122 lb
118 lb
Reliability
Reliability is
highest when
=
+
T
E is small
E
Less
error
Methods of Assessing
Reliability
1.Test-retest
2.Alternate Forms
3.Split-half
4.Internal Consistency
5.Inter-Rater Reliability
1. Test-Retest
Reliability
"Temporal stability: Simply, the rankorder stability of scores from one
administration of a test to another.
1.Administer the test to a group of people
2.Re-administer it at some other time to the
same group of people
3.correlate Time 1 and Time 2 scores
If correlation < 1.0, due to error

variance
Test-Retest Issues
The true score should remain
the same and that is what is
correlating across the time
points
The lower the correlation, the
less stable the scores and the
more error or extraneous variation
Problems with this approach?
Problem One
Characteristics or attributes being
measured may change between Time 1 and 2
Why might this happen?
Change in true score; almost all
psychological traits exhibit some change
across a long enough time interval (e.g.,
reading ability of children)
Use short time intervals (1 week, 1

month) to estimate reliability
Want error due to random fluctuations, not
long term changes
Problem Two
Practice Effects (a.k.a carry over
effects)
Learning might occur during the first
administration
Remembering content
Especially an issue if the time between
administrations is too short
More of a problem for performance-type
measures
Problem Three
Reactivity Effects
The experience of taking the test itself can
change a persons true score
E.g., on a test of geography, test takers may
become curious about the correct answers of the
questions- go out and study.
E.g., a test of marital satisfaction may
involve questions addressing dimensions that
the person had never thought of before- then,
they may start paying attention to that
dimension and change accordingly. Thus, the
practice of mere measurement may change the
test-taker.
2. Parallel
(Alternative) Forms
Reliability
Using the same test on repeated occasions
has certain problems (test-retest)
Therefore, we can use parallel forms of

the test on the two occasions
Persons take one form at Time 1 and the
alternate form at Time 2 (e.g., GRE)
The correlation between the tests is the
reliability of the test
Mean of the two scores is an individuals
true score
Correlation between parallel forms is also
an index of temporal stability
To Be Parallel You Must

Have the same means and standard
deviations
Items must be of the same difficulty
Same number of items, expressed in the
same form, and cover the same content
ALL other characteristics must be the
same
Examinees should be indifferent to the
form
Issues With Parallel

Forms
Nice idea, but it is not easy to construct
identical or very similar at all times.
Just as the test-retest method, this
method requires two separate test
administrations- thus, it can be quite
costly and cumbersome.
Unless the forms are perfectly parallel
this form of reliability violates the
assumptions of CTT.
3. Split-Half
Reliability
Instead of creating two
different forms, why not create
one form and split it into two?
Reliability is the correlation
between the two halves
Issues with Split-Half

Reliability
How should we split the test?
1st half, 2nd half ?(not a good idea; think
about fatigue effects)
Even-odd
Random halves
Fact:
items
Half does not contain all the
Problem because there is a direct

relationship between test length and
reliability
Only using half the items reduces our estimate of
reliability
Fundamentally violates the assumptions of CTT
Spearman-Brown Formula
A way to correct for using only half of the
items
Formula that computes the reliability if a test
were longer or shorter
So corrects for small number of items
4. Internal Consistency
Average Item Intercorrelation
Cronbachs Coefficient Alpha
Take the logic of split-half and
parallel forms reliability to the
extreme
Every ITEM is a parallel test of the
construct
Therefore, the average correlation among
items is an index of reliability
Last, the correlation of an item to the
total is an index of reliability
Average Item
Intercorrelation
Internal consistency reliability
ICR estimates are different from the
other methods; focus is on # of items
in the test and the intercorrelations
among the items and their correlation
to the test as a whole.
An example: imagine two people who
are taking an internally consistent
test of extraversion
Internal Consistency
Example
Person A is very extraverted, Person B
is not
For every item, Person A always responds
true and Person B always responds false
So, within a sample of different people,

the responses to items will be
correlated
People who score high on item 1 will also
score high on item 2, 3,..n
Internal consistency
A Second Example
Imagine Person A and Person B take an
internally consistent test of
intelligence
Person A is very intelligent; Person B is
not so bright
Person A passes every item; Person B fails
nearly every item
Again within a sample of different people,
the item responses will be correlated
People who pass item 1 will tend to pass
items 2, 3,.n
Data
1.Administering a test to a group of
individuals
2.Computing correlations among all
items and computing the average of
those intercorrelations
3.Computing the correlation of an item
score to the score of the test
without that item
Average Inter-Item
Correlations
I1
I2
I3
I4
I5
I1
1.00
I2
.89
1.00
I3
.91
.92
1.00
I4
.88
.93
.95
1.00
I5
.84
.86
.92
.85
1.00
In
.88
.91
.95
.87
.85
.90
In
1.00
Reliability
The reliability of a test is based on the
number of items in the test (k) and the
average intercorrelation among test items.
Thus, ICR methods are mathematically
linked to the split-half method.
alpha represents the mean reliability coefficient
one would obtain from all possible split halves.
Cronbachs Coefficient
Alpha
Most commonly used measure of
internal reliability
Alpha is the average value of all
possible split-half reliabilities
2
s
i
k 1 2
stotal score
k 1
Internal Consistency as
a Method
The principal advantage of ICR
methods is practicality.
You dont need to administer a test
multiple times.
Split-half methods may look more

computationally challenging.
This is not an issue anymore,
computers generate reliability
estimates in a second
Using Cronbachs Alpha

By convention, alpha should be at
least .70 or higher to retain an item
in an "adequate" scale;
Many researchers require a cut-off of
.80 for a good scale.
Most consider alpha above .95
indication of redundancy.
However, specificity/generality of
the scale focus affects the estimates
Some Examples of
Reliability
Observat
Item1 Item2
Item3 Item5 Item9
ion
2
5
291
Participants.
The Data
From a recent
scale
development
project.
Included 308
potential items
over 5 facets.
This facet ended
up with 25 items.
19
24
30
58
71
86
93
99
100
261
2.66
2.14
2.74
1.82
Sample
2.51
Mean
Split-Half Reliability
Item 1-13
and 14-25: r
= .92
Odd vs Even:
r = .92
By 3s: r
= .97
Average Inter-item
Correlations
Item
Item
Item
Item
Item
3
5
9
12
25
Item 3
Item 5
Item 9
1.00
0.68
0.58
0.71
0.71
1.00
0.52
0.67
0.68
1.00
0.55
0.53
Pure mean = 0.56
Item 12 Item 25
1.00
0.70
1.00
Cronbachs Alpha
Item
Alpha
If Item
Deleted
ItemTotal
Correlati
on
Item 3
.97
.97
.80
Item 5
.97
.97
.74
Item 9
.97
.97
.75
Item 12
.97
.97
.81
Item 25
.97
.97
.71
=
0.97
SPSS
Reliability Estimates
and Error
5. Inter-rater
Reliability
What if youre not using a test
but instead observing
individuals behaviors as a
psychological assessment tool?
How can we tell if the judges
(assessors) are reliable?
Typically a set of criteria are

established for judging the behavior
and the judge is trained on the
criteria
Then to establish the reliability of
both the set of criteria and the
judge, multiple judges rate the same
series of behaviors
The correlation between the judges is
the typical measure of reliability
Kappa is a measure of inter-rater

reliability that controls for
chance agreement
Values range from -1 (less
agreement than expected by chance)
to +1 (perfect agreement)
+.75 excellent
.40 - .75 fair to good
Below .40 poor

5 Reliability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 Reliability

Uploaded by

Copyright:

Available Formats

Reliability

By taking multiple measurements it is presumed that these

Classical Test Theory (CTT) Reliability

What is observed is a function of the

Errors on two different tests are not correlated

If correlation < 1.0, due to error

Problems with this approach?

Use short time intervals (1 week, 1

Therefore, we can use parallel forms of

To Be Parallel You Must

Issues With Parallel

Issues with Split-Half

Half does not contain all the

Problem because there is a direct

Fundamentally violates the assumptions of CTT

So, within a sample of different people,

Split-half methods may look more

Using Cronbachs Alpha

Pure mean = 0.56

Typically a set of criteria are

Kappa is a measure of inter-rater

You might also like