Professional Documents
Culture Documents
CH 6 & 7
Reliability
the proportion of variance in a set of test
scores that is due to the real or true attributes
of the persons being measured, rather than
error
Also, repeatability, consistency, or stability
rxx =
True2
observed2
T 2
T2 + e2
0 rxx 1
Reliability as Repeatability
Conceptually, any observation has some degree of error or
imprecision
Observed score = TRUE SCORE + ERRORS OF MEASUREMENT
Components of Reliability
Test scores reflect:
Consistency
stable characteristics
Inconsistency
factors that affect the scores
but have nothing to do with characteristics being
measured
Components of Reliability
Want a statistic of the proportion of total test
score variance that is due to the true score
variance
i.e., what proportion is not due to error
variance?
Defining true score variance as the consistent,
stable variance
2
X
= T2 + e2
Assumptions of Classical
Test Theory
Error of measurement is unsystematic or random
deviation of an individuals score from a
theoretically expected observed score (truescore)
Observed score = True Score + error
True score is an expected or mean score not a
real trait represents a combination of all
the factors that lead to consistency in the
measurement
Errors are not correlated with true score (i.e.,
random)
For a given individual, an error may not be a completely
random event. However, across a number of individuals,
the causes of error are assumed to be random.
Reliability
Completely Reliable
scale
133 lb
133 lb
133 lb
133 lb
133 lb
Reliability
Completely Unreliable
scale
115 lb
140 lb
141 lb
122 lb
118 lb
Reliability
Reliability is
highest when
=
+
T
E is small
E
Less
error
Methods of Assessing
Reliability
1.Test-retest
2.Alternate Forms
3.Split-half
4.Internal Consistency
5.Inter-Rater Reliability
1. Test-Retest
Reliability
"Temporal stability: Simply, the rankorder stability of scores from one
administration of a test to another.
1.Administer the test to a group of people
2.Re-administer it at some other time to the
same group of people
3.correlate Time 1 and Time 2 scores
Test-Retest Issues
The true score should remain
the same and that is what is
correlating across the time
points
The lower the correlation, the
less stable the scores and the
more error or extraneous variation
Problem One
Characteristics or attributes being
measured may change between Time 1 and 2
Why might this happen?
Change in true score; almost all
psychological traits exhibit some change
across a long enough time interval (e.g.,
reading ability of children)
Problem Two
Practice Effects (a.k.a carry over
effects)
Learning might occur during the first
administration
Remembering content
Especially an issue if the time between
administrations is too short
More of a problem for performance-type
measures
Problem Three
Reactivity Effects
The experience of taking the test itself can
change a persons true score
E.g., on a test of geography, test takers may
become curious about the correct answers of the
questions- go out and study.
E.g., a test of marital satisfaction may
involve questions addressing dimensions that
the person had never thought of before- then,
they may start paying attention to that
dimension and change accordingly. Thus, the
practice of mere measurement may change the
test-taker.
2. Parallel
(Alternative) Forms
Reliability
Using the same test on repeated occasions
has certain problems (test-retest)
3. Split-Half
Reliability
Instead of creating two
different forms, why not create
one form and split it into two?
Reliability is the correlation
between the two halves
Fact:
items
Spearman-Brown Formula
A way to correct for using only half of the
items
Formula that computes the reliability if a test
were longer or shorter
So corrects for small number of items
4. Internal Consistency
Average Item Intercorrelation
Cronbachs Coefficient Alpha
Take the logic of split-half and
parallel forms reliability to the
extreme
Every ITEM is a parallel test of the
construct
Therefore, the average correlation among
items is an index of reliability
Last, the correlation of an item to the
total is an index of reliability
Average Item
Intercorrelation
Internal consistency reliability
ICR estimates are different from the
other methods; focus is on # of items
in the test and the intercorrelations
among the items and their correlation
to the test as a whole.
An example: imagine two people who
are taking an internally consistent
test of extraversion
Internal Consistency
Example
Person A is very extraverted, Person B
is not
For every item, Person A always responds
true and Person B always responds false
A Second Example
Imagine Person A and Person B take an
internally consistent test of
intelligence
Person A is very intelligent; Person B is
not so bright
Person A passes every item; Person B fails
nearly every item
Again within a sample of different people,
the item responses will be correlated
People who pass item 1 will tend to pass
items 2, 3,.n
Internal Consistency
Data
1.Administering a test to a group of
individuals
2.Computing correlations among all
items and computing the average of
those intercorrelations
3.Computing the correlation of an item
score to the score of the test
without that item
Average Inter-Item
Correlations
I1
I2
I3
I4
I5
I1
1.00
I2
.89
1.00
I3
.91
.92
1.00
I4
.88
.93
.95
1.00
I5
.84
.86
.92
.85
1.00
In
.88
.91
.95
.87
.85
.90
In
1.00
Internal Consistency
Reliability
The reliability of a test is based on the
number of items in the test (k) and the
average intercorrelation among test items.
Thus, ICR methods are mathematically
linked to the split-half method.
alpha represents the mean reliability coefficient
one would obtain from all possible split halves.
Cronbachs Coefficient
Alpha
Most commonly used measure of
internal reliability
Alpha is the average value of all
possible split-half reliabilities
2
s
i
k 1 2
stotal score
k 1
Internal Consistency as
a Method
The principal advantage of ICR
methods is practicality.
You dont need to administer a test
multiple times.
Some Examples of
Reliability
Observat
Item1 Item2
Item3 Item5 Item9
ion
2
5
291
Participants.
The Data
From a recent
scale
development
project.
Included 308
potential items
over 5 facets.
This facet ended
up with 25 items.
19
24
30
58
71
86
93
99
100
261
2.66
2.14
2.74
1.82
Sample
2.51
Mean
Split-Half Reliability
Item 1-13
and 14-25: r
= .92
Odd vs Even:
r = .92
By 3s: r
= .97
Average Inter-item
Correlations
Item
Item
Item
Item
Item
3
5
9
12
25
Item 3
Item 5
Item 9
1.00
0.68
0.58
0.71
0.71
1.00
0.52
0.67
0.68
1.00
0.55
0.53
Item 12 Item 25
1.00
0.70
1.00
Cronbachs Alpha
Item
Alpha
If Item
Deleted
ItemTotal
Correlati
on
Item 3
.97
.97
.80
Item 5
.97
.97
.74
Item 9
.97
.97
.75
Item 12
.97
.97
.81
Item 25
.97
.97
.71
=
0.97
SPSS
Reliability Estimates
and Error
5. Inter-rater
Reliability
What if youre not using a test
but instead observing
individuals behaviors as a
psychological assessment tool?
How can we tell if the judges
(assessors) are reliable?