Professional Documents
Culture Documents
3
S~.FrF.MB ~ , 1951
I. Historical Resum~
A n y research based on measurement must be concerned with the
accuracy or dependability or, as we usually call it, reliability of measurement. A reliability coefficient demonstrates w h e t h e r the test designer w a s correct in expecting a certain collection of items to yield
interpretable statements about individual differences (25).
E v e n those investigators who regard reliability as a pale shadow
of the more vital m a t t e r of validity cannot avoid considering the reliability of their measures. No validity coefficient and no factor analysis can be interpreted without some appropriate estimate of the magnitude of the e r r o r of measurement. The p r e f e r r e d w a y to find out
h o w accurate one's measures are is to make t w o independent measurements and compare them. In practice, psychologists and educators
have often not had the opportunity to recapture their subjects f o r a
second test. Clinical tests, or those used for vocational guidance, a r e
generally worked into a crowded schedule, and there is always a de*The assistance of Dora Damrin and Willard Warrington is gratefully acknowledged. Miss Damrin took major responsibility for the empirical studies reported. This research was supported by the Bureau of Research and Service, College of Education.
297
298
PSYCHOMETRIKA
299
IJEP, j . C R O N B A C H
rtt(~,,,~ - - - -
- -
; (i--l,2,...n).
(I)
n--1
i----
(2)
Vt
Here Vt is the variance of test scores, and lz~ is the variance of item
scores after weighting. This formula reduces to (1) when all items
are scored 1 or zero. The variants reported by Dressel (10) for certain weighted scorings, such as Rights-minus-Wrongs, are also special cases of (2), but f o r most data computation directly f r o m (2) is
simpler than b y Dressel's method. H o y t ' s derivation (20) arrives
at a formula identical to (2), although he d r a w s attention to its application only to the case w h e r e items are scored 1 or 0. Following
the p a t t e r n of a n y of the other published derivations of (1) (19, 22),
making the same assumptions b u t imposing no limit on the scoring
pattern, will permit one to derive (2).
Since each w r i t e r offering a derivation used his own set of assumptions, and in some cases criticized those used by his predecessors, the precise meaning of the formula became obscured. The original derivation unquestionably made much more stringent assumptions
than necessary, which made it seem as if the formula could properly
be applied only to r a r e tests which happened to fit these conditions.
It has generally been stated that a gives a lower bound to "the true
reliability"--whatever that means to t h a t particular writer. In this
paper, w e take formula (2) as given, and make no assumptions regarding it. Instead, we proceed in the opposite direction, examining
the properties of a and t h e r e b y arriving at an interpretation.
We introduce the symbol a partly as a convenience. "KuderRichardson F o r m u l a 20" is an a w k w a r d handle f o r a tool t h a t we expect to become increasingly prominent in the test literature. A second
reason for the symbol is that a is one of a set of six analogous coefficients (to be designated /~, r , 8, etc.) which deal with such other
300
PSYCHOMETRIKA
and
r~ -~ r~ -- rd
r~ -- r~.
r~--
(3)
ro~~
; and
~
---
--
. . . .
(4)
LEE J. CRONBACH
301
Entering Data*
ZAt"
Formulas Assuming
aa ~ ab
ZB~
Tab ag gb
4uaab~ab
2~~
9.A
o{ o"a o b
2(1
1 + zab
%2+%~)u,.
3All
ut2
4A
4B (=--4A)
ff$ o'd
o'd2
Ud2
1 - - - -u| 2
l - - - ~t 2
5A
5B
4 (%~ ~ %udr~)
2 (2%2 - - add
4aa2--Ud2
% .~ ~
are the
half-rut ~ .
t=~.b.d=a,-b.
| Guttman (19)
]lAfter Mosier (18)
Rulon (al)
k,=
2mr + m 2 + 1
/1 + m~+ r \
2m(1 + r)
(1 + r )
2m
)'
(5)
302
PSYCHOMETRIKA
(6)
k~ -
2mr(2mr--
m s
+ 3)
When m equals 1, t h a t is, when the two standard deviations are equal,
the formula in Column B is identical to that in Column A. As Table
2 shows, there is increasing disagreement between F o r m u l a 1B and
those in Column A as m departs f r o m unity. The estimate b y the
S p e a r m a n - B r o w n formula is always slightly larger than the coefficient of equivalence computed by the more tenable definition of comparability.
TABLE 2
Ratio of Spearman-Brown Estimate to More E x a c t Split-Half Estimate of
Coefficient of Equivalence when S.D.'s are Unequal
Ratio of
Half-Test
S.D.'s
(greater/lesser)
1
I.I
1.2
1.3
1.4
1.5
.20
.40
.60
.80
1
1.005
1.017
1.035
1.057
1.083
1
1.004
1.014
1.029
1.048
1.069
I
1.003
1.012
1.025
1.041
1.060
1
1.003
1.010
1.022
1.036
1.052
I
1.003
1.009
1.020
1.032
1.046
1.00
1
1.002
1.008
1.017
1.029
1.042
LEE J. CRONBACH
303
V~
C~
C~
"
~,,
~,.
C,~
'"
! SECOND HALF
FIRST HALF
.....
FIR S T
HALFi
_~ Ct
s~co~I
V~
C.
,,, ,,
C~
C~
V~
F~QUaE 1
Schematic Division of the Matrix of Item Variances and Covariances.
304
PSYCHOMETRIKA
C~b is the total covariance of all item pairs such t h a t one item is
within ~ and the other is within b ; it is the "between halves" covarianee.
Then
C~ = r~b~b ;
(7)
C, ----C~ + C~ + C ~ ;
V, ----V. + Vb +
V~=~Vi
+2C~
~,
(8)
Vb~-~Vi..+2Cb.
~,,
and
(9)
(I0)
(
r.-----2
V. + V~ )
1
- -
V,-- V,-- Vb
--2
V,
V,
(ii)
4C~
~',,- - - - V,
(12)
This indicates t h a t w h e t h e r a particular split gives a high or low coefficient depends on w h e t h e r the high interitem covariances are placed
in the "between halves" covariance or w h e t h e r the items having high
correlations are placed instead within the same half.
Now we rewrite a:
n-
Vt
~ -- 1
2C,
-- 1
Cij
(vt 1
(13)
Vt
(i4)
V,
C~
,,(,~- I)/2"
(15)
Therefore
v,
(16)
L E E J. CRONBACH
305
W e p r o c e e d n o w b y d e t e r m i n i n g t h e m e a n coefficient f r o m all
(2n') !/2 (n' l) 2 possible splits of t h e test. F r o m ( 1 2 ) ,
4 (Jab
'
rt~ - -
(17)
V,
In a n y s p l i t , a p a r t i c u l a r C~; h a s a p r o b a b i l i t y of
of f a l l i n g
2 ( n - - 1)
E E C(j ; (i = 1 , 2 , - - - n - - 1 ;
i s
1--i + 1,..-;n).
(18)
But
n(n--
1)
C~j.
C~j :-
~F.
~j
(19)
(20)
and
__
n2__
C~ ~ - - Cij.
(21)
From (17),
r--,
4n 2 n 2 Cij
"r,,- C,t - 4V,
V,
(22)
@,t ~ a .
(23)
Therefore
F r o m ( 1 4 ) , w e c a n also w r i t e a in t h e f o r m
Z Z C,i
a--
n-- 1
Vc
; (i,j=l,2,...n;i~:D.
(24)
T h i s i m p o r t a n t r e l a t i o n s t a t e s a c l e a r m e a n i n g f o r a as n / ( n - - l ) t i m e s
the ratio of interitem covariance to total variance. The multiplier
n/(n--1) allows for the proportion of variance in any item which is
due to the same dements as the covariance.
306
PSYCHOMETRIKA
, as a s p e c i a l case o f t h e s p l i t - h a l f c o e f f i c i e n t . N o t only is a a
f u n c t i o n of all the split-half coefficients f o r a t e s t ; it can also be shown
to be a special case of the split-half coefficient.
i f we assume t h a t t h e t e s t is divided into equivalent halves such
t h a t Cry,, (i.e., C,~/n' 2) equals C~j, t h e a s s u m p t i o n s f o r f o r m u l a 2A
still hold. We m a y designate the split-half coefficient f o r this s p l i t t i n g
as
"l'tt l .
4 C~b
r. - - ~
V,
(12)
Then
ru ~
~--
-- ------- --
Vt
--V,
(25)
V,
From (16),
rt~o ~ a.
(26)
LEE J . CRONBACH
307
are not selected at random for psychological tests where any differentiation among the items' contents or difficulties permits a planned
selection. Two planned samplings m a y be expected to have higher
correlations t h a n two random samplings, as Kelley pointed out (25).
We shall show t h a t this difference is usually small.
IV. A n Examination of Previous Interpretations and Criticisms of a
1. Is a a conservative estimate of reliability? The findings
j u s t presented call into question the frequently repeated statement t h a t a is a conservative estimate or an underestimate or a lower
bound to "the reliability coefficient." The source of this conception
is the original derivation, where Kuder and Richardson set up a definition of two equivalent tests, expressed their correlation algebraically, and proceeded to show by inequalities t h a t a was lower t h a n
this correlation. Kuder and Richardson assumed t h a t corresponding
items in test and parallel test have the same common content and the
same specific content, i.e., t h a t they are as alike as two trials of t h e
same item would be. In other words, they took the zero-interval retest correlation as their standard. Guttman also began his derivation
by defining equivalent tests as identical. Coombs (6) offers the somew h a t more satisfactory name "coefficient of precision" for this index
which reports the absolute m i n i m u m error to be found if the same
i n s t r u m e n t is applied twice independently to the same subject. A coefficient of stability can be obtained by m a k i n g the two observations
w i t h a n y desired interval between. A rigorous definition of the coefficient of precision, then, is t h a t it is the limit of the coefficient of
stability, as the time between testings becomes infinitesimal.
Obviously, a n y coefficient of equivalence is less t h a n the coefficient of precision, f o r one is based on a comparison of different items,
the other on two trials o f t h e same items. To put it another way: a
or any other coefficient of equivalence t r e a t s the specific content of an
item as error, but the coefficient of precision t r e a t s it as p a r t of the
t h i n g being measured. It is very doubtful if testers have any practical need for a coefficient of precision. There is no practical testing
problem where the items in the test and only these items constitute the
t r a i t under examination. We m a y be unable to compose more items
because of our limited skill as testmakers but any group of items in a
test of intelligence or knowledge or emotionality is regarded as a sample of items. I f there w e r e n ' t " p l e n t y more where these came f r o m , "
performance on the test would not represent performance on any more
significant variable.
308
PSYCHOMETRIKA
W e therefore turn to the question, does a underestimate appropriate coefficients of equivalence? Following Kelley's argument, the
w a y to make equivalent tests is to make them as similar as possible,
similar in distribution of item difficultyand in item content. A pair
of tests so designed that corresponding items measure the same factors, even if each one also contains some specific variance, will have a
higher correlation than a pair of tests drawn at random from the
pool of items. A planned split,where items in opposite halves are as
similar as the test permits, m a y logicallybe expected to have a higher
between-halves covariance than within-halves covariance, and in that
case, the obtained coefficientwould be larger than a. a is the same
type of coefficientas the split-halfcoefficient,and while it m a y be lower, it m a y also be higher than the value obtained by actually splitting
a particular test at random. Both the random or odd-even split-half
coefficient and a will theoretically be lower than the coefficientfrom
parallel forms or parallel splits.
2. Is a less than the coefficient of stability ? Some w r i t e r s expect a to be lower than the coefficient of stability. Thus G u t t m a n says
(34, p. 311) :
F o r t h e case o f scale scores, t h e n , . . . we h a v e t h e a s s u r a n c e t h a t i f
t h e i t e m s a r e a p p r o x i m a t e l y s c a l a b l e [ i n w h i c h c a s e a will b e h i g h ] , t h e n
they necessarily have very substantial test-retest reliability.
LEE J. CRONBACH
309
310
PSYCHOMETRIKA
firmed by Air Force psychologists who made a similar attempt to categorize items from a mechanical reasoning test and found that they
could not. These items, they note, "are typically complex factorially"
(15, p. 309).
Eight items which some students omitted were dropped. An item
analysis was made for 50 papers. Using this information, ten parallel splits were made such that items in opposite halves had comparable
difficulty. These we call Type I splits. Then eight more splits were
made, placing items in opposite halves on the basis of both difficulty
and apparent content (Type II splits). Fifteen random splits were
made. For all splits, Formula 2A was applied, using the 200 remaining cases. Results appear in Table 3.
TABLE 3
All Splits
Type of Split
No. of
Coefli-
Range
Mean
.779,-.860
.798--.846
.801-.833
.810
.820
.817
cients
Random
Parallel Type I
Parallel Type I I
15
10
8
No. of
Coeifi- Range
eients
8
6
4
Mean
.795-.860
.798-,846
.809-.826
.817
~22
,818
There are only 126 possible splits for the morale test, and it is
possible to compute all half-test standard deviations directly from the
item variances and covariances. Of the 126 splits, six were designated
in advance as Type II parallel splits, on the basis of content and an
item analysis of a supplementary sample of papers. Results based on
200 cases appear in Table 4.
TABLE 4
All Splits
Type of Split
No. of
Coefli-
Range
Mean
cients
All Splits
Parallel (Type I I )
126
6
No. of
Coeftl- Range
~e~n
cienlm
.609-.797
.681-.780
.715
.737
82
5
,609-.797
,712-.780
.717
.748
LEE J. CRONBACH
311
The highest and lowest coefficients for the mechanical test differ
by only .08, a difference which would be important only when a very
precise estimate of reliability is needed. The range for the morale
scale is greater (.20), but the probability of obtaining one of the extreme values in sampling is slight. Our findings agree with Clark, that
the variation from split to split is less than the variation expected
from sample to sample for the same split. The standard error of a
Spearman-Brown coefficient based on 200 cases using the same split is
.03 when rtt = .8, .04 when r~t -- .7. The former value compares with
a standard deviation of .02 for all random-split coefficients of the mechanical test. The standard error of .04 compares with a standard
deviation of .035 for the 126 coefficients of the morale test.
This bears on Kelley's comment on proposals to obtain a unique
estimate: "A determinate answer would result if the mean for all
possible spilts were gotten, but, even neglecting the labor involved,
this would seem to contravene the judgment of comparability." (25,
p. 79). As our tables show, the splittings where half-test standard
deviations are unequal, which "contravene the judgment of comparability," have coefficients about like those which have equal standard
deviations.
Combining our findings with those of Clark and Cronbach we
have studies of seven tests which seem to show that the variation from
split to split is too small to be of practical importance. Brownell finds
appreciable variation, however, for the four tests he studied. The apparent contradiction is explained by the fact that the former results
applied to tests having fairly large coefficients of equivalence (.70 or
over). Brownell worked with tests whose coefficients were much lower, and the larger range of r's does not represent any greater variation in z values at this lower level.
In Tables 3 and 4, the values obtained from deliberately equated
half-tests differ slightly, but only slightly, from those for random
splits. Where a is .715 for the morale scale, the mean of parallel splits
is .748--a difference of no practical importance. One parallel split
reaches .780, but this split could not have been defended a priori as
more logical than the other planned splits. In Table 3, we find that
neither Type I nor Type II splits averaged more than .01 higher than
a . Here, then, is evidence that the sort of judgment a tester might
make on typical items, knowing their content and difficulty, does not,
contrary to the earlier opinion of Kelley and Cronbach, permit him to
make more comparable half-tests than would be obtained by random
splitting. The data from Cronbach's earlier study agree with this. This
conclusion seems to apply to tests of any length (the morale scale has
312
PSYCttOMETRIKA
only ten items). Where items fall into obviously diverse subgroups in
either content or difficulty, as, say, in the California Test of Mental
Maturity, the tester's j u d g m e n t could provide a better-than-random
split. It is dubious w h e t h e r he could improve on a random division
w i t h i n subtests.
It should be noted t h a t in this empirical s t u d y no a t t e m p t was
made to divide items on the basis of r , , as Gulliksen (18, p. 207-210)
has recently suggested. Provided this is done on a large sample of
cases other t h a n those used to estimate r , , Gulliksen's plan m i g h t
indeed give parallel-split coefficients which are consistently at least a
few points higher t h a n a.
The failure of the data to support our expectation led to a f u r t h e r
study of the problem. We discovered t h a t even tests which seem to
be heterogeneous are often highly s a t u r a t e d w i t h the first factor
among the items. This forces us not only to extend the i n t e r p r e t a t i o n
of a , but also to reexamine certain theories of test design.
Factorial composition of the test variance. To make fully clear
the relations involved, our analytic procedure will be spelled out in
detail. We postulate t h a t the variance of a n y item can be divided
a m o n g k + 1 orthogonal factors (k common w i t h other items and one
unique). Of these, we shall r e f e r to the first, f l , as the general factor,
even though it is possible t h a t some items would have a zero loading on this factor.* Then if f.-~ is the loading of common f a c t o r z on
item i ,
1.00 = N ' (f21, + f"2~ + P,~ + . . . + f ' v , ) .
(27)
(28)
C, - - Y.~,C,j :
(i=l,2,...n--1;i=i+
V, ---- N" Y. o~, ( f ' ,
1 ,...,n).
+ . . . + 2N 2 F. ~ .~ uj f ~ f ~ j .
4
(29)
313
LEE J . CRONBACH
~h~t e r m s
z~~ terms
ns ' terms
~,~ terms
n terms
of the f o r m N ~ w j f l d , , plus
of the f o r m N'~jf2~f2j, plus
of the f o r m N'~jfsJs~, plus and so on to
of the f o r m N ~ j f~dkj, plus
of the f o r m N ~ .~;
J U i'
(31)
n2
~V,=n~A'+
~2
25
NJa~ '
,'~2
/,~ + ~ / , ~ + . . . + ~ / , ~ + E fu:.
25
25
(39.)
6n'+5~z
5n
= ~ .
(33)
6~+5
5
Lira f ' . - -
f"' ----""--f;'
Lim
f;,
~Pv~'
L~
(34)
--.83.
6~' +
5n"
(35)
.03.
(36)
5
6n + 5"
(37)
--
(38)
314
PSYCHOMETRIKA
Note that among the t e r m s making up the variance of any test, the
n u m b e r of terms representing the general f a c t o r is n times the numb e r representing item specific and e r r o r factors.
We have seen t h a t the general factor cumulates a v e r y large influence in the test. This is made even clearer b y F i g u r e 2, w h e r e we
I00
40
|0
ERROR
tlJ
X 4o
4o
E
a.
2o
io
o
o
I
~
I.... I.............I....... I
~
4
4
4
I
s
I
?
I
4
I
s
NUMBER
I
~
OF
I
~
I
m
I
~4
I
M
I
~
I
~
I
m
I
ts
zo
I'IIEMS
Xt : . 3 G + .3F....,+ .35
$~+ .84E
Fm~ms 2
Change in Proportion of Test Variance due to General, Group, and Unique
Factors among the Items as n Increases,
plot the trend in variance for a particular case of the above test structure. Here we set k - 6 , n ~ - - n , n 2 = n a - - n , = n ~ = n 6 - - n / 5 . Then
we assume that each-item h-~Ethe composl%Ton: 9% general factor, 9%
f r o m some one group factor, 827~ unique. F u r t h e r , the unique variance is divided by 70/12 between e r r o r and specific stable variance. It
is seen t h a t even with unreliable items such as these, which intercorrelate only .09 or .18, the general factor quickly becomes the predominant portion of the variance. In the limit, as n becomes indefinitely
large, the general factor is 5,/6 of the variance, and each group factor
is 1/30 of the total variance.
LEE J . CRONBACH
315
~'fo i
fl
f2
fa
fo,
9
50 or 0
0 or 50
41
41
.fu~
>:.fu~
50 (if present)
n
n/2
n/2
1
Ji
Oto5
lto5
Number
of Items
Containing
Factor
50
/k
~/,.../~
fl
/,
~ + z _< k _<---g-+ z
Factors
Pattern
Per Cent of
Variance in
Any Item
12
~-----6
22
31
31
17
n.~- I
9
50
0
41
74
13
74
25
35
3fi
~----~6
48
10
44
10
7~-----5 ~-~--2
5
41
50
9
50
~-~-1
26
36
36
.----100
21
76
1
26
37
87
. . - ~ oo
0
0
100
0
n-~--100 n - - ~
TABLE 5
Factor Composition of Tests Having Certain Item Characteristics
as a Function of Test Length
LEE J . CRONBACtl
317
1
2
3
4...k
1
2
3
Common
Factors
Pattern
n
4
4
4 each
n
n/2
n/2
No. of Items
Having Non-Zero
Loadings in Factor
~-~-20
24
104
~.:8
16
4
4
24
104
1
8
~-~-2
1
1
2
10
120
500
100
4
4
4 each
150
620
16
4
4
100
25
25
P.-~20
r.--~-8
960
3900
900
4
4
4 each
n-~--60
1350
5460
900
225
225
~60
TABLE 6
Composition of the Between-Halves Covariance for Tests of Certain Patterns
OO
L E E J. C R O N B A C H
319
~J20
PSYCHOMETRI1CA
21 (-,
~,
f., f.,),
;E
1
n
2
,
n(n--l)
Y. ~. ~ '
, ~
a n d the total
variance
f.~
2
~ ,
2
-- - -
n(n--l)
C,,
(40)
n--1
covarianee) due
to c o m -
LEE J. CRONBACH
321
(41)
~..r,i
(42)
; (i=l,2,...n--l;]=i+
a - - - -
l,...n).
322
PSYCHOMETRIKA
test authors, these scores correlate .668. Then, by (42), a , the common-factor concentration, is .80. Turning to the P r i m a r y Mental
Abilities Tests, we have a set of moderate positive correlations reported when these were given to a group of eighth-graders (35). The
question m a y be asked: H o w much would a composite score on these
tests reflect common elements r a t h e r than a hodgepodge of elements
each specific to one subtest ? The intercorrelations suggest t h a t there
is one general factor ~mong the tests. Computing a on the assumption of equal subtest variances, we get .77. The total score is loaded
to this extent with a general intellective factor. Our third illustration relates to f o u r Air Force scores related to carefulness. E a c h
score is the count of n u m b e r w r o n g on a plotting test. The f o u r scores
have r a t h e r small intercorrelations (15, p. 687), and each score has
such l o w reliability t h a t its use alone as a m e a s u r e of carefulness is
not advisable. The question therefore arises w h e t h e r the tests a r e
enough intereorrelated t h a t the general f a c t o r would cumulate in a
preponderant w a y in their total. The s u m of the six intercorrelations is 1.76. Therefore a is .62. I.e., 62% of the variance in the
equally weighted composite is due to the common f a c t o r among the
tests.
F r o m this approach comes a suggestion f o r obtaining a superior
coefficient of equivalence for the " l u m p y " test. It w a s shown t h a t a
test containing distinct clusters of items might have a parallel-split
coefficient appreciably higher than a . If so, we should divide the test
into subtests, each containing w h a t appears to be a homogeneous
group of items. ~ is computed f o r each subtest separately b y (2).
Then ~ a gives the covariance of each cluster with the opposite cluster in a parallel form, and the eovariance between subtests is an estimate of the covariance of similar pairs '%etween forms." Hence
t
r, t =
x 2
; (i=l,2,...n;j=l,2,...n),
(43)
Vt
323
LEE J . CRONBACH
Parallel-orms
coefficient
~ r e a t e a t split-half c o e f i cient
i Ch=e
P r i n c i p a l common f a c t o r
,~
~
0
about h e r e '
E C!
error
r l l b a s e d on c t , r l t h i n
Per cent
clusters
100
FIOUeE 3
C e r t a i n Coefficients r e l a t e d to t h e Composition o f t h e T e s t V a r i a n c e .
(44)
n +
(l--n)a
V, -- ~V,
(45)
:~V4
324
FSYCHOMETRIKA
(46)
(47)
2C
(~J 4c
6C
8C
I00
I
20
I
40
" ~ r t , t ...3o
I
80
I00
I
20
I
40
I
60
I
80
800
20
~.8o
FIGUE~ 4
Relation of ij to Pl and pj for Several Levels of Correlation.
I
80
I
40
!
60
810
//~
~0o
326
PSYCHOMETRIKA
5. Is the usefulness of a limited by properties of the ~hi coefficient between items having unequal difficulties ? The criticism has
been made, most vehemently by Loevinger (27), that a is a poor index because, being based on product-moment correlations, it cannot
attain unity unless all items have distributions of the same shape.
For the pass/fail item, this requires that all p~ be equal. The inference is drawn that since the coefficient cannot reach unity for such
items, a and @do not properly represent homogeneity.
There are two ways of examining this criticism. The simpler is
empirical. The alleged limitation upon the product-moment coefficient
has no practical effect upon the coefficient, for items of the sort customarily employed in psychological tests. To demonstrate this, we
consider the change in ~ with changes in item difficulty. To hold constant the relation between the "underlying traits," we fix the tetrachoric correlation. When the tetrachoric coefficient is .30, p~ ~ .50
and pj ranges from .10 to .90, @~j ranges only from .14 to .19. Figure 4
shows the relation of ~,j to ~, and pj for three levels of correlation:
rt.t-- .30, rt.t~ .50, and r~o~-- .80. The correlation among items in
psychometric tests is ordinarily below .30. For example, even for a
five-grade range of talent, the ~'~jfor the California Test of Mental
Maturity subtests range only from .13 to .25. That is, for tests having the degree of item intercorrelation found in present practice,
is very nearly constant over a wide range of item difficulties.
TABLE 7
Variation in Certain Indices of Interitem Consistency with Changes in Item
Difficulty (Tetrachorie Correlation Held Constant)
p~
pj
rijte t
.50
-->.00
.50
.10
.50
.20
.50
.40
.50
.50
.50
.60
.50
.80
.50
.90
.50
--~1.00
.30
.30
.30
.30
.30
.30
.80
.30
.30
~j
"-'~.00
.14
.17
.19
.19
.19
.17
.14
-'-.00
Hij
---~1.00
.42
.34
.23
.19
.23
.~4
.42
-->1.00
PSYCHOMETRIKA
327
(48)
Distribution of
Difficulties
Range of
Pt
p~
Diff.
Diff.
A
A'
Normal
Peaked
.20 to .80
.50
.50
.50
.181
.192
.011
.909
.914
.005
B
B'
Normal
Peaked
.10to.90
.50
.50
.50
.176
.192
C
C'
Normal
Peaked
.50 to .90
.70
.70
.70
.170
.181
.011
,902
.909
.007
D
D'
Rectangular
Peaked
.10 to .90
.50
.50
.50
.153
.192
.039
.892
.914
.02~
.016
.906
.914
.008
328
PSYCHOMETRIKA
L E E J. C R O N B A C H
329
330
PSYCHOMETRIKA
LEE J. CRONBACH
331
it is believed t h a t t h e t e s t d e s i g n e r should s e e k i n t e r i t e m c o n s i s t e n c y ,
a n d j u d g e t h e e f f e c t i v e n e s s of h i s e f f o r t s b y t h e coefficient a , t h e p u r e
scale should n o t b e v i e w e d a s a n ideal. I t s h o u l d b e r e m e m b e r e d t h a t
T u c k e r (36) a n d B r o g d e n (1) h a v e d e m o n s t r a t e d t h a t i n c r e a s e s in
i n t e r n a l c o n s i s t e n c y m a y lead to d e c r e a s e s in t h e p r o d u c t - m o m e n t
v a l i d i t y coefficient w h e n t h e s h a p e of t h e t e s t - s c o r e d i s t r i b u t i o n d i f fers from that of the criterion distribution.
1. F o r m u l a s f o r s p l i t - h a l f coefficients of e q u i v a l e n c e a r e c o m p a r e d , a n d those o f R u l o n a n d G u t t m a n a r e a d v o c a t e d f o r p r a c t i c a l
use rather than the Spearman-Brown formula.
2. a , t h e g e n e r a l f o r m u l a of w h i c h K u d e r - R i c h a r d s o n f o r m u l a
20 is a special case, is f o u n d to h a v e t h e f o l l o w i n g i m p o r t a n t m e a n ings:
(a)
a is t h e m e a n o f all possible s p l i t - h a l f coefficients.
332
PSYCHOMETRIKA
2.
3.
4.
5.
6.
7.
REFERENCES
Brogden, H. E. Variation in test validity with v a r i a t i o n in the distribution
of item difficulties, number of items, and degree of their intercorrelation.
Psychomet~'ika, 1946, 11, 197-214.
Brown, W. Some experimental results in the correlation of mental abilities.
Brit. J. PsychoL, 1910, 3, 296-322.
Brownell, W. A. On the accuracy with which re~iability m a y be measured
by correlating test halves. J. e~'pe~'. Edus., 1933, 1, 204-215.
Burr, C. The influence of differential weighting. Brit. J. Psychol., Star.
Sect., 1950, $, 105-128.
Clark, E. L. Methods of splitting vs. samples as sources of instability in
test-reliability coefficients. Hwrvazd educ. Rev., 1949, 19, 178-182.
Coombs, C. H. The concepts of reliability and homogeneity. Educ. ~sychol.
Meas., 1950, 10, 43-56.
Cronbach, L. J. On estimates of test reliability. J. e d ~ . Psychol., 1943, 34,
485-494.
LEE J. CRONBACH
8.
333
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
Cronbach, L. J. Test "reliability": its meaning and determination. P~ryCho~net~/ka, 1947, 12, 1-16.
Dressel, P. L. Some remarks on the Kuder-Richardson reliability coefficient.
Psychom6t~ika, 1940, 5, 305-310.
Ferguson, G. The factorial interpretation of test difficulty. Pschomst~'ika,
1941, 6, 323-329.
Ferguson, G. The reliability of mental tests. London: Univ. of London
Press, 1941.
Festinger, L. The treatment of qualitative data by "scale analysis." Psychol. Bull., 1947, 44, 149-161.
Goodenough, F. L. A critica| note on the use of the term "reliability" in
mental measurement. J. educ. PsychoL, 1936, 27, 173-178.
Guilford, J. P., ed. Printed classification tests. Report No. 5, A r m y A i r
Forces Aviation Psychology Program. Washington: U. S. Govt. Print. Off.,
1947.
Guilford, J. P. Fundamental statistics in psychology and education. Second
ed. New York: McGraw-Hill, 1950.
Guilford, J. P., and Michael, W. B. Changes in common-factor loadings as
tests are altered homogeneously in length. Psychomct~'ika, 1950, 15, 237-249.
Gulliksen, H. Theory of mental tests. :New York: Wiley, 1950.
Guttman, L. A basis for analyzing test-retest reliability. Psychomet~ika,
1945, 10, 255-282.
Hoyt, C. Test reliability estimated by analysis of variance. Psychometr~l~,
1941, 6, 153-160.
Humphreys, L. G. Test homogeneity and its measurement. Ame~'. Psychologist, 1949, 4, 245.
Jackson, R. W., and Ferguson, G . A . Studies on the reliability of tests. Bull.
No. 12, Dept. of Educ. Res., University of Toronto, 1941.
Kelley, T. L. Note on the reliability of a test: a reply to Dr. Crum's criticism. J. educ. Psychol., 1924, 15, 193-204.
Kelley, T. L. Statistical method. New York: Macmillan, 1924.
Kelley, T. L. The reliability coefficient. Psychomet~ika, 1942, 7, 75-83.
Kuder, G. F., and Richardson, M. W. The theory of the estimation of test
reliability. Psychometrika, 1937, 2, 151-160.
Loevinger, J. A systematic approach to the construction and evaluation of
tests of ability. Psychol. Monogr., 1947, 61, No. 4.
Loevinger, J. The technic of homogeneous tests compared with some aspects
of "scale analysis" and factor analysis. Psychol. Bull., 1948, 45, 507-529.
Mosier, C . I . A short cut in the estimation of split-halves coefficients. Educ.
psychol. Meas., 1941, 1; 407-408.
Richardson, M. Combination of measures, pp. 379-401 in Horst, P. (Ed.)
The prediction of personal adjustment. New York: Social Science Res.
Council, 1941.
Rulon, P. J. A simplified procedure for determining the reliability of a test
by split-halves. HaPvard ed~w.Rev., 1939, 9, 99-103.
Shannon, C. E. The mathematical theory of communication. U r b a n a : Univ.
of Ill. Press, 1949.
334
33.
34.
35.
36.
37.
38.
39.
PSYCHOMETRIKA
Spearman, C. Correlation calculated with faulty data. B~'iL .L Psychol.,
1910, 3, 271-295.
Stouffer, S. A., et. al. Measurement and prediction. Princeton: Princeton
Univ. Press, 1950.
Thurstone, L. L., and Thurstone, T. G. F a c t o r i a l studies of intelligence, p.
37. Chicago: Univ. of Chicago Press, 1941.
Tucker, L. R. Maximum validity of a test with equivalent items. Psychom e ~ ' k a , 1946, 11, 1-13.
Vernon, P. E. An application of factorial analysis to the study of test
items. B~it. J. Ps"yehol., Star. See., 1950, 3, 1-15.
Wherry, R. J., and Gaylord, R. H. The concept of test and item reliability
in relation to factor pattern. Psychomet~'ik~, 1943, 8, 247-264.
Woodbury, M. A. On the standard length of a test. Res. B u l l 50-53, Edue.
Test. Service, 1950.