Professional Documents
Culture Documents
Jound in texts on psychological statistics (e.g., Guilford and Fruchter, \vith. the criterion. acts ~s a suppressor variable to eliminate or suppress
,'1973). Essentially, such an equation is~.ased on the corre1atioTl..~~dL- the l?,elev~nt varIance In the other test. For example, reading compre-
::lestwith the criterion, as well as . t c rrel tions a hellSlon might correlate highly with scores on a mathematical or a
· \·tests. vious y, those tests that correlate higher with the criterion mechanical aptitude test, because the test problems require the under-
· ','shOUfd receive more weight. It is equally important, howev~r, to take ~ta~1ding of complic~ted written instructions. If reading comprehension
; intoaccount the correlation of each test with the other tests In the bat- IS Ir~elevant:!o the lob behavior to be predicted, the reading compre-
·:tery. Tests correlating highly with each other rcpresent needless dupli- henSIOn reqUired by the tests introduces error variance andji)wers the
.cation,since th'ey cover ti:>a large extent the same aspects of the criterion. predicti~e vali?ity of the tests. Administering a reading corrip¥~hension
The inclusion of two such tests ",ill not appreciably increase the validity test an~ mcludmg.scores on this test in the regression equation will elimi-
of the entire battery, even though both tests may correlate highly with nate th~s error vanan~e and raise the validity of the battery. The suppres-
the criterion. In such a case, one of the h;s.ts would serve about as sor vanable appears In the regression equation with a negative weight.
effectively as the pair; only one would therefore be retained in the bat- Thus, .the higher an individual's score on reading comprehension, the
terv. I mote IS deducted from his score on the mathematical or mechanical test.
. Even after the most serious instances of dUElication have been elimi- The ~se of suppressor variables is illustrated by a study of 63 industrial
nated h~-;~v~~tk-t~sts rem-iini~ thebatte~ \"ill correlate with ~echamcs (Sorenson, 1'966). The most effective battery for predicting
each ~fuerro-v~rying-d~rees. "For maximum predtctiv~tests that Job pe:formanc~ in this group included: (1) a questionnaire covering
makea-iTiOienearly unique contribution to the total battery should re- ed~cat~on, prevIOu~ mechanical experience, and other background data
ceive greater weight than those that partly duplicate the functions of (cnterIon correlation = .30); ( 2) a mechanical insight test stressing
other tesrs:-I'i.'fThe computation of a multiple regression equation, ,each practical mechanics of the "nuts-and-bolts" type (criterion correlation =
test is wei hted in direct proportion to its correlation with the criterion .22); and (3) a test of mechanical comprehension oriented toward the
and.ln,jIULerSe proJLortion to its corre alions wit t e other tests. Thus, academic understanding of mechanical principles (criterion correlation =-
-thehighest weight will be assig!led to th.e t~~ with the highest vali~ity -.04; corr~lation with test 2 = .71). The third test functioned as a sup-
and the least amount of overlap with the rest of the battery. pressor van able , as can be seen from thc following regression equation:
The validity orthe entire battery can be found by --computing the
C;;:: 17T1 + 10T2 - 6T3 + 866
multiple correlation (R) between- the criterion and the battery. '!lis cor-
relation indicates the hi hest redictive value that can be obtained from ~Vithout the suppressor variable the test battery would "overpredict" the
the given attery, when each test i~venoptimum- weight for predicti~ __ ]ob~~:~ormance of individuals who obtain high scores on the practical
th~rion ill..9uestion. The optimum weights are those determined by . ,~ecliamcs tes~ through an application of mechanical principles but who
the regression equation:- / lack the praCtlcal mechanical know-how required on the job. The irrele-
It should be noted that these weights are optiolUm only for the par- ." vant contribution of academic knowledge of mechanical principles to
ticular sample in which they were found. Because of chance errors in scores on the practical mechanics tests was thus ruled out by the sup-
the correlation coefficients used in deriving them, the regression weights pressor variable.
may vary from sample to sample. Hence, the battery should be cross.; Attem~ts to employ -suppressor variables to improve the validity of
validated by correlating the predicted criterion scores with the actual ~erso~allty tests have proved disappointing (}. S. ''liggins, 1973). In any
criterion scores in a new sample. Formulas are available for estimating sl~ua.tiOn, mor~over, the more direct· procedure of revising a test to
the amount of shrink~ge in a multiple correlation to be expected when ehmmate the Irrelevant variance is preferable to the indirect statistical
the regression equation is applied to a second sample, but empirical ~limination of such variance through a suppressor variable. 'When changes
verification is preferable whenever possible. The larger the sample on In the test are not feaSible, the investigation of suppressor variables
which regression weights were derived, the smaller the shrinkage will be. should be considered.4 .
Under certain conditions the predictive validity of a battery can be
~ There has also been some explorationof the inclusionof continuous moderator
improved by including in the regression equation a test having a ze~o vanables in regression equations through nona4~Jtive and higher-order functions;
correlation with the criterion but a high correlation with another test m but the results have not been promising (Kirkpatrick et al. 1968· Saunders 19$.
the battery. This curious situation arises when the test that is uncorrelated J. S. Wiggins, 1973). .,."", ,
Validity: Measuremellt and Interpretation 185
score in the battery, even though individual differences within the group
ULTIPLE CUTOFF An alternative strategy for combining test
SCORES. were not Significantly correlated with criterion ratings. It would seem
utilizes multiple cutoff points. Briefly, this procedure involves the that women who enter or remain in this type of job are already selected
blishment of a minimum cutoff score on each test. Every individual with regard to Finger Dexterity.s
'falls below such a minimum score on anyone of the tests is rejected. The validity of the composite KFM pattern of cutting scores in a group
, those persons who reach or exceed the cutoff scores in all tests are of 194 workers is shown in Table 17. It will be seen that, of 150 good
ted. An example of this technique is provided by the General Apti- workers, 120 fell above the cutting scores in the three aptitudes and 30
Test Battery (GATB) developed by the United States Employment were false rejects, falling below one or more cutoffs. Of the 44 poor
ce for use in the occupational counseling program of its State Em- workers, 30 were correctly identified and 14 were false acceptances. The
mentService offices (U.S. Department of Labor, 1970a). Of the nine overall efficacy of this cutoff pattern is indicated by a tetracporic cor-
ude scores yielded by this battery, those to be considered for each relation of .70 between predicted status and criterion" ratings.
pation were chosen on the basis of criterion correlations as well as
os and standard deviations of workers in that occupation. TABLE 17
Effectiveness of GATB Cutoff Scores 'On Aptitudes K, F, and M in
Identifying Good and Poor Workers
E~'~~
sllative.DataUsed to Establish Cutoff Scores on GATB (From U.S. Department or Labor. 1958, p. 14)
Criterion Nonqualifying
SD Correlation
Cood 120 150
General Learning Ability 75.1 14.2 -.094 Poor 14 44
Verbal' 80.1 11.3 -.085 Total
Numerical 73.2 18.4 134 194
-.064
Spatial 78.9 15.9 - .041
F~rm Perception 80.1 23.5 -.012
If only scores yielding significant validity coefficients are taken into ac-
Clerical Perception 86.3 16.6 .088
c.ount, .one o.r more essential abilities in which all workers in the occupa-
Motor Coordination 89.3 20.7 .316°
Finger Dexterity 92.4 18.1 .155 tiO~ excel :r)]gh~be overlooked. Hence the need for considering also those
Manual Dexterity 88.2 18.6 .437u aptltudes m whIch workers excel as a group, even when individual differ-
ences beyond a certain minimum are unrelated to degree of job success.
; Significant
at .05 level. ~he ~ultiple cutoff me:hod ~s preferable to the regression equation in
•• Significantat .01 level. SItuatIons such as these, in which test scores are not linearly related td the
criterjo~. In some jobs, mQreover, workers may be so homogeneous in a
~ey ~aIt that the range of individual differences is too narrow to yield a
, The development of GATB occupational standards for machine cutters slgmficant correlation between test scores' and criterion.
inthe food-canning and preserving industry is illustrated in Table 16. In The strongest argument for the use of multiple cutoffs rather than a're-
!~rmsof standard scores with a mean of 100 and an SD of 20, the cutoff
~ession ~quation centers around the question of com~~atory qu,alifica-
'!coresfor this occupation were set at 75 in Motor Coordination (K), tions. \Vlth the regression equation, an individual wli<)"ia'tes low 'in one
Finger Dexterity (F), and Manual Dexterity (M),(U.S. Department of test may receive an acceptable total score because he 'r~t¢s very high in
Labor,1970a, Section IV, p. 51). Table 16 gives mean, standard deviation,
'andcorrelation with the criterio~ (supervisory ratings) for each of the
5 The. data hav~ been somewhat simplifiedfor illustrativep~rposes. Actually,the
ninescores in a group of 57 women workers. On the basis of criterion nna] chOIceof aptItudes and ~tting scoreswas based On se~rate.allaJyses of three
, correlations,Manual Dexterity and Motor Coordination appeared promis- groups of workers on r~la:ed Jobs.on the results obtained i!(a comblqed sample of
ing. Finger Dexterity was added because it yielded the highest mean 194 cases, and on quahtataveJob analysesof the operationsiay-olved. .(.
,\
~86 Principles of Psyc1lOlogical Tcstillg Validity: Measurement and Interpretation 187
:someother test in the battery. A marked deficiency in one skill may thus In placement, the assignments are based on a single score. This score may
'.be compensated for by outstanding ability along other lines. It is pos- be derived from a single test, such as a mathematics achievement test. If
.sible,however, that certain types of activity may require essential skills a hattery of t.ests has been administered, a composite score computed
Jorwhich there is no substitute. In such cases, individuals falling below from a single regression equation wpuld be employed. Examples of place-
.'itherequired minimum in the essential skill will fail, regardless of their ment decisions include the sectioning of college freshmen into different
::otherabilities. An opera singer, for example, cannot afford to have poor ~lathem~tics classes ~n the basis of tlleir achievement test scores, assign-
"pitchdiscrimination, regardless of how well he meets the other require- 1~~ ~ppltcants to clencal jobs requiring different levels of skill and respon-
,;..mentsof such a career. Similarly, operators of sound-detection devices in SIbIlity, and placing psychiatric patients into "more disturbed" and '1ess
:'submarines need good auditory discrimination. Those men incapable of disturbed" wards. It is evident that in each of these decisions only one
~.makingthe necessary discriminations cannot succeed in such an assign- criterion is employed and that placement is determined bv the individual's
: ment, regardless of superior mechanical aptitude, general intelligence, or position along a single predictor scale. .
1 other traits in which they may exccl. With a multiple cutoff strategy, in- Classification, on the other hand, always involves two or more criteria.
;dividuals laeking any essential skill would always be rejected, while with In a military situation, for example, classification is a major problem, since
a regression equation they might be accepted. each. man in an available manpower pool must be assigned to the military
, When the relation between tests and criterion is linear and additive, on speCIalty where he can serve most effectively. Classification decisions are
: the other hand, a higher proportion of correct decisions will be reached ~kewise required in industry, when new employees aTe aSSigned to train-
X' with a regression equation than with multiple cutoffs. Another important mg programs for different kinds of jobs. Other examples include the
advantage of the regression equation is that it provides an estimate of ~unseling of students regarding choice of college curriculum (science,
each person's criterion score, thereby permitting the relative evaluation hberal arts, etc.), as well as field of concentration. Counseling is based
of all individuals. With multiple cutoffs, no further differentiation is pos- essentially on classification, since the client is told his chances of .succe~d-
sible among those accepted or among those rejected. In many situations, i~g i~ different .acad~mic programs or occupations. Clinical diagn~sis is
the best strategy may involve a combination of both procedures. Thus, hk~\VlSe a c~~sS1ficatJOn.problem, the major purposes of each diagnosis
the multiple cutoff may be applied nrst, in order to reject those falling bemg a deCiSion regardmg the most appropriate type of therapy.
below minimum standards on any test, and predicted criterion scores may Although placement can be done with either one or more predictors,
then be computed for the remaining acceptable cases by the use of a re- classification requires multiple predictors whose validitv is individually
gression equation. If enough is knowll about the particular job require- determined a~ainst each criterion. A classification batte;y requires a dii-
ments, the preliminary screening may be dOlle in terms of only one or fer~nt re?resslOn equation for each criterion. Some of the tests may have
two essential skills, prior to the application of the regression equation. ~~Ights I~ all the equations, although of different values; others may be
mcluded m only one or two equations, having zero or negligible weights
for some of the criteria. Thus, the combination of tests employcd out of
th~ to.tal battery, as well as the speCific weights, differs with the particular
cntenon. An example of such a classification battery is that developed by
THE NATUREOF CLASSIFICATION. Psychological tests may be used for I the Air. Force for assignment of personnel to different tr.aining programs
purposes of selection, placement, or classiHcation. In selection, each in- (DuBOIS, 1947). This battery, consisting of both paper-and-pencil and ap-
dividual is either accepted or rejected. Deciding whether or not to admit paratus tests, provided stanine scores for pilots, navigators, bombardiers,
a student to college, to hire a joh applicant, or to accept an army recruit a~d a. few other air-crew specialties. By finding an individual's estimated
for officer training are examples of sele~tion decisions. VVhen selection is cnte~lOn scores from the different regression equations, it was possible to
done sequentially, the earlier stages are often called "screening," the term pre~1Ct whether, for example, he was better qualjfied as a pilot than as a
"selection" being reserved for the more intensive final stages. "Screening naVlgator. .
may also be used .to deSignate any rapid, rough selection process even
when not followed by further selection procedures.
Both placement and classification differ from selection in that no one is ~~IZINC THE UTILIZATION OF TALENT.bifferential pre9iction of cri-
rejected, or eliminated from the program. All individuals are aSSigned to tena WIth a battery of tests permits a Emler utilization of available human
appropriate "treatments" so as to maximize the effectiveness of outcomes. reS,ources than is possible with a single general test or with a composite
Validity: Mcasurcment and Interpretation 189
the selected employees decreases; but it remains better than chance even
when the correlation is .80. With lower selection ratios we can of course
obtain better qualified personnel. A5 can be seen in Table 18, however,
for each selection ratio, mean job performance is better whcn applicants
arc chose~ thr~ugh cl~ssification than through selection strategies.
A ?racticallIIustrahon of the advantages of classification strategies is
proV1de~ ?y the use of Aptitude Area scores in the . igninent of person-
nel to military o~cupational specialties in the U.S.' . (Maier & Fuchs,
197.2): Each Aptitude Area corresponds to a group." my jobs requiring
a slI1~llar,pattern of aptitudes, 1mow ledge, and interests. From a 13-test
claSSification battery, combinations of three to five tests are used to find
the individual's score in each Aptitude Area. Figure 20 shows the results
of an investigation of 7,500 applicants for enlistment in which the use of
Aptitude Area scores was compared ">ith the use of a global screening
test, the Armed Forces Qualification Test (AFQT). It will be noted that
only 56 percent of this group reached or exceeded the 50th percentile on
the AFQT, while 80 percent reached or exceeded the average standard
score of 100 on their best Aptitude Area. 'Thus, when individuals are aDo-.
cated to specifi~ jobs on the basis of the aptitudes required for each job,-a
verr large majority are able to perform as well as the average of the
entire sample or better. This apparent impossibility, in which nearly
everyone could be above average, can be attained by capitalizing on the
fact that nearly everyone excels in some aptitude.
The same point is illustrated with a quite different population in a
study of gifted children by Feldman and Bratton (1972). For demonsfi!a-
tion purposes, 49 c)rildren in two fifth-grade classes were evaluated on
TABLE 18
MeanStandard Criterion Score of Persons placed on Two Johs by
. Selectionor Classification Strategies
50th percentile
or higher
, (Adapted from Brogden. 1951, p. 182) on AFQT
f~* * i'ifj't t
SelectionRatio -Selection:
Standard score
forEach-Job Single Predictor 100 or higher
0 .20 040 .60 .80
in best
Aptitude Area
1.03 1.02 1.01 1.00 .96
5% .88
.87 .86 .84 .82 .79
10 .70
.48 .68 .67 .65 .62 .59 Flc. 20. Percentages Scoring Above Average on AFQT and on Best Aptitude
20 .46 .43
.55 .53 .50 ~ea of Army Classification Battery in a Sample of 7,500 Applicants for En-
30 .32
.41 .37 .34 .29 lisbnent.
40 .18 .42
.31 .28 .25 .22 .17 (Data from U.S. Army Reaearc:h Institute for. the BehaviCiW ~d. Social SeieDc:e5,
50 .00
Courtesy J. E. Uhlaner, 1974.)
, Principles of Psychological Testing Validity: Measurement and Interpretation 191
. f 19 measures, all of which had pl'('viously been used to select t? the par~icular gr?up he resembles most closely. Although the regres-
ts for special programs for the gifted, Among these measures were sion equation permits the prediction of degree of success in each Beld,
,scoreson a group intelligence test and on an educational achieve- the multiple discriminant function treats all persons in one category as of
battery, tests of separate aptitudes and separate academic areas equal status. Group membership is the only criterion data utilized by this
s reading and arithmetic, a test of creative thinking, grades in music method. The discriminant function is useful when criteriori scores are
~rt, and teachers' nominations of the most "gifted" and the most unavailable and only group membership can be ascertained. Some tests,
tive" children in each class. When the five highest ranking children for inst~nce, are validated by administering them to persons in different
ted by each criterion were identifled, they included 92 percent of occupations, although no measure of degree of vocational success is avail-
group. Thus, it was again shown that nearly all members of a group able for individuals within each field.
excel when multivariate criteria are employed. ~e discriminant function is also appropriate when there is a nonlinear
~elahon. between the criterion and one or more predictors. For example,
m certa~n perso~~lity traits .ther~ may be an optimum range for a given
IFFERENTIAL VALIDITY.In the evaluation of a classification battery, the occupation. IndIViduals havmg either more or less of the trait in question
jar consid'eration is its differential validity against the separate criteria. ~uld thus be at a disadvantage. It seems reasonable to expect, for
he object of such a battery is to predict the difference in each person's msta.nce, that salesmen shOWing a moderately high amount of social
erformance in two or more jobs, training programs, or other criterion dommance would be most likely to succeed, and that the chances of
uations. Tests chosen for such a battery should yield very different su~cess wo~ld ?e~line as scores move in either direction from this region.
alidity coefficients for the separate criteria. In a two-criterion cIassifica- WIth the dlscnmmant function, we would tend to select individuals fall-
'on problem, for example, the ideal test would have a high correlation ing within this optimum range. With the regression equation, on the other
:with one criterion and a zero correlation (or preferably a negative cor- hand, the more dominant the-score, the more favorable would be the
relation) with the other criterion. General intelligence tests are relatively predicted ~utcome. If the correlation between predictor and criterion
poor for classification purposes, since they predict success about equally were negative, of course, the regression equation would yield more favor-
well in most areas. Hence, their correlations with the criteria to be dif- abl~ predic~ons for the low scorers. But there is no dir~ct way whereby
ferentiated would be too similar. An individual scoring high on such a ~n mtermedlate score would receive maximum credit. Although in many
test would be classified as successful for either assignment, and it would mstances the two techniques would lead to the same choices there are
be impossible to predict in which he would do better. In a classification situat~ons i~ wh.ich. p~rsons wou~d be differently classified by'regression
battery, we need some tests that are good predictors of criterion A and equatIons and.dIscnmm~nt functIons. For most psychological testing pur-
poor predictors of criterion B, and other tests that are poor predictors of pose~, re.gresslOn equations provide a more effective technique. Under
A and good predictors of B. certam CIrcumstances, however, the discriminant function is better suited
Statistical procedures have been developed for selecting tests so as to to yield the required information.
maximize the differential validity of a classification battery (Brogden,
1951; Horst, 1954; Mollenkopf, 1950b; Thorndike, 1949). When the num-
ber of criteria is greater than two, however, the problem becomes very
complex, no completely analytical solutions are yet available for these I
situations. In practice, various empirical approaches are followed to ap- TIlE ~ROB~EM~ If we waQt to use tests to predict outcomes in some
proximate the desired goals. '_ future sItuation~ ~uch.as an applicant's performance in college or on a job,
we. need ~ests WJ~ high predictive validity against the specific criterion.
This reqUIrement IS commonly overlooked in the develOpment o{ so-called
MULTIPLEDISC~JNANT FUNCTIONS.An alternative way of handling culture-fair tests (to be discussed further in Ch. ~), In the effort to
classiflcation decisions is by means of the multiple discriminant function include in such tests only £Unctions common to different -cultures or sub-
(French, 1966). Essentially, this is a mathematical procedure for deter- cultw:es, we ma~ choose content that has little relevance to any criterion
mining how closely the individual's scores on a whole set of tests ap- we WIShto predICt. A better solution is to chO()s~~riterion-relevant con-
proximate the scores typical of persons in a given occupation, curriculum, tent and then investigate the possible effect of m~deratorvanahl.s on test
psychiatric syndrome, or other category. A person would then be aSSigned scores. Validity coefficients, regression weights, and cutoff scores may
Prillciplesof Psyc1101ogical Tcstitlg
Validity: 'Measurement atld Interpretation 193
<. vary as a function of differences in the examinees' experiential back-
through these tally m k . k
'grounds. These values should therefore be checked within subgroups for tion is th : ar s I~ 'nown as the regression line, and its equa-
e regreSSIOn equatIon I th· I h
, whomthere is reason to expect such effects. would have anI, d" n IS examp e, t e regression equation
It should be noted, however, that the predictive characteristics of test ) one pre Ictor. The multiple reg " '.
cussed earlier in this cha t hI' reSSlOn equatIOns. dIS-
'scoresare less likely to vary among cultural groups when the test is in- is the same. p er ave severa predIctors, but the principle
, trinsically relevant to criterion performance. If a verbal test is employed
When both test and criteri '
to predict nonverbal job performance, a fortuitous validity may be found (SD == 1 00) th l on scores are expressed as standard 'sco.res
in one cultural group because of traditional associations of past experi- ficient. For this: s ope o,f the regr~ssion lin~ equals the correlation coef-
ences within that culture. ln a group with a different experiential back- coefficient in the ~~~n, If a te~. Yle~ds a Significantly different validity
ground, however, the validity of the test may disappear. On the other Figure 21 ro'd hgrou~s, ~ IS dIfference is described as slope bias.
hand, a test that directly samples criterion behavior, or one that measures p VI es sc ematIc Illustrations of regression lines for several
, essential prerequisite skills, is likely to retain its validity in different
groups.
Since the mid-l96Os, there has been a rapid accumulation of research
on possible ethnic differences in the predictive meaning of test scores. In
. this connection, the EEOC Guidelines (Appendix B) explicitly state:
"Data must be generated and results separately reported for minority and
nonminority groups wherever technically feasible." The functions and
implications of separate validation studies are also discussed in the re-
ports of the AP A task forces on the testing of minority groups for both
educational and employment purposes (American Psychological Associa-
tion, 1968; Cleary, Humphreys, Kendrick, & Wesman, 1975). The large
majority of studies conducted thus far have dealt with black Americans,
although a few have included other ethnic minorities. The problems in-
vestigated are generally subsumed under the heading of test bias. In this
context, the term "bias" is employed in its well-established statistical
sense, to designate constant or systematic error as opposed to chance
error. This is the same sense in which we speak of a biased sample, in
contrast to a random sample. The principal questions that have been
raised regarding test bias pertain to validity coefficients (slope bias) and
to the relationship between group means on the test and on the criterion
(intercept bias). These questions will he examined in the next two see--
tions.
=
(50 - 34 16). An item passed hy exactly 50 percent of the cases falls
INTERVAL SCALES. The percentage of persons passing an item expresses at the mean and would thus have a 0 value on this scale. The more dif-
item difficulty in terms of an ordinal scale; that is, it correctly indicates ficult items have plus values, the easier items minus values. The difficulty
the rank order or relative difficulty of items. For example, if Items 1, 2, value corresponding to any percentage passing can be found by refer-
and 3 are passed by 30, 20, and 10 percent of the cases, respectively, we ence to a normal curve frequency table, given in anv standard statistics
can conclude that Item 1 is the easiest and Item 3 is the hardest of the text. '
three. But we cannot infer that the difference in difficulty bctween Items Because item difficulties expressed in terms of normal curve ,,-units
1 and 2 is equal to that between Items 2 and 3. Equal percentage dif- involve negative values and decimals, they are usually converted into a
ferences would correspond to equal differences in difficulty only in a ~ore m~na~ea~le scale. One such scale, employed by Educational Test-
rectangular distribution, in which the cases were uniformly distributed Ing SerVIce In Its test development, uses a unit designated by the Greek
throughout the range. This problem is similar to that encountered in con- letter delta (tL). The relation between tL and normal curve ~-values (x)
nection with percentile scores, which are also based on percentages of is shown below:
cases. It will be recalled from Chapter 4 that percentile scores do not
represent equal units, but differ in magnitude from the center to the ex- A == 13 + 4x
tremes of the distribution (Fig. 4, Ch. 4). The co~st~nts 13 and ~ were chosen arbitrarily in order to provide a scale
If :we assume a nonnal distribution of the trait measured by any given that ehmmates .negative values and yields a range of integers wide
item, the difficulty level of the item can be expressed in terms of an equal- enough to permIt the dropping of decimals. An item passed by nearly
unit interval scale by reference to a table of normal curve frequencies. In 100 percent of the cases (99.87%), falling at -30', would have a tL of:
Chapter 4 we saw,' for example, that approximately 34 percent of the 13 + (4)( -3) == I. This is the lowest value likely to be found in most
cases in a normal distribution fall between the mean and a distance of I" groups. At the other extreme, an item passed by less than 1 percent
in either direction (Fig. 3, Ch. 4). With this information, we can examine (0.13%) of the cases, will have a value of +3" and a tL of: 13 +
Figure 22, which shows the difficulty level of an item passed by 84 per- ( 4)( 3) ==.25. An item falling at the mean will have a 0 u-value and a tL
cent of the cases. Since it is the persons in the upper part of the distribu- ~f: 13 +. (4}(0) == 13. The tL scale is thus a scale in which practicaHyJaIl
tion who pass and those in the lower part who fail, this 84 percent in- Items. WIll fa:ll- between 1 and 25, and the mean difficulty value within
cludes the upper haH (50%) plus 34 percent of the cases from the lower any gIven group corresponds to 13.
=
half (50 + 34 84). Hence, the item falls 1" below the mean, as shown An i~por~ant practical advantage of the tL scale over other possible
in Figurc 22. An item passed by 16 percent of the cases would fall 1" conversIons IS that a table is available (Fan, 1952) from which tL ,can be
above the mean, since above this point there are 16 percent of the cases fou~d by simply entering the value of p (proportion of persons pass;ing
the Item). The table eliminates the necessity of looking up normal curve
IT-values and. transforming t~ese values to tL's. For most practical pur-
poses, an ordmal measure of Item difficulty, such as percentage passing, is
adequate .. For more precise statistical analyses, reqUiring the measure-
ment of difficulty on an interval scale, tL values can be obtained with little
additional effort.
selves widely distributed in difficult), level is illustrated in Figure 25. The concentrated around .50. The reliability coefficient was also highest in
three distributions of total scores given in Figure 25 were obtained by this case, and it was particularly low in'the test compQ~ed of items with
Ebel (1965) with three 16-item tests assembled for this purpose. The extreme difficulty values (Test 3). This simple demonstration serves only
items in Test 1 were chosen so as to cluster close to the .50 difficulty level; to clarify the point; similar conclusions have been reached in more tech-
those in Test 2 are widely dist~ibuted over the entire difficulty range; nical analyses of the problem, with both statistical and experimental
and those in Test 3 fall at the two extremes of difficulty. It will be noted procedures (Cronbach & Warrington, 1952; Lord, 1952; Lord & Novick,
that the widest spread of total test scores was obtained with the items 1968).
Fie. 25. Relation between Distribution of Test Scores and Distribution of Item 1 Because of the nature of many of the tests, the tenn "exercise" was considered
Difficulty Values. more appropriate than "item." For purposes of the present disC!1ssio.~ •.they can be
regarded as items. ".0' "
(From Ebel, 1965, p. 563.)
zo(i Principles of PSfJc1lOlogica/ Testing Item Analysis 207
the exercises were "easy" (.90): one third were "average" (.50); and one and a continuous variable (the criterion). In certain situations, the cri-
third were "difficult" (.10). The actual percentages of persons passing terion too may be dichotomous, as in graduation versus nongraduation
each exerdse varied somewhat around these values. But the goal of the from college or success versus failure on a job. Moreover, a continuous
test constructors was to approximate the three values as closely as pos- criterion may be dichotomized for purposes of analysis. The basic rela-
sible. tionship between item and criterion is illustrated by the three item char-
A third example of the choice of item difficulty levels in terms of acteristic curves reproduced in Figure 26. Using fictitious data, each of
special testing goals is to be found in mastery testing. It will be recalled these curves shows the percentages of persons in each class-interval of
(eh. 4) that mastery testing is often associated with criterion-referenced criterion score who pass the item. It can be seen that Item 1 has a low
testing. If the purpose of the test is to ascertain whether an individual has validity, since it is passed by nearly the same proportion of persons
adequately mastered the basic essentials of a skill or whether he has throughout the criterion range. Items 2 and 3 are hetter, showing a
acquired the prerequisite knowledge to advance to the next step in a closer correspondence between percentage passing and criterion score.
learning program, then the items sll,ould probably be at the .80 or .90 Of these two, 3 is the more valid item, since its curve rises more steeply.
difficulty level. Under these conditions, we would expect the majority of
those taking the examination to complete nearly all items correctly. Thus,
the very easy items (even those passed by 1000/<: of the cases), which are 100
discarded as nondiscriminative in the usual standardized test, are the very " 90
5"1 80
items that would be included in a mastery test. Similarly, a pretest, ad- ~ ~ 70
ministered prior to a learning unit to detennine whether any of the ~= 60
students have already acquired the skills to be taught, will yield very low ~ g> 50
percentages of passing for each item. In this case, items with very low or
If"::: 40
c~ 30
even zero p values should not be discarded, since they reveal what re- ~ 20
mains to be learned. ~ 10
0
It is apparent from thesc examples that the appropriate difficulty level 0 10 15 20 25 30 35 40 45 50
of items depends upon the purpose of the test. Although in most testing Criterion Score
situations items clustering around a medium difficulty (.50) yield the FIG. 26. Item Characteristic Curves for Three Hypothetical Items.
maximum information about the individual's performance level, decisions (Adapted from Lord, 1953, p. 520.)
about item difficulty cannot be made routinely, without knowing how
the test scores 'will be used.
Although item characteristic curves can provide a vivid graphic rep-
resentation of differences in item validities, decisions about individual
items can be made more readily if validity is expressed by a single nu-
merical index for each item. Over fifty different indices of item validity
ITEM-CRITERION RELATIONSHIPS. All indices of item validity are based on have been developed and used in test construction. One difference
the relationship between item response and criterion performance. Any among them pertains to their applicability to dichotomous or continuous
criterion employed for test validation is also suitable for item validation. measures. Among those applicable to dichotomous variables, moreover,
Item analysis can be employed to improve not only the convergent but some assume a continuous and normal distribution of the underlying
also the discriminant validity of a test (see Ch. 6). Thus, items can be trait, on which the dichotomy has been artificially imposed; others as-
chosen on the basis of high correlation vdth a criterion and low correla- sume a true dichotomy. Another difference concerns the relation of item
tion with any irrelevant.variable that may affect test performance. In the difficulty to item validity. Certain indices measure item validity inde-
development of an al,"ithmetic reasoning test, for example, items that cot- pendently of item difficulty. Others yield higher validities for. items 'l§1O~'e
relate Significantly with a reading comprehension test would be dis- to the .50 difficulty level than for those at the ex~mes of difficulty.
carded. Despite differences in procedure and assumptions, most item validity
Since item responses are generally recorded as pass or fail, the measure- indices provide closely similar results. Althoqg.!l. the numerical values
ment of item validity usually involves a dichotomous variable (the item) of the indices may differ, the items that are retaihed and those that are
rejected on the hasis of different validity indices are largely the same. In
fact, the variation in item validity data from sample to sample is gen- SIMPLE ANALYSIS WITH SMALL GROUPS. Because item analysis is fre-
erally greater than that among different methods. For this reason, the quently conducted with small groups, such as the students who have
choice of method is often based on the amount of computational labor taken a classroom quiz, we shall consider first a simple procedure espe-
required and the availability of special computational aids. Among the cially suitable for this situation. Let us suppose that in a class of 60
published computational aids are a number of abacs or nomographs. students we have chosen the 20 students (337c) with the highest and the
These are computing diagrams with which, for example, the yalue of an 20 with the lowest scores. We now have three groups of papers which we
item-criterion correlation can be read directly if the percentages of per- may call the Upper (U), Middle (M), and Lower (L) groups. First we
sons passing the item in high and low criterion groups are known (Guil- need to tally thc correct responses to each item given by students in the
ford & Fruchter, 1973, pp. 445-458; Henrysson, 1971). three groups. This can be done most readily if we list the item numbers
in one column and prepare three other columns headed U, M, and L. As
we come to each student's paper, we simply place a tally next to each
USE OF EXTREME GROUPS. A common practice in item analysis is to item he answered correctly. This is done for each of the 20 papers in the
compare the proportion of cases who pass an item in contrasting criterion U group, then for each of the 20 in the M group, and finally for each of
groups. VVhen the criterion is measured along a continuous scale, as in the 20 in the L group. We are now ready to count up the tallies and re-
the case of course grades, job ratings, or output records, upper (U) and cord totals for each group as shown in Table 19. For illustrative pur-
lower (L) criterion groups are selected from the extremes of the distribu- poses, the first seven items have been entered. A rough index of the valid-
tion. Obviously, the more extreme the groups the sharper will be the ity or discriminative value of each item can be found by subtracting the
differentiation. But the use of very extreme groups, such as upper and number of persons answering it correctly in the L group from the number
lower 10 percent, would reduce the reliability of the results because of answering it correctly in the U group. These U-L differences are given
the small number of cases utilized. In a normal distribution, the optimum in the last column of Table 19. A measure of item difficulty can be ob-
point at which these two conditions balance is reached with the upper tained with the same data by adding the number passing ea~h item in all
and lower 27' percent (KeIley, 1939). When the distribution is flatter three criterion groups (U + M + L).
than the normal curve, the optimum percentage is slightly greater than
27 and approaches 33 (Cureton, 1957b). With small groups, as in an TABLE 19
ordinary classroom, the sampling error of item statistics is so large that Simple Item Analysis Procedure: Number of Persons Gh'ing Correct
only rough results can be obtained. Under these conditions, therefore, Response in Each Criterion Group
we need not be too concerned about the exact percentage of cases i.n
the two contrasted groups. Any convenient number between 25 percent
~~..:a:a::UOii:!mm
U M L
- Difficulty Discrimina-
and 33 percent will serve satisfactorily. Item (20) (20) (20) (U + M + L) tion (U - L)
With the large and normally distributed· samples employed in the de-
velopment of standardized tests, it has been customary to work ,,~ith
1 15 9 7 31 8
2 20 20 16 56° 4
the upper and lower 27 percent of the criterion distribution. Many· of 1_
3 19 18 9 46 10
the tables and abacs prepared to fl,lcilitatethe computation of item valid- 4 10 11 16 37 - 6-
ity indices are bascd on the assumption that the "27 percent rule" has 5 11 13 11 35 0-
been followed. As high"speed computers become more generally avail- 6 16 14 9 39 7
able, it is likely that the various labor-saving procedures developed to 7 5 0 0 5- 5
facilitate item analysis will be gradually replaced by more exact and
sophisticated methods. With computer facilities, it is better to analyze
the results for the entire sample, rather than working with upper and
lower extremes. Mathematical procedures have also been developed for
measuring item validity from the item characteristic curves, but their
application is not feasible witho~t access to a computer (Baker, 1971;
Henrysson, 1971; Lord & Novick,'1968 ).
Examination of Table 19 reveals four questionable items that have been chosen for discussion. This tabulation gives thc number of students in
identified for further consideration or for class discussion. Two items, 2 the U and L groups who chose each option ill answering the particular
• and 7, have been Singled out because one seems to be too easy, having items.
been passed by 56 out of 60 students, and the other too difficult, having Although Item 2 has been included in Table 20, there is little more we
been passed by only 5. Items 4 and 5, while satisfactory with regard to can learn about it by tabulating the frequency of each wrong option,
difficulty level, show a negative and zero discriminative value, respec- since only 4 persons in the L group and none in the U group chose wrong
tively. \Ve would also consider in this category any items with a very answers. Discussion of the item with the students, however, may help to
small positive U-L difference, of roughly three or less when groups of determine whether the item as a whole was too easy and therefore of
approximately this size are being compared. With larger groups, we little intrinsic value, whether some defect in its con;tructioll served to
would expect larger differences to occur by chance in a nondiscriminat- give away the right answer, or whether it is a good item dealing with a
ing item. point that happened to have been effectively taught and well remem-
The purpose of item analysis in a teacher-made test is to identify de- bered. In the first case, the item would probably be discarded, in the
ficiencies either in the test or in the teaching. Discussing questionable second it would be revised, and in the third it would be retained un-
items with; the class is often sufficient to diagnose the problem. If the changed.
wording of the item was at fault, it can be revised or discarded in subse- The data on Item 4 suggest that the third option had some unsus-
quent testing. Discussion may show, however, that the item was satis- pected implications that led 9 of the -better students to prefer it to the
factory, but the point being tested had not been properly understood. In correct alternative. The point could easily be settled by asking those
that case, the topic may be reviewed and clarified. In narrowing down students to eXl'lain why they chose it. In Item 5, the fault seems to lie in
the source of the difficulty, it is often helpful to carry out a supple- the wording either of the stem or of the correct alternative, because the
menta!)' analysis, as sho"'1!Din Table 20, \vith at least some of the items students who missed the item ~ere uniformly distributed over the four
wrong options. Item 7 is an unusualIv difficult one, which was answered
TABLE 20 incorrectly by 15 of the U and all of the L group. The slight clustering of
Response Analysis of Individual Items responses on incorrect option 3 suggests a superficial attractiveness of
1';~~"'W\~~~':;;:;;T.%i!''J'''"4.~~~"l':''..!'..m.·::-:;''·it:re''~~~~~~~
this option, especially for the more easily misled L group. Similarly, the
Response Options' lack of choices of the correct response (option 1) by any of the L group
Item Group suggests that this alternative was so worded that superficially, or to the
1 2 3 4 5 uninformed, it seemed wrong. Both of these features, of course, are
desiderata of good test items. Class discussion might show that Item 7
2 Upper 0 0 0 20 0 is a good item dealing with a point that few class members had actually
Lower 2 0 1 16 1 learned.
4 Upper 0 10 9 0 J.
Lower 2 16 2 0 0
THE INDEX OF DISCRIMINATION. If the numbers of persons passing each
5 Upper 2 3 2 11 2' item in U and L criterion groups are expressed as percentages, the differ-
Lower 1 3 3 11 2 ence between these two percentages provides an index of item validity
7 that can be interpreted independently of the size of the particular sample
Upper 5 3 5 4 3
Lower 0 5 8 3 4 in which it was obtained. This index has been repeatedly described in
the psychometric literature (see, e.g., Ebel,-19&S; Johnson, 1951;~Mosier
& McQuitty, 1940) and has been variously designated as ULI,ULD; or
simply D. Despite its simplicity? it has been shoVi'Jlto agree quite closely
with other, more elaborate measures of item validity" (~ngelhart, 1*).
The computation of D can be illustrated by referen~ to the data previ·
?usly reported in ~ble 19. First, the numbers of persons passing eacp
ltem in the U and L~ps are changed to percerltages. Because thl;l
Principles of Psyc1lOlogical Testing
"number of cases in each group is 20, we could divide each number by 20 cases in the total sample pass an item, there can be no difference in per-
: andmultiply the result by 100. It is easier, however, to divide 100 by 20, centage passing inU and L groups; hence D is zero. At the other ex-
;:lwhichgives 5, and then multiply each number by that constant. Thu~, treme, if 50 percent pass an item, it would be possible for all the'U
J~forItem 1, 15 X 5 = 75 (U group) and 7 X 5 = 35 (L group). For thIS cases and none of the L cases to pass it, thus yielding a D of 100
'·item, then, D, is: 75 - 35 = 40. The remaining values for the seven (100 - 0 = 100). If 70 percent pass, the maximum value that D could
items are given in Table 2l,3 take can he illustrated as follows: (U) 50/50 = 100%; (L) 20/50 = 40%;
D can have anv value between + 100 and -100. If all members of the D = 100 - 40 = 60. It will be recalled that, for most testing purposes,
; U group and non~ of the L group pass an item, D equals 100. Conversely,
~;ifall members of the L group and none of the U group pass it, D equals TABLE 22
-100. If the percentages of both groups passing an item are equal, D Relation of Maximum Value of D to Item Difficulty
. will he zero. D has several interesting properties. It has been demon-
....strated (Ebel, 1965; Findley, 1956) that D is directly proportional to the
21
.:. TABLE 100 o
: Computation of Index of Discrimination (Data from Table 19) 90 20
70 60
Percentage Passing Difference 50 100
Item (Index of 30 60
Upper Group Lower Group Discrimination) 10 20
o o
1 75 35 40
2 100 80 20
3 95 45 50 items closer to the 50 percent difficulty level are preferable. Hence, item
4 50 80 -30 validity indices that favor this difficulty level are often appropriate for
5 55 55 0 item selection.
6 80 45 35
7 25 0 25
PH,l COEFFICIENT. Many indices of item validity report the relationship
between item and criterion in the form of a correlation coefficient. One of
difference between the numbers of correct and incorrect discriminations
these is the phi coefficient (</». Computed from a fourfold table, ¢ is
made by an item. Correct discriminations are based on the number of
based on the proportions of cases passing and failing an item in U and L
passes in the U group versus the number of failures in the L group; in-
criterion groups. Like all correlation coefficients, it yields values between
correct discriminations are based on the number of failures in the U
+ 1.00 and -1.00. The ¢ coefficient assumes a genuine dichotomy in both
group versus the number of passes in the L,group. Ebel (1967) has also
item response and criterion variable. Consequently, it is strictly ap-
shown that there is a close relation between the mean D index of the
plicable only to the dichotomous conditions under which it was obtained
items and the reliability coefficient of the test. The higher the mean D.
and cannot be generalized to any underlying relationship between the
the higher the reliability.
traits m~asured by item and criterion. Like the D index, ¢ is biased
Another noteworthy characteristic of D is one it shares with several
toward the middle difficulty levels-that is, it yields the highest possible
other indices of item validity. The values of D are not independent of ·correlations for dichotomies closest to a 50-50 split.
item difficulty but are biased in favor of intermediate difficulty levels.
Several computational aids are available for finding ¢ coefficients.
Table 22 shows the maximum possible value of D for items v.~th different
\Vhen the number of cases in U and L criterion groups is equal, ¢ can be
percentages of correct responses. If either 100 percent or 0 percent of the
found with the Jurgensen tables (1947) by simply entering the per::.
3 The alert reader may have noticed that the same result can be obtained by centages passing the item in U and L groups. Since in conducting an
~i1J)plvmultiplying the differences in thc last column of Tr,ble 19 by the constant, 5. item analysis it is W"Jally feasible to select U and L groups of equal
14 Principles of Psyc1lOlogieal Testing Item AnalysiS :..
j :
'ze,the Jurgensen tables are widely used for this purpose. \Vhen the two mated biserial correlations, but it has been shown that their standard
;:riteriongroups are unequal, cf> can be found with another set of tables, errors are somewhat larger than those of biserial correlations computed
•repared by Edgerton (1960), although their application is slightly more from all the data in the usual way. That is, the rb!.s estimated from the
imeconsuming. Fan tables fluctuates more from sample to sample than does the rbis
The significance level of a cf> coefficient can be readily computed computed by formula. \\lith this information, one could use the standard
throughthe relation of cf> to both Chi Square and the Normal Curve error of rbi. to estimate approximately how large the correlation should
Ratio.Applying the latter, we can identify the minimum value of cf> that be for statistical significance.4It should be reiterated that, \vith computer
wouldreach statistical significance at the .05 or .01 levels with the fol- facilities, biserial correlations can be readily found from the responses of
owingformulas: the total sample; and this is the preferred procedure.
1.96
cf>.05 = VN
2.58
cf>.Ol = ,IN Item analysis is frequently conducted against total score on the test
itself. This is a common practice in the case of achievement tests and
In these fonmulas, N represents the total number of cases in both cri- especially teacher-made classroom tests, for which an external criterion
terion groups combined. Thus, if there were 50 cases in U and 50 in L is rarely available. As was noted in Chapter 6, this procedure yields a
groups,N would be 100 and the minimum cf> significant at the .05 level measure of internal consistency, not external validity. Such a procedure
would be 1.96 -7- ylOO = .196. Any item whose cf> reached or exceeded is appropriate as a refinement of content validation and of certain aspects
.196would thus be valid at the .05 level of significance. of construct validation .
For tests requiring criterion-related validity, however, the use of total
score in item analysis needs careful scrutiny. Under certain conditions,
BISERIAL CORRELATION. As a final example of a commonly used measure the two approaches may lead to opposite results, the items chosen on the
of item validity, we may consider the biserial correlation coefficient basis of external validity being the very ones rejected on the basis of
(TbiS), which contrasts with cf> in hvo major respects. First, rbis assumes internal consistency. Let us suppose that the preliminary form of a
a continuous and normal distribution of the trait underlying the dichoto- scholastic aptitude test consists of 100 arithmetic items and 50 vocabu-
mousitem response. Second, it yields a measure of item-criterion relation- lary items. In order to select items from this initial pool by the method
ship that is independent of item difficulty. To compute rbis directly from of internal consistency, the biserial correlation between performance on
the data, we would need the mean criterion score of those who pass and each item and total score on the 150 items may be used.s It is apparent
those who fail the item, as well as the proportion of cases passing and that such biserial correlations would tend to be higher for the arithmetic
failing the item in the entire sample and the standard deviation of the than for the vocabulary items, since the total score is based on twice as
criterion scores. . many arithmetic items. If it is desired to retain the 75 'best" items in
Computing all the needed terms and applying the rbis formula for each the final form of the test, it is likely that most of these items will prove
item can be quite time consuming. Tables have been prepared ham to be arithmetic problems. In terms of the criterion of scholastic achieve-
which fbig can be estimated by merely' entering the percentages passing ment, however, the vocabulary items might have been more valid pre-
the item in the upper and lower 27 percent of the criterion group (Fan, dictors than the arithmetic items. If such is the case, the item analysis
1952;1954). These are the previously mentioned tables obtainable from will have served to lower rather than raise the validity of the test.
Educational Testing Service. With these tables it is possible by entering The practice of rejecting items that have low correlations v,ith total
the percentage passing in U and L groups to find three values: an esti- Scoreprovides a means of purifying or homogenizing the test. By such a
mate of p, the percentage who pass the item in the entire sample; the 4 The formula for CT'bf, can be found in any standard statistics te~i, such as
previously described A, a measure of item difficulty on an interval scale; Guilford and Fruchter (1973, pp. 293-296).
and fbi.s between item and criterion. These tables are only applicable 5 Such part-whole correlations "'ill be somewhat inflated by the common specine
when exactly 27 percent of the cases are placed in U and L groups. and error variance in the item and the test of which it is a part. Formulas ,have
been developed to correct for this effect (Guilford & Fruchter, 197:3, pp.' 454-456).
There is no way of computing exact significance levels for these esti-
rocedure, the items with the highest average intercorrc1ations will be procedures are, of course, unlike. One aims to increase the breadth of
etained.This method of selecting items will increase test validity only criterion coverage and reduce duplication; the other attempts to raise the
hen the original pool of items measures a single trait and when this homogeneity of the test. Both are desirable objectives of test construction.
ait is present in the criterion. Some types of tests, however, measure a The appropriate procedure depends largely on the nature and purpose
mbination of traits required by a complex criterion. Purifying the test of the test. One extreme can be illustrated by a biographical inventory, in
such a case may reduce its criterion coverage and thus lower validity. which items can only be evaluated and selected in terms of an external
, The selection of items to maximize test validity may be likened to the criterion and the coverage of content is highly heterogeneous. The op-
lectionof tests that will yield the highest validity for a battery. It will posite extreme can be illustrated by a spelling test, whose content is
, e recalled (eh. 7) that the test contributing most toward battery highly homogeneous and in which internal consIstency would be a de-
'alidity is one having the highest correlation with the criterion and the sirable goal for item selection.
,west correlation with the other tests in the battery. If this principle is For many testing purposes, a satisfactory compromise is to sort the
pplied to the selection of items, it means that the most satisfactory relatively homogeneous items into separate tests, or subtests, each of
temsare those with the highest item validities and the lowest coefficients which will cover a different aspect of the criterion. Thus, breadth of
£internal consistency. On this basis, it is possible to determine the net coverage is achieved through a variety of tests, each yielding a relatively
ffectiveness of an item-that is, the net increase in test validity that ac- unambiguous score, rather than through heterogeneity of items v.rithin a
'rues from the addition of that particular item. Thus, an item that has single test. By such a procedure, items with 10\\' indices of internal con-
high correlation with the external criterion but a relatively low correla- sistency would not be discarded, but would be segregated. Within each
'on with total score would be preferred to one correlating highly with subtest or item group, fairly high internal consistencv could thus be at-
oth criterion and test score, since the first item presumably measures tained. At the same time, internal 'consistency would ~ot be accepted as a
n aspect of the criterion not adequately covered by the rest of the test. substitute for criterion-related validity, and some attention would be
It might seem that items could be selected by the same methods used given to adequacy of coverage and to the avoidance of excessive concen-
choosingtests for inclusion in a battery. Thus, each item could be cor- tration of items in certain areas.
lated with the criterion and with every other item. The best items
hosenby -this method could then be weighted by means of a regression
'quation.Such a procedure, however, is neither feasible nor theoretically
.efensible.Not only would the computation labor be prohibitive, but
':teritem correlations are also subject to excessive sampling fluctuation \\'bether or not speed is relevant to the function being measured,
d the resulting regression weights would be too unstable to provide a item ,indices computed from a speeded test may be misleading. Except
asis for item selection, unless extremely large samples are used. For for items that all or nearly all subjects have had time to attempt, the item
•esereasons, several approxirryationprocedures have been developed for indices found from a speed test will reflect the position of the item in the
lecting items in terms of their net contribution to test validity. Some test rather than its intrinsic difficulty or validity. Items that appear late
these methods involve an empirical build-up process, whereby items in the test will be passed by a relatively small percentage of the total
e added to an ever-growing pool, and the validity of each successive sample, because only a few persons have time to reach these items. Re-
mposite is recomputed. Others begin with the complete set of items gardless of how easy the item may be, if it occurs late in a speeded test,
d reduce the pool by successive elimination of the poorest items until it will appear difficult. Even if the item merely asked for the subject's
'e desired test validity is attained. Because even these techniques re- name, the percentage of persons who pass it might be very low when
ire considerable computational labor, their use is practicable only the item is placed toward the end of a speeded test.
'hen computer facilities are available (Fossum, 1973; Henrysson, 1971). Similarly, item validities tend to be overestimated for those items that
It should be noted that all techniques for selecting items in terms of h~~e not been reached by all subjects. Because the more proficient in-
eir net effectiveness represent the opposite approach from that followed diViduals tend to work faster, they are more likely to reach one of the
'en items are chosen on the basis of internal consistency. In the former later items in a speed test (Mollenkopf, 1950a). Thus, regardless of the
ocedure, a high item-test correlation increases the probability that nature of the item itself, some correlation between the item and the cri-
e item will be rejected; in the latter, a high item-test correlation in- terion would be obtained if the item occurred late in a speed test. "
'eases the probability of its acceptance. The objectives of the two To avoid some of these difficulties, we could limit the analysis of each
Principles of Psychological Testing
itemto those persons who have reached the item. This is not a completely suIts clearly showed that the position of an item in the speed tests
i:satisfactorysolution, however, unless the number of persons failing to affected its indices of difficulty and validity. \Vhen the same item oc-
;reach the item is small. Such a procedure would involve the use of a curred later in a speeded test, it was passed by a greater percentage of
.:rapidly shrinking number of cases, and would thus render the results on those attempting it, and it yielded a higher item-criterion correlation .
,.thelater items quite unreliable. ~10reover, the persons on whom the later The difficulties encountered in the item analysis of speeded tests are
; items are analyzed would probably constitute a selected sample, and fundamentally similar to those discussed in Chapter 5 in coniJection
: hence would not be comparable to the larger samples used for the earlier with the reliability of speeded tests. Various solutions, both empirical
:.'items.As has already been pointed out, the faster subjects tend also to and statistical, have been developed for meeting these difficulties. One
i be the more profiCient. The later items would thus be analyzed on a empirical solution is to administer the test with a longer time limit to
. superior sample of individuals. One effect of such a selective factor the group on '''"hich item analysis is to be carried out. This solution is
would be to lower the apparent difficulty level of the later items, since satisfactory provided that speed itself is not an important aspect of the
·:the percent~ge 'passing would be greater in the selected superior group ability to be measured by the test. Apart from the technical problems
" than in the entire sample. It will be noted that this is the opposite error presented by speCific tests, however, it is well to keep in mind that item-
from that introduced when the percentage passing is computed in terms analysis data obtained with speeded tests are suspect and call for careful
,"of the entire sample. In that case, the apparent difficulty of items is scrutiny.
spuriously ri,lised.
The effect of the above procedure on indices of item validity is less
obvious, but nonetheless real. It has been observed, for example, that
some low-scoring examinees tend to hurry through the test, mar1..--ing
,; items almost at random in their effort to try all items within the time MEANING ~F CROSS VALIDATION. It ~ essential that test validi!y be com-
allowed. This tendency is much less common among high-scoring ex- puted on a dIfferent sample of persOllSlrom that on which tne items were
aminees. As a result, the sample on which a late-appearing item is selected. T~ independent determination of the~.yalidi~ of the entire
analyzed is likely to consist of some very poor respondents, who will te'st1sl<nown as cross validation ( Mosier, 1951). Any validity coefficient -
perform no better than chance on the item, and a larger number of very computed on the same sample that was used br item-selection purposes
proficient and fast respondents, who are likely to answer the item cor- will capitalize on chance errors within that particular sample and will
rectly. In such a group, the item-criterion correlation will probably be consequently be spuriously high. In fact, a high validity coefficient could
•. higher than it would be in a more representative sample. In the absence result under such circumstances even when the test has no validitv at
of such random respondents, on the other hand, the sample on which all in predicting the particular criterion. •
the later items are analyzed will cover a relatively narrow range of Let us suppose that out of a sample of 100 medical students, the 30
ability. Under these conditions, the validities of the later items will tend with the highest and the 30 with the lowest medical school grades have
to be lower than they would be if computed on the entire unselected been chosen to represent contrasted criterion groups. If, now, these two
sample. groups are compared in a number of traits actually irrelevant to success
The anticipated effects of speed on indices of item difficulty and item in medical school, certain chance differences will undoubtedly be found.
validity have been empirically verified, both when item statistics are Thus, there might be an excess of private-school graduates and of red-
computed with the entire sample (\Vesman, .l949) and when they ~re haired persons within the upper criterion group. If we were to assign
computed with only those persons who attempt the item (Mollenkopf, each individual a score by crediting him with one point for private-school
1950a). In the latter study, comparable groups of high school students graduation and one point for red hair, the mean of such scores would un-
were given two forms of a verbal test and two forms of a mathematics doubtedly be higher in the upper than in the lower criterion g~oup.
test. Each of the two forms contained the same items as the other, but This is not evidence for the validity of the predictors, however, since
items occurring early in one form were placed late in the other. Each such a validation process is based on a circular argument. The two pre-
form was administered with a short time limit (speed conditions) and dictors were chosen in the first place on the basis of the chance varia-
with a very liberal time limit (power conditions). Various intercompari- tions that characterized this particular sample. And the same chance
sons were thus possible between forms and timing conditions. The re- differences are operating to produce the mean differences in total score.
Principles of Psychological Testing
itemto those persons who have reached the item. This is not a completely suIts clearly showed that the position of an item in the speed tests
~;satisfactorysolution, however, unless the number of persons failing to affected its indices of difficulty and validity. \Vhen the same item oc-
;reach the item is small. Such a procedure \'Vould involve the use of a curred later in a speeded test, it was passed by a greater percentage of
f;rapidlyshrinking number of cases, and would thus render the results on those attempting it, and it yielded a higher item-criterion correlation.
'the later items quite unreliable. ~10reover, the persons on whom the later The difficulties encountered in the item analysis of speeded tests are
items are analyzed would probably constitute a selected sample, and fundamentally similar to those discussed in Chapter 5 in connection
\:hencewould not be comparable to the larger samples used for the earlier with the reliability of speeded tests. Various solutions, both empirical
:.items.As has alre'ady been pointed out, the faster subjects tend also to and statistical, have been developed for meeting these difficulties. One
< be the more profiCient. The later items would thus be analyzed on a empirical solution is to administer the test with a longer time limit to
!superior sample of individuals. One effect of such a selective factor the group on which item analysis is to be carried out. This solution is
would be to lower the apparent difficulty level of the later items, since satisfactory provided that speed itself is not an important aspect of the
:the percentage 'passing would be greater in the selected superior group ability to be measured by the test. Apart from the technical problems
" than in the entire sample. It will be noted that this is the opposite error presented by speCific tests, however, it is well to keep in mind that item-
from that introduced when the percentage passing is computed in terms analysis data obtained with speeded tests are suspect and call for careful
I of the entire sample. In that case, the apparent difficulty of items is scrutiny.
spuriously r@.ised.
The effect of the above procedure on indices of item validity is less
,. obvious, but nonetheless real. It has been observed, for example, that
some low-scoring examinees tend to hurry through the test, mar1.ing
items almost at random in their effort to try all items within the time MEANING ~F CROSS VALIDATION. It i.?essential that test validi!X,be com-
allowed. This tendency is much less common among high-scoring ex- puted on a dIfferent samp!e of persOllSIrom that on which tne items were
aminees. As a result, ~he sample on which a late-appearing item is ~elected. T~ independent determination of the~.yalidity of the entire
analyzed is likely to consist of some very poor respondents, who will test is known as cross validation ( Mosier, 1951). Any validity coefficient -
perform no better than chance on the item, and a larger number of very computed on the same sample that was used far item-selection purposes
proficient and fast respondents, who are likely to answer the item cor- will capitalize on chance errors within that particular sample and will
rectly. In such a group, the item-criterion correlation will probably be consequently be spuriously high. In fact, a high validity coefficient could
higher than it would be in a more representative sample. In the absence result under such circumstances even when the test has no validitv at
of such random respondents, on the other hand, the sample on which all in predicting the particular criterion. •
the later items are analyzed will cover a relatively narrow range of Let us suppose that out of a sample of 100 medical students, the 30
ability. Under these conditions, the validities of the later items will tend with the highest and the 30 with the lowest medical school grades have
. to be lower than they would be if computed on the entire unselected been chosen to represent contrasted criterion groups. If, now, these two
'" sample. groups are compared in a number of traits actually irrelevant to success
The anticipated effects of speed on indices of item difficulty and item in medical school, certain chance differences will undoubtedly be found.
validity have been empirically verified, both when item statistics are Thus, there might be an excess of private-school graduates and of red-
computed with the entire sample (Wesman, 1949) and when they ~re haired persons within the upper criterion group. If we were to assign
computed with only those persons who attempt the item (Mollenkopf, each individual a score by crediting him with one point for private-school
1950a). In the latter study, comparable groups of high school students graduation and one point for red hair, the mean of such scores would un-
were given two forms of a verbal test and two forms of a mathematics doubtedly be higher in the upper than in the lower criterion g~oup.
test. Each of the two forms contained the same items as the other, but This is not evidence for the validity of the predictors, however, since
items occurring early in one form were placed late in the other. Each such a validation process is based on a circular argument. The two pre-
form was administered with a short time limit (speed conditions) and dictors were chosen in the first place on the basis of the chance varia-
with a very liberal time limit (power conditions). Various intercompari- tions that characterized this particular sample. And the same chance
sons were thus possible between forms and timing conditions. The re- differences are operating to produce the mean differences in total score.
220 Principles of PsycllOlogical Testing Item Analysis Z:li.
.When tested in another sample, however, the chance differences in fre- sence of each item or response sign. Because of the procedure followed in
quency of private-school graduation and red hair are likely to disappear generating these chance scores, Cureton facetiously named the test the
',or be reversed. Consequently, the validity of the scores will collapse. "B-Projective Psychokinesis Test."
An item analysis was then conducted, ,,'ith each student's grade-point
average as the criterion. On this basis, 24 "items" were selected out of
~' AN EMPIRICAL EXAMPLE. A specific illustration of the need for cross the 85. Of these, 9 occurred more frequently among the students with
"validation is provided by an early investigation conducted with the an average grade of B or better and received a weight of + 1; 15 oc-
~Rorschach inkblot test (Kurtz, 1948). In an attempt to determine whether curred more frequently among the students "ith an average grade below
{'the Rorschach could be of any help in selecting sales managers for life B and received a weight of -1. The sum of these item weights consti-
,insurance agencies, this test was administered to 80 such managers. tuted the total score for each student. Despite the known chance deriva-
.- These managers had been carefully chosen from several hundred em- tion of these "test scores," their correlation with the grade criterion in
ployed by eight life insurance companies, so as to represent an; upper the original group of 29 students proved to be .82. Such a finding is simi-
criterion group of 42 considered very satisfactory by their respective lar to that obtained with the Rorschach scores in the pre\iously cited
companies, and a lower criterion group of 38 considered unsatisfactory. study. In both instances, the apparent correspondence between test score
The 80~cords were studied by a Rorschach expert, \\'ho selected a and criterion resulted from the utilization of the same chance differences
set of~s, or response characteristics: occurring more freque~tly in both in selecting items and in determining validity of total test scores.
one critenon group than in the other. SIgns found more often 111 the
upper criterion group were scored + 1 if present and 0 if absent; those
more common in the lower group were scored -1 or O. Since there were CONDITIONS AFFECTING VALIDITY SHRINKAGE. The amount of shrinkage
16 signs of each type, total scores could range theoretically from -16 of a validity coefficient in cross validation depends in part on the size of
; to +16. the original item pool and the proportion of items retained. \\Then the
When the scoring key based on these 32 signs was reapplied to the number of original items is large and the proportion retained is small, \
original group of 80 persons, 79 of the 80 were correctly classified as be- there is more opportunity to capitalize on chance differences and thus l
ing in the upper or lower group. The correlation between test score and obtain a spuriously high validity coefficient. Another condition affecting
criterion would thus have been close to 1.00. However, when the test amount of shrinkage in cross validation is size of sample. Since spuriously
was cross-validated on a second comparable sample of 41 managers, 21 in high validity in the initial sample results from an accumulation of sam-
the upper and 20 in the lower group, the validity coefficient dropped to pling errors, smaller groups (which yield larger sampling errors) will
a negligible .02. It was thus apparent that the key developed in the first exhibit greater validity shrinkage.
sample had no validity for selecting such personnel. If items are chosen on the basis of previously formulated hypotheses,
derived from psychological theory or from past experience with the cri-
terion, validity shrinkage in cross validation \vill be minimized. For ex-
AN EXAMPLE WITH CHANCE DATA. That the use of a single sample for ample, if a particular hypothesis required that the answer "Yes" be more
item selection and test validation can produce a completely spurious va- frequent among successful students, then the item would not be retained
lidity coefficient under pure chance conditions was vividly demonstrated if a Significantly larger number of "Yes" answers wer'e' given by the U1l-
by Cureton (1950). The criterion to be predicted was the grade-point successful students. The opposite, blindly empirical approach would be
average of each of 29 students registered in a psychology course. This illusdtrated by assembling a miscellaneous set of questions with little re- .}
criterion was dichotomized into grades of B or better and grades below ?ar to their relevance to the criterion behavior, and then retaining all
B. The "items" consisted of 85 tags, numbered from 1 to 85 on one side. Items yielding Significant positive or negative correlations with the cri-
To obtain a test score for each subject, the 85 tags were shaken in a terion. Under the latter circumstances, we would expect much more f
container and dropped on the table. All tags that fell with numbered ~hrinkage than under the former. In summary, shrinkage of test validity
side up were recorded as indicating the presence of that particular item 111 cross validation will be greatest when samples are small, the initial
in the student's test performance. Twenty-nine throws of the 85 tags thus item pool is large, the proportion of items retained is small, and items
provided complete records for each student, showing the presence or ab- are assembled without previously formulated rationale. '
Item Analysis 2:<;.,
centage passing or the delta values (!:l) of the same items in two groups.
If there is no significant item X group interaction, i.e., if the relative dif-
ficulties of items are the same for both groups, this correlation should be
, EXPLORATORY STUDIES. Insofar as diverse cultures or subcultures foster close to 1.00. These more sophisticated statistical techniques have been
,r the development of different skills and knowledge, these differences employed in studies with the College Board's Preliminary Scholastic
/' will be reflected in test scores. An individual's general level of perform- Aptitude Test, administered to high school students. Relative item dif-
; ance will be higher in those aptitudes stimulated and encouraged by his ficulties were investigated with reference to ethnic, socioeconomic, and
particular experiential background. A further question pertains to the urban-rural categories (Angoff & Ford, 1973; Cleary & Hilton, 1968).
f. relative difficulty of items for groups with dissimilar cultural back- The results show some significant though small item X group inter-
grounds. If difficulty is measured in the usual way, in terms of percentage actions. Correlations between the delta values for two ethnic groups were
; of respondents passing each item, will the rank order of the items be the slightly lower than the correlations for two random samples of the same
:,same across groups, regardless of overall level of performance? Early in- ethnic group. Two of the bivariate distributions of such correlations
( vestigations of this question with urban and rural children revealed a are illustrated in Figures 27 and 28. \Yhen two random samples of
number of significant differences in relative item difficulties on the white high school students were compared (Fig. 27), the item deltas
Stanford-Binet (Jones, Conrad, & Blanchard, 1930) and on a general were closely similar, yielding a correlation of .987.\Vhen black and white
information te!St (Shim berg, 1929). samples were compared (Fig. 28), the items not only proved more dif-
A comprehensive test of group differences in relative item difficulties ficult as a whole for the black students, but they also showed more dis-
is provided by measures of item X group interaction, computed through crepancies in relative difficulty, as indicated by a correlation of .929.
an analysis of variance. Another procedure is to correlate either the per- Efforts to identify reasons for these differences led to two suggestive
18 18
~0. 16 16
E
m
(/)
'"
0.
.~
.r:: E
ro
~ 14 (/)
14
"tJ E
c .r::
0
u ~
'"
(/)
.2
.2 12 ~ 12
~ 2
:oJ C3
Q; 0
0
10 10
8 8
FIG. 27. Bivariate Distribution of Item Difficulties of Preliminary Scholastic FIG. 28. Bivariate Distribution of Item Difficulties of Preliminary Scholastic
Aptitude Test for Two Random Samples of 'White High School Students. Aptitude Test for Random Samples of Black and White High School Students.
(From Angoff & Ford, 1973, p. 99. Reproduced by permission of National Council (From Angoff & Ford, 1973, p. 100. Reproduced by permission of National Council
on :Measurement in Education.) on Measurement in Education.)
: findings. First, examination of item content failed to reveal any relation A similar diversity of procedure can be found with reference to other
".between the affected items and known differences in the experiential group differences in item performance. In the development of a socio-
~backgrounds of the groups. Second, equating the groups on a related economic status scale for the Minnesota Multiphasic Personality In-
t, cognitive variable reduced both the group difference in mean scores a.nd ventory, only those items were retained that differentiated significantly
i the item X group interaction. The latter finding suggests that the relative between the responses of high school students in two contrasted socio-
, difficulty of items depends at least in part on the absolute performance economic groups (Gough, 1948). Cross validation of this status scale on
. level in the ability measured by the test. It is possible, for example, that a new sample of high school students yielded a correlation of .50 with
persons at different aptitude levels utilize different work methods, objective indices of socioeconomic status. The object of this test is to de-
, problem-solving techniques, or cognitive skills in responding to the same termine the degree to which an individual's emotional and social re-
, items. Those items that prove relatively difficult when solved by method sponses resemble those characteristic of persons in upper or lower socio-
A could prove relatively easy when solved by method B, and vice versa. economic levels, respectively. Hence, those items showing the maximum
. It should be added that all the techniques used in studying item X differentiation between social classes were included in the scale, and
group interactions in ability tests are also applicable to personality' tests. those showing little or no differentiation were discarded. This procedure
In the latter tests, what is measured is not the difficulty of items but the is similar to that followed in the development of masculinity-femininity
relative frequency in choice of specific response options, as on an attitude scales. It is apparent that in both types of tests the group differentiation
scale or personality inventory. constitutes the criterion in terms of which the test is validated. In such
cases, socioeconomic level and sex, respectively, represent the most rele-
vant variables on the basis of which items can be chosen.
ITEM SELECTION TO MINIMIZE OR MAXIMIZE GROUP DIFFERENCES. In the Examples of the opposite approach to socioeconomic or cultural differ-
construction of certain tests, item X group interactions have been used entials in test responses can also be found. An extensive project on such
as one basis for the selection of items. In the development of the Stanford- cultural differentials in intelligence test items was conducted at the Uni-
Binet, for example, an effort was made to exclude any item that favored versity of Chicago (Eells etal., 1951). These investigators believed that
either sex significantly, on the assumption that such items might reflect most intelligence tests might be unfair to children from lower socioeco-
purely fortuitous and irrelevant differences in the experiences of the two nomic levels, since many of the test items presuppose infon11ation, skills,
sexes (McNemar, 1942, Ch. 5). Owing to Jhe limited number of items or interests typical of middle-class children. To obtain evidence for such
available for each age level, however, it was not possible to eliminate all a hypothesis, a detailed item analysis was conducted on eight widely used
sex-differentiating items. In order to rule out sex differences in total score, group intelligence tests. For each item, the frequencies of correct re-
therefore, the remaining sex-differentiating items were balanced, approxi- sponses by children in higher and lower socioeconomic levels were com-
mately the same number favoring boys and girls. pared. Following this investigation, two members of the research team
No generalization can be made regarding the elimination of sex differ- prepared a special test designed to be "fair" to lower-cla;s urban Ameri-
ences, or any other group differences, in the selection of test items. \Vhile can children. In the construction of this test, an effort was made to ex-
certain tests, like the Stanford-Binet, have sought to equalize the per- clude the types of items previously found to favor middle-class children.6
formance of the two sexes, others have retained such differences and re- As in the case of sex differences, no rigid policy can be laid down re-
., port separate norms for the two sexes. This practice is relatively icommon garding items that exhibit cultural differentiation. CertaIn basic facts of
in the case of special aptitude tests, in which fairly large differences in test construction and interpretation should, however, be noted. First,
favor of one or the other sex have been consistently found. whether items that differentiate significantly between certain groups are
Under certain circumstances, moreover, items may be chosen, not to
minimize, but to maximize, sex differentiation. An example of the latter
retained or discarded should depend on the purpose for which the test
is designed. If the criteria to be predicted show significant differences
T.
procedure is to be found in the masculinity-femininity scales developed between the sexes, socioeconomic groups, or other categories of persons,
for use with several personality inventories (to be discussed in Ch. 17). then it is to be expected that the test items will also exhibit such group
Since the purpose of these scales is to measure the degree to which an 6 Known as the Davis-Eells Games, this test was subsequently discontinued be-
individual's responses agree with those characteristic of men or of cause it proved unsatisfactory in a number of ways, including low validity in pre-
women in our culture, only those items that differentiate significantly be- dicting academic achievement and other practical criteria. ~Ioreover, the anticipated
tween the sexes are retained. ad"antage of lower-class children on this test did not hold up in other samples.
Principles of Psychological Testing
'Jests of
the differentiation between such groups. For these tests, items showing
the largest group differences in response should be chosen, as in the case
of the masculinity-femininity and social status scales cited above.
The third point is of primary concern, not to the test constructor, but
, .to the test user and the general student of psychology who vl'ishes to in-
• terpret test results properly. Tests whose items have been selected with
reference to the responses of any special groups cannot be used to com-
Gel'le1"alIntellectual Level
., pare such groups. For example, the statement that boys and girls do not
differ significantly in Stanford-Binet IQ provides no information what-
ever regarding sex differences. Since sex differences were deliberately
eliminated in the process of selecting items for the test, their absence
from the final>scores merely indicates that this aspect of test construction
was successfully executed. Similarly, lack of socioeconomic differences on
I a test constructed so as to eliminate such differences would provide no
infom1ation on the relative performance of groups varying in socioeco-
nomic status.
Tests designed to maximize group differentiation, such as the mascu-
linity-femininity and social status scales, are equally unsuitable for group
comparisons. In these cases, the sex or socioeconomic differentiation in
personality characteristics would be artificially magnified. To obtain an
unbiased estimate of the existing group differences, the test items must
be selected without reference to the responses of such groups. The prin-
cipal conclusion to be drawn from'the present discussion is that proper
interpretation of scores on any test requires a knO\vledge of the basis on
which items were selected for that test.
11ldividlwl Tests 231
i30 Tests of Gcnuallllfellccfllal Lct:cl
~validatedagainst relatively broad criteria. They characteristically provide Detailed instructions for administering and scoring each test: were pro-
'~ sinO"lescore, such as an IQ, indicating the individual's general intcl- vided, and the IQ was employed for the first time in any psychological
test.
:iectu~l level. A typical approach is to arrive at this global estimate of
"intellectual performance by "the sinking of shafts at critical points" The second Stanford reVlSlon, appearing in 1937, consisted of two
:(Terman & Merrill, 1937, p. 4). In other words, a wide variety of tasks equivalent forms, Land M (Terman & Merrill, 1937). In this revision,
,is presented to the subject in the expectation that an adequate sampling the scale was greatly expanded and completely restandardized on a new
1>ofall important intellectual functions will thus be covered. In actual and carefully chosen sample of the U.S. population. The 3,184 subjects
';practice, the tests are usually overloaded with certain functions, such as employed for this purpose included approximately 100 children at each
tverbal ability, and completely omit others.
half-year interval from 1 Yz to 511z years, 200 at each age from 6 to 14,
" Because so many intelligence tests are validated against measures of and 100 at each age from 15 to 18. All subjects were within one month of
'~cademic achievement, they are often designated as tests of scholastic a birthday (or half-year birthday) at the time of testing, and every age
~aptitude. Intelligence tests are frequently employe? as preliminar!' group co~tained an equal number of boys and girls. From age 6 up,
.'screening instruments, to be followed by tests of speCIal aptItudes. ThIS most subjects were tested in school, although a few of the older subjects
'practice is especially prevalent in the testing of normal adolescents. or were obtained outside of school in order to round out the sampling.
~adults for educational and vocational counseling, personnel selectIOn, Preschool children were contacted in a variety of wavs, manv of them
;:i:andsimilar purposes. Another common use of general inte.lligence tests. is being siblings of the schoolchildren included in the sa~ple. D~spite seri-
,ltobe found in clinical testing, especially in the identification and classIfi- ous efforts to obtain a representative cross-section of the population,
;'cation of the mentally retarded. For clinical purposes, individual tests the sampling was somewhat higher ,than the U.S. population in socio-
-'are generally employ~d. Among the most widely used in~ividual ~ntell~- economic level, contained an excess of urban cases, and included only
native-born whites. . '
. gence tests are the Stanford-Binet and \\'echsler scales dIscussed m thIS
A third revision, published in 1960, provided a single form (L-M)
~chapter.
incorporating the best items from the two 1937 forms (Terman & !vler-
rill, 1960). \Vithout introducing any new content, it was thus possible
to eliminate obsolescent items and to relocate items whose difficultv level
had al~ered during the intervening years owing to cultural changes. In
EVOLUTIONOF THE SCALES.The ori£!inal Binet-Simon scales kLVe al- prepanng the 1960 Stanford-Binet, the authors were faced with a com-
>ready been described briefly in Chapter 1. It will be recalled that the mon dilemma of psychological testing. On the one hand, frequent re-
'1905 scale consisted simply of 30 short tests, arranged in ascending order visions are desirable in order to profit from technical advances and re-
of difficulty. The 1908 scale was the first age scale; and the 1911 scale finements in test construction and from prior experienccln the use of
introduced minor improvements and additions. The age range covered the test, as well as to keep test content up to date. The last-named con-
by the 1911 re\ision extended from 3 vears to the adult level. Among the sideration is especially important for information items and for pictorial
n;anv translations and adaptations of the early Binet te~;-s were a number material which may be affected by changing fashions in dress, household
, of American revisions, of '\vhich the most viable has been the Stanford- appliances, cars, and ether common articles. The use of obsolete test
, Binet.' The first Stanford revision of the Binet-Simon scales, prepared by content may seriously undem1ine rapport and may alter the difficulty
Terman and his associates at Stanford University, was published in 1916 level of items. On the other hand, revision mav render much of the ac-
(Terman, 1916). This revision introduced so many changes and additions cumulated data inapplicable to the new for~. Tests that have been
as to represent virtually a new test. Over one third of the items \\'ere ne\\", Widel~' used. for many years have acquired a rich body of interpretive,
and a number of old items were revis'ed, reallocated to different age matenal whICh should be carefully weighed against the need for re-
levels, or discarded. The entire scale was restandardized on an American VISIon. It was for these reasons that the authors of the Stanford-Binet
sample of approximately one thousand children and four hundred adults. chose to condense the two earlier forms into one, tl;ercby steering a
course between the twin hazards of obsolescence and discontinuity. The
1 A detailed account of the Binet-Simon scales and of the development, me, and
, The itenE in the Binet scales are commonly called "tests," since each is separatelY
c];ni",:1 in!erpreL,tion of the Stanford-Binet can be found in Sattler (1974, Chs.
and Inn)' C'ur·t(-:-dn ~e\'eral pnrts.
adrninist(·recJ .
·13Z Tests of General Intellectual Level
were those for whom the primary language spoken in the home was not
lossof a parallel form was not too great a price to pay for accomplish-
English. To cover ages 2 to 8, the investigators located siblings of the
~gthis purpose. By 1960 there was less need for an alte~na~e. form .than group-tested children, choosing each child on the basis of the Cocrnitive
Iherehad been in 1937 when no other well-constructed mdl\'ldual mtel-
Abilities Test score obtained by his or her older sibling. Additional cases
llgencescale was available. at the u?per ages were recruited in the same way. The Stanford-Binet
/ In the preparation of the 1960 Stanford-Binet, items were sel~cted
sample mcl~ded approximately 100 cases in each half-year age group
whom forms Land M on the basis of the performance of 4,498 subjects,
from 2 to 5 ~ years, and 100 at each year group from 6 to 18.
:aged 2112 to 18 )'ears who had taken either or both forms of the test
II I ' In comparison with the 1937 norn1S, the 1972 norms are based on a
'between 1950 and 1954. The subjects were examined in six states situ-
mor~ representative sample, as well as being updated and hence re-
';atedin the Northeast, in the Midwest, and on the \Vest Coast. Although
~e~tmg a~y effects of intervening cultural changes on test performance. It
';thesecases did not constitute a representative sampling of American
ISmterestmg to note that the later norms show some improvement in test
'schoolchildren,ca:re was taken to avoid the operation of major selective
performance. at all ages. The improvement is substantial at the preschool
factors,3The 1960 Stanford-Binet did not involve a restandardization of
ages, averagmg about 10 IQ points. Th·· test authors attribute this im-
",thenormative scale, The new samples were utilized only to identify
prov~ment ~o th~ impact of radio and television on young children and
ichanges in item difficulty over the intervening period. Accordingly, the
the mcreasmg literacy and educational lew·] of parents, among other
.: difficulty of e;;tch item was redetermined by finding the percentage of cultural changes. There is a smaller but clearly discernible improvement
,:children passing it at successive mental ages on the 1937 forms. Th~s for at. ages 15 and over which, as the authors suggest, may be associated
, purposes of item analysis the children were grouped, not accordmg to
WIth the larger proportion of students who continued their education
"their chronological age, but according to the mental age they had ob-
through high school in the 1970s than in the 1930s.
::tained on the 1937 forms. Consequently, mental ages and IQ's on the
l!< 1960 Form L-M were still expressed in terms of the 1937 normative
"
"" samp Ie. AmII~ISTRATIOK A~D SCORING. The materials needed to administer the
; The next stage was the 1972 restandardization of For:n L-~1 (Terman Stanford-Binet are shown in Figure 29. The\' include a' box of standard
& Merrill, 1973, Part 4), This time the test content remamed unchanged,4
toy objects for use at the younger age lev~ls, two booklets of printed
. but the norms were derived from a new sample of approximately 2,100
cards, a record booklet for recording responses, and a test manual. The
cases tested during the 1971-1972 academic year. To achieve national tests are grouped into age levels ex-tending from age II to superior adult.
representativeness despite the prac:tical impossibility of admi.nistering in- Between the. ages of II and V, the test proceeds by half-year intervals.
dividual tests to very large samples, the test publishers took advantage
Thus, there IS a level corresponding to age II, one to age II-5, one to
of a sample of approximately 20,000 children at each age level, employed
age I~I, and so for~h. Because progress is so rapid during these early
in the standardization of a group test (Cognitive Abilities Test). This
~ges, It proved feaSIble and desirable to measure change over six-mo:,th
sample of some 200,000 schoolchildren in grades 3 through 12 was mtervals. Between V and XIV the age levels correspond to vearlv inter-
chosen from communities stratified in terms of size, geographical region,
vals. The remaining levels are designated as Average Adult ~nd Superior
and economic status, and included black, Mexican-Ar·;erican, and Puerto
Adult levels I, II, and III. Each age level contains six tests, with the
Rican children. exception of t~e .Average Adult level, which contains eight.
The children to be tested with the Stanford-Binet were identified
The tests wlthm anyone age level are of approximately uniform dif-
through their scores on the verbal battery of the Cognitive Abilities
£~ulty and are arranged without regard to such residual differences in
Test, so tbat the distribution of scores in this subsample corresponded dIfficulty as may be present. An alternate test is also prO\'ided at each
to the national distribution of the entire sample. The only cases excluded
age leve~. Being of approximately equivalent difficulty, the alternate may
3 For speCial statistical analyses, there were two additional samples of California be substituted for any of the tests in the level. Alternates are used if one
children, including 100 6-year-oJds stratified with regard to father's occnpation and ~f .the regul~r tests must be omitted because special circumstances make
100 15-year-olds stratified with regard to both father's occupation and grade dis- It mappropnatc for the individual or because some irrecrularitv interfered
tribution. with ils standardized administration. Co /
4 'Vith onl". two very minor exceptions: the picture on the "doll card" at age II
Four test~ in each year level were selected on the basis of validity and
was updatcc1:' and the ~".ord "charcoal" was permitted as a substitute for "coal" in
representativeness to constitute an ablncuiatcd scale for use when time
tll€' Similarit:es test a' age VE.
Indiddual Tests 235
~.234 1 (;~IS UJ \.x(;IIl,;lU( lUfCllc.t,..lllUI .l..J •••. Lol. I·
~t which they are credited. The vocabulary test, for exampl:, may be tain constant IQ variability at all ages, the SD's of ratio IQ's on these
scoredanywhere from level VI to Superior Adult III, dependmg on the scales fluctuated from a low of 13 at age YI to a high of 21 at age II-6.
~numberof words correctly defined. . Thus, an IQ of 113 at age VI corresponded to an IQ of 121 at age II~6.
.. The items passed and failed by anyone individual will show a certam SpeCial correction tables were developed to adjust for the major IQ
amountof scatter among adjacent year levels. 'Ve do not find that ex- variations in the 1937 scales (McNemar, 1942, pp. 172-174). All these
"amineespass all tests at or below their mental age level and fail all tests difficulties were circumvented in the 1960 form through the use of devia-
~abovesuch a level. Instead, the successfully passed tests are spread over tion IQ's, which automatically have the same SD throuO'hout the aae
'several year levels, bounded by the subject's basal age at one extreme range. .00
:andhis ceiling age at the other. The subject's mental age on the Stanford- As an aid to the examiner, Pinneau prepared tables in which deviation
;Binetis found by crediting him "ith his basal age and adding to that age IQ's can be looked up by entering ~fA and CA in years and months.
. further months of credit for every test passed beyond the basal level. In These Pinneau tables are reproduced in the Stanford-Binet manual
~thehalf-year levels between II and V, each of the six tests counts as one (Terman & Merrill, 1973). The latest manual includes both the 1972 and
i'month· between VI and XIV each of the six tests corresponds to two the 1937 normative IQ tables. For most testing purposes, the 1972 norms
:~month~of credit. Since each ~f the adult levels (AA, SA I, SA II, and are appropriate, showing how the child's performance compares with
.:SAIII) covers more than one year of mental age, the months of credit for that of others of his own age in his generation. To provide comparability
.. each test an! adjusted accordingly. For example, the Average Adult with IQ's obtained earlier, however, the 1937 nom1S are more suitabl~.
: levelincludes eight tests, each of which is credited with two months; the They would thus be preferred in a continuing longitudinal study, or in
.: SuperiorAdult I level contains six tests, each receiving four months. . comparing an individual's IQ with the IQ he obtained on the Stanford-
, The highest mental age theoretically attainable on the Stanford-Bmet Binet at a younger age. When used in this way, the 1937 standardization
, is 22 years and 10 months. Such a score is not, of course, a true mental sample represents a fixed reference group, just as the students taking
age, but a numerical score indicating degree of superiority above the the College Board Scholastic Aptitude Test in 1941 provide a fixed ref-
A~'eraae
o Adult performance. It certainh'. does not correspond to the erence group for that test (see Ch. 4).
achievement of the average 22-year-old; according to the 1972 norms, Although the deviation IQ is the most convenient index for evaluatinO'
the average 22-year-old obtains a mental age of 16-8. For any adult over an individual's standing in his age group, the MA itself can serve a use~
18years of age,' a mental age of 16-8 yields an IQ of 100 on this scale. In ful function. To say that a 6-year-old child performs as well as a typical
fact) above 13 years
-,'
me;ital acres
b
cease to have the same significance as 8-year-old usually conveys more meaning to a layman than saying he has
thev do at lower levels, since it is just beyond 13 that the mean MA an IQ of 137. A knowledge of the child's MA level also facilitates an
begins to lag behind CA on this scale. The Stanford-Binet .is not suitable understanding of what can be expected of him in terms of education~l
for adult testing, espeCially within the normal and supenor range. De- achievement and other developmental norms of behavior. It should be
spite the three Superior Adult levels, there is insufficient ceiling for most noted, however, that the MA's obtained on the Stanford-Binet are still
superior adults or even for very superior adolescents (Kennedy et aI., expressed in terms of the 1937 nom1S. It is only the IQ tables that in-
1960). In such cases, it is often impossible to reach a ceiling age level at corporate the updated 1972 norms. Reference ~o these tables will show,
which all tests are biled. ~Ioreover, most of the Stanford-Binet tests for example, that if a child whose CA is 5-0 obtains an 1'.1Aof 5-0, his
have more appeal for children than for adults, the content being ,of IQ is not 100. To receive an IQ of 100 with the 1972 norms, this child
relatively little interest to most adults. would need an MA of 5-6.
One of the advantages of the Stanford-Binet derives from the mass of
interpretive data and clinical experience that have been accumulated
KORMATI''E II'TERPRETATIO~. A major innovation introduced in the regarding this test. For many clinicians, educato:·s, and others concerned
1960 Stanford-Binet was the substitution of deviation IQ's for the ratio with the evaluation of general ability level, the Stanford-Binet IQ has
IQ's used in the earlier forms. These deviation IQ's are standard scores become almost synonymous with intelligence. Much has been learned
wilh a mean of 100 and an SD of 16. As explained in Chapter 4, the about what sort of behavior can be expected from a child with an IQ of
principal advantage of this type of IQ is that it provi~es c~mpara~le 50 or 80 or 120 on this test. The di<.tributions of IQ's in the successive
scores at all age levels, thus eliminating the vagaries of ratIo IQ s. DespIte standardization samples (1916, 193i, 19(2) have provided a· common
the care with which the 19'37scaks were developed in the effort to ob- frame Or reference fo" the interpretation of IQ's.
'}l3/) Tests 0t GCllcralintel/ceit/al LCrc{
Individual Tests 239
.'. B f the size of the error of mcasurement of a St~nford-~inet -2 SD down to -3 SD, ranges from 68 (100 - 2 X 16) to 52 (100"':"
" ecause
';:IQ 0
it is customar)T to allow approximately a 10-pom . t b a nd on either 90 3 X 16). The other IQ ranges can be found in a similar manner. The
"' 'd, f the obtained IQ for chance YanatJOn. .. Th us any IQ between percentages of cases at each level are those expected in a normal dIstri-
~and'SIe 110 0
is considered equivalent to the average IQ .0f 100 .'
IQ's
d . above . bution (see Fig. 6, Ch. 4). They agree quite closely with the percentages
'j,llo represent superior deviations, those below 90 mfer~or. eVlatJO~s. of persons at these IQ levels found empirically in the general popula-
'There is no generally accepted frame of reference for class:fymg supenor tion. The frequency of mental retardation in the general population is
lIQ's. It may be noteworthy, however, that in. the cla~slcal, 100:g.-term usually estimated as close to 2 percent. The Stanford-Binet manual con-
;. , t'g tion of gifted children bv Terman and hiS co-wOlkers a mlmm.um tains still another classification of levels of mental retardation, based on
limes I140
}IQof a. was requir.=d for incluSIOnm
-. . t h e pnnclpa
" 1 pa rt of the project somewhat different IQ limits, which has been widely used as an in-
£(Terman & Oden, 19~9). . l'fi' terpretive frame of reference by clinical psychologists (Terman & Mer-
t At the other end of the scale, a widely used educahonal c aSSI catdl~nl rill, 1973, p. 18).
;':of mental retardates recognizes the educa bl e, trama . bl'e, an d custo ....,: la The use of such classifications of IQ level" although of unquestionable
~
.}categories. The educable group, 111 e. . th IQ range from 50
d to 15,
'bl, cans help in standardizing the interpretation of test performance, carries
' d ce to at least the third grade in academIc work-an POSSI ) a certain dangers. Like all classifications of persons, it should not be rigidly
f~i~~nas the sixth grade-if taught in a specially adapted classroo; applied, nor used to the exclusion of other data about the individual. j
~:situation. The trainable group, with IQ's between 25 ~nd 50, can' e of
Thdered,~re,d hco~:se, nOI,~harPbdividinghlin~s betwleen tdheh"mentally re~ '
~"taught self-care and social adjustment ~n a protecte.d enVIronment. Those tar e an t e norma or etween t e 'norma" an t e "superior.
~below IQ 25 generally require custodial ~nd ~urSlllg care. . ._ Individuals with IQ's of 60 have been hown to make satisfactory ad-
'j In its manual on terminology and claSSIficatIOn,the Amencan ASSOCI justments to the demands of daily living, while some \\-ith IQ's close to
i,.,'.. atJon on l. 1\1entaI D e ficlency
. (AAMD)
1
lists four levels of mental'fi retarda-
. . 100 may require institutional care.'
·;.tion defined more precisely in terms of SD units. This classl cahon IS Decisions regarding institutionalization, parole, discharge, or special
.!':GIven111
. ' . Table 23 , together with the Stanford-Binet IQ ranges '11b correspond- t d training of mental retardates must take into account not only IQ but also
;:,fn to each level and the expected pe~c.e~1tageof cases. It WI. e no ~e social maturity, emotional adjustment, physical condition, and other cir-
: th~t the classification is based on a dIVISIOn~f the lo,,:er ?ortlOn_~~D. cumstances of the individual case. The AA:\iD defines mental retarda-
'I,; normal distribution curye into steps of 1 SD each, begmnm.g at 1 d tion as "significantly subaverage general intellectual functioning existing 'l
;: The advantage of such a classiBcation is that it can be. readily trans ate concurrently with deficits in adaptive behavior, and manifested during the
\(mto . stan dar d scores 0 r deviation IQ's in an)' scale. Smce the d'Stanford- f developmental period" (Grossman, 1973, p. 11). This definition is further
" Binet deviation IQ scale has an SD of 16, the mild level, exten mg rom explicated in the stipulation that a child should not be classified as men-
tally retarded unless he is deficient in both intellectual! functioning, as
indicated by IQ level, and in adaptive behavior, as measured by such
fuu~ . . instruments as the Vineland Social Maturity Scale or the AAMD Adaptive
Levelsof Mental Retardation as Defined in Manual of Amencan Behavior Scales (to be discussed in Chapter 10).
Associationon Mental Deficiency
Nor is high IQ synonymous with genius. Persons with IQ's of 160 do
(Data in first two columns from Grossm~l, 1973 ' p. 18. Reprinted by permIssion of occasionally lead undistinguished lives, while SOmewith IQ's much closer
the American Association on Mental DefiCIency)
~:;-)jr"·P':r:i:;r7"tyrr'···~~'::u:":t-!FC·~""''''''''~"J..z.::·.;g::.::!!;~~~'''1::t~::.<:~;:.'~~~1".~~_._
.. ,._,~_~. _~'
to 100 may make outstanding contributions. High-level achievement in ,', r
specific fields may require special talents, originality, persistence, single- !
Range of
Cutoff Points Stanford-Binet lQ Percentage ness of purpose, and other propitious emotional and motivational con-
ditions.
(in SD units from Mean) (SD = 16) of Cases
.\ferenceto age and IQ level of subjects (McNemar, 1942, Ch. 6). In 135-13 9 I I /
!~neral,the Stanford-Binet tends to be more reliable for the older than 130-13 4 I , I II /
I
lorthe younger ages, and for the lower than for the higher IQ's. Thus, 125-129 I I 1 II /
I
It ages 2~.1z to 5y:!, the reliability coefficients range from .83 (for IQ 120-124 .. \ I I I /
" I
140-149) to .91 (for IQ 60-69); for ages 6 to 13, they range from .91 to 115-119 i Ili"Il~I'
! •. I' //I I
97, respectively, for the same IQ levels; and for ages 14 to 18, the cor- 110-114 I I I , ~1ii!t1""'1/1
I. / //I 1
I
I
Wspondingrange of reliability coefficients extends from .95 to .98. I ./lit
tifi !
I
I
, " -IIit
111
"" I I
Eo 100-104 I I " 11
fii;'
"i I I
I I I
I I
of tests in general. It results in part from the better control of conditions u..
co
95-99 I I //I I:: i I " i
III /1 I
~.at is possible \"ith older subjects (especially in comparison with the i,1
I I
./II' ./IitI./lit'~1
o 90-94 //I .-llltl!w,m, I
.preschoolages). Another factor is the slowing down of developmental Q &5-89 I 1111
"
./lit -lilt -llitll i
I I !,
i
I I
;iatewith age. \Vhen reliability is measured by retesting, individuals who
areundergoing less change are also likely to exhibit less random fluctu-
80-84 iI/ i I 11/\, i ! i
I
I
I I !
I' iT I
;
I
75-79 iI/ 1'1 I
iationover shod periods of time (Pinneau, 1961, Ch. 5).
~.'The higher reliability obtained with lower IQ levels at any given CA, 70-74 ,1/ I I
I i I I I
,1m the other hand, appears to be associated with the specific structural 65-69 I, 11 I i I I I I I I I I
:fharacteristics of the Stanford-Binet. It will be recalled that because of 60-64 I I i I I i I I
'thedifference in number of items available at different age levels, each 55-59 i- I I I I I I I I I
.itemreceives a weight of one month at the lowest levels, a .... "eight of two 50-54 . I i I i I I II I I !
monthsat the intermediate levels, and weights of four, five, or six months 45-49 i i I I I I I I I i i
I
,jatthe highest levels. This weighting tenc.s to magnify the error of meas- 40-44 I
I I I I I i I II
I I i 1
,urement at the upper levels, because the chance passing or failure of a
'singleitem makes a larger difference in total score at these levels than it
~ 0-
""
o
I
~
"<J!
-.:-
r
<r)
-.:.
0- ~ 0- "<t 0- 0-
!~~~~~~~~
0- g~ b 0 ~
o 0 ~ ::
0
N'N
~
M
I
~
:doesat lower levels. Since at any given CA, individuals with higher IQ's
~aretested with higher age levels on the scale, their IQ's will have a lar~er 10 on Form M
error of measurement and lower reliability (Pinneau, 1961, Ch. 5). The FIG. 30. Parallel-Form Reliabil"tv f h .
of IQ's Obtained by Seven-Yea;-Ol~ ~h~ldStanford-Bmet: Bivariate Distribution
.relationship between IQ level and reliability of the Stanford-Binet is also • 1 ren on Forms Land M
-illustrated graphically in Figure 30, showing the bivariate distribution of (From Company.)
Miffiin Terman & Merril' 1,
19~-
J', p. 45. Reproduced by permission . of Houghton
!lQ'sobtained by 7-year-old children on Forms Land M. It will be ob-
~.servedthat the individual entries fall close to the diagonal at lower IQ
'levelsand spread farther apart at the higher levels. This indicates closer
".agreement between Land 1\1IQ's at lower levels and wider discrepancies standardization sample were tested wi h' .
I
vear birthda)' ThIS' na 1 . t m a month of a blrthda)' or half-
;between them at upper levels. With such ~ fan-shaped scatter diagram, a , . rrow y restncted . 1
lower reliability coefficients th f ~g~ range ""ou d tend to produce
single correlation coefficient is misleading. For this reason, separate re- more heterogeneous sa 1 an oun or most tests, which employ
;'liability coefficients have been reported for different portions of the IQ
reliability coefficient of
of approximatel)' 5 IQ P . t (
~b
:~'d~~~~a:~d1tO'.t~rms of individual IQ's, a
eh b1\e an error of measurenent
:,range.
" On the whole, the data indicate that the Stanford-Binet is a highly reli- am s see 5) I tho -
are about 2: 1 that a ch'ld' t S . . n a ~r words, the chances
;able test, most of the reported reliability coefficients for the various age IS rue tanford-Bin t IQ d'ff b"
;and IQ levels being over .90. Such high reliability coefficients were ob- l ess from the IQ obtained in a sinal . e 1 ers y;) points or
of 100 that it varies b b e testmg, and the chances are 95 out
tained despite the fact that they were computed separately within each y no more than 10 points (5 >~1.96 = 9.8). Re-
:.age group. It will be recalled in this connection that all subjects in the
Individual Tests 243
ijl'ctillgthe same differences found in the reliability coefficients, these
,~;rorsof measurement will be somewhat higher for younger than for s~e also A. J. Edwards, 1963 ).5 The correlations are at least as high as
olderchildren, and somewhat higher for brighter than for duller indi- t. ose normally found between tests designed to measure the same func-
;;iduals. tions, and they fall \\ithin the range of common reliabilitv coefficients'-
Insofar ~:. all t~e fun~,tions listed are relevant to what'is commonlv re-
garded as mtelhgence, the scale ma~' be said to have content vali·ditv.
VALIDITY. Some information bearing on the content t;alidity of the The preponderance of verbal content at the upper levels is defended by
the test authors on theoreti,:'al grounds. Thus, they write:
Stanford-Binet is provided by an e:~amination of the tasks to be per-
iormed bv the examinee in the various tests. These tasks run the gamut
Jromsimple manipulation to abstract reasoning. At the earliest age levels, At these le~'els the ~,ajor intellectual differences between subjects reduce
larg~ly to. ,dIfferences m the ability to do conceptual thinking, and facilit in
thetests require chiefly eye-hand coordination, perceptual discrimination,
dealmg \\"Ith concepts is most readily sampled bv the use of verbal t~sts
Jnd ability to follow directions, as in block building, stringing beads,
somparing lengths, and matching geometric fomls. A relatively Ia:rge
~anfuar' esse~tially, is the shorthand of the higher thought processes ar d
t e :ve at whIch this shorthand functions is one of the most importa;t d~-
:number of tests at the lower levels also involve the identification of com- ~~~)I~ants of the level of the processes themselves (Terman & Merrill, 19,37,
:monobjects presented in toy models or in pictures.
( Several tests occurring over a wide age range call for practical judg-
~'mentor common sense. For example, the child is asked, "\\11at should ~t shou;d be ad~e? that clinical psychologists have developed several
::youdo if you found on the streets of a city a three-year-old baby that was ~c en~es, or classIfymg Stanford-Binet tests, as aids in the qualitative
; lostfrom its parents?" In other tests the examinee is asked to explain why escnption of the individual's test" performance (see Sattler, 1974, Ch.
i certain practices are commonly follo\ved or certain objects are employed 10). ~attern analyses of the examinee's successes and failures in different
lin daily living. A number of tests calling; for the interpretation of pictori- functIOns mav provide h I f I Iff h .
I ' e p u cues or urt er clmical exploration. The
! ally or verbally presented situations, or the detection of absurdities in :esu ts of such anal~'ses, however, should be reO'arded as tentative and
\ either pictures or brief stories, also seem to fall into this category. mterpreted \~rjth caution. 1-.10st functions are r~presented bv too few
:, Memory tests are found throughout the scale and utilize a wide variety t:sts to perm.It rehable measurement; and the coverage of an): one func-
, of materials. The individual is required to recall or recognize objects, tIOn vanes \\'1de]~' from one year le\·el to another.
. pictures. geometric designs, bead patterns, digits, sentences, and the Data on the criteriol1-related t;alidity of the Stanford-Binet both . _
. content of passages. Several tests of spatial orientation occur at widely c~n:eJ~t ,and pr~dicti\"e, have been obtained chiefly in terms of acadec~~c
" scattered levels. These include maze-tracing, paper-folding, paper-cut- aCrI~\ e,llent Smce the publication of the original 1916 Scale manv cor-
" ting, rearrangement of geometric figures, and directional orientation. re atlQns have ~een. computed between Stanford-Binet ~IQ 'and ~chool
Skills acquired in school, such as reading and arithmetic, are required for grlad:s, teachers ratmgs, and achievement test scores. Most of these cor-
successful performance at the upper year levels. re atlOns fall between 40 .J 7- S h I
b I ' anlt. u. c 00 progress was likewise found to
The most common type of test, especially at the upper age levels, is ere ated to Stanford-Bine.t IQ. Children who were accelerated bv one or
that employing verbal content In this category are to be found such "'ell- more grades averaged conslderablv higher in IQ than th . t ']
d I ' . ose a norma age-
known tests as vocabulary, analogies, sentence completion, disaLianged gr,a e ocatlOn! and children who were retarded by one or more rades
sentences, defining abstract terms, and iriterpreting proverbs. Some stress a\ e~a"ged con.slder~bly below (~rcNemar, 1942, Ch. 3). g
verbal fluency, as in naming unrelated words as rapidly as possible, LIke most ~ntelhgence tests, the Stanford-Binet correlates hiO"hlv with
giving rhymes, or building sentences containing three given words. It perf0:mance m nearly all academic courses, but its correlations bar~ hi h-
should also be noted that many of the tests that are not predominantly g
est wIth. the pr.edominantly verbal courses, such as English and histo T
verbal in content nevertheless require the understanding of fairly com- ~~~elatlO~ WIth achievement test scores show the same pattern. InY)~
plex verbal instructions. That the scale as a whole is heavily weighted y of hIgh school sophomores, for example, Form L IO's correlated
...
with verbal ability is indicated bv the correlations obtained between the
45-word vocabul~rv test and n;ental ages on the entire scale. These
correlations were f~und to be ,71, .83, .86, and .83 for groups of examinees
agel 8, 11, 14, and IS years, respectively (~1d\emar, 1942, pp. 139-140;
I Tests of General 11Itellectllal Level Indicidual Tests 245
~73with Reading Comprehension scores, .54 with Biology sco:es, and :4,8 pretation of IQ's, moreover, the scale should be highly saturated with a
" fh Geometry scores (Bond, 1940). Correlations in the .50 sand .60 s Single common factor. The latter point has already been discussed.in
fie been found with college grades. Among college students, both co~nection ~\'ith homogeneity in Chapter 5. If the scores were heavily
'selective factors and insufficient test ceiling frequently lower the cor- weIghted WIth two group factors, such as verbal and numerical aptitudes,
:~liations. an IQ of, let us say, 115 obtained by different persons might indicate
~here have been relatively few validation studies with the 1960 Form high v~rbal ability in one case and high numerical ability in the other.
L.M (see Himelstein, 1966). Kennedy, Van de Reit, and White (1963) re- \fd\emar (1942, Ch. 9) conducted separate factorial analYses of
I£ort a correlation of .69 with total score on the California Achievement Stanford-Binet items at 14 age levels, including half-year groups' from 2
! rst in a large sample of black elementary school children. Cor~elations to 5 and year groups at ages 6, 7, 9, 11, 13, 15, and 18. The number of
.,,,;th scores on separate parts of the same battery were: Readmg, .68; subjects employed in each analysis varied from 99 to 200 and the
.Arithmetic, .64; and Language, .70. number of items ranged from 19 'to 35. In each of these anal;'ses, tetra-
, IIn interpreting the IQ, it should be borne in mind that the Stanford- choric c.orrelations were computed between the items, and th'e resulting
i.net-like most so-called intelligence tests-is largely a measure of correlatlOns were factor analyzed. By including items from adjacent vear
[scholastic aptitude and that it is heavily loaded with verbal fu~ctions, leve~s in ~ore than one analysis, some evidence was obtained regarding
't-~peciallv at the upper levels. Individuals with a language handIcap, as the IdentIty of the common factor at different ages. The factor loadings
!ell as those whose strongest abilities lie along nonverbal lines, will thus of tests that recur at several age levels provided further data on this
.'score relatively Iowan such a test. Similarly, there are undoubtedly a point. In general, the results of these analyses indicated that perfo:mance
',number of fields in which scholastic aptitude and verbal comprehension on Stanford-Binet items is largely explicable in temlS of a Single common
~e not of primary importance. Obviously, to apply a~y test to situations factor. Evidence of additional group factors was found at a few age
,hr which it is inappropriate will only reduce its effectlveness. Because of levels, but the contribution of l'hese factors was small. It was likewise
the common identification of Stanford-Binet IQ with the very concept demonstrated that the common factor found at adjacent age revels was
If intelligence, there has been a tendency to expect too much from this es:entially the same, although this conclusion may not apply to more
'De test. WIdely separated age levels. In fact, there was some evidence to sucrgest
,j Data on the construct ualidity of the Stanford-Binet come from many that the common factor becomes increasingly verbal as the higher C;ges
"ources. Continuitv in the functions measured in the 1916, 1937, and are approached. The common factor loading of the \'ocabulary test, for
[960 scales was en~ured by retaining in each version only those items that example, rose from .59 at age 6 to .91 at age 18.
correlated satisfactorily with mental age on the preceding form. Hence, Other factor-analytic studies of both the 1937 and the 1960 forms have
{\he information that clinicians have accumulated over the years regarding used statistical techniques deSigned to bring out more fully the operation
kpica] behavior of individuals at different l\1A and IQ levels can be of group factors (L. V. Jones, 1949, 1954; Hamsev & Vane, 1970; Sattler,
,.Gtilized in their interpretation of scores on this scale. 1974, Ch. 10; Stott & Ball, 1965). Among the fact~rs thus identified were
f). Age differentiation represents the major criterion in the selectiOI: of s~\:~r~l verbal, memory, reasoning, spatial visualization, and perceptual
Stanford-Binet items. Thus, there is assurance that the Stanford-Bmet amhtIes. In general, the results suggest that there is much in common in
h1easures abilities that increase with age during childhood and adoles- the scale as a whole-a characteristic that is largel:-' built into the Stan-
.' cence in our culture. In each form, internal consistency was a further ford-Binet by selecting items that have high correlations with total scores.
, ;criterion for item selection. That there is a good deal of functional At the same time, performance is also influenced by a number of speCial
:homogeneity in the Stanford-Binet, despite the apparent variety of con- abilities whose composition varies with the age level tested.
; 'tent, is indicated bv a mean item-scale correlation of .66 for th::. 1960
~.revision. The prcdo~inance of verbal functions in the scale is shown by
Itbe higher correlation of verbal than nonverbal items with performance
on the total scale (Terman & Merrill, 1973, pp. 33-34).
Further data pertaining to construct validity are provided by sevcral The rest of this chapter is concerned with the intelligence scales pre-
;independent factor analyses of Stanford-Binet items. If IQ'~ are to be pared by David \Vechsler. Although administered as individual tests and
! comparable at different ages, the scale should have appro::mlatl'l~· the designed for many of the same uses as the Stanford-Binet these scales
same LHtorial composition at all age levels. For an unamb1[!uous ll1ter- differ in sr:ver:J.1iinpor':mt \V3YSfrom the earlier test. H8th~r tlJan !y,jnr
Tests of General Intellcetl/Ill Lcn'l
Indiddual Tests 247
prganized into age levels, all items of a given type are grouped into sub- Iation of words rf'C' d d . .
tests and arranged in increasing order of difficulty within each subtest. In He I'k' II d f'IVe un ue weight In the traditional intelligencf' test
I eWIse ca e attenf t ti' l' .. .
'this respect the \/Vechsler scales follow the pattern established for group ad Its d' IOn a le mapp Icablhty of mental age non11$ to
u ,an pomted out that few adults had .' lb' .
tests, rather than that of the Stanford-Binet. Another characteristic th d d' . pre, IOUSy een mcluded in
e stan ar IzatlOn samples for individual' t II'
feature of these scales is the inclusion of verbal and performance It was t . In e Igence tests.
\ubtests, from which separate verbal and performance IQ's are com- a mdee,t these vanous objections that the original \Vechsler-
B e IIevue was evelo d I f
,puted. similar to th pe. n arm and content, this scale was closel"
: Besides their use as measures of general intelligence, the \Vechsler .h· 1 'h e more recent \Vechsler Adult Intelligence Scale (\V AIS ')
\\ IC 1 as now SUPI)] t d' Th
,\scales have been investigated as a possible aid in psychiatric diagnosis. . '. an e It. e earlier scale had a number f
,Beginning with the observation that brain damage, psycho'.ic deteriora- :ee~~n~~~~o~~~~;~:c~:s~'p~rticu~a~\ wi~~ regard to size and representativ~-
,'tion, and emotional difficulties may affect some intellectual functions corrected I'n tIle 1 t p e. ~n Ie lablht~, of subtests, which were largel).
a er reVISion.
" more than others, \Vechsler and other clinical psychologists argueq that
, an analysis of the individual's relative perfomlance on different subtests
should reveal specific psychiatric disorders. The problems and results per- DESCRIPTIOK. Published in 19-- h \ 7
taining to such a profile a . alysis of the Wechsler scales will be analyzed Six subtests are grouped into a ~~r~a~ S~a~:Sa:donfilpri~estele'p,enfsubtests.
Seal Th b 'e 111 a a er ormance
in Chapter 16, as an exan,ple of the clinical use of tests. e.. edse.s~ tes:s are listed and briefly described below in the order
of th elr a lTIIl1lstratIon. '
The interest aroused by the Wechsler scales and the extent of their use
is attested by some 2,000 publications appearing to date about these VERBAL SCALE
scales, In addition to the usual test reviews in the Mental1l1easurements
1. Informatioll' 99 questio .
Yearbooks, research pertaining to the \Vechsler scales has been surveyed adults have' p:esumabl./~ ~ovenng a Wid: variety of information that
periodically in journals (Guertin et a1., 1956, 1962, 1966, 1971; Littell, -\n effort was d t' a. an opportumty to acquire in our culture.
1960; Rabin & Guertin, 19.51; Zimmemlan & \Voo-Sam, 1972) and has " ma e 0 aVOId specialized d' k
might be added h' or aca emlc -nowledge. It
been' summ~rized in several books (Glasser & Zimmerman, 1967; ~lata- . "t
f or a Iana tIme m mf .at questIOns of general information have been used
I h'"
razzo, 1972; \Vechsler, 19.58; Zimmerman, \yoo-Sam, & Glasser, 1973). dividual': intell t IOlrma psyc ~atnc examinations to establish the in-
ec ua eve land hIS practical orientation.
2. Comprehension' 14 items' h f h' h
should be d . d ' m eac 0 \V IC the examinee explains what
one un er certain circumst h
OF THE WAIS. The first form of the \Vechsler scales, hlO\\'n followed th . f " ances, w y certain practices are
A'\TECEDE'\TS
L....
FIG. 31. The Block Design Test of the Wechsler Adult Intelligence Scale.
FIG. 32. Easy Item from ~he WAIS Picture Arrangement Test.
(Courtesy the Psychological Corporation.)
(~eproduced by pecmission. Copyright © 1955, The Psychological Corporation, New
Yurk, N.Y. All rights reserved.)
7.1ubtcsl.As in the other \Vcchsler scales, the subtcsts are grouped into a
Werbal and a Performance scale as follows:
The numbers correspond to the order in which the subtests are ad-
. ministered. Unlike the procedure followed in the \VAIS and the earlier
';;WISC, the Verbal and Performance subtests in the \VISC-R are ad-
I
ministered in alternating order. The \lazes subtest, which requires more
.; time, may be substituted for Coding if the examiner so prefers. Any other
substitution, including the substitution of Mazes for any other subtest
and the substitution of Digit Span for any of the Verbal subtests, should
be made onl)· if one of the regular subtests must be omitted because of FIG. 34. The Object Assemblv Test of the Wechsler Intelligence Scale for
'special handicaps or accidental disruption of testing procedure. The Children-Revised.
supplementary tests may always be administered in addition to the (Courtesy The Psychologic2.1 Corporation.)
, each subtest with Verbal, Performance, and Full Scale scores, and of VEHBAL SCALE
PEnFOI\~I.-\:\CE SCALE
, these three composite scores with each other. All correlations are given Information "Animal House
; separately for the 200 cases in each of the 11 age groups in the standardi-
Vocabulary Picture Completion
zation sample. The correlations between total Verbal and Performance
scores ranged from .60 to .73 within age groups, averaging .67. Thus the Arithmetic :\'1azes
two parts of the scale have much in common, although the correlations
between them are low enough to justify the retention of the separate "Geometric Design
scores. Block DeSign
Factorial analyses of the earlier \VISe subtests identified factors quite "Sentences (Supplementary Test)
similar to those found in the adult scales, namely general, verbal compre-
hension, perceptual-spatial, and memory (or freedom from distractibility) "S~ntences" is a memory test, substituted for the WISe Digit Span. The
factors (see Littell, 1960; Zimmerman & Woo-Sam, 1972). In a more chIld. repeats. each sentence immediatel:' after oral presentation by the
recent study (Silverstein, 1973), the Wlse subtests were factor analyzed exam1l1er:TIns test can be used as an alternate for one of the other verbal
separately in groups of 505 English-speaking white, 318 black, and 487 ~ests! or .It can be admin~ster.ed as an additional test to provide further
Mexican-Ametican children aged 6 to 11 years. The results revealed a mfOll1:atlOnabo~t the chIld, 111which case it is not included in the total
verbal comprehension factor having substantial correlations with the five sC,oreIII ?~lculatmg the IQ. "Animal House" is basically similar to the
verbal tests; and a perceptual organization factor having substantial cor- Vi, AIS DIgIt. Symbol and: he WISe Coding test. A key at the top of the
relations with Block Design and Object Assembly. A major finding of this board has T-ncturesof dog, chicken, fish, and cat. each with a differentl ;
study was the similarity of factor structure across the three ethnic groups, colored cyl~nder ~its "house") under it. The child is to insert the correctl),
sug;esting that the tests measure the same abilities in these groups. A colored cyhnder 111the hole beneath each animal on the board (see Fi~.
factor analysis of the \\'ISe-R scores of the standardization sample at 11
age levels bet\\'een 6~~ and 16% years yielded clear evidence of three
major factors at each age level (Kaufman, 1975a). These factors cor-
responded closel:' to the previously described factors of verbal compre-
hension, perceptual organization, and freedom from distractibility.
DESCRIPTIO~. In more than one sense, the \Vechsler Preschool and Pri-
marv Scale of Intelligence (WPPSI) is the baby of the series. Published
in 1967, this scale is designed for ages 4 to 6% years. The scale includes
11 subtests, only 10 of which are used in finding the IQ. Eight of the sub-
tests are downward extensions and adaptations of \VISe subtests; the
other three were newlY constructed to replace \VISe subtests that proved
unsuitable for a variety of reasons. As in the \VISe and \\7 AIS, the sub-
tests are grouped into a Verbal and a Performance scale, from which
Verbal, Performance, and Full Scale IQ's are found. As in \VISC-R, the
administration of Verbal and Performance subtests is alternated in order
to enhance variety and help to maintain the young child's interest and
cooperation. Total testing time ranges from 50 to 75 minutes, in one or FIG. 35. The Animal House T ' f hev \" h J
. eSl 0 t ec IS er Preschool and Primarv
two testing sessions. Scale of Intelligence. 'J
In the following list thr: new subtests hnve been starred: (Counes)' The Ps)'cho)o~;caJ Corporation.)
26z Tes!s of General Intellectual Lcccl
Indit:·idllal Tests 263
.:35). Time. errors, and omissions determine the score. "Geometric ~esign"
more between Verbal and Performance IQ's is sufficiently important to
: requires the copying of 10 simple designs with a colored pencIl. be investigated.
The possibility of using short forms seems to have aroused as mu.ch
Stability over time was checked in a group of 50 kindergarten children
. interest for WPPSI as it had for W AIS and WISC. Some of the sam.e m-
retested after an average interval of 11 weeks. Under these conditions,
, vestigators have been concerned with the derivation of such abbrevIated
reliability of Full Scale IQ was .92; for Vei-bal IQ it was .86; and for
scales at all three levels, notably Silverstein (1968a, 1968b, 1970, 1971). Performance IQ, .89.
In a particularly well-designed study, Kaufman .(1972 ~ constructed a
i short form consistincr of two Verbal subtests (Anthmetrc and Compre-
hension) and two Performance subtests (Block Design and Pic:~re
VALIDITY, As is true of the other two \Vechsler scales, the WPSSI
Completion). \Vithin individual age levels, this battery yielded rehabIhty
manual contains no section labeled "validity," although it does provide
; coefficients ranging from .91 to .94 and correlations 'vith Full S:::ale IQ
some data of tangential relevance to the validity of the instrument.
rangina from .89 t~ .92. Half the WPPSI standardization sample of 1,200
Jntercorrelations of the 11 subtests within each acre level in the stand-
cases \~as used in the selection of the tests; the other half was used to b
ardization sample fall largely between 040 and .60. Correlations between
cross validate the resulting battery. Kaufman reiterates the c~stomary
Verbal and Performance sub tests are nearly• as hicrh as those within each
caution to use the short form only for screening purposes when trme does b
'cale. The overlap between the two scales is also indicated by an average
not pem1it the administration of the entire scale.
correlation of .66 between Verbal and Performance IQ's.
The manual reports a correlation of .75 with Stanford-Binet IQ in a
NOR!\1S. The W1>PSI was standardized on a national sample of 1,200 group of 98 children aged 5 to 6 years. As in the case of the 'VISC, the
children-IOO bovs and 100 girls in each of six half-year age groups from Stanford-Binet correlates higher with the Verbal IQ (.76) than "ith the
4 to 61.~.Childre~ were tested within six weeks of the required birthday Performance IQ (.56). This finding was corroborated in subsequent
or mid;'ear date. The sample was stratified against 1960 census dat~ With studies by other investigators working with a variety of groups. In
refere~ce to geographical region, urban-rural residence, proporhon of thirteen studies surveyed by Sattler (1974, p. 209), median correlations
whites and nonwhites, and father's occupational level. Raw scores on of \VPSSI "ith Stanford-Binet IQ were .82, .81, and .67 for Full Scale,
each subtest are converted to normalized standard scores with a mean of Verbal, and Performance IQ's, respectively. Correlations have also been
10 and an SD of 3 \\ithin each quarter-year group. The sums of the scaled found with a number of other general ability tests (for references, see
scores or. the Verbal, Performance, and Full Scale are then cOD\'erted to Sattler, 1974, p. 210 and Appendix B-9). Data on predictive validity are
deviation IQ's \\ith a mean of 100 and an SD of 15. Although \Vechsler meager (Kaufman, 1973a).
arcrues against the use of mental age scores because of their possible A carefully designed reanalysis of the standardization sample of 1,200
m~interpretations, the manual provides a table f~r converting raw scores cases (Kaufman, 1973b) provided information on the rel1a:ion of WPSSI
on each subtest to "test ages" in quarter-year umts. scores to socioeconomic status (as indicated by father's occupational
level), urban versus rural residence, and geographic region. For each of
RELIABILITY. For every subtest except Animal House, relia,?ility was
these three variables, \VPSSI Verbal, Perfonnance, and Full Scale IQ's
found bv correia tin a odd and even scores and applying the Spearman- were compared between samples matched on all the other stratification
Brown f~rmula. Sin~e scores on Animal House depend to a considerable variables (including the other two variables under investigation plus sex,
age, and color). .
extent on speed, its reliability was found by a retest at the end of. t~e
testing session. Reliability coefficients wer.e c~mputed separa:ely wIt~m Socioeconomic status yielded Significant differences only at the ex-
each half-year arre group in the standardlzatlOn sample. \VInle varymg tremes of the distribution. Children \vith fathers in the professional and
with subt~st andOag~ lewl, these reliabiuties fell mostly in the .80's. Re- technical categories averaged Significantly higher than all other groups
liability of the Full Scale IQ varied between .92 and .94; for Verb:!l IQ, (Mean IQ = 110); and children whose fathers were in the unskilled
it var;ed between .87 and .90; and for Performance IQ, between .84 and category averaged Significantly lower than all other groups (Mean IQ =
.91. Standard errors of measurement are also provided in the manual, as 92.1 ). Geographic region showed no clear relation to sCOres.No signiB.-
well as tables for evaluating the significance of the difference between can.t differences was found between matched urban and rural samples,
<·"'>fPC .. From these (bt;) it is sl1~~ested that a difference of 15 points or
un~lke ea~lie; studies With. the \VISC (Seashore, Wesman, & Doppelt,
19.)0) anc. tlW Stanfc,-d-Bmet (\f c?\'em2r, 1 g4~). TIw investigator al-
",264 Tes/s of GC7Jrral Ill/dlre/flal Lcccl
I
, '1v for use-
.'
with two broad group factors: a verbal factor with substantial loadings
in the six verbal subtests in each age group; and a performance factor
with substantial loadings in the five performance tests in the two older
with traditional instruments
the preceding cha ter or th~
next chapter. Hisforicall
:~c.
h' p~op.er ~ .or adequately examined
as t e mdIvIdual scales described in
h ) teal group tests to be considered in the
groups and somewhat lower but still appreciable loadings in the youngest were designated as perfo:~ t e nds of tests surveyed in this chapter
P rf ance, non language or nonverbal
group (ages 4 to 4y:! years), The separation of the two group factors c ormance tests on th ,h 1 ' l' .
was generally less clear-cut in the youngest group, a finding that is in with a minimal use' of ~ e \\ 0 e, mv~ ve the manipulation of objects,
line with much of the earlier research on the organization of abilities language on the part oPfa~therand,pe?cll. Nonlanguage-tests require no
eI er exammer or exa - Th" .
in young children, "'hen subtest scores were factor analyzed separately for these tests can be' b mmee. e mstructions
for black and white children in the standardization sample, the results in without the use of oralgoIvren't)t' delmonstration, gesture, and pantomime,
wn en anguage Apt t' f 1
both groups were closely similar to those obtained in the total sample group tests was the Arm' E. . . . ro a )pe 0 non anguage
(Kaufman & Hollenbeck, 1974)., foreian.s eab . . y xammatlOn Beta, developed for testing
1991 b) RP .. ng and Illiterate recruits during 'World War I (Y k
~ , ("VISlOnsof this test were sub 1 er es,
For most testing purpo 't . sequent y prepared for civilian use.
ses, 1 IS not necessar)' t 1" 1
from test administration sine th .' 0 e Immate a I language
CONCLUDING REl\IARKS ONTHEWECHSLER SCALES.The currentl" available edae of a common l' e .• e exammees usually have some knowl-
b anguage. l\10reover short . I' ,
forms of the three \Vechsler scales reflect an increasing level ~f sophisti- usually be translated 0' .' , SImp e mstructlOns can
. r gIVen successlvel)' in t, I .
cation and experience in test construction, corresponding to the decade appreciably altering the nature 0 d'ffi 1 f \\ 0 angu.ages WIthout
when they were developed: "\TAIS (1955), \~7PSSI (1967), WISC-R tes:l~, however requires the ~ I hC~ty 0 the test. None of these
, exammee Imself t 'h '
(1974). In comparison with other individually administered tests, tI\eir spoken language. 0 use elt er wntten or
principal strengths stem from the size and representativeness of the Still another related categor ' is that of .
standardization samples, particularly for adult and preschool populations, deSignated as nonreadin test~ 1\'1 nonverb~l tests, more properly
and the technical quality of their test construction procedures, The school children fall into rh' ~ ost tests for pnmary school and pre-
IS
treatment of reliability and errors of measurement is especially com- readers at anv age level Whc~legory,. a,s do tests for illiterates and non-
-' . Ie reqUlnng no read' ..
mendable. The weakest feature of all three scales is the dearth of em- tests make extensive use of a l' . mg or wntmg, these
ra mstructions and '.
pirical data on validity. The factor-analytic studies contribute to a clarifi- part of the examiner, Moreover th f commumcati~n on the
cation of the constructs in terms of which performance on the \Vechsler hension, such as voc~bular)' and' th eyr dequently measure verbal comnre-
e un erstandin(T of t ' •.
scales may be described; but even these studies \-vould have been more paragraphs, through the use of 't . 1 . c> sen ences ane.;short
, pIC ana matenal suppl t d .h
informative if they had included more indices of behavior external to Instructions to accompany each item Unlik h emen e WIt oral
tIlE'scales themselves. would thus bt> un suite -J for forei . 1.' e t e nonlanguage fests, they
gn-spe:L,mg or deaf persons.
Tests of GCllcralllltclIcctllol Lncl
Because of its highlv individualized procedures PiafTetian testina ,::,is a set of tov~ lampposts in a sb'aiaht
b
line between two tOY
~'
houses' placina b
u ~ ' 0
well suited for clinical work. It has also attracted the attention of edu- a toy man in the same spots in the child's landscape that he occupies in
cators he cause it clJermits the integration of testing and teachiner.40 Its most the examiner's identical landscape; designating right and left on the
~
frequent use, hO\\'E'\'E'1',is still in research on develoI)mental psvcholoav.
_ b_ child's own body, on the examiner in different positions, and in the rela-
The three sets of tests described below have been selected in part because tion of objects on the table; and problems of perspective, in which the
of their present or anticipated availability for use outside of the authors'
5 Other major projects on the standardization of Piagetian scales are heing con-
J "Schemata" is the plural of "sch,'ma:' a tpnn commonly encountered in uucted by S, K. Escalona at the Albert Einstein College of l\'ledicine in !\ew York
Pi'~f':etian writings and signifying essentially a framework into which the individual City, R. D. Tuddenham at the University of California in Berkeley, N. P. Vinh-Bang
fits incomin!?; semon' data. and B, lnhelder in Piaget's laboratorv in Geneva, Switzerland, and E, A. Lunzer at
, An exa~lple of 'such an application IS "Let's Look at Children," to be discussed the University of :-'lanr-hestcr in EnRland,
in Chapter 14, 6 Personal commun 'lion from Professor A. Pinard, April 3, 1974.
child indicates how three to\' mountains look to a man standing in L1if- 6. DCt:clopmmd of Sc7lel)wta for rcbti11'C; to objects-respondin; to objects
[erent places, Several of tht·'se spatial tests deal \\'ith the "('gocentrism" 0\' looking, feeling, manipulatjn~, drnpping, thro,ving, elc., ,mel by so-
cia]])' instigated ~chemata appropriate to particubr objects (e.g., "drh'-
of the vauna child's thinking, which makes it difficult for him to regard
ing" toy C:l~',building \;..ith blocks, wearing beads. naming objects).
objects' fron~ viewpoints oth~r than his O\\'n.
The complete protocol of the child's responses to each test is sco~ed as 1\0 nom,s are provided, b\lt the authors collt'cted data on sewral psy-
a unit, in terms of the den·lopmental le'el indicated b~' the quality of chometric properties of their scales by administering them to 84 infants,
the responses. Laurendeau and Pinard have subjected their tests to ex- including at least four at each month of age up to one year and at least
tensi\'e statistical analyses. Their standardization sample of 700 cases in- four at each t\yO months of age between one and two years. ~\Iost of these
cluded 2:5 boys and 25 girls at each six-r~~interval from 2 to 5 year~ subjects were children of graduate students and staff at the UniHorsity of
and at each one-vear interval £romolo12. The children were selected so Illinois. Both observer agreement and test-retest agreement after a 48-
as to constitute ~ representative sample of the French Canadian popula- hour interval are reported. In general, the tests appear quite satisfactory
tion of \Iontreal with regard to father's occupationd level and school in both respects. An index of ordinality, computed for each scale from
grade (or number of children in the family at the preschool ages). Be- the scores of the same 84 children, ranged from .8o:? to .991. The authors
sides prO\iding normative age data, the authors analyzed thel[ results report that .50 is considered minimally~satisfactory evidence of ordinality
for ordinaiitv, or uniformitv of sequence in the attainment of response with the index employed.7
levels bv different childrel~. They also investigated the degree of simi- Uzgiris and Hunt clearly explain that these are only provisional scales,
laritv in' the developmental stages reached by each child in the different although they are available to other investigators for research purposes.
tests. lntercorrelations of scores on the five causality tests ranged from Apart from j~urnal articles reporting specifi; studies in which the scales
.59 to ,78; on the Bve space test~, the correlations ranged from .37 to .67 were employed, the authors describe the tests in a book (Uzgiris & Hunt,
(Laurendeau & Pinard, 1962, 1'.'136; 1970, p. 412), 1975) and also provide six sound films demonstratin~ their use. The
2T~ Ordinal Scales of Psychological Development prepared by Uzgiris scales were originally designed to measure the effects of speciBc el1\"iron-
and ~ (1975.Lare desie;ned for a much younger age level than are nwntal conditions on the rate and course of development of infants. Stud-
tilose" of Laurendeau and Pinard, extending from the age of 2 weeks to 2 ies of infants reared under different conditions (Paraskevopoulos & Hunt,
years. These ages cover approximately what Pia get characterizes as the 1971) and on infants participating in intervention programs (Hunt,
I -ensorimotor period and \'vithin which he recognizes six stages. In order Paraskevopoulos, Schickedanz, & Uzgiris, 1975) have thus far indicated
to increase the sensitivity of their instruments, l'zgiris and Hunt classify significant effects of these environmental variables on the mean age at
the responses into more 'thall six levels, the numbe~ varying from 7 to 14 which children attain different steps in the developmental scales.
in the different scales. The series inciudes six scales, designated" as fol- Unlike the first two examples of P!agetian scales, the 90ncept Assess..-
lows: .~ 111£l1LKit-<;:~~tion is a published test which mav be purchased on
-I the same basis as other psychological tests. Designed for ages 4.-!.07 ~'
1. Object Permanence--·--the child's emerging notion of independently exist-
it provides a measure of one of the best known Piagetian concepts ..~
ing objects is indicEltedb~arron'o\vingoranODjecrinasearcw'!;g_
s,~n..r~~~lion that S\lclLPE£perties of objects ~ /
for an object after ItlsFIuIClen with increasing degrees of~ment
------ ,®g t vol me or number remain uncl.ang.ed-llilie~.ect.s...und.er.go
2. Decelop1J1ent of ~lcans for achieving desired environmental ends-use of tI:ans£o lations in shane, E?sition, f0m. or other s~~~s, The
own hands in reaching for objects and of other means such as strings, authors (Goldschmi & Bentler, 1968b) focused on conservation as all
stick, support, etc. . ~---.)
!!1di~~~Q.UQ.i_~::ition J!:Qill..lb-E'_pr-eG.p~r.~-!~al ~.£9nc~
3. InLilation-includingboth gestural and vocal imitation. 01~~1!j~ stage of thinking, which Piagct places rougT-J:y at the age of
7 or 8 years:------·-·----
4. Operational Causality-recognizing and adapting to objective causality,
Throughout th :est, the procedure is essentially the same. The child is
ranging from visual observation of one's own hanus to eliciting desired
beh:lVior from a human agent and activating a mechanical tOY.
7 Proc~dures for the measurement of ordinality and the application of scalogram
5. Object Relations in Space-coordination of schemata of looking and analysis to Piagetian scale~ are still controversial, a fact th"at should be borne in mind
listening in loulizing 0bjects in space; understanding such relations as in interpreting any reported indices of ordinality (see Hooper, 1973; Wohhvili,
container, equilibriurn, gravity. 1970).
shown b,·o identical objects; then the examiner makes certain transforrna- & \\'asik, 1971). T"raini.ng in Cf'lltSl'rY~ltioJ.l Ll:::k:~~:J~ tic'en fonlld tf) 1n-t[',ny\"c
tions in one of them and interrogates the child about their similarity or scores significantlv (see also Goldschmid. 1865: Zimmerman '~;:nosenth:1L
difference. After answerin~, the ~hild is asked to e:,pbin Lis ans\\'e~. In 1974a, 1974b). The manual cites several studies on small grouDS that
, each item, one point is sc~red for the correct judgment of equivalence contribute suggestive data about the construct validity of th; test. Some
1 [Uld one point for an acceptable explanation. For example, the examiner evidence of predictive validity is provided b~' sig:nifi~ant correlations in
1. begins with two standard glasses containing equal amounts of water the .30·s and .40's with nrst-barade achievement, the correlation beina
b
;1 (continuous quantity) or grains of corn (discontinuous quantity) and highest with arithmetic grades ( ..52).
pours the contents into a Rat dish or into several small glasses. In another
task, the examiner shows the child two equal balls of Playdoh and then
< natteils one into a pancake and asks whether the ball is as heavy as the
pancake.
Three forms of the test are available. Forms A and E are parallel, each DL\F:>;ESS. Owing to their general retardation in linguistic development,
providing six tasks: Two-Dimensional Space, Number, Substance, Con- deaf children are usually handicapped on verbal tests, even when the
tinuous Quantity, 'Veight, and Discontinuous Quantity. The two forms verbal content is presented visuall~'. In fact, the testing of deaf children
were shown to be closely eouivalent in means and SD's and their scores was the primary object in the development of some of the earliest per-
correlated .95. Form C incl~des two different tasks, Area and Length; it formance scales, such as the I'intner-Paterson Performance Scale and the
correlates .76 and .74 with Forms A and B, respectively. Administration Arthur Performance Scale. In Revised Form II of the Arthur scale. the
is facilitated by printing all essential directions on the record form, in- verbal instructions required in the earlier form \yere further reduced in
cluding diagrams of the materials, directions for manipulating materials, order to increase the applicability of the test to deaf children. Special
and verbal instructions. adaptations of the W'echsler scales are sometimes employed in testing
i., Norms were established on a standardization sample of 560 bo)'s and deaf persons. The verbal tests can be administered if the oral questions
girls between the ages of 4 and 8, obtained from schools, da:: care are typed on cards. Various procedures for communicating the instruc-
centers, and I-lead Start centers i. the Los Angeles, California area. The tions for the performance tests have also been worked out (see Sattler,
sample included both blacks an·, whites and covered a wide range of 1974, pp, 170-172). 'Vith such modifications of standard testing pro-
socioeconomic level, but with a slight overrepresentation of the lower- cedures, however, one cannot assume that reliability, validihr, and norms
middle class. P~~~ are reported .f~J;:_.~'}.QlL.~~ _~
_~l. .These remain unchanged. :'<onlanguage group tests, such ~s the Al:n~~' Beta, are
norms, of course, must be regarded as tentative in view of the small num- also used in testing the deaf.
ber of cases at each age and the limitations in representativeness of the "'ltether or not they require speCial procedural adaptations, all the
sample. Mean scores for each age show a systematic rise \vith age, with tests mentioned thus far \vere standardized on hearing pe~sons. For many
a sharp rise between 6 and 8 years, as anticipated from Piagetian theory. purposes, it is of course desirable to compare the performance of tbe deaf
Both in the process of test construction and in the evaluation of the with general norms established on hearing persons. At the same time,
final fomls, the authors carried out various statistical anah'ses to assess norms obtained on deaf children are also useful in a numher of situations
scorer reliability; Kuder-Richardson, parallel-form, and ret~st reliability; pertaining to the educational development of these children.
scalability, or ordinality; and factorial composition (see also Goldschmid To meet this need, the Hiskey-Nebraska Test of Learning Aptitude was
& Bentler, 1968a). Although based on rather small samples, the results developed and standardized on deaf and hard-of-hearing children. This
indicate generally satisfactory reliability and give good evidence of ordi- is an individual test suitable for ages 3 to 16. Speed was eliminated, since
nalit)' and of the presence of a large common factor of conservation it is difficult to convey the idea of speed to young deaf children. An at-
throughout the tasks. tempt was also made to sample a \vider variety of intellectual functions
Comparative studies in seven countries suggest that the test is ap- than those covered by most performance tests. Pantomime and practice
plicable in widely di\'Crse cultures, yielding high reliaoilities and showing exercises to communicate the instructions, as well as intrinsicallv interest-
approximately similar age trends (Golclschmid et aI., 1973). Differences ing itetns to establish rapport, \vere considered important refluirements
among cultures and subcultures, however, have been found in the mean for such a test. All items were chosen with special reference to the limita-
ages at which concepts are acquired, i.e., the age curves may be displaced tions of deaf children, the final item selection beinab based . chieHv.. on the
horizontallv bv (me or two veal'S (see also Fil1nrelli & KfOJJer.J972: \Vasik criterion of age differentiation.
St..lnfol'd-Einct (S. P. I·Ia~...
:t:·~. l~J-L.~,.l~).1:31.. \i1 it::"'.;i";~; tb~!t (J.llli.J he <.tc1rnill-
i:;tcrecl without the US(~ of visiG!-l \\"c,e selcct.:c1 from )),,1th [unll Land
Form \1. This procedure yielded six tests felr e:lch Yl';1r level from VIII to
XIV, and cigl;t tests at the Average Adult le\el.· In order to assemble
8. Completion of Dr~lwing'
cnou<7h tests for \'ear levels III to \'1. it \';as necessarv to draw on some
9. ~lemOl~.' for Digit<; of th~ special tests de"iseu for me in the e~nhcr I-L,yes'-Bi:wt. \10st of th::
tests in the final scale are oral, a few requiring braille materials. A retest
reliability of .90 and a split-half reliability of .91 are reported by Hayes.
5. Paper Folding (Patterns) 11. Picture Analogies
Correlations ".-ith braille editions of standard achievement tests ranged
6. Visual Attention Span 12. Spatial Reasoning from .82 to .9:3. The validity ~ of tflis .
test was also checked against
~ school
progress.
:\"orms were deri"ecl separatel:; from 1,079 deaf and 1,074 hearing chil- The 'Wechsler scales have also been adapted for blind ex'lminees. These:
clren between the acres of :3 and 17 Years, testeu in 10 stales, Split-half re- adaptations consist essentidly in using the ,-erbal tests and omitting the
liabilities in the .90~ are reported for deaf and hearing groups. IDtercor- performance tests. A few items iDr.ppropriate for the blind have also bt't'rl
relations of the l:? subtests range from the ,30's to the .70·s among younger replaced by alternates. \Vhen tested under these conclitions, blind
children (ages :3 to 10) and from the .20's to the .40's among older chil- persons as a group have been found to equal or excel the general seeing
chen (ages 11 to 17). Correlations of .78 to .86 were found between the norms.
Hiskey-Nebraska and either the Stanford-Binet or the \Vechsler Intel- A different approach is illustrated by the Haptic Intelligence Scale for
li(!en~e Scale for Children in small groups of hearing children. Further Adult Blind. TIlis was developed as a nonwrbal test to be used in con-
e~·idence of validity was provided by substantial correlations with junction "ith the Verbal Scale of the \VAIS. Four of the tests are adapta-
achievement t,ests among deaf children. The manual contains a discussion tions of performance tests from the \VAIS, namely, Digit Symbol, Block
of desirable practices to be followed in testing deaf children. Design, Object Assembly', and Obiect Completion; two were newly de-
vised, including Pattern Board and Bead Arithmetic. The tests utilize a
completely tactile approach and, if given to the partially sighted, require
BLINDNESS. Testing the blind presents a very different set of problems the wearing of a blindfold. For this reason, among others, they are prob-
from those f"ncount~red with the deaf. Oral tests can be most readily ablv best suited for testing the tot all" blind. Standardization procedures
adapted for blind persons, v.·hile performance tests are least likely to be followed closely' those el;~ployed with the \VAIS. The blind subjects
applicable. In addition to the usual oral presentation by tho: examiner, tested in the standardization sample included a proportional number of
other suitable testing techniques have been utilized, such as phonog;raph nonwhites and were distrihuted O\'er the major geographical re;ions of
records and tape o~ "'ire recorJings. Some tests are also available in the country. Subtest scores and deviation IQ's are fowld as in the WAlS.
b'aille. The latter technique is somewhat limited in its applicability, how- Split-half ~eliahility' for the entire test was found to be .95 and for sub-
ever bv the greater bulkiness of materials printed in br~,ille as compared tests \'aried from .79 (Object Assembly) to .94 (Bead Arithmetic). A six-
witl; iniprint, by the slower reading rate for braille, and by th~ nu:nber month retest of 136 subjects yielded a total-score reliability of .91 and
of blind persons who are not facile braille readers. The exammee s re-' subtest reliabilities ranging from .70 to .81. Correlation \\'ith WAIS Verbal
sponses may likewise be recorded in braille or on a typewriter. Speciall;.' Scale in the 20-34 age group of blind subjects was .65. The materials are
prepared embossed answer sheets or cards are also available for use with bulky and administration time is long, requiring from 1;~to 2 hours; but
true-fabe, multiple-choice, and other objective-type items. In many incli- blind examinees generally find the tests interesting and enjoyable. The
vidually administered tests, of course; oral responses can be obtained. authors caution that this is a provisional scale, requiring further research.
Among the principal examples of general intelligence tests that have It can provide useful information when employed by a trained clinician.
been adapted for blind persons are the Binet and the ·Wechsler. The first A number of group i'ntelligence tests have likewise been adapted for
Haves-Binet revision for testing the blind was based on the 1916 Stanford- use with the visually handicapped and are available in both b.rge-h'pe
Bi~et. In 1942, the Interim Hayes-BinetS was prepared from the 1937 and braille editions, Examples include the School and College Ability
3 Orif;inally deSignated as an interim edition because of the tentative nature of its Tests (SCAT), the College Board Scholastic Aptitude Test (S,-\T), and
standardization, this revision has come to be known by this name in the literature. the Aptitude Test of the Graduate Record Examinations (GRE). Re-
IC;Ul,h \';! ~L 'j tadile- f";m r:f the Pi·, .:n''':i ,'C :\ ht r:,,'('S h;1:, ,11·".',Tl it to u~'e of "u:-;e" \·ocab'l.il:lr\·. f:.:.r:_'·:i;lJI.·· ..- ~~j)P!i(·ab;\.. td person-=; ll.r1;l.t)!c to ...0-
]m'e promise as a nOllverbal illtel1ig(']~(:(' kst for blilld childrcll bdwcen calizc well (such as th~ CtT~ bral palsi~~l) and to tl;e deaL Since thev are
the a~es of 9 and 1.5 veal'S (n icll &. :\ mler:~on, 19G.51. An adaptation of the eas~' to administer and can be completed in about 1.5 minutes or less: thev
'Vinel:md Social Maturity Scale for blind preschool children \vas devel- are also useful as a rapid screening device in situations \vhere no trained
oped ,1110 standardized by ~faxfielc1 and Buchholz (1957). examiner is available but individual testil1!:; is needed.
The Peabody Picture Vocabulary Test (PPVT) is typical of these in-
, struments. It consists of a series of 150 plates, each containing four pic-
ORTHOPEDIC HAI\'DICAPS. Although usually able to receive auditory and tures. As each plate is presented, the examiner provides a stimulus word
visual stimulation, the orthopedicall:' handicapped may have such severe orally; the child responds by pointin~ to or in some other way designating
Illotor disorders as to make either oral or \\'Titten responses impracticable. the picture on the plate that best illustrates the meaning of the stimulus
The manipu ration of fonn boards or other performance materials would word. Although the entire test covers a range from 21,;', to 18 vears, each
likewise meet with difficulties. \Norking against a time limit or in strange indivichl.ll is given only the plates appropl~iate to his-own p~riormance
surroundings often increases the motor disturbrmce in the orthopedically level, as determined by a specified run of successes at one end and failures
· handicapped. TIleir greater susceptibility to fatigue makes short testing at the other. Raw scores can be converted to mental ages, deviation IQ's,
sessions necessary. or percentiles. The PPVT is untimed but requires from 10 to 15 minutes.
Some of the severest motor handicaps are found among the cerebral It is available in f\vo parallel forms which utilize the same set of cards
palsied. Yet surveys of these cases have frequently employed common in- with different stimulus words.
telligence tests such as the Stanford-Binet or the Arthur Performance The standardization sample for the PPVT included a total of 4,012
· SCCL1~. In such studies, the most severely handicapped were usuall:' ex- cases between the ages of 21.~ and IS years tested in :'\ashville, Tennessee,
: dueled as untestable. Frequently, informal adjustments in testing proce- and its environs, Alternate form reliability coefficients for different age
dure are made in order to adapt the test to the child's response capacities. le\'els within the standardization sample ranged from .67 to .84. Reliabil-
Both of these procedures, of course, are makeshifts. ity coefficients within the same range were subsequently obtained in
..\. more satisfactor:' approach lies in the development of testing instru- several mentally retarded or physically handicqpped groups. Validity was
ments suitable for even the most severely handicapped indi\iduals. A originally established in terms of age differentiation. Since its publication,
number of s[leciallv.. desianed
b tests or adaptations of existing tests
.... are now the test has been employed in a number of studies \\'ith normal, mentally
a\'ailable for this purpose, although their normative and validity data are retarded, emotionally disturbed, or physically handicapped chilc1re~.
usually meaaer.
, ~ Several of the tests to be discussed in the next section, ,
These studies have vielded validitv coefficients in the .60's 'with individual
·
oriainalh-
b .
designed
'-'
for use in cross-cultural testing,\.. have also proved and group intellige;lce scales \\'itl~in relati\'Cly homogeneous age groups.
" applicable to the handicapped. Adaptations of the Leiter International Understandablv, . these correlations were hifTher with verbal than with
0
P,:rformance Scale and the Porte us ~1azes, suitable for administration to performance tests. There is also some evidence of moderate concurrent
cerebral-palsied children, have been prep~rC'd (Allen & Collins, 19;'3; and predictive validity against academic achievement tests. A limitation
Arnold, 19,51). In both adapted tests, the examiner manipulates the te~t of this test for certaiJ~ te;ting purposes is suggested by the finding that
'~materials, while the subject responds only by appropriate head move- culturally disadvantaged children tend to perform more poorly on it than
ments. A similar adaptation of the Stanford-Binet has been proposed on other intelligence tests (Costello & Ali, 1971; Cundick, 1970; IVlilgram
I (E. Katz, 1958). The Progressive i\l atrices provide a promising tool for ~ Ozer, 1967; Rosenberg & Stroud, 1966). On the other hand, particij)ants
I. this purpose. Since this test is given with no time limit, and since the re- 111 preschool compensatory education programs showed more improve-
.~sponse may be indicated orally, in writing, or by pointing or nodding, it ment on this test than on the Stanford-Binet (Howard & Plant, 1967;
1 appears to be espeCially appropriate for the orthopedically handicapped. Klaus & Gray, 1968; j'vlilgram, 1971). Scores 011the PPVT may reflect in
,,: Despite the flexibility and simplicity of its response indicator, this test part the child's degree ~f cultural assimiLion. /
covers a wide range of difficulty and provides a fairly high test ceiling. Similar procedures of test administration have been incorporated in pic-
Successful use of this test has been reported in studies of cerebral-palsied torial classification tests, as illustrated by the Columbia Mental :'vlaturity
children and adults (Allen & Collins, 195.5; Holden, 19.51; Tracht, 1948). Scale (CMMS). Originally developed for use with cerebral-palsied chii-
Another type of test that permits the ntilization of, a simple' pointing dren, this scale comprises 92 items, each consisting of a set of thee, four,
~~ ...,...;....~hr:' ,,-.;,...f.,~(? ,,"'nrnl'111nrll
............... }p(:t Thp<,p J.":)C't~ l')l'nvlrlp :l r~nir1 mpas-
or five drawings printed on a large card. The examinee is required to
iLh~nlif\' the Jr~·l\\·in~lliat d()t?~ 11\)t b('ll.'n~j" \1. ~LL tl'lL" \)tL[r:)~ il1cHcatll1U L.>~
I.~hoice' bv. noinlina
1.':>
'or Ill)dding'-' (see
.
Fil!:~'38).
t..'
To heij2hl~ll
•...
interc:;l"'and
.J
appeal, the cards and dra\yings are varicolored. The objects depicted were THE pnOBLE:\f. The testing of persons with highly dissimilar cultural
chosen to be within the range of experience of most American children. backgrounds has received increasing attention since midcentnry. Tests are
Scores are expressed as Age Deviation Scores, which are normalized needed for the maximum utilization of hl:man resources in the newly de-
standard scores within age groups, with a mean of 100 and an SD of 16. veloping nations in Africa and elsewhere. The rapidly expandiner educa-
Percentile and stanine equivalents for these scores are also provided. To tional facilities in these countries require testing for admission ~urposcs
meet the demand for developmental norms, the manual includes a Ma- as well as for indi\idual counseling. \\'ith increasing industrialization,
: turitylndex, indicating the age group in the standardization sample there is a mounting demand for tests to aid in the job selection and place-
whose test performance is most similar to that of the child. I:1ent of personnel, particularly in mechanical, clerical, and professional
helds.
In America the practical problems of cross-cultural testing have been
associated chiefly with subcultures or minority cultures within the domi-
na.nt ~\.~lture. TI:ere has been widespread concern regarding the ap-
plIcabilIty of avaIlable tests to culturall:' disadvantaged groups. It should
b~ ~ot:d par~nthe.tically that cultural disadvantage is a relative concept.
ODJectIvely tnere IS only cultural difference between anv two cultures or
subcultures. Each culture fosters and encourages the de'velopment of be-
havior that is adapted to its values and demands. \\'hen an individual
must adjust to and compete within a culture or subculture other than that
in which he was reared, then cultural difference is likely to become cul-
tural disadvantage. .
Although concern with cross-cultural testing has been greatly ~timu-
FIG. 39. Examiner Administering Colui11bia ~-tental 1\hturitv Scale to Child. l~ted by recent social and political developments, the problem \vas recog-
(From Columbia Mental Maturity Scale: Guide for Administering and interpreting,
mzed at least as early as 1910. Some of the earliest cross-cultural tests
1972, p. 11. Copyright © 1972 by Harcourt Brace Joyanovich, Inc. Reproduced by were
_.
developed_
for testiner
0
the larere
~
waves of immicrrants
b··0
comi11U to the
permission.) . Lhllted States at the turn of the centurv. Other early tests oriCTinated in
basic research on the comparative abilities of relati~elv isolated cultural
groups. These cultmes were often quite primitive ami had had IiHIe or
The standardization sample for the CM\'1S comprised 2,600 children,
no contact with \Vestern civilization within whose framework most 1)SV-
including 100 boys and 100 girls in each of 13 six-month afe groups be-
chological tests had been developed. t ,
tween the ages of 3-6 and 9-11. The sample ,vas stratified in' terms of the
Traditionally, cross-cultural tests have tried to nile out one or more
1960 U.S. Census with regard to parental occupational level, race, and
parameters along which cultures vary. A well-known example of such a
geographical region; proportion of children living in metropolitan and
parameter is language. If the cultural groups to be tested spoke different
non metropolitan areas was also approximately controlled. Split-kJf re-
languages, tests were developed that required no language on the part of
liabilities within single age groups ranged from .85 to .91. Standard errors
either examiner or subjects. \Vhen educational backgrounds differed
of measurement of the Age Deviation Scores are between 5 and 6 points.
widely and illiteracy was prevalent, re ding was ruled o~t. Orallangua<Te'
Retest of three age groups after an interval of 7 to 10 days yielded re-
was not eliminated from these tests because thev. were desierned for p:r-
liabilities of .84 to .86. A correlation of .67 with Stanford-Bi~1et' was found 0
sons spea k'ing a common language. Another parameter in which cultures
in a group of 52 preschool and first-grade children. Correlations with
or subcultures differ is that of speed. 1\ot only the tempo of daily life, but
achievement test scores in first- and second-grade samples fell mostly be-
also the. motivation to hurry and the value attached to npid performance
tween the high o4O's and the low .60·s. More extensive data on validitv
v~ry. wlde:y among national cultures, among ethnic minority groups
and on applicability to various handicapped groups are available for a~
,.......1:.,.•..f,....~?"'.., •....t. •..1... r'I .f",,("t
wIthm a smgle nation, and between urban and rural subcultures (see,
e.g., Klillcberg, 1925; Knap1', 1960). Aecordingl", cross-cultural tests have
often-though not alw3vs-tried to eliminate the influence of spced by
allowi11"lonLgtime limit~ and e,i\'inv; no premium for faster performance.
L
Still ~her parameters along \vhi~h cultures differ pertain to test con-
tent. Most nonlangurlge and nomeading tesl~, for example, call for items
of information that arc specific to certain cultures. Thus, they may re-
quire the examinee to understrlnd the function of such objects as violin,
postage stamp, gun, pocketknife, telephone, piano, or mirror. Persons
reared in certain cultures may lack the experiential background to re-
spond correctly to such items. It ',,"as chiefly to control this type of cul-
tural parameter that the classic "culture-free" tests were first developed.
Following a brief examination of typical tests designed to eliminate one
or more of the above parameters, we shall tum to an analysis of alterna-
tive approaches to cross-cult-D.ral testing.
high school pupils, The scale \\'rlS subsequently applied to several Af.ri-
can groups by Porteus and to a few other national groups by other .m-
\'t'sti(!ators. A later H'\'ision, issued in 1948, was based on further testll1g '.uch an IQ retains 11I~,same meaning .at different ag~s. In fa.et,. the pub-
of A~1Crican children, high school students, and Arm:,- recruits during lished (bta show CtlllslQerable f1uctuatlOl1 m the standard deVIatIon of the
Worlel \Yar II. A distincti~'e feature of the Leiter scale is the almost com- J{)'s at different a~,' lewls. Split-half reliabilities of ,91 to .94 are reported
plete elimination of instructiom, either spoken or pantomime. Each test f r~m several stlldit's. but the samples ,,;ere quite heterogeneous in age
begins with a very ea.s:' task of the type to be encountered throughput ;!nd piobably in 01 lIt·r. ~1~Jaraeteri~tics. Validati,on data ~re based pri.nci-
that test. The comprehension of the task is treated as part of the test. Jallv 011 age c1iffer('lltJanon and mternal conSIstency. Some correlatIOns
The materials consist of a response fr,lme, illustratC'd in FigurC' 40, with ~,re ~lso rel~ort('d \\-jlh teacheL·' ratings of intelligence and with scores on
all adjustable card holder. All tests are administered by attaching the ap- !ilher tests, incJudjJl~ the Stanford-Binet and the WISe. These correh-
propriate card, containing p!inted pictmes, to the frame. The examinee I ions range from .5{l to .92, but most were c:'tained on rather hetero-
chooses the blocks with the proper response pictures and inserts them "eneaus groups,
into the frame. I, The tests in year k\'els 2 to 12 are also available as the Arthur Adapta-
The Leiter scale was designed to cover a wiele range of functions, simi- f ion of the Leiter Infernational Performance Scale. This adaptation, con-
lar to those found in verbal scales. Among the tasks included may be men- :,idered most suitahle for testing children between the ages of 3 ancl 8
tioned: matching identical colors, shades of gray, forms, or pictures; copy- ",'ears was standardiled by Grace Arthur in 19.52. Its norms must be re-
ing a block design; picture completion; number estimation; analo~ies; ;'fard:d as quite limiled, being derived from a standardization sample of
", children frum a fmQC1
2,S9 . , ']'e,cJass, mv.d western metropo l'!tan 1J3C 1~grounC1. 1
series completion; recognition of age differences; spatial relations; 100t-
print recognition; simihrities; memory for a selies; and c1as:if-icatiOl: of J .ike the original scale, the Arthur adaptation yields an ~lA and a
animals aceordinc; to habitat. Administered individually, w!th no time ratio IQ.
limit. these tests ~re arranc::ccJ into vear levels from 2 to 18. The scale is Tlw Culture Fair Intelligence Test, developed by R. B. CatteD and pub-
,cu)"(·cl in h::nn:, oE \1-'1. and ratio 1(,), ahLouzh there is no as';ura'lce that 1 . c~
.;.';,j.:l( 1 ~l"'\' , "'_"·,,.·.:',':tl~~C for Pc':;on~:tt·,\' ~illd J\.hi}it.,,- 1\.:sling . (.rf;tl'T') .. is H
, .j1·J'j~··
in the circle. This condition can be met anI" in the third response alterna-
tive, which has been marked. .
For Scale 1, onl;: ratio IQ's are provided. In Scales 2 and 3, scores can
be converted into deviation IQ's ~\'ith an SD of 16 points. Scales 2 and 3
have l~fen standardized on larg.t:'-r samples than Scale 1, but the repre~
sentatJv','ness of the samples ar¥i the number of cases at some aue levels
still fall short of desirable test,,'Construction standards. Althouah ~he tests
are highly speeded, some n.drms are provided for an untim~d version.
Fairly extensive verbal ins}'ructions are required, but the author asserts
that g~vjng these instructions in a foreign language or in pantomime ,,-ill
not affect the difficultv of the test.
Internal consistenc); and alternate-fom1 reliabilitv' coefficients are mar-
ginal, especially for Scale 3, where thev fall mosth: in the .50's and .60's,
rTIl
lllJ \Talidity is dis~ussed chiefly in terms ~f saturatio;l ":ith a general intel-
le~tIve factor (g), having been investigated largely through correlation
WIth other tests and through factor analysis, Scattered studies of concur-
rent and predictive validity show moderate correlations with various
academic and occupational criteIia. The Cattell tests have been admin-
istered in several European countries, in America, and in certain African
FIG. 41. Sample Items from Cultme Fair Intelligence Test, Scale 2, and Asian cultmes. Norms tended to remain unchanged in cultures mod-
(Copyright by Institute of Personality and Ability Testing.) erately similar to that in which the tests were developed; in other cul-
tures, however, performance fell considerabl~' below the original norms,
Moreover, black children of low socioeconomic level tested in the United
paper-and-pencil test. This test is available in three levels: Scale 1, for
States did no better on this test than on the Stanford-Binet (\VilIard
aaes 4 to 8 and mentallv retarded adults; Scale 2. for ages 8 to 13 and 1968). '
a~erage adults; and Scal~ :3,for grades 10 to 16 and superior adults. E~ch
The Progressive l\1atrices, developf'd in Great Britain bv Raven, were
seal<: has been prepared in two parallel forms, A and B. Scale 1 reqwres
also designed as a measure of Spearman's g factor. Requi;ing chiefl" the
individual administration for at least some of the tests; the other scales
ed~:tjon of relations among abstract items, this test is regarded by 'most
may be given either as individual or as group tests. Scale 1 compris:s
BntIsh psychologists as the best available measure of g. It consists of 60
eight tests, only four of which are described by the author as culture-faIr.
matrices,or designs, froIi1 each of which a part has been removed. The
TI~e other four involve both verbal comprehension and slwcific cultural
subject chooses the mi:.sing insert from six or eight given alternatives, The
information. II' i:. suggested that the four culture-fair tests can be used as
a sub-batten!, separate norms being provided for this abbreviated scale:
items are grouped into five series, each containing 12
matrices of increas-
ing ~im.('u~ty ~ut similar in principle. The earlie: series require accuracy
Scales 2 and 3 are alike, except for difficulty level. Each consists of the
of ~ISCnnllnatlOn; the later, more difficult series involve analogies, permu-
fo]]owing four tests, sample items from which are shown in Figure 41.
~atlOn and alternation of pattern, and other logical relations. Two sample
1. Series: Select the item that completes the series, Items are reproduced in Figure 42. The test is administered with no time
2. Classification: Mark the one item in each row that does not belong with limit, .and can be given individually or in groups. Ver~- simple oral in-
the others. structIOns are required.
3. Matrices: Mark the item that correctly compietes the given matdx, or Percentile norms are provided for each half-veal' interval between 8
pattern, and 14 years, and for each five-year inten-al b~t\\'een 20 and 65 years,
<1.CondifiollS: Inserl a dot in one of the alternati\'e designs so as to meet ThE'se norms are based on British sarnples, including 1,407 children, '3,665
the san-,p conditions inc]ica!(;c] in the sample d(~sigll. Thus, in th(~ eXJTI1plc men in .militar~- service tested durin£" '''arid \V,o,]' ~II, and 2.1 92 e!vj)j'1J1
rcproduC"u:Jir. figur<' ~J, the dot Hius1 be in tlk two rcc l~,,·,!:];.0, hi! JiJl ") ''1 1) . .
neaLis. LICJse<,! ~ir{iLar r!(jnrl~ \',-'e~'('obt:-l11:t,!J b',' LirnoLli (1943) on ] ,6{~O
...',"ith a factor COIlll11011 to rnost int('llj~"l"l'l""
.
If- ':!"-,,; i ';--~n.l-'·l·'I·r~(·ll \";t-il
..~.,....
I
~ -. ,~.. \ \·_l\.. ~1.
1~I't-::"'I'-
,_7,
l,. .,.
\"
IS
reasoning. perceptual accuracy, and other group fadors also innuence
( 8 I performance (Burke, 19.38).
I
I
-.....r' r-,--, .;::::;.c:> I An easier form, the Colonred Progressive :\Iatrices. is available for chil-
~~DI
I cIren bet\\'een the ages of .3 and 11 \~ears and for mentally retarded ~~dults.
A more advanced form has also b~en developed for suiJerior aduits, but
2
I ~~
its distribution is restricted to approved and registered users.
A still different approach is illustrated bv the GoodenouO'h Draw-a-
2 3 L1 \lan Test, in which the examinee is simply i;lstructed to "make a picture
~"d: .
,
of a mall; make the very best picture that yOU can." This test was in use
..:~~,Y
without change from its original standardization in 1926 until 1963. An
5 5 6 7 8
extension and revision was published in 1963 under the title of Good-
I ~{, '"
1"\ ) \--r)l§L)[±)\ 2 ) enough-Harris Drawing Test (D. B. Harris, 1963). In the revision. as in
the original test, emphasis is placed on the child's accuracv of obser\'<ltion
FIG. 42. Sample Items from the Progressive Matrices. and on the development of conceptual thinking, rather 'than on artistic
(Reproduced by permission of J. C. Raven.) skill.. Credit is ~iven for the inclusion of individual body parts, clothing
detmls, proportIOn, perspectiw, and similar features. A total of 73 scorable
items were selected on the basis of age differentiation, relation to total
children in Argentina. Use of the test in several European countries like-
scores on the test, and relation to group intelligence test scores. Data for
wise indicated the applicability of available norms. Studies in a numb~r
this purpose were obtained by testing samples of 50 boys and 50 girls at
of non-European cultures, however, have raised doubts about the SUIt-
each grade level from kindergarten through the ninth grade in urban and
abilitv of this test for aroups \. ith very dissimilar backgrounds. In such
rural areas of 1finnesota and \Visconsin, stratified according to father's
groul;s, moreover, the fest was found to reflect amount of education and occupation.
to be susceptible to considerable practice effect.
In the revised scale, subjects are also asked to draw a picture of a
The manual for the Progressive :\btrices is quite inadequate, glvmg,
woman and of themseh·es. The \Voman scale is scored in .terms of 71
little information on reliabilitv and none on validity. :\Ian)' investigations
items similar to those in the J\-Ian scale. The Self scale was developed as a
have been published, ho\Vev~r,that provide relevant data on this test. In
pr~je~ti\'e test of personality, although available findings from this ap-
a review of publications appearing prior to 1957, Burke (1958) lists ~ver
phcatlOn are not promising. Norms on both J\fan and \Voman scales were
50 studies appearing in England, 14 in America, and 10 elsew~ere. Sl~ce
e:tablished on new samples of 300 children at each year of age from .5 to
that time, research has continued at a rapid pace, especially In Amenca
10, selected so as to be representative of the United States population
where this test has received growing recognition. The Scventh Mental
with regard to father's occupation and geographical region. Point scores
Measuremcnts Yearbook lists nearl;' ..JOO studies, many dealing wi~h the
on each scale are transmuted into standard Scores with a mean of 100 and
use of this test with clinical patients.
an SD of 15.. In Figure 43 \vill be found three illvstrative drawings pro-
Retest reliability in groups of older children and adults that were mod-
duced by' chlldren aged 5-8, 8-8, and 12-11, to ,'ether with the corre-
, eratelv homogeneous in age varies approximately between .70 and .90. At
sponding raw point scores and standard scores. An alternative, simplified
the lo~ver score ranges, however, reliability falls considerably belO\~ thes.e
scoring procedure is provided by the Quality scales for both }'Ian and
,~ values. Correlations "ith both verbal and performance tests of mtelh-
\\'oman drawings. Instead of the point scoring, the Quality scales utilize
gence range between .40 and .75, tending to be higher with perfo:man~e
a global, qualitative assessment of the entire drawina. obtained bv match-
than \\'ith verhal tests. Studies with the m~ntally retarded and wIth dif-
ing. the child's drawing with the one it resembles m~t closely in ; graded
ferent occupational and educational groups indicate fair concurrent senes of 1:2 samples.
validitv. Predictive validity coefficients against academic criteria run
The reliabili.ty of the Draw-a-\fan Test has been repeatedly investi-
some\\:hat lower than thos~ of th·~ usual verbal intelligence tests. Several
gated by a vanety of proc.::dures. In one carefullv controlled study of the
factorial analyses suggest that the Progressive 1-fatrices are heavily lo;,ded
earlier forin administered to .386 third- and fou;th-grade school;hildren,
scales, information regarding the constrnct validity of the test i~ provided
by correlations with other intelligf'l1ce tests. These con-elations vary
widely, but the rnajorit~, arc over .50. In a study with 100 fourth-grade
children, correlations were found between the Draw-a-\1an Test and a
number of tests of known factorial composition (Ansbacher, 1952). Such
correlations indicated that, within the ages covered, the Draw-a-\lan
Test correlates highest with tests of reasoning, spatial aptitude, and per-
ceptual accuracy. \lotor coordination pla~'s a negligible role in the test
at these ages. For kindergarten children, the Draw-a-\1an Test correlated
higher with numerical aptitude and lower with perceptual speed and
accuracy than it did for fourth-grade children (D. B. Harris, 196:3). Such
findings suggest that the test may measure somewhat different functions
at different ages.
The original Draw-a-"t\'Ian Test has been administered widel" in clinics
as a supplement to the Stanford-Binet and other verbal scales.' It has also
been employed in a large number of studies on different cultural and
ethnic groups, including several American Indian samples. Such investi-
gations have indicated that performance on this test is more dependent on
differences in cultural background than was originally assumed. In a re-
"'omon, Raw Score 31 Mon, Row Score 66 view of studies pertaining to this test, Goodenough and Harris (1950, p.
Man, Raw Score 7 .-
CA 5-8 CA 8-8 CA 12-11 399) expressed the opinion that "the search for a culture-free test,
~ Standard S,ore 103 Standard Score 134
Standard Score 7 ..5 - - whether of intelligence, artistic ability, personal-social characteristics, or
FIG. 43. Specimen Drawings Obtained in Goodenough-Harris Drawing Test. an~' other measurable trait is illusorY." This view was reaffirmed bv Harris
(Courtesy Dale B. Harris. ) in his 196:3 book. :\lore recentl~', Dennis (1966) analyzed comixll'ative
data obtained with this test in 40 \Videl~' different cultural groups, prin-
the retest correlation after a one-week interval was .6'30, and split-half cipally from 6-year-old children. 1\1ean group scores appeared to be most
reliability was .89 (McCarthy, }9H). Rescoring of the identical dr.a\\-ings closel~' related to the amount of experience with representational art
\vithin each culture. In the case of groups ,,;ith little indigenous art, it
by a difT'erent scorer yielded r, scorer relia bilit\' of .90, and resconnp b]-
\\'as hypothesized that test performance reflects degree of acculturation
tl;e same scorer correlated .94. Studies with the new form (Dunn, 196 i;
to \ \' estern civilization. ~'
D. B. Harris, 1963) have yielded similar results. Readrninistration of, the
test to groups of kindergarten children on consecutive days re.vealen no Cultural differences in experiential background were again rewaled in
a well-designed comparative investigatio;1 of l\Iexican~ and American
significant difference in performance on different days. ~X~l1ml:er effect
was also found to be 11<:,gligible,as was the effect of art trammg m school. children with the Goodenough-Harris test (Laosa, Swartz. & Diaz-
The old and new scales are apparently quite similar; their scores cor- Guerrero, 1974). In studies of this test in i\igeria (Bakare, 1972) and
relate between .q and .98 in homogeneous age groups. The correlation Turkey (U9man, 1972), mean scores increased consistently and signifi-
of the }'1an and Woman scales is about as high as the split-half reliability cantly \\'ith the children's socioeconomic level. It should be added that
of the Man scale found in comparable samples. On this basis, Harris rec- these findings with the Goodenough-Harr:s test are typical of results ob-
ommends that the two scales be regarded as alternate forms and that the tained with all tests initially designed to be "culture-free" o'r "culture-
fai,r."
mean of their standard scores he used for greater reliability. The Quality
scales, representing a quicker but crueler scoring metholl, yield interscorn
Icliabilities dusterin':!, in the .80's. Correlations of about the same mao;-
nitude have been fo~md between Quality scale ratings and point scon:s AI'PHOACHLS TO CHOSS'CVLTURAL TESTI:\'f;, Theoreticallv we can identify
thr.:e npproadws to the c1evclonment of if",!, for n,C're;))],: [(,'2rce:1 in cl;{-
obl8ilipd for the same clrawin£':s.
f('rent cultUTi"S or subcultures, , . hO;i ir, ill ,-;f'tjr.(' ;",.,':,- [,,;;':r'- ";,, , : '::11
,\na~t froln the ilf:'ln-analvsis--'data q~ltht'li·-d in. the JC\'e~ 'lprncllt of the.
~ 1.." • '-'
three l1L1\' be combined. The first approa('h invo1:-(', the choice of it~'ms
c:onlJ110n to Jnal1V cultures and the Y~1.1ic.1ation of tIle resulting tt:~~tag ..11n:-:t .\llllTican Imtitlltes for H('scarch, undl,']' th'2 SjH)11Sm"hip (;f tlw l_'ni:ecl
local criteria in {11<111Y di!Ierent cultures. This is the basic approach of the SLltes j\~,:r-'ncy for International Dcvelopment -(SchV:,1r;, l%-:1a, 19G-lb;
culture-fair tests, although their repeated validation in different cultures Schwarz 6: Krug, 1972). A.nother exarnpleis the long-term testillg pro-
has often been either neglected altogether or inadequate 1:' executed. frarn of the i\ational Institute of Personnel Hcsearch in ]oh'111m:sburcf
\Vithout such a step, however, we cannot be sure th<1t the test is relatively (Blake, 19(2): In such instances, the tests are validated aaaimt th~
free from culturally restricted elements. :Moreover, it is unlikel:' that any specific educational and \'ocational criteria they Clre designecl ~) p;'eelict,
sinerle test could be desiabl1ed that \vould fullv. meet these requirements an.d pcrfonm:nce is evaluated in terms of local norms. Each test is ap-
b
across a wide range of cultures. plied only Within the culture in ,,-hich it \\'<15 developed and 110 cross-
On .the other hand, cross-cultural assessment techniques are needed for cultural comparisons are attempted. If the criteria to be predicted are
basic researc1i on some verv fundamental questions. One of these ques- technological, however, "'''estern-type intelligence" is likely to be needed,
tions pertains to the generality of psychological principles and constructs and the teS ts will reflect the direction in which the particular culture is
derived within a single culture (Anastasi, 19583, Ch. 18). Another ques- evolving rather than its prevalent cultural characteristics at the time (see
also Vernon, 1969, Ch. 14). .
tion concerns the role of environmental conditions in the dC\'elopment
of individual differences in behavior-a problem that can be more ef- A.ttention should also be cal1ed to the publication, in the late 1960" and
fectively studied within the wide range of environmental variation pro- early 1970s, of several handbooks concerned with cross-cultural testing
Yided bv hiahlv dissimilar cultures. Research of this sort calls for instru- and research, and witI-: the use of tests in dewloping countries (Bieshel~
.; ments t11at ~a~ be administered under at least moderately comparable vel, 1969; Brislin, Lonner, & Thorndike, 1974; Sc!l\nllz &. Krug, 197:2). All
conditions in different cultures. Safeguards against incorrect interpreta- provide information on recommended tests, adaptations of standardized
tions of results obtained with such instruments should be sought in ap- tests, and procedural guidelines for the development and application of
propriate experimental designs and in the investigators' thorough fa- tests. Further indication of the Widespread interest in cross-cultural test-
miliaritv "ith the cultures or subcultures under investigation. ing can be founel in the report of an international conference on :'IIental
A se~ond major approach is to make up a test within one culture and Tests and Cultural Adaptation held in 1971 in IstanbuL Turk-y (Cron-
administer it to individuals with different cultural backgrounds. Such a bach &. Drenth, 1972). The papers presented at this conference r~flect the
procedure would be followed when the object of testing is prediction of a wi~e diversity of interests and backgrounds of the participants. TI~e
local criterion within a particular culture. In such a case, if the speCific ~OPICSrange from methodological problems and entluations of speCific
cultural loading of the test is reduced, the test validity 111 a:' also drop, llistruments to theoretical discussions and reports of empirical studies.
since the crite~ion itsclf is culturally loaded. On the other hand, we The principal focus in both the handbooks and the conference report is
should . avoid the mistake of reaardinc~ on major ~ultural differences, as found among nations and among p~oples
b oJ.,anv test develoI,ed within a single
L
cultural framework as a universal yardstick for measuring "intelligence." at ver:' c1tfferent stages in their cultural evolution. In addition. a vast
1\01' should ,,-;. assume that a low s~ore on such a test has the same causal amount of literature has accumulated in the decades of the 1960s and
explanation when obtained by a member of another culture as when ob- 1970s on the psychological testing of minorities in the United States,
tained bv a member of the test culture. 'Vhat can be ascertained by such chieBy for educational and vocational purposes. In the present book, this
an appr~ach is the cultural distance between groups, as well as the indi- material is treated wherever it can be most dearl\' presented. Thus in
vidual's degree of acculturation and his readiness for educational and C.h.a~,te~·:'3, the focus was on social and ethical c;ncerns and responsi-
vocational activities that are culture-specific. bdltl:S In the use of tests \\ith cultural minorities. The technical psycho-
As a third approach, different tests may be developed \vithin each cul- metnc problems of test bias and item-group interactions were considered
ture and validated against local criteria only. This approach is exemplified in Chapters '7 and 8. In the present chapter, attention was centered on
bv the manv revisions of the original Binet scales for use in different instruments developed for cross-cultural abi]jtv testing. Problems in the
European, Asian, and African cult~Hes, as well as by the development of interpretation of the results of cross-cultural te~tina will be considered in
CI1apter 12. 0
tests for industrial and military personnel within particular cultures. A
current example is provided by the test-development program conducted A final point should be reiterated about the instruments discussed in
in several developing nations of Africa, Asia, and Latin America by the this section. Although initially developed for cross-cultural testing, several
of these instruments have found a major application in tJ;e anna-
298 Tests of General Intellectual Lccd
"1 ",
'.
HILE individual tests such as the Stanford-Binet
\Vechsler scales find their principal application
.; group tests are used primaril:' in the educational
and the
in the clinic,
system, civil
service, industr\" and the miIitarv services. It ",ill be recalled that mass
testing began during \Vorld \\'a;' 1 with the de\'~lopment
Alpha and the Army Beta for use in the United States Army. The former
of the Army
was a verbal test deSigned for general screening and placement purposes.
The latter was a nonlanguage test for use with men who could not prop-
erly be tested with the Alpha owing to foreign-language background or
illiteracy. The pattern established by these tests was closely followed in
the subsequent development of a large number of group tests for civilian
application.
Hevisions of the civilian fOID1S of both original army tests are still in
use as Alpha Examination, },loclified Forl1I 9 (commonly known as
Alpha g) and as Revised Beta Examination. In the armed sen'ices, the
Armed Forces Qualification Test (AFQT) is now administered as a pre-
liminar)' screening instrument, followed by classification batteries de-
veloped within each serviCe for assignment to occupational specinlties.
The AFQT provides a Single score based on an equal number of vocabu-
lary, arithmetic, spatial rehtions, and nwchanical ability items.
In this chapter, major types of group tests in current use will be sur-
w:ycd. First we shan consider the principal differences between group
and individual tests. Then we shall discuss the characteristics of multi-
level batteries d',igned to cover a wide age or grade range, with typical
illustrations £rorn different levels. Fina]])', group tests designed for use
at the college level and beyond will be examined,
tf':~t.S cc~n be- adnljni~}te.red SilfJ'.} 1L':r:c'ol1s1y to as Irlany persorls as c~n-l Ll~
fittf'll cl1lJ1fortallh' into the available ~pnce and rcac1lCd tJ:n'tlf:lJ a micro- difficult:,. This arrangement ensures th~lt each e:-:aminee has an oppor-
l.)hol1e, Lnrgre-scale
~
testillerb I)roerrnm5
b
wer(' made l)ossib1<.:,])\.. tile ck\'(~lol)- tuni~y to tn' each type of item (such as \'ocabulary, arithmeti c, spiltial,
ment of group testing techniques. By utilizing only printed items <lnd etc.) and to complete the easier items of each type before ti':'ing the
simple responses that can be recorded on a test booklet or answcr sheet, more difficult ones on which he might other".-ise waste a good deal of
the need for a one-to-one relationship between examiner and examinee time. L
was eliminated. A practical difficulty encounter('d with separate sub tests, however, is
A second\\;a:-" in which group tests facilitated mass testing; was by that the less expelienced or less careful examiners may make timing
greatly simplifying the examiner's role. In contrast to the extensive train- errors. Such errors are more likely to occur and are relativelv more seri-
, ing and experience required to administer the Stanford-Binet, for ex- ous with several short time limit~ than "ith a sin b ale lonab ti~1e limit for
ample, most group tests require onl:' the ability to read simple instruc- the whole test. To reconcile the use of a single time limit \vith an ar-
tions to the examinees and to keep accurate time. Some preliminary train- rangement permitting aU examinees to try :11 types of items at snc-
ing sessions are desirable, of course, since inexperienced examiners are cessively increasing difficulty levels, some tests utilize the spiral-omnibus
likely to deviate inadvertently from the standardized procedure in ways format. One of the earliest tests to introduce this format was the Otis
that may affect test results. Because the examiner's role is minimized, Self-Administering Tests of I\.fental Ability which, as its name implies, en-
howevel:, group testing can pro\'ide more uniform conditions than does deavor.ed to reduce .the examiner's role to a minimum. The same arrange-
'I individual testing. The use of tapes, records, and film in test administra- ment IS followed m the Otis-Lennon ~Iental Abilitv Test from the
tion offers further opportunities for standardizing procedure and,elim- fourth-grade levE up. In a spiral-omnibus test, the ea;iest it~ms of each
inating examiner variance in large-scale testing. type are presented first, followed by the next easiest of each type, and
Scoring is typically more objective in group testing and can be done so on in a rising spiral of difficulty leveL as illustrated below:
by a clerk. J\Iost group tests can also be scored by computers through
several available test-scoring services. Moreover, whether hand-scored or 1. The opposite of hate is: Answer
machine-scored, group tests usually provide separate answer sheets and 1. enemy, 2. fear, 3. love, 4. friend, 5. joy , .
reusable test booklets, Since in tllese tests all responses are written on :2. If 3 pencils cost 25 cents, how man:--"pencils can be bought for 7.5
cents? , , , .
the answer sheet, the test booklets can be used indefinitely until they
3. A bird does not always have:
wear out, thereby effecting considerable economy. Answer sheets also
1. wings, 2, eyes, 3. feet, 4. a nest,S. a bill .
take up less room than test booklets and hence can be more conveniently 4. The opposite of honor is:
filed for large numbers of examinees. 1. glon', 2. disgrace, 3. cowardice, 4. fear,S. defeat .....
From another angle, group tests characteristically provide better estab-
lished norms than do individual testS. Because of thc relative ease and In order to avoid the necessity of repeating instructions in each item
''', rapidity of gathering data with group tests, it is customary to test very and to reduce the number of shifts in instructional set reauired of the
, large, representative samples in the standardization process. In the most examinees, some tests apply the spiral-omnibus arrangementJnot to single
recently standardized group tests, it is not unusual for the normative items but to blocks of 5 to 10 items. This practice is followed, for ex-
samples to number between 100,000 and 200,000, in contrast to the ~,OOO ample, in the Armed Forces Qualification Test and in the Scholastic Apti-
to 4,000 cases laboriously accumulated in standardiZing the most care- tude Test of the College Entrance Examination Board.
fully developed individual intelligence scales.
Group tests necessarily differ from individual tests in form and ar-
'I rangement of items. Although open-ended questions calling for free re- DISADVAl"TAGES OF GROUP TESTI(,;G. Although group tests have several
, sponses could be used-and were used in the early group tests-today desirable features and serve ;i well-nigh indispensaUe function in present-
the typical group test employs multiple-choice items, This change was day testing, their limitations should also be noted. In group testing, the
obviously required for uniformity and objectivity of scoring, whether by examiner has much less opportunity to establish rapport, obtain co-
hand or machine. \Vith regard to arrangement of items, whereas the operation, and maintain the interest of examinees. Any temporary condi-
Binet type of scale groups items into age levels, group tests character- ~ion of the ~xaminee, such as illness, fatigue, worry, or anxiety, that may
istically group items 'J similar content into separately timed sllbtcsts. mterfere WIth test performance is less readily detected in group than in
Within each suhtest, items are usually arranged in increasing order of individual testing. In general, persons unaccustomed to testing may be
30Z Tesls of Gel:CTa{ hllcllccltWI Lcrel
1973 ~. Although it is possi.ble to deSign paper-and-pencil group tests
somewhat more handicapped on group than on individual tests. There is
that ll1co~po~ate such adaptIve procedures (Ckary, Linn, & Rock, 1968;
also sOl11e('yjdencc suggesting: that emotionally disturbed children may
Lord, 19 ,la), these techniques lend themselves best to computerized
perform better on individual than on group tests (Bower, 1969; \Vi11is,
test administration.
1970).
From another angle, group tests have been attacked because of the
restrictions imr;osed on the examinee's responses. Criticisms have been
. directed parti~u1arly against multiple-choice items and against such
standard item types as analogies, similarities, and classification (Hoffman,
1962; LaFave, 1966). Some of the arguments are ingenious and provoca-
tive. One contention is that such items may penalize a brilliant and Routing
original thinker who sees unusual implications in the answers. It should Test
be noted parenthetically that if this happens, it must be a rare occur-
rence in view of the item analysis and validitv data. Some critics have
focus~d on the importance of 'analyzing erro;s and inquiring into the
reasons why an individual chooses a particular answer, as in the ty'pical
Intermediate
Piagetian approach (Sigel, 1963). It is undoubtedly true that group
tests provide little or no opportunity for direct observations of the ex-
aminee's behavior or for identifying the causes of atypical performance.
For this and other reasons, when important decisions about indiYiduals
are to be made, it is desirable to supplement group tests either with indi-
vidual examination of doubtful cases or with additional information from
E
. other sources.
Still another limitation of traditional group testing is its lack of flexi-
bility, insofar as every examinee is ordinarily tested on all items. Avail-
able testing time could be more effectively utilized if each examinee con-
centrated on items appropriate to his ability level. Moreover, such a
procedure \\'ould avoid boredom from working on too easy items, at one FTG.44. Two-Stage Adaptive Testing with Three I\leasurement Lewis. Each
extreme, and mounting frustration and anxiety from attempting items examinee takes routing test and one measurement test.
beyond the individual's present ability level, at the other. It ,,,ill be re-
called that in some individual tests, such as the Stanford-Binet and the
,.~dapti~e. testing can follow a wide variety of procedural models (De-
Feabody Picture Vocabulary Test, the selection of items to be presented \\-Itt &. \hl.Ss, 1974; Larkin &: \Veiss, 1975- \Veiss 1974· Weiss &. Betz
by the examiner depends upon the examinee's prior- responses. Thus in 1973). A simple example involvina two-s~aO'e te~tina is illustrated l'n'
these tests the examinee is given only items within a difficulty range ap- F'
'lgure 44 '. In this hypothetical test, ball examinees
D
take b a 10-item routina
propriate to his ability level. test, .whose items cov~r a wide difficulty range. Depending on hispel~
formance on the routmg test, each examinee is directed to one of the
three 20-item measurement tests at different levels of difficultv. Thus each
COMPUTER UTILIZATION e.; GROUP TESTING. In the effort to combine
p~rson takes only 30 items, although the entire test comprise; 70 items. A
some of the advantages of individual and group testing, several inno\' -
(h.fierent ~rra:!gement is i11ust~ated in the pyramidal test shown in Fig-
tive techniques are being explored. 11ajor interest thus far has centered
u~e. 45. III tlm case, all exammees begin with an item of intermediate
on wavs of adjusting item coverage to the response characteristics of d1fI1culty. If an individual's response to this item is correct, he is routed
individual examinees. In the rapidly growing literature on the topic,
~rward to. the next nore difficult item; if his response is \\Tong. he moves
this approach has been .variou'.\' designated as adaptive, sequential, ~lOwnward to the next easier item, This procedure is repeated :r.fter each
br:mchcd, tai1ored, indivic1\1~!1izt·d, programed, dynamic- or f{'~ponse- ltem resp,i1se, until the indi\'idual has given 10 responses. The ii1ustr~;-
contingent testing (Baker, 1971; Glaser & Kitko, 19"il; \Veiss &. Bdz,
iIHlividual's responses provide enough information for a deeision about
\ I
Examinee's
Scores
..:.. needs of indiyidual examinees, computers can help to circumvent other
limitations of traditional group tests ( Baker, 1971; Glaser & l\'itko, 1971;
B, F. Green, 1970). One potential contribution is the analysis of wrong
\ responses, in order to identi~y ty es oC~rtQr in il2SiLYigYi.1L~ii'iig:i;AJ1ot1wr
\ is thE1 use of response--~s that permit the examinee to tr~' alternative
responses in turn, "ith immediate feedback, until the correct response
\ is chosen, Still another possibility is the development of special response
i procedures and item types to investigate the examinee's problem-solving
techniques. For instance, following the initial presentation of the prob-
lem, the examinee lTIay have to ask the computer for further information
needed to proceed at each step in the solution; or he may be required to
respond by indicating the steps he follows in arrivinb at the solution.
Considerable research is also in progress with several relativel~- unre-
stricted response modes, such as underlining the appropriate words in
sentences or constructing Single-word responses.
orally. By so doing, the examiner also controls the amount of time avail-
able to complete each item (about 15 seconds). The whole test re-
= = I
= G·
quires approximately 25 to 30 minutes and is administered in two parts General Information: Mark the picture of the thing we talk into.
"'ith a short rest period in between. Part I consists of 23 classification
items; Part II contains a total of 32 items. designed to measure verbal (~'.jP
CI1
~.
[J!9
- ""-J, )
~ ~
~.y-
conceptualization, quantitative reasoning. gener~l information, and abil- ~
ity to follow directions. Typical items in each of these categories are ; = Co
=
shown in Figure 46. Following Directions: Mark the picture that shows a glass inside a
1\orms for all levels of the Otis-Lennon batten! were obtained on a square with c cross on top.
carefully chosen representative sample of over 200,000 pupils in 100 + + it +
school systems drawn from all 50 states. Scores can be expressed as devia-,
tion IQ's with an SD of 16. Percentile ranks and stanines can also be
found with reference to both age and grade. norms. \Vell-constructed
U ~ [[J [Q]
tests for the primary level have generally been found to have satisfactory
= = = G·
reliability and criterion-related validitv. The Otis-Lennon Primarv II FIG. 46. Items Illustrative of the Otis-Lennon I\lental Ability Test, Primary
yielded ~n alternate-form reliability of.in in a sample of 1,047 first-g~ade I and Primary 1I Levels.
children, over an interval of two weeks. Sp1it-half reliahility in the total (CopyTight, 1967, Harcourt Brace Jovanovich, Inc.)
orally. By so doing, tbe examiner also controls the amount of time avail-
able to complete each item (about 15 seconds). The whole test re-
= 0 I
= G·
quires approximately 25 to 30 minutes and is administered in two parts General Information: Mork the picture of the thing we talk into.
with a short rest period in between. Part I consists of 23 classification
items; Part II contains a total of 32 items. designed to measure verbal
~
~p
", \
OJ
WI) ~
~.
[1!9
~-9
conceptualization, quantitative
ity to follow directions.
shown in Figure 46.
reasoning. gcner~l infomlation, and abil-
Typical items in each of these categories are ; -- = = =
Failowing Direclions: Mark the picture that shows Q glass inside 0
J\orms for all levels of the Otis-Lennon batten! were obtained on a square with c cross on top.
carefully chosen representative sample of over 200,000 pupils in 100 + + •• +
school systems drawn from all 50 states. Scores can be expressed as devia-,
hon IQ's with an SD of 16. Percentile ranks and stanines can also be U ~
[[] [Q]
found with reference to both age and grade. norms. \Vell-constructed
tests for the primary level have generally been found to have satisfactory
= = = e·
reliability and criterion-related validitv. The Otis-Lennon Primarv II FIG. 46. Items Illustrative of the Otis-Lennon Mental Ability Test, Primary
yielded ~n alternate-form reliability of .87 in a sample of 1,047 first-g~ade 1 and Primary II Levels.
children, over an interval of two ·weeks. Sp1it-half reliahility in the total (Copyright, 1967, Harcourt Brace JO\'ano\,ich, Inc.)
elements; the items bear relatively little relation to fonnal school instruc- (Reproduced by courtesy of Robert L. Thorndike and Elizabeth Hagen.)
tion,
into a uniform scale across all levels to permit continuity of measure- O
Each subtest is preceded by practice exercises, the same set being used ment
. and .comparability ~ of scores in dinerent.. grades F or nonna t'lve
'
for all levels. In Figures 47, 48, and 49 v,'ill be found a typical item from mterp}"etatlOns, th~ s~ores on each battery can be e';;pressed as normalized
each of the 10 subtests. with highly condensed instructions. In difficulty standaI~ scores wlt~m each age, with a mean of 100 and an SD of 16.
level, these items correspond to items given in grades 4 to 6. The authors ~centrles and st.anmes can also be found, \"lithin ages and \vithin d~ades.
recommend that all three batteries be given to each child, in three testing , e m~nual ~dvlses agamst combining scores from the three b;tteries
sessions. For most children, the Nonverbal Battery does not predict mto a smgle mdex.
school achievement as well as do the Verbal and Quantitative batteries. .K~der-Richardson reliabilities of the three batten' scores computed
However, a comparison of performance on the three batteries may pro- wlthm to(Trades , ' are mas tl y III . ~h e .9'0 s. The manual also .' reports standard
vide useful information regarding special abilities or disabilities. errors .o~ m(as~rement for drfferent grades and score levels, as well as
The standardization sample, including a;?proximately 20,000 cases in the nllmmum Illterbatterv score differences that can b 'd d
hav '<'. '... '., e conSl,ere to
each of the 10 grade groups, was carefully chosen to represent the school _ e statJ~tlcal and practlcal slgI1lRc::mce. IntercorreJations of batten'
population of the country. Haw scores on each battery are translated scores
" are 111 the 1-'(11-
ll ,u. 60'-·
t ~ anc 1 .1~'(" ."
J s; mtcrc:orrelations of subtests are also'
~
" ,
1. Quantitative Relations: If the amount or quantity in Column I is more 1. Figure Classification: the first three figures are alike in some way.
than in Column II, mark A; if it is less, mark B; if they are equal, Find the figure at the right that goes with- the first three.
mark C.
ABC D E
2. Number Series: t'1e numbers at the left are in a certain order. Find the
96oD~
number at the right that should come next. 2. Figure Analogies: decide how the first two figures are related to each
other. Then find the one figure at the right that goes with the third
figure in the same way that the second figure goes with the first .
.l
~ 3. Equation Building: Arrange the numbers and signs below to make true A
~
I:'
equations and then choose the number
correct answer.
at the right that gives you a
e
t
3. Figure Svnthesis: For each shaded area, decide whether or not it can
be completely covered by using all the given black pieces without
l~ FIG. 48. Typical Items from Quantitative Battery of Cognitive Abilities Test.
\' overlapping any.
Answers: I-B, 2-B, 3-A.
(Reproduced by courtesy of Robert L. Thorndike and Elizabeth Ragen.)
Given Pieces ~"". ~
L... l~
·.·):·
unusually high. Factor analyses likewise showed the presence of a large
Complete Shapes
general factor through the three batteries, probably representing prin-
Q:. [b.
rn
cipally the ability to reason "'ith abstract and symbolic content.
The Cognitive Abilities Test was standardized on the same normative ~-I
~. I
ii,.,
sample as two achievement batteries, the Iowa Tests of Basic ~kills
(ITBS) for grades 3 to 8 and the Tests of Academic Progress (TAP) for I .H / ..•........ ( ,
.•lr; grades 9 to 1~2.Concurrent validity of the Cognitive Abilities Test against
the ITBS, found within single-grade groups of 500 cases from the stand-
ardization sample, ranged from the .50'5 to the .70's. A.JS is generally
found with academic criteria, the Verbal Battery yielded the highest cor- , 49. Typical Items from Nonverbal Battery of Cognitive Abilities Test.
FIG.
Answers:I-B, 2-D, 3-3 & 4.
relations with achievement in all school subjects, except for arithmetic
(Reproduced by courtesy of Robert L. Thorndike and Elizabeth Ran-en and with per-
which tended to correlate slightly higher with the Quantitative Battery. mission of Roughton Mifflin Company.) '"
Correlations with the l\'onverbal Battery were uniformly lower than with
the other two batteries.
Correlations with achievement tests over a three-vear interval are of
HI~,H SCHOOL LEVEL. It should be noted that the high school levels of
about tl1(' same ma gnitude as the concurrent cor;elations. Predictiye IrIultJJcvel batteries, as well as other tests designed for high school
validity coefficients ~gainst school grades obtaineu from one to three students, are also suitable for testing general, unseJected adult groups.
veal'S .iater run some~vhat lower. falling mostly in the .50'5 and .60's.
Another SOurce of adult tCe,ts is to be found in the tests c1e\'Cloped for
These correlations are probably 'attent~ated b)· unre1iabilit)' and other military personnel and subsequentl}' published in civilian editions. Short
extraneous variance in grading procedures. screening tcsts fcJr job ;\pp]icants will be considered in Chapter Fl.
Cr"l/p Testing 315
314 Tests of General Intel/ccillal Let;cl
. analogies test. The items in this test, ho\Ye\u, eliHer from the traditional
An example of a group intel1i~en('(' test for the high school level is
analogies items in that the respondent must choose both words in the
provided by Level 2 of the School and Collegt' Ability Tests (SCAT )--
second pair, rather than just the fourth word. The quantitative score is
Series II, designed for grades 9 to 12. At all levels of the SCAT series,
derived from a quantitative comparison test desi'2:ned to assess the
tests are available in two equivalent forrn5, A and B. Oriented specifically
examinee's understanding of fundamental number dperations. Covering
toward the prediction of academic achievement, all levels yield a verbal.
both numerical and geometric content, these items require:::. minimum of
a quantitative, and a total score. The verbal score is based on a verbal
reading and emphasize insight and resourcefulness rather than traditional
cOl:1pu~ati~nal procedures. Figure 50 shows sampie Yerbal and quanti-
tatIve Items taken from a Student Bulletin distributed to SCAT ex-
Part I - Verbal AbilitV: each item begins with two words that go together a~1inees~for preliminary orientation purposes. The items reproduced in
in a certain way. Find the two other words that go together in about the FIgure 0" fall approximately in the difficulty range covered bv Level 2.
same way. At all IE-- Is, testing time is 40 minutes, 20 minutes for e~ch ·part.
In line with current trends in testin£: theory, Sc.\ T undertakes to
2 braggart: humility ::
tool : hammer :: measure developed abilities. This is simply an e~plicit admission of what
A traitor: repentance is more or less true of all intelligence tests, namely that test scores reflect
A table: chair
B radical: con\entionality
B toy : doll the nature and amount of schooling the individual has received rather
C weapon : metal C precursor: foresight
D ~ophisticate : predisposition than m:as~ring "capacity" independently of relevant prior experiences.
D slc:igh : bell
AccordmglY, SCAT dra\vs freely on word knowled~e and arithmetic
processes learned in the appropriate school grades. In this respect, SCAT
Part II _ Mathematical AbilitV: each item is made up of tvvo amounts or
does not really differ from other intelligence tests, especially those de-
quantities, one in Column A and one in Column B. You have four choices:
Signed for the high school and college levels; it only makes o,;ert a condi-
A, B, C, or D. Choose A if the quantity in Column A is greater, B if that
tion sometimes unrecognized in other tests,
in Column B is greater, C if the quantities are equal, or D if there is not
Verbal, quantitative, and total scores from all seAT levels are ex-
enough information for you to tell about their sizes.
pressed on a common scale which permits direct comparison from one
level t~ another. These scores can in turn be cOlwerted into percentiles
I 1 or stamnes for the appropriate grade. A particularl~' desirable feature of
3 A number between A number between
10 and :10 1000 1001 seA T scores is the provision of a percentile band in addition to a sino-Ie
10 and 20
percentile for each obtained score. Representing a distance of appro~i-
mately one standard error of measurement on either side of the corre-
sponding percentile, the percentile band gives the 65 percent confidence
interval, or the range within which are found 68 perc~nt of the cases in
Q
a normal curve. In other words, if we conclude that an individual's true
O I score lies within the given percentile band, the chances of our being
&
,2
2'I I
correct are 68 out of 100 (roughly 2: 1 ratio). As explained in Chapter 5,
t I the error of measurement provides a concrete wav of takinO' the reliabilitv
P
5
R - - -'-='- - -- of a test into acconnt when interpreting an indi\~dua]'s sc~re. "
If two percentile bands overlap, the difference between the. scores can
Area of .6 STU
b~.,ignored; if t~ey do not overlap, the difference can be regarded as sig-
Area of L, PQR
abo ....e mficant. Thus, If two students were to obtain total SeAT scores that fall
above
in the percentile bands 55-68 and 74-84, we could conclude with fair
confid(;nu' that the second actually excels the first and ,,-auld continue to
FIG. 50. Typical Items from SCAT Seri(;~sII, Level 2, for Grades 9 to 12.
~o ~o. on <: rctes~. Percentile bands -likewiSE:;help in comparing fl_ Single
:\'115\\'CI:': ]--B, 2-B, 3-D,4-A, 5-C.
!l;dl\'lduals reIutlye standing on vel11,<1and quantit,,~i\-e p,u-ts of thc test.
(fn.1tTJ Studtnt Bull('tin-SCAtI'-"St"Tits Ii. Copyrif!ht (Sl 19Gi by EdUc~ttion:.d 'Te~tin~
II a student's verba] ,mc1 quantitative scores correspond to the percentiJt'
c)('ni(c . .-\.Il rit.:lltS It''iEned Heprillted by pETInl'S)CI1.)
grade groups, from grade 4 to 14, total Score reliabilities are all .90 or
above; verbal and quantitative reliabilities var:\' between .83 and .9l.
Very =! Very These reliabilities may be spuriously high because the tests are some-
High
Higtl
90 ~ _.. - -_. what speeded. The percentage of students reaching the last item in differ-
0-=1
i .. flf; -~
ent grades ranged from 65 to 96 on the verbal test and from 55 to 8.5 on
the quantitative test. Under these circumstances, equivalent-fom] relia-
... ,--------- bility would seem more appropriate. If the reliability coefficients are in
fact spuriously high, the errors of measurement are underestimated; and
hence the percentile bands should be wider.
60 :J
1 It should be noted, however, that many of the students who did not
record an answer for all items may have abandoned any attempt to solve
1
Average
50 the more difficult items even though they had enough time. In the quanti-
tative test, moreover, a student can waste an inordinate amount of time
E- 40 -=l 40 - in reaching the answer to certain items b:-' counting or computation,
~
when the proper recognition of numerical relations would yield the an-
--------
30
1_ --_. 30 swer almost instantaneously. Under the.<;e conditions, speed of perform-
~:l-
ance would be highly correlated with the quantitative reasoning abilities
Low 20 the test is designed to measure.
--
-------- In view of the stated purpose for which SCAT was developed, its pre-
Very
10
dictive validity against academic achievement is of prime relevance.
Low
[ -~ [ Follow-up data in grades 5, 8, 11, and 12 were obtained from schools
participating in the standardization sample. Validity coefficients were
computed in each school and then averaged for each grade across
FIG. 51. SeAT-II Profile; Illustrating Percentile Bands.
schools, the number of available schools per grade ranging from 3 to 26.
(From Student BUlletin-SeAT -Series II. CopYright © 1% I by Education::! Testing
Seryice. Ali rights resen-ed. Reprinted b." permission.)
For the four grades investigated, the average correlation between SCAT
Total and grade-point average ranged from ..59 to .68; for SCAT Yerbal
and English grades, the range was .41 to .69; and for SCAT Quantitative
. bands 66-SG and .58-78, respectively, we would conclud~ that he is ll~t and mathematics grades, the range was .43 to .65. Because indi\ic1ual
Significantly better in verbal than in quantitative abilitIes, because IllS correlations varied Widely from school to school, however, the manual
percentile bands for these two scores overlap (see Fig. 51). . recommends local validation.
The SCAT standardization sample of over 100,000 cases was seleCted Correlations with achievement tests (Sequential Tests of Educational
so as to constitute a representati\;e cross-section of the national school, Progress) generally fell between .60 and .80. Quantitative Scores tend to
population in grades 4 through ]2 and the first two years of col~ege. ~or correlate higher than Verbal Scores with mathematics achievement; and
this purpose, a three-stage sampling procedure was employed, III .wl11ch Verbal scores tend to correlate higher than Quantitative scores with
the unilS to be sampled were school systems (publIc and pflvate), achievement in all other subjects. Total SCAT scores, however, generally
schools, and classrooms, respectively. Similar successive sampling pro- yield validity coefficients as high as those of either of the two part scores
cedures were followed for the college sample. The selection of the or higher. Thus the effectiveness of Verbal and Quantitative scores as
standardization sample, as well as other test-construction procedures fol- differential predictors of academic achievement remains uncertain. It is
lowed in the development of SCAT, sets an unusually high standard of noteworthy in this connection that the Verbal and Quantitative scores are
technical quality.
themselves highly correlated. These correlations are in tIll' .70's, except at
Reliability coefficients for verbal, quantitativE', and total scores were the lowest and highest grades, where they drop to the .60's. Such dose
separately ~omrmted within Single grade groups by tl:e Ku(~~r-HjC'1nrdson simil::rity may resu1t from the item types employed in the two tests,
tecllllicJl.lC'.The reported relio.bilities are uniforrnl)' hJgh. \\'It]llD sep::cratc ",hid.;. involve hrgck the abi]it\, to detect and utilize relations in S'I'ill-
oolic and abstract ~ontent. Lil~~ other ff·ct~ disf'!J<cr-rJ in 11,;r ,..h"~'.",, .
!
SCAT was designed prine-ipany as a measure of general intellectual de- Standard Mult iple-Choice Ouestions: dra\'!ing upon elementary. arithmetic alQebra and
~eometrv t8~ght. In tht' ninth gradE: or earlier, the~f- ite:i"ls emphasize in5ich~ful- recs~n;ng
ve lopment ahd only secondarily as an indicator of intraindividual apti- and the applIcation of principles. .•
tude differences.
Ou~:tir~ri!'e Comparisons: mark (Ai if the quantity in Column A is the greater, (B) if the
quc"t~t, In ~olumn B .'S the greater, (C) If the two quantities are equ,.:, (D) if the relet ion·
COLLEGE ADMISSION. A number of tests have been developed for use in ship Ccnnot De aetermlned from the information given.
the admission, placement, and counseling of college students. An out-
Column A Column 8
standing example is the Scholastic Aptitude Test (SAT) of the College
3 X 353 >: 8 4 X 352 X 6
EntrancE' Examination Board. Several new forms of this test are pre-
pared each year, a different 10rm being employed in each administration.
FIG. 53 .. Illustrative Mat~ematics Items from CEEB Scholastic Aptitude Test.
Separate scores are reported for the Verbal and Mathematics sections of
InstructIOns have been hIghly condensed. Answers: I-E. 2-C
the test. Figures 52 and 5:3 contain brief descriptions of the "erbal <.nd
(.Fr~m About the SAT 1?7~-75, pp. 4, 5. College Entrance Examination ·Board, New
mathematiZs item types, with illustrations. The items reproduced are "on .. Re~nnted by . ,ermlSSlon of Education"! Testing Service, copyright owner of the
among those given in the orientation booklet distributed to prospective test questions.) .
examinees (College Entrance Examination Board, 1974b). Changes intro-
duced in 1974 on an experimental basis include the addition of a Test
of Standard 'Vritten English and the separate reporting of a vocabulary prehension score (based on the sentence completion and readiner com-
score (based on the antonyms and analogies items) and a reading com- prehension items). 0
First incorporated into the CEEB testing program in 1926, the SAT has
undergone continuing development and extensi\'e research of hicrh
technical quality. One of the reviewers in the Set;{.'ntll Mental Measll~e-
mCl;:s ?'earbook .writes: "Technically, the SAT ma:' be regarded as highly
pel<ecd'-d-posslbJy reachmg the pinnacle of the current state of the art
of psychometrics" (DuBois, 1972). Another comments: "The sYstem of
Sent&nCE CCfnpli:tio"-'s: choos~ ~he or,e word or set of words vl!r.ich. when
lenCE::. bEst 1its in with the meaning o~ the sentence as a whc!e,
inserted in the ~en· ]~retestin~ 0:
item~~ ~nalysis, and stand.ardization of new forms 'exempli-
£~s ;h~ nl~s~~oP~l,~:Jcated procedures m modern psychometrics" (\\T. L.
2. From the first the island':. rs, despite an outvvard __ • did what the V could to __ the "alJaLe, 19. _). Several aspects of the SAT research have been cited in
ruthless occupyirlg power.
di~erent chapters of this book to illustrate speCific proceuures. A de-
iF.) harmonv .. assist (B) enmity .. embarrass (e) rebel!ion .. foil talled account of methodology and results can be found in the technical
(D; resistance .. dEstrc.y (E') acquiescence .. th\",,'cri.
report e~it.ed by Angoff (1971 b). A shorter comparable form, known as
An3!ogi2S: sticet the lettered pair which best exprEsses a relationship similar to that ex"
the Pre~1l11ll1ary SA~, has also ~een in use since 1959. Generally taken at
pfes~E:din HIe original pair.
an earher stage, thIS test provldes a rough estimate of the hi(Th school
(B) hero: worshi;1
3. CRUTCH: LOCOMOTION: lA) paddle: canoe s~udent's apti~ude for college work and has been employed f~r educa-
(E) statement. conter,lion
lei horse: carriage (Dj spectacles: Vision
banal counseling and other speCial purposes. Both tests are restricted to
r· Reading CornprehenslO.'1: examinee reads a p2$~ge and answers multiple<cho;ce q:J~slion$
i
the testing program administered by the College Entrance Examination
. dsse~~;rlghis understandmg of its content.
~oard on behalf of member colleges. All applic;nts to these colleges take
:FIG. 52. l1lustrati\·e Verbal Items from CEEB Scholastic Aptitude Test. In- tDe ~AT. Sam,e col1eges also require one or more achievement tests in
structions have been l'ighly condensed. Answers: I-B, 2-E, 3-D. specI:ll fields, likewise administered by CEEB.
(rroJn Aboul the SAT l!J74-75, pp. 4·, 5. College Entrance Examination Board, New Another nationwide program, bunched in 1959, is the Americ2.n Col-
. 'lc,rk. Heprinted by pen11i5sion of Educntion"l Te.sting Service, copyright ov;ner of th' legE T(·~ting Program (ACT). OrigimJJy hmil-u brgcly to sinle uni-
test quC'~jons.)
320 Tests of General Infdlcclllal Lcccl
wrsity systC'ms, this program has grown rapidly and is now l~scd by many
colleges throughout the countrv. The ACT Test Battery mdudes four GRADUATE SCHOOL AD:\flSSION. The practice of testing applicants for
parts~ English ~Usage, l\1athem;tics Usage, Social Studies R~ading, and admission to college was subsequently extended to include graduate and
Kutural Sciences Reading. Reflecting the point of view of Its founder, professional schools." \Iost of the tests designed for this purpose repre-
E. F. Lindquist, the examination provides a set of work samples of col- sent a combination of general intelligence and achievement tests. A well-
lege work. It overlaps traditional aptitude and achievement tests, focu.s- known example is the Graduate Record Examinations (GRE). This series
ing on the basic intellectual skills required for satisfactory performance 111 of tests originated in 19:36 in a joint project of the Carnegie Foundation
for the Advancement of Teaching and the graduate schools of four east-
college.
Technicallv, the ACT does not come up to the standards set by the ern universities. Ko\\' greatly expanded, the program is conducted hv
SAT. Heliabilities are generally lower than desirable for individual de- Educational Testing _, Service, under the baeneral direction of the Graduat'e
cisions. The separate scores are somewhat redundant insofar as the four Record Examinations Board. Students are tested at desianated b
centers
parts are heavily loaded with reading comprehension and higll1)' int:r- prior to their admission to graduate school. The test results are used bv
correlated. On the other hand, validity data compare favorably wIth the universities to aid in m~king admission and placement decisions and
those found for other instruments in similar settings. Correlations be- in selecting recipients for scholarships, fellowships, and special appoint-
tweeri composite scores on the whole batter)' and college grade-p.oint ments.
averages cluster around .50. Most of these validity data \vere obtamed The GEE include an Aptitude Test and an Advanced Test in the stu-
throuah research services made available to member colleges by the ACT dent's field of specialization. The latter is available in many fields, such
progr~m staff. The program also provides extensive normative and in- as biolog;', English literature, French, mathematics, political science, and
terpretive data and other ancillary services. . ps)'chology. The Aptitude Test is essentially a scholastic aptitude test
In addition to the above restricted tests, a number of tests desIgned suitable for advanced undergraduates and graduate students. Like many
for college-bound high school students and for college students are com- such tests, it yields separate Verbal and Quantitati\'C scores. The verbal
merciallv available to counselors and other qualified users. An example items require verbal reasoning and comprehension of reading passages
is the C~llege Qualification Tests. This battery offers. six possible. scores: taken from several fields. The quantitative items require arithmetic and
Verbal, Numerical, Science Information, Social StudIes Information, To- algebraic reasoning, as well as the interpretation of graphs, diagrams,
tal Information, and a Total score on the entire test. The information de- and descriptive data.
manded in tbe various fields is of a fairly general and basic nature and is Scores on all GRE tfstS are reported in terms of a single standard score
not dependent on technical aspects of particular courses. Reliability and scale with a mean of 500 and ~11 SD of 100. These s~on:."s are directh'
normative data compare favorably with those of similar tests. Validity comparable for all tests, having been anchored to the Aptitude Te~t
data are promising but understandably less extensive than for tests th"t scores of a fixed reference group of 2,09,5 seniors examined in 1952 at 11
have been used more widely. collrges. A score of 500 on an Advanced PhysiC'; Test, for example, is the
It \vill be noted that, witl1 the exception of the Collebe Board's SAT score expected from physics mo.' )rs whose Aptitude Test score equals the
(which can be supplemented with achievement tests), all these tests mean Aptitude Test score of the reference group. Since graduate school
sample a combination of general aptitudes and knowledge about (or' applicants are a selected sample with reference to academic aptitude,
ability to handle) subject matter in major acade.mic fields. 'Vhen sepa- the means of m:::Jst groups actually taking each Adnlnced Test in the
rate scores are available, their differential validity in predicting achieve- graduate student selection program will be considerably abow 500.
ment in different fields is questionable. It would seem tl1at total score Moreover, there are consistent differences in the intellectual caliber of
provides the best predictor of perfom1ance in nearly all colleg.e courses. students majoring in different subjects. For normative interpretation,
Among the part-scores, verbal scores are usually the best smgle pre- theref ore, the current percentiles given for specific groups are more rele-
, '. dictor~. Another important point to bear in mind is that scores 011any of vant «nd local norms are still better.
these tests are not intended as substitutes for high school grades in the The reliability ~ and validity • of the GHE have .
been in\'CstiaatedD
in a
prediction of college achievement. High school grades can predict college number of different student samples (Guide for the use of the GRE,
achievement as well as most tests or better. '\'hen test scores are C0111-
bilH;'clwith high school gmdes, however, the prediction of college per- , Tl,e tf:',tin~of :ll'plic«nl' t.) pn)fe"ioIl31 sd,()ols wiJl ]-.".discussed in Cktptcr ]5,
in conneclion will, occupation::d tests.
fl)rmance is eo~siderablyCimproved.
1U1.3). Kuch'T-Hic:hanl,;o!1 rcli"hilitil'S of the Verbal ,md (:!u"ntit,,!i\'C
scor~s of the Aptitude Test "ita of toLll scores on the Auvanced Tests
are consistently over .90. Scver"l Adv,mcecl Tests also report scores in ::f (-=J j:;n~'strv
r :, .JCJ
/
~
two or three r;1ajor subdivisions of the field, such as experimen tal and
social psychology. The re1iabilities of these suhscores are mostly in the
.80·s. The lower reliabihtits, as well as the high intercorrelations among
::
60
~
/
/
/
/
matical ability is of major importance; the reverse was true of such fields
as English. The GHE Advanced Test was the most generally valid single
20.
I /
,---_/
predictor among those investigated. Illustrative
be seen in Figure 54, showing the percentage
data from three fields can
of students attaining the
15 r /
/
/
/
PhD in successive intervals of Advanced Test scores. The three coef- /
1: ~ /
ficients ai\'en in Fiaure 54 are biserial correlations between GRE Ad- /
b b
vanced Test scores and attainment or nonattainment of the PhD. 01 !
I' !
!
2 3 4 5 6 8 9
The highest validities were obtained with weighted composites of
undergraduate grade-point average and one or more GRE scores. These
multiple correlations fell mostly between .40 and .4,5 for various criteria FIG. 54. Percentage of Students at Various Levels of GRE Advanced Test
and for different fields. It should be noted that the narrO\v range of Scores \Vho Attained the PhD within 10 Years.
talent covered by graduate school applicants necessari1. ' results in lower (Fro,:, Willingh.an:" 1974, p. 276; data from Creager, 1965. Copyright 1974 by the
correlations than arc obtained with the SAT among college applicants. American ASSOClatlon for the Advancement of Science.)
This finding does not imply that the GRE is intrinsically less valid than
the SAT; rather it means that finer discriminations are required within
the more narrowly restricted graduate school population. Percentil~ norms on the ~Ii1ler Analogies Test are given for graduate
Another widely used test for the selection of graduate students is the, and ~rofesslOnal school students in several fields and for groups of in-
~\filler Analogies Test. Consisting of complex analogies items whose sub- dustnal employees and applicants. Over half of these groups contained
ject matter is drawn from many academic fields, this test has an un- 500 or more cases and none had less than 100. ;\,'larked variations in test
usually high ceiling. Although a 50-minute time limit is imposed, the performance are found among these different samples. TIre median of one
test is primarily a power test. The !\-liller Analogies Test was first de- group, for example, corresponds to the 90th percentile of another. Means
veloped for use at the University of Minnesota, but later forms were and SD's, for. addition~l, smaller industrial samples are also reported as
made available to other graduate schools and it was subsequently pub- further aIds m normahve interpretation.
lished by the Psychological Corporation. Its administration, however, is Odd-even reliability coefficients of .92 to .95 were found in different
restricted to licensed centers in universities or business organizations. The s~mples, and alternate-form reliabiIities ranged from .85 to .90. Correla-
tpst is used both in the selection of graduate students and in the evalua- ho~s with several individual and group tests of intelligence and academic
tion of personnel for high-level job~ in industry. It is available in five aptltudes fall almost entirely between the ..50's and the .70'5. Over 100
",.,rO>llpl fnrn" nnp nF whil'h is fPs.·rvpd for reexuminat) .ns. validity coefficients are reported for graduate and professional student
3~L; Yr,;I" of Cencra/Iniel/celllni Lt'['(!
Group Tc.<:tin[: 325
groups and for a fc\\' industrial samples. The~(' coefficients vary widel:-'.
Slight1: owr a third are hetween .:30 and .60, About an equal number C~\~T Scores correlated .40 and 045 with peer ratinas of abilitv to think
cntIcally and analyticallu A, d : ~2 b .'
are elearl:' too low to be significant. The field of specialization, the nature d d'" }. n ln a group 0. 00 experienced elementary
of the criteria employed, and the size, heterogeneity, and other char- an seco~ ary sCho?1 te.achers, Ci\lT correlated .54 with a scale desicrned
acteristics of the samples are obvious conditions affecting these coef- to meaSUle teachers attitude to\vard gifted children. Evidentlv the t:ach-
ficients, :t\1eans and SD's of several contrasting groups in different settings ers ~vho ~hems~lves sc~red higher on this test had mOre fa\;orable atti-
tu dES to\\md gIfted children,
provide some additional promising validity data. It is evident that the
nlidity of this test must be evaluated in 1::'rms of the specific context in , B~ca~;e of its unique features, the Concept ~fastery Test can un-
which it is to be used. GOU )te y, s~r:re a useful function for certain testing purposes. On the
oth:: ha~1Q,It IS cl~arly n~t an instrument that can be used or inter reted
rouLI~lel). A meal:mgful mterpretation of G\IT scores requires a ~reful
St:PERIOR ADL1LTS.Any test deSigned for college or graduate students is st1ud) of all the dIverse data accumulated in thc manual, preferablv sup-
p emented by local norms. .•
also likely to be suitable for examining superior adults for occupational
assE'ssrnent, research, or other purposes. The use of the Miller Analogies
Test for the selection anel evaluation of high-level industrial personnel
has already been mentioned. Another test that provides sufficient ceiling
for the examination of highly superior adults is the Concept \1astery
Test (C\fT). Originating as a by-product of Terman's extensive longi-
tudinal study of gifted children, Form A of the Concept Mastery Test
\vas developed for testing the intelligence of the gifted group in early
maturity (Terman & aden, 1947), For a still later follo'N-up, when the
gifted subjects were in their mid-forties, Form Twas prepareo (Terman
& aden, 1959). This form, which is somewhat easier than Form A, was
subsequentl:' released for more general use.
The Concept \lastery Test consists of both analogies and s:-'non)'m-
antonym items, Like the \liller Analogies Test, it draws on concepts from
man:' fields, including ph:-'sical and biological sciences, mathematics,
histe'ry, geograph:-', littrature, music, and others. Although preclom-
inanth' verbal, the test incorporates some numerical content in the
analogiC's items.
Percentilc norms arc given for approximately 1,000 cases in the Stan-
ford gifted group, te"ted at a mean age of 41 years, as well as for smaller
s:1mples of graduate students, college seniors applying for Ford Founda-
tion Fellowships in Behavioral Sciences, and engineers and scientists in
a navy electronics laboratory, To provide further interpretive clata, the
manual (with 1973 supplement) reports means and SD's of some 20
additional student and occupational samples.
Alternate-form rc1iabilitics of the C:\1T range from .86 to .94. Scores
shcl\\' consistent rise with increasing educational level and yield correh-
tions clustering around .60 with predominantly verbal intelligence tests,
Sisrnificant cOlTebtions with grade-point averages were found in seven
college samples, the correbtions ranging from .26 to .59. Some sllggestivc
findings in other ('ontexts are also cited. For example, in two gioups of
managers participating in advanced management tra;ning programs,
C HA PTE R 12
An important approach to the understanding of the construct, "intel-
ligence," is through longitudinal studies of the same individuals over
PS)lclwlogicallssucs long periods of time. AlthouO'h
as contributing
o such investiaations
b. may. be reO'arded
to the long-term predictiw validation of speCific tests,
,:,
they have broader implications for the nature of intelligence and the
in In.telligcncc Testing meaning of an IQ. When intelligence was believed to be largely an ex-
pression of hereditary potential, each indi\'idual's IQ was expected to
remain very nearly constant throuO'hout o. life.
_ An\' obs·erved'·variationJill._
_._ ..- ~
retesmtg-,'\Cl'S<l.ttrihuTea~fC)\\·E'aknesses ii1the n';;;suring instrument-
either inadequate reliability or poor selection of functions tested. \Vith
P.:,
:.
.
, SYCHOLOGICAL
t 1 Like all tools
tests should be regar d e d as 00 s. . .'
~heir eEectiveness depends on the knowledge, skill, and .mtegnty of
b l'
the user. A hammer can e emp 0) e 0 u·
d t b ild a crude ]'atchen table
.
, increasing research on the nature of intelligence, however, has come the
realization that il~~:2ce itself is both complex and dynamic. In the
following sectioris, we sllaIr examine typJCiil £ndil1gs of longitudinal
or a fine cabinet-or as a weapon of assault. Since psychol~g1Ca~ tes:s studies of inteJJigence and shall inquire into the conditions making for
are measures of behavior, the interpretation of test results requnes kno\\l- both stability and instability of the IQ.
ed\!e about human behavior. ~\',~ho.!og~ca,l ~~sts can~?~. ,~~ y~_op~r1y,ap.::
pli~d outside the context of pS)'chological science. F~ml1Jant) wlth ~ele-
-"aJ'lt bel1a\'ioral lesearch is needed not only by the test constmctor but STABILITY OF THE IQ. An extensiH' bod v of data has accumulated
showing that, over the elementary, high sch~ol, and college period, intel-
also b\' the test user. .' . f
An inevitable consequence of the expansion an~ gr?\\'mg ~omplexlty 0 ligence test performance is quite stable (see Anastasi, 19.58a, pp. 2:32-
an\' scientific endeavor is an increasing speclalizat101.1 of mterests and 238; ~'lcCal1, Appelbaum, & Bogart\·, 1973). In a S'-vedish stud\' of a
fU;lctions among its practitioners. Such specialization lS. clearly apparent relatively unselected population: fo;' example, Husen (19.51) f~und a
in the relationship of psn:hological testing to the mamstream of co.n- correlation of .';2. between the test scores of 618 third-grade school boys
( 'A .... 196"') S ecialist<; in IJs"chometncs and the scores obtained by the same persons 10 veal'S ~'later on their i'n-
temporary psychology na5tasl, I.. P '.- .' ; . '"
hav~ raised tedilliques of test constructiOn to ~rul): 11:1preSS]\~ pll1n."c.le~ duction into military service. In a later Swedish study, Harnqvist (196S)
"1'" \\'l']'le l.)rovidin<' tl:chnicalh' sUI)enor mSlTumenb, 110\\,e\·el, reports a correlation of .78 between tests administered at 13 and 18 years
o f qua H). J . to .. . '. rId of age to over 4,500 young men. Even preschool tests show remarkably
the\' have aiven relativel\' little attentJOJ1 to ensunng tlldt test use:s la
• b . '" d d f . tl' ) 'oI)er use of such mstru- high correlations with later retests. In a longitudinal study of 140 chii-
the psychological mformatJon lJee eo.! Ie I I '"
ments. As a res111t, outdated interpretatlOns of test perf.ormo.nce al~. to.~, dren conducted at Fels Hcsearch Institute (Sontag, Bal:'er, & ~elson:
l 1958), Stanford-Binet scores obtained at 3 and at 4 veal'S of 8.r!e cor-
often survjvcd without rderence to the results of pertme1Jt be.la\ ~or,J
research. This lXlrtial isolation of psychological t~stin.g.from othe:' fields related .83. The correlation with the 3-vear tests decreased c;s. the
of I)Svcl!oloav-witl'j its consequent misuses and 1ll1smterpretatlOns. of interval between .retests increased, but by 'age 12 it was still as high as
.'
tests--accounts ,c>. 11' .~co_~.~~
for some 0 f t 1·Ie J2u.2.IC d··,·t 'rt ..~.~
with ..nsvcholorrlcnL
1"''-j~----e--: _.- .46. Of special relevance to the Stanford-Binet is the foJ]ow-up conducted
testing in the decades of the_~~~_~=~,~~,~~.:,!he tOl;ics. chose~ f~r, d]s~ by Bradway, Thompson, and Cravens (195S) on children originally
_~. _ ..t.l- C::11.1p·-t-e~1:
]"ll-u"'s-trate
ways 111wl11ch the I1l1d1l1r!s01 PS) cho tested between the ages of .'2.and 51.~ as part of the 1937 Stanford-Binet
CUSSlon 1n lJS:I, "'. L' L'. D
"1 t t· tl··, efl er,tive use 0'..' mtelll0'ence ksts standardization sample. Initi[.;,} I Q'.s correlated .fj5 witli ~J- vcar retests
lorrjcal rcse~1!'Cj 1 can CODln.,et t U Ie .~' ','~ .
aJ~l can help to correct popular misconceptions abou~~~.~.3-.~1~ ..__ and ,59 with 25 ..vear retests. The correlation bet\H'cn the lO-H'ar retest
similar scores:-~'-~-~---'"-''~'-'--"'''-'----'-~-''-''' -_..- (Mean age:=: 1·4 years) and ~5-)'ear retest (:\1('an age ='29 vears)
\\'as .85. . .
As would he expected, retest cUlTcl:ttions ale higlWT, the shorter the
interval b,·t\Vecr~ tests. \Vith a consLmt interval het~>,'f'I',-~'tf',;tTmn)'pnvpr
retest correlations tend tel ])(' hi::hcr th,-' older the children. T},c en'cds of implied in the pi'e\'jous~v di:icussc,J Piagciial1 al\pro:lch to nwntal de-
\'CJn"111C 1t
. ~l' ~' •,"
"IS '1',,]1 .. ,- '1' 1-1 ....• • l' ., ,. '.
\ e, d .• Il ute \ <lllOUS]JlC 1nOuallzccl lllstrudi o]]al nn),~rams.
age and retest interval on retest correlations e:-;hibit considerable rcgu-
l~rity and are themselws highly predictable (Thorndike, 19:3::;, 19,10).
Applicatiolls of the same principle ul1derlie Proiect Head SL1;t
an,l
other compe:1satoryeducational programs for cult~rally disac1vantage~l
One' explanation for the increasing stability of the IQ with age is 'pr~-
preschool chIldren (Bloom, D<lvis, & Hess, 1963: Gordon & Wilkcr;on
vided b\' the cUll1ubtiw nat~Q.Lil~~elle~~L.sl~Y.~.lQI?:2lE,nt. The mdl- 1w'e S' 1 19-r> ~ l' '.' . ,
vUU; 1ge, . I.:>; ~tan ey, 1972, 197:3; Whim bey, 1975), Insofar as chil-
vidual's'intellectual skills and knodedgc at each age include all his
dren. f.rom dlsadva.1~taged backgrounds lack some of the essential pre-
earlier skills ~ll1dknowledge plus an incr,~2~~~~!()L~~,-v,~~g1!isj,ti,<:J~1Sc~
Even
if the annual increments bear no re'1itTcmto each other, a growing con- ~~~ms~tes f~r e~ec(Jve sd~ool ~e.aming, they ,vould only fall farther and
leU thel behmd m academiC aC11levement as they progressed throuO'h the
sistencv of T)erformance level \\'ould emerge, simply because earlier ac-
qUisiti~ns c~nstitute an increasing proportion of total skills and knowl- school gra.des. It should 'be added that learning prerequisites co\~r not
edge as age increases. Predictions of IQ from age 10 t.o 16 would thus be o,n.1Ysuch mtellcctual skills. ~lSt~1ea~quisition of bnguage and of quantita-
more accurate than from 3 to 9, because scotes at 10 lllclude over half of :1\,e concep~s, but also atc.ltuoes, mterests, motivation, problem-soh'ing
what is present at 16, while sc~res at 3 include a much smaller proportion ~t).le~, r,eactwns to frustratIOn, self-concepts, and other personalitv char-
actenstlCS. .. ,
of what is present at 9.
Anderson (1940) described this relationship between successive scores 10Th: object of. c.ompensato.ry educational programs is to provide the
~:.~r,nll:g~}rerequlSlt~s that \\'111 enable children to profit from subsequent
as the overlap hypotllesis. He maintained that, "Since the growing indi-
vidmll docs not lose what he already has, the constancy of the IQ IS 111 ;,~~1u;i~nt;;I~1 to,
-.~,a~1~,"LtY~9~,~
domg, of course, these programs h9.p.i_"JQ~.disxllpLthe,
...Q~.~tha.~.,.
\v~.~~.~.?~~=::~~~J~~::!:..!_':.~lain cd...}O\~ COmpc nsa-
large measure a matter of the part-~\'hole or overlap relation" (p. 394).
In -support of this hypothesis, Anderson computed a set of correlations ~~ e~::~~h~~~rl~gl?.:ns prOVIDe one example of the interactiori'bet'\ff'i'l-
Imt@score andtreatD1enCin··flie I)realCfioI1oJ'-"::s--k"--~---""'-'··;-·"---"""1~~--'------'
_I suusequenl seol e, (1S-
between initial and terminal "scores" obtained with shuffled cards and .~_. -,._, '__"___ _ '__ "__
'~'_"_
cussed m Chapter 7 " That mte·l·l-e-c"- ..'-l"SKI
'. lua, ..e'1·1
.s·.--..
:-'l·_a so..1oe
C,ln :-··... ''1'------,---
rr::-......,..·-.
eaecbve v tauO'ht
random numbers. These correlations, which depended solely on the
extent of overlap bet\wen successive measures, agreed closely with em- at the adult l,evel is suggested by the promising exploratory 'rese2.;c11
pirical test-retest correlations in intelligence test scores found in th~ee reported by "himbey (1975), which he describes as "cognitiv~ ther<1p\'."
published longitudinal studies. In fact, the test scores ~ended to gl\'e
somewhat IOlcer correlations, a difference Anderson attnbuted to such
factors as errors of measurement and change in test content with age. I:\'~TABlLITY~F THE IQ. Correlational studies on the stability of the IQ
Although the o\'erlap hypothesis undoubtedly accounts for some .o~the p~'ovl(ie actuan~l data, applicable to group predictions. For the reasons
increasinO' stability of the IQ in the developing individual, two addltlonal gl\'en above, IQ s tend to be quite stable in this actuarial sense. Studies of
conditionsb merit 'consideration. The first is the ..?JJPiWJ,1JJJe..n.ta I••,sta&l.,!.tlj-
7"1' ~l:~.~~~d ..~:.)1.1e ?~l:.~~l~D_~:L~~.::eal}~~K~..!::I?~~.9.,!2!:..2
~~~,_.? ..Q.~:!.1\VUuL~!ifiL~...,
characterizing the developmental years of most individuals. Children ~~~Q.:l ;:,harp nses or drops in IQ may occur as a result of major environ- .
tend to remain in the same family, the same socioeconomic level, and mental cha~~es in the child's life. Drastic changes in family structure or
the same cultural milieu as they g~O\v up. They are not typically shifted home condlbo:1S, adoption into a foster home, severe or prolonged illness,
at random from intellectually stimulating to intellectually handicapl}ing and therapeubc or rem~di,al programs are examples of the type of e\'ents
environments. Hence, whatever intellectual advantages or disadvantages th~t may alter th~ ~hl1d s subsequent intellectual development. f:_\:~.l~,
they had at one stage in their development tend to persist in the interval ':~~?~=.::_~~.~_O"~.ll1a.:n~ ..the same environment, however, may show large'
between retests.
.-.J.ncLeas.es ..QLd.R~JJ!l!?esjJ}:::.rQJ!D:'ie~~f·Tnese-c11'an ges-n~e:~n:":oTcour~e:----
A second condition contributing to the general stability of the IQ per-
I See, e.g., Ba~'ley (1955), Bayley & Schaefer (1984), Brad\'.'av (194.5) Bradwav
tains to th e !:Q1 e aLp rer.G_qJJi~.i.tLJ.fW Tn i ~g,.:S:.kil£~,n_.s..],lhs,eqlle.nU~,c~!I.lJ.ll:g: , _
& RobInson c(1961); Haan (196.'3), Honzik, l\ladarlane, &: Al1e~ ('1048): Kagan &
Not only doecs the individual retain prior learning, but much of his prior Freeman (U{)-'3), Kagan, Sontag, Baker, & Kelson (19.58), ~lcCall Appelhaum &
Iearnin g I;P;q~j,des,tQQbJ.9J,,=s1Jb5,e.q1Jent.l.eanling,-He!~f.§.,.Jh~ ..D.:!2!f_PLQg.t:~2~, __• H.ogarty (197-'3), Rees & Palmer (1970), Sontaa. Baker, & :\elson (1958) Wie~er
he has made in the a~guisi!i91LoLintel1Gct.]JaLslm.s.. __cm.d,..1n.Q,~'..1.e.d,ge_J!L_, Rlder, & Oppel (196:3). to ",
a'nv~on;:;-point~ti~ne, the better able he is to p!.~fuJrom subsequent I Pinne<l'tl (1961). has prepared tables showing the median and range of individual
le;~;";i~g'~~~~nces. -Tl;~onCel)t'''or~~;dil~es;-'in educatioll'lsar{ expres-::--' Q changes f~und III the Berkeley Growth Study for each age at test and ITiest from
1 month to 11 Years. -
slon'oCthis gt'neraFpnnciple. The sec,jential nature of learnin~ is also
330 Tesls of General Inldlcel1l01 Lc.Tcl
PsycllO!ngica/ IsslIes ill 171tclli~(,llcc Teslillg 331 t I';' [
th,at the child is developing :1t a faster or a. slower rate than that of the 1
wit I IQ's in both follow-ups. Several longitudinal studies have demon-
\
~\'t~ \"> "
normati\'e population on which the test was standardized, In genej'al, chil- strated a relation between amount and direction of change in intelligence,r ~~ [
dren in culturally disadvantaged environments "tcntl,~.tQ~)ose and those tcst Scores and amount of formal schooling the individual himself h;ls ~' ~. ~.
in sUJ)erior envir'onments to ocrain in IQ with ae:e. ~ Invest(~ations
•. of the completed in the interval between test and ~'etest (see Harnqvist, 197:3). {"'-'>.J
specific characteristics of these environments and of the children them- The score differences associated with schoolin.Q: are larger tLm those .\}~ ~[
selves are of both theoretical and practical interest. associated with socioeconomic status of the family. - ..~'fS.~
Typical data on the ma~nitude of individual IQ changes is provided SOl:le inYE'stiga~or.s have concentrated more speCifically on the per- ~ ;~j~
by the California Guidance Studv. In an analYsis of retest data from sonality charactenstrcs associated with intellectual acceleration and de- .~ \ ;;~,J,
2~2 cases in this studv, Hor1Zik, 1\lacfarlane, and Allen (1948) reported ce I eration. -A t t I1e Fels Research
. Institute, 140 children were included ~'\l~ v . I
individual IQ chang~s of as much as 50 poin!~yver the period from in an intensive longitudinal study extending from early infancy to adoles- ~ ~li~
6 to 18 years, when retest cOrrelafionsaregei-lerally high, 59 percent c~nce and _beyond (Kagan & Freeman, 1963; Kagan, Sontag, Baker, & .. ~ ~ j
of the d;ildren changed by 15 or more IQ points, :37 percent by 20 or
more points, and 9 p~rcent' by 30 or more. Nor are 1110Stof these changes
~~lso.n, 1905; .Sontag, Baker, &. Nelson, 1955). \\~ithin this group, those ~~ I~ j
childlen slJowll1g the largest gams and those showmg the largest losses in "\.~-
random or erratic in nature. On the contrary, children exhibit consistent IQ between the ages of 41S and 6 were compared in a wide varietv of ".,
upward or dowmvard trends over several ~'onsecutive years; and __ !.ll..~~~.~~
chancres
.. _ b.
.JO_M_
are related to environmental
•• •• ~_,_ .,. _. '
characteristics.
-'- • •
In the Cali1(';';nia per.s.o~.lalit).. and.. enVil..onm.entitl m.ea.s'u.re...s; the same \\·as.. don.c with tllOse
sho\~'1llg the.larges~ IQ changes between 6 and 10. Q.y.ring, thLPI.t?ch~.L
!/;.... '
,,~,
CU"idance Study~--aetailed-;I)\-;e-strgation of home conditions and parent-
:.~:_1:~_,::~1.~~0l~11 ~ependency on parents was the principal condition ~ :'
child relationships indicated that large upward or downward shifts in _:ssocJatea \"ith IQ 16ss-:'During-tITe'schbOlyem's'~-rQ-gaills--\,refe-associate-d- .'
IQ were associated ,,\lith the cultural milieu and emotional climate in '. chieHy ~~~!~f'high ach~~l~.:nt_~riv.:, competitive striving, and curiosity [
which the child ,,-as real:ed. A further follow-up conducted when the .~~out nature., SuggestIve aata were liKeWIse o15famed- regardl11gthe ro1e-~
participants had reached the age of 30 still f9~~1d signiJi~ant _correl~- of parental attitudes and child-rearip'g practices in the development .
I
:.,1
{ions .!:>et\'l'eentest sC:gres al1.d..fumil;:.... mili£.u as a;;sessed at the age of 21 of these traits.
months (Honzik, 1967). Parental concern with the child's educational A later analysis of the same sample, extending through age 17, focused .1
achievement emerged as an important correlate of subsequent test per- principally on patterns of IQ change over time (~1cCall, Appelbaum, &'
formance, as did other variables reBecting parental concern with the Hogart)', 197:3). Children exhibiting different patterns were compared
child's gener,'\l welfare. \\'ith regard to child-rearing practices as assessed through periodic home I
In the previously mentioned follci\\'-up of the 1937 Stanford- Binet visits. A typical finding is that the parents of children whose IQ's showed
standardization sample, Brad\\'ay (19·tS) selected for speCial study the a rising. trend during the preschool veal'S presented "an encouraging and
50 children showing the largest IQ changes from the preschool to the rewardmg atmosphere, but one with some structure and enforcement of
junior high school period. Results of home visits and interviews with policies" (!\1cCall et aI., 1973, p. 54), A m2 'I' condition associated \\'ith
parC:'nts again indicated that Significant rises or drops in IQ over the rising I Q's is described as accelerational att . :l1p1. or the extent to which
lO-)'ear period were related to various familial and home characteristics. "the parent deliberatd:' trained the child in various mental and molor
In a reanalysis of results obtained in fin; published longitudinal studies, skills which were not yet essential" (p. 52).
including some of those cited directly in this chapter, Rees and Palmer An~ther approach to an understanding of IQ changes is illustrated by
( 19-:-0) found chan ges in IQJ~~_t\}'.f.t;:!!..§ __a.mll:?:_):~ar~Jfd~(O_~sniBcap!Jy. .. " ~I~ans (1.963) follow-up study of 49 men and 50 women who had par- 1 ;
related to socioeconomic'status as inc!~~t~q~Y.[c1n2~b.,.c:~.~.<:atiOI~al and hClpated. ll: a long-term growth study. IQ's were obtained with a group
ocCUj)affonar-lc\'eT~-A--ilrillJm'l:eIationship was observcd by ltiril~\'ist tco,t adm1l1lstered when the subjects ;'vere about 12 years old and again
·-(T9"6Sj-ijYh-is-S'wedish-stud)'. In their 10- and. 25-)'ear follow-ups of chil- when they were in their middle or late 30's. Personalih' characteristics
drelJ who had taken the Stanford-Binct at preschool ages, Bradway and were invcstiga ted throu gh a self-report inn'n torf,';;;d' a- strf~~-'onrrFCilsTVC""
Hobinson (1961) computed an index based on parents' education, inter\'ie:.;'s conducted at the time of the adult follow-up. The upper and
btlwr's occupation, and occupation of both grandfathers. Although they lower 2.:) percent of the (Tbl'OUI) in tenns of 10 chancre de<cr!'J'ltcc "s ac-
"./"<
la]Je]ed this measure an ancestral index, rather than an index of socio-
, '- b' "0,·1'" '
c('l~rato~s and dece1.erators: were comp::ned witl.l special reference to ~)J I I
ec, .nomic status, their resu1ts are consistent \\'ith those' of other illvesti- tbClr reh::u:ce on ~'OPll1gor (je_fcnsenwckll1isn~. Tiles£: mechanisms refer" I C)
gaiors: the index \'ieldec1 Significant correhtions of approximately .:30 to contrastmg personalitv st)·Jes in aea1ing with probJenls and frustration':. '
COllinG ~ JrIl'c:hanisrns in ~gener;1l relllTscnt an objediYc,
.
constructive, real- schooling \yas completed 111am' vear~ C':1rlier and who h:1H' since been
istic approach; defense mechanisms are characterized by '\"ithdrawal, engaged in highly din:Tsified ~ct·i\'ities. In this and the next section, .we
dcnial, rationalization, and distortion. The results ('on_firmc0_t.h~ h::,p.Qth~:: , sh~IJ examine some of the implications of these problems for early
sis that a('cel~1·Elp.r~mage sigl!18_c:~!!.!!£*r~16rellj;~.QL.Q.QEi}lg __~11echanisms chIldhood and adult testing, respectiwh'. .
aildd;'('~le'rators of defense mechanisms. -------------'
--Simirarresu1ts~--;~~~ort~dbyM';:iart~' (1966), from a longitudinal
stud,' of 65 children tested from two to four times between infancv and PREDICTIVE VALIDITY OF INFANT AXD PRESCHOOL TESTS. The conclusions
the ~arly teens. On the basis of IQ changes, the children \\'ere c1a~sified that emcrw" .from longitudinal studies is that prescllOol tests (especiall;'
into four categories: (a) rdatively constant--40 percent; (b) accelerative when admmIstered after the age of 2 years) have moderate validitv in
spurts in one or more areas of functioning-25 percent; (c) slo\v, delayed, predicting subsequent intelligence test performance, but that infant tests
or inhibited development-9 percent; (d) erratic score changes, incon- have \'ir~ually none (Badey, 1970; Lewis, 197,'3; t\lcCalL Hogarty, & Hurl-
sistent performance in different functions, or progressive IQ dec1ine- burt, 19i2). Combining the results reported in eight studies. McCall and
26 percent. Intensive case studies of the individual children in these four his associatl's computed the median correlations '-between tests adminis-
categories led ~'1oriarty to hypothesize that characteristic differences in tered during the first 30 months of life and childhood IQ obtained be-
coping mechanisms constitute a major factor in the observed course of ~ween 3 an~ 18 ;'ears (McCall et at, 19(2). Their findings are reproduced
IQ over time. m ~able 20 .. Several trends are apparent in this table. First, tests given
Research on the factors associated with increases and decreases in IQ dunng tl~e first year of life haw little or no long-term predictive value.
throws light on the conditions determining intellectual development in Second, mfant tests show some validity in predicting IQ at preschool
general. It also suggests that prediction of subsequent intellectual statu~ __ ag:s (:3-4 yea~'s), but the correlations exhibit a sharp drop beyond that
can be imprm:ed-lf~.res of !l~~il1dl':'iduars ~mQtj_Qllar:'5lliLll1.QJE':a.:: __.-_-.__ pomt, after chIldren reach school age. Third after the aGe of 18 months
·-t'~~--chal~acteri;;tICs al1<LoLhi-s--€Cn.\liL01ll11.enLar.e._CQm~ed with initial validities are moderate and stable. "'hen predictions are ~lade from thes'~
. test scores. From still another viewpoint, the findings of this t~-pe of ages, the correlations seem to be of the same order of maanitude regard-
-j-cse-:1.rCI1JSoillt the way to the kind of intervention programs that can less of the length of the retest interval. b
(F~orn ]\jcC:'B, .1-Iogorty, & HUJ Ibmt, 1972. Copyright 19'i2 by the American Psvcho-
logIcal ASSollatlOn. Reprinted b:.- permission,) . .
The assessment of intelligence at the t\\'o extremes of the age range ChildhOOD Age
prcsenb special theoretical and interpretive problems. One of these in Years
problems pertains to the functions that should be tested. \Vhat constitutes (Fletest) 1-6 7-12- 13-18 19-30
intelligence for the infant and the preschool child? What constitutes iH-
telligcnce for the older adult'? The second prohlem is not entirely inde-
oS-IS .01 .20 .21 .49
.5-7 .Cll .06 .30 .41
pendent of the first. Vnlike the schoolchild, the infant and preschooler
3-4 .23 .3:3 .4i .54
have not been exposed to the standardized series of experiences r"'111'e-
sented by the school curriculum. In developing tests for the elementary,
high school, and college levels, the test constructor has a large fund of
common experiential matt·rial from which he can draw test items. Prior The lack of Jong-term predict i\'(· validity of infant tests needs to be
to school entrance, on the other hand, the child's experiences are far evaluated further with 1".'gard to otlJer rf'lated findings. First, a number
less standardized, despite certain bron.d cultural uniformities in chi1d- of dinic:ialls llan' argued that infant tests do improv~. the prediction of
rcaring practices. Under these conditions, both the construction of tests subsequent. (~('\'elopl11cnt, but only if interpreted in the light of con-
and tIle interpretation of test results are much more difficult, To some COJT!lt:<ntchl1lcal olJ5Cf\'ations (Donofrio, 19G5: Escalona. 19S0; l~nobl()cb
extent the same difEcu1ty is encountered in testing older adults, whose (;, Pas~;rnanicL IDfiO). Predictions mj~llt also 1x' illlprO\'E'd by a con-
sideration of developmental trends through repeated testing, a procedure Lail sL,ge, provide a 1ramc\\'or]; for e:\alllinin~ the changing nature of
originallY proposed by Ge~,e]l with refnencc to his Developrnental intelligence. ~lcCall and his associates at th; Fe1s Res~ar~h Institute
Schedules. (:"lcCall ct aI., 1972) haw explored the interrelations of infant behavior
In the second place, sen'ral investigators have found that infant tests in terms of such a Piageti::m orientation. Through sophisticated statistical
have mueh higher predictive validity within nonnormal, clinical popu- analyses involving intercorrelations of different skills within each age as
I lations than within normal populations. Significant validity coefficients in :wl1 a~ correlations .among the same and different skills across ages, these
the .60's and .70·s have been reported for children with initial IQ's below 1l1\'estJgators looked for precursors of later development in infant be-
80, as well as for groups having knO\\"ll or suspected neurological ab- havior. :<\lthough the findings are presented as highly tentative and only
normalities (Ireton, Thwing, & Gravern, 1970; Knobloch & Pasamanic:k, suggestive, the authors describe the major component of infant intelli-
1963, 1966,1967; \Verner, HorlZik, & Smith, 1968). Infant tests appear to gence at each six-month period during the first 2 years of life. These
be most useful as aids in the diagnosis of defective development resulting descriptions bear a rough resemblance to Piagetial~ developmental se-
from organic pathology of either hereditary or environmental origin. In quences. The main developmental trends at 6, 12, 18, and 24 months,
the absence of organic pathology, the child's subsequent development is resp~ctively, are summarized as follows: (1) manipulation that produces
determined largely by the environment in \vhich he is reared. This the contmgent perceptual responses; (2) imitation of fine motor and social-
test cannot be expected to predict. In fact, parental education and other vocal-verbal beha\'ior; (3) verbal labeling and comprehension; 4)
characteristics of the home environment are better predictors of subse- furt~)er verbal .development, including fluent verbal production and gru.rn-
quent IQ than are infant test scores; and beyond 18 months, prediction mabcnl matunty.
is appreciably improved if test scores are combined with indices of fa- Apart from a wealth of provocative hypotheses, one conclusion that
milial socioeconomic status (Bayley, 1955; McCall et aI., 1972; Pinneau, emerges clearly from the research of :McCall and his co-workers is that
1961; Werner, Honzik, & Smith, 1968). the predominant behavior at diHerent ages exhibits qualitative shifts and
falls to support the conception of a "constant and pervasive" general in-
tellectual ability (1fcCall et a1" 1972, p. 746). The same conclusion was
NATURE OF EARLY CHILDHOOD Il"TELLIGENCE. Several investigators have reached 1 Lewis (1973) on the basis of both his own research and his
concluded that, while lacking predictive validity for the general popu- survey of published studies. Lewis describes infant intelligence test per-
lation, infant intelligence tesLs are valid indicators of the child's cognitive formance as being neither stable nor unitar\'. Negligible correlations mav
abilities at the time (Bayley, 1970; Stott & Ball, 1965; Thomas, 1970). be founel over intervals even as short as ti1ree l;JO;lths: and correlatiOl;s
According to this view, a major reason for the negligible correlations with performance on the same or different scales at the age of two years
between infant tests and subsequent performance is to be found in the and beyond are usually insignificant. 1\10reO\'er, there is little correiation
changing nature and composition of intelligence with age. Intelligence in among different scales administered at the same age. These results have
infancy is qualitatively different from intelligence at school age; it con- been obtained with both standardized instruments such as the Bavlev
sists of a different combination of abilities. Scales of Infant DcveJormlcnt and with ordinal scales of the Piao~tia;l
This approach is consistent with the concept of developmental tasks type (Gottfried & Brody, 19"75; King & Seegmiller, 1971; Lewis &. Mc-
proposed by several psychologists in a variety of contexts (Erikson, 19,50; Gurk, 1972). In ~lace. of the traditional model of a "developmentally
Havighurst, 195:3; Super et al., 1957). Educationally and vocationally, as' constant general mte1hgence," Lewis proposes an interactionist view
well as in other aspects of human development, the individual encounters emrhasizi.n~ both. the role of experience in cognitive development and
typical behavioral demands and problems at diHerent life stages, from the speclfJclty of mtellcctual skills.
infancy to senescence. Although both the problems and the appropriate
reactions vary somewhat among cultures and subc~Jtures, modal require-
ments can be speCified within a given cultural setting. Each life stage I\fI'LICATIO:\'S FOR I:\,TEP""P,TIC)fo; PROGP~"'\fS. The bte 1960s and earl"
m:tkcs characteristic demands up()n the individual. 1\Iaste~' of the de- ~giOs witnessed scnne disillusionm~nt and considerable confusion regard-
vLloprnental tasks of earlier stages influences the individu,1l's handhng mg the. purposes, methods, and efJectiveness of compensaton' preschool
of Ihe lwhavioral demands of the next. educ~ttJonal programs, such as Project Head Start. Designed prinCipally
\\'itlJin the rnorc c:irClln1ScriLed area of cognitiVE: dcvelopment, Pia;:c- to en!J:mcc: the ac:JC1emic readiness of children frorn disadvanLczed back-
grounds, these programs differed widely in procedures and results. ~1ost
were cI:ash projects, initiated with inadcquate plannin~. Onl~· a few PROBLEMS IN THE TESTIl\G OF ADULT Il\TELLIGEI\CE
could demonstrate substantial improvements in the children's perform-
ance-and such improvements \\"('re often limited and short-livecl (Stan- AGE DECREMENT. A distinctive feature introduced by the 'Vechslcr
ley, 1972). scales for measuring adult intelligence (eh. 9) was the u'se of a dec1ininrr
"Against this background, the Office of Child Development of the O.S. norm to compute deviation 1Q's. It \"ilI be recalled that raw scores on th~
Department of Health, Education, and '''e!fare sponsored a conference of
a panel of experts to try to define "social competency" in early childhood
i VAISsubtests are first transmuted into standard scores with a mean of
o and an SD of 3. These scaled scores are expressed in terms of a fixed
(Anderson & Messick, 1974). The panel agreed that social competency reference. group consisting of the 500 persons between the aaes of 20 and
includes more than the traditional concept of general intellgC'nce. After 34 years mcluded in the standardization sample. The sum ~f the scaled
working through a diversity of approaches and some thorny theoretical scores. on the 11 sub tests is used in finding the deviation IQ in the ap-
issues, the panel dre\v up a list of 29 components of social competency, proprIate age table. If we examine the sums of the scaled scores directly
which could serve as possible goals of earl:' intervention programs. In- however, w~ can compare the performance of different acre groups il~
cluding emotional, motivational, and attitudinal as well as cognitive terms of • a slnrrle
• t>'
co 1It'muous- scale.
'F' Igure 5-0 s1lOWS the means e-
of these
variables, these components ranged from self-care and a differentiated t?tal scaled scores for the age levels included in the national standardiza-
self-concept to verbal and quantitative skills, creative thinking, and the tIon d sample
60 and for the mol'f~ ~. limited "old -,aae
0 .
san lp I"e 0 f 4-"'-
i;) persons
enjoyment of humor, play, and fantasy. Assessment of these components age years and Over (DoppeJt &: Wallace, 19.5.5).
requires not only a wide variety of tests but also other measurement As can be seen in Figure .55, the scores reach a peak bet,,-een the ages
techniques, such as ratings, records, and naturalistic observations. Few 0: 20 and 34 and th.en declme slo\VI~' until 60. A sharper rate of decline
if any intervention programs could undertake to implement all the goals. ,,,as f~und af.ter .6?
m :he old-age sample. The deviation IQ is found bv
But the selection of goals should be explicit and deliberate; and it should refernng an ll1dlvldual s total scaled score to the norm for his o\m ag~
guide both intervention procedures and program evaluation.
The importance of evaluating the effectiveness of an intervention
program in terms of the specific skills (cognitive or noncognitive) that
the program was deSigned to improve is emphasized by Lewis (1973).
In line with the previousl" cited specificity of beh~nioral development
in early childhood, Lewis urges the measurement of specific skills rather 105
'0
intervention program should be tailored to fit the needs of the indi\idual 85
[
child, with reference to the development attained in speCific skills. V)
:>
80
Sigel (1973) gives an incisive analysis of preschool programs against c
0
..,;'" 75
the background of current knowledge regarding both child development "'"
and educational techniques. In line 'with available knowledge about de- 70
velopmental processes, he, too, recommends the use of speCific achieve-
ment tests to assess progress in the skills developed by the educational
programs, instead of such global scores as IQ's. He also emphaSizes the
65 t
19;
I
30 50
I: I
:,62.5.I 725
importance of process, interrelation of changes in different functions, and 17 22.5 40 60 67.5 79.5
patterns of development, as in the characteristic Piagetian approach. Age> in Years (lv\idpo;nts of Age Grot:p5)
And he urges the reformulation of the goals of carly childhoCJd education FIG. 5.5. Decline ill WAIS Scaled Scores witl, Age.
in more realistic terms. (From D(lppelt & V\!alllo-ce, 195:;, p. 323.)
338 re-,!s of (;e!l('1011nfcllccfviil Left'! PsychologiclJi hSII('s in In!cllic:cli('c
-. Tesfin" •...
~~_':'O~':,:'~d;:~~
__"""
cros~'S~;:t!Q'l2'
---,fJ:.1
't"
t5
I
50 l- c'O'''.'''O",' \ 50 I, 5 50
I ,
"15')
.~j.;
v:
could be located were contact~d and 302 of them were giwn the same I
tests again. This subs ample was shown to be closely comparable to the
original group in age, sex ratio, and socioeconomic level.
1., i
I
>-
45f
I
r
!
1
The design of this study permits two types of comparisons: (1) a cross-
2~ 30 35 40 45 50 5S 60 65 70
I 2~ 3~ 35 4(} 4=. 50 55 !
sectional comparison among different age groups from 20 to 70 tested at
Ag.
, Ag, I
the same time, and (2) a longitudinal comparison \\ithin the same indi-
viduals, initially tested at ages ranging from 20 to 70 and retested after :: I :] ,,,'"' ' " 0""",,,, ""''''' 1:: I
seven years, The results of the cross-sectional comparisons sh2.we~. signifi- .
~ I !~f ~~- ~ I
i
cant .inlergener.aJion differences oQ...alLtesis .. In other \\'ords, those born
_~'~::t~":n~!~
1
and reared more recently performed better than those Lorn and reared 50
at an earlier time period. Longitudinal comparisons, on the other hand, I I cross secli~-,<!!
45 r
sl}.o\\'cl-,a..J.e.ud.eugdQr mean s...C:Q1:.e.s-ei~rto ri~ or remail~,..':!..nchanged -1 II 40 45 I
!
\"hen ~iliYiduals were retested. ~The one major exception oC~dlll .lfl-:;y I 2: 30 ~o 40 45 ~D 5'5 6'~ de 7~' I
two h~12tt.<i.-E:iUest~jluYhi.2.~~~mance was _~ig!~&~_'l1ltl:l12Q.Q!:~! I AQ': :
after ~3eyen-"ear interval. 1,(- FIG. 56. Differences in Adult Intelligence as Assessed b\· Cross-Sectional and
The contrast between the results of the cros'i(-sectional and longitudinal Longitudinal Studies. .
approaches is illustrated in Figure 56, sho\\'ing the trends obtained with (From S~haie & S.trother, 1968, pp. 675, 676. Copyright 1963 bv the American
four of the tests.' Similar results were found in a second seven-year retest PsychologIcal AssocIation. Reprinted by permission.)
of 161 of the original participants (Schaie & Labouvie-Vief, 1974). Fur-
ther corroboration was prm'ided by a still different approach involving
the testing of three independent age-stratified samples drawn from the persons. Moreover, the best performers within the older groups excel
sallle population in 1956, 196:3, and 19,0 (Schaie, Labomie, & Buech, tl~e p~or~st perfclrmers \\'ithin the younger groups. l\or is such overl::tp'
1973 ). pmg l1111ltedto adjacent age levels; the ranges of performance still o\,(:'r-
In general. the results of the better designed studies of adult intelli- lap when extreme groups are compared. Thus some 80-\'ear-olds \,·Jl do
gence strcmgl,,'strr:-gt',;rthat
\,. -./- '- •...
tbe ability decrements:J0.IJ11@.di:....attrihu.tEC't:.:ro.::
~_. '" ~ - - .- - better than some 20-year-olds. '
aging' 'lreadual1y-illtt'1"gt'1T€l1rtiUl.'1Of"Tlltercohort differences, proba~l)' \\'hat is even more rele\'ant topic, howeveL is that tile to the present
associated \vl111ci-:i1tuwchangesin omsocj'et);~-Ge·riuJ.ne-abiHtY-d'Ccre- ~_ cl~~~!.f!.!~S
that occur "ith aging vary with the individual. Thus between the
~_ . _•......•
-="'._,...,."...,_..--"="=.,.."'~·r~
ments are not likely to be manifested until well over tFie'a'ge of 60. 1:!.£!~ ages of 50 and 60, for e~Ui)ie, so'me persons may show' a decrease. som'?
over, any generahnti.On, ",he her 'perramtng"'to age ecrement or cohort no appreciable change, and some an increase i~ test performanc~. The
differences, nmst be. qualified by a recognitionb~\¥idt. individ1d.Ul a~~~ch~!:ge. wheth~ it be a dmp-9L.Lri~>_.l\'ill also vary wi~
vari::hih~~""f(')-j,~.all si~_~!i~!.,~ In'clivid.llaT"di"fft,rences within anyone J:llllQ.I}R_~ney~"i~~:!':'~
Moreover, intensive studi~7 of per-~;~~'-;f-'~d;a;~~d-~~
age level are much greater thall' the average difference between age age, extendmg mto the seventh, eighth, and ninth decades of ]jfe. indicate
levels. As a result, the distributions of scores obtained by persons of differ- that inte1lectual functioning is more closely related to thein~lividua]'s
ent ages o\'erlap extenSively. This simp1)' means that large numbers of health st:ltus than to his chronological age (Birren, 1968; Palmore, 1970).
older persom can be found whose performance equals that of younger
3 1'h(',(' are the te.<ls that most closely rescn-;!e iJ:tellif,Ccnce test,; in their content.
Of the,e fom tests, only Rea50nin~ showed a b~dy si~nifi('ant (]1 < .0.5) n>Ltioll ~ATl'llE o~ ADt.'LT I~TELLIGE~CE. \Vithin the life span, testing has been
to aC'" in th2 ]ongitudinal 5tuc1\'. Th(' Illar:nitm]e of the decline, however, i'; J',"('1. Oriented c1JlefJy towf:rd the schoolchild and college student. At these
smaller than in th~ cro;5-secljOl;~1 C'omptUi~~n, lcwJs, till' test comtructor can ura'vV on the large -~,ommon poul of ex..
:342 Tests of CCllcroi I ntciicl'l 1101 Lel'd
P.syclw!ogicollsslICS in Intelligence Testing 343
jwricnc('s that have been organized into academic CUrri(~llLl. .\Jo~t in-
telligence tests measure how well the individual has acqUlr~d the 1I1tel- a elu!tllOod d:~~2:.~:2a rgel yo n w ha t eXP:!::!!~c:.c:?-ili!j}l.divi d ~~~~2~e.~gQ~~ _
lectual ski11s taught in am schools; and they can in turn predIct how well ..d~~.:11.E,_t.!!~~_~~~~I:~_,~I19~:S.iT:~~-,':r.~iTItt011shlp
between these experiences and
the functIOns covered bv the tests.
he is prepared f~)r the next level in the educational hieran:h):. Te~ts for
adults, including the Wechsler scales, draw Lugel)· on tlm Iden:lfiable
common fund of experience. As the individual grows older an~l hIs own
formal educational experiences recede farther into the past, tlm fund of PROBLE\lS II, CROSS-CULTURAL TESTI:'\G
common experience may become increasingly less approp:iate. to assess
his intellectual functioning. Adult occupations are more dIversIfied than The use of tests with persons of diverse cultural backgrounds has al-
chil dhood sch oolin g. ThL£1llJ)lllati¥e-e>:p.eri ences.nLadll1thDo d·J11a~'-ihus_" ready been considered from \'arious angles in earlier parts of this book.
stimulate a differential developmen!~SJLab}li!~~~ in different per~~.~:s:,,~_ Ch~pter 3 w~s concemed with the social and ethical implications of such
Becaus~i~;~mg-ence-fe-stsare'c:Yosely linked to·ac,a-deIliic"a:15i1ities, it is test,mg, partIcularly with reference to minority groups within a broader
not surprising to find that longitudinal studies of adults show, larger ag.e national culture. Technical problems pertaining to test bias and to item-
increments in score among those individuals who have contmued theIr group in~eraction. were analyzed in Chapters 7 and 8. And in Chapter 10
education longer (D. P. Campbell, 1965; Harnqvist, 1968; Hust.n, 1951; we e~:lmme.d typIcal tests designed for various trans cultural applications.
Lorge, 1945; Owens, 195.3). Similarly, persons whose occupatIOns .are In tIllS sectIon, we shall present some basic theoretical issues about the
more "academic" in content, calling into play verbal and numencal role of culture in behavior, with special reference to the interpretation of
abilities, are likeh' to maintain their performance level or show improve- intelligence test scores.
ment in intellige;lce test scores over the years, while those engaged in
occupations emphasizing mechanical activities or interp.ersonal reh:tions
mav show a loss. Some suggestive data in support of tl11s hypotheSIS are
. LEVELSor CULTURALDIFFERE~TfALS. Cultural differences may operate
rel;orted by \ViJ]iams (1960), who compared the. perfomlance of 100
lJ1 many ways to bring about group differences in behavior. The level at
l)ersons ) ranain(T
c' b
in acre
b
from 6.5 to over 90, on a senes of verbal and non-
which cultural influences are manifested varies alan" a continuum ex-
verbal tests. Rather stlfrin~rresl?'~51eI!CeS were found bet\\'::~ the ..
tending from superficial and temporary effects to tho~ which are basic,
incjj \'id-Hal~-oGGupnJiDn.:mldhju .. elative __llil1oin~c:~~:r. ..the _~~~~Ees of _~_
pern;anent, ~I1~ ~ar-reaching. From both a theoretical and a practical
tasks. Longitudinal investigiltions of adults have also found ~~g ..~2!iY~_
stanapomt, It IS Important to inquire at what le\'e] of this continuum
relations.lii.p~ betwet-n toh:t1 IQ changesa!:!.9.:..-sertai!L bio,gl':lEI1j,C:~ ..Li!lY~n: _
tor\'i~m~ (Ch;~fes&Ta~1~m64";-O\\;el;-S,1966). any observed behaVioral difference falls. At 011(' e\trem(~ \\'(> find cultural
dlfferences that may aftect only responses on a particular test and thus
Each time and place fosters the development of skills appropriate to
reduce its validity for certain groups. There are undoubt(.dlv test items
its characteristic demands. \iVithin the life span, these demands differ for
th(' infant, the schoolchild, the adult in different occupations, Mid the that have no diagnostic value when applied to persons f;'om certain
retired septuagenarian. An interesting demonstration of the implications c.ultures .because of lack of familiarity with speCific objects or other rela-
tIvel:' tnvlal experiential differences.
of this fact for intelligence testing was provided by Demming and Pressey
(1957). 111ese investigators began with a task analysis of typical adult ?\fost cultural factors that affect test responses, however, are also likelv
functions, conducted through informal surveys of reading matter and of to influence the broader behavior domain that the test is desianed t~
reported daily activities and problems, On this basis, they prepared pre- sample ..In an Englisl:-speaking culture, for example, inadt'quate ~laster)'
liminarv forms of some 20 tests "indigenous" to the older years. The tests ~f E~lghsh may handICap a child not only on an intelligenee test but also
1~ hIS. school work, contact with associates, play activities, and other
empha~ized practical information, judgment, and social perception. Re-
SItuatIOns of daily life. Such a condition would thus interfere with the
sults with three of these tests, administered together \vith .!'it~m.q~l!_~::~:"l:)_~;J
child's sub:equ.ent ~nte]]ectual and emotional developme~It and would
and nQl.lYerbal-tc~to samples of different ages"._~bQ,:~y.t:-:,(:Uh~lJhS',_Q1Qf.L_,_
peIS (;;I~u::~s.~)l~:c!
the YOuI~·ge;-on~t.!l'e-~1_~~\~:-t-;;:iE:~:hill;J1 _r_.£b:jgE~-
..Q_.rf.Y~L~~ have practIcal ImplIcations that extend far be\'ond immediate test per-
shipJl@lQ,-for,_tl:ltjr-@B:]2~1E~~~;;ts.An these types of research suggest that formance. At the same time, deficiencies of this sort can be remedied
whether intelligence test seores rise or decline with increasing age in \\:i~hollt much difficulty. Suitable language training can bring the indi-
vldual np te>an dfective functioning It·vel within a rclati\'e!y short period.
344 Tests of CCllcrallntdlcc!uol Lcul
The bllgua~e an individual has been taught to speak was chosen in the
above exa~11)1~because it provides an extreme and obvious illustration of Ct.'LTURALDIFFERE"CESAKD CULTURALHAKDlCAP.\\'hen ps;-ehologists
several POil;tS: (1) it is clearly not a hereditary condition; (2) i~ ~an be began to develop instruments for cross-cultural testing in the Hrst quarter
altered; (3) it can seriously affect pC'rformance on a t~st,. a~mm~stered of this century, they hoped it would be at least theoretic,lJly possible to
in a different language; (4) it will similarly effect the mdividual s edu- measure hereditary inteJlectual potential independentl;' of the impact of
cationaL vocational, and social activities in a culture that uses an un- cultural experiences. The individual's behavior \vas thought to be Over-
familiar language. Many other examples can be cited from the rni~~le laid with a sort of cultmal veneer whose penetratie.n became the ob-
ranae of the continuum of cultural differentials. Some are cogmtJve jective .of what .were then called "culture-free" tests. Subsequent develop-
diff~rentials, such as reading disability or ineffective strategies for solv- ments 111 genetIcs and psychology have demonstrated the fallac,' of thi~
ing abstract problems; others are attitudinal or motivation~l, s~ch as lack concept. \Ve now recognize that hereditarv and environrnenta'l factoI';
of interest in intellectual activities, hostility toward authont)' £gures, low interact at all stages in the organism's devel~pment and that their effects
achievement drive, or poor self-concept. All such conditions can be are inextricably intertwined in the resulting beha\-ior. For man, culture
ameliorated by a variety of means, ranging from functional .literacy permeates nearly all environmental contacts. Since all behavior is thus
training to vocational counseling and psychotherapy. All are hkely to affected by the cultural milieu in which the individual is reared and
affect both test performance and the daily life activities of the child and since psychological tests are but samples of beha\'ior, cultural influences
adult. will and should be reflected in test performance. It is therefore futile to
As we move along the continuum of cultural differentials, we must ~ry ~o ~e\ise a test that is free from cultural influences. The present ob-
recoanize that the l~nger an environmental condition has operated in JectIve 111 c~'oss-cultural testing is rather to construct tests that presuppose
the Dindividual's lifetime, the more difficult it b'tecomes to reverse 1 s only expenences that are Common to different cultures. For this reason
effects. Conditions that are environmentally determined are not neces- such terms as "culture-common," "culture-fair," and "cross-cultural" hav~
sarily remediable. Adverse experiential factors operating over many years replaced the earlier "culture-free" label.
may produce intellectual or emotional damage that can no longer be :0:0 single test can be uni\"ersaJly applicable or equally "hir" to all
eliI;1inated bv the time intervention occurs. It is also important to bear cultures_ There are as man\" varieties of culture-fair tests as there are
in mind, hO\~'ever, that the permanence or irremediability of a psycho- parameters in which cultur~s diHer. A nonreading test ma\' be culture-
logical condition is no proof of hereditary origin. , ~air iJ: .one situation, a non language test in anoth~L a performance test
An example of cultural differentials that Inay produce p~rm.anentfeHects 111 a thud, and a translated adaptation of a vcrbal test in a fourth. The
011 individual bphavior is provided by researcb on complIcations 0, preg-
varieties of available cross-cultural tests are not interchancreable but are
nancy and parturition (Knobloch &, Pasamanick, 19GG; Pasamanick ~ useful in different types of cross-cultural comparisons. 0
Knobloch, 1966). In a series of studies on large samples of blacks ana It is unlikely, moreover, that any test can be equall~' "fair" to more
whites, p;'enatal and perinatal disorders were found to be signi~cantly than one cultural group, especially if the cultures are quite dissimilar.
related to mental retardation and beha\ior disorders in the offspnng. An \\'hile reducing cultural differentials in test perforrn;ll1ce, cross-cultural
important source of such irregularities in the process of childbearing and
1 I
birth is to be found in deficiencies of maternal nutrition and other con-
tests cannot completely eliminate such differentials. Even' test tends to
favor persons from the culture in which it was developed: The mere use
ditions associat~d with low socioeconomic status. Analysis of the data of paper and pencil or the prf:'sentation of abstract tasks having no im-
revealed a much higher frequency of a11 such medical complications in mediate practical significance will favor some cultural groups and handi-
lower than in higher socioeconomic levels, and a higher frequency cap others. Emotional and motivational factors likewise ~influcnce test per-
among blacks than among vihites. Here then is an example of cultural formance. Among the many relevant conditions differing from culture to
differentials producing organic disorders that in _turn may lead .t~ behav- cl~l~ure may be mentioned the intrinsic interest of the test content- rapport
ioral deficiencies. TIll' cflects of this type of CUltural ellfferentl.u cannot W1UI the examiner, drive to do well on a lest, desire to excel others, and
be completely reversed within the lifetime of the indiv.idl~al, but require pa~t habits of solving problems individually or cooperativch-. In testing
more than one generation for their elimination. Agam It needs to be clnldren of low socioeconomic level, several investigators have found that
emphaSized, hO\~~ever, that such a situation does l?ot irnpl~- hereditary the examinees rush through the test, marking ans\~-('rs almost at mnclorn
deiect, nor does it provide any justification for faIlure to Improve the and Bnishin~ before time is ca]led (Ee]]s et a!', 1951 !. The sarnc r(':action
environrocntal conditions that brought it about. has been observed among Puerto Wean sc:hoolc:hil~lrcn tested in :0:ew
1'1(')$'
",: ' ,1" 19G'0
Ort~'· 'J. l"""
v,~; \' emon, lOG'"
v 0). 1
_11 a provocative analYsis of the
York Cit,· and in }lawiiii (.-\nastasi & Cordtwrl, 195:3; S. Smith, 19,12).
prol)lem, Ortar (1963, pp, 2:32-233) writes: '
Such a reaction may reflect a combination of lack of interest in the
n,];;tiyeh' abstract t~st content and l':\p(.'ctation of low achicH'ment on
~,n ~he ~<iSiS of our results it appe.ars that, both from the practical point of
tasks re~cmblillg schoo] work By hurrying through the test, the child ~Je\\ ano 01. theoretlcal grounds, the verbal tests and items are better suited as
shortens the period of disco:nforL mtercultural measuring instruments than am' other kind, They must. of
Each culture and subculture encourages and fosters cert~in abilities comse, ~e translated and adapted, but this adaptation is inHnitel~' easier' and'
and wa\'s of behaving, and discourages ;r suppresses others. It is there- ~,ore rehab:e:han, t~ w~]:.ni~h :n::possible task of ",tra:)slati~g" a'nd adapting
fore to b~expected that, on tests developed within the majorityA.merican ~ pe~o:~naJ.Ct"tesc. l,1C langLl"t!,E' of perfonllancc- IS tne cultural perception,
culture, for example, persons reared in that culture will gerwrally exce1. ~~t ItS. words "and grammar and ~YJjtax are not even eomplE'tel~' under~tood,
H a tf'st were constructed by the same procedures within a culture diiIer- le( al01le orgamzed in natIOnal entitles. \Ve do not know how to "transhte" a
in'" markedlY from ours, .~11ericans would probably appear deficient in picture .into the representational language of a different culture, but we are
Co' • r' 1 ' thoroughl\' familiar with the technique and requirements of translatin'" verbal
terms of test norms, Data bearing on tl1is type or cultura companson are
contents. , , . A concept that is non-existent in a certain laJ)gua"~'" simplY
meager. \\l,at c\"ide:nce is avo.ilable, hO\\'8ver, suggests that persons from
•..... ' • - . 1 can:1Ot bf' translated into this language, a be-tnT which acts a~ a ~afe£uard
our culture m~w be just as handi.capped on tests preparea wlthlll otller
ag?Jnst meehanical use of a gi,'cn instrument when adapting it for a difr~rent
cultures as mCl~'lbers of those cultures are on our tE'sts (Anastasi, 1958&, culture, ~
pp. 5CG-56S), Cultural differences become cultural handicaps when th~
individu[ll moves out of tIle culture or subculture in which he \\'as reared Amcm~ the e~amples cited by Ortar is the observation that, when ]1re-
and endea\'ors to function, compete, or succeed within another cdture, sentechnth a pIcture of a head from which the mouth was missing, Orie!l-
From a broader viewpoint, however, it is these yery contacts and inter- tal immigrailt children in Israel said the bod\' was missincJ'. Lh~hmiliar
changes hetween cultures that stimulate the advancement of ci\'ilization~, \\:it11the conven.tion of considering the dra\·,.'ii~g of a head ~~ a complete
Cultt;ral isolation, \\'hile possibly more comfortr.ble for individuals, leads pIcture, these ctlildren regarded the absence of the body as mOT(: im-
to socieLtl stagnation. portant than the omission of a mere detail like the mouth. 'For a diHerent
reason, an item requiring that the names of the seasons be arrancrtd ill
the proper sequence would he more appropriate in a cross-cublr~l te;t
LA);GI:AGE I); TI1ANSCl'LT1.:RAL TESTl);G. Most cross-cultural tests utilize
than \':ould an item using pictures of the St'asons, The seasons ,,','ou]d nol
nonverbal contt'nt in the hope of obtaining a more nearly culture-fair
onh' look different in different countries for geographical reason::. but
measure of t]w san1e intellectL1al functions measured b\' verbal intelli-
the" "'ould also probably be represented bv means of ('orJveJ1tiona lized
gence k>ts. Both assurnpt\ol1s unckr1ving this approach are questionable, pictc1rial S\'111bol, which would be unfami]'iar to pcr'C!IiS from another
}~irst, it c~:nnot be r,ssLlliled th"t gOfivcrbal tests measure the S:1ll1e func- culture,
tions 8S \'('rua] tests, howev(;1' ~;imj]ar they may appear. A spatid analogies rn'f'USeOCI'1'ct.-" 1)t' 'lnsl1itahh"
.
.,,1. , '. 'ul8.1 1epreSenc~1tl()n
.j'" 1" may
lIS8J1 in cultures
test is nd men'!\' a nonverbal version of a verbal an::Jogies test. Some of
~na(,C'lI,tomecl to representative drawin~, A lwo-climer!sional reproouc--
tlw c:-,r]v nonJar:£!uaoe
- J •• C1'
tt:st~" such as the Army flE::ta, wc~e hea\'ih' loaded
v1lh spatia] visualization and perceptual
#. •
lnvC'stiqatic;ns ~Yith a. wiJ'c v8.riety of cultnr;:l groups i.n man)' countrib ao<tr'ict tl"11 1'jn(i t. ~~ -':, ~-:.nc..
l~b 1·)I·C)~'~·"·e~'
·.... I an::d\'UC CO,in) 't' IVE' sh·'jec:. .
··rY~r,..,('J "'rirtir. (yr.
h:1ve f(~;mcl larger group J.if1eren(:e~. in perform';mce aDd other nonverbal
"'1oJ
.•~, J..
_.I.
-'I""
1
~
L'
:",'. . ••~: .'\...LL0 l\.:'l ,.1
tests th:;n in \'elb~\l test:; (Anastasi, 19':iJ; lrvir!e, 19Ci9a 1fJ69b; Jensen,
, J . ]'syclwlof:,icn! hSI1CS ill Jnlcllif''''l1c''
',,- l.
T'('s I'mg 349
yes the same e "t'
other cu1tural contC'\ts may be much less accustomed to such probkm-
1])VO
scores on suec~ssi~'e {~:~)II;~f
,
~~~C~~ITl!'Ct
"
r.cgllJ~fr1y followed
0 a llJ)J orm scal':.',
in converting
soh'ing approaches.
It should bc added that nonyerbal tests ha\'C fared no better in the
testing of minority groups and persons of low sociopconomic status \\"ithin
tl-Je United St,-,tes. On the WISC. for instance, black children usuallY find
the Performance tests as difficult as or more difficult than the Yerbal MEAKING OF A!\ ,. IQ For tl)e genera 1 public the IQ . 'd
tests; this pattrrn is also characteristiC of children from low socioeconomic a particular tVI)e of seo .,' IS not) entified \\ith
.' re on a part)cu1'1r te t b t' f '
levels (Caldwell & Smith, 1965; Cole & Hunter, 1971; Goffeney, Hender- d esignation for intelliaence o
S' ., I
. 0 pleva elit has tl .
C s, U IS 0 ten a shorthand
b
son, & Butler, 1971; Hughes & Lessler, 1965; Teahan & Drews, 1962), cannot be merely ignored 0' d I d' )IS usage ecome, that it
I ep ore as a I)Opul '' .
The samp groups tend to do better on the Stanford-Binet than on either b e sure, wben consideriner tl . - ar nJlsconcephon. To
b 1e numencal value f '
Raven's Progressive 1\iatrices (Higgins & Sivers, 1958) or Cattell's Cul- a Iways specify the test fr~OI1)\,,1 ' 1 . '. a a gl\'en IQ, we should
, __' HC1 It was deriv d D'ff .
tests that yield an IQ--ao'~"-'f-;-t d'-:--fj~---:~'---~..: I erent mtellir-ence
ture Fairlntf::'lligence Test (\\'illard, 1968). ' In ac I ler In cant' t d' ~
There is of course no procedural difficulty in administering a verbal a ffcct the interpretation of tl " en an III other \\aVS that
. 1ell scores. Some of th d'ff, '
test across cultures speakin~ a common language, \Vhen language differ- tests sJ)armg~ the common lab ceo 1 f ".mte II'Igely'e t t"ese, I el ences amonab
ences necessitate a translat'ion of the test, problems arise regarding the examples considered il) tl d' ~ ~ es \\'ere apparent in the
)e prece ml! chapt :t\ 1
comparabi1it;' of norms and the equivalence of scores. It should also be need to reexamine th 1 u ers. 1 onet wIess, there is a
" ,e genera connotations of tI ".
noted tl)at a simple translation would rarely suffice. Some adaptation gence, as symbolized bv tl I I' 1e construct ll1telli-
.' _ 1e Q. t m1O'ht be add d tl J
and revision of content is generally required. Of interest in this con- conceptIOn of intelliaence I b h c e Ult t)e pre\'alent
e' )as een S aI)ed t 'd
nection is the procedure developed in equating the scales of the CEEB t)e
I characteristics of th St f 'd ' ' 0 a consl erable degree by
. e an OJ - Bmet seal ,I . h f .
Scho1astic Aptitude Test (SAT) and the Puebra de Aptitud AcaMmica prOVIded the only instrument for the't "e, W HC or many years
gence and which was ofte d llLenSI\ e measurement of inteJli-
(P AA )-a Spanish version of the SAT (Angoff & \1odu, 197:}). . n use as a crit " f' J' ,
The PAA was originally developrd for local use in Puerto Rico, 11ut has Fnst, intelligence should b 'd d ell on 01 va Idatmg ne\\ tests.
e regal e as a-.des . ',' h
subsequently been considered by continental American universities as a e.KIUD,atQIY
I ~onceI)t. A 1 IQ . ;F ''-~ _cnpIIY~ rat, er than an
,-- ---~ I IS an expresslOn of . d' 'd I'
possible aid in the admission of Spanish-speaking students. With such Ieve 1 at a. given point in "i . 1'-
lome, m re ahon to his aa
an m ,1\'1 ua s ability
' -. . .
objectives in mind, an expJoratory project in scale equating was under- test can mdicate the reaSOI) f ,I' f oe n01l))s. No mtelhgence
,C _ S.OI HS per orma T '1 ' ~
taken. This project provides a demonstration of an in:'''''nious method performance on a test or i "'d' l' n.c:,. 0 attn Jute madeqnate
. n e\elV a\- ,Fe actInt e t ". 1
applicable to other situations requiring testing in mul pIe languages. Igence"
I, d' , is a tautology and' ', ..
_ " c ,I S 0 mae equate intel-
. In no \\a\' advances 0' d d' a
Essentiall:', the procedure consists of two steps. The hrst step is the m IVldual's bandicaI) 1ft' . c. UJ un erstan m of the
, n ac, It may serve t I 1 ff 0
selection of a common set of anchor items judged to be equally ap- causes of the handicaI) in the i d' 'd' 1'1-' 0 )a t e orts to explore the
' n 1\ I ua s Hlstorv
propriate for both groups of students. These items are administered in I n t e III!!ence
~ tests ' c.
as '''ell
v, as any ot h er k' d f' .
English to thc English-speaking students and in Spanish to the 5panish- to .~.!?.£LQn in.dividual b t t 1- l' . ,m 0 tests, should be used not
. . . ~__ •__., .J,,1.. ..Q. .. 1t...p_!.n understand' a 1. T . '
speaking stuoents. The performance of these groups pwvides the data for s~n to hIS maximum func'tioninn' le\'-~- \\-,-,---'-- Jl1o- 1lm~, a brmg a pcr-
measuring the difficulty level (~ value) arid discriminative power (rl.;" time; we need to ass h' b e need to start where he is at the
. ess IS strenrrths and ' k ' ,
with total test score) of each item. For the final set of anchor items, an:' mgly. If a reading test indicates 1 ,\\e~ nesses and plan aecord-
a
items sho\\'ing appreciable item-group interaction are discarded (see do not label him as a 110' d t )adt a dnld IS retarded in readin . we
'. urea er an stop' n 'd ' ' . b'
Ch. S). These would be the "biased" items that are likely to have a verbal test to conceal his h d' I ' OJ 0 \\ e give hm1 a non-
him to read an Icap, nstead we concentrate on teachina
psychologically different meaning for the two groups. The final anchor . . 0
items are those having approximately the same relat1'oC difficu1t:' for tl)e .~n ImpUJ'tant goal of contemporary testin(T ., ,. ' ,
English-speaking and the Spanish-speaking sampJes, in addition to mcet- to self-understandl'l)a . cL ] . 1:>' mOJeO\o, IsJQ ..contnbute.
.' , o..aJ1 p.eJ:.Sima dev 1 ' ' .
\ Ided bv lests is beinrr . d' --~---fc..Q-E!!!-~!::- The mformation 1.1ro-
'n~ the specifications for difficu1ty level and discriminative rnwcr. . 1" " 'i::- use mcreasm<T]y to ',,' d' . 1 ]
The second step is to include these anchor items in a reguJar adminis- t'. Jona and voc'lticlllaJ
c,
I'
p annmg and in
b ,
J-'
aSSlSl III l\'E ua s in edue-a-
d " <
tration of the SAT and the PAA and to use the scores on tIle anc1-l<)1 l1\'es. The attention 1"1')']<1 C! ..· t ff ma .mg eCISlOns ahout their OWI1
, ~ b ,_ c;1'('no e" c','i' ."
,~c \ e \\'d)'S 0 f communicat'I'"·,,i::- t l
• (. 1J f''''
":."'!
items as a basis for converting al1 test scores to a single scale. This step
350 Tests of GellcTCll Il1tcllecttwl Leu} Psychological Issl/es ill intelliaencc TcstillC7 351
b . t-
results to the individual attests to the grO\ving recognition of this appli- cat.ed, a maj~r s~l~stantj.ve source of controversy pertains to the interpre-
cation of testing. tatIOn of hentablhty estImates. Specifically, a heritability index shows the
A second major point to bear in mind is ~~:. ~~~t.:lli~e"~1ce,~~.~!.9t .a~ proportional contribution of genetic or hereditary fa~tors to the t~tal
single, unitary ability, b~_a composi~~eve~~l. fu~~tlO~S:,The_tg[!l~_ v~l:iance of a particular trait in a given populati;n under existing con-
ci5'i'DiliOi1Iy use to cover that combination of abilItIes reqUIred for survival dItIons. For example, the statement that the heritabilitv of Stanford-Binet
ana a vancemeDf\vithin a particular culture. It follo\\'s that the specific IQ among urban American high school students is .70 would mean that
·abltrtle~-im:lu·de--om--f11fs compos~vell as their relative weights, will 70 percent of the variance found in these scores is attributable to heredi-
vary with time and place. In different cultures and at different historical tary differences and 30 percent is attributable to environment.
periods within the same culture, the qualifications for successful achieve- Heritability indexes have been computed by various formulas (see, e.g.,
ment \vill differ. The changing composition of !.n!.~}Egs:nQ~ ...~<.m_als'Lh~.._ . Jensen, 1969; Loehlin, Lindzey, & Spuhler, 1975), but their basic data
recognized within iflle lIe orfne uiCli\;rauaC'from infan~!, to ~5!~!hSlOd,--_ . are measures of familial resemblance in the trait under consideration. A
A:~ iUQlvioual'Sielative a1lilifywil1 t~ to increase wltnage m thos.e frequent procedure is to utilize intelli ence test correlations of monozy-
functions \~hose va ue is emp 1asizeCfl)y his culture or subcultur~d hiS gotic (identi.caI) and dizygotic (fraternal) twins. Corre ations betwe~ri
rej'afi\;e aoihty wlltrenatodecrease in those functions whose va ue is 11'lonozygotic ..twins rearea--togetler and between monozygotic twins
deemphasized (see, e.g., Levinson, 1959, 1961). reared apart m foster homes have also been used.
Typical intelligence tests designed for use in our culture with school- Several points should be noted in interpreting heritability estimates.
age children or adults measure largely verbal abilities; to a lesser degree, F~rst, ~he empirical data on familial resemblances are subj~ct to some
they also Celver abilities to deal with numerical and other abstract sym· dlstortJOn because of the unassessed contributions of environmental fac-
bol~. These are the abilities that predominate in schoolleaming. Most tors. For instance, there is eVide~~-th-;t-;;-:;onozvaotic twins ~amore7
intelligence tests can therefore be regarded as measures of scholastic apti- closely similar environment than do dizygoti~ ~wins (Anastasi, 1958a,
tude. Th.e IQ is_both a reflecrtgn of.,Eor educational achieve~-=~! and ~ pp. 2-87-288; K~ch, 1966). Another difficulty is that twin pairs reared
prediclor of subsequent educational perrorriiance. Because tne functions apart are not aSSIgned at random to different foster homes, as thev would
taught in the educational system are of basic importance in our culture, be in an i~eal e.xperiment; it is well known that foster' home pla~ements
the IQ is also an effective predictor of performance in many occupations are selectIve wlth regard to characteristics of the child and the foster
and other activities of adult life. family. Hence the foster home environments of the twins within each
On the other hand, there are many other important functions that pair are likely to sho\\' sufficient resemblance to account for some of the
intelligence tests have never undertaken to measure. ~e~llimice.l. IllQJ2:2 correlation between their test scores. There is also evidence that twin data
n~.c.al ~..md..artisJiQ..ilptitudes are obvious exam~D10tivational, emo- regarding heritability may not be generalizable to the population at
tional, and attitudinal variables 'are important determiners of achieve- larg~ because of the greater susceptibility of twins to prenatal trauma
ment in all areas:B Current creativity research is identifying both cog- leadmg to severe mental retardation. The inclusion of such severelv re-
nitiyc and personality variables that are associated with creative produc- tarded cases in a sample ma:' greatly increase the twin correlati~n in
tivity. All this implies, of course, that both individual and institutional intelligence test scores (Kichols & Br~man, 1974).
decisions should be based on as much relevant data as can reasonably Apart from questionable data, heritability indexes have other intrinsic
be gathered. To base decisions on tests alone, and especially on one or limitations (see, Anastasi, 1971; Hebb, 1970). It is noteworthy that in
two tests alone, is clearly a misuse of tests. Decisions must be made by the earl~' part of the previously cited article, Jensen (1969, pp. 33-46)
persons. Tests represent one source of data utilized in making decisions; c1.e~rly.hsts th.ese limitations among others. !gst, th~mpt of herita.
they are not themselves decision-making instruments. ~Ih.ty .!.~.aE0~~~ble ..!~:p01)Ulat~ons, not indiv~duals. For example, in
_~!lg to establish the etiOlogv oLa ..padicular-child'.5i11ental retardation
the hc:ritabn~i inJex would be of lliLh.cl~egardless of the size of th~'---.,
HERITABILITY A~D MODULABILITY. Much confusion and controversy have
llenta )j!ity index in the population, the child's mental retardation could
resulted from the application of heritability estimates to intelligence test have resulted from a defective gene (as in phenylketonuria or PKU),
scores. A well-Jmowl1 example is an article by Jensen (1969), which has from prenatal bntin damage, or from extreme experiential denrivation.
engendered great furor and led to many heated arguments. Although Se~ond, heritability indexes refer to the population on which thev were
there are sev-eral aspects to this controversy and the issues are compli- foune! at the tirne. Any change in either hereditary or environ~nental
Psychological Issues in Intelligence Testing 353
(;2 Tests of GCllcrallntcllcctll~Ji L~~cl . '. .
conditions would alter the hCrltablllty ll1dex. For lI1stancc, an ll1crcase raised from .5S to .92, \vhen parents' educational level was included in
;'1 inbreeding, as on an isolated island, would reduce the variance at- the multiple correlation (p. 80).
"hat trait is unimportant. An extreme example may help to clanfy the a revision of the Army Alpha of \Vorld \Var 1. The distribution of this
I
the correlation between readibg comprehension at grade 2 and at grade
8 rose from .52 to .72 'when father's occupation was added as a rough
different abilities, but also the way in which intelligence becomes differ-
entiated into identifiable traits.4 There is empirical evidence that the
index of the cultural level of the home (p. 119). Again, a reanalysis of
\ intelligence test data. fron~ the Harvard Growth Stud] showe~ that the
correlation between mtelhgence test scores at ages I and 16 could be
J11llnber and nature of traits or abilities may change over tirflr and may these specific motives will interact with situational bctors, as well as with
differ amonrr cultures or subcultures (Anastasi, 1970). aptitudes, to determine the individual's actual performance in given
An individual's intelligence at an:' one point in time is the en~ product situations. .
of a vast and complex sequence of interactions between heredItary and The relation between personality and intellect is reciprocal. Not only
environmental factors. At any stage in this causal chain, there is oppor- do personality characteristics affect intellectual development, but intei-
tunity for interaction with n~\\' f;ctors; and because each interaction in lectual level also affects personality deVelopment. Suggestive data in
turn 'determines the direction of subsequent interactions, there is an ever- support of this relation are provided in a studv by Plant and 1\linium
widening network of possible outcomes. The connection betw~e~ th.e .(1967!- ~ra\\'ing upon the data gathered in fi~'e ;vailable longitudinal
genes an individual inherits and any of his behavioral charactenstlcs IS mvestIgatJOns of college-bound young adults, the authors selected the
thus highly indirect and devious (see Anastasi, 1958b, 1973; Hebb, 1953). upper and lower 25 percent of each sample in terms of intelligence test
scores. These contrasted groups were then compared on a se;:ies of
personalit:, tests that had been administered to one or more of the
INTELLIGEKCE A"n PERSO:"ALIH. Although it is customary and con- samples: The pe.rsonality tests included measures of !t1:it~des, vallle~ __
venient to classify tests into separate categories, it should be recognized motJvatlOn, and l11terE:::rsonal and oth~r non cognitive traits. The results
that all such distinctions are super£cial. In interpreting test scores, per- 'of1Iiisarlal:'sis re<;;::;led-a stro~g t~ndenc:T for th;high-aptitude groups
sonality and aptitudes cannot be kept apart. An individual's per~orman~e to undergo substantially more "psychologically positive" personality
on an aptitude test, as well as his performance in scho~l, on.the JO~, or 111 changes than did the low-aptitude groups.
any other context, is influenced by his achievement dnve, hIS perSIstence, The success the individual attains in the development and use of his
hi~ value system, his freedom from handicapping emotional problems, aptit~des is bound to influence his emotional adjustment, interpersonal
and other ~haracteristics traditionally classi£ed under the heading of relatIOns, and self-concept. In the self-concept we can see most clearly the
"personality." . mutual influence of aptitudes and personality traits. The child's ach'ieve-
Even more important is the cumulative effect of personality character- n:ent in school, on the ~la);gl~ound, and in other situations helps to shape
istics on the direction and extent of the individual's intellectual develop- hIS self-concept; and Ius selt-concept at any given staGe influences his
ment. Some of the evidence for t· s effect, collected through longitudinal subsequent performance, in a continuing spiral. I!L!!~itregard, the self-
studies of children and adults, was summarized earlier in this chapter. c:?nc~J?l~~J:~~~_s_.~Q.r.LQL12[i\.'a.te_selHulfillin g-prophec}: .._.~. .----------.-.
Other investigations on groups ranging from preschool children to college At a more basic theoretical level, K. J. Hayes (1962) has proposed
students have been surveyed b\· Dreger (1968). Although some of the ~ broadly oriented hypothesis concerning the relationship of drives and
research on young childr~n utiiized a~longitudinal approach, data from 1l1te]]ect. Regarding intelligence as a collection of learned abilities, Haves
older subjects were gathered almost exclusively through concurrent maintains that the individual's motivational makeup influences the kind
correlations of personality test scores with intelligence test scores and and amount of learning that occurs. Specificallv, it is the strength of
indices of academic achievement. The data assembled by Dreger indi- the "exp~rience-pI:oducing drives" that affects i~tellectual development.
c<\te the importance of considering appropriate personality variables as Th~se. (1: lves are Illustrated by exploratory and manipulatory activities,
an aid in understanding an individual's intelligence test perfonm'l11ce cunos~ty, pl~~', the ?abbling of infants, and other intrinsicaliy motivated
andin predicting his academic achievement. .' behavIOr. ~Itmg chIefly research on animal behavior, Hayes argues that
It would thus seem that prediction of a child's subsequent
l
mtellectual these expenence-producing drives are genetically determined and tepre-
development could be improved by combining information about his se~t the only hereditary basis of individual differences in intelligence. It
emotional and motivational characteri,;tics \vith his scores on ability mIght be ad~ed th~t the hereditary or environmental basis of the experi-
tests. A word should be added, however, regarding the assessment of ~n:e-pr,oducmg dnves need not alter the conceptualization of their role
. 'mot;vat;on." In the pmd;",1 ,valnatian of "hookbild"n, ool1ege,tn- 111 lIltelicctual development. These two parts of the theon' mav be con-
dents, job applicants, and other categories of persons, psychologists are sidered independently. . "'
often asked for a measure of the individual's "motivation." \\'hen thus \Vhatever the origin of the experience-produciJ12: drives, the individual's
:t_,..;JIij
worded, this is a meaningless request, since motivation is speCific. \"hat experience is regarded as a joint function of the ~trength of these drives
is needed is an indication of the individual's value system and the and the environment in which they operate. The cu~nulativc eHect of
intensity with whch he will strive toward specific goals. The strength of
thm "'p'";,n,,, In tmn detenn;n" the ;ndl,·;du,r, ;ntelhtnal I,ve!
PsychologicallsslICs i1l l1ltcllir;:cllcc Tcsti1lr;:
any given time. This is a provocative hypothesis, through which Hayes
cnviromncnt provides the immediate task and contributes to motivational
in'·egrates a considerable body of data from mall)' t)'pes of research on
strength for this task relative to the competing moti\"ational strengths .of
h,Ml human and animal behavior.
alternative actions. Motivation affects both the efficienev \\'ith which the
On the basis of 25 years of research on achievement motivation,
~asL is perform.e~ and the time spent On it (e.g., stud:·il~g. carrying out a
J. W. Atkinson (1974, '1976) and his co-workers have formula~ed. a Job-related actmty). Efficiency reflects the relationship between nature
'connrehensive schema representing the interrelationships of abl1itles,
of task and current motivation. Level of performance results from the
moti\'ation, and environmental variables. The approach is dynamic in that
individual's relevant abilities (e.g., as assessed by test scores) and the
it implies systematic change rather than constancy in. the individual'.s
efficiency with which he applies those abilities t~ the current task. The
lifetime. It also incorporates the reciprocal effects of aptltudes and motl-
final achievement or product shows the combined effects of level of
vational variables; and it emphaSizes the contribution of motivational
~erformance while at work and time spent at work. Another and highly
variance to test performance. To illustrate the application of :his con-
Important consequent of level of performance X time spent at work is the
ceptual schema, computer simulations were emp~oyed.' ShOW111g hm\'
lasting cumulative effects of this activity or experience on the individual's
ability and motivation can jointl:' influence both lllt:lhgence. ~est per-
Own cognitive and non cognitive development. This last step represents
fonnance and cumulative achievement. Some supporting empuical data
a feedback loop to the individual's personality, whose effects are likely
are also cited regarding the high school grade-pomt average of bo~'s as to be reflected in his future test scores, .
predicted from earlier intelligence test scores and a measure of achieve-
Many implications of this approach for the interpretation of test scores
ment motivation (Atkinson, O'Malley & Lens, 1976).
are spelled out in the original sources cited above, which should be
A diagram of Atkinson's conceptual schema is repr~duced in ~igure
examined for further details. The schema provides a promising orientation
57. Beginning at the left, this figure shows the combmed operatlon of
for the more effective utilization of tests and for the better understandina
heredity and past formative environment in the devel~pm;nt of both of the conditions underlying human achievement. b
c02:niti"e and noncounitive aspects of the total personalIty. fhe present
" b
I mmecJiate environment
as guide to ac.tion
Pers,:ma Iil\' l
l\Jctu~i=:of the t2sk (Ai Cumulative Effec:ts
I '1 l d (\AX
~X-Eif},enCy ~,I.ll\S\i,",\/1
Heredity
.....
/ hi
l/ II
Abilities
1 -t- ~:':f~r~~alice
while at work
~::h~::eV:O:~t
b I
;" 1\ Motives k--....... ',/
Formative
.
I '"
I
I
\ "'"
!I
i \;K
~'trength
i \\
of
motl\'atloflIT"L
/'-.,
Time spent
at work
J On the se:i
t\. .....\\
'L
1 ----*- \
opportunltl€S
+-
\
Strength
IV', • \'e5o
0.1
of motivation for
alternatives {T B T:2
Immediate environment
as goad to action