Test Validity and The Ethics of Assessment

Test Validity and the Ethics of Assessment
SAMUEL MESSICK Educational Testing Service

Princeton, New Jersey
ABSTRACT: Questions of the adequacy of a test as
a measure of the characteristic it is interpreted to
assess are answerable on scientific grounds by apprais-
ing psychometric evidence, especially construct validity.
Questions of the appropriateness of test use in proposed
applications are answerable on qthical grounds by
appraising potential social consequences of the testing.
The first set of answers provides an evidential basis
for test interpretation, and the second set provides a
consequential basis for test use. In addition, this
article stresses (a) the importance of construct validity
for test use because it provides a rational foundation for
predictiveness and relevance, and (b) the importance of
taking into account the value implications of test inter-
pretations per se. By thus considering both the evi-
dential and consequential bases of both test interpreta-
tion and test use, the roles of evidence and social values
in the overall validation process are illuminated, and test
validity comes to be based on ethical as well as
evidential grounds.
Fifteen years ago or so, in papers dealing with per-
sonality measurement and the ethics of assessment,
I drew a straightforward but deceptively simple
distinction between the psychometric adequacy of a
test and the appropriateness of its use (Messick,
1964, 1965). I argued that not only should tests be
evaluated in terms of their measurement properties
but that testing applications should be evaluated in
terms of their potential social consequences. I urged
that two questions be explicitly addressed whenever
a test is proposed for a specific purpose: First, is the
test any good as a measure of the characteristics it
is interpreted to assess? Second, should the test be
used for the proposed purpose in the proposed way?
The first question is a scientific and technical one
and may be answered by appraising evidence for the
test's psychometric properties, especially construct
validity. The second question is an ethical one, and
its answer requires a justification of the proposed
use in terms of social values. Good answers to the
first question are not satisfactory answers to the
second. Justification of test use by an appeal to
empirical validity is not enough; the potential social
consequences of the testing should also be appraised,
1012 N O VEMBER 1980 AMERICAN P SY CH O LO GIST
Copyright 1980 by the American Psychological Association, Inc.
0003-066X/80/3511-1012$00.7S
not only in terms of what it might entail directly as
costs and benefits but also in terms of what it makes
more likely as possible side effects.
These two questions were phrased to parallel two
recurrent criticisms of testingthat some tests are
of poor quality and that tests are often misusedin
an attempt to separate the frequently blurred issues
in the typical critical interchange into (a) ques-
tions of test bias or the adequacy of measurement,
and (b) questions of test fairness or the appro-
priateness of use (Messick, 19,65).
It was in the context of appraising personality
measurement for selection purposes that I originally
stressed the need for ethical standards for justifying
test use (Messick, 1964). Although at that time
personality tests appeared inadequate for the selec-
tion task when systematically evaluated against
measurement and prediction standards, it seemed
likely that rapidly advancing research technology
would, in the relatively near future, produce psy-
chometrically sophisticated personality assessment
devices. Therefore, questions might soon arise in
earnest as to the scope of their practical application
beyond clinical and counseling usage. With vari-
ables as value-laden as personality characteristics,
it seemed critical that values as well as validity be
considered in contemplating test use.
Kaplan (1964) pointed out that "the validity of
a measurement consists in what it is able to accom-
plish, or more accurately, in what we are able to do
with it. j . . The basic question is always whether
the measures have been so arrived at that they can
serve effectively as means to the given end" X P -
198). .Also at issue is whether the measures should
serve as means to the given end, in light of other
ends they might inadvertently serve and in con-
sideration of the place of the given end in the social
This article was an invited address to the Divisions of
Educational P sychology and of Evaluation and Measure-
menti presented at the meeting of the American P sycholog-
ical Association, N ew Y ork City, September 1, 1979.
Requests for reprints should be sent to Samuel Messick,
Educational Testing Service, P rinceton, N ew Jersey 08S41.
Vol. 35, No. 11, 10 12 -10 2 7
fabric of pluralistic alternatives. For example,
should a psychometrically sound measure of "flex-
ibility versus rigidity" be used for selection in a
particular college if it significantly improves the
multiple prediction of grade point average there?
What if the direction of prediction favored rigid
students? What if entrance to a military academy
were at issue, or a medical school? What if the
scores had been interpreted instead as measures of
"confusion versus control"? What if there were
large sex differences in the score distributions? In
a different arena, what minimal levels of knowledge
and skill should be required for graduation from
high school and in what areas?
It seemed clear at this point that value issues in
measurement were not limited to personality assess-
ment, nor to selection applications, but should be
extended to all psychological and educational mea-
surement (Messick, 1975). This is primarily be-
cause psychological and educational variables all
bear, either directly or indirectly, on human char-
acteristics, processes, and products and hence are
inherently, though variably, value-laden. The
measurement of such characteristics entails value
judgmentsat all levels of test construction, analy-
sis, interpretation, and useand this raises ques-
tions both of whose values are the standard and of
what should be the consequences of negative valu-
ation. Values thus appear to be as pervasive and
critical for psychological and educational measure-
ment as is testing's acknowledged touchstone,
validity. Indeed, "The root meaning of the word
'validity' is the same as that of the word 'value':
both derive from a term meaning strength" (Kap-
lan, 1964, p. 198).
It should be emphasized that value questions
arise with any approach to psychological and edu-
cational testingj whether it be norm-referenced or
criterion-referenced (Glaser & N itko, 1971), a
construct-based ability test or a content-sampled
achievement test (Messick, 1975), a reactive task
or an unobtrusive observation (Webb, Campbell,
Schwartz, & Sechrest, 19,66), a sign or a sample
(Goodenough, 1969), or whatever, but the nature
of the critical value questions may differ from one
approach to another. For example, many of the
advantages of samples over signs derive from the
similarity of past behaviors to desired future be-
haviors, which 'makes it more likely that behavior-
sample tests will be judged relevant in both content
and process to the task or job domain about which
inferences are to be drawn. It is also likely that
scores from such samples, because of behavioral
consistency from one time to another, will be pre-
dictive of performance in those domains (Wern-
imont & Campbell, 1968). A key value question is
whether such "persistence forecasting," as Wallach
(1976) calls it, is desirable, in a particular domain
of application. In higher education, for example,
the appropriate model might not be persistence but
development and change, which suggests that in
such instances we be wary of selection procedures
that restrict individual opportunity on the basis of
behavior to date (H udson, 1976).
The distinction stressed thus far between the
adequacy of a test as a measure of the character-
istic it is interpreted to assess and the appropriate-
ness of its use in specific applications underscores
in the first instance the evidential basis of test
interpretation, especially the need for construct
validity evidence, and in the second instance the
consequential basis of test use, through appraisal
of potential social consequences. In developing
this distinction in prior work I emphasized the
importance of construct validity' for test use as
well, arguing "that even for purposes of applied
decision making reliance upon criterion validity or
content coverage is not enough," that the meaning
of the measure must also be comprehended in order
to appraise potential social consequences sensibly
(Messick, 1975, p. 956). The present article
extends this argument for the importance of con-
struct validity in test use still further by stressing
its role in providing a "rational foundation for pre-
dictive validity" (Guion, 1976b). After thus
elaborating the evidential basis of test use, I con-
sider the value implications of test interpretations
per se, especially those that bear evaluative and
ideological overtones going beyond intended mean-
ings and supporting evidence; the circle is thereby
completed with an examination of the consequential
basis of test interpretation. Finally, the dynamic
interplay between test interpretation and its value
implications, on the one hand, and test use and its
social consequences, on the other, is sketched in a
feedback model that incorporates a pragmatic
component for the empirical evaluation of testing
consequences.
Validity as Inference From Evidence
According to the Standards for Educational and
Psychological Tests (American Psychological Asso-
ciation et al., 1974), "Questions of validity are
questions of what may properly be inferred from a
test score; validity refers to the appropriateness of
AMERICAN P SY CH O LO GIST N O VEMBER 1980 1013
inferences from test scores or other forms of assess-
ment. . . . It is important to note that validity is
itself inferred, not measured. . . . It is, therefore,
something that is judged as adequate, or marginal,
or unsatisfactory" (p. 2 5). This document also
points out that the many forms of validity ques-
tions fall into two broad classes, those dealing with
inferences about what is being measured by the
test and those inquiring into the usefulness of the
measurement as a predictor of other variables.
Furthermore, there are a variety of validation
methods available, but they all entail in principle
a clear designation of what is to be inferred from
the scores and the presentation of data to support
such inferences. ,
Unfortunately, after this splendid beginning, this
and other official documentsnamely, the Division
of Industrial and O rganizational Psychology's
(1975) Principles for the Validation and Use of
Personnel Selection Procedures and the Equal
Employment O pportunity Commission et al.'s
(1978) "Unifor'm Guidelines on Employee Selec-
tion P rocedures" proceed, as Dunnette and Bor-
man (1979) lament, to "perpetuate a conceptual
compartmentalization of 'types' of validity-
criterion-related, content, and construct. . . . the
implication that validities come in different types
leads to confusion and, in the face of confusion,
over-simplification" (p. 483). O ne consequence
of this simplism is that many test users focus on
one or another of the types of validity, as though
any one would do, rather than on the specific infer-
ences they intend to make from the scores. There
is an implication that once evidence of one type of
validity is forthcoming, one is relieved of respon-
sibility for further inquiry. Indeed, the "Uniform
Guidelines" seem to treat the three types of valid-
ity, in Guion's (1980 ) words, "as something of a
Holy Trinity representing three different roads to
psychometric salvation. If you can't demonstrate
one kind of validity, you've got two more chances!"
(p. 4). .
:
Different kinds of inferences from test scores
require different kinds of evidence, not different
kinds of validity. By "evidence" I mean both
data, or facts, and the rationale or arguments that
cement those facts into a justification of test-score
inferences. "Another way to put this is to note
that data are not information; information is that
which results from the interpretation of data"
(Mitroff & Sagasti, 1973, p. 12 3). O r as Kaplan
(1964) states, '-'What serves as evidence is the
result of a process of interpretationfacts do not
speak for themselves; nevertheless, facts must be
given a hearing, or the scientific point to the process
of interpretation is lost" (p. 375). Facts and
rationale thus blend in this view of evidence, and
the tolerable balanced between them in the arena of
test validity extends over a considerable range,
possibly even falling just short of the one extreme
where facts are left to speak for, themselves and the
other extreme where a logical rationale alone is
deemed self-evident.
By focusing on the nature of the evidence in
relation to the nature of the inferences drawn from
test scores, we come to view validity as a general
imperative in measurement. Validity is the overall
degree of justification for test interpretation and
use. It is "an evaluation, considering all things,
of a certain kind of inference about people who
obtain a certain score" (Guion, 1978b, p. 50 0 ).
Although it may prove helpful conceptually to
discuss the interdependent features of the generic
concept in terms of different aspects or facets, it is
simplistic to think of different types or kinds of
validity.
From this standpoint we are not very well served
by labeling different aspects of a general concept
with the name of the concept, as in criterion-related
validity, content validity> or construct validity, or
by proliferating a host of specialized validity mod-
ifiers, such as discriminant validity, trait validity,
factorial validity, structural validity, or population
validity, each delimiting some aspect of a broader
meaning. The substantive points associated with
each of these terms are important ones, but their
distinctiveness is blunted by calling them all
"validity." Since many of the referents are similar
but not identical, they tend to be assimilated one
to another, leading to confusion among them and
to a blurring of the different forms of evidence that
the terms wer$ invoked to highlight in the first
place. Worse still, any one of these so-called
validities, or a small set of them, might be treated
as the whole of validity, while the entire collection
to date might still not exhaust the essence of the
whole.
We would be much better off conceptually to use
labels more descriptive of the character and intent
of each aspect, such as content relevance and con-
tent coverage rather than content validity, or pop-
ulation generalizability rather than population
validity. Table 1 lists a number of currently used
validity terms along with a tentative descriptive
designation for each that is intended to underscore
differences among the concepts while at the same
1014 -N O VEMBER 1980 AMERICAN PSYCHOLOGIST
time highlighting the key feature of each, such as
consistency or utility, and pointing to essential
areas of similarity and overlap, as with criterion
relatedness, nomological relatedness, and external
relatedness. With one possible exception to be
discussed .subsequently, none of these concepts
qualify for the accolade of validity, for at best they
are only one facet of validity and at worst, as in
the case of content coverage, they are not validity
at all. So-called "content validity" refers to the
relevance and representativeness of the task con-
tent used in test construction and does not refer to
test scores at all, let alone evidence to support
inferences from test scores, although such content
considerations do permit elaborations on score
Inferences supported by other evidence (Guion,
1977a, 1978a; Messick, 1975; Tenopyr, 1977).
I will comment on most of the concepts in Table
1 in passing while considering the claim of the one
exception noted earliernamely, construct validity
to bear the name "validity" and to wear the
mantle of all that name implies. I have pressed in
previous writing for the view that "all measure-
ment should be construct-rejerenced" (Messick,
1975, p. 957). O thers have similarly argued that
"any inference relative to prediction and ,.- . . all
inferences relative to test scores, are based upon
underlying constructs" (Tenopyr, 1977, p. 48).
Guion (1977b, p. 410 ) concluded that "all validity
is at its base some form of construct validity. . . .
It is the basic meaning of validity." I will argue,
building on Guion's (1976b) conceptual ground-
work, that construct validity is indeed the unifying
concept of validity that integrates criterion and
content considerations into a common framework
for testing rational hypotheses about theoretically
relevant relationships. The bridge or unifying
theme that permits this integration is the meaning-
fulness or interpretability of the test scores, which
is the goal of the construct validation process. This
construct meaning provides a rational basis both
for hypothesizing predictive relationships and for
judging content relevance and representativeness.
I stop short, however, as did Guion (1980 ), of
equating construct validity with validity in gen-
eral, but for different reasons. The main basis for
hesitancy on my part, as we shall see, is that valid-
ity entails an evaluation of the value implications
of both test interpretation and test use. These
implications derive primarily from the test's con-
struct meaning, to be sure, and they feed back
into the construct validation process, but they
also derive in part from broader social ideologies,
TABLE 1
Alternative Descriptors for Aspects of Test Validity
Validity designation Descriptive designation
Content validity
Criterion validity
P redictive validity
Concurrent validity
Construct validity
Convergent validity
Discriminant validity
Trait validity
N omological validity
Factorial validity
Substantive validity
Structural validity
External validity
P opulation validity
Ecological validity
Temporal validity
Task validity
Content relevancedomain speci-
fications
Content coveragedomain repre-
sentativeness
Criterion relatedness
P redictive utility
Diagnostic utility
Substitutability
Interpretive meaningfulness
Convergent coherence
Discriminant distinctiveness
Trait correspondence
N omological relatedness
Factorial composition
Substantive consistency
Structural fidelity
External relatedness
P opulation generalizability
Ecological generalizability
Temporal continuityacross de-
velopmental levels
Temporal generalizabilityacross
historical periods
Task generalizability
such as the ideologies of social science or of educa-
tion or of social justice, and hence go beyond con-
struct meaning per se.
IN TERP RETIVE MEAN IN GFULN ESS
Construct validation is a process of marshaling evi-
dence to support the inference that an observed
response consistency in test performance has a
particular meaning, primarily by appraising the
extent to which empirical relationships with other
measures, or the lack thereof, are consistent with
that meaning. These empirical relationships may
be assessed in a variety of ways, for example, by
gauging the degree of consistency in correlational
patterns and factor structures, in group differences,
response processes, and changes over time, or in
responsiveness to experimental treatments. The
process attempts to link the reliable response con-
sistencies summarized by test scores to nontest
behavioral consistencies reflective of a presumably
common underlying construct, usually an attribute
or process or trait that is itself embedded in a more
comprehensive network of theoretical propositions
or laws called a nomological network (Feigl, 1956;
H empel, 1970 ; Margenau, 1950 ). An empirically
grounded pattern of such links provides an evi-
dential basis for interpreting the test scores in
construct or process terms, as well as a rational
basis for inferring testable implications of the
scores from the broader theoretical network of the
constructs meaning (Cronbach & Meehl, 1955;
Messick, 1975). Constructs are thus chosen or
created "to organize experience into general law-
like statements" (Gronbach, 1971, p. 462 ).
Construct validation entails both confirmatory
and disconfirmatory strategies, one to provide
convergent evidence that the measure in question
is coherently related to other measures of the same
construct as well as to other variables that it should
relate to on theoretical grounds, and the other to
provide discriminant evidence that the measure is
not related unduly to exemplars of other distinct
constructs (D. T. Campbell & Fiske, 1959). Dis-
criminant evidence is particularly critical for dis-
counting plausible counterhypotheses to the con-
struct interpretation (P opper, 1959), especially
those pointing to the possibility that the observed
consistencies might instead be attributable to
shared method constraints, response sets, or other
contaminants.
Construct validity emphasizes two intertwined
sets of relationships for the test: one between the
test and different methods for measuring the same
construct O r trait, and the other between measures
of the focal construct and exemplars of different
constructs predicted to be variously related to it
on theoretical grounds. Theoretically relevant
empirical consistencies in the first set, indicating a
correspondence between measures of the same con-
struct, have been called trait validity, and those in
the second set, indicating a lawful relatedness be-
tween measures of different constructs, have been
called nomological validity (D. T. Campbell, 1960;
Cronbach & Meehl, 1955). In order to discount
competing hypotheses involving alternative con-
structs or method contaminants, the two sets are
often analyzed simultaneously in a multitrait-
multimethod strategy that employs multiple meth-
ods for assessing each of two or more different
constructs (D. T. Campbell & Fiske, 1959). Such
an approach highlights the need for both convergent
and discriminant evidence in both trait and nomo-
logical validity.
Trait validity deals with the fit between measure-
ment operations and conceptual definitions of the
construct, and nomological validity deals with the fit
between obtained data patterns and theoretical pre-
dictions about those patterns (Cook & Campbell,
1979). The former is concerned with the meaning
of the measure as a reflection of the construct, and
the latter is concerned with the meaning of the
construct as reflected in the measure's relational
properties. Both aspects are intrinsic to construct
validity, and the. interplay between them leads
to iterative refinements of measures, constructs, and
theories over time. Thus, the paradox that mea-
sures are needed to define constructs and constructs
are needed to build measures is resolved, like all
existential dilemmas in science, by a process of
successive approximation (Kaplan, 1964; Lenzen,
1955).
It will be recalled that the Standards for Educa-
tional and Psychological Tests (AP A et al., 1974)
condensed the variety of validity questions into
two types, those dealing with the intrinsic nature
or meaning of the measure and those dealing with
its use as an indicator or predictor of other vari-
ables. In the present context, this distinction
should be seen as a whole-part relationship: Evi-
dence bearing , on the meaning of the measure
embraces all of construct validity, whereas evidence
for certain predictive relationships contributes to
that part called nomological validity. Some pre-
dictive relationshipsnamely, those between the
measure and specific applied criterion behaviors
are traditionally singled out for special attention
under the rubric of "criterion-related validity," and
it therefore follows that this too is subsumed con-
ceptually as part of construct validity.
This does not mean, however, that construct
validity in general can replace criterion-related
validity in particular in applied settings. The
criterion correlates of a measure constitute strands
in the construct's nomological network, but their
empirical basis is still to be checked. Thus,
"criterion-related validity is intended to show the
validity, not of the test, but of that hypothesis" of
relationship to the criterion (Guion, 1978a, p. 2 0 7). .
The analysis of criterion variables within the mea-
sure's construct network, especially if conducted in
tandem with the construct validation of the criterion
measures themselves, provides a powerful rational
basis for criterion prediction (Guion, 1976b).
CRITERIO N RELATEDN ESS
So-called "criterion-related validity" is usually con-
sidered to comprise two types, concurrent validity
and predictive validity, which differ respectively in
terms of whether the test and criterion data were
collected at the same time or at different times. A
1016 N O VEMBER 1980 AMERICAN PSY CH O LO GIST
more fundamental distinction would recognize that
concurrent correlations with criteria are usually
obtained either to appraise the diagnostic effective-
ness of the test in detecting current behavioral pat-
terns or to assess the suitability of substituting the
test for a longer, more cumbersome, or more expen-
sive criterion measure. It would also be' more help-
ful in both the predictive and the concurrent case to
characterize the function of the relationship in terms
of utility rather than validity. Criterion relatedness
differs from the more general nomological relatedness
in being more narrowly stated and pointed toward
specific sets of data and specific applied settings.
In criterion relatedness we are concerned not just
with verifying the existence of relationships and
gauging their strength, but with identifying useful
relationships under the applied conditions. Utility
is the more appropriate concept in such instances
because it implies interpretation of the correlations
in the decision context in terms of indices of predic^
tive efficiency relative to base rates, mean gains in
criterion performance due to selection, the dollar
value of such gains relative to costs, and so forth
(Brogden, 1946; Cronbach & Gleser, 1965; Curtis
& Alf, 1969; Darlington & Stauffer, 1966; H unter,
Schmidt, & Rauschenberger, 1977).
In developing rational hypotheses of criterion
relatedness, we not only need a conception of the
construct meaning of the predictor measures, as we
' have seen, but we also need to conceptualize criterion
constructs, basing judgments on data from job or
task analyses and the construct validation of pro-
visional criterion measures (Guion, 1976b). In the
last analysis, the ultimate criterion is determined on
rational grounds (Thorndike, 1949); in any event,
it "can best be described as a psychological construct
. . . [and] the process of determining the relevance
of the immediate to the ultimate criterion becomes
one of construct validation" (Kavanagh, Mac-
Kinney, & Wolins, 1971, p. 35). It is particularly
crucial to identify criterion constructs whenever
potentially contaminated criterion measures, such as
ratings or especially multiple ratings from different
sources, are employed (James, 1973). In the face
of impure or contaminated criterion measures, the
question of the intrinsic nature of the relation be-
tween predictor and criterion comes to the fore
(Gulliksen, 1950 ), and construct validity is needed
to .broach that issue. "In other words, an orienta-
tion toward construct validation in criterion
research is the best way of guarding against a
hopelessly 'incomplete job of criterion development"
(Smith, 1976, p. 768). Thus, if construct validity
is not available on the predictor side, it better be
on the criterion side, and both "must have adequate
construct validity for their respective sides if the
theory is to be tested adequately" (Guion, 1976b,
p. 80 2 ).
Implicit in this rational approach to predictive
hypotheses there is thus also a rational basis for
judging the relevance of the test to the criterion
domain. This provides a means of coping with the
quasi-judicial term jab-relatedness, even in the case
where criterion-related empirical verification is
missing. "Where it is clearly not feasible to do the
study, the defense of the predictor can rest on a
combination of its construct validity and the
rational justification for the inclusion of the con-
struct in the predictive hypothesis" (Guion, 1974,
p. 2 91). The case becomes stronger if the pre-
dicted relationship has been verified empirically in
other settings. Guion (1974), for one, has main-
tained that this stance offers better evidence of
job-relatedness than does a tenuous criterion-
related study done under pressure with small
samples, low variances, or questionable criterion
measures. O n the other hand, the simple demon-
stration of an empirical relationship between a
measure and a criterion in the absence of a cogent
rationale is a dubious basis for justifying relevance
or use (Messick, 1964, 1975).
CO N TEN T RELEVAN CE AN D CO N TEN T CO VERAGE
The other major basis for judging the relevance
of the test to the behavioral domain about which
inferences are to be drawn is so-called "content
validity." Content validity in its classic form
(Cronbach, 1971) is limited to the strict behavioral
language of task description, for otherwise, con-
structs are apt to be invoked and we have another
case of construct validity. There are two main
facets to content validity: O ne is content relevance,
which refers to the specification of the behavioral
domain in question and the attendant specification
O f the task or test domain. Specifying domain
boundaries is essentially a requirement of opera-
tional definition and, in the absence of appeal to
a construct theory of task performance, is limited
to a statement of admissible task characteristics
and behavioral requirements. The other facet is
content coverage, which refers to the specification
of procedures for sampling the domain in some
representative fashion. The concern is thus with
content sampling of a specified content domain,
which is a prescription for test construction, not
AMERICAN PSYCHOLOGIST NOVEMBER 1980 1017
validity. Consensual judgments about the rele-
vance of the test domain as denned to, a particular
behavioral -domain of interest (as, for example,
when choosing a standardized achievement test to
evaluate a new curriculum), along with judgments
of the adequacy of content coverage in the test,
are the kinds of evidence usually offered for content
validity. But note that this is not evidence in
support of inferences from test scores, although it
might influence the nature of those inferences.
This attempt to define content validity as sep-
arate from construct validity produces a dysfunc-
tional strain to avoid constructs, as if shunning
them in test development somehow lessens the
import "of response processes in test performance.
The important sampling consideration in test con-
struction is not representativeness of the surface
content of tasks but representativeness of the pro-
cesses employed by subjects in arriving at a
response (Lennon, 1956). This puts content
validity squarely in the realm of construct validity
(Messick, 1975). Rather than strain after neb-
ulous distinctions, we should inquire how content
considerations contribute to construct validity and
how to strengthen that contribution (Tenopyr,
1977).
Loevinger (1957) incorporated content as an
important feature of construct validity by consider-
ing content representativeness and response con-
sistency jointly. What she called "substantive
validity" is "the extent to which the content of the
items included in (and excluded from?) the test
can be accounted for in terms of the trait believed
to be measured and the context of measurement"
(Loevinger, 1957, p. 661). This notion was intro-
duced "because of the conviction that consider-
ations of content alone are not sufficient to establish
validity even when the test content resembles the
trait, and considerations of content cannot be
excluded .when the test content least resembles the
trait" (Loevinger, 1957, p. 657). The elimination
of certain items from the test because of poor
empirical response properties may sometimes distort
the test's representativeness in covering the con-
struct domain as originally conceived, but it is
justified if the resulting test thereby becomes a
better exemplar of the construct as empirically
grounded (Loevinger, 1957; Messick, 1975).
Content validity has little to say about the scor-
ing of content samples, and as a result scoring pro-
cedures are typically ad hoc (Guion, 1978b). Scor-
ing models in the construct framework, in contrast,
logically parallel the structural relations inherent in
behavioral manifestations of the construct being
measured. Loevinger (1957) drew explicit atten-
tion to the need for rational scoring models by
coining the term structural validity, which includes
"both the fidelity of the structural model to the
structural characteristics of non-test manifestations
of the trait and the degree of inter-item structure"
(p. 661).
Even in instances where the test is an undisputed
representative sample of the behavioral domain of
interest and the concern is with the demonstration
of task accomplishment per se regardless of the
processes underlying performance (cf. Ebel, 1961,
1977), empirical evidence of response consistency
and not just representative content sampling is
important. In such cases, inferences are usually
drawn from the sample performance to domain per-
formance, and these inferences should be buttressed
by indices of. the internal-consistency type to gauge
the extent of generalizability to other items like
those in the sample, to other tests developed in
parallel fashion, and so forth (J. P . Campbell, 1976;
Cronbach, Gleser, N anda, & Rajaratnam, 1972 ).
We should also consider the possibility that the
test might contain sources of variance irrelevant to
domain performance, which is a particularly impor-
tant consideration in interpreting low scores. Con-
tent validity at best is a unidirectional concept:
Although it may undergird certain straightforward
interpretations for high scorers (such as "they
possess suitable skills to perform the tasks cor-
rectly, because they did so .repeatedly"), it provides
no basis for interpreting low scores in terms of
incompetence or lack of skill. To do that requires
the discounting of plausible counterhypotheses
about such irrelevancies in the testing as anxiety,
defensiveness, inattention, or low motivation
(Guion, 1978a; Messick, 1975, 1979). And the
empirical discounting of plausible rival hypotheses
is the hallmark of construct validation.
GEN ERALITY O F CO N STRUCT MEAN IN G
The issue of generalizability just broached for con-
tent sampling permeates all of validity. Several
aspects of generalizability of special concern have
been given distinctive labels, but unfortunately
these labels once again invoke the sobriquet
validity. The extent to which a measure's empir-
ical relations and construct interpretation gen-
eralize to other population groups is called "popula-
tion validity" (Shulman, 1970 ); to other situations
or settings, "ecological validity" (Bracht & Glass,
1018 N O VEMBER 1980 AMERICAN PSYCHOLOGIST
1968; Snow, 1974); to other times,- "temporal
validity" (Messick & Barrows, 1972 ); and to
other tasks representative of the operations called
for in the particular domain of interest, "task
validity" (Shulman, 1970 ).
The label validity is especially unsuitable for
these important facets of generalizability, for such
usage might be taken to imply that the more
generalizable a measure is, the more valid. This is
not always the case, however, as in the measure-
ment of such constructs as mood, which fluctuates
over time, or concrete operations, which typify a
particular developmental stage, or administrative
role, which operates in special organizational set-
tings, or delusions, which are limited to specific
psychotic groups. Rather, the appropriate degree
of generalizability for a measure depends upon the
nature of the construct assessed and the scope of
its theoretical applicability. A closely related issue
of "referent generality" (Coan, 1964; Snow, 1974),
called "referent validity" by Cook and Campbell
(1979), concerns the extent to which research evi-
dence supports a measure's range of reference and
the multiplicity of its referent terms. This con-
cept points to the need to tailor the level of con-
struct interpretation to the limits of the evidence
and to avoid both oversimplification and over-
generalization in the connotation of construct
labels. N onetheless, constructs refer not only to
available evidence but to potential evidence, so
that the choice of construct labels is influenced by
theory as well as by evidence and, as we shall see,
by ideologies about the nature of humanity and
society which add value implications that go
beyond evidential validity per se.
EVIDEN TIAL BASIS O F TEST IN TERP RETATIO N
AN D USE
To recapitulate thus far, construct validity is the
evidential basis of test interpretation. It entails
both convergent and discriminant evidence docu-
menting theoretically relevant empirical relation-
ships (a) between the test and different methods
for measuring the same construct, as well as (b)
between measures of the construct and exemplars
of different constructs predicted to be related
nomologically. For test use, the relevance of the
construct for the applied purpose is determined in
addition, by developing rational hypotheses relating
the construct to performance in the applied domain.
Some of the construct's nomological relations thus
become criteria! when made specific to the applied
setting. The empirical verification of this rational
hypothesis contributes to the construct validity of
both the measure and the criterion, and the utility
of the applied relation supports the practicality of
the proposed use. Thus, the evidential basis of
test use is also construct validitybut elaborated
to determine the relevance of the construct to the
applied purpose and the utility of the measure in
the applied setting.
In all of this discussion I have tried to avoid the
language of necessary and sufficient requirements,
because such language seemed simplistic for a com-
plex and holistic concept like test validity. O n the
one hand, construct validation is a continuous,
never-ending process developing an ever-expanding
mosaic of research evidence. At any point new
evidence may dictate a change in construct, theory,
or measurement, so that in the long run it is diffi-
cult to claim sufficiency for any piece. O n the
other hand, given that the mosaic of evidence is
reasonably dense, it is difficult to claim that any
piece is necessaryeven, as we have seen, empirical
evidence for criterion-related predictive relation-
ships in specific applied settings, provided, of
course, that other evidence consistently supports
a compelling rationale for the application.
Since the evidence in these evidential bases de-
rives from empirical studies evaluating hypotheses
about relationships or about the structure of sets
of relationships, we must also be concerned about
the quality of those studies themselves and about
the extent to which the research conclusions are
tenable or are threatened by plausible counter-
hypotheses to explain the results (Guion, 1980 ).
Four classes of threats to the tenability and gen-
eralizability of research conclusions are discussed
by Cook and Campbell (1979), with primary
reference to quasi-experimental and experimental
research but also relevant to nonexperimental cor-
relational studies. These four classes deal, respec-
tively, with the questions of (a) whether a relation-
ship exists between two variables, an issue called
"statistical conclusion validity"; (b) whether the
relationship is plausibly causal from one variable
to the other, called "internal validity"; (c) what
interpretive constructs underlie the relationship,
called "construct validity"; and (d) the extent to
which the interpreted relationship generalizes to
and across other population groups, settings, and
times, called "external validity."
I will not discuss here the first question raised
by Cook and Campbell except simply to affirm that
the tenability of statistical conclusions about the
existence and strength of relationships is of course
basic to the whole enterprise. I have already dis-
cussed construct validity and external generalizabil-
ity, although it is important to note in connection
with the latter that I was referring to the generaliz-
ability of a measure's empirical relations and con-
struct interpretation to other populations, settings,
and times, whereas Cook and Campbell (1979)
were referring to the generalizability of research
conclusions that two variables (and their attendant
constructs) are causally related one to the other.
My emphasis was on the generality of a measure's
construct meaning based on any relevant evidence
(Messick, 1975; Messick & Barrows, 1972 )com-
monality of factor structures, for examplewhile
theirs was on the generality of a causal relationship
from one measure or construct to another -based on
experimental or quasi-experimental treatments.
Verification of the hypothesis of causal relation-
ship is what Cook and Campbell term internal
validity, and such evidence contributes importantly
to the nomological basis of a measure's construct
meaning for those construct theories entailing
causal claims. Internal validity thus provides the
evidential basis for causal strands in a nomological
network. The tenability of cause-effect implica-
tions is important for the construct validity of a
variety of educational and psychological measures,
such as those interpreted in terms of intelligence,
achievement, or motivation. Indeed, the causal
overtones of constructs are one source of the value
implications of test interpretation, a topic I will
turn to shortly.
Validity as Evaluation of Implications
Since validity is an evaluation of evidence, a judg-
ment rather than an entity, and since some evi-
dential basis should be provided for the interpreta-
tion and use of any test, validity has always been
an ethical imperative in testing. As Burton (1978)
put it, "Validity (as the word implies) has been
primarily an ethical requirement of tests, a pre-
requisite guarantee, rather than an active com-
ponent of the use and interpretation of tests" (p.
2 64). She went on to argue that with criterion-
referenced testing, "Glaser in essence, was taking
traditional validity ,out of the realm of ethics into
the active arena of test use" (p. 2 64). Glaser
may have taken traditional validity into the active
arena of test use, as it were, but it never left the
realm of ethics because test use itself is an ethical
issue.
If test validity is the overall degree of justification
for test interpretation and use, and if human and
social values encroach on both interpretation and
use, as they do, then test validity should take
account of those value implications in the overall
judgment. The concern here, as in most ethical
issues, is with evaluating the present and future
consequences of interpretation and use (Church-
man, 1961). If, as an intrinsic part of the overall
validation process, we weigh the actual and poten-
tial consequences of our testing practices in light
of considerations of what future society might need
or desire, theh test validity comes to be based on
ethical as well as evidential grounds.
CO N SEQUEN TIAL BASIS OF TEST USE
Value issues have long been recognized in connec-
tion with test use. We have seen that one of the
key questions to be posed whenever a test is sug-
gested for a specific purpose is "Should it be used
for that purpose?" Answers to that question
require an evaluation of the potential consequences
of the testing in terms of social values, but that is
no trivial enterprise. There is no guarantee that
at any point in time we will identify all of the
critical possibilities, especially those unintended
side effects that are distal to the manifest testing
aims.
There are few prescriptions for how to proceed
here, but one recommendation is to contrast the
potential social consequences of the proposed test-
ing with the potential social consequences of alter-
native procedures and even of procedures antago-
nistic to testing. This pitting of the proposed test
use against alternative proposals is an instance of
what Churchman (1971) has called Kantian
inquiry; the pitting against antithetical counter-
proposals is called Hegelian inquiry. The intent
of these strategies is to draw attention to vulner-
abilities in the proposed use and to expose its tacit
value assumptions to open debate. In the context
of testing, a particularly powerful and general form
of counterproposal is to weigh the potential social
consequences of the proposed test use against the
potential social consequences of not testing at all
(Ebel, 19,64).
The role of values in test use has been intensively
examined in certain selection applicationsnamely,
in those where different population groups display
significantly different means on predictors, or
criteria, or both. Since fair test use implies that
selection decisions will be equally appropriate
regardless of an individual's group membership, and
since different selection systems yield different
proportions of selected individuals in the different
groups, the question of test fairness arises in ear-
nest. In good Kantian fashion, several models of
fair selection were formulated and contrasted with
each other (deary, 1968; Cole, 1973; Darlington,
1971; Einhorn & Bass, 1971; Linn, 1973, 1976;
Thorndike, 1971); some, having been found incom-
patible or even mutually contradictory, offered good
Hegelian contrasts (P eterson & N ovick, 1976). It
soon became apparent in comparing these models
that each accorded a different importance or value
to the various subsets of selected versus rejected
and successful versus unsuccessful individuals in
the different population groups (Dunnette & Bor-
man, 1979; Linn, 1973). Moreover, the values
accorded are a function not only of desired criterion
performance but of desired individual and group
attributes (N ovick & Ellis, 1977). Thus, each
model not only constitutes a different definition of
fairness but also implies a particular ethical posi-
tion (H unter & Schmidt, 1976). Each view is
ostensibly fair under certain conditions, so that
arguments over the fairness of test use turn out in
many instances to be disagreements as to what the
conditions are or ought to be.
With the recognition that fundamental value
differences were at issue, several utility models were
developed that required specific value positions to
be taken (Cronbach, 1976; Gross & Su, 1975;
P eterson & N ovick, 1976; Sawyer, Cole, & Cole,
1976), thereby incorporating social values explicitly
with measurement technology. But making values
explicit does not determine choices among them,
and at this point it appears difficult if not impos-
sible to be fair to individuals in terms of equity, to
groups in terms of parity or adverse impact, to
institutions in terms of efficiency, and to society in
terms of benefits and risks all at the same time. A
workable balancing of the needs of all of the parties
is likely to require successive approximations over
time, with iterative modifications of utility matrices
based on experience with the consequences of
decision processes to date (Darlington, 1976).
CO N SEQUEN TIAL BASIS O F TEST IN TERP RETATIO N
In contrast to test use, the value issues in test
interpretation have not been as vigorously ad-
dressed. That social values impinge upon theoretical
interpretation may not be as obvious, but it is no
less serious. "Data come to us only in answer to
questions.. . . . H ow we put the question reflects
our values on the one hand, and on the other hand
helps determine the answer we get" (Kaplan, 1964,
p. 385). Facts and values thus go hand in hand
(Churchman, 1961), and "we cannot avoid ethics
breaking into inductive logic" (Braithwaite, 1956,
p. 174). As Kaplan (1964) put it, "Data are
the product of a process of interpretation, and
though there is some sense in which the materials
for this process are 'given' it is only the product
which has a scientific status and function. In a
word, data have meanirig, and this word 'meaning,'
like its cognates 'significance' and 'import,' includes
a reference to values" (p. 385). Thus, just as
data and theoretical interpretation were seen to be
intimately intertwined in the concept of evidence,
so data and values are intertwined in the concept
of interpretation, and fact, value, and meaning
become three faces of the substance of science.
Whenever an event or relationship is concep-
tualized, it is judgedeven if only tacitlyas be-
longing to some broader category to which value
already attaches. If a crime, for example, is seen
as a violation of the social order, the modal societal
response is to seek correction, which is a derivative
of the value context of this way of seeing. If crime
is seen as a violation of the moral order, expiation
will be sought. And if seen as a sign of distress,
especially if the distress can be assimilated to a
narrower category like mental illness, then a claim
of compassion and help attaches to the valuation.
In Vickers's (1970 ) terms, the conceptualization
of an event or relationship within a broader cate-
gory is a process of "matching," which is an infor-
mational concept involving the comparison of
forms. The assimilation of the value attached to
the broader schema is a process of "weighing,"
which is a dynamic concept involving the compari-
son of forces. For Vickers (1970 ), "the elaboration
of the reality system and the value system proceed
together. Facts are relevant only to some standard
of value; values are applicable only to some con-
figuration of fact" (p. 134). H e uses the term
appreciation to refer to those conjoint judgments of
fact and value (Vickers, 1965).
In the construct interpretation of tests, such
appreciative processes are central, though typically
latent. Constructs are broader conceptual cate-
gories than the test behaviors, and they carry with
them into the testing arena value connotations
stemming from three major sources: First are the
evaluative overtones of the construct labels them-
selves; next are the value connotations of the
broader theories or nomological networks in which
constructs are embedded; and last are the implica-
tions of the still broader ideologies about the na-
ture of humanity, society, and science that color
how we proceed. Ideology is a complex configura-
tion of values, affects, and beliefs that provides,
among other things, an existential perspective for
viewing the worlda "stage-setting," as it were,
for interpreting the human drama in ethical, sci-
entific, or whatever terms (Ed'el, 1970 ). The
ideological overlay subtly influences test interpre-
tation, especially for very general constructs like
intelligence, in ways that go beyond empirically
verified connections; in the nomological network
(Crawford, 1979). The hope here in drawing
attention explicitly to the value implications of test
interpretation is that some of these ideological and
valuative links might be exposed to inquiry and
subjected either to empirical grounding or to policy
debate.
Exposing the value assumptions of a construct
theory and its more subtle links to ideologypos-
sibily to multiple, cross-cutting ideologiesis an
awesome challenge. One approach is to follow
Churchman's (1971) lead arid attempt to contrast
each construct theory with an alternative perspec-
tive for interpreting the test scores, as in the
Kantian mode of inquiry; better still f,or probing
the ethical implications of a theory is to contrast
it with an antithetical, though plausible, H egelian
counterperspective. This raises to the grander level
of theory-comparison the strategy of focusing on
plausible rival hypotheses and counterhypotheses
in evaluating the basis for relationships within a
theory. Systematic competition between counter-
theories in attempting to explain the conjoint data
derivable from each also tends to offset the concern
that scientific observations are theory-laden or
theory-dependent and that the presumption of a
single theory might thereby preclude uncovering
the most challenging test data for that theory
(Feyerabend, 1975; Mitroff, 1973). Moreover, as
Churchman (1961) stresses, although consensus is
the decision rule of traditional science, conflict is
the decision rule of ethics. Since the one thing .we
universally disagree about is "what ought to be,"
any scientific approach to ethics should allow for
conflict ^and debate, as should any attempt to assess
the ethical implications of science. "Thus, in
order to derive the 'ethical' implications of any
technical or scientific model, we explicitly incor-
porate a dialectical mode of examining (or testing)
models" (Mitroff & Sagasti, 1973, p. 133). In a
sense we are asking, as did Churchman's mentor
E. A. Singer (19S9), what the consequences" would
be if a given scientific judgment had the status
of an ethical judgment.
It should be noted that value issues intrude in
the testing process at all levels, not just at the
grand level of broad construct interpretation. For
example, values influence the relative emphasis on
different types of content in test construction
(N unnally, 1967) and procedures for scoring the
quality of performance on content samples (Guion,.
1978b), but the concern here is limited to the value
implications of test interpretation. Consider first
the evaluative overtones of the construct label itself.
I have already suggested that a measure interpreted
in terms of "flexibility versus rigidity" would be
utilized differently if it were instead labeled "con-
fusion versus control." Similarly, a measure called
"inhibited versus impulsive" would have different
consequences if it were labeled "self-controlled
versus uninhibited." So would a variable like
"stress" if it were relabeled "challenge." The point
is not that we would make a concept like stress
into a good thing by renaming it but that by not
presuming it to be a bad thing we would investigate
broader consequences, facilitative as well as
debilitative (McGrath, 1976). In choosing a con-
struct label, we should strive for consistency be-
tween the trait and evaluative implications of the
name, attempting to capture as closely as possible
the essence of the construct's theoretical import,
especially its empirically grounded import, in terms
reflective of its salient value connotations. This
may prove difficult, however, because many traits
are ope/n to conflicting value interpretations and
thus call for systematic examination of counter-
hypotheses about value outcomes, if not to reach
convergence, at least to clarify the basis of the
conflict. Some traits may also imply different
value outcomes under different circumstances,
which suggests the possible utility of differentiated
trait labels to embody these value distinctions, as in
the case of "debilitating anxiety" and "facilitating
anxiety." Rival theories of the construct might
also highlight different value implications, of
course, and lead to conflict between the theories
not only in trait interpretation but also in value
interpretation.
Apart from its normative and evaluative over-
tones, perhaps the most important feature of a
construct in regard to value connotations is its
breadth, or the range of its theoretical and
empirical referents. This is the issue that Snow
1022 NOVEMBER 1980 AMERICAN PSYCHOLOGIST-
Test Interpretation Test Use
Evidential Basis
Consequential Basis
Construct Validity
Value Implications
Construct Validity +
Relevance/Utility
Social Consequences
Figure 1. Facets of test validity.
(1974) called referent generality. The broader the
construct, the more difficult it is to embrace all of
its critical features in a single measure and the
more we are open to what Coombs (1954) has
called "operationism in reverse," that is, "endowing
the measures with all the meanings associated with
the concept" (p. 476). In choosing the appro-
priate breadth or level of generality for a construct
and its label, one is buffeted by opposing counter-
pressures toward oversimplification on the one hand
and overgeneralization on the other. At one
extreme is the apparent safety in using merely
descriptive labels tightly tied to behavioral exem-
plars in the test (such as Adding Two-Digit N um-
bers). Choices on this side sacrifice interpretive
power and range of application if the test might
also be defensibly viewed more broadly (e.g., N um-
ber Facility). At the other extreme is the apparent
richness of high-level inferential labels (such as
Intelligence, Creativity, or Introversion). Choices
on this side are subject to the dangers of mis-
chievous dispositional connotations and the backlash
of conceptual imperialism.
At first glance, one might think that the appro-
priate level of construct reference should be tied
not to test behavior but to the level of generaliza-
tion supported by the convergent and discriminant
research evidence in hand. But constructs refer to
potential relationships as well as actual relation-
ships, so their level of generality should in principle
be tied to their range of reference in the nomo-
logical theory, with the important proviso that this
range be restricted or extended when research evi-
dence so indicates. The scope of the original theo-
retical formulation is thus modified by the research
evidence available, but it is not limited to the
research evidence available. As Cook and Camp-
bell (1979) put it, "The data edit the kinds of
general statements we can make" (p. 88). And
debating the value implications of test interpreta-
tion may also edit the kinds of general statements
we should make.
Validity as Evaluation of
Evidence and Consequence
Test validity is thus an overall evaluative judgment
of the adequacy and appropriateness of inferences
drawn from test scores. This evaluation rests on
four bases: (1) an inductive summary of convergent
and discriminant research evidence that the test
scores are interpretable in terms of a particular
construct meaning, (2 ) an appraisal of the value
implications of that interpretation, (3) a rationale
and evidence for the relevance of the construct and
the utility of the scores in particular applications,
and (4) an appraisal of the potential social con-
sequences of the proposed use and of the actual
consequences when used.
P utting these bases together, we can see test
validity to have two interconnected facets linking
the source of justificationeither evidential or
consequentialto the function or outcome of the
testingeither interpretation or use. This cross-
ing of basis and function is portrayed in Figure 1.
The interactions among these aspects are more
dynamic in practice, however, than is implied by
a fourfold classification. In an attempt to rep-
resent the interdependence and feedback among
the components, a flow diagram is presented in
Figure 2 . The double arrows linking construct
validity and test interpretation in the diagram are
meant to imply a continuous process that starts
sometimes with a construct in search of proper
measurement and sometimes with an existing test
in search of proper meaning.
The model also includes a pragmatic component
for the evaluation of actual consequences of test
practice, pragmatic in the sense that this com-
ponent is oriented, like pragmatic philosophy,
I m p l i c a t i o n s
for Test
t erp ret a t l o
< E v a l u a t e Consequences
Figure 2 . Feedback model for test validity.
toward outcomes rather than origins and seeks
justification for use in the practical consequences
of use. The primary concern of this component is
the balancing of the instrumental value of the test
in accomplishing its intended purpose with the
instrumental value of any negative side effects and
positive by-products of the testing. Most test
makers acknowledge responsibility for providing
general evidence of the instrumental value of the
test. The terminal value of the test in terms of the
social ends to be served goes beyond the test maker
to include as well the decisionmaker, policymaker,
and test user, who are responsible for specific evi-
dence of instrumental value in their particular
setting and for the specific interpretations and
uses made of the test scores. In the final analysis,
"responsibility for valid use of a test rests on the
person who interprets it" (Cronbach, 1969, p. SI),
and that interpretation entails responsibility for
its value consequences.
Intervening in the model between test use and
the evaluation of consequences is a decision matrix
to emphasize the point that tests are rarely used
in isolation but rather in combination with other
information in broader decision systems. The
decision process is profoundly influenced by social
values and deserves, in its own right, massive
research attention beyond the good beginning pro-
vided by utility models. As Guion (1976a)
phrased it, "The formulation of hypotheses is or
should be applied science, the validation of hypoth-
eses is applied methodology, but the act of making
. . . [a] decision is ... still an art" (p. 646). The
feedback model as portrayed is a closed system, to
emphasize the point that even when consequences
are evaluated favorably they should be contin-
uously or periodically monitored to permit the
detection of changing circumstances and of delayed
side effects.
The model is closed and this article is closed
wjith the provocative words of Sir Geoffrey Vickers
(1970 ): "If indeed we have reached the end of
ideology (in Daniel Bell's phrase) it is not because
we can do without ideologies but because we should
now know enough about them to show a proper
respect for our neighbour's and a proper sense of
responsibility for our own" (p. 109).
REFEREN CES
American P sychological Association, American Educational
Research Association, & N ational Council on Measure-
ment in Education. Standards for educational and psy-
chological tests. Washington, D.C.: American P sycho-
logical Association, 1974.
Bracht, G. H ., & Glass, G. V. The external validity of
experiments! American Educational Research Journal,
1968, 5, 437^74.
Braithwaite, R. B. Scientific explanation. Cambridge,
England: Cambridge University Press, 1956.
Brogden, H . E. O n the interpretation of the correlation
coefficient as a measure of predictive efficiency. Journal
of Educational Psychology, 1946, 37, 65-76.
Burton, N . W. Societal standards. Journal of Educational
Measurement, 1978, IS, 2 63-2 71.
Campbell, D. T. Recommendations for AP A test standards
regarding construct, trait, or discriminant validity.
American Psychologist, 1960, 15, 546-553.
Campbell, D. T., & Fiske, D. W. Convergent and discrim-
inant validation by the multitrait-multimethod matrix.
Psychological Bulletin, 1959, 56, 81-105.
Campbell, J. P . P sychometric theory. In M. D. Dunnette
(Ed.), Handbook of industrial and organizational psy-
chology. Chicago: Rand McN ally, 1976.
Churchman, C. W. Prediction and optimal decision:
Philosophical issues of a science of values. Englewood
Cliffs, N .J.: P rentice-H al!, 1961.
Churchman, C. W. The design of inquiring systems: Basic
concepts of systems and organization. N ew Y ork: Basic
Books, 1971.
Cleary, T. A. Test bias: P rediction of grades of N egro
and white students in integrated colleges. Journal of
Educational Measurement, 1968, 5, 115-124.
Coan, R. W. Facts, factors, and artifacts: The quest for
psychological meaning. Psychological Review, 1964, 71,
123-140.
Cole, N . S. Bias in selection. Journal of Educational
Measurement, 1973, W, 2 37-2 55.
Cook, T. D., & Campbell, D. T. Quasi-experimentation:
Design and analysis issues for field settings. Chicago:
Rand McN ally, 1979.
Coombs, C. H .. Theory and methods of social measure-
ment. In L. Festinger & D. Katz (Eds.), Research
methods in the behavioral sciences. N ew Y ork: H olt,
Rinehart & Winston, 1954.
Crawford, C. George Washington, Abraham Lincoln, and
Arthur Jensen: Are they compatible? American Psychol-
ogist, 1979, 34, 664-672 .
Cronbach, L. J. Validation of educational measures. Pro-
ceedings of the 1969 Invitational Conference on Testing
Problems: Toward a theory of achievement measure-
ment . P rinceton, N .J.: Educational Testing Service,
1969.
Cronbach, L. J. Test validation. In R. L. Thorndike
(Ed.), Educational measurement (2 nd ed.). Washington,
D.C.: American Council on Education, 1971.
Cronbach, L. J. Equity in selectionWhere psycho-
metrics and political philosophy meet. Journal of Edu-
cational Measurement, 1976, 13, 31-41.
Cronbach, L. J., & Gleser, G. C. Psychological tests and
personnel decisions (2 nd ed.). Urbana: University of
Illinois P ress, 1965.
Cronbach, L. J., Gleser, G., N anda, H ., & Rajaratnam, N .
The dependability of behavioral measurements: Theory
of generalizability for scores and profiles. N ew Y ork:
Wiley, 1972 .
Cronbach, L. J., & Meehl, P . E. Construct validity in psy-
chological tests. Psychological Bulletin, 195S, 52, 281-302.
Curtis, E. W., & Alf, E. F. Validity, predictive efficiency,
and practical significance of selection tests. Journal of
Applied Psychology, 1969, 53, 32 7-337.
Darlington, R. B. Another look at "culture fairness."
Journal of Educational Measurement, 1971, 8, 71-82.
Darlington, R. B. A defense of "rational" personnel selec-
tion, and two new methods. Journal of Educational
Measurement, 1976, 13, 43-52.
AMERICAN PSYCHOLOGIST N O VEMBER 1980 1025
Darlington, R. B., & Stauffer, G. F. Use and evaluation
of discrete test information in decision making. Journal
of Applied Psychology, 1966, 50, 125-129.
Division of Industrial and O rganizational P sychology,
American P sychological Association. Principles for the
validation and use of personnel selection procedures.
H amilton, O hio: H amilton P rint Co., 1975.
Dunnette, M. D., & Borman, W. C. P ersonnel selection
and classification systems. In M. R. Rosenzweig & L. W.
P orter (Eds.), Annual Review of Psychology (Vol. 30 ).
P alo Alto, Calif.: Annual Reviews, 1979.
Ebel, R. L. Must all tests be valid? American Psychol-
ogist, 1961, 16, 640-647.
Ebel, R. L. The social consequences of educational testing.
Proceedings of the 1963 Invitational Conference on Test-
ing Problems. P rinceton, N .J.: Educational Testing
Service, 1964.
Ebel, R. L. Comments on some problems of employment
testing. Personnel Psychology, 1977, 30, 5S-63.
Edel, A. Science and the structure of ethics. In 0 . N eu-
rath, R. Carnap, & C. Morris (Eds.), Foundations of the
unity of science: Toward an international encyclopedia
of unified science (Vol. 2 ). Chicago: University of
Chicago Press, 1970.
Einhorn, H . J., & Bass, A. R. Methodological consider-
ations relevant to discrimination in employment testing,
Psychological Bulletin, 1971, 75, 2 61-2 69.
Equal Employment O pportunity Commission, Civil Service
Commission, U.S. Department of Labor, & U.S. Depart-
ment of Justice. Uniform guidelines on employee selec-
tion procedures. Federal Register (August 2 5, 1978),
43 (166), 38290-38315.
Feigl, H . Some major issues and developments in the phi-
losophy of science of logical empiricism. In H . Feigl &
M. Scriven, Minnesota studies in philosophy of science:
The foundations of science and the concepts of psychol-
ogy and psychoanalysis. Minneapolis: University of
Minnesota Press, 1956.
Feyerabend, P . Against method: Outline of an anarchist
theory of knowledge. London, England: N ew Left
Books, 1975.
Glaser, R., & N itko, A. J. Measurement in learning and
instruction. In R. L. Thorndike (Ed.), Educational
measurement (2 nd ed.). Washington, D.C.: American
Council on Education, 1971.
Goodenough, F. L. Mental testing: Its history, principles,
and applications. N ew Y ork: H olt, Rinehart & Win-
ston, 1969.
Gross, A. L., & Su, W. Defining a "fair" or "unbiased"
selection model: A question of utilities. Journal of Ap-
plied Psychology, 1975, 60, 345-351.
Guion, R. M. O pen a new window: Validities and values
in psychological measurement. American Psychologist,
1974, 29, 2 87-2 96.
Guion, R. M. The practice of industrial and organizational
psychology. In M. D. Dunnette (Ed.), Handbook of
industrial and organizational psychology. Chicago: Rand
McN ally, 1976. (a) ,
Guion, R. M. Recruiting, selection, and job placement. In
M. D. Dunnette (Ed.), Handbook of industrial and or-
ganizational psychology. Chicago: Rand McN ally, 1976.
(b)
Guion, R. M. Content validityThe source of my discon-
tent. Applied Psychological Measurement, 1977, 1, 1-
10 . (a)
Guion, R. M. Content validity: Three years of talk
What's the action? Public Personnel Management, 1977,
6, 407-414. (b)
Guion, R. M: "Content validity" in moderation. Person-
nel Psychology, 1978, 31, 205-213. (a)
Guion, R. M. Scoring of content domain samples: The
problem of fairness. Journal of Applied Psychology,
1978,^,499-50 6. (b)
Guion, R. M. O n trinitarian doctrines of validity. Pro-
fessional Psychology, 1980 ,11, 385-398.
Gulliksen, H . Intrinsic validity. American Psychologist,
1950 ,5,511-517.
H empel, C. G. Fundamentals of concept formation in em-
pirical science. In O . N eurath, R. Carnap, & C. Morris
(Eds.),' Foundations of the unity of science: Toward an
international encyclopedia of unified science (Vol. 2 ).
Chicago: University of Chicago Press, 1970.
H udson, L. Singularity of talent. In S. Messick (Ed,),
. Individuality in learning. San Francisco: Jossey-Bass,
,1976.
H unter, J. E., & Schmidt, F. L. Critical analysis of the
statistical and ethical implications of various definitions
of test bias. Psychological Bulletin, 1976, 83, 1053-1071.
H unter, J. E., Schmidt, F. L., & Rauschenberger, J. M.
Fairness of psychological tests: Implications of four def-
initions for selection utility and minority hiring. Journal
of Applied Psychology, 1977, 62, 2 45-2 60 .
James, L. R. Criterion models and construct validity for
criteria. Psychological Bulletin, 1973, 80, 75-83.
Kaplan, A: The conduct of inquiry: Methodology for be-
havioral science. San Francisco: Chandler, 1964.
Kavanagh, M. J., MacKinney, A. C., & Wolins, L. Issues
in managerial performance: Multitrait-multimethod anal-
yses of ratings. Psychological Bulletin, 1971, 75, 3449.
Lennon, R. T. Assumptions underlying the use of content
validity. Educational and Psychological Measurement,
1956, 16, 2 94-30 4.
Lenzen, V. F. P rocedures of empirical science. In O . N eu-
rath, R. Carnap, & C. W. Morris (Eds.), International
encyclopedia of unified science (Vol. 1, P t. 1). Chicago:
University of Chicago P ress, 1955.
Linn, R. L. Fair test use in selection. Review of Educa-
tional Research, 1973, 43, -139-161.
Linn, R. L. In search of fair selection procedures. Journal
of Educational Measurement, 1976, 13, 53-58.
Loevinger, J. O bjective tests as instruments df psychologi-
cal theory. Psychological Reports, 1957, 3, 635-694
(Monograph Supplement 9).
Margenau, H . The nature of physical reality. N ew Y ork:
McGraw-H ill, 1950. (Reprinted, Woodbridge, Conn.:
O xbow, 1977.)
McGrath, J. E. Stress and behavior in organizations. In
M. D. Dunnette (Ed.), Handbook of industrial and or-
ganizational psychology. Chicago: Rand McN ally, 1976.
Messick, S. P ersonality measurement and college perform-
ance. Proceedings of the 1963 Invitational Conference
on Testing Problems. P rinceton, N .J.: Educational Test-
ing Service, 1964.
Messick, S. P ersonality measurement and the ethics of as-
sessment. American Psychologist, 1965, 20, 136-142.
Messick, S. The standard problem: Meaning and values
in measurement and evaluation. American Psychologist,
1975, 30, 955-966.
Messick, S. P otential uses of noncognitive measurement in
education. Journal of Educational Psychology, 1979, 71,
2 81-2 92 .
Messick, S., & Barrows, T. S. Strategies for research and
evaluation in early childhood education. In I. J. Gordon
(Ed.), Early childhood education: The seventy-first year-
book of the National Society for the Study of Education.
Chicago: University of Chicago Press, 1972 .
Mitroff, I. I. 'Be it resolved that structured debate not-
consensus ought to form the epistemic cornerstone of
O R/MS': A reaction to Ackoff's note on systems science.
Interfaces, 1973, 3, 14-17.
Mitroff, I. I., & Sagasti, F. Epistemology as general sys-
tems theory: An approach to the design of complex
decision-making experiments. Philosophy of Social Sci-
ence, 1973, 3, 117-134'.
N ovick, M. R., & Ellis, D. D. Equal opportunity in edu-
cational and employment selection. American Psychol-
ogist, 1977, 32, 306-320.
N unnally, J. Psychometric theory. N ew Y ork: McGraw-
Hill, 1967.
P eterson, N . S., & N ovick, M. R. An evaluation of some
models for culture-fair selection. Journal of Educational
Measurement, 1976, 13, 3-29.
P opper, K. R. The logic of scientific discovery. N ew Y ork:
Basic Books, 1959.
Sawyer, R. L., Cole, N . S., & Cole, J. W. L. Utilities and
the issue of fairness in a decision theoretic model for
selection. Journal of Educational Measurement, 1976,
13, 59-76.
Shulman, L. S. Reconstruction of educational research.
Review of Educational Research, 1970, 40, 371-396.
Singer, E. A. Experience and reflection (C. W. Churchman,
Ed.). P hiladelphia: University of P ennsylvania Press,
1959.
Smith, P . C. Behaviors, results, and organizational effec-
tiveness: The problem of criteria. In M. D. Dunnette
(Ed.), Handbook of industrial and organizational psy-
chology. .Chicago: Rand McN ally, 1976.
Snow, R. E. Representative and quasi-representative de-
signs for research on teaching. Review of Educational
Research, 1974,44, 2 65-2 91.
Tenopyr, M. L. Content-construct confusion. Personnel
Psychology, 1977, 30, 47-54.
Thorndike, R. L. Personnel selection: Test and measure-
ment techniques. N ew Y ork: Wiley, 1949.
Thorndike, R. L. Concepts of culture-fairness. Journal of
Educational Measurement, 1971, S, 63-70.
Vickers, G. The art of judgment. N ew Y ork: Basic Books,
1965.
Vickers, G. Value systems and social process. H armonds-
wo.rth, Middlesex, England: P enguin Books, 1970.
Wallach, M. A. P sychology of talent and graduate educa-
tion. In S. Me'ssick (Ed.), Individuality in learning.
San Francisco: Jossey-Bass, 1976.
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Securest,
L. U'nobtrusive measures: Nonreactive research in the
social sciences. Chicago: Rand McN ally, 1966.
Wernimont, P . F., & Campbell, J. P . Signs, samples, and
criteria. Journal of Applied Psychology, 1968, 52, 372-
376.
APA Congressional Science Fellowship P rogram
The American P sychological Association (AP A) is now accepting applications for its
Congressional Science Fellowship P rogram, which is designated for 1981-1982 in the area of
child policy: The program, administered by the American Association for the Advance-
ment of Science (AAAS) and funded for 1981-1982 by the Esther Katz Rosen Fund of
the American P sychological Foundation, provides an extraordinary opportunity for post-
doctoral and midcareer individuals to learn about science-government interaction and to
make contributions to the more effective use of science in government. O ne fellow will
be selected by the AP A to spend one year working as a special legislative assistant on the
staff of an individual congressperson or a congressional committee.
Applicants must have obtained a doctorate in psychology, must demonstrate exceptional
research ability and scientific expertise in some area of child psychology (e.g., develop-
mental, child-clinical), and must have a strong interest in using scientific knowledge
toward the solution or prevention of societal problems involving and affecting children.
Applicants must also belong to AP A, or be an applicant for membership.
The fellowship period covers one year beginning 1 September 1981 and requires resi-
dence in the Washington, B.C., area. The fellowship includes a stipend of $20,000 plus
nominal relocation and travel expenses.
Interested individuals should submit the following application materials: (a) a detailed
vita; (b) a statement of 50 0 words or less addressing the applicant's interest in the
fellowship and how it relates to the applicant's career goals; and (c) three letters of
reference on the applicant's ability to work on Capitol Hill as a special legislative assistant
with scientific expertise in psychology (sent directly to the address below). Application
materials must be postmarked by midnight January 23, 1981.
Finalists will be invited to AP A's Central O ffice in Washington, D.C., for an interview
by a selection committee in late March 1981. Announcement of the award will be made
by early April 1981.
Send application materials to Joann H orai, Congressional Science Fellowship P rogram,
American P sychological Association, 1200 Seventeenth Street, N .W., Washington, D.C.
20036.

Test Validity and The Ethics of Assessment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Validity and The Ethics of Assessment

Uploaded by

Copyright:

Available Formats

Test Validity and the Ethics of Assessment

SAMUEL MESSICK Educational Testing Service

You might also like