Princeton, New Jersey ABSTRACT: Questions of the adequacy of a test as a measure of the characteristic it is interpreted to assess are answerable on scientific grounds by apprais- ing psychometric evidence, especially construct validity. Questions of the appropriateness of test use in proposed applications are answerable on qthical grounds by appraising potential social consequences of the testing. The first set of answers provides an evidential basis for test interpretation, and the second set provides a consequential basis for test use. In addition, this article stresses (a) the importance of construct validity for test use because it provides a rational foundation for predictiveness and relevance, and (b) the importance of taking into account the value implications of test inter- pretations per se. By thus considering both the evi- dential and consequential bases of both test interpreta- tion and test use, the roles of evidence and social values in the overall validation process are illuminated, and test validity comes to be based on ethical as well as evidential grounds. Fifteen years ago or so, in papers dealing with per- sonality measurement and the ethics of assessment, I drew a straightforward but deceptively simple distinction between the psychometric adequacy of a test and the appropriateness of its use (Messick, 1964, 1965). I argued that not only should tests be evaluated in terms of their measurement properties but that testing applications should be evaluated in terms of their potential social consequences. I urged that two questions be explicitly addressed whenever a test is proposed for a specific purpose: First, is the test any good as a measure of the characteristics it is interpreted to assess? Second, should the test be used for the proposed purpose in the proposed way? The first question is a scientific and technical one and may be answered by appraising evidence for the test's psychometric properties, especially construct validity. The second question is an ethical one, and its answer requires a justification of the proposed use in terms of social values. Good answers to the first question are not satisfactory answers to the second. Justification of test use by an appeal to empirical validity is not enough; the potential social consequences of the testing should also be appraised, 1012 N O VEMBER 1980 AMERICAN P SY CH O LO GIST Copyright 1980 by the American Psychological Association, Inc. 0003-066X/80/3511-1012$00.7S not only in terms of what it might entail directly as costs and benefits but also in terms of what it makes more likely as possible side effects. These two questions were phrased to parallel two recurrent criticisms of testingthat some tests are of poor quality and that tests are often misusedin an attempt to separate the frequently blurred issues in the typical critical interchange into (a) ques- tions of test bias or the adequacy of measurement, and (b) questions of test fairness or the appro- priateness of use (Messick, 19,65). It was in the context of appraising personality measurement for selection purposes that I originally stressed the need for ethical standards for justifying test use (Messick, 1964). Although at that time personality tests appeared inadequate for the selec- tion task when systematically evaluated against measurement and prediction standards, it seemed likely that rapidly advancing research technology would, in the relatively near future, produce psy- chometrically sophisticated personality assessment devices. Therefore, questions might soon arise in earnest as to the scope of their practical application beyond clinical and counseling usage. With vari- ables as value-laden as personality characteristics, it seemed critical that values as well as validity be considered in contemplating test use. Kaplan (1964) pointed out that "the validity of a measurement consists in what it is able to accom- plish, or more accurately, in what we are able to do with it. j . . The basic question is always whether the measures have been so arrived at that they can serve effectively as means to the given end" X P - 198). .Also at issue is whether the measures should serve as means to the given end, in light of other ends they might inadvertently serve and in con- sideration of the place of the given end in the social This article was an invited address to the Divisions of Educational P sychology and of Evaluation and Measure- menti presented at the meeting of the American P sycholog- ical Association, N ew Y ork City, September 1, 1979. Requests for reprints should be sent to Samuel Messick, Educational Testing Service, P rinceton, N ew Jersey 08S41. Vol. 35, No. 11, 10 12 -10 2 7 fabric of pluralistic alternatives. For example, should a psychometrically sound measure of "flex- ibility versus rigidity" be used for selection in a particular college if it significantly improves the multiple prediction of grade point average there? What if the direction of prediction favored rigid students? What if entrance to a military academy were at issue, or a medical school? What if the scores had been interpreted instead as measures of "confusion versus control"? What if there were large sex differences in the score distributions? In a different arena, what minimal levels of knowledge and skill should be required for graduation from high school and in what areas? It seemed clear at this point that value issues in measurement were not limited to personality assess- ment, nor to selection applications, but should be extended to all psychological and educational mea- surement (Messick, 1975). This is primarily be- cause psychological and educational variables all bear, either directly or indirectly, on human char- acteristics, processes, and products and hence are inherently, though variably, value-laden. The measurement of such characteristics entails value judgmentsat all levels of test construction, analy- sis, interpretation, and useand this raises ques- tions both of whose values are the standard and of what should be the consequences of negative valu- ation. Values thus appear to be as pervasive and critical for psychological and educational measure- ment as is testing's acknowledged touchstone, validity. Indeed, "The root meaning of the word 'validity' is the same as that of the word 'value': both derive from a term meaning strength" (Kap- lan, 1964, p. 198). It should be emphasized that value questions arise with any approach to psychological and edu- cational testingj whether it be norm-referenced or criterion-referenced (Glaser & N itko, 1971), a construct-based ability test or a content-sampled achievement test (Messick, 1975), a reactive task or an unobtrusive observation (Webb, Campbell, Schwartz, & Sechrest, 19,66), a sign or a sample (Goodenough, 1969), or whatever, but the nature of the critical value questions may differ from one approach to another. For example, many of the advantages of samples over signs derive from the similarity of past behaviors to desired future be- haviors, which 'makes it more likely that behavior- sample tests will be judged relevant in both content and process to the task or job domain about which inferences are to be drawn. It is also likely that scores from such samples, because of behavioral consistency from one time to another, will be pre- dictive of performance in those domains (Wern- imont & Campbell, 1968). A key value question is whether such "persistence forecasting," as Wallach (1976) calls it, is desirable, in a particular domain of application. In higher education, for example, the appropriate model might not be persistence but development and change, which suggests that in such instances we be wary of selection procedures that restrict individual opportunity on the basis of behavior to date (H udson, 1976). The distinction stressed thus far between the adequacy of a test as a measure of the character- istic it is interpreted to assess and the appropriate- ness of its use in specific applications underscores in the first instance the evidential basis of test interpretation, especially the need for construct validity evidence, and in the second instance the consequential basis of test use, through appraisal of potential social consequences. In developing this distinction in prior work I emphasized the importance of construct validity' for test use as well, arguing "that even for purposes of applied decision making reliance upon criterion validity or content coverage is not enough," that the meaning of the measure must also be comprehended in order to appraise potential social consequences sensibly (Messick, 1975, p. 956). The present article extends this argument for the importance of con- struct validity in test use still further by stressing its role in providing a "rational foundation for pre- dictive validity" (Guion, 1976b). After thus elaborating the evidential basis of test use, I con- sider the value implications of test interpretations per se, especially those that bear evaluative and ideological overtones going beyond intended mean- ings and supporting evidence; the circle is thereby completed with an examination of the consequential basis of test interpretation. Finally, the dynamic interplay between test interpretation and its value implications, on the one hand, and test use and its social consequences, on the other, is sketched in a feedback model that incorporates a pragmatic component for the empirical evaluation of testing consequences. Validity as Inference From Evidence According to the Standards for Educational and Psychological Tests (American Psychological Asso- ciation et al., 1974), "Questions of validity are questions of what may properly be inferred from a test score; validity refers to the appropriateness of AMERICAN P SY CH O LO GIST N O VEMBER 1980 1013 inferences from test scores or other forms of assess- ment. . . . It is important to note that validity is itself inferred, not measured. . . . It is, therefore, something that is judged as adequate, or marginal, or unsatisfactory" (p. 2 5). This document also points out that the many forms of validity ques- tions fall into two broad classes, those dealing with inferences about what is being measured by the test and those inquiring into the usefulness of the measurement as a predictor of other variables. Furthermore, there are a variety of validation methods available, but they all entail in principle a clear designation of what is to be inferred from the scores and the presentation of data to support such inferences. , Unfortunately, after this splendid beginning, this and other official documentsnamely, the Division of Industrial and O rganizational Psychology's (1975) Principles for the Validation and Use of Personnel Selection Procedures and the Equal Employment O pportunity Commission et al.'s (1978) "Unifor'm Guidelines on Employee Selec- tion P rocedures" proceed, as Dunnette and Bor- man (1979) lament, to "perpetuate a conceptual compartmentalization of 'types' of validity- criterion-related, content, and construct. . . . the implication that validities come in different types leads to confusion and, in the face of confusion, over-simplification" (p. 483). O ne consequence of this simplism is that many test users focus on one or another of the types of validity, as though any one would do, rather than on the specific infer- ences they intend to make from the scores. There is an implication that once evidence of one type of validity is forthcoming, one is relieved of respon- sibility for further inquiry. Indeed, the "Uniform Guidelines" seem to treat the three types of valid- ity, in Guion's (1980 ) words, "as something of a Holy Trinity representing three different roads to psychometric salvation. If you can't demonstrate one kind of validity, you've got two more chances!" (p. 4). . : Different kinds of inferences from test scores require different kinds of evidence, not different kinds of validity. By "evidence" I mean both data, or facts, and the rationale or arguments that cement those facts into a justification of test-score inferences. "Another way to put this is to note that data are not information; information is that which results from the interpretation of data" (Mitroff & Sagasti, 1973, p. 12 3). O r as Kaplan (1964) states, '-'What serves as evidence is the result of a process of interpretationfacts do not speak for themselves; nevertheless, facts must be given a hearing, or the scientific point to the process of interpretation is lost" (p. 375). Facts and rationale thus blend in this view of evidence, and the tolerable balanced between them in the arena of test validity extends over a considerable range, possibly even falling just short of the one extreme where facts are left to speak for, themselves and the other extreme where a logical rationale alone is deemed self-evident. By focusing on the nature of the evidence in relation to the nature of the inferences drawn from test scores, we come to view validity as a general imperative in measurement. Validity is the overall degree of justification for test interpretation and use. It is "an evaluation, considering all things, of a certain kind of inference about people who obtain a certain score" (Guion, 1978b, p. 50 0 ). Although it may prove helpful conceptually to discuss the interdependent features of the generic concept in terms of different aspects or facets, it is simplistic to think of different types or kinds of validity. From this standpoint we are not very well served by labeling different aspects of a general concept with the name of the concept, as in criterion-related validity, content validity> or construct validity, or by proliferating a host of specialized validity mod- ifiers, such as discriminant validity, trait validity, factorial validity, structural validity, or population validity, each delimiting some aspect of a broader meaning. The substantive points associated with each of these terms are important ones, but their distinctiveness is blunted by calling them all "validity." Since many of the referents are similar but not identical, they tend to be assimilated one to another, leading to confusion among them and to a blurring of the different forms of evidence that the terms wer$ invoked to highlight in the first place. Worse still, any one of these so-called validities, or a small set of them, might be treated as the whole of validity, while the entire collection to date might still not exhaust the essence of the whole. We would be much better off conceptually to use labels more descriptive of the character and intent of each aspect, such as content relevance and con- tent coverage rather than content validity, or pop- ulation generalizability rather than population validity. Table 1 lists a number of currently used validity terms along with a tentative descriptive designation for each that is intended to underscore differences among the concepts while at the same 1014 -N O VEMBER 1980 AMERICAN PSYCHOLOGIST time highlighting the key feature of each, such as consistency or utility, and pointing to essential areas of similarity and overlap, as with criterion relatedness, nomological relatedness, and external relatedness. With one possible exception to be discussed .subsequently, none of these concepts qualify for the accolade of validity, for at best they are only one facet of validity and at worst, as in the case of content coverage, they are not validity at all. So-called "content validity" refers to the relevance and representativeness of the task con- tent used in test construction and does not refer to test scores at all, let alone evidence to support inferences from test scores, although such content considerations do permit elaborations on score Inferences supported by other evidence (Guion, 1977a, 1978a; Messick, 1975; Tenopyr, 1977). I will comment on most of the concepts in Table 1 in passing while considering the claim of the one exception noted earliernamely, construct validity to bear the name "validity" and to wear the mantle of all that name implies. I have pressed in previous writing for the view that "all measure- ment should be construct-rejerenced" (Messick, 1975, p. 957). O thers have similarly argued that "any inference relative to prediction and ,.- . . all inferences relative to test scores, are based upon underlying constructs" (Tenopyr, 1977, p. 48). Guion (1977b, p. 410 ) concluded that "all validity is at its base some form of construct validity. . . . It is the basic meaning of validity." I will argue, building on Guion's (1976b) conceptual ground- work, that construct validity is indeed the unifying concept of validity that integrates criterion and content considerations into a common framework for testing rational hypotheses about theoretically relevant relationships. The bridge or unifying theme that permits this integration is the meaning- fulness or interpretability of the test scores, which is the goal of the construct validation process. This construct meaning provides a rational basis both for hypothesizing predictive relationships and for judging content relevance and representativeness. I stop short, however, as did Guion (1980 ), of equating construct validity with validity in gen- eral, but for different reasons. The main basis for hesitancy on my part, as we shall see, is that valid- ity entails an evaluation of the value implications of both test interpretation and test use. These implications derive primarily from the test's con- struct meaning, to be sure, and they feed back into the construct validation process, but they also derive in part from broader social ideologies, TABLE 1 Alternative Descriptors for Aspects of Test Validity Validity designation Descriptive designation Content validity Criterion validity P redictive validity Concurrent validity Construct validity Convergent validity Discriminant validity Trait validity N omological validity Factorial validity Substantive validity Structural validity External validity P opulation validity Ecological validity Temporal validity Task validity Content relevancedomain speci- fications Content coveragedomain repre- sentativeness Criterion relatedness P redictive utility Diagnostic utility Substitutability Interpretive meaningfulness Convergent coherence Discriminant distinctiveness Trait correspondence N omological relatedness Factorial composition Substantive consistency Structural fidelity External relatedness P opulation generalizability Ecological generalizability Temporal continuityacross de- velopmental levels Temporal generalizabilityacross historical periods Task generalizability such as the ideologies of social science or of educa- tion or of social justice, and hence go beyond con- struct meaning per se. IN TERP RETIVE MEAN IN GFULN ESS Construct validation is a process of marshaling evi- dence to support the inference that an observed response consistency in test performance has a particular meaning, primarily by appraising the extent to which empirical relationships with other measures, or the lack thereof, are consistent with that meaning. These empirical relationships may be assessed in a variety of ways, for example, by gauging the degree of consistency in correlational patterns and factor structures, in group differences, response processes, and changes over time, or in responsiveness to experimental treatments. The process attempts to link the reliable response con- sistencies summarized by test scores to nontest behavioral consistencies reflective of a presumably common underlying construct, usually an attribute or process or trait that is itself embedded in a more comprehensive network of theoretical propositions or laws called a nomological network (Feigl, 1956; H empel, 1970 ; Margenau, 1950 ). An empirically AMERICAN P SY CH O LO GIST N O VEMBER 1980 1015 grounded pattern of such links provides an evi- dential basis for interpreting the test scores in construct or process terms, as well as a rational basis for inferring testable implications of the scores from the broader theoretical network of the constructs meaning (Cronbach & Meehl, 1955; Messick, 1975). Constructs are thus chosen or created "to organize experience into general law- like statements" (Gronbach, 1971, p. 462 ). Construct validation entails both confirmatory and disconfirmatory strategies, one to provide convergent evidence that the measure in question is coherently related to other measures of the same construct as well as to other variables that it should relate to on theoretical grounds, and the other to provide discriminant evidence that the measure is not related unduly to exemplars of other distinct constructs (D. T. Campbell & Fiske, 1959). Dis- criminant evidence is particularly critical for dis- counting plausible counterhypotheses to the con- struct interpretation (P opper, 1959), especially those pointing to the possibility that the observed consistencies might instead be attributable to shared method constraints, response sets, or other contaminants. Construct validity emphasizes two intertwined sets of relationships for the test: one between the test and different methods for measuring the same construct O r trait, and the other between measures of the focal construct and exemplars of different constructs predicted to be variously related to it on theoretical grounds. Theoretically relevant empirical consistencies in the first set, indicating a correspondence between measures of the same con- struct, have been called trait validity, and those in the second set, indicating a lawful relatedness be- tween measures of different constructs, have been called nomological validity (D. T. Campbell, 1960; Cronbach & Meehl, 1955). In order to discount competing hypotheses involving alternative con- structs or method contaminants, the two sets are often analyzed simultaneously in a multitrait- multimethod strategy that employs multiple meth- ods for assessing each of two or more different constructs (D. T. Campbell & Fiske, 1959). Such an approach highlights the need for both convergent and discriminant evidence in both trait and nomo- logical validity. Trait validity deals with the fit between measure- ment operations and conceptual definitions of the construct, and nomological validity deals with the fit between obtained data patterns and theoretical pre- dictions about those patterns (Cook & Campbell, 1979). The former is concerned with the meaning of the measure as a reflection of the construct, and the latter is concerned with the meaning of the construct as reflected in the measure's relational properties. Both aspects are intrinsic to construct validity, and the. interplay between them leads to iterative refinements of measures, constructs, and theories over time. Thus, the paradox that mea- sures are needed to define constructs and constructs are needed to build measures is resolved, like all existential dilemmas in science, by a process of successive approximation (Kaplan, 1964; Lenzen, 1955). It will be recalled that the Standards for Educa- tional and Psychological Tests (AP A et al., 1974) condensed the variety of validity questions into two types, those dealing with the intrinsic nature or meaning of the measure and those dealing with its use as an indicator or predictor of other vari- ables. In the present context, this distinction should be seen as a whole-part relationship: Evi- dence bearing , on the meaning of the measure embraces all of construct validity, whereas evidence for certain predictive relationships contributes to that part called nomological validity. Some pre- dictive relationshipsnamely, those between the measure and specific applied criterion behaviors are traditionally singled out for special attention under the rubric of "criterion-related validity," and it therefore follows that this too is subsumed con- ceptually as part of construct validity. This does not mean, however, that construct validity in general can replace criterion-related validity in particular in applied settings. The criterion correlates of a measure constitute strands in the construct's nomological network, but their empirical basis is still to be checked. Thus, "criterion-related validity is intended to show the validity, not of the test, but of that hypothesis" of relationship to the criterion (Guion, 1978a, p. 2 0 7). . The analysis of criterion variables within the mea- sure's construct network, especially if conducted in tandem with the construct validation of the criterion measures themselves, provides a powerful rational basis for criterion prediction (Guion, 1976b). CRITERIO N RELATEDN ESS So-called "criterion-related validity" is usually con- sidered to comprise two types, concurrent validity and predictive validity, which differ respectively in terms of whether the test and criterion data were collected at the same time or at different times. A 1016 N O VEMBER 1980 AMERICAN PSY CH O LO GIST more fundamental distinction would recognize that concurrent correlations with criteria are usually obtained either to appraise the diagnostic effective- ness of the test in detecting current behavioral pat- terns or to assess the suitability of substituting the test for a longer, more cumbersome, or more expen- sive criterion measure. It would also be' more help- ful in both the predictive and the concurrent case to characterize the function of the relationship in terms of utility rather than validity. Criterion relatedness differs from the more general nomological relatedness in being more narrowly stated and pointed toward specific sets of data and specific applied settings. In criterion relatedness we are concerned not just with verifying the existence of relationships and gauging their strength, but with identifying useful relationships under the applied conditions. Utility is the more appropriate concept in such instances because it implies interpretation of the correlations in the decision context in terms of indices of predic^ tive efficiency relative to base rates, mean gains in criterion performance due to selection, the dollar value of such gains relative to costs, and so forth (Brogden, 1946; Cronbach & Gleser, 1965; Curtis & Alf, 1969; Darlington & Stauffer, 1966; H unter, Schmidt, & Rauschenberger, 1977). In developing rational hypotheses of criterion relatedness, we not only need a conception of the construct meaning of the predictor measures, as we ' have seen, but we also need to conceptualize criterion constructs, basing judgments on data from job or task analyses and the construct validation of pro- visional criterion measures (Guion, 1976b). In the last analysis, the ultimate criterion is determined on rational grounds (Thorndike, 1949); in any event, it "can best be described as a psychological construct . . . [and] the process of determining the relevance of the immediate to the ultimate criterion becomes one of construct validation" (Kavanagh, Mac- Kinney, & Wolins, 1971, p. 35). It is particularly crucial to identify criterion constructs whenever potentially contaminated criterion measures, such as ratings or especially multiple ratings from different sources, are employed (James, 1973). In the face of impure or contaminated criterion measures, the question of the intrinsic nature of the relation be- tween predictor and criterion comes to the fore (Gulliksen, 1950 ), and construct validity is needed to .broach that issue. "In other words, an orienta- tion toward construct validation in criterion research is the best way of guarding against a hopelessly 'incomplete job of criterion development" (Smith, 1976, p. 768). Thus, if construct validity is not available on the predictor side, it better be on the criterion side, and both "must have adequate construct validity for their respective sides if the theory is to be tested adequately" (Guion, 1976b, p. 80 2 ). Implicit in this rational approach to predictive hypotheses there is thus also a rational basis for judging the relevance of the test to the criterion domain. This provides a means of coping with the quasi-judicial term jab-relatedness, even in the case where criterion-related empirical verification is missing. "Where it is clearly not feasible to do the study, the defense of the predictor can rest on a combination of its construct validity and the rational justification for the inclusion of the con- struct in the predictive hypothesis" (Guion, 1974, p. 2 91). The case becomes stronger if the pre- dicted relationship has been verified empirically in other settings. Guion (1974), for one, has main- tained that this stance offers better evidence of job-relatedness than does a tenuous criterion- related study done under pressure with small samples, low variances, or questionable criterion measures. O n the other hand, the simple demon- stration of an empirical relationship between a measure and a criterion in the absence of a cogent rationale is a dubious basis for justifying relevance or use (Messick, 1964, 1975). CO N TEN T RELEVAN CE AN D CO N TEN T CO VERAGE The other major basis for judging the relevance of the test to the behavioral domain about which inferences are to be drawn is so-called "content validity." Content validity in its classic form (Cronbach, 1971) is limited to the strict behavioral language of task description, for otherwise, con- structs are apt to be invoked and we have another case of construct validity. There are two main facets to content validity: O ne is content relevance, which refers to the specification of the behavioral domain in question and the attendant specification O f the task or test domain. Specifying domain boundaries is essentially a requirement of opera- tional definition and, in the absence of appeal to a construct theory of task performance, is limited to a statement of admissible task characteristics and behavioral requirements. The other facet is content coverage, which refers to the specification of procedures for sampling the domain in some representative fashion. The concern is thus with content sampling of a specified content domain, which is a prescription for test construction, not AMERICAN PSYCHOLOGIST NOVEMBER 1980 1017 validity. Consensual judgments about the rele- vance of the test domain as denned to, a particular behavioral -domain of interest (as, for example, when choosing a standardized achievement test to evaluate a new curriculum), along with judgments of the adequacy of content coverage in the test, are the kinds of evidence usually offered for content validity. But note that this is not evidence in support of inferences from test scores, although it might influence the nature of those inferences. This attempt to define content validity as sep- arate from construct validity produces a dysfunc- tional strain to avoid constructs, as if shunning them in test development somehow lessens the import "of response processes in test performance. The important sampling consideration in test con- struction is not representativeness of the surface content of tasks but representativeness of the pro- cesses employed by subjects in arriving at a response (Lennon, 1956). This puts content validity squarely in the realm of construct validity (Messick, 1975). Rather than strain after neb- ulous distinctions, we should inquire how content considerations contribute to construct validity and how to strengthen that contribution (Tenopyr, 1977). Loevinger (1957) incorporated content as an important feature of construct validity by consider- ing content representativeness and response con- sistency jointly. What she called "substantive validity" is "the extent to which the content of the items included in (and excluded from?) the test can be accounted for in terms of the trait believed to be measured and the context of measurement" (Loevinger, 1957, p. 661). This notion was intro- duced "because of the conviction that consider- ations of content alone are not sufficient to establish validity even when the test content resembles the trait, and considerations of content cannot be excluded .when the test content least resembles the trait" (Loevinger, 1957, p. 657). The elimination of certain items from the test because of poor empirical response properties may sometimes distort the test's representativeness in covering the con- struct domain as originally conceived, but it is justified if the resulting test thereby becomes a better exemplar of the construct as empirically grounded (Loevinger, 1957; Messick, 1975). Content validity has little to say about the scor- ing of content samples, and as a result scoring pro- cedures are typically ad hoc (Guion, 1978b). Scor- ing models in the construct framework, in contrast, logically parallel the structural relations inherent in behavioral manifestations of the construct being measured. Loevinger (1957) drew explicit atten- tion to the need for rational scoring models by coining the term structural validity, which includes "both the fidelity of the structural model to the structural characteristics of non-test manifestations of the trait and the degree of inter-item structure" (p. 661). Even in instances where the test is an undisputed representative sample of the behavioral domain of interest and the concern is with the demonstration of task accomplishment per se regardless of the processes underlying performance (cf. Ebel, 1961, 1977), empirical evidence of response consistency and not just representative content sampling is important. In such cases, inferences are usually drawn from the sample performance to domain per- formance, and these inferences should be buttressed by indices of. the internal-consistency type to gauge the extent of generalizability to other items like those in the sample, to other tests developed in parallel fashion, and so forth (J. P . Campbell, 1976; Cronbach, Gleser, N anda, & Rajaratnam, 1972 ). We should also consider the possibility that the test might contain sources of variance irrelevant to domain performance, which is a particularly impor- tant consideration in interpreting low scores. Con- tent validity at best is a unidirectional concept: Although it may undergird certain straightforward interpretations for high scorers (such as "they possess suitable skills to perform the tasks cor- rectly, because they did so .repeatedly"), it provides no basis for interpreting low scores in terms of incompetence or lack of skill. To do that requires the discounting of plausible counterhypotheses about such irrelevancies in the testing as anxiety, defensiveness, inattention, or low motivation (Guion, 1978a; Messick, 1975, 1979). And the empirical discounting of plausible rival hypotheses is the hallmark of construct validation. GEN ERALITY O F CO N STRUCT MEAN IN G The issue of generalizability just broached for con- tent sampling permeates all of validity. Several aspects of generalizability of special concern have been given distinctive labels, but unfortunately these labels once again invoke the sobriquet validity. The extent to which a measure's empir- ical relations and construct interpretation gen- eralize to other population groups is called "popula- tion validity" (Shulman, 1970 ); to other situations or settings, "ecological validity" (Bracht & Glass, 1018 N O VEMBER 1980 AMERICAN PSYCHOLOGIST 1968; Snow, 1974); to other times,- "temporal validity" (Messick & Barrows, 1972 ); and to other tasks representative of the operations called for in the particular domain of interest, "task validity" (Shulman, 1970 ). The label validity is especially unsuitable for these important facets of generalizability, for such usage might be taken to imply that the more generalizable a measure is, the more valid. This is not always the case, however, as in the measure- ment of such constructs as mood, which fluctuates over time, or concrete operations, which typify a particular developmental stage, or administrative role, which operates in special organizational set- tings, or delusions, which are limited to specific psychotic groups. Rather, the appropriate degree of generalizability for a measure depends upon the nature of the construct assessed and the scope of its theoretical applicability. A closely related issue of "referent generality" (Coan, 1964; Snow, 1974), called "referent validity" by Cook and Campbell (1979), concerns the extent to which research evi- dence supports a measure's range of reference and the multiplicity of its referent terms. This con- cept points to the need to tailor the level of con- struct interpretation to the limits of the evidence and to avoid both oversimplification and over- generalization in the connotation of construct labels. N onetheless, constructs refer not only to available evidence but to potential evidence, so that the choice of construct labels is influenced by theory as well as by evidence and, as we shall see, by ideologies about the nature of humanity and society which add value implications that go beyond evidential validity per se. EVIDEN TIAL BASIS O F TEST IN TERP RETATIO N AN D USE To recapitulate thus far, construct validity is the evidential basis of test interpretation. It entails both convergent and discriminant evidence docu- menting theoretically relevant empirical relation- ships (a) between the test and different methods for measuring the same construct, as well as (b) between measures of the construct and exemplars of different constructs predicted to be related nomologically. For test use, the relevance of the construct for the applied purpose is determined in addition, by developing rational hypotheses relating the construct to performance in the applied domain. Some of the construct's nomological relations thus become criteria! when made specific to the applied setting. The empirical verification of this rational hypothesis contributes to the construct validity of both the measure and the criterion, and the utility of the applied relation supports the practicality of the proposed use. Thus, the evidential basis of test use is also construct validitybut elaborated to determine the relevance of the construct to the applied purpose and the utility of the measure in the applied setting. In all of this discussion I have tried to avoid the language of necessary and sufficient requirements, because such language seemed simplistic for a com- plex and holistic concept like test validity. O n the one hand, construct validation is a continuous, never-ending process developing an ever-expanding mosaic of research evidence. At any point new evidence may dictate a change in construct, theory, or measurement, so that in the long run it is diffi- cult to claim sufficiency for any piece. O n the other hand, given that the mosaic of evidence is reasonably dense, it is difficult to claim that any piece is necessaryeven, as we have seen, empirical evidence for criterion-related predictive relation- ships in specific applied settings, provided, of course, that other evidence consistently supports a compelling rationale for the application. Since the evidence in these evidential bases de- rives from empirical studies evaluating hypotheses about relationships or about the structure of sets of relationships, we must also be concerned about the quality of those studies themselves and about the extent to which the research conclusions are tenable or are threatened by plausible counter- hypotheses to explain the results (Guion, 1980 ). Four classes of threats to the tenability and gen- eralizability of research conclusions are discussed by Cook and Campbell (1979), with primary reference to quasi-experimental and experimental research but also relevant to nonexperimental cor- relational studies. These four classes deal, respec- tively, with the questions of (a) whether a relation- ship exists between two variables, an issue called "statistical conclusion validity"; (b) whether the relationship is plausibly causal from one variable to the other, called "internal validity"; (c) what interpretive constructs underlie the relationship, called "construct validity"; and (d) the extent to which the interpreted relationship generalizes to and across other population groups, settings, and times, called "external validity." I will not discuss here the first question raised by Cook and Campbell except simply to affirm that the tenability of statistical conclusions about the AMERICAN P SY CH O LO GIST N O VEMBER 1980 1019 existence and strength of relationships is of course basic to the whole enterprise. I have already dis- cussed construct validity and external generalizabil- ity, although it is important to note in connection with the latter that I was referring to the generaliz- ability of a measure's empirical relations and con- struct interpretation to other populations, settings, and times, whereas Cook and Campbell (1979) were referring to the generalizability of research conclusions that two variables (and their attendant constructs) are causally related one to the other. My emphasis was on the generality of a measure's construct meaning based on any relevant evidence (Messick, 1975; Messick & Barrows, 1972 )com- monality of factor structures, for examplewhile theirs was on the generality of a causal relationship from one measure or construct to another -based on experimental or quasi-experimental treatments. Verification of the hypothesis of causal relation- ship is what Cook and Campbell term internal validity, and such evidence contributes importantly to the nomological basis of a measure's construct meaning for those construct theories entailing causal claims. Internal validity thus provides the evidential basis for causal strands in a nomological network. The tenability of cause-effect implica- tions is important for the construct validity of a variety of educational and psychological measures, such as those interpreted in terms of intelligence, achievement, or motivation. Indeed, the causal overtones of constructs are one source of the value implications of test interpretation, a topic I will turn to shortly. Validity as Evaluation of Implications Since validity is an evaluation of evidence, a judg- ment rather than an entity, and since some evi- dential basis should be provided for the interpreta- tion and use of any test, validity has always been an ethical imperative in testing. As Burton (1978) put it, "Validity (as the word implies) has been primarily an ethical requirement of tests, a pre- requisite guarantee, rather than an active com- ponent of the use and interpretation of tests" (p. 2 64). She went on to argue that with criterion- referenced testing, "Glaser in essence, was taking traditional validity ,out of the realm of ethics into the active arena of test use" (p. 2 64). Glaser may have taken traditional validity into the active arena of test use, as it were, but it never left the realm of ethics because test use itself is an ethical issue. If test validity is the overall degree of justification for test interpretation and use, and if human and social values encroach on both interpretation and use, as they do, then test validity should take account of those value implications in the overall judgment. The concern here, as in most ethical issues, is with evaluating the present and future consequences of interpretation and use (Church- man, 1961). If, as an intrinsic part of the overall validation process, we weigh the actual and poten- tial consequences of our testing practices in light of considerations of what future society might need or desire, theh test validity comes to be based on ethical as well as evidential grounds. CO N SEQUEN TIAL BASIS OF TEST USE Value issues have long been recognized in connec- tion with test use. We have seen that one of the key questions to be posed whenever a test is sug- gested for a specific purpose is "Should it be used for that purpose?" Answers to that question require an evaluation of the potential consequences of the testing in terms of social values, but that is no trivial enterprise. There is no guarantee that at any point in time we will identify all of the critical possibilities, especially those unintended side effects that are distal to the manifest testing aims. There are few prescriptions for how to proceed here, but one recommendation is to contrast the potential social consequences of the proposed test- ing with the potential social consequences of alter- native procedures and even of procedures antago- nistic to testing. This pitting of the proposed test use against alternative proposals is an instance of what Churchman (1971) has called Kantian inquiry; the pitting against antithetical counter- proposals is called Hegelian inquiry. The intent of these strategies is to draw attention to vulner- abilities in the proposed use and to expose its tacit value assumptions to open debate. In the context of testing, a particularly powerful and general form of counterproposal is to weigh the potential social consequences of the proposed test use against the potential social consequences of not testing at all (Ebel, 19,64). The role of values in test use has been intensively examined in certain selection applicationsnamely, in those where different population groups display significantly different means on predictors, or criteria, or both. Since fair test use implies that selection decisions will be equally appropriate 1020 N O VEMBER 1980 AMERICAN P SY CH O LO GIST regardless of an individual's group membership, and since different selection systems yield different proportions of selected individuals in the different groups, the question of test fairness arises in ear- nest. In good Kantian fashion, several models of fair selection were formulated and contrasted with each other (deary, 1968; Cole, 1973; Darlington, 1971; Einhorn & Bass, 1971; Linn, 1973, 1976; Thorndike, 1971); some, having been found incom- patible or even mutually contradictory, offered good Hegelian contrasts (P eterson & N ovick, 1976). It soon became apparent in comparing these models that each accorded a different importance or value to the various subsets of selected versus rejected and successful versus unsuccessful individuals in the different population groups (Dunnette & Bor- man, 1979; Linn, 1973). Moreover, the values accorded are a function not only of desired criterion performance but of desired individual and group attributes (N ovick & Ellis, 1977). Thus, each model not only constitutes a different definition of fairness but also implies a particular ethical posi- tion (H unter & Schmidt, 1976). Each view is ostensibly fair under certain conditions, so that arguments over the fairness of test use turn out in many instances to be disagreements as to what the conditions are or ought to be. With the recognition that fundamental value differences were at issue, several utility models were developed that required specific value positions to be taken (Cronbach, 1976; Gross & Su, 1975; P eterson & N ovick, 1976; Sawyer, Cole, & Cole, 1976), thereby incorporating social values explicitly with measurement technology. But making values explicit does not determine choices among them, and at this point it appears difficult if not impos- sible to be fair to individuals in terms of equity, to groups in terms of parity or adverse impact, to institutions in terms of efficiency, and to society in terms of benefits and risks all at the same time. A workable balancing of the needs of all of the parties is likely to require successive approximations over time, with iterative modifications of utility matrices based on experience with the consequences of decision processes to date (Darlington, 1976). CO N SEQUEN TIAL BASIS O F TEST IN TERP RETATIO N In contrast to test use, the value issues in test interpretation have not been as vigorously ad- dressed. That social values impinge upon theoretical interpretation may not be as obvious, but it is no less serious. "Data come to us only in answer to questions.. . . . H ow we put the question reflects our values on the one hand, and on the other hand helps determine the answer we get" (Kaplan, 1964, p. 385). Facts and values thus go hand in hand (Churchman, 1961), and "we cannot avoid ethics breaking into inductive logic" (Braithwaite, 1956, p. 174). As Kaplan (1964) put it, "Data are the product of a process of interpretation, and though there is some sense in which the materials for this process are 'given' it is only the product which has a scientific status and function. In a word, data have meanirig, and this word 'meaning,' like its cognates 'significance' and 'import,' includes a reference to values" (p. 385). Thus, just as data and theoretical interpretation were seen to be intimately intertwined in the concept of evidence, so data and values are intertwined in the concept of interpretation, and fact, value, and meaning become three faces of the substance of science. Whenever an event or relationship is concep- tualized, it is judgedeven if only tacitlyas be- longing to some broader category to which value already attaches. If a crime, for example, is seen as a violation of the social order, the modal societal response is to seek correction, which is a derivative of the value context of this way of seeing. If crime is seen as a violation of the moral order, expiation will be sought. And if seen as a sign of distress, especially if the distress can be assimilated to a narrower category like mental illness, then a claim of compassion and help attaches to the valuation. In Vickers's (1970 ) terms, the conceptualization of an event or relationship within a broader cate- gory is a process of "matching," which is an infor- mational concept involving the comparison of forms. The assimilation of the value attached to the broader schema is a process of "weighing," which is a dynamic concept involving the compari- son of forces. For Vickers (1970 ), "the elaboration of the reality system and the value system proceed together. Facts are relevant only to some standard of value; values are applicable only to some con- figuration of fact" (p. 134). H e uses the term appreciation to refer to those conjoint judgments of fact and value (Vickers, 1965). In the construct interpretation of tests, such appreciative processes are central, though typically latent. Constructs are broader conceptual cate- gories than the test behaviors, and they carry with them into the testing arena value connotations stemming from three major sources: First are the evaluative overtones of the construct labels them- selves; next are the value connotations of the AMERICAN P SY CH O LO GIST N O VEMBER 1980 1021 broader theories or nomological networks in which constructs are embedded; and last are the implica- tions of the still broader ideologies about the na- ture of humanity, society, and science that color how we proceed. Ideology is a complex configura- tion of values, affects, and beliefs that provides, among other things, an existential perspective for viewing the worlda "stage-setting," as it were, for interpreting the human drama in ethical, sci- entific, or whatever terms (Ed'el, 1970 ). The ideological overlay subtly influences test interpre- tation, especially for very general constructs like intelligence, in ways that go beyond empirically verified connections; in the nomological network (Crawford, 1979). The hope here in drawing attention explicitly to the value implications of test interpretation is that some of these ideological and valuative links might be exposed to inquiry and subjected either to empirical grounding or to policy debate. Exposing the value assumptions of a construct theory and its more subtle links to ideologypos- sibily to multiple, cross-cutting ideologiesis an awesome challenge. One approach is to follow Churchman's (1971) lead arid attempt to contrast each construct theory with an alternative perspec- tive for interpreting the test scores, as in the Kantian mode of inquiry; better still f,or probing the ethical implications of a theory is to contrast it with an antithetical, though plausible, H egelian counterperspective. This raises to the grander level of theory-comparison the strategy of focusing on plausible rival hypotheses and counterhypotheses in evaluating the basis for relationships within a theory. Systematic competition between counter- theories in attempting to explain the conjoint data derivable from each also tends to offset the concern that scientific observations are theory-laden or theory-dependent and that the presumption of a single theory might thereby preclude uncovering the most challenging test data for that theory (Feyerabend, 1975; Mitroff, 1973). Moreover, as Churchman (1961) stresses, although consensus is the decision rule of traditional science, conflict is the decision rule of ethics. Since the one thing .we universally disagree about is "what ought to be," any scientific approach to ethics should allow for conflict ^and debate, as should any attempt to assess the ethical implications of science. "Thus, in order to derive the 'ethical' implications of any technical or scientific model, we explicitly incor- porate a dialectical mode of examining (or testing) models" (Mitroff & Sagasti, 1973, p. 133). In a sense we are asking, as did Churchman's mentor E. A. Singer (19S9), what the consequences" would be if a given scientific judgment had the status of an ethical judgment. It should be noted that value issues intrude in the testing process at all levels, not just at the grand level of broad construct interpretation. For example, values influence the relative emphasis on different types of content in test construction (N unnally, 1967) and procedures for scoring the quality of performance on content samples (Guion,. 1978b), but the concern here is limited to the value implications of test interpretation. Consider first the evaluative overtones of the construct label itself. I have already suggested that a measure interpreted in terms of "flexibility versus rigidity" would be utilized differently if it were instead labeled "con- fusion versus control." Similarly, a measure called "inhibited versus impulsive" would have different consequences if it were labeled "self-controlled versus uninhibited." So would a variable like "stress" if it were relabeled "challenge." The point is not that we would make a concept like stress into a good thing by renaming it but that by not presuming it to be a bad thing we would investigate broader consequences, facilitative as well as debilitative (McGrath, 1976). In choosing a con- struct label, we should strive for consistency be- tween the trait and evaluative implications of the name, attempting to capture as closely as possible the essence of the construct's theoretical import, especially its empirically grounded import, in terms reflective of its salient value connotations. This may prove difficult, however, because many traits are ope/n to conflicting value interpretations and thus call for systematic examination of counter- hypotheses about value outcomes, if not to reach convergence, at least to clarify the basis of the conflict. Some traits may also imply different value outcomes under different circumstances, which suggests the possible utility of differentiated trait labels to embody these value distinctions, as in the case of "debilitating anxiety" and "facilitating anxiety." Rival theories of the construct might also highlight different value implications, of course, and lead to conflict between the theories not only in trait interpretation but also in value interpretation. Apart from its normative and evaluative over- tones, perhaps the most important feature of a construct in regard to value connotations is its breadth, or the range of its theoretical and empirical referents. This is the issue that Snow 1022 NOVEMBER 1980 AMERICAN PSYCHOLOGIST- Test Interpretation Test Use Evidential Basis Consequential Basis Construct Validity Value Implications Construct Validity + Relevance/Utility Social Consequences Figure 1. Facets of test validity. (1974) called referent generality. The broader the construct, the more difficult it is to embrace all of its critical features in a single measure and the more we are open to what Coombs (1954) has called "operationism in reverse," that is, "endowing the measures with all the meanings associated with the concept" (p. 476). In choosing the appro- priate breadth or level of generality for a construct and its label, one is buffeted by opposing counter- pressures toward oversimplification on the one hand and overgeneralization on the other. At one extreme is the apparent safety in using merely descriptive labels tightly tied to behavioral exem- plars in the test (such as Adding Two-Digit N um- bers). Choices on this side sacrifice interpretive power and range of application if the test might also be defensibly viewed more broadly (e.g., N um- ber Facility). At the other extreme is the apparent richness of high-level inferential labels (such as Intelligence, Creativity, or Introversion). Choices on this side are subject to the dangers of mis- chievous dispositional connotations and the backlash of conceptual imperialism. At first glance, one might think that the appro- priate level of construct reference should be tied not to test behavior but to the level of generaliza- tion supported by the convergent and discriminant research evidence in hand. But constructs refer to potential relationships as well as actual relation- ships, so their level of generality should in principle be tied to their range of reference in the nomo- logical theory, with the important proviso that this range be restricted or extended when research evi- dence so indicates. The scope of the original theo- retical formulation is thus modified by the research evidence available, but it is not limited to the research evidence available. As Cook and Camp- bell (1979) put it, "The data edit the kinds of general statements we can make" (p. 88). And debating the value implications of test interpreta- tion may also edit the kinds of general statements we should make. Validity as Evaluation of Evidence and Consequence Test validity is thus an overall evaluative judgment of the adequacy and appropriateness of inferences drawn from test scores. This evaluation rests on four bases: (1) an inductive summary of convergent and discriminant research evidence that the test scores are interpretable in terms of a particular construct meaning, (2 ) an appraisal of the value implications of that interpretation, (3) a rationale and evidence for the relevance of the construct and the utility of the scores in particular applications, and (4) an appraisal of the potential social con- sequences of the proposed use and of the actual consequences when used. P utting these bases together, we can see test validity to have two interconnected facets linking the source of justificationeither evidential or consequentialto the function or outcome of the testingeither interpretation or use. This cross- ing of basis and function is portrayed in Figure 1. The interactions among these aspects are more dynamic in practice, however, than is implied by a fourfold classification. In an attempt to rep- resent the interdependence and feedback among the components, a flow diagram is presented in Figure 2 . The double arrows linking construct validity and test interpretation in the diagram are meant to imply a continuous process that starts sometimes with a construct in search of proper measurement and sometimes with an existing test in search of proper meaning. The model also includes a pragmatic component for the evaluation of actual consequences of test practice, pragmatic in the sense that this com- ponent is oriented, like pragmatic philosophy, AMERICAN P SY CH O LO GIST N O VEMBER 1980 1023 I m p l i c a t i o n s for Test t erp ret a t l o < E v a l u a t e Consequences Figure 2 . Feedback model for test validity. 1024 N O VEMBER 1980 AMERICAN P SY CH O LO GIST toward outcomes rather than origins and seeks justification for use in the practical consequences of use. The primary concern of this component is the balancing of the instrumental value of the test in accomplishing its intended purpose with the instrumental value of any negative side effects and positive by-products of the testing. Most test makers acknowledge responsibility for providing general evidence of the instrumental value of the test. The terminal value of the test in terms of the social ends to be served goes beyond the test maker to include as well the decisionmaker, policymaker, and test user, who are responsible for specific evi- dence of instrumental value in their particular setting and for the specific interpretations and uses made of the test scores. In the final analysis, "responsibility for valid use of a test rests on the person who interprets it" (Cronbach, 1969, p. SI), and that interpretation entails responsibility for its value consequences. Intervening in the model between test use and the evaluation of consequences is a decision matrix to emphasize the point that tests are rarely used in isolation but rather in combination with other information in broader decision systems. The decision process is profoundly influenced by social values and deserves, in its own right, massive research attention beyond the good beginning pro- vided by utility models. As Guion (1976a) phrased it, "The formulation of hypotheses is or should be applied science, the validation of hypoth- eses is applied methodology, but the act of making . . . [a] decision is ... still an art" (p. 646). The feedback model as portrayed is a closed system, to emphasize the point that even when consequences are evaluated favorably they should be contin- uously or periodically monitored to permit the detection of changing circumstances and of delayed side effects. The model is closed and this article is closed wjith the provocative words of Sir Geoffrey Vickers (1970 ): "If indeed we have reached the end of ideology (in Daniel Bell's phrase) it is not because we can do without ideologies but because we should now know enough about them to show a proper respect for our neighbour's and a proper sense of responsibility for our own" (p. 109). REFEREN CES American P sychological Association, American Educational Research Association, & N ational Council on Measure- ment in Education. Standards for educational and psy- chological tests. Washington, D.C.: American P sycho- logical Association, 1974. Bracht, G. H ., & Glass, G. V. The external validity of experiments! American Educational Research Journal, 1968, 5, 437^74. Braithwaite, R. B. Scientific explanation. Cambridge, England: Cambridge University Press, 1956. Brogden, H . E. O n the interpretation of the correlation coefficient as a measure of predictive efficiency. Journal of Educational Psychology, 1946, 37, 65-76. Burton, N . W. Societal standards. Journal of Educational Measurement, 1978, IS, 2 63-2 71. Campbell, D. T. Recommendations for AP A test standards regarding construct, trait, or discriminant validity. American Psychologist, 1960, 15, 546-553. Campbell, D. T., & Fiske, D. W. Convergent and discrim- inant validation by the multitrait-multimethod matrix. Psychological Bulletin, 1959, 56, 81-105. Campbell, J. P . P sychometric theory. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psy- chology. Chicago: Rand McN ally, 1976. Churchman, C. W. Prediction and optimal decision: Philosophical issues of a science of values. Englewood Cliffs, N .J.: P rentice-H al!, 1961. Churchman, C. W. The design of inquiring systems: Basic concepts of systems and organization. N ew Y ork: Basic Books, 1971. Cleary, T. A. Test bias: P rediction of grades of N egro and white students in integrated colleges. Journal of Educational Measurement, 1968, 5, 115-124. Coan, R. W. Facts, factors, and artifacts: The quest for psychological meaning. Psychological Review, 1964, 71, 123-140. Cole, N . S. Bias in selection. Journal of Educational Measurement, 1973, W, 2 37-2 55. Cook, T. D., & Campbell, D. T. Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McN ally, 1979. Coombs, C. H .. Theory and methods of social measure- ment. In L. Festinger & D. Katz (Eds.), Research methods in the behavioral sciences. N ew Y ork: H olt, Rinehart & Winston, 1954. Crawford, C. George Washington, Abraham Lincoln, and Arthur Jensen: Are they compatible? American Psychol- ogist, 1979, 34, 664-672 . Cronbach, L. J. Validation of educational measures. Pro- ceedings of the 1969 Invitational Conference on Testing Problems: Toward a theory of achievement measure- ment . P rinceton, N .J.: Educational Testing Service, 1969. Cronbach, L. J. Test validation. In R. L. Thorndike (Ed.), Educational measurement (2 nd ed.). Washington, D.C.: American Council on Education, 1971. Cronbach, L. J. Equity in selectionWhere psycho- metrics and political philosophy meet. Journal of Edu- cational Measurement, 1976, 13, 31-41. Cronbach, L. J., & Gleser, G. C. Psychological tests and personnel decisions (2 nd ed.). Urbana: University of Illinois P ress, 1965. Cronbach, L. J., Gleser, G., N anda, H ., & Rajaratnam, N . The dependability of behavioral measurements: Theory of generalizability for scores and profiles. N ew Y ork: Wiley, 1972 . Cronbach, L. J., & Meehl, P . E. Construct validity in psy- chological tests. Psychological Bulletin, 195S, 52, 281-302. Curtis, E. W., & Alf, E. F. Validity, predictive efficiency, and practical significance of selection tests. Journal of Applied Psychology, 1969, 53, 32 7-337. Darlington, R. B. Another look at "culture fairness." Journal of Educational Measurement, 1971, 8, 71-82. Darlington, R. B. A defense of "rational" personnel selec- tion, and two new methods. Journal of Educational Measurement, 1976, 13, 43-52. AMERICAN PSYCHOLOGIST N O VEMBER 1980 1025 Darlington, R. B., & Stauffer, G. F. Use and evaluation of discrete test information in decision making. Journal of Applied Psychology, 1966, 50, 125-129. Division of Industrial and O rganizational P sychology, American P sychological Association. Principles for the validation and use of personnel selection procedures. H amilton, O hio: H amilton P rint Co., 1975. Dunnette, M. D., & Borman, W. C. P ersonnel selection and classification systems. In M. R. Rosenzweig & L. W. P orter (Eds.), Annual Review of Psychology (Vol. 30 ). P alo Alto, Calif.: Annual Reviews, 1979. Ebel, R. L. Must all tests be valid? American Psychol- ogist, 1961, 16, 640-647. Ebel, R. L. The social consequences of educational testing. Proceedings of the 1963 Invitational Conference on Test- ing Problems. P rinceton, N .J.: Educational Testing Service, 1964. Ebel, R. L. Comments on some problems of employment testing. Personnel Psychology, 1977, 30, 5S-63. Edel, A. Science and the structure of ethics. In 0 . N eu- rath, R. Carnap, & C. Morris (Eds.), Foundations of the unity of science: Toward an international encyclopedia of unified science (Vol. 2 ). Chicago: University of Chicago Press, 1970. Einhorn, H . J., & Bass, A. R. Methodological consider- ations relevant to discrimination in employment testing, Psychological Bulletin, 1971, 75, 2 61-2 69. Equal Employment O pportunity Commission, Civil Service Commission, U.S. Department of Labor, & U.S. Depart- ment of Justice. Uniform guidelines on employee selec- tion procedures. Federal Register (August 2 5, 1978), 43 (166), 38290-38315. Feigl, H . Some major issues and developments in the phi- losophy of science of logical empiricism. In H . Feigl & M. Scriven, Minnesota studies in philosophy of science: The foundations of science and the concepts of psychol- ogy and psychoanalysis. Minneapolis: University of Minnesota Press, 1956. Feyerabend, P . Against method: Outline of an anarchist theory of knowledge. London, England: N ew Left Books, 1975. Glaser, R., & N itko, A. J. Measurement in learning and instruction. In R. L. Thorndike (Ed.), Educational measurement (2 nd ed.). Washington, D.C.: American Council on Education, 1971. Goodenough, F. L. Mental testing: Its history, principles, and applications. N ew Y ork: H olt, Rinehart & Win- ston, 1969. Gross, A. L., & Su, W. Defining a "fair" or "unbiased" selection model: A question of utilities. Journal of Ap- plied Psychology, 1975, 60, 345-351. Guion, R. M. O pen a new window: Validities and values in psychological measurement. American Psychologist, 1974, 29, 2 87-2 96. Guion, R. M. The practice of industrial and organizational psychology. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology. Chicago: Rand McN ally, 1976. (a) , Guion, R. M. Recruiting, selection, and job placement. In M. D. Dunnette (Ed.), Handbook of industrial and or- ganizational psychology. Chicago: Rand McN ally, 1976. (b) Guion, R. M. Content validityThe source of my discon- tent. Applied Psychological Measurement, 1977, 1, 1- 10 . (a) Guion, R. M. Content validity: Three years of talk What's the action? Public Personnel Management, 1977, 6, 407-414. (b) Guion, R. M: "Content validity" in moderation. Person- nel Psychology, 1978, 31, 205-213. (a) Guion, R. M. Scoring of content domain samples: The problem of fairness. Journal of Applied Psychology, 1978,^,499-50 6. (b) Guion, R. M. O n trinitarian doctrines of validity. Pro- fessional Psychology, 1980 ,11, 385-398. Gulliksen, H . Intrinsic validity. American Psychologist, 1950 ,5,511-517. H empel, C. G. Fundamentals of concept formation in em- pirical science. In O . N eurath, R. Carnap, & C. Morris (Eds.),' Foundations of the unity of science: Toward an international encyclopedia of unified science (Vol. 2 ). Chicago: University of Chicago Press, 1970. H udson, L. Singularity of talent. In S. Messick (Ed,), . Individuality in learning. San Francisco: Jossey-Bass, ,1976. H unter, J. E., & Schmidt, F. L. Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin, 1976, 83, 1053-1071. H unter, J. E., Schmidt, F. L., & Rauschenberger, J. M. Fairness of psychological tests: Implications of four def- initions for selection utility and minority hiring. Journal of Applied Psychology, 1977, 62, 2 45-2 60 . James, L. R. Criterion models and construct validity for criteria. Psychological Bulletin, 1973, 80, 75-83. Kaplan, A: The conduct of inquiry: Methodology for be- havioral science. San Francisco: Chandler, 1964. Kavanagh, M. J., MacKinney, A. C., & Wolins, L. Issues in managerial performance: Multitrait-multimethod anal- yses of ratings. Psychological Bulletin, 1971, 75, 3449. Lennon, R. T. Assumptions underlying the use of content validity. Educational and Psychological Measurement, 1956, 16, 2 94-30 4. Lenzen, V. F. P rocedures of empirical science. In O . N eu- rath, R. Carnap, & C. W. Morris (Eds.), International encyclopedia of unified science (Vol. 1, P t. 1). Chicago: University of Chicago P ress, 1955. Linn, R. L. Fair test use in selection. Review of Educa- tional Research, 1973, 43, -139-161. Linn, R. L. In search of fair selection procedures. Journal of Educational Measurement, 1976, 13, 53-58. Loevinger, J. O bjective tests as instruments df psychologi- cal theory. Psychological Reports, 1957, 3, 635-694 (Monograph Supplement 9). Margenau, H . The nature of physical reality. N ew Y ork: McGraw-H ill, 1950. (Reprinted, Woodbridge, Conn.: O xbow, 1977.) McGrath, J. E. Stress and behavior in organizations. In M. D. Dunnette (Ed.), Handbook of industrial and or- ganizational psychology. Chicago: Rand McN ally, 1976. Messick, S. P ersonality measurement and college perform- ance. Proceedings of the 1963 Invitational Conference on Testing Problems. P rinceton, N .J.: Educational Test- ing Service, 1964. Messick, S. P ersonality measurement and the ethics of as- sessment. American Psychologist, 1965, 20, 136-142. Messick, S. The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 1975, 30, 955-966. Messick, S. P otential uses of noncognitive measurement in education. Journal of Educational Psychology, 1979, 71, 2 81-2 92 . Messick, S., & Barrows, T. S. Strategies for research and evaluation in early childhood education. In I. J. Gordon (Ed.), Early childhood education: The seventy-first year- book of the National Society for the Study of Education. Chicago: University of Chicago Press, 1972 . Mitroff, I. I. 'Be it resolved that structured debate not- consensus ought to form the epistemic cornerstone of O R/MS': A reaction to Ackoff's note on systems science. Interfaces, 1973, 3, 14-17. Mitroff, I. I., & Sagasti, F. Epistemology as general sys- tems theory: An approach to the design of complex 1026 N O VEMBER 1980 AMERICAN P SY CH O LO GIST decision-making experiments. Philosophy of Social Sci- ence, 1973, 3, 117-134'. N ovick, M. R., & Ellis, D. D. Equal opportunity in edu- cational and employment selection. American Psychol- ogist, 1977, 32, 306-320. N unnally, J. Psychometric theory. N ew Y ork: McGraw- Hill, 1967. P eterson, N . S., & N ovick, M. R. An evaluation of some models for culture-fair selection. Journal of Educational Measurement, 1976, 13, 3-29. P opper, K. R. The logic of scientific discovery. N ew Y ork: Basic Books, 1959. Sawyer, R. L., Cole, N . S., & Cole, J. W. L. Utilities and the issue of fairness in a decision theoretic model for selection. Journal of Educational Measurement, 1976, 13, 59-76. Shulman, L. S. Reconstruction of educational research. Review of Educational Research, 1970, 40, 371-396. Singer, E. A. Experience and reflection (C. W. Churchman, Ed.). P hiladelphia: University of P ennsylvania Press, 1959. Smith, P . C. Behaviors, results, and organizational effec- tiveness: The problem of criteria. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psy- chology. .Chicago: Rand McN ally, 1976. Snow, R. E. Representative and quasi-representative de- signs for research on teaching. Review of Educational Research, 1974,44, 2 65-2 91. Tenopyr, M. L. Content-construct confusion. Personnel Psychology, 1977, 30, 47-54. Thorndike, R. L. Personnel selection: Test and measure- ment techniques. N ew Y ork: Wiley, 1949. Thorndike, R. L. Concepts of culture-fairness. Journal of Educational Measurement, 1971, S, 63-70. Vickers, G. The art of judgment. N ew Y ork: Basic Books, 1965. Vickers, G. Value systems and social process. H armonds- wo.rth, Middlesex, England: P enguin Books, 1970. Wallach, M. A. P sychology of talent and graduate educa- tion. In S. Me'ssick (Ed.), Individuality in learning. San Francisco: Jossey-Bass, 1976. Webb, E. J., Campbell, D. T., Schwartz, R. D., & Securest, L. U'nobtrusive measures: Nonreactive research in the social sciences. Chicago: Rand McN ally, 1966. Wernimont, P . F., & Campbell, J. P . Signs, samples, and criteria. Journal of Applied Psychology, 1968, 52, 372- 376. APA Congressional Science Fellowship P rogram The American P sychological Association (AP A) is now accepting applications for its Congressional Science Fellowship P rogram, which is designated for 1981-1982 in the area of child policy: The program, administered by the American Association for the Advance- ment of Science (AAAS) and funded for 1981-1982 by the Esther Katz Rosen Fund of the American P sychological Foundation, provides an extraordinary opportunity for post- doctoral and midcareer individuals to learn about science-government interaction and to make contributions to the more effective use of science in government. O ne fellow will be selected by the AP A to spend one year working as a special legislative assistant on the staff of an individual congressperson or a congressional committee. Applicants must have obtained a doctorate in psychology, must demonstrate exceptional research ability and scientific expertise in some area of child psychology (e.g., develop- mental, child-clinical), and must have a strong interest in using scientific knowledge toward the solution or prevention of societal problems involving and affecting children. Applicants must also belong to AP A, or be an applicant for membership. The fellowship period covers one year beginning 1 September 1981 and requires resi- dence in the Washington, B.C., area. The fellowship includes a stipend of $20,000 plus nominal relocation and travel expenses. Interested individuals should submit the following application materials: (a) a detailed vita; (b) a statement of 50 0 words or less addressing the applicant's interest in the fellowship and how it relates to the applicant's career goals; and (c) three letters of reference on the applicant's ability to work on Capitol Hill as a special legislative assistant with scientific expertise in psychology (sent directly to the address below). Application materials must be postmarked by midnight January 23, 1981. Finalists will be invited to AP A's Central O ffice in Washington, D.C., for an interview by a selection committee in late March 1981. Announcement of the award will be made by early April 1981. Send application materials to Joann H orai, Congressional Science Fellowship P rogram, American P sychological Association, 1200 Seventeenth Street, N .W., Washington, D.C. 20036. AMERICAN P SY CH O LO GIST N O VEMBER 1980 1027