You are on page 1of 9

Principles of Language Assessment - Practicality, Reliability, Validity,

Authenticity, and Washback


A. Practicality
An effective test is practical. This means that it

 Is not excessively expensive,


 Stays within appropriate time constraints,
 Is relatively easy to administer, and
 Has a scoring/evaluation procedure that is specific and time-efficient.

A test that is prohibitively expensive is impractical. A test of language


proficiency that takes a student five hours to complete is impractical-it consumes
more time (and money) than necessary to accomplish its objective. A test that
requires individual one-on-one proctoring is impractical for a group of several
hundred test-takers and only a handful of examiners. A test that takes a few
minutes for a student to take and several hours for an examiner too evaluate is
impractical for most classroom situations.

B. Reliability
A reliable test is consistent and dependable. If you give the same test to the same
student or matched students on two different occasions, the test should yield
similar result. The issue of reliability of a test may best be addressed by
considering a number of factors that may contribute to the unreliability of a test.
Consider the following possibilities (adapted from Mousavi, 2002, p. 804):
fluctuations in the student, in scoring, in test administration, and in the test itself.

 Student-Related Reliability

He most common learner-related issue in reliability is caused by temporary illness,


fatigue, a “bad day,” anxiety, and other physical or psychological factors, which
may make an “observed” score deviate from one’s “true” score. Also included in
this category are such factors as a test-taker’s “test-wiseness” or strategies for
efficient test taking (Mousavi, 2002, p. 804).

 Rater Reliability

Human error, subjectivity, and bias may enter into the scoring process. Inter-rater
reliability occurs when two or more scores yield inconsistent score of the same
test, possibly for lack of attention to scoring criteria, inexperience, inattention, or
even preconceived biases. In the story above about the placement test, the initial
scoring plan for the dictations was found to be unreliable-that is, the two scorers
were not applying the same standards.

 Test Administration Reliability

Unreliability may also result from the conditions in which the test is administered.
I once witnessed the administration of a test of aural comprehension in which a
tape recorder played items for comprehension, but because of street noise outside
the building, students sitting next to windows could not hear the tape accurately.
This was a clear case of unreliability caused by the conditions of the test
administration. Other sources of unreliability are found in photocopying variations,
the amount of light in different parts of the room, variations in temperature, and
even the condition of desks and chairs.

 Test Reliability

Sometimes the nature of the test itself can cause measurement errors. If a test is too
long, test-takers may become fatigued by the time they reach the later items and
hastily respond incorrectly. Timed tests may discriminate against students who do
not perform well on a test with a time limit. We all know people (and you may be
include in this category1) who “know” the course material perfectly but who are
adversely affected by the presence of a clock ticking away. Poorly written test
items (that are ambiguous or that have more than on correct answer) may be a
further source of test unreliability.

C. Validity
By far the most complex criterion of an effective test-and arguably the most
important principle-is validity, “the extent to which inferences made from
assessment result are appropriate, meaningful, and useful in terms of the purpose
of the assessment” (Ground, 1998, p. 226). A valid test of reading ability actually
measures reading ability-not 20/20 vision, nor previous knowledge in a subject, nor
some other variable of questionable relevance. To measure writing ability, one
might ask students to write as many words as they can in 15 minutes, then simply
count the words for the final score. Such a test would be easy to administer
(practical), and the scoring quite dependable (reliable). But it would not constitute
a valid test of writing ability without some consideration of comprehensibility,
rhetorical discourse elements, and the organization of ideas, among other factors.
 Content-Relate Evidence

If a test actually samples the subject matter about which conclusion are to be
drawn, and if it requires the test-takers to perform the behavior that is being
measured, it can claim content-related evidence of validity, often popularly
referred to as content validity (e.g., Mousavi, 2002; Hughes, 2003). You can
usually identify content-related evidence observationally if you can clearly define
the achievement that you are measuring.

 Criterion-Related Evidence

A second of evidence of the validity of a test may be found in what is called


criterion-related evidence, also referred to as criterion-related validity, or the extent
to which the “criterion” of the test has actually been reached. You will recall that
in Chapter I it was noted that most classroom-based assessment with teacher-
designed tests fits the concept of criterion-referenced assessment. In such tests,
specified classroom objectives are measured, and implied predetermined levels of
performance are expected to be reached (80 percent is considered a minimal
passing grade).

 Construct-Related Evidence

A third kind of evidence that can support validity, but one that does not play as
large a role classroom teachers, is construct-related validity, commonly referred to
as construct validity. A construct is any theory, hypothesis, or model that attempts
to explain observed phenomena in our universe of perceptions. Constructs may or
may not be directly or empirically measured-their verification often requires
inferential data.
 Consequential Validity

As well as the above three widely accepted forms of evidence that may be
introduced to support the validity of an assessment, two other categories may be of
some interest and utility in your own quest for validating classroom test. Messick
(1989), Grounlund (1998), McNamara (2000), and Brindley (2001), among others,
underscore the potential importance of the consequences of using an assessment.
Consequential validity encompasses all the consequences of a test, including such
considerations as its accuracy in measuring intended criteria, its impact on the
preparation of test-takers, its effect on the learner, and the (intended and
unintended) social consequences of a test’s interpretation and use.

 Face Validity

An important facet of consequential validity is the extent to which “students view


the assessment as fair, relevant, and useful for improving learning” (Gronlund,
1998, p. 210), or what is popularly known as face validity. “Face validity refers to
the degree to which a test looks right, and appears to measure the knowledge or
abilities it claims to measure, based on the subjective judgment of the examines
who take it, the administrative personnel who decode on its use, and other
psychometrically unsophisticated observers” (Mousavi, 2002, p. 244).

D. Authenticity
An fourth major principle of language testing is authenticity, a concept that is a
little slippery to define, especially within the art and science of evaluating and
designing tests. Bachman and Palmer (1996, p. 23) define authenticity as “the
degree of correspondence of the characteristics of a given language test task to the
features of a target language task,” and then suggest an agenda for identifying
those target language tasks and for transforming them into valid test items.

E. Washback
A facet of consequential validity, discussed above, is “the effect of testing on
teaching and learning” (Hughes, 2003, p. 1), otherwise known among language-
testing specialists as washback. In large-scale assessment, wasback generally refers
to the effects the test have on instruction in terms of how students prepare for the
test. “Cram” courses and “teaching to the test” are examples of such washback.
Another form of washback that occurs more in classroom assessment is the
information that “washes back” to students in the form of useful diagnoses of
strengths and weaknesses. Washback also includes the effects of an assessment on
teaching and learning prior to the assessment itself, that is, on preparation for the
assessment.

F. Applying Principles to the Evaluation of Classroom Tests


The five principles of practicality, reliability, validity, authenticity, and
washback go a long way toward providing useful guidelines for both evaluating an
existing assessment procedure and designing one on your own. Quizzes, tests, final
exams, and standardized proficiency tests can all be scrutinized through these five
lenses.

1. Are the test procedures practical?


2. Is the test reliable?
3. Does the procedure demonstrate content validity?
4. Is the procedures face valid and “biased for best”?
5. Are the test tasks as authentic as possible?
6. Does the test other beneficial washback to the learner?

TYPES OF VALIDITY EVIDENCE

 Content validity
 Face validity
 Curricular validity
 Criterion-related validity
 Predictive validity
 Concurrent validity
 Construct validity
 Convergent validity
 Discriminant validity
 Consequential validity
Content Validity

Content validity addresses the match between test questions and the content or subject area they are intended to
assess. This concept of match is sometimes referred to as alignment, while the content or subject area of the test may
be referred to as a performance domain.

Experts in a given performance domain generally judge content validity. For example, the content of the SAT
Subject Tests™ is evaluated by committees made up of experts who ensure that each test covers content that
matches all relevant subject matter in its academic discipline. Both a face validity and a curricular validity study
may be used to establish the content validity of a test.
Face Validity refers to the extent to which a test or the questions on a test appear to measure a particular construct
as viewed by laypersons, clients, examinees, test users, the public, or other stakeholders. In other words, it looks like
a reasonable test for whatever purpose it is being used. This common sense approach to validity is often important in
convincing laypersons to allow the use of a test, regardless of the availability of more scientific means.
Content-related evidence of validity comes from the judgments of people who are either experts in the testing of
that particular content area or are content experts.
In contrast, because these two groups may approach a test from different perspectives, it is important to recognize
the valuable contributions made by both.

Curricular Validity is the extent to which the content of the test matches the objectives of a specific curriculum as
it is formally described.
Curricular validity takes on particular importance in situations where tests are used for high-stakes decisions, such as
state high school exit examinations. In these situations, curricular validity means that the content of a test that is
used to make a decision about whether a student receives a high school diploma should measure the curriculum that
the student is taught in high school.

Curricular validity is evaluated by groups of curriculum/content experts. The experts are asked to judge whether the
content of the test is parallel to the curriculum objectives and whether the test and curricular emphases are in proper
balance.
Criterion-Related Validity

Criterion-related validity looks at the relationship between a test score and an outcome. For example, SAT™ scores
are used to determine whether a student will be successful in college. First-year grade point average becomes the
criterion for success. Looking at the relationship between test scores and the criterion can tell you how valid the test
is for determining success in college. The criterion can be any measure of success for the behavior of interest. In the
case of a placement test, the criterion might be grades in the course.

A criterion-related validation study is completed by collecting both the test scores that will be used and information
on the criterion for the same students (e.g., SAT scores and first-year grade point average, or AP® Calculus exam
grades and performance in the next level college calculus course). The test scores are correlated to the criterion to
determine how well they represent the criterion behavior.
A criterion-related validation study can be either predictive of later behavior or a concurrent measure of behavior or
knowledge.
Predictive validity refers to the "power" or usefulness of test scores to predict future performance.
Examples of such future performance may include academic success (or failure) in a particular course, good driving
performance (if the test was a driver's exam), or aviation performance (predicted from a comprehensive piloting
exam). Establishing predictive validity is particularly useful when colleges or universities use standardized test
scores as part of their admission criteria for enrollment or for admittance into a particular program.

The same placement study used to determine the predictive validity of a test can be used to determine an optimal cut
score for the test.
I would like to do an admission validity study.
I would like to do a predictive placement validity study.
Concurrent Validity needs to be examined whenever one measure is substituted for another, such as allowing
students to pass a test instead of taking a course. Concurrent validity is determined when test scores and criterion
measurement(s) are either made at the same time (concurrently) or in close proximity to one another. For example, a
successful score on the CLEP® College Algebra exam may be used in place of taking a college algebra course. To
determine concurrent validity, students completing a college algebra course are administered the CLEP College
Algebra exam. If there is a strong relationship (correlation) between the CLEP exam scores and course grades in
college algebra, the test is valid for that use.

Construct Validity

Construct validity refers to the degree to which a test or other measure assesses the underlying theoretical construct
it is supposed to measure (i.e., the test is measuring what it is purported to measure).
As an example, think about a general knowledge test of basic algebra. If a test is designed to assess knowledge
of facts concerning rate, time, distance, and their interrelationship with one another, but test questions are phrased in
long and complex reading passages, then perhaps reading skillsare inadvertently being measured instead of factual
knowledge of basic algebra.
Construct validation requires the compilation of multiple sources of evidence. In order to demonstrate construct
validity, evidence that the test measures what it purports to measure (in this case basic algebra) as well as evidence
that the test does not measure irrelevant attributes (reading ability) are both required. These are referred to as
convergent and discriminant validity.

Convergent validity consists of providing evidence that two tests that are believed to measure closely related skills
or types of knowledge correlate strongly. That is to say, the two different tests end up ranking students similarly.
Discriminant validity, by the same logic, consists of providing evidence that two tests that do not measure closely
related skills or types of knowledge do not correlate strongly (i.e., dissimilar ranking of students).
Both convergent and discriminant validity provide important evidence in the case of construct validity. As noted
previously, a test of basic algebra should primarily measure algebra-related constructs and not reading constructs. In
order to determine the construct validity of a particular algebra test, one would need to demonstrate that the
correlations of scores on that test with scores on other algebra tests are higher than the correlations of scores on
reading tests.

Consequential Validity

Some testing experts use consequential validity to refer to the social consequences of using a particular test for a
particular purpose. The use of a test is said to have consequential validity to the extent that society benefits from that
use of the test. Other testing experts believe that the social consequences of using a test—however important they
may be—are not properly part of the concept of validity.

Messick (1988) makes the point that ". . . it is not that adverse social consequences of test use render the use invalid
but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as
construct–irrelevant variance."

For example, suppose some subgroups obtain lower scores on a mathematics placement test and, consequently, are
required to take developmental courses. According to Messick, this action alone does not render the test scores
invalid. However, suppose it was determined that the test was measuring different traits for the particular subgroup
than for the larger group, and those traits were not important for doing the required mathematics. In this case, one
could conclude that the adverse social consequences (e.g., more subgroup members in developmental mathematics
courses) were caused by using the test scores and were traceable to sources of invalidity. In that case, the validity of
the test use (course placement) would be jeopardized.

You might also like