You are on page 1of 12

Authenticity in Language Test Design

Written By:
Miriam A. Alkubaidi
2009

Teachers are designers. An essential act of our profession is the crafting of


curriculum and learning experiences to meet specified purposes. We are also
designers of assessment to diagnose student needs to guide our teaching and to
enable us, our students, and others (parents and administrators) to determine
whether we have achieved our goal (Wiggins, Mc Tighe, 2005, P. 1).

The goal Wiggins and Mc Tighe are referring to is the imparting of knowledge to
language learners. Teachers may assertively achieve this by accurate test design
through the implementation of authentic testing. Although authenticity has been
perceived from various perspectives, on the whole it has never been fully realized to
the high standards scholars propose. However, this paper suggests a possible
definition of authenticity wherein a test may be seen to be authentic in terms of
language testing. It is possible to design an authentic test by the definition proposed.
This paper will argue and defend its definition by various readings, and support the
notion of the impossibility to create a truly authentic test.

Understandings of authenticity:

To establish the importance of authenticity in language testing, one may attempt to


define 'authenticity'. The meaning of authenticity has been much debated in both
applied linguistics as well as in the field of education as a whole (Lewkowicz, 2000,
p. 43). Since the 1990s, many scholars have agreed that language testing in itself is a
discipline derived from Applied Linguistics (Bachman, 1991, p. 672). Davis, Brown,
Elder, Hill, Lumley and McNamara (1999, p. 13) suggest that authenticity in testing is
when both 'the content and skills' mirror one another. For instance, when testing
students speaking ability, learners may be asked to role-play using phrases taught in
class. In fact, Bachman (1991, p. 671) determines authenticity through an
international model which links both the test method's characteristics with the actual
language proficiency being tested. Additionally, Bachman and Palmer (1996, p. 23)
reach a similar conclusion, that test tasks should correspond with the language being
tested through the use of tasks which they say can be described as 'relatively
authentic' tasks. Therefore, we can deduce from their definition that authenticity is a
term that cannot be achieved for certain. Furthermore, they believe that it is the
'degree of communication', to which the test task represents the actual taught language
abilities that simulate common responses from testers (Bachman & Palmer, 1996, p.
23). This may be achieved in language classes when learners communicate with one
another, requesting directions for instance as a given task previously taught in class.
On the other side of the coin, McNamara (1996, p. 11) argues that a performance
assessment is authentic if it replicates real-life and simulates real-life responses.
However, one may argue that it is a virtual imitation and therefore only partially
achieving authenticity. In contrast to, Spolsky (1985, p. 32) who points out that to be
authentic a test must include either the language teaching material extracted from
primary resources (materials), or function of tasks (methodology). In regard to testing

listening, for instance, numerous questions arise; do we merely read the passage to the
participants, or should the passage be extracted from a radio broadcast, for example?
Which is more a replication of a real-life situation? And is it, in fact, measuring what
the test is intended to measure (validity)?

On the topic of validity, which is essential to achieve authenticity in test design,


Messick (1996, pp. 243-244) indicates that authenticity is simply direct testing with
specific validity standards expressed through communicative behaviour i.e. listening,
reading, writing and reading, whereas Spolsky (1985, p. 34) suggests that 'face
validity' is indeed only 'real' language use. In line with the previously stated findings
about authenticity, we may well conclude that realistically authentic tests do not exist
per se. Further on in this paper, rationales supporting this argument will be presented.
In essence, we may view authenticity in testing as an ongoing process that uses
specific validity constructs as a medium to achieve its precise designated
purposes through the operation of primary resources as learning materials, in
order to simulate an exclusive register. This definition stems from previous
readings in which authenticity has been defined as a test while this paper suggests that
authenticity is a continuous assessment process that cannot be fully achieved through
reproduction of so-called real-life situations. A test is a test, which is a form of
assessment, and can never live up to authentically replicating real-life situations,
however much we may try to simulate responses to lead us as close as possible to a
true-to-life test design. Therefore, a new approach to authenticity must take place.

Issues and problems revolving around authenticity and its relation to the
construct validity:

There are different reasons why tests cannot be truly authentic. Throughout the
literature about authenticity it is suggested that an authentic test is what replicates
real-life. Davis et al (1999, p. 13) confirm that authenticity can never be completely
achieved. This is highly probable for four reasons. To begin with, the fact that a test
is under assessment shatters the concept of real-life situations (Davis et al, 1999, p.
13) that are genuine. This is called the 'real-life' approach (RL) to authenticity
(Bachman, 1990, p. 301). Spolsky (1985, p. 31) confirms this notion and as a result,
suggests that observation of authentic language behaviour, or 'authentic test
language' (Stevenson & Spolsky, cited in Shohamy & Reves, 1985, p. 54), produced
by participants may produce an authentic form of assessment. Spolsky adds here that
participants undertaking tests are placed under scrutiny, arousing anxiety which in
turn affects the results of testing either in a positive of negative way. This is a
logical conclusion since the 'communicative context' is an assessment context
(Stevenson, 1985, p. 44).

Secondly, tests are bound by language limitations, such as specific target


language use (TLU), which is unrealistic. However useful it may be for the learners'
purposes in learning a language, it does not replicate reality. Moreover, it confines
the learner to simply emulating the TLU.

Thirdly, the fact that tests are

administered in a specific time, place, and with specific participants contradicts the
notion of authenticity in testing whereby authenticity is perceived at times as a
reproduction of real-life situations. In addition, in real life variables change and

variously affect the process of language, whereas tests have controlled variables.
Finally, this approach is merely concerned with 'face validity' (Bachman, 1990, p.
315), which in turn neglects accurate assessment. For instance, when students are
debating a topic the teacher is attempting to assess their speaking proficiency;
however, with the natural flow of language, how can criteria be designed to
accommodate efficient assessment? The fact remains that the assessment is murky
and therefore, erroneous in itself.

Equally,

Bachman

(1990,

p.302)

approaches

authenticity

from an

'interactional/ability (IA)' perspective. This approach focuses on the characteristics of


communicative language use, the ability to communicate through the learners'
language ability, which is measured by constructs of validity. Bachman and Palmer
(1996, p. 18 cited in Fulcher & Davidson, 2007, p. 15) use 'usefulness' and 'construct
validity' interchangeably. In light of this, they define authenticity 'as the relationship
between test task characteristics and the characteristics of tasks in the real world.
However, the difficulty with extrapolating constructs lies with identifying and 'finding
suitable criterion measures' and linking them to language proficiency (Alderson, Wall
and Clapham, 1991, p. 209)

The authenticity of testing is embedded within the test design. Brown (2001,
p. 463) emphasises that authenticity is in the implementation of the activity, not the
test design, Bachman (1990, p. 300), however, identifies authenticity in the recreation
of language use through testing. In other words, to achieve authenticity, test design is
an essential factor. To assess a test, verification of its constructs needs to correlate

with learning outcomes. All in all, construct validity is operationalised through test
items.

In a study conducted by Wu and Stansfield (2001, pp. 188-189) wherein they


followed a model originally created by Bachman and Palmer (1996) for English for
Specific Purposes (ESP), authenticity of the task was realized through a taxonomy of
classification through scientific means. A structured verification procedure was put in
place to ensure task authenticity was achieved by using authentic 'primary resource'
materials. The authors claim that the authenticity of any test depends upon the test's
purpose. This is a logical perception for if we do not have a purpose then how do we
identify what it is we are testing? Shohamy (1985, p. 25) also agrees that in order for
authenticity to be achieved, the purpose of the test needs to be clearly identified.
However, this alone cannot be claimed as solely an application of authenticity through
the tests' constructs. For instance, 'selected responses', introduced by Brown and
Hudson (1998, p. 658), is a method by which fill-in-the-blanks and true-false test
items are designed in accordance with the tests purpose, and consequently the test is
objectively scored. Even so, this type of testing - as much as it is useful - is not
derived from authentic interaction, nor is it a reproduction of real-world tasks. In
other words, we are unable to assess the learners' communicative competence. He/she
may be able to write, yet be orally inept. Besides, such test items involve a high
degree of guessing and occasionally ambiguity, which in turn reflects upon the
accuracy of scoring. On the contrary, 'constructed response' is a method of assessment
that allows room for the learner's creativity and engagement in simulated situations
close to real-life with activities such as performance tasks and open ended questions
(Brown and Hudson, 1998, p. 661). Even though some may argue that such tasks

undergo subjective scoring as well as undefined constructs, such disadvantages may


perhaps be overcome by creating a variety of test items that cover all objectives set
out as the purpose of the test and this will also achieve comprehensiveness in test
design. We append here that scales with specified expectancies may be designed to
overcome subjective scoring.

Furthermore, Weir (2005, p. 14) indicates that we need to explicitly define the
construct of measurement to a precise procedure before designing the test so as to
achieve accurate validity in a test. Because construct validity is a psychological trait
which operates in the brain it needs to be interpreted with great care (Brown, 1996,
pp. 239-240). Through accurately designing tests' constructs, authenticity may be
partially achieved as constructs represent the purpose as well as the back bone of its
design. As initially stated in this paper, in order for authenticity in tests to be
achieved, the constructs must possess crystal clear objectives. It is important that
objectives must be measured to obtain the level of proficiency. That is to say,
constructs should be quantified in measurable terms (Fulcher & Davidson, 2007, p. 7).
For instance in order to assess speaking, the examiner must formulate a scale in which
arrays of constructs may measure the learners' proficiency level. As a result, there will
be a direct correlation between the tests' constructs and its design.

From another perspective, Messick (1995, p. 742) defines construct validity as


an interpretation of test scores which indeed complements Weir's definition; if in fact
the construct represents the tests purpose, then it may well result in accurate scoring
of the intended construct. It is essential to mention here that construct validity is an
approach usually chosen when there is a theory to be measured (Raatz, 1985, p. 62).

However, the theory should not dominate the test design but rather be shaped around
its rationale.

Messick (1995, p. 745) identifies six aspects of construct validity. However


for the purposes of this paper, only two will be rationalised. The following aspects
extrapolate authenticity of test design. Namely, the content is bound by specified
domains of knowledge such as attitudes and skills which should feature in
performance tasks (Messick, 1995, p. 745). Content should include all domains,
including 'functional importance' representing real-life like tasks wherein the
participants' responses are genuine and natural. Moreover, in reference to the
proposed definition of authenticity, if primary resource materials are used in test
tasks, then the likelihood of achieving authentic responses will inevitably arise. On a
micro level, this cannot always be achieved. One may call it a hit or miss shot.
Nevertheless, it can be achieved through continuous testing. Therefore, Brown (2001,
p. 479) suggests that authenticity should not be considered an 'absolute' value, but
rather a 'continuum'. Thus, no matter how much a task may resemble reality, yet it is
unrealistic to expect that responses will thus resemble real-life responses each and
every time. It is obvious here to mention that the definition of authenticity proposed at
the beginning of the paper described it as an ongoing process.

Moreover, the content of an assessment task should be extracted from


authentic materials, such as newspaper articles for example, to assess reading and
writing. Learners may read the article, debate about it, and then write a summary
(integrative testing). This has been practised by the author in Dublin, Ireland with
refugee learners in Integrate Ireland Language and Training (IILT) where the

Common European Framework of References (CERF) was theoretically practised.


Communicative language activities were practised as designed by the framework; '
reception, production, interaction and mediation' (Little, 2006, p. 168). One of the
enduring advantages of integrative testing is, as Hwang (2005, p. 3) asserts, that
authentic materials generate a successful natural flow of language acquisition.
Undeniably, such an approach results in successful communicative competence. In
fact we could add here that integrative testing, such as dictation, has been widely
supported. In a study conducted by Rahimi (2008, p. 45) it was found that both
listening comprehension and grammatical skills were improved by the use of
activities such as dictation.

Another fundamental feature Messick (1995, pp. 745- 749) mentions is the
'structural' aspect. In simpler words, to what extent does the scoring reflect upon the
task? There is a direct negative correlation between the scoring and the RL approach
introduced by Bachman. The approach in turn is difficult or impossible to evaluate
with precision. For this purpose the third aspect of construct validity, 'consequential',
is also imprecise. Messick suggests that consequential validity is the implications
and outcomes of tasks scoring. That is to say that the implications and test use are
linked to validation. It is a 'progressive matrix formation'. Undeniably, we can safely
conclude that such an approach is irrespectively a weak one as it results in an array of
contradictions.

Conclusion:

11

As proposed at the beginning of this paper, authenticity in test design cannot


be fully realized; however, we may attempt to authenticate a test through a
progressive continuum of a series of testing so that a single test is not the only source
of assessment. In each test, specific constructs should be designed to measure the 'can
dos'. Additionally, to come close to real-life situations as well as to simulate natural
responses from learners, primary resource material should be intertwined with the
learning process to produce the required register for assessment. A realistic approach
should be set in the application of authenticity, and this may be achieved through the
definition proposed.
Reference list:
Bachman, L. F. (1990). Fundamental considerations in language testing: Oxford
University Press, USA.
Bachman, L. F. (1991). What does language testing have to offer? TESOL Quarterly,
671-704.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Oxford
University Press.
Brown, J. D. (1996). Testing in language programs: Prentice Hall Regents, Upper
Saddle River, NJ.
Brown, J. D., & Hudson, T. (1998). The alternatives in language assessment. Tesol
Quarterly, 653-675.
Fulcher, N. G., & Davidson, F. (2007). Language testing and assessment: an
advanced resource book: Routledge.
Hwang, C. C. (2005). Effective EFL education through popular authentic materials.
Asian EFL Journal, 7(1), 1-12.
Lewkowicz, J. A. (2000). Authenticity in language testing: some outstanding
questions. Language Testing, 17(1), 43.
Little, D. (2006). The Common European Framework of Reference for Languages:
Content, purpose, origin, reception and impact. Language Teaching, 39(03),
167-190.
McNamara, T. F. (1996). Measuring second language performance. New York:
Addison Wesley Longman.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences
from persons' responses and performances as scientific inquiry into score
meaning. American Psychologist, 50, 742-749.
Messick, S. (1996). Validity and washback in language testing. Language Testing,
13(3), 243-244.
Raatz, U. (1985). Better theory for better tests? Language Testing, 2(1), 60.
Rahimi, M., & Data, B. (2008). Using Dictation to Improve Language Proficiency.
The Asian EFL Journal Quarterly March 2008 Volume 10, Issue, 10(1), 33.
11

Shohamy, E., & Reves, T. (1985). Authentic language tests: where from and where
to? Language Testing, 2(1), 48.
Spence-Brown, R. (2001). The eye of the beholder: authenticity in an embedded
assessment task. Language Testing, 18(4), 463.
Spolsky, B. (1985). The limits of authenticity in language testing. Language Testing,
2(1), 31.
Stevenson, D. K. (1985). Authenticity, validity and a tea party. Language Testing,
2(1), 41.
Wall, D., Calpham, C., Alderson, C.J. (1991). Validating tests in difficult
circumstances. In C. J. Alderson & B., North (Eds), Language testing in the
1990s (pp. 209-225). London: Macmillan Publishers Limited.
Weir, C. J. (2005). Limitations of the Common European Framework for developing
comparable examinations and tests. Language Testing, 22(3), 281.
Wiggins, G. P., & McTighe, J. (2005). Understanding by design: Association for
Supervision & Curriculum Development.
Wu, W. M., & Stansfield, C. W. (2001). Towards authenticity of task in test
development. Language Testing, 18(2), 187.

12

You might also like