Language Testing

Language Testing
Liu Jianda
Syllabus
It is expected that, by the end of this module, participants should be
able to do the following :
Understand the general considerations that must be addressed in the

development of new tests or the selection of existing language tests;
Make their own judgements and decisions about either selecting an
existing language test or developing a new language test;
Familiarise themselves with the fundamental issues, approaches, and
methods used in measurement and evaluation;
Design, develop, evaluate and use language tests in ways that are
appropriate for a given purpose, context, and group of test takers;
Understand the future development of language testing and the
application of IT to computerized language testing.
Syllabus
In order to achieve these objectives, the module gives participants the opportunity
to develop the following skills:
writing test items
collecting test data and conducting item analysis
evaluating language tests with regard to validity and reliability

This is done by considering a wide range of issues and topics related to language
testing. These include the following :
General concepts in language testing and evaluation
Evaluation of a language test: reliability and validity
Communicative approach to language testing
Design of a language test
Item writing and item analysis
Interpreting test results
Item response theory and its applications
Computerized language testing and its future development
Class Schedule
1
2
3
4
5
6
7
8
9
10
11
12
Basic concepts in language testing

Test validation: reliability and validity (1)
Test validation: reliability and validity (2)
Test construction (1)
Test Construction (4)
Rasch analysis (1)
Rasch analysis (2)
Language testing and modern technology
Assessment
One 5000 6000 word paper on language
testing
Collaborative work:
Youll be divided into group of four to complete
the development of a test paper. Each of you

will be responsible for one part of the test
paper. But each part should contribute
equally to the whole test paper. Therefore,
besides developing your part, you need to
come together to discuss the whole test
paper in terms of reliability and validity.
Course books
Bachman, L. F. & Palmer, A. (1996). Language Testing in

Practice. Oxford: Oxford University Press.
Brown, J. D. (1996). Testing in Language Programs. Upper
Saddle River, NJ: Prentice Hall Regents.
Li, X. (1997). The Science and Art of Language Testing.
Changsha: Hunan Educational Press.
McNamara, T. (1996). Measuring second language
performance. London ; New York: Longman.
Website:
http://www.clal.org.cn/personal/testing/Leeds
Session 1
Basic concepts in language testing
.
A short history of language testing
Spolsky (1978) classified the development
of language testing into three periods, or
trends:
the prescientific period
the psychometric/structuralist period
the integrative/sociolinguistic period.
The prescientific period
grammar-translation approaches to
language teaching
translation and free composition tests
difficult to score objectively
no statistical techniques applied to
validate the tests
simple, but unfair to students
The psychometric-structuralist period
audio-lingual and related teaching methods

objectivity, reliability, and validity of tests
considered
measure discrete structure points
multiple-choice format (standardized tests)
follow scientific principles, have trained
linguists and language testers
The integrative-sociolinguistic period

communicative competence
Chomskys (1965) distinction of competence and performance
Competence: an ideal speaker-listeners knowledge of the rules of the

language;
performance: the actual use of language in concrete situations
Hymess (1972) proposal of communicative competence
the ability of native speakers to use their language in ways that are not only
linguistically accurate but also socially appropriate.
Canale & Swains (1980) framework of communicative competence:
Grammatical competence, mastery of the language code such as morphology,

lexis, syntax, semantics, phonology;
Sociolinguistic competence, mastery of appropriate language use in different
sociolinguistic contexts;
Discourse competence, mastery of how to achieve coherence and cohesion in
spoken and written communication
Strategic competence, mastery of communication strategies used to
compensate for breakdowns in communication and to enhance the
effectiveness of communication.

Bachmans (1990)s framework of communicative
language ability:
Language competence: grammatical, sociolinguistic, and

discourse competence (Canale & Swain):
organizational competence
grammatical competence
textual competence
pragmatic competence
illocutionary competence
sociolinguistic competence
Strategic competence: performs assessment, planning, and

execution functions in determining the most effective means
of achieving a communicative goal
Psychophysiological mechanisms: characterize the channel
(auditory, visual) and mode (receptive, productive)
Ollers (1979) pragmatic proficiency test:
Temporally and sequentially consistent with the

real world occurrences of language forms
Linking to a meaningful extralinguistic context
familiar to the testees
Clarks (1978) direct assessment:

approximating to the greatest extent the
testing context to the real world
Cloze test and dictation (Yang, 2002b)
Communicative testing or to test
communicatively
Performance tests (Brown, Hudson, Norris, &

Bonk, 2002; Norris, 1998)
Not discrete-point in nature

Integrating two or more of the language skills of
listening, speaking, reading, writing, and other
aspects like cohesion and coherence,
suprasegmentals, paralinguistics, kinesics,
pragmatics, and culture
Task-based: essays, interviews, extensive reading
tasks
Performance Tests
Three characteristics:
The task should:
be based on needs analysis (What criteria should be

used? What content and context? How should experts
be used?)
be as authentic as possible with the goal of measuring
real-world activities
sometimes have collaborative elements that stimulate
communicative interactions
be contextualized and complex
integrate skills with content
be appropriate in terms of number, timing, and
frequency of assessment
be generally non-intrusive, that is, be aligned with the
daily actions in the language classroom
Performance Tests
Raters should be appropriate in terms of:
number of raters
overall expertise
familiarity and training in use of the scale
The rating scale should be based on appropriate:
categories of language learning and development

appropriate breadth of information regarding learner
performance abilities
standards that are both authentic and clear to students
To enhance the reliability and validity of decisions as well as

accountability, performance assessments should be combined with
other methods for gathering information (e.g. self-assessments,
portfolios, conferences, classroom behaviors, and so forth)
Development graph (Li, 1997: 5)
2. Theoretical issues
Language testing is concerned with

both content and methodology.
Development since 1990
Communicative language testing

(Weir, 1990)
Reliability and validity
Social functions of language testing
Ethical language testing

Washback (impact) (Qi, 2002; Wall, 1997)
impact: effects of tests on individuals, policies or practices within the classroom,

the school, the educational system or society as a whole
washback: effects of tests on language teaching and learning
Ways of investigating washback:
analyses of test results

teachers and students accounts of what takes place in the classroom (questionnaires and
interviews)
classroom observation
Ethics of test use
use with care (Spolsky, 1981: 20)

codes of practice
Professionalization of the field
training of professionals
development of standards of practice and mechanism for their implementation and
enforcement
Critical language testing
put language testing in the society
Factors affecting performance of

examinees
Communicative
language ability
TEST SCORE
Random
factors
Test method
facets
Personal
attributes

Testing interlanguage pragmatic knowledge
currently on research level

focus on method validation
web-based test by Roever
Computerized language testing
Item banking
Computer-assisted language testing
Computerized adaptive language testing
Test items adapted for individuals

Test ends when examinees ability is determined
Test time very shorter
Web-based testing
Phonepass testing

Language testing and second
language acquisition (Bachman &
Cohen, 1998)
Help to define construct of language

ability
Use findings of language testing to prove
hypotheses in SLA
Provide SLA researchers with testing and
standards of testing
Development of research methodology

Factor analysis
The main applications of factor analytic
techniques are:
(1) to reduce the number of variables and

(2) to detect structure in the relationships
between variables, that is to classify variables.
Therefore, factor analysis is applied as a

data reduction or structure detection
method
Generalizability theory (Bachman, 1997;

Bachman, Lynch, & Mason, 1995)
Estimating the relative effects of different

factors on test scores (facets)
The most generalizable indicator of an
individuals language ability is the universe
score, however, in real world, we can only
obtain scores from a limited sample of
measures, so we need to estimate the
dependability of a given observed score as an
estimate of the universe score.
Two stages are involved in applying G-theory to

test development
G-study
The purpose is to estimate the effects of the various
facets in the measurement procedure (usually
conducted in pretesting).
e.g. persons (differences in individuals speaking ability),

raters (differences in severity among raters), tasks
(differences in difficulty of tasks);
two-way interactions:
task x rater different raters are rating the different tasks

differently
person x task some tasks are differentially diffucult for
different groups of test takers (source of bias)
person x rater some raters score the performance of
different groups of test takers differently (indication of rater
bias)
Two stages are involved in applying Gtheory to test development
D-study
The purpose is to design an optimal measure for
the interpretations or decisions that are to be made
on the basis of the test scores (estimation of
dependability).
Generalizability coefficient (G coefficient) provides
an estimate of the proportion of an individuals
observed score that can be attributed to his or her
universe score, taking into consideration the effects
of the different conditions of measurement
specified in the universe of generalization. But it is
appropriate for norm-referenced tests.
For criterion-referenced tests, use phi coefficient.
(GENOVA)
Item response theory (Rasch model)

It enables us to estimate the statistical
properties of items and the abilities of
test takers so that these are not
dependent upon a particular group of test
takers or a particular form of a test. It is
widely used in large-scale standardized
test.
Structural equation model (Antony

John Kunnan, 1998)
A combination of multiple regression,

path analysis and factor analysis
Attempts to explain a correlation or a
covariance data matrix derived from a
set of observed variables; latent
variables are responsible for the
covariance among the measured
variables.
Basic procedures in SEM (Example from

Purpura, 1998)
Examine the relationships between strategy use and

second language test performance.
Design two questionnaires for cognitive strategies
and metacognitive strategies (40 items)
Ask respondents to answer the questionnaires
Respondents take a foreign language test
Cluster the 40 items to measure several variables
Compute the reliability of the variables
Conduct factor analysis to identify factors
Conduct SEM analysis (AMOS, EQS, LISREL)
Qualitative method
Verbal report (think-aloud,

introspective)
Observation
Questionnaires and interviews
Discourse analysis
3. Classification of language tests
According to families
Norm-referenced tests
Criterion-referenced tests
Measure global language abilities (e.g.

listening, reading speaking, writing)
Score on a test is interpreted relative to
the scores of all other students who
took the test
Normal distribution
Normal Distribution
http://stat-www.berkeley.edu/~stark/Java/NormHiLite.htm
Students know the format of the test

but do not know what specific content
or skill will be tested
A few relatively long subtests with a
variety of question contents
Criterion-referenced tests
Measure well-defined and fairly specific objectives

Interpretation of scores is considered absolute
without referring to other students scores
Distribution of scores need not to be normal
Students know in advance what types of questions,
tasks, and content to expect for the test
A series of short, well-defined subtests with similar
question contents
According to decision purposes
Proficiency tests
Placement tests
Achievement tests
Diagnostic tests
Proficiency tests
Test students general levels of language

proficiency
The test must provide scores that form a
wide distribution so that interpretations of the
differences among students will be as fair as
possible
Can dramatically affect students lives, so
slipshod decision making in this area would
be particularly unprofessional
Placement tests
Group students of similar ability levels

(homogeneous ability levels)
Help decide what each students
appropriate level will be within a
specific program
Right tests for right purposes
Achievement tests
About the amount of learning that students have

done
The decision may involve who will a advanced to the
next level of study or which students should graduate
Must be designed with a specific reference to a
particular course
Criterion-referenced, conducted at the end of the
program
Used to make decisions about students levels of
learning, meanwhile can be used to affect curriculum
changes and to test those changes continually
against the program realities
Diagnostic tests
Aimed at fostering achievement by promoting strengths and

eliminating the weaknesses of individual students
Require more detailed information about the very specific areas
in which students have strengths and weaknesses
Criterion-referenced, conducted at the beginning or in the
middle of a language course
Can be diagnostic at the beginning or in the middle but
achievement test at the end
Perhaps the most effective use of a diagnostic test is to report
the performance level on each objective (in a percentage) to
each student so that he or she can decide how and where to
invest time and energy most profitably
Formative assessment vs. summative

assessment
Formative: a judgment of an ongoing program used

to provide information for program review,
identification of the effectiveness of the instructional
process, and the assessment of the teaching process
Summative: a terminal evaluation employed in the
general assessment of the degree to which the larger
outcomes have been obtained over a substantial part
of or all of a course. It is used in determining
whether or not the learner has achieved the ultimate
objectives for instruction which were set up in
advance of the instruction.
Public examinations vs.

classroom tests
Purpose: proficiency vs. achievement

(placement, diagnostic)
Format: standardized vs. open
(objective vs. subjective)
Scale: large-scale vs. small-scale (selfassessment)
Scores: normality, backwash

Language Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Testing

Uploaded by

Copyright:

Available Formats

Language Testing

Understand the general considerations that must be addressed in the

writing test items

collecting test data and conducting item analysis

evaluating language tests with regard to validity and reliability

General concepts in language testing and evaluation

Evaluation of a language test: reliability and validity

Communicative approach to language testing

Design of a language test

Item writing and item analysis

Interpreting test results

Item response theory and its applications

Computerized language testing and its future development

Basic concepts in language testing

the development of a test paper. Each of you

Bachman, L. F. & Palmer, A. (1996). Language Testing in

The prescientific period

The psychometric-structuralist period

audio-lingual and related teaching methods

The integrative-sociolinguistic period

Chomskys (1965) distinction of competence and performance

Competence: an ideal speaker-listeners knowledge of the rules of the

Hymess (1972) proposal of communicative competence

Canale & Swains (1980) framework of communicative competence:

Grammatical competence, mastery of the language code such as morphology,

The integrative-sociolinguistic period

Language competence: grammatical, sociolinguistic, and

Strategic competence: performs assessment, planning, and

The integrative-sociolinguistic period

Ollers (1979) pragmatic proficiency test:

Temporally and sequentially consistent with the

Clarks (1978) direct assessment:

The integrative-sociolinguistic period

Performance tests (Brown, Hudson, Norris, &

Not discrete-point in nature

The task should:

be based on needs analysis (What criteria should be

Raters should be appropriate in terms of:

The rating scale should be based on appropriate:

categories of language learning and development

To enhance the reliability and validity of decisions as well as

Development graph (Li, 1997: 5)

Language testing is concerned with

Development since 1990

Communicative language testing

Ethical language testing

impact: effects of tests on individuals, policies or practices within the classroom,

analyses of test results

Ethics of test use

use with care (Spolsky, 1981: 20)

Professionalization of the field

Critical language testing

put language testing in the society

Factors affecting performance of

Development since 1990

currently on research level

Computerized language testing

Test items adapted for individuals

Development since 1990

Help to define construct of language

Development of research methodology

(1) to reduce the number of variables and

Therefore, factor analysis is applied as a