Week 2 Lta 200913

WEEK 2
Fundamentals of testing & Assessment

1. Validity
2. Reliability
3. Practicability
4. Backwash/Washback
TE30503 LTA Dr Lee 1
Source: http://martin-thoma.com/what-is-the-best-programming-language/
Introduction
In this lecture, we will be looking at a number
of terminologies that are synonymous with
language testing and assessment.
Familiarize yourself with all these terms fast as
we will be using these terms regularly
throughout the course.

CRITERIA OF A GOOD LANGUAGE TEST

Is there a perfect language test?
A perfect language test if such a thing
could be found would satisfy all of the
following criteria.
In practice, we have to achieve a balance
between different criteria particularly
between validity, reliability, and
practicability.
source: http://www.maniacworld.com/good-cheap-fast-service.html
source: http://www.peanuts.com/comics/
source: http://www.peanuts.com/comics/
Source: http://funnyexam.com/popular/answers/3
http://www.roderik.net/me/humour/
Source: http://1.bp.blogspot.com/-QoyapCf4y6s/TqDShTS-
HcI/AAAAAAAAJ8k/4NjDiP8uxs4/s1600/Find_X.jpg and http://jokeslab.com/media/wp-
content/uploads/2011/09/funny-math-exam-expansion.jpg
A. VALIDITY 1
Validity is arguably the most important criteria for the quality
of a test.
The term validity refers to whether or not the test measures
what it claims to measure.
A test with high validity will have items closely linked to the
tests intended focus.
Grant (1987: 89) defines validity as follows:
Validity in general refers to the appropriateness of a given test
or any of its component parts as a measure of what it is
purported to measure.
A test is said to be valid to the extent that it measures what it
is supposed to measure.
Validity 2
The validity of a test is critical because,
without sufficient validity, test scores have no
meaning.
There are several ways to estimate the validity
of a test: including content validity,
concurrent validity, predictive validity,
construct validity, and face validity.
1. Content Validity
A test is said to have content validity if its content
constitutes a representative sample of the language
skills, structures, etc. with which it is meant to be
concerned.
To judge content validity, need a specification of the skills
or structures etc. that it is meant to cover.
A comparison of test specification and test content is the
basis for judgement of content validity.
Content validity is important because:
1. provide an accurate measure of what it is supposed to measure.
2. backwash effect areas which are not tested are usually ignored
in T&L.

Criterion-related validity:
Criterion-related validity refers to how far
results of the test agree with those
provided by some independent and highly
dependable assessment of the candidates
ability.
Two kinds of criterion-related validity:
concurrent validity and predictive validity.
2. Concurrent Validity
Concurrent validity assesses how well a test agrees with
a concurrent assessment of a different type.
i.e., the extent the test provides results which are more
or less comparable with the results of other language
tests.
Concurrent validation involves the comparison of the test
scores with some other measure for the same candidates
taken at roughly the same time as the test.
This other measure may be scores from a parallel version
of the same test or some other test; the candidates self-
assessments of their language abilities; or ratings of the
candidates on relevant dimensions by teachers, subject
specialists or other information.

3. Predictive Validity
Predictive validity is a test's ability to predict how a
person will perform at a later date on a different
assessment of ability performance in school or on a
job, for example.
i.e., the extent the results of the test accurately predicts
the language performance of the candidates when they
use the language in the real world.
Predictive validation is most common with proficiency
tests: tests which are intended to predict how well
somebody will perform in the future.
The simplest form of predictive validation is to give
students a test, and then at some appropriate point in the
future give them another test of the ability the initial test
was intended to predict.
Criterion-related validity
The results of the assessment of predictive and concurrent
validity are expressed as correlation coefficients, and the
absolute minimum standard is .71.
Often evidence of predictive validity is impossible to obtain.
For example, university entrance examinations notoriously fail
to predict success in university.
The reason, though, is not necessarily the inadequacy of the
examinations but the inadequacy of the sample.
When you compare scores on the entrance examination with
success in university, you're looking at the success only of the
highest-scoring students on the entrance examination.
If people with middling or low scores on the entrance
examination were admitted to university then you could well
find a relationship between the entrance examination results
and success in university.
4. Construct Validity
Construct refers to any underlying ability (or trait)
which is hypothesised in a theory or language ability.
Construct validity - The extent the test matches a
coherent view of the nature of language and the nature
of language learning?
Does the test really adopt the theory of
language/language learning which it claims to adopt?
A test is said to have construct validity if it is capable of
measuring certain specific characteristics in accordance
with a theory of language behaviour and learning.
The more direct the test the better is the construct
validity.
For example, if a communicative approach to language
teaching and learning has been adopted throughout a
course, a test comprising chiefly MCQ items (indirect test)
will lack construct validity.
5. Face Validity
A test is said to have face validity if it looks as if it
measures what it is supposed to measure.
If a test item looks right to other testers, teachers,
moderators, and testees, it can be described as having
face validity.
For example, a test which pretended to measure
pronunciation ability but which did not require the
candidate to speak might be though to lack face validity.
Face validity is hardly a scientific concept, yet it is very
important.
A test which does not have face validity may not be
accepted by candidates, teachers, education authorities
or employers.
The use of validity
What use is the teacher to make of the notion of
validity?
First, every effort should be made in constructing
tests to ensure content validity.
Where possible, the tests should be validated
empirically against some criterion.
If indirect testing is used, reference should be made
to the research literature to confirm that
measurement of the relevant underlying constructs
has been demonstrated using the testing techniques
that are to be used.
B. RELIABILITY
Reliability is simply consistency.
i.e., the extent the test produces consistent
results if the same candidates take the test
on repeated occasions.
According to Alderson, Clapham and Wall
(1995, p. 6):
"Reliability is the extent to which test scores
are consistent: if candidates took the same
test again tomorrow after taking it today,
would they get the same result?"

Reliability
Source: http://www.upei.ca/~xliu/measurement/week6-7.htm
Reliability 2
Another way to look at reliability is the degree to
which test scores are free from measurement error
and are consistent from one occasion to another
(the degree of stability on repeated administrations
of the test)
Sources of measurement error, which include
fatigue, nervousness, content sampling, answering
mistakes, misinterpreting instructions and guessing,
contribute to an individual's score and lower a test's
reliability.
Reliability 3
The reliability of a test can be quantified in the form of
reliability coefficient.
Reliability coefficients allow us to compare the reliability of
different tests.
The ideal reliability coefficient is 1
A test with a reliability coefficient of 1 is one which would give
precisely the same results for a particular set of candidates
regardless of when it happened to be administered.
A test which had a reliability coefficient of zero would give
sets of results quite unconnected with each other.
Certain authors have suggested how high a reliability
coefficient we should expect for different types of language
tests.
Reliability 4
Lado (1961) says that good vocabulary,
structure and reading tests are usually in the
.90 to .99 range
Auditory comprehension tests are more often
in the .80 to .89 range.
Oral production tests may be in the .70 to .79
range.
Reliability 5
Different types of reliability estimates should be used
to estimate the contributions of different sources of
measurement error.
Inter-rater reliability coefficients provide estimates
of errors due to inconsistencies in judgment
between raters.
Alternate-form reliability coefficients provide
estimates of the extent to which individuals can be
expected to rank the same on alternate forms of a
test.
Reliability 6
Reliability should be assessed at every administration of
the test.
Tests are used if they have been reliable in the past, but
they are only useful to you if they are reliable when you
use them.
Groups of people tested can differ in many important
ways, some of which can affect reliability.
It is not unusual to find that a highly acclaimed test fails
to live up to its history of reliability when you use it.
Usually this is not a reflection on the quality of the test
(or of you as a test administrator), but simply a reflection
of the facts of life no test is appropriate for everyone.
For example, the reliability of many tests varies markedly
with the age of the people taking it.
How to make tests more reliable
1. Take enough samples of behaviour. Other
things being equal, the more items you have
on a tests, the more reliable that test will be.
2. Do not allow candidates too much freedom.
Candidates should not be given a choice, and
the range over which possible answers might
vary should be restricted

How to make tests more reliable 2
3. Write unambiguous items. Candidates should
not be presented with items whose meaning
is not clear or to which there is an acceptable
which the test writer has not anticipated.
Moderation can help to minimize this
problem.
4. Provide clear and explicit instructions to avoid
misinterpretation of what candidates are
asked to do.

5. Ensure that tests are well laid out and
perfectly legible. Minimize tests that are badly
typed (or handwritten), have too much text in
too small a space, and are poorly reproduced.
6. Candidates should be familiar with format
and testing techniques. Any aspect of a test
that is unfamiliar to candidates, are likely
affect candidates. performance

7. Provide uniform and non-distracting conditions of
administration.
8. Use items that permit scoring which is as objective
as possible.
9. Make comparisons between candidates ad direct as
possible. For example, scoring the compositions all
on one topic will be more reliable than if the
candidates are allowed to choose from six topics.
10. Provide detailed scoring key
11. Train scorers.

12. Agree acceptable responses and appropriate
scores at outset of scoring.
13. Identify candidates by number, not name.
14. Employ multiple, independent scoring.

Reliability and Validity
To be valid a test must provide consistently accurate
measurements.
It must therefore be reliable.
A reliable test, however, may not be valid at all.
For example, as a writing test we might require
candidates to write down the translation equivalents of
500 words in their own language. This could well be a
reliable test; but it is unlikely to be a valid test of writing.
In making test reliable, we must be wary of reducing their
validity eg. restricting the scope of what candidates are
permitted to write in a composition might diminish the
validity of the task.
There will always be some tension between reliability and
validity.
Need to balance gains in one against loses in the other.

Source: http://fcemprep.co.uk/reliability-validity/
Source:
http://upload.wikimedia.org/wikipedia/commons/5/5d/Reliability_and_validity.svg
C. PRACTICABILITY
Can the test be administered reasonably easily?
Can the test be administered without unreasonable
expenditure?
Can the test be administered in a reasonable
amount of time?
Can the test be marked reasonably easily?
Other things being equal, it is good that a test
should be easy and cheap to construct, administer,
score and interpret.
Practicability 2
Is sometimes called 'logistics'.
Frankly, testers wish logistics would just go
away. It is a nuisance. We are annoyed at
the influence of features like staffing, cost,
space, time and other logistical concerns
that impinge on test development.

Practicability 3
A good way to conceive of this problem is from the
tension between 'fast', 'cheap' and 'good'
The idea is that you can only have two of those
features: a test can be cheap and good, but it will
take forever to develop or re-engineer.
A test can be quick and good, but it will cost a
fortune because you will have to re-divert other
resources to the test.
Finally a test can be 'fast' and 'cheap' but it will be
terrible -- it will probably lack reliability and validity.
D. WASHBACK/BACKWASH EFFECT
What influence will the test have on
the teaching which takes place before
the test?
Will this influence be positive (i.e. will
it encourage good learning habits)?
Or will the influence be negative?

WASHBACK/BACKWASH EFFECT 2
Washback effect is powerful: it can be
beneficial or detrimental.
If we use a test to improve classroom
teaching, then the test is said to have positive
washback.
However, if the test has a negative effect on
teaching, it is said to have negative
washback.
How to achieve Beneficial Backwash
1.Test the abilities whose development you want to
encourage. If you want to encourage oral ability, then test
oral ability.
2. Sample widely and unpredictably. Important that the
sample taken should represent as far as possible the full
scope of what is specified.
3. Use direct testing. Direct testing implies testing of
performance skills, with texts and tasks as authentic as
possible.
4. Make test criterion-referenced. If test specifications make
clear just what candidates have to be able to do, and with
what degree of success, then students will have a clear
picture of what they have to do.
How to achieve Beneficial Backwash 2
5. Base achievement test on objectives. If
achievement test are based on objectives,
rather than on detailed teaching and textbook
content, they will provide a truer picture of
what has actually been achieved.
6. Ensure test is known and understood by
students and teachers.
7. Where necessary, provide assistance to
teachers (esp. intro of new test).

Conclusion
Looked at 4 fundamental terms associated
with LTA:
Validity
Reliability
Practicability
Backwash/Washback


Week 2 Lta 200913

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2 Lta 200913

Uploaded by

Copyright:

Available Formats

WEEK 2

Fundamentals of testing & Assessment

You might also like