You are on page 1of 17

System 28 (2000) 523539

www.elsevier.com/locate/system

What to look for in ESL admission tests: Cambridge certicate exams, IELTS, and TOEFL$
Micheline Chalhoub-Deville a,*, Carolyn E. Turner b
a

Foreign Language & ESL Education, Division of Curriculum & Instruction, University of Iowa, N254 Lindquist Center, Iowa City, IA 52242-1529, USA b Department of Second Language Education, McGill University, 3700 McTavish Street, Montreal, QC, Canada H3A 1Y2 Received 15 December 1999; received in revised form 1 February 2000; accepted 8 February 2000

Abstract The article is intended to familiarize test-users with the issues they need to consider when employing assessments for screening/admission purposes. The article examines the purpose, content, and scoring methods of three English as a second language admission tests the Cambridge certicate exams, International English Language Testing System, and Test of English as a foreign language-computer-based test and discusses reliability and validity considerations salient to each instrument. The validity and reliability discussion is guided by the ``Standards for Educational and Psychological Testing'' (1999). The article indicates that the scores obtained from these assessments are used to help make critical decisions that aect test-takers' lives. It is critical, therefore, that these scores provide high quality information. Developers of large-scale tests such as those reviewed in the present article have the responsibility to: construct instruments that meet professional standards; continue to investigate the properties of their instruments and the ensuing scores; and make test manuals, user guides and research documents available to the public. Test-users also have a responsibility. Testusers need to be cognizant of the properties of the instruments they employ and ensure appropriate interpretation and use of test scores provided. Test-users need to carry out local investigations to make sure that their admission requirements are based on an informed

$ This paper is based on discussion presentations given by the authors at an institute entitled ``Using English Screening Tests at Your Institute'' at the annual meeting of TESOL March 1999 in New York. * Corresponding author. Tel.: +1-319-335-5606; fax: +1-319-335-5608. E-mail addresses: m-chalhoub- deville@uiowa.edu (M. Chalhoub-Deville), cx9x@musica.mcgill.ca (C.E. Turner).

0346-251X/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved. PII: S0346-251X(00)00036-1

524

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

analysis of their academic programs and the language ability score proles necessary to succeed in these programs. # 2000 Elsevier Science Ltd. All rights reserved.
Keywords: Examination; Teaching; Cambridge exams; Test-users; TOEFL; IELTS

1. Introduction Historically, large-scale English as a second language (ESL) admission testing has been dominated by two test batteries: the Cambridge exams, sponsored by the University of Cambridge Local Examinations Syndicate (UCLES), and the Test of English as a foreign language (TOEFL), from Educational Testing Service (ETS). These two organizations dier in their ideologies and approaches to assessing the language ability of non-native speakers of English (Spolsky, 1995). UCLES tends to emphasize a close relationship between testing and teaching. The Cambridge exams have been constructed more like an achievement test with strong links between the examination and teaching syllabi. The exams incorporate a variety of item types that reect those used in instructional settings. The hallmark of TOEFL, on the other hand, is its psychometric qualities with a strong emphasis on reliability. ETS adheres to a more psychometric approach to test construction, favoring objective, e.g. multiple-choice items. In terms of instruction, ETS emphasizes dissociation from any particular instructional program. It markets the TOEFL as a prociency test. This article examines issues related to the three Cambridge certicate exams; the International English Language Testing System (IELTS), which is also operated by UCLES, and the TOEFL. Several publications have examined a variety of issues related to the instruments under investigation (e.g. Bachman et al., 1993, 1988; Spolsky, 1995). These articles, however, were intended primarily for language testing researchers. The present article addresses the needs of test-users who wish to enhance their knowledge about how to use these assessments for screening/admission purposes. As such, this article is more in line with the somewhat dated Alderson et al. (1987) publication, which provides descriptive and evaluative reviews of the major ESL tests in use around the world. The present article familiarizes readers with the issues that need to be considered by those selecting and using ESL admission instruments. The review of the Cambridge certicate exams, IELTS, and TOEFL includes a summary of basic information, e.g. test purpose, length, format, a description of test content and scoring, and a discussion of various reliability and validity issues. Before embarking on the instrument review, a brief overview of the concepts of validity and reliability is provided. 2. Reliability and validity considerations The ``Standards for Educational and Psychological Testing'' (AERA et al., 1999) is a widely recognized professional publication whose purpose is to provide criteria

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

525

and guidelines to be observed by all participants involved in the testing process. The `Standards' describe reliability as the: F F Fdegree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable, and repeatable for an individual test taker; the degree to which scores are free of errors of measurement for a given group. (p. 180) In other words, reliability refers to the degree to which test scores represent testtakers' true scores. Typically, when examining test reliability, issues such as the following are considered: 1. The degree to which the conditions under which the test is administered are conducive to optimal performance. In a second language test, any variable that aects test scores, other than the language ability being measured, is considered a potential source of measurement error. Errors of measurement can limit the reliability and generalizability of scores obtained. 2. The psychometric properties, e.g. diculty and discrimination indices of test tasks/items, and internal consistency of test tasks/items; in the case of objective types of items, internal consistency refers to the extent to which items measuring a particular aspect of the language construct intercorrelate with each other. In the case of open-ended types of items, consistency is often examined in terms of rater agreement in scoring. 3. Standard error of measurement (SEM), which summarizes uctuations in scores due to various and to be expected imperfect measurement conditions; the SEM should be considered especially at or near cut-o/passing scores. In terms of validity, the `Standards' (AERA et al., 1999, p. 9) state: Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing and evaluating tests. The process of validation involves accumulating evidence to provide a sound scientic basis for the proposed score interpretationsF F F When test scores are used or interpreted in more than one way, each intended interpretation must be validated. This conceptualization of validity highlights the following points: 1. The validation process includes gathering evidence regarding the relevance and the representativeness of the content covered from the specied construct domain (Messick, 1989, 1996). Additionally, validation emphasizes theoretical arguments and empirical evidence to support test score interpretation and use.

526

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

2. Test purpose should guide the language test developer with respect to all aspects of test construction and research to support the meaning and use of the ensuing test scores. 3. The validation process is not a one-time activity but an ongoing process. Validation research emphasizes an ongoing and a systematic research agenda that documents the properties and interpretations of test scores and provides evidence to support their use as specied in the test purpose. The separate description of reliability and validity may give the impression that the two are distinct. This is not the case. The two are interrelated. For example, although the discussion below about cut-o scores and computer familiarity is addressed primarily in terms of reliability dependability and generalizability of scores, the discussion can as easily be said to focus on validity concerns appropriate representation of language prociency and use of test scores. In short, research documenting the quality of test results needs to assemble multiple sources of evidence that includes various aspects of validity and reliability. The salient points of reliability and validity considered above will serve as the main components for evaluating the Cambridge certicate exams, IETLS, and TOEFL. Given that what needs to be considered for any given instrument is complex and far ranging, and in order not to sacrice the depth of the arguments at the expense of breadth, the discussion will focus on a limited number of the salient issues for each instrument. Nevertheless, the cumulative information provided can assist readers who wish to better understand ESL admission tests.

3. The Cambridge certicate exams 3.1. Purpose The three Cambridge certicate exams First Certicate in English (FCE), Certicate in Advanced English (CAE) and the Certicate of Prociency in English (CPE) are among the oldest and most established products in the ESL testing eld. Exam developers maintain that success on these instruments can be considered proof of English language ability that satises entrance requirements at most British universities. The exams are also used to evaluate ESL for use in commerce and industry. The present review addresses the academic aspects of the test only. 3.2. Content The FCE, CAE, and CPE exams have ve obligatory sections, called papers: Reading, Writing, English in Use, Listening, and Speaking. The exams vary in length and can take about 56 h. The Reading, Writing, and English in Use papers are administered in 1 day. Separate arrangements are made for Listening and

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

527

Speaking. The exams are the lengthiest of all the admission instruments mentioned in this article. Each paper of the Cambridge certicate exams presents a variety of tasks with regard to both type of input and response type. For example, according to the ``CPE Handbook'' (UCLES, 1998a), the Listening section includes three tasks. The rst task involves listening to a debate with a multiple-choice response format; the second task requires listening to a radio interview with sentence completion as the response format; and the third comprises listening to a discussion with a matching response format. Research such as that by Shohamy (1984) and Chalhoub-Deville (1995) supports including a variety of tasks and response types (see also Bachman, 1990). This research shows that diverse methods inuence test-takers' performance dierently. As such, test-method variation helps minimize the advantage one test-taker may have over another with respect to their dierent abilities in performing one particular method. This potential benet, however, may be to the detriment of reliability across the tasks. UCLES also makes available test practice and preparation material to further reduce test method eect. Texts in the Listening paper are played twice. Listening texts present a variety of accents all three exams dene ``variety of accents'' as ``accents corresponding to standard variants of English native speaker accent, and to English nonnative speaker accents that approximate to the norms of native speaker accents'' (UCLES, 1998a, p. 40). These last two features of the exams, i.e. playing the text twice and the accent variety, are not always present in other admission tests such as TOEFL. The Reading paper diers across the three certicate exams. The length of the reading texts ranges between 350 and 1200 words. Texts are obtained from various sources, e.g. literary books, journals, newspapers, magazines, etc. Tasks require testtakers to interact with the text in various ways to focus on main points, detail, text structure, recognize an attitude, make an inference, etc. Response type is limited to multiple-choice and multiple-matching. The English in Use paper also diers across the FCE, CAE, and CPE. The purpose of this paper is to examine test-takers' knowledge and control of the formal elements of the language system. The paper includes task types such as multiplechoice, cloze, error correction, word formation and note expansion. Across the three exams, the Writing paper requires test-takers to produce extended discourse. The ``CAE Handbook'' (UCLES, 1999a), for example, states that the CAE Writing paper includes two parts. The rst part is compulsory. It requires testtakers to perform several tasks based on input materials such as texts and visuals. In the second part, test-takers have to select one of four tasks to perform. These tasks include writing an article, a report, an information leaet, etc. The Speaking paper involves a pair of test-takers along with two examiners. One examiner serves as an interlocutor and the other serves as an assessor. Tasks require each candidate to speak and varied interaction among the participants. Pair and group testing of speaking are welcome options to the typical one-on-one interview investigated (Shohamy et al., 1986; Fulcher, 1996).

528

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

3.3. Scoring method The raw score for each of the ve Cambridge exam papers is derived dierently. Scoring is contingent on the type of item. For example, in the CAE (UCLES, 1999a) Reading paper, one mark is given for each correct answer to the multiple matching tasks and two marks are given to the multiple-choice and gapped items; and in the Listening paper, each correct answer is given one mark. For the Writing and Speaking papers, impression marks with holistic band descriptions are used. The raw score or band level for each paper is then converted to a weighted score of 40, resulting in a test total of 200 marks. Therefore, all papers are weighted equally. The total score is then translated into a passing letter grade of A, B, C or to a failing letter grade of E, F. Information is not provided on how raw scores and band levels are converted to weighted scores or how letter grades are derived from total scores. 3.4. Reliability Rigorous work to document the psychometric properties of the Cambridge exams has not been emphasized by UCLES. Inadequate documentation of test reliability as well as evidence to support the equivalence of dierent test forms are some of the problems that have been voiced with regard to the certicate exams (Alderson et al., 1987; Bachman et al., 1993). Such issues raise concerns about the quality and the fairness of the scores obtained. Internal UCLES reports indicate that considerable eort has been made in recent years to document internal consistency of the objective tests. Also, UCLES manuals state that improvements have been made in terms of establishing equivalence across test forms through the use of item banks where items have identied psychometric properties. Nevertheless, published documents continue to report very little information about such analyses (Fulcher, 2000). Making such information more available is mandatory to help test-takers and users make informed evaluations of the quality of the tests and the ensuing scores. Published information is also lacking with regard to the productive skills. UCLES reports mention that assessors for both the writing and oral tests are adequately trained and monitored. While such information is reassuring, it is not supported with research evidence. Information such as intra- and inter-rater reliability needs to be collected and published. Cambridge exam developers need to maintain a continuous research agenda that investigates and reports to test users various aspects of test reliability. At the very least, the research should provide: 1. 2. 3. 4. separate reliability indices for each paper/section; SEM indices especially for the various pass/fail cut-o scores; reliability indices for the various forms/versions of the tests; information regarding the equivalency of dierent forms/versions of the tests; and 5. rater reliability indices for the subjective items/papers.

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

529

The results of these analyses should be reported regularly, and certainly any time new papers are created. Finally, it is important for UCLES to report basic descriptive information about the test-takers and their performance. To explicate, the ``CAE Report'' (UCLES, 1998b), includes the percentage of candidates at each overall grade. (Similar information is not reported for the other exams, however.) Such information is helpful but not sucient. Information about the performance of candidates by variables such as language background, native country, academic level, age, gender, etc., needs to be reported. This information can then be used, for example, to complement cut-o scores considered for admission purposes (see discussion of cut-o scores below). UCLES collects information on many of these variables and is capable, therefore, of publishing it. 3.5. Validity The most distinctive feature about the Cambridge certicate exams is their close connection to the educational context. The Cambridge ESL testing tradition grows out of the context of examinations for schools. ``Examination syllabusesF F Foften serve as the starting point for teaching syllabuses, and school students are taught in preparation for specic examinations in particular subjects'' (UCLES, 1998c, p. 4). Having a direct impact on teaching has always been an intended feature of the Cambridge exams. UCLES publications do not shy away from actually making specic recommendations regarding instructional practices. UCLES handbooks include a variety of suggestions for teachers to promote classroom learning (e.g. see ``FCE Handbook'', UCLES, 1997, p. 10). As such, the Cambridge exam developers encourage teachers and students to view exam results as indicators of learning, e.g. the extent to which learning goals at dierent levels have been achieved. This loop of communication between teaching and testing is likely to have, in validity terms, positive washback eect on the instructional process. (For a discussion of washback, see Wall, 1997, 2000). This close link between testing and teaching/learning can also be observed in the manner in which language specialists are included in all stages of test construction and scoring and in the way the scores are reported. The exams are developed and edited to a large extent by applied linguists and ESL teachers with considerable teaching experience. These language specialists are also included in the scoring process as scorers and as trainers of those who do the scoring. In terms of score interpretation, the Cambridge exam letter grades are accompanied by statements that indicate the sections of the tests in which test-taker performance was weak/strong. Certicates documenting test-takers' language ability according to the Cambridge instructional system are awarded successful test-takers. With regard to scores, however, two issues need to be considered. First, as indicated above, the FCE, CAE, and CPE exams yield overall grades that provide a general indication of the test takers' language ability. As such, the considerable work invested in test development and scoring is somewhat undermined given the loss of information resulting from combining the ve paper scores into one nal

530

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

score/grade. While the nal grade can be useful for an overall indication of language ability, individual paper scores would add rich and useful information that could provide feedback to the instructional process. Paper scores can also be used to make better selection and admission decisions into the various academic programs at a particular institution. Second, Cambridge exam publications indicate that the letter grades are computed based on the percentage of the total marks. For example, the ``CAE Handbook'' (UCLES, 1999a) states that the passing grade `C' is equivalent to approximately 60% of the 200 total points on the exam. The handbook does not explain why 60% has been chosen why not 70%? Also, the handbook does not explain what the meaning of `C' is with regard to instructional goals. It is important for testusers to know how and why the raw scores are transformed into letter grades, and if such grades are more meaningful and aord more appropriate interpretation and use. 4. IELTS 4.1. Purpose IELTS is another instrument administered by UCLES and used for university admission purposes. IELTS is intended to measure both academic and general English language prociency. IELTS includes six sections, called modules. All testtakers are administered the same Listening and Speaking modules. Test-takers then choose to take either the General Training or Academic Reading and Writing modules. The General Training modules measure test-takers' readiness to work in English language environments, undertake work-related training, or provide language ability evidence for the purpose of immigration. The Academic modules measure test-takers' academic readiness to study or receive training in English at the undergraduate or graduate level. The present review focuses on the Academic uses of the test only. 4.2. Content The length of IELTS is about 2 h and 30 min. The modules correspond to the traditional skills: reading, writing, listening, and speaking. The modules have no central theme or topic. Similar to the Cambridge certicate exams, IELTS includes a variety of tasks and response types within each module. An example of the variety of response types is illustrated in the IELTS Listening module, which includes: multiple-choice items; short answers; labeling diagrams; summarizing information; taking notes; and matching lists. IELTS publications state that the four parts within the Listening module are progressively dicult. Specic information on how diculty level is determined is not provided. Information included only mentions the characteristics of the listening text (length, topic, style, number of participants, setting, etc.). To

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

531

help familiarize test-takers with the various task formats, IELTS developers provide practice and preparation materials. The Listening module includes a variety of excerpts. Excerpts are played only once. Questions are provided in test booklets. Test-takers are given time to preview the questions. Test-takers are required to respond to the questions while the tape is playing. They write their answers directly in the test booklet. This appears reasonable, as some of the tasks require labeling diagrams, matching gures with words, etc. Test-takers are given time to transfer their answers to an answer sheet, probably to allow for the machine scanning of the responses. Test-takers need to be cautioned about this two-step recording of answers, as it provides opportunity for error. The Reading module uses text sources from the UK and Australia. Texts used are intended for a non-specialist audience (Clapham, 2000). A Reading module includes at least one text with detailed logical argument and one text with non-verbal materials such as graphs, diagrams, etc. The variation of response types prompts test takers to work with texts in dierent ways. IELTS developers indicate that the parts within the Reading module, similar to the Listening module, become increasingly dicult. It is not clear how diculty level is dened, however. The Writing module requires two writing samples. The rst writing task asks test takers to write a summary of information presented in a table, graph or diagram. The task species an audience and a genre to consider in the response. Such contextualization of writing is innovative and corresponds to a more authentic approach to assessment. The second task requests an essay on a given topic only one choice of topic is given. This is a fairly usual procedure on standardized tests. The IELTS Speaking module consists of a 1015-min interview between the testtaker and one examiner. The interview requires test-takers to describe, narrate, and provide explanations on a variety of personal and general interest topics. The interview also includes an elicitation task, which is in the form of a role-play. Interviews are recorded in case they need to be double marked. IELTS' manual (UCLES, 1999b) indicates that in evaluating interviews examiners consider ``evidence of communicative strategies, and appropriate and exible use of grammar and vocabulary'' (p. 14). It is not clear from the manual description what the rating procedures are, e.g. whether a holistic or detailed scoring system is used. The manual points out that the interview is being revised based on research conducted and in consultation with experts in the eld. 4.3. Scoring method The score report provides separate band scores for the four modules. In addition, module scores are averaged to yield an overall band score. Description statements of language prociency are provided for the nine-level bands. As with the Cambridge certicate exams, IELTS developers do not explain how the Listening and Reading raw scores are converted into band levels. Additionally, the criteria scales used for rating Writing and Speaking as provided are scanty, and the information on how ratings are converted to band scores is lacking. Such information, however, is important to test-users and helps in the interpretation of the ratings obtained.

532

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

4.4. Reliability IELTS diers from the Cambridge exams in that published reports recognize the need to address reliability and include information to that eect. For example, IELTS manuals describe a detailed approach to the certication of interviewers/ assessors for the speaking test and raters for the writing component that requires recertication procedures every two years. This process of training raters is commendable. In terms of documenting rater reliability, however, the information is lacking. IELTS developers report that the speech and writing samples are re-rated when there is an inconsistency in the prole of the scores and that centers are monitored as part of regular reliability studies conducted by the developers. Such information, while reassuring, needs to be augmented with research evidence. To avoid repetition, and since the reliability measures outlined as necessary for the Cambridge exams are pertinent to IELTS as well, the reader is referred to the ``Reliability'' section under ``The Cambridge certicate exams'' (Section 3.4). In short, IELTS publications need to provide more documentation of rater reliability, the reliability of the instrument, and the ensuing scores. Another aspect of reliability to consider with admission tests is the dependability of decisions made based on cut-o scores. Often, institutions do not carry out any systematic local investigations to decide upon cut-o scores, but instead base their decisions on hearsay and what other institutions are doing. Such a practice is illadvised. Dierent institutions are likely to have dierent program requirements and their cut-o scores should reect their local needs. Although ``IELTS' Annual Review'' (UCLES, 1999b) provides information regarding scores used by various institutions, the publication urges test-users to conduct local research to verify the appropriateness and dependability of adopting a particular band scale for admission. This call for local research to determine the appropriate cut-o score is quite appropriate and should be heeded. Additionally, IETLS developers caution institutions that ``IELTS band scores reect English language prociency alone and are not predictors of academic success or failure'' (UCLES, 1999b, p. 8). The cautionary remark is quite appropriate and is supported by the developers of all three instruments reviewed in the present article, Cambridge certicate exams, IELTS, and TOEFL. All developers agree that a valid approach to making admission decisions demands that institutions examine variables other than ESL ability (Section 6). 4.5. Validity The phrase ``international English language'' in IELTS' name represents a distinguishing feature of this assessment as it acknowledges the ever-expanding status of English as an international language. Researchers such as Ingram and Wylie (1993) and Clapham (1996), however, indicate that IETLS' representation of the language construct is rooted in the skills and components models typically used in language testing (Lado, 1961; Canale and Swain, 1980; Canale, 1983; Bachman, 1990). An examination of IETLS' research and manuals documents a conspicuous absence

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

533

of any reference to the international English knowledge base (e.g. Quirk and Widdowson, 1985; Kachru, 1992; Crystal, 1997). IELTS' manuals point out that the internationalization of the test refers to the partnership of the British Council and UCLES, on the one hand, and the International Development Projects (IDP) Education Australia, on the other. Additionally, IELTS' publications state: ``[t]he fact that test materials are generated in both the UK and Australia ensures that the content of each test reects an international dimension'' (IELTS, July 1996, p. 16). Such collaboration and approach to test development is likely to avoid country-specic lexical or cultural knowledge that might disadvantage test-takers who do no have that specic knowledge. Nevertheless, it does not automatically render the test international. Research documenting claims that IELTS can be used as a measure of English as an international language needs to be made available. Given the ever-increasing status of English as an international language, research into how to operationalize the international English construct is likely to be of great value for both language teachers and testers. The appropriateness of using test scores in contexts for which they are not intended also needs to be considered. IELTS' scores have been intended mainly for use in the UK and Australia. Increasingly, however, IELTS is marketing the test in North America. An important issue to address, therefore, is the comparability of language use in North American academic institutions to that in the other two countries. In other words, research is needed to investigate the appropriateness of scores obtained from IELTS as measures of academic language use in North America. Without such research, it is dicult to ascertain what the ensuing test scores in this context mean and how they should be used. IELTS' commitment to research and its responsiveness to research ndings is well documented in the literature. In the late 1980s and early 1990s, IELTS underwent major changes from a test with three academic subject modules to its current form. According to Clapham (1996), IELTS had intended the change to be more of a revision. Nevertheless, based on the results of various investigations, IELTS' developers made more comprehensive changes to the test. As such, IELTS has shown commitment to test practices informed by research ndings. As argued above in the denition of validity, the validation process is not a one-time activity but an ongoing process. IELTS' developers need to maintain their validation research and document the psychometric properties of their test and the resulting scores. 5. Test of English as a Foreign Language-computer-based test (TOEFL-CBT) 5.1. Purpose The purpose of TOEFL is to measure the English prociency of non-native speakers who intend to study in institutions of higher learning in the USA and Canada. In addition, scores are used by certain medical certication and licensing agencies. As with the other tests, TOEFL scores are increasingly being used by institutions, private organizations, and government agencies in other countries as well.

534

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

5.2. Content In 1998, TOEFL was converted from a paper-and-pencil (P&P) to a CBT. In July 1998, TOEFL-CBT rolled out in specic areas around the world and the TOEFL program is progressively implementing the CBT in the remaining areas. The present review will focus on TOEFL-CBT. TOEFL-CBT includes three sections: Listening, Structure/Writing combined, and Reading. Speaking is assessed separately using the Test of Spoken English (TSE). Although the TSE can be administered along with TOEFL, it is an independent test with dierent procedures and scheduling. For that reason, the following discussion will focus on TOEFL-CBT. The length of TOEFLCBT, without the speaking component, is approximately 4 h. This time frame includes the mandatory Tutorial, which is intended to help familiarize students with needed computer functions and test skills. A major dierence between the TOEFL-CBT and other tests in this article is the computer delivery system and the adaptive algorithm in the Listening and Structure sections. An adaptive test diers from a traditional, linear test in that an item is selected based on a test-taker's performance on previous items. Ideally, a computeradaptive test (CAT) optimizes the testing situation by targeting each test-taker's ability level (for more detail on CBT and CAT, see Chalhoub-Deville and Deville, 1999; Alderson, 2000). The Listening section uses visuals that require test-takers to view a picture and listen at the same time. The listening input is played only once. Test-takers are not given the opportunity to preview the questions, nor to see them while the listening input is being played, nor to take notes. Once the input has nished, the question is heard and both the question and response options are displayed on the screen. In addition to traditional multiple-choice items, the test includes item types that ask test-takers to select two options, match or order objects, and select a visual. Similar to the P&P test, the Structure section contains two types of multiplechoice items: selecting the option that best completes a sentence and identifying an incorrect option. Because the Listening and the Structure sections are adaptive, testtakers must answer each question before the next item is administered. Additionally, test-takers cannot return to previous items. The Writing section is the only part of TOEFL-CBT where test-takers are requested to construct a response. Test-takers are required to write an essay on a generic topic. Only one topic is provided. No information is provided about the audience, purpose, etc., to help test-takers contextualize their essay. Test-takers can either hand-write or type their essay. Handwritten essays are scanned before they are sent to raters for evaluation. In the Reading section, test-takers are typically administered four to ve texts with 1014 items per text. Interesting tasks have been developed. One such item requires test-takers to ``insert a sentence'' response into its appropriate place in a paragraph. Test-takers are administered linear sections of reading texts and items. The computer algorithm administers test-takers individualized sets of texts and items that meet the content and statistical requirements of the test. As no adaptive algorithm is used in this section, however, test-takers can return to previous items.

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

535

5.3. Scoring method Scoring in the adaptive Listening and Structure sections is cumulative. The nal section score depends on item diculty and the number of items answered correctly. Correct answers to more dicult questions carry more weight than correct answers to easier ones. Item diculty is estimated using the three-parameter item response theory model. Scores in the linear Reading section are based on the number of correct answers, but adjusted for potential discrepancies in the individualized sets of reading. Two independent raters evaluate essays in the Writing section. Essays are rated using a six-point scale. Section scores are converted into scaled scores (ETS, 1998). Scaled section scores across the three tests contribute equally to provide a total scaled score. 5.4. Reliability TOEFL has been described as the prototypical psychometric-based ESL screening test (Spolsky, 1995). Pierce (1994) documents that the process of TOEFL item development and revision are dictated to a large extent by psychometric analyses. Additionally, the emphasis on the selected-response item type is to help ensure high reliability standards. With regard to scores, TOEFL is grounded in a normreferenced approach where results are represented as numeric scores that indicate the relative standing of students in comparison to a criterion group performance. Description statements that explicate the test-takers' performance in terms of language ability accompany only the writing and the TSE scores. It is the research supporting TOEFL that informs its constituencies about the properties of the test scores. Annual reports and manuals present extensive information that documents the reliability of the scores and the performance of testtakers with respect to various background variables (e.g. native language, gender, academic level, etc.). The program also produces several publications a year within the TOEFL Research Report and Technical Report series that investigate a variety of issues to enhance researchers' knowledge about and practitioners' use of test scores. As indicated in the denition of reliability above, other than the language ability being measured, any variable that aects test scores is considered a potential source of measurement error. Measurement error can limit the reliability and generalizability of scores obtained. Given the computer format of TOEFL-CBT, an important issue to consider is whether test-takers' performance is adversely aected because they are not familiar with the computer medium. Before TOEFL was administered in its CBT format, research investigating testtakers' experience with computers and its eect on TOEFL performance was conducted (Kirsch et al., 1998; Eignor et al., 1998; Jamieson et al., 1998; Taylor et al., 1998). A representative sample of approximately 90,000 TOEFL test-takers was surveyed regarding computer familiarity. Results indicated that 16% of the testtakers were identied as having low computer familiarity. Consequently, TOEFL researchers developed a tutorial that test-takers took before starting a TOEFL-CBT.

536

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

Research results showed that upon taking the tutorial, there were no practical differences in scores between computer-familiar and computer-unfamiliar test-takers. Nevertheless, as the researchers themselves write, more research is needed to examine the relationship between various background variables and CBT performance. In addition to the mandatory tutorial included in TOEFL-CBT, the instructional CD-ROM TOEFL sampler is being disseminated to further help enhance test takers' familiarity with TOEFL-CBT. CBT represents the next generation of tests and the challenges encountered by the TOEFL program are to be expected. As more testing organizations convert to the computer medium (e.g. UCLES has introduced CommuniCAT), the collective eort of researchers should help resolve many of the current concerns. Nonetheless, computer familiarity and P&P and CBT/CAT equivalency research should become standard practice for any testing organization planning to use computers as a medium of test delivery. 5.5. Validity Preoccupation with the psychometric qualities of TOEFL helps ensure good testing practices. Nevertheless, it has made the TOEFL somewhat resistant to and slow in incorporating changes that might jeopardize its high reliability standards. Also, the continued commercial success of TOEFL has contributed to its adherence to the status quo. Whereas the validity of test scores is undermined when reliability standards are not upheld, reliability documentation alone cannot make up for inadequate validity evidence. In other words, a strong reliability agenda is not sucient to ensure meaningful inferences made from TOEFL scores. TOEFL's emphasis on scientic accuracy through its stringent reliability analyses has been done ``with a hazardous disregard for some aspects of validity'' (Spolsky, 1995, p. 356). Anastasi (1986) emphasized the importance of building validity into the process from the beginning. She states: Validity is thus built into the test from the outset rather than being limited to the last stages of test developmentF F F The validation process begins with the formulation of detailed trait or construct denitions, derived from psychological [communicative] theory, prior research, or systematic observation and analyses of the relevant behavior domain. (p. 3) In test development, construct delineation is a prerequisite to rendering meaningful scores. Post-test analyses of TOEFL scores, no matter how rigorous they are, cannot make up for a limited representation of the language construct of the test. One can argue that insucient validation eort in the test construction phase renders interpretation of subsequent test scores suspect. The psychometric-structuralist representation of the language construct as operationalized in TOEFL has changed little over the years, despite advancement in the eld's understanding of the construct. Additionally, Spolsky (1995) argues that changes made are typically driven by factors such as the maintenance of the test and

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

537

enhancing its marketing edge rather than in response to theoretical developments and research ndings. Improvements made to TOEFL-CBT, while commendable, still fall short in their representation of a communicative language construct. For example, the TSE continues to be administered independently. The writing prompts, while requiring extended responses, do not have writers contextualize their essays (Section 5.2). Also, the focus in the scoring of writing is more on grammatical and textual competence and not on sociolinguistic competence. It is worth noting that in recent years ETS and the TOEFL program have invested considerable resources to conceptualize, design and construct a new battery of tests. Work on this battery has been carried out under the name, ``TOEFL 2000''. TOEFL 2000 includes researchers from ETS and academia with varied expertise and professional background who are working outside the current connes of the present TOEFL. They are considering various integrative models of communicative competence, examining academic language use, investigating a variety of item types, and exploring dierent forms of score reporting to accommodate the needs of the various TOEFL constituents. TOEFL 2000, as it is currently being discussed, represents a very promising endeavor. 6. Conclusion Language ability scores obtained from the Cambridge certicate exams, IELTS, and TOEFL are used to help make critical decisions concerning admission into institutions for academic training. It is critical, therefore, that the scores obtained provide high quality information. Developers of large-scale tests such as those reviewed in the present article have the responsibility to: construct instruments that meet professional standards; continue to investigate the properties of their instruments and the ensuing scores; and make test manuals, user guides and research documents available to the public. Test-users also have a responsibility. As the `Standards' (AERA et al., 1999) state test developers should, ``provide information on the strengths and weaknesses of their instrumentsF F F. However, the ultimate responsibility for appropriate test use and interpretation lies predominantly with the test user'' (p. 113). Test-users need to be cognizant of the properties of the instruments they employ and ensure appropriate interpretation and use of test scores provided. Test-users need to carry out local investigations to make sure that their admission requirements are based on an informed analysis of their academic programs and the language ability score proles necessary to succeed in these programs. When making admission decisions, good practise dictates taking into account the overall score as well as scores in the various skill/component areas dierent academic programs may require dierent proles of language ability. Additionally, SEM indices that reect score uctuations should be considered. Other student variables should also be examined. These variables include students' past academic performance, previous experiences and credentials, performance on academic achievement tests, local test results, interview performance, etc. In short, admission decisions

538

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

should consider how language ability, individual factors, and academic requirements t together to ensure more dependable admission decisions. Finally, it must be noted that there is no best test. There could be a test, however, that is more appropriate for and corresponds better to the particular needs of a given institution. A specic context is always needed to put the usefulness of a test into perspective. When evaluating a test for use in a particular context, it is important to be aware of the considerations and issues put forth in the present article. References
AERA, APA, NCME, 1999. Standards for Educational and Psychological Testing. American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Washington, DC. Alderson, J.C., Krahnke, K.J., Stanseld, C.W., 1987. Reviews of English Language Prociency Tests. TESOL, Washington, DC. Alderson, J.C., 2000. Technology in testing: the present and the future. System 28, 593603. Anastasi, A., 1986. Evolving concepts of test validation. Annual Review of Psychology 7, 115. Bachman, L.F., 1990. Fundamental Considerations in Language Testing. OUP, Oxford. Bachman, L.F., Davidson, F., Foulkes, J., 1993. A comparison of the abilities measured by the Cambridge and Educational Testing Service EFL test batteries. In: Douglas, D., Chapelle, C. (Eds.), A New Decade of Language Testing Research. TESOL, Alexandria, VA, pp. 2545. Bachman, L.F., Kunnan, A., Vanniarajan, S., Lynch, B., 1988. Task and ability analysis as a basis for examining content and construct comparability in two EFL prociency test batteries. Language Testing 5, 128159. Canale, M., 1983. On some dimensions of language prociency. In: Oller Jr., J.W. (Ed.), Issues in Language Testing Research. Newbury House, Rowley, MA, pp. 333342. Canale, M., Swain, M., 1980. Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics 1, 147. Chalhoub-Deville, M., 1995. Deriving oral assessment scales across dierent tests and rater groups. Language Testing 12, 1633. Chalhoub-Deville, M., Deville, C., 1999. Computer adaptive testing in second language contexts. Annual Review of Applied Linguistics 19, 273299. Clapham, C., 1996. The Development of IELTS: A Study of The Eect Of Background Knowledge On Reading Comprehension. Cambridge University Press, Cambridge. Clapham, C., 2000. Assessment for academic purposes: where next? System 28, 511521. Crystal, D., 1997. English as Global Language. Cambridge University Press, Cambridge. Eignor, D., Taylor, C., Kirsch, I., Jamieson, J., 1998. Development of a Scale for Assessing the Level of Computer Familiarity of TOEFL Test Takers (TOEFL Research Report No. 60). Educational Testing Service, Princeton, NJ. ETS, 1998. Computer-Based TOEFL: Score User Guide. Educational Testing Service, Princeton, NJ. Fulcher, G., 1996. Testing tasks: issues in task design and the group oral. Language Testing 13, 2351. Fulcher, G., 2000. The `cummunicative' legacy in language testing. System 28, 483497. IELTS (International English Language Testing System), July 1996. IELTS Annual Report: 1995. UCLES, The British Council, and IDP Education Australia, Cambridge. Ingram, D.E., Wylie, E., 1993. Assessing speaking prociency in the international English language testing system. In: Douglas, D., Chapelle, C. (Eds.), A New Decade of Language Testing Research. TESOL, Alexandria, VA, pp. 220234. Jamieson, J., Taylor, C., Kirsch, I, Eignor, D., 1998. Design and Evaluation of a Computer-based TOEFL Tutorial (TOEFL Research Report No. 62). Educational Testing Service, Princeton, NJ. Kachru, B.B. (Ed.), 1992. The Other Tongue: English Across Cultures. University of Illinois Press, Urbana, IL.

M. Chalhoub-Deville, C.E. Turner / System 28 (2000) 523539

539

Kirsch, I., Jamieson, J., Taylor, C., Eignor, D., 1998. Computer Familiarity among TOEFL Test Takers (TOEFL Research Report No. 59). Educational Testing Service, Princeton, NJ. Lado, R., 1961. Language Testing. McGraw-Hill, New York. Messick, S., 1989. Validity. In: Linn, R.R. (Ed.), Educational Measurement, 3rd Edition. American Council on Education/Macmillan, New York, pp. 13103. Messick, S., 1996. Validity and washback in language testing. Language Testing 13, 241256. Pierce, B., 1994. The test of English as a foreign language: developing items for reading comprehension. In: Hill, C., Parry, K. (Eds.), From Testing to Assessment: English as an International Language. Longman, New York, pp. 3960. Quirk, R., Widdowson, H.G. (Eds.), 1985. English in the World: Teaching and Learning the Language and Literatures. Cambridge University Press, Cambridge. Shohamy, E., 1984. Does the testing method make a dierence? The case of reading comprehension. Language Testing 1, 147170. Shohamy, E., Reves, T., Bejarano, Y., 1986. Introducing a new comprehensive test of oral prociency. English Language Teaching Journal 40, 212222. Spolsky, B., 1995. Measured Words. Oxford University Press, Oxford. Taylor C., Jamieson, J., Eignor, D., Kirsch, I., 1998. The Relationship Between Computer Familiarity and Performance on Computer-based TOEFL Test Tasks (TOEFL Research Report No. 61). Educational Testing Service, Princeton, NJ. UCLES, 1997. FCE Handbook. University of Cambridge Local Examination Syndicate, Cambridge. UCLES, 1998a. CPE Handbook. University of Cambridge Local Examination Syndicate, Cambridge. UCLES, 1998b. CAE Report. University of Cambridge Local Examination Syndicate, Cambridge. UCLES, 1998c. Producing Cambridge EFL Examinations: Key Considerations and Issues. University of Cambridge Local Examination Syndicate, Cambridge. UCLES, 1999a. CAE Handbook. University of Cambridge Local Examination Syndicate, Cambridge. UCLES, 1999b. IELTS Annual Review: 1998/1999. University of Cambridge Local Examination Syndicate, The British Council, and IDP Education Australia, Cambridge. Wall, D., 1997. Impact and washback in language testing. In: Clapham, C., Corson, D. (Eds.), Language Testing and Assessment, Encyclopedia of Language and Education, Vol. 7. Kluwer, Dordrecht, pp. 291302. Wall, D., 2000. The impact of high-states testing or teaching and learning: can this be predicted or controlled? System 28, 499509.

You might also like