You are on page 1of 19

/$1*8$*(

7(67,1*

A Rasch-based validation of the


Vocabulary Size Test

Language Testing
27(1) 101118
The Author(s) 2010
Reprints and permission: http://www.
sagepub.co.uk/journalsPermission.nav
DOI: 10.1177/0265532209340194
http://ltj.sagepub.com

David Beglar

Temple University, Japan

Abstract
The primary purpose of this study was to provide preliminary validity evidence for a 140-item
form of the Vocabulary Size Test, which is designed to measure written receptive knowledge of
the first 14,000 words of English. Nineteen native speakers of English and 178 native speakers of
Japanese participated in the study. Analyses based on the Rasch model were focused on several
aspects of Messicks validation framework.The findings indicated that (1) the items and examinees
generally performed as predicted by a priori hypotheses, (2) the overwhelming majority of the items
displayed good fit to the Rasch model, (3) the items displayed a high degree of unidimensionality
with the Rasch model accounting for 85.6% of the variance, (4) the items showed a strong
degree of measurement invariance with disattenuated Pearson correlations for person measures
estimated with different sets of items of 0.91 and 0.96, and (5) various combinations of items
provided precise measurement for this sample of examinees as indicated by Rasch reliability
indices >0.96. The Vocabulary Size Test provides teachers and researchers with a new instrument
that greatly extends the range of measurement provided by other measures of written receptive
vocabulary size.

Keywords

item invariance, Rasch model, test validity, unidimensionality, vocabulary size, vocabulary test

Acquiring a large vocabulary is an enterprise that occurs gradually over a number of


years for both native speakers and foreign language learners, and because of the key role
that lexical knowledge plays in reading and listening, it is important that estimates of
receptive vocabulary size be available to administrators, teachers, and the learners themselves. One vocabulary size test that has received attention for a number of years has
been the Eurocentres Vocabulary Size Test (Meara and Buxton, 1987; Meara and Jones,
1990; see Read, 2000, pp. 126132 for a detailed description and evaluation of the test).
This test has been used as a placement instrument and as a measure of second language
learners vocabulary size in a number of studies (current versions of the test are available
at http://www.lognostics.co.uk/). A second instrument that has been used by a number of
researchers as a measure of vocabulary size is the Vocabulary Levels Test (Nation, 1983,
1990). However, Nation (2001) has stated that the test is a diagnostic test (p. 373)
Corresponding author:
David Beglar, Temple University Japan Campus, Osaka Eki-mae Building 3, 21F, 11-32100 Umeda Kita-ku,
Osaka, Japan 5300001.
E-mail: dbeglar@mac.com OR beglar@tuj.ac.jp

102

Language Testing 27(1)

whose main purpose is to let teachers quickly find out whether learners need to be working on high frequency or low frequency words (pp. 2122); thus, Nation himself does
not appear to consider the test a comprehensive measure of vocabulary size.
Currently, no well-accepted test of written receptive vocabulary size for non-native
speakers of English exists despite the fact that such a test could serve a number of important roles in foreign language curricula. For instance, vocabulary size test results could be
used to determine the current vocabulary sizes of individuals as well as groups of English
language learners, compare the vocabulary sizes of those same individuals and groups,
chart the growth of their vocabularies as they progress through educational programs,
determine how successfully learners can perform such everyday tasks as read newspapers,
watch movies, and listen to friendly conversations (see Nation, 2006 for further information about the vocabulary sizes needed to perform such tasks), select individuals displaying specific levels of vocabulary knowledge for particular educational experiences, make
mastery decisions, determine the degree to which a course or program is meeting lexical
objectives, and better understand the impact of educational reform on vocabulary growth.
Accomplishing this variety of educational purposes requires a test capable of measuring
the vocabulary sizes of beginning as well as highly advanced learners.
This paper describes an initial effort to provide validity evidence for a test designed
to measure written receptive vocabulary size. This task will be approached following
Messick (1989, 1995), who identified six aspects of construct validity that ...function as
general validity criteria or standards for all educational and psychological measurement
(Messick, 1995, p. 6): content, substantive, structural, generalizability, external, and consequential validity. The first four of these aspects, as well as two further aspects of validity proposed by the Medical Outcomes Trust Scientific Advisory Committee (1995),
responsiveness and interpretability, will be investigated. Additionally, a Rasch-based
approach to instrument validation described by Wolfe and Smith (2007a, 2007b) will be
applied. The validity evidence presented in this paper must be viewed in the light of the
understanding that Tests do not have reliabilities and validities, only test responses do
test responses are a function not only of the items, tasks, or stimulus conditions but of
the persons responding and the context of measurement (italics in the original) (Messick,
1989, p. 14). Thus, the primary purpose of this study is to investigate the functioning of
one form of the Vocabulary Size Test (VST) with one group of test-takers in a specific
context.

Method
Participants
Four groups of participants (N = 197) took part in this study: (1) adult native speakers of
English (NSE group) studying in a Masters or Doctoral program in education at a major
American university (n = 19); (2) advanced English proficiency native speakers of
Japanese (High group) studying in the same Masters and Doctoral programs (TOEFL
560617) (n = 29); (3) intermediate English proficiency native speakers of Japanese
(Mid group) studying in an intensive language program at the same American university

David Beglar

103

and undergraduate students studying in a prestigious private Japanese university (n = 53);


and (4) low English proficiency native speakers of Japanese (Low group) studying in a
less prestigious Japanese university (n = 96). The participants were fully informed of the
purposes of the test and permission to use the TOEFL scores of the Japanese participants
was obtained.

The Instrument
The Vocabulary Size Test was developed to provide a reliable, accurate, and comprehensive measure of second language English learners written receptive vocabulary size
from the first 1000 to the fourteenth 1000word families of English.
Three 140-item forms of the Vocabulary Size Test have been written; however, the
focus of this study is on Form 1 (One version of the Vocabulary Size Test is available at
http://www.victoria.ac.nz/lals/staff/paul-nation/nation.aspx). The test form investigated
in this study consists of 10 items from each 1000word level for a total of 140 items. The
words included on the Vocabulary Size Test are based on fourteen 1000 BNC word lists
developed by Nation (2006) (available at http://www.vuw.ac.nz/lals/staff/paul-nation/
nation.aspx). These lists use the notion of word family as the unit of organization. The
word family is an appropriate unit for a receptive vocabulary measure because second
language learners beyond a beginning proficiency level have some control of word
building devices and are able to identify both formal and meaning-based relationships
between regularly affixed members of a word family (e.g., produce, producing, producer). Empirical evidence also supports the idea that the word family is a psychologically real unit (Bertram et al., 2000; Bertram et al., 2000; Nagy et al., 1989). The word
family unit used in the fourteen 1000 BNC word family lists is set at level 6 of Bauer and
Nations (1993) scale of levels. All of the family members at level 6 meet the criteria of
regularity, frequency, productivity, and predictability.
The word lists used to choose and sequence the test items are not based on the entire
100,000,000 token British National Corpus because the formal written nature of the
British National Corpus strongly affected the words included in the higher frequency
levels. For instance items such as cat and hello occur in the fourth 1000-word list and
relatively formal words such as civil and commission occur in the first 1000word list. As
a result, the first twelve 1000-word lists were revised using word family range and frequency figures from the 10 million token spoken section of the British National Corpus.
(The spoken corpus lists upon which the Vocabulary Size Test is based are available from
http://www.vuw.ac.nz/lals/staff/paul-nation/nation.aspx.)
A multiple-choice format was selected for the Vocabulary Size Test in order to (1)
allow a wide range of content to be sampled efficiently, ; (2) allow the test to be used with
learners from a variety of language backgrounds (i.e., many learners are familiar with the
multiple-choice format), ); (3) control the level of difficulty of the items by demanding
approximately the same degree of knowledge for each item (achieved through the consistent use of one set of item writing procedures), ); (4) make marking as efficient and
reliable as possible, ; and (5) make learners demonstrate knowledge of each item. Each

104

Language Testing 27(1)

item is placed in a short non-defining context as shown in the following example from
the fifth 1000-word level.
1.



miniature: It is a miniature.
a. a very small thing of its kind
b. an instrument for looking at very small objects
c. a very small living creature
d. a small line to join letters in handwriting

The four options were written using a restricted vocabulary. For items at the first and
second 1000-word frequency levels, only words from the first 1000 of Wests (1953)
General Service List were used. To the extent possible, the words in the definitions were
of higher frequency than the item being defined, but for the highest frequency items, this
was not always possible (e.g., time could not be defined except with words of a slightly
lower frequency such as hours). For items at the third 1000-word level and above, the
defining vocabulary was drawn from the first 2000 words of Wests General Service List.
When it was necessary to use a word not on Wests list, the frequency of the defining
word and the test item were checked using the British National Corpus, and a defining
word that was always significantly more frequent than the item being defined was
selected (e.g., the target word haunt was defined using the word ghost).
All four answer options are substitutable in the context sentence, and the context sentences were chosen to reflect the most frequent environments for the target item. For
example, the word instance occurs most frequently in the phrase for instance, so this was
used as the sentence context. Where the plural of an item was significantly more frequent
than the singular, the context was made plural (e.g., standards). The part of speech chosen for the item was also a reflection of the highest frequency environment.
Test-takers must have a fairly well-developed idea of the meaning of the word to correctly answer the items because the correct answer and the distractors frequently share
elements of meaning. This makes the Vocabulary Size Test a more demanding test than
the Eurocentres Vocabulary Size Test because test-takers do not have to directly demonstrate knowledge of the meaning of the words in the checklist format.

Procedures
Three versions of form 1 of the Vocabulary Size Test were administered during regular
class meetings to different groups of test-takers based on their English proficiency level.
The NSE and High groups took a 140-item form made up of items from the first to the
fourteenth 1000-word frequency levels, the Mid group took an 80-item form made up of
items from the first to the eighth 1000-word frequency levels, and the Low group took a
40-item form made up of items from the first to the fourth 1000-word frequency levels.
The data from the completed tests were entered into an Excel 11.3.5 spreadsheet, exported
to WINSTEPS 3.64.2 (Linacre, 2007a), and calibrated using the Rasch dichotomous
model (Rasch, 1960), which is defined mathematically by the following formula: Pn =
exp(Bn Di) / [1 + exp(Bn Di)], here Pni = the probability of a person n with ability

David Beglar

105

Bn succeeding on item i, which has a difficulty of Di; exp = exponent of the natural constant e = 2.71828.
The Rasch measurement model was selected because it provides a way to construct
linear item and person measures, relate the empirical item and person hierarchy to a
priori hypotheses concerning the latent variable and person responses to that variable,
examine differences between observed responses and model expected responses, and
determine the dimensionality of the data through an analysis of item residual variances
and the degree to which the residuals appear to form meaningful measures of secondary
constructs.
As all test-takers completed the first 40 items on the test, a common-item design with
an internal anchor (Wolfe, 2004) was used to calibrate the remaining items. Twenty-three
items displaying infit mean-square indices between 0.90 and 1.10 were selected as
anchors and convergence criteria were set to 10 times the usual degree of strictness. Link
quality was evaluated by inspecting item displacement values and by cross-plotting item
difficulty and person ability estimates with and without the anchor set. No problems
were noted.

Content aspect of construct validity


The first aspect of construct validity concerns test content. Messick (1995, p. 6) stated,
The content aspect of construct validity includes evidence of content relevance, representativeness, and technical quality. Content relevance was described in the previous
section, so the focus here is on representativeness and technical quality.

Representativeness
The second aspect of construct validity, representativeness, is a particularly important
characteristic of a test, as this aspect concerns the degree to which a test is sensitive to
variations in the construct being measured (Borsboom et al., 2004). Three issues are of
concern here: (1) whether a sufficient number of items are included on the measurement
instrument; (2) whether the empirical item hierarchy shows sufficient spread; and (3)
whether gaps exist in the empirical item hierarchy. Each of these issues is addressed in
the output shown in the item-person map displayed in Figure 1, which shows the linear
relationship between the Rasch calibrations for the 197 test-takers and the 140 items. On
the far left side of the figure is the Rasch logit, which has been transformed into a CHIPS
scale for easier interpretation (E. V. Smith, 2000). The CHIPS scale, which has an item
mean of 50, is useful for predicting expected examinee performance. More able persons
and more difficult items are located toward the top of the figure and less able persons and
less difficult items are located toward the bottom of the figure. The items are labeled
according to their word frequency level in the BNC corpus and the item number on the
test form (e.g., 50008 means the fifth 1000-word level, item 8).
Figure 1 indicates that a sufficient number of items is included on the VST, as 10
items per level has resulted in a test that is capable of measuring the written receptive

106

Language Testing 27(1)

More Able Persons


80
*

70

*
#
*#
##
*##

*#
*
*
60

###
*
*##
##

#######

*#######
50

*#####
*######
##########
######

40

##
##
*
*

30
Less Able Persons

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

More Difficult Items

13,0004
14,0009
11,0003
11,00010
10,0001
900010
11,0006
10,00010
14,0006
10,0006
13,0003
90007
10,0004
70005
10,0002
80007
10,0003
60005
10,0005
20009
50003
80001
10,0007
70009
14,0002
30003
80008
10004
50007
10003
60006
20008
200010
90002
10009
10007
10008
10005
10001
10002
10006

12,0006
11,0009
12,00010

14,00010
14,0008
12,0005

90004
12,0009

90009
13,0007

12,0008
10,0009

13,0002
13,00010

14,0004
13,0009

60008
14,0001

10,0008
13,0005

11,0001
13,0006

11,0005
50005

11,0008
600010

13,0001
90001
100010

14,0003
90006
12,0002

14,0005

50009

12,0003

70002

11,0007
60007
11,0002
30009
700010
80009
11,0004
80002
20003
30007

20006
70004
12,0001
40001
70003
90008
50004
80003
20004
60009

400010
70007
12,0004
40004
70006

40003
90005
14,0007
50001
70008

60002
80006
20005
800010

60003

12,0007
60001
20001

13,0008
60004
30004

40002
70001
30006

40009

50008
300010
90003
30001
20002
20007
30008

80004
30005

40007

500010

40006
40008

50006

30002
80005

50002

40005

Less Difficult Items

Figure 1.Wright map of person measures and item calibrations


Note: Each # represents approximately 3 persons. Each * represents approximately 1 person. M = the mean
of the person or item estimates. S = 1 standard deviation from the mean. T = 2 standard deviations from
the mean.

vocabulary knowledge of lowproficiency EFL students as well as high-proficiency nonnative speakers of English studying in Masters or Doctoral programs of Education in an
English-medium university. No floor or ceiling effects were present for any of the
Japanese test-takers, as the least able Japanese test-taker had an ability estimate of 36.1
(i.e., written receptive knowledge of approximately 800 word families), while the most

David Beglar

107

able Japanese participant had an ability estimate of 72.6 (i.e., written receptive knowledge of approximately 13,100 word families). Item difficulties ranged from 33.2 to 74.4.
The spread of item calibrations was determined statistically by calculating item strata,
which is calculated using the following formula: Item strata = (4Gitem + 1) / 3, where
Gitem is Rasch item separation. Gitem = SAitem / SEitem, where SAitem is the adjusted
item standard deviation, and SEitem is the average measurement error. The item strata
statistic indicates the number of statistically distinct strata of item difficulty that the testtakers have distinguished separated by at least three errors of measurement (Wright and
Masters, 2002).The item strata statistic for this sample of 140 items was 7.29. This finding confirms that the difficulty of the items defining the empirical item hierarchy varies
widely (i.e., more than seven statistically distinct levels); thus, the VST can be used with
learners with widely differing levels of written receptive vocabulary knowledge and provides a sufficient number of strata for measuring learners lexical acquisition over long
periods of time.
Finally, Figure 1 shows that there are not only no gaps in the empirical item hierarchy,
there is considerable redundancy, as items with similar difficulty estimates are found
along nearly the entire measurement range. Ten items per level is more than sufficient to
estimate the test-takers lexical knowledge with a high degree of precision as shown by
the relatively low standard errors (a range of 1.01.8, M = 1.6, SD = 0.3) for the Japanese
test-takers ability estimates. As shown below, it is possible to substantially reduce the
number of items used per word frequency level without encountering significant reductions in measurement precision.

Technical quality
Technical quality was evaluated by inspecting Rasch standardized item weighted meansquare fit statistics as estimated with the 178 Japanese participants. A standardized infit
value of > +2.00 was adopted as the criterion for determining underfit to the Rasch
model because the standardized fit index maintains a similar Type I error rate across
varying sample sizes (Smith et al., 1998; R. M. Smith, 2000). Five items, 100010,
basis (Infit Mnsq = 2.05, Infit Zstd = 6.4), 30004, scrub (Infit Mnsq = 1.19, Infit
Zstd = 2.7), 30009, rove (Infit Mnsq = 1.50, Infit Zstd = 4.3), 1100010 hessian (Infit
Mnsq = 1.48, Infit Zstd = 2.1), and 14,0008, erythrocyte (Infit Mnsq = 1.36, Infit
Zstd = 2.1), displayed standardized infit values > + 2.00. The most misfitting examinee
response strings were inspected for each of the five misfitting items. With the exception
of items 100010 and 30009, the underfit was caused by fewer than four examinees
who had responses resulting in large residuals. An inspection of the distractor functioning revealed the reason for the misfit for item 100010. One of the distractors (main
part) was selected by 60% of the Japanese examinees; their mean person ability estimate was 53.24. The correct answer (reason) was selected by only 23% of the examinees and their average person ability estimate, 49.88, was below that of the examinees
who selected the distractor. As even some of the NSE participants expressed confusion
about this item, it appears that main part needs to be replaced with a distractor that is
less similar to the correct answer.

108

Language Testing 27(1)

An inspection of the distractor functioning of item 30009 also indicated problems as


two incorrect options (getting drunk and making a musical sound through closed lips)
attracted nearly as many responses (23% and 29%, respectively) as the correct answer
(traveling around) (31%). Although the mean person ability estimates of the examinees
selecting each of the responses were similar (51.81, 50.50, and 53.15, respectively), it is
difficult to conclude that the item should be rewritten as the two distractors are distinctly
different from the correct answer.
The second way in which technical quality was assessed was through an inspection of
overfitting items. Overfit to the Rasch model, which is defined in this study as items
displaying standardized infit and outfit values > 2.00, do not pose the same threat to the
accurate measurement of the latent variable that underfitting items do; however, such
items can indicate that the assumption of local independence has been violated, and they
decrease error variance and increase reliability estimates. This can lead to overly optimistic impressions of an instruments reliability for a particular group of examinees (E.
V. Smith, 2005). However, on this form of the VST, only three items (2.1%), item 2000
1, maintain (Infit Zstd = 4.5, Outfit Zstd = 2.2), item 20005, patience (Infit Zstd =
2.6, Outfit Zstd = 2.1), and item 50003, nun (Infit Zstd = 3.9, Outfit Zstd = 2.9),
overfit the Rasch model with standardized infit and outfit statistics > 2.00. This number
of overfitting items was not considered problematic in the context of the entire test, given
that E. V. Smith (2005) concluded that where the proportion of overfitting items is less
than 5%, item difficulty and person ability estimates are not affected substantially.
In sum, although item 100010 should be rewritten and the other four misfitting items
(items 3,0009, 20001, 20005, and 50003) bear watching in future uses of this test
form, the overall number of misfitting items is not problematic when viewed in the context of the entire 140-item test, as the five items represent a 3.6% misfit rate, which is less
than the 5% rate expected to occur by chance given the nature of the z distribution. The
care that went into writing the items is apparent given the excellent psychometric characteristics displayed by the vast majority of the items.

The substantive aspect of construct validity


Messick (1995, p. 6) defined the substantive aspect of construct validity as theoretical
rationales for the observed consistencies in test responses ... along with empirical evidence that the theoretical processes are actually engaged by respondents in the assessment tasks. This aspect of construct validity was investigated first by determining the
degree to which item difficulty estimates are in the hypothesized order.
It was hypothesized that the VST items would form a difficulty continuum based on
their frequency in the BNC list, given that word frequency is an indicator of the probability
that an individual will encounter a word in an authentic communicative context. Word
frequency corpora make it clear that over large quantities of authentic data, the probability
of meeting some words is far greater than that of meeting other words, a phenomenon that
is elucidated in exposure theory, which has been used to explain the development of receptive vocabulary acquisition in native speakers of English (Miller and Gildea, 1987; Stenner
et al.,1983; see Ellis (2002) for a discussion of the impact of frequency on second language

109

Mean difficulty estimates and 95% confidence intervals

David Beglar

70

60

50

40

1000
3000
5000
7000
9000
11000
13000
2000
4000
6000
8000
10000
12000
14000
Frequency level

Figure 2. Mean difficulties and 95% confidence intervals for the 14 word frequency levels

acquisition). A number of previous second language vocabulary researchers have confirmed that higher frequency words tend to be better known than lower frequency words
(e.g., Greidanus and Nienhuis, 2001; Laufer and Nation, 1999).
This hypothesis was tested by computing the ensemble means for each of the fourteen
1000-word frequency levels that make up the test. This involved computing the mean
Rasch item difficulty estimates for the 10 items that were calibrated for each word frequency level. In this way, the idiosyncratic behavior of individual items, caused for
instance, by their status as loan words in Japanese (e.g., beagle and caffeine), was ameliorated. The results displayed in Figure 2 show that the mean ensemble difficulties of the
14 word frequency levels are generally in line with theoretical expectations. The easiest
group of words was the first 1000-word level items, which had a mean item difficulty
estimate of 42.59, and the most difficult group was the fourteenth 1000-word level items,
which had a mean item difficulty estimate of 62.98. The eighth 1000-word frequency
level was easier than predicted, in part due to one extremely easy item (item 80004,
kindergarten) that is an English loanword in Japanese. In addition, the items comprising
the second through the fourth and the eleventh through the fourteenth 1000-word frequency levels were not clearly distinguished from one another in terms of ensemble difficulty. This may have occurred for the second through fourth 1000 levels because the
presence of a large number of English loanwords in these frequency bands may have
served to equalize word difficulty to some degree. The lack of differences among the

Group mean ability estimates and 95% confidence intervals

110

Language Testing 27(1)

75

70

65

60

55

50

45
Low

Mid

High

NSE

Group

Figure 3. Mean Rasch ability estimates and 95% confidence intervals for the four groups of
participants

eleventh through the fourteenth 1000word frequency levels was likely caused by the
sample of examinees in this study; most of the High group participants had written receptive vocabulary sizes between 7500 and 10,000 word families, and the native speakers of
English had relatively few difficulties with most of the items at those lower word frequency levels; thus, very few of the participants possessed vocabulary sizes that could
distinguish any differences among those levels.
Next, the degree to which examinee ability estimates were in the hypothesized order
was investigated by determining whether the construct measured by the VST followed the
pattern predicted by theory as well as previous empirical research. A developmental model
in which it was hypothesized that written receptive lexical knowledge is greater for individuals at higher levels of general English proficiency was adopted. Thus, it was predicted
that the NSE test-takers would display the greatest lexical knowledge followed the High
group (TOEFL > 575), the Mid group (TOEFL < 525), and the Low group (in descending
order). It was also predicted that the gap between the NSE and High groups and the High
and Mid groups would be much greater than that between the Mid and Low groups
because of the larger general proficiency differences separating those groups. As shown in
Figure 3, the calibrations show that the four groups of examinees are in the predicted
order, and, as predicted, the gap separating the Low and Mid groups is considerably
smaller than the ones separating the Mid and High groups and the High and NSE groups.

David Beglar

111

The structural aspect of construct validity


With regard to the structural aspect of construct validity, Messick (1989, p. 43) stated
that, ...scoring models should be rationally consistent with what is known about the
structural relations inherent in behavioral manifestations of the construct in question and
...the degree of homogeneity in the test should be commensurate with this characteristic
degree of homogeneity associated with the construct. It was hypothesized that the VST
would show a high degree of psychometric unidimensionality. This hypothesis was
partly based on Beglar and Hunts (1999) research with two sections of the Vocabulary
Levels Test (Nation, 1990) showing that the majority of items on 54-item versions of the
2000word frequency level and University Word List test measured a single factor.
Following proposals by Hattie (1985) and Linacre (1992, 1998), the dimensionality of
the VST was investigated by identifying items that appeared to measure a secondary dimension based on a principle components analysis of item residuals. Several studies (e.g., Smith
and Miao, 1994) have shown that Rasch analysis is superior to traditional factor analytic
approaches for assessing the dimensionality of instruments, such as the VST, that are
designed to produce a unidimensional measure of a latent variable. The Rasch model
extracts the first major dimension in the data, which is the systematic variation accounted
for by the primary latent construct, and if the data are unidimensional and they fit the Rasch
model, no systematic relationships should be present in the residuals. On this version of the
Vocabulary Size Test, the Rasch model accounted for 85.6% of the total variance, a figure
that is well above the 60% criterion proposed by Linacre (2007b). In addition, the first four
residual components each accounted for between 0.4 and 0.6% of the variance in the data
(total variance of the four components = 1.9%), figures that are well below the 3% criterion
proposed by Linacre. Following Stevens (2002) guidelines, components on which at least
three items had absolute loadings >0.80, at least four items had absolute loadings >0.60, or
at least 10 items had absolute loadings >0.40 were considered to indicate the presence of a
second meaningful construct in the data. None were found; thus, no meaningful secondary
dimension was identified and none of the residual components were measured with sufficient reliability to warrant interpretation. Taken together, the results indicate that this version of the VST displays a high degree of psychometric unidimensionality, a conclusion
that is further bolstered by the tests of item invariance described below.
A corresponding principal components analysis for the examinees indicated that
85.6% of the variance in person ability estimates was explained by the Rasch measures,
and the variance explained by the first five contrasts was 0.9%, 0.6%, 0.5%, 0.5%, and
0.4%, respectively; thus, the variance in the examinees performances was well accounted
for by the Rasch person measures.

The generalizability aspect of construct validity


Messick (1989, p. 56) stated that The extent to which a measures construct interpretation
empirically generalizes to other population groups is here called population generalizability ... and to other tasks representative of operations called for in the particular domain
of reference, task generalizability (italics in the original). This aspect of validity concerns

112

Language Testing 27(1)

the principle of invariance, which was described by Rasch (1960, p. 332) as follows: The
comparison between two stimuli should be independent of which particular individuals
were instrumental for the comparison; and it should also be independent of which other
stimuli within the considered class were or might also have been compared.
The use of the Rasch model does not in any way guarantee that test data behave as
described in the above quote; invariance must be demonstrated and is a question of the
degree to which item and person calibrations are invariant across measurement contexts.
Item calibration invariance was investigated by testing for uniform differential item
functioning (DIF) with the separate calibration t-test approach (Wright and Stone, 1979).
The purpose of the DIF analysis was to determine whether male and female test-takers
displayed different probabilities of answering items correctly after being matched on
measures of written receptive vocabulary knowledge. Because the NSE test-takers were
predominately male and the High group test-takers were predominately female, DIF was
investigated for the first through the eighth 1000word frequency level items with the
responses provided by the Mid and Low group participants (n = 147). After applying a
Holms Bonferroni adjustment to protect against committing a Type I error, two of the 80
items, 100010 (basis) and 30009 (rove), exhibited statistically significant DIF, both in
favor of the female examinees. Note that these two items were also the most problematic
in terms of underfitting the Rasch model in the previous analyses. As only two of the 80
items investigated displayed statistically significant levels of DIF, it was concluded that
the majority of items had passed the first test of invariance.
A second way in which invariance was investigated was by randomly selecting five
items per frequency level from the 140-item test in order to produce two 70-item test
forms. Rasch person ability estimates were then calibrated first with the full 140-item
form and then with the two 70-item forms (Forms A and B), and Pearson correlations were
calculated with the resulting person measures. High correlations are an indication that the
items display a high degree of invariance in the sense that different sets of items tap the
same latent construct and therefore result in similar person ability estimates. The Pearson
correlation between the 140-item form and Forms A and B were identical at 0.98, and the
correlation between Form A and Form B was 0.93 (correlation disattenuated for measurement error = 0.96). All correlations were statistically significant at p = .01 (2-tailed test).
Third, a stricter test of invariance suggested by Linacre (2007a) was performed by
splitting the test items into two subtests based on the positive and negative item residual
loadings obtained in Winsteps. Person ability estimates were first obtained using the
items with positive residual loadings and then using those with negative residual loadings. The Pearson correlation between the resulting person measures was r = 0.84 (disattenuated correlation = 0.91). These correlation coefficients indicated a relatively high
degree of measurement invariance, as the person ability measures produced by these two
maximally different sub-tests were similar.
Taken together, the results of the three analyses indicate that various combinations of
VST items will likely display high degrees of invariance and thereby produce similar
person ability estimates. The results also confirm the findings of the pca of item residuals
reported above given that high degrees of measurement invariance strongly suggest that
the VST primarily measures a single latent variable, which is presumably written receptive vocabulary knowledge.

113

David Beglar

Table 1. Rasch item reliability and item separation for five forms of the Vocabulary Size Test
Number
of items

Number
Frequency
Test-takers
Rasch item
of items
levels
reliability
per level

140
10
80
10
40 5
40
10
20 5

1st14th
1st8th
1st8th
1st4th
1st4th

ALL
Low and Mid
Low and Mid
Low and Mid
Low and Mid

0.96
0.96
0.96
0.98
0.98

Rasch item
separation
5.22
4.71
4.93
6.25
6.39

Note: The combined Low and Mid groups n-size = 147.

A final way in which the generalizability aspect of validity was investigated was by
showing the degree to which various versions of the Vocabulary Size Test were free of
measurement error in the context of this study. Rasch-based reliability estimates were
produced for four scoring designs: the full 140-item test as well as four shortened test
forms. In two of the test forms, only five items per word frequency level were used. In
both of these cases, five items per frequency level were randomly deleted; thus, no
attempt was made to select optimally performing items.
Because reliability indices are not linear and can encounter ceiling effects (E. V.
Smith, 2001), the Rasch item separation index, which indicates the adequacy of the measures in terms of defining a line of increasing knowledge and which is not subject to
ceiling effects, was also calculated for each test form. As shown in Table 1, all five test
forms were of approximately equal reliability, and for the test-takers in this study even
the shortened forms of the Vocabulary Size Test displayed adequate reliability.

Responsiveness
Responsiveness (Medical Outcomes Trust Scientific Advisory Committee, 1995) refers
to the sensitivity of the measurement instrument to detect change. This is a critical quality
of an effective vocabulary size test as the purpose of the test is either to determine an
individuals standing in relation to other examinees (i.e., a norm-referenced interpretation) or whether an individual has achieved specific abilities, levels, or knowledge (i.e.,
criterion-referenced interpretations). The responsiveness of the Vocabulary Size Test was
investigated by determining person strata, which is calculated using the following equation: Person strata = 4(Gp + 1) / 3, where Gp is the Rasch person separation statistic, Gp:
Gp = (4Gp + 1) /3, where Gitem is Rasch item separation, Gitem = SAitem / SEitem,
where SAitem is adjusted item standard deviation and SEitem is the average measurement
error. The person strata statistic indicates the number of statistically distinct levels of
person ability on the measurement continuum that are separated by at least three errors of
measurement (Wright and Masters, 2002, p. 888). More strata indicate that an instrument
can distinguish persons into more levels of ability. Person strata for the 197 participants
was 7.15; thus, this version of the Vocabulary Size Test was able to distinguish approximately seven statistically distinct levels of written receptive vocabulary knowledge

114

Language Testing 27(1)

among the test-takers in this sample. This indicates a high degree of responsiveness for
this version of the VST as well as its potential to measure examinees who vary widely in
terms of written receptive lexical knowledge and to measure changes in lexical knowledge over long periods of time. It is also worth noting that few second language learners
of English would encounter a ceiling effect with a vocabulary test that extends to the
14,000 word frequency level. This characteristic of the VST would allow it to be used in
nearly any ESL/EFL context.

Interpretability
Interpretability (Medical Outcomes Trust Scientific Advisory Committee, 1995) is the
degree to which qualitative meaning can be assigned to quantitative measures. This is
typically accomplished through the use of norms in norm-referenced tests and cut-scores
in criterion-referenced tests.
Before considering how the person measures produced by the VST can be interpreted,
it is necessary to recognize three limitations of the test. First, because the test is a measure of written receptive vocabulary size, using the test to measure test-takers listening
vocabulary size is not recommended as reading and listening vocabulary sizes can vary
considerably. Second, test-takers responses to the VST provide little indication of how
well the test words could be used in speaking and writing tasks. Finally, although vocabulary knowledge is the most important factor affecting the readability of a text (Klare,
1974), test-takers responses provide only a rough indication of how well they can read,
so the VST should not be viewed as a substitute for a reading test.
In order to arrive at reasonably precise estimates of vocabulary size, learners should be
asked to sit at least two levels beyond their present level because although frequency level
is strongly related to the likelihood of a word being known, other factors, such as part of
speech, word length, and loan word status, are involved in lexical acquisition. In addition,
frequency counts can give differing results depending on the size and nature of the corpus
used. Therefore, a learner with a vocabulary size of 3000 words will know some words
beyond this level and will not to know some words within or below this level.
The most straightforward way to interpret the VST results is to view each item as
representing 100 word families (assuming that 10 items are used at each 1000word frequency level). A test-takers raw score would be multiplied by 100 to obtain their total
vocabulary size (up to the fourteenth 1000word family level). For instance, a low
English proficiency test-test taker who sat the first 40 items (i.e., the 10004000word
frequency levels), and answered 25 items correctly would have an estimated vocabulary
size of approximately 2500 words (25 100). In this approach items at the same word
frequency level are seen as being largely interchangeable (i.e., of similar difficulty).
A second approach to interpreting the VST results based on item response theory,
involves viewing test items as indicators of the underlying trait being measured (i.e.,
written receptive lexical knowledge of English in this case). In this approach, no strong
assumption is made that items at the same word frequency level are necessarily similar
in terms of difficulty. Interpreting test results using this approach provides a number of
advantages over the first approach: missing responses do not present a problem, idiosyncratic answering patterns by test-takers and idiosyncratic item performance can be

115

David Beglar
Table 2. Conversion table of Rasch ability
estimates and number of word families known
Rasch ability estimate

Number of word
families known

89.6
14000
72.0
13000
67.5
12000
64.3
11000
61.7
10000
59.4 9000
57.3 8000
55.2 7000
53.2 6000
51.0 5000
48.6 4000
45.9 3000
42.5 2000
37.5 1000

identified though the inspection of person and item fit indices, the degree of invariance
of different test forms can be readily identified, and through the use of carefully selected
anchor items, a multitude of test forms can be used to place test-takers on a single continuum of breadth of written receptive vocabulary knowledge, provided that the items fit
the Rasch model and that the property of item invariance holds. Regarding this last point,
this study shows how useful it would be to use different test forms to measure different
groups of test-takers lexical knowledge. For instance, as noted above, the High group
participants in this study generally had vocabulary sizes between 7500 and 10,000 word
families. Rather than being asked to sit the entire 14-level test, future test takers in this
proficiency group could be administered items ranging from the 6000 to the 12000 word
frequency levels. As some of these items would act as anchors derived from the full 140item test, the test-takers ability estimates could be placed on a common scale with participants who took different versions of the test (e.g., appropriate items for the Mid
participants would be from the 2000 to the 7000 word frequency level). The results of
this approach are shown in Table 2 in a simplified Rasch logit-to-word family conversion
chart produced by Winsteps for the data analyzed in this study. As shown in the table, a
test-taker with a Rasch ability estimate of 55.2 would have a vocabulary size of approximately 7000 word families; this finding would hold regardless of the particular test form
that the test-taker completes provided that the test forms are properly anchored.

Conclusion
The primary purpose of this paper has been to present initial validity evidence for one
version of the Vocabulary Size Test using an approach that combines a priori hypotheses
concerning the latent variable, an operational definition of that latent variable, and a

116

Language Testing 27(1)

measurement model that produces interval person measures. A carefully designed set of
items was developed in order to measure a single latent constructwritten receptive
knowledge of the first 14,000 word families of English. Thus, the primary goal of this
study was to determine whether any of the items failed to act as an accurate measure of
this construct. Three hypotheses concerning item functioning (see Figure 2), test-taker
performance (see Figure 3), and test dimensionality were stated prior to data analysis,
and each of the hypotheses was confirmed. The vast majority of items fulfilled the criteria of good measurement by (1) displaying adequate fit to the Rasch model, (2) contributing to a strong degree of psychometric unidimensionality as shown by the analysis of
item residuals, and (3) showing a high degree of measurement invariance as indicated by
the strongly similar person measures produced by various forms of the test. Furthermore,
the test-takers were measured with a high degree of precision on multiple versions of the
test as indicated by low standard errors and high reliability estimates across the entire
measurement range.
New measurement instruments are useful if they extend the range of measurement
shown by previous instruments and the Vocabulary Size Test does so. It is difficult to
conceive of contexts in which it would be necessary to measure the written receptive
lexical knowledge of second language learners of English beyond the 14,000word frequency level, as knowledge of the most frequent 14,000 words of English along with
proper nouns account for over 99% of the running words in written and spoken text
(Nation, 2006).
In this study, because of the cross-sectional nature of data collection, the focus was on
interindividual measurement, that is, the way in which the latent construct varied over
different persons. In future studies of the Vocabulary Size Test, intraindividual change
should be investigated by measuring variation in person measures over time with the
same persons. Indeed, the greatest value of the test will likely be in measuring learners
progress in vocabulary learning over time.

Acknowledgements
The following people were involved in the making of this test under the direction of Paul
Nation: Vera Humayun, Elizabeth Warren, Denise Worthington, and especially Winifred
Bauer, who did a very large amount of work on the final form of the test items.

References
Bauer L and Nation ISP (1983). Word families. International Journal of Lexicography, 6,
253279.
Beglar D and Hunt A (1999). Revising and validating the 2000 word level and university word
level tests. Language Testing, 16, 131162.
Bertram R, Baayen R and Schreuder R (2000). Effects of family size for complex words. Journal
of Memory and Language, 42, 390405.

David Beglar

117

Bertram R, Laine M and Virkkala M (2000). The role of derivational morphology in vocabulary
acquisition: Get by with a little help from my morpheme friends. Scandinavian Journal of
Psychology, 41, 287296.
Borsboom D, Mellenbaugh GJ and van Heerden J (2004). The concept of validity. Psychological
Review, 111, 10611071.
Ellis NC (2002). Frequency effects in language processing: A review with implications for theories
of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24,
143188.
Greidanus T and Nienhuis L (2001). Testing the quality of word knowledge in a second language
by means of word associations: Types of distractors and types of associations. The Modern
Language Journal, 85, 567577.
Hattie J (1985). Methodology review: Assessing undimensionality of tests and items. Applied
Psychological Measurement, 9, 139164.
Klare GR (1974). Assessing readability. Reading Research Quarterly, 10, 62l02.
Laufer B and Nation P (1999). A vocabulary-size test of controlled productive ability. Language
Testing, 16, 3351.
Linacre JM (1992). Prioritizing misfit indicators. Rasch Measurement Transactions, 9, 422423.
Linacre JM (1998). Detecting multidimensionality: Which residual data-type works best? Journal
of Outcome Measurement, 2, 266283.
Linacre JM (2007a). A users guide to WINSTEPS. Chicago: winsteps.com.
Linacre JM (2007b). Dimensionality: Contrasts and variances [online]. Available: http:www.winsteps.com/winman/principalcomponents.htm.
Meara P and Buxton B (1987). An alternative to multiple choice vocabulary tests. Language
Testing, 4, 142154.
Meara P and Jones G (1990). Euorcentres vocabulary size test, Version E1.1/K10. Zurich:
Eurocentres Learning Service.
Medical Outcomes Trust Scientific Advisory Committee (1995). Instrument Review Criteria.
Medical Outcomes Trust Bulletin, 3, 14.
Messick S (1989). Validity. In Linn RL (Ed), Educational measurement (3rd ed) (pp. 13103).
New York: Macmillan, Messick, S. (1995). Validity of psychological assessment: Validation of
inferences from persons responses and performances as scientific inquiry into score meaning.
American Psychologist, 50, 741749.
Miller GA and Gildea PM (1987). How children learn words. Scientific American, 257, 9499.
Nagy WE, Anderson R, Schommer M, Scott JA and Stallman A (1989). Morphological families in
the internal lexicon. Reading Research Quarterly, 24, 263282.
Nation ISP (1983). Testing and teaching vocabulary. Guidelines, 5, 1225.
Nation ISP (1990). Teaching and learning vocabulary. New York: Newbury House.
Nation ISP (2001). Learning vocabulary in another language. Cambridge: Cambridge University
Press.
Nation ISP (2006). How large a vocabulary is needed for reading and listening? Canadian Modern
Language Review, 63, 5982.
Rasch G (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen:
Danmarks Paedogogiske Institut.
Read J (2000). Assessing vocabulary. Cambridge: Cambridge University Press.

118

Language Testing 27(1)

Smith Jr EV (2000). Metric development and score reporting in Rasch measurement. Journal of
Applied Measurement, 1, 303326.
Smith Jr EV (2001). Evidence for the reliability of measures and the validity of measure interpretation: A Rasch measurement perspective. Journal of Applied Measurement, 2, 281311.
Smith Jr EV (2005). Effect of item redundancy on Rasch item and person estimates. Journal of
Applied Measurement, 6, 147163.
Smith RM (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement,
1, 199218.
Smith RM, Schumacker RE and Bush MJ (1998). Using item mean squares to evaluate fit to the
Rasch model. Journal of Outcome Measurement, 2, 6678.
Smith RM and Miao C (1994). Assessing dimensionality for Rasch measurement. In Wilson, M.
(Ed), Objective measurement: Theory into practice, Volume 2 (pp. 316327). Norwood, NJ:
Ablex,
Stenner AJ, Smith M and Burdick DS (1983). Toward a theory of construct definition. Journal of
Educational Measurement, 20, 305315.
Stevens J (2002). Applied multivariate statistics for the social sciences (4th ed). Mahwah, NJ:
Lawrence Erlbaum.
West M (1953). A general service list of English words. London: Longman, Green.
Wolfe EW (2004). Equating and item banking with the Rasch model. In Smith, Jr., E. V. and Smith,
R. M. (Eds), Introduction to Rasch measurement (pp. 366390). Maple Grove, MN: JAM
Press. Wolfe, E. W. and Smith Jr., E. V. (2007a). Instrument development tools and activities
for measure validation using Rasch models: Part IInstrument development tools. Journal of
Applied Measurement, 8, 97123.
Wolfe EW and Smith Jr EV (2007b). Instrument development tools and activities for measure
validation using Rasch models: Part IIValidation activities. Journal of Applied Measurement,
8, 204234.
Wright BD and Masters GN (2002). Number of person or item strata. Rasch Measurement
Transactions, 16, 888.
Wright BD and Stone MH (1979). Best test design. Chicago: MESA Press.

Copyright of Language Testing is the property of Sage Publications, Ltd. and its content may not be copied or
emailed to multiple sites or posted to a listserv without the copyright holder's express written permission.
However, users may print, download, or email articles for individual use.

You might also like