Professional Documents
Culture Documents
net/publication/269799300
CITATIONS READS
0 662
2 authors, including:
Emma A. Climie
The University of Calgary
14 PUBLICATIONS 87 CITATIONS
SEE PROFILE
All content following this page was uploaded by Adam W. Mccrimmon on 11 April 2016.
Reviewed by: Adam W. McCrimmon & Emma A. Climie, University of Calgary, Calgary, Alberta, Canada
DOI: 10.1177/0734282911406646
Test Description
The Test of Written Language—Fourth Edition (TOWL-4), published by PRO-ED, is a newly
updated individual or group-based measure of written language for students aged 9 years,
0 months through 17 years, 11 months. The stated purposes of the measure are to identify students
in need of support or intervention in the area of written language, identify strengths and weak-
nesses in students’ writing abilities, document progress resulting from written language interven-
tions, and provide measurement in written language research.
The TOWL-4 is classified as a Level B measure, and may be administered by psychologists
and nonpsychologists who have undergone formal training in standardized psychoeducational
assessment. The stated administration and scoring time is approximately 60 to 90 min. The
TOWL-4 consists of seven subtests that combine to form two composites (Contrived Writing and
Spontaneous Writing) and an Overall Writing score. Contrived Writing tasks focus on discrete
aspects of written discourse (e.g., spelling, punctuation, word usage) whereas Spontaneous
Writing tasks examine an individual’s functional writing ability (i.e., quality). Scaled scores and
percentiles are provided for subtests and composites, and scoring parameters for each subtest are
provided in the Examiner’s Manual.
The TOWL-4 kit consists of an Examiner’s Manual, a Supplemental Practice Scoring Booklet,
two Record/Story Scoring Forms (Form A and Form B), two Student Response Booklets (Form
A and Form B), and three colored Picture Cards (a sample card and one card for Form A and
Form B, respectively). The Examiner’s Manual is effectively laid out, beginning with a discus-
sion of the history of the previous editions of the measure and a brief overview of written lan-
guage assessment. Following this, the Examiner’s Manual provides sections on administration
instructions, recording and interpretation, a description of the normative sample, and presenta-
tion of the psychometric properties of the measure, including a small section on controlling for
test bias.
receiving a score of 0, 1, or 2, the scoring is primarily based on yes/no qualifications (i.e., the
element is present or is not) or quantifiable data (e.g., 0 = not evident, 1 = 1-2 items evident,
2 = 3+ items evident). The Supplemental Practice Scoring Booklet contains 10 sample stories
and the corresponding scoring for each and is intended to provide examples of scoring for the
benefit of examiners who desire additional practice in scoring these subtests.
For all subtests, there is an initial sample item followed by a number of test items of increasing
difficulty. Vocabulary is the second task in the administration sequence although it is listed as
Subtest 1. This task requires the examinee to write a sentence containing a specified word. Each
item is scored as either 1 (correct) or 0 (incorrect). For this subtest, as with all subsequent subtests,
examples of each score, as well as rationale, are provided in the Examiner’s Manual.
Subtests 2 (Spelling) and 3 (Punctuation) are administered simultaneously. The examiner reads
a sentence to the examinee, who then writes the sentence in their Response Booklet. Each sen-
tence is then scored either 1 (correct) or 0 (incorrect) for accuracy in spelling and punctuation.
Logical Sentences, Subtest 4, presents the examinee with 22 sentences in the Response Booklet,
each with an incorrect element of logic, such as use of a homonym (e.g., “here” rather than “hear”)
or incongruence of terms (e.g., The shoes were hungry to run).
Subtest 5, Sentence Combining, requires examinees to view several sentences and combine
them into one coherent sentence. Initial items contain two sentences that can be combined by
only adding the word “and” between each statement, whereas subsequent items are made more
difficult through the addition of more sentences and the complexity of language required to
effectively combine them.
Technical Adequacy
Development and Standardization
The TOWL-4 norming sample was collected throughout 2006-2007 and consisted of 2,205
children and adolescents ranging in age from 9 years to 17 years, 11 months across 17 American
states. The sample had a relatively consistent number of participants within each age group
(N = 201-294). Demographics were based on data from the 2007 U.S. census (U.S. Bureau of the
Census, 2007) and sample characteristics took into consideration gender, race, ethnicity, house-
hold income, educational level of parents, geographic region, and other exceptionalities of the
child (e.g., Learning Disabilities, Attention-Deficit/Hyperactivity Disorder, Hearing Impairment,
Speech–Language Disorder, Emotional disturbance, Blindness/partial sight, Physical Impairment,
Gifted/talented, and those on the Autistic Spectrum). It should be noted that no Canadian norms
have been created for this measure.
Reliability
Internal consistency. Internal consistency was measured through the use of Cronbach’s coef-
ficient alpha (examining both Forms A and B) and through administration of alternative forms.
Coefficient alpha scores (across ages) for the subtests were generally acceptable, ranging from
.74 to .92. The three composites (Contrived Writing, Spontaneous Writing, and Overall Writ-
ing) yielded good to excellent coefficient alpha scores of .84 to .96.
The TOWL-4 provides alternative form analysis in both the immediate and delayed timeframe.
In general, students who immediately completed both forms of the TOWL-4 demonstrated rela-
tively consistent results, with subtest mean correlations ranging from .74 to .86 on subtest scores
and .82 to .94 on composite scores. With the delayed administration, students were administered
both forms and then readministered both forms 2 weeks later. As with the immediate timeframe,
alternate forms testing was found to be acceptable, with 9 of 10 subtests and composite scores
correlating at .80 or higher.
Test-retest reliability. Test-retest reliability was conducted using a subsample of 84 students in
Texas (aged 9 years-17 years, 11 months). Both forms of the TOWL-4 were administered approx-
imately 2 weeks apart. The standard scores for each of these subtests were correlated, with a
majority (93%) having correlations rounding to .80 or higher and 54% rounding to or exceeding
.90, indicating that there is acceptable test-retest reliability between Forms A and B.
Scorer differences. In the case of tests such as the TOWL-4, clear scoring criteria are important
to reduce the subjectivity of scoring. To examine interrater scoring, 41 protocols were scored
independently by two individuals highly familiar with the scoring criteria. A majority of the sub-
test and composite scores resulted in correlations at or above .90 in magnitude, indicating strong
correlation. The two subtests that did not meet this cutoff were part of the Story Composition
composite, but both coefficients were still within the acceptable range (.80s).
Validity
Content validity. The TOWL-4 manual provides specific details on the rationale for including
each subtest, including consideration of individual items and composite domains. As well, careful
consideration was given to selecting stimulus pictures to ensure that pictures were appropriate,
child-friendly, and recognizable. A substantial review process was undertaken and included the
solicitation of feedback from teachers, university staff, school assessment personnel, and creators
of other assessment measures.
Test creators also took into consideration the possibility of item bias and conducted statistical
analyses to determine whether there may be a bias in item responses based on gender, race, or
ethnic background. Results of the analyses led authors to conclude that the TOWL-4 is within the
acceptable limits regarding item bias.
Criterion-prediction validity. A subset of participants were given both the TOWL-4 as well as
another standardized measure designed to measure a similar construct (e.g., reading or writing)
to determine the ability of the TOWL-4 to effectively predict an individual’s writing and read-
ing performance. Test scores on the TOWL-4 were correlated with performance on the Written
Language Observation Scale (WLOS; Hammill & Larson, 2009), the Reading Observation
Scale (ROS; Wiederholt, Hammill, & Brown, 2009), and the Test of Reading Comprehension—
4th edition (TORC-4; Brown, Wiederholt, & Hammill, 2009). Analyses revealed no significant
differences between mean scores on the TOWL-4 composite scores and those on the WLOS,
ROS, and TORC-4, indicating consistent performance of the participants between these tests.
Construct validity. A three-step process was undertaken to examine the TOWL-4’s ability to
accurately measure an individual’s writing ability. First, authors identified several constructs pre-
sumed to underlie writing performance. Second, using these constructs, a number of hypotheses
were created. Finally, these hypotheses were tested through scientific method. The first hypothe-
sis, that writing ability should be related to age, was found to be supported through correlational
analyses. As well, consistent with the second hypothesis, moderate correlations (.31 to .70) were
found between subtests, indicating that the subtests were correlated but not so closely related that
they were redundant. The third hypothesis predicted that writing ability should correlate with intel-
ligence. Using the Wechsler Intelligence Scale for Children—4th Edition (WISC-IV; Wechsler,
2003) and the Comprehensive Tests of Nonverbal Intelligence (CTONI; Hammill, Pearson, &
Wiederholt, 1996), overall correlations found a moderately strong relationship between the
TOWL-4 and WISC-IV (.53 to .75) and an acceptable correlation between the TOWL-4 and the
CTONI (.36 to .58). Finally, it was predicted that there would be differences in performance between
those of Average writing ability and those who were known to be poor or adept at writing. Indeed,
atypical populations scored significantly lower (e.g., Learning Disabilities) or higher (e.g., Gifted/
talented students) than typically-developing students.
References
Brown, V. L., Wiederholt, J. L., & Hammill, D. D. (2009). Test of Reading Comprehension. Austin, TX:
Hammill Institute on Disabilities.
Hammill, D. D., & Larson, S. C. (2009). Written Language Observation Scale. Austin, TX: Hammill Institute
on Disabilities.
Hammill, D. D., Pearson, N., & Wiederbolt, J. L. (1996). Comprehensive Test of Nonverbal Intelligence.
Austin, TX: PRO-ED.
U. S. Bureau of the Census. (2007). Statistical Abstract of the United States (126th ed.). Washington, DC:
Author.
Wechsler D. (2001). Wechsler Individual Achievement Test (2nd ed.). San Antonio, TX: Psychological
Corporation.
Wechsler, D. (2003). Wechsler Intelligence Scale for Children (4th ed.). San Antonio, TX: Psychological
Corporation.
Wiederholt, J. L., Hammill, D. D., & Brown, V. L. (2009). Reading Observation Scale. Austin, TX: Hammill
Institute on Disabilities.
Woodcock, R. W., McGrew, K. S., & Mather, N. (2001). Woodcock-Johnson III Tests of Achievement.
Itasca, IL: Riverside.