Professional Documents
Culture Documents
Linda L. Liu, David H. Uttal, Loren M. Marulis, Alison R. Lewis, and Christopher M. Warren
Northwestern University
Nora S. Newcombe
Temple University
This work was supported by the Spatial Intelligence and Learning Center (NSF Grant ). We
thank Spyros Konstantopoulos and Larry Hedges for their help. Send correspondence to
David Uttal (duttal@northwestern.edu) or Nora Newcombe (newcombe@temple.edu).
Training of spatial ability 2
Abstract
We meta-analyzed 113 research studies in order that attempted to improve spatial reasoning. We
investigated the magnitude, durability, transfer and mediators of traning effectsto determine the
extent to which spatial skills can be improved by training, to identify factors that moderate
training effects, and to ascertain the durability and generalizability of training effects.. The
magnitude We found that effect sizes were affecte substantially by the presence and type of
control or comparison groups. of effect sizes was highly dependent on the kind of control groups
used in the studies. After treatment group improvement was separated from control group
improvement, the mean effect size for treatment was g = .75 (SE = .03). Treatment group effect
sizes did not differ for children (g = .70, SE = .05) and adults (g = .77, SE = .03) or for males (g
= .58, SE = .06) and females (g = .59, SE = .06). Training effects were stable; the Furthermore,
eeffectsffect sizes were stable regardless of whether not affected by delays between training and
post-test mesuresposttesting was immediate or delayed. Although the magnitude of transfer was
not as great as the magntidue of original traning, there was evidence of both and near and far
transfer. Effect sizes were still evident, although significantly lower, when far transfer rather than
near transfer of training was examined. Considered together, the results suggest that spatially
enriched education could pay high dividends in terms of improved participation in mathematics,
The central question in the study of development is how a mature form arises from initial
departure points, and what influences the course of such development. Related questions include
the origins and determinants of individual variation, and the existence of sensitive periods for
influencing development for better or for worse (Bornstein, 1989). Although spirited debate on
these matters continues, there has been an increasing emphasis on malleability in the
neuroscience of development (e.g., Johnson, Munakata & Gilmore, 2002; Shonkoff & Phillips,
2000). Relatedly, there has been a focus on education that maximizes human potential and
reduces inequality, both in preschool children (e.g., Heckman & Masterov, 2007) and for a
variety of subjects taught in school, such as reading (e.g., Rayner, Foorman, Perfetti, Pesetsky &
Seidenberg, 2001) and mathematics (e.g., National Mathematics Advisory Panel, 2008).
Spatial skills are not a school subject, but they have been shown to be an important
predictor of students’ interest and success in science, technology, engineering and mathematics
(STEM; Hedges & Chung, in prep; Humphreys, Lubinski & Yao, 1993; Shea, Lubinski &
Benbow, 2001). Building on such considerations, the National Research Council (2006) recently
published a report, Learning to Think Spatially, which emphasized the importance of spatial
thinking in science and mathematics education and called for educators to incorporate
Such effortsEfforts to improve spatial skills are predicated on the assumption that spatial
skills are in fact malleable. Yet, the fundamental question of whether long-lasting improvements
in spatial skill can be attained through training has yet to be resolved conclusively. Diverse
claims have been made regarding the effectiveness of spatial training. Some investigators have
Training of spatial ability 4
argued that training spatial performance leads only to fleeting improvement, limited to cases
where there is a high degree of similarity between the tasks trained and the outcome measures of
interest (Eliot, 1987; Eliot & Fralley, 1976; Maccoby & Jacklin, 1974; Sims & Mayer, 2002). In
fact, the recent NRC report questioned the generality of training effects and concluded that
transfer of spatial improvements has not been convincingly demonstrated. The report called for
(Learning to Think Spatially, 2006). In this paper, we aim to assessTherefore, we tested the
hypothesis that training, education, or life experience can improve spatial skills. In particular, we
address four main questions about spatial training, using a meta-analysis of existing training
First, we began simply by examining the magnitude of the improvements that can be
obtained, broken down by the type of spatial skill assessed and subsequently by variables such as
duration of training and study design. We grouped major spatial dependent variables into a set of
conceptual categories, and examined the impact of coded study characteristics (e.g., training
duration, frequency, type of control group) for each category of dependent measure. If the sizes
of training effects are heavily dependent on the spatial measures chosen in training studies, then
the effect sizes of these conceptual categories might vary a great deal. On the other hand, these
conceptual categories might not vary in malleability, yielding effect sizes that are similar in
magnitude. Finding the latter could suggest either that training exerts a general, as opposed to
task-specific, influence on spatial ability or that differences in study characteristics (e.g., training
duration, study methodology) are stronger determinants of the size of training effects than are the
generalizable effects, then we should expect performance to improve not only on tasks directly
trained but also on transfer tests—measures that were not directly trained but that were
administered along with the trained task to assess whether there was transfer to related skills. We
should expect some limits of this transfer, with the magnitude of training effects being larger for
near transfer, when the training and reference tests are similar, dropping off as training and
Third, we addressed the question of durability. Are training-related gains maintained and,
if so, for how long? We ascertained the durability of training and transfer by estimating the point
at which, if ever, training effects declined to pretest levels. We performed such these analyses in
two ways: 1) across studies, comparing the effects of posttesting after the different intervals of
time used in different studies, and 2) within studies, for those studies that included both
Fourth, we analyzed whether training effects are more pronounced for some groups than
for others. For example, =It it is often argued that females should improve more with training
than males because they have been more deprived of spatial experience (e.g. Sherman, 1967);
however, a prior meta-analysis found that males and females improved in parallel (Baenninger &
Newcombe, 1989). It might also be predicted that children would improve more than adults,
either because they are lower in spatial skill and hence have more to gain, or are in a sensitive
period that has closed by adulthood. Across training studies, do children, in fact, show larger
effects of training compared to adults? What is the impact of other grouping variables, such as
In summary, despite the high volume of research investigating the impact of various
training interventions on spatial outcomes, the field lacks a systematic and comprehensive
analysis of training effects for a variety of spatial skills for specified groups, and, especially,
lacks an accurate assessment of durability and transfer. In this paper, we examine the questions
of what spatial outcome measures are most and least amenable to training as a function of the
various types of training used to improve each spatial outcome and address how the effects of
training are moderated by individuals’ pre-existing levels of performance. We examine the effect
of study characteristics such as participant screening criteria, method of training, and frequency
of training as a means for accounting for the variability that exists in the magnitude of training
effects. We hope to shed light on the controversy surrounding what constitutes the most
appropriate methods for training different spatial skills as well as questions regarding the
durability and generalizability of spatial training effects. Ultimately, we hope to be able to make
informed recommendations to educators about the most appropriate ways to train spatial skills, to
establish guidelines regarding the extent to which different types of spatial skills are typically
improved with training, and to inform the design of educational interventions to help improve
Method
Eligibility Criteria
Several criteria were used to determine whether a study would be included in this meta-
analysis:
1. The study included at least one spatial outcome variable. Examples include, but are
reaction time on a spatial task (e.g., mental rotation or finding an embedded figure), or
2. The study used training, education, or another type of specific intervention that was
either the study included a control group that did not receive the training, the study
obtained before the intervention was given, or the study compared the effects of
training on pre-existing groups (e.g., engineering and liberal arts students) that were
4. The study focused on a non-clinical population. Thus, we excluded studies that used
spatial training to improve spatial skills after brain damage or to ameliorate the
risk populations.
(Educational Resources Information Center) through May 31, 2007. The search included foreign-
language articles provided thatif they included an English abstract. The goal was to perform a
comprehensive screening of all studies reporting on the effects of spatial training so that each
AND (spatial OR visuospatial OR geospatial). This search resulted in 788 hits. We read the
abstracts of these articles to determine whether they met the criteria described above. To ensure
that these decisions were made reliably, two researchers read through 155 (20%) randomly
selected abstracts, rated their eligibility, and, reached consensus through discussion for any
group consisting of the two raters and two additional raters. This process resulted in 158
In addition, weWe also contacted experts and authors in the field for any published and
unpublished data (of their own or that of their colleagues) and relevant references. We sent out
approximately 205 inquiries, and received 34 responses (17%), which yielded 43 manuscripts.
Two researchers independently read each of these additional studies, again rating the relevance
of the articles on the criteria mentioned above. In total, 201 abstracts were acquired and reviewed
studies were deemed relevant after being read in full by at least two researchers. The level of
agreement between the two raters on their decisions for inclusion and exclusion (Cohen’s Kappa
= .74), which is “substantial” as defined by Landis and Koch (1977) and just below the .75 cut-
off of “excellent” defined by Capozzoli, McSweeney and Sinha (1999). A review of the
reference lists of these articles yielded another 14 relevant articles. In all, the review process
produced a sample of 113 (i.e., 99 + 14) studies. These included articles published in scientific
also included several non-English manuscripts, obtained via translators who were familiar with
the psychology literature who obtained articles translated from Korean, Dutch, Romanian,
We took steps to avoid the “file drawer problem” (Rosenthanl, 1979), which potentially
could inflate our estimate of the magnitude of training effects. Because studies that find large
effects of training are more likely to be published than those that report little or no effect
(Rosenthal, 1979), we took steps to avoid publication bias by increasing our access to
unpublished work. First, when we wrote to authors and experts, we explicitly asked them to
include unpublished work. Second, we searched references lists of our articles for relevant
unpublished conference proceedings, and we also looked through the table of contents of any
recent relevant conference proceedings that were accessible online. Third, our search of
Dissertation Abstracts International yielded many unpublished articles, which were included
when they were relevant. For dissertations that were eventually published, we examined both the
published article and original dissertation. We augmented the data from the published article if
the the dissertation contained additional, unpublished data that were relevant to our objectives.
In some cases, authors did not provide sufficient information in the papers to allow us
to calculate effect sizes. To address this problem, we contacted the authors. For example, in
some cases, we requested separate means for control and treatment groups when only the F- or t-
statistics summarizing significant group differences were reported. Authors provided usable data
in approximately 20% of these cases and we used these data to compute effect sizes, separately
for males and females and control and treatment groups whenever possible.
Training of spatial ability 10
The data from each study were entered into the software program Comprehensive Meta-
Analysis (CMA; Borenstein, Hedges, Higgins, & Rothstein, 2005).The program accepts multiple
types of data input, including not only means and standard deviations but also categorical data,
odds ratios, etc. Measures of effect size typically quantify the magnitude of gain associated with
a particular treatment relative to the improvement observed in the control group. Gains can be
conceptualized as an improvement in score and effect sizes can also be computed from an F
statistic, t statistic or chi-square value as well as from change scores representing the difference
in mean performance at two points in time. Thus, in some cases, it was possible to obtain effect
sizes without having the actual mean scores associated with a treatment (See Schmidt and
All effect sizes were expressed as Hedges’ g, a slightly more conservative derivative of
correction for biases due to sample size, weighting each effect size by the standard error of the
effect size so that less precise estimates are given less weight in the analyses. Hedges g is
computed as:
Mtreatment Ð Mcontrol
g = ÃMSEwithin SÕs
We used a random effects model in order to ensure that our results can be generalized
beyond the studies selected and to increase the likelihood that we can apply our inferences more
Random effects models are used when there is reason to suspect that variability would not be due
solely to sampling error (Lipsey & Wilson, 2001). Given the broadness and complexity of this
topicspatial cognition,, the studies we included differed in numerous several ways beyond the
Training of spatial ability 11
effects of sampling, making it unlikely that all sources of variance could be accounted for by a
single model. Using random effects allowed stronger and wider generalizations to be made.
Effect sizes were metaanalyzed using SPSS 16.0 macros (Wilson, 2002). These macros
were usedWe usedWilsons’ (2002) SPSS macros to calculate mean effect size (MEANES) and to
multiple regression (METAREG) for effect sizes. Thus, they allowed us to testThe macros
allowed us to test for both simple effects and interactions among grouping variables (e.g., study
characteristics) on effect size. Our sample before trimming consisted of 113 manuscripts (87
published and 26 unpublished) with relevant, available and usable data, yielding 634 effect sizes.
We developed a coding system for characterizingcoded the methods and procedures used in
each study. The coding scheme addressed the following characteristics of each study: The variables
coded described details about thethe sample, methods and procedures, spatial measures used, study
design, nature of control group, and details about the procedure such as length and frequency of
training, and time lapse delay between the end of training and the posttest were also coded for each
Both outcome measures and methods of training were classified into categories to
facilitate generating conclusions about groups of studies. We defined five categories of outcome
and mental rotation. As summarized in Table 1, these five categories of outcome measures
overlapped in part with the three catogores that developed by Linn and Petersen (1985) used in
their meta-analysis of gender differences in spatial skill. Our categories also and map onto three
Training of spatial ability 12
of the five factors established by Carroll’s (1993) extensive re-analysis of individual differences
measuring Visuospatial Perceptual Speed (Carroll, 1993) or Spatial visualization (Linn &
categories used by both sets of researchers. Mental rotation, a category also included in Linn and
Petersen’s classification, was designated as Spatial Relations (also called Speeded Rotation) by
Carroll (1993). Were there two other categories Carroll used that we did not include? We
don’t mention his Closure speed or Flexibility of closure categories, but these are part of
We also included two additional categories. We added the category of Spatial principles
Principles because this domain represents a distinct and highly-specific skill that was not
included in the factor-analytic studies reanalyzed by Carroll (1993), as it grew out of a different
(Piagetian) research tradition; this category also corresponds to what Linn and Petersen (1985)
referred to as spatial perception. We also included Perspective taking, which was not included in
Linn and Petersen’s analysis. Perspective taking also grew out of the Piagetian tradition and has
been shown to be distinct from mental rotation even though the two can be considered
computationally equivalent (Hegarty & Waller, 2004; Huttenlocher & Presson, 1973, 1979).
Seven studies that reported the effects of training on an entire spatial test battery without
providing means for the individual subtests within the battery (making it impossible to calculate
separate effect sizes for the spatial components represented by the subtests) were excluded.
classified studies by method of training using the flowchart shown in Figure 1. Coders read the
Training of spatial ability 13
Method and Results sections of each study and evaluated the training procedure used to obtain
each effect size. Coders then classified whether each effect size was the result of training using a
course, videogame, or a spatial task (either specific practice involving the outcome measure of
interest or transfer to an untrained spatial task). Within the course category, we distinguished
studies in which enrollment in a course was the sole manipulation from those studies that
compared the effects of an enhanced course to the typical version of the course. The former were
typically a semester in length whereas the latter were conducted in a shorter length of time within
a semester. Within instances of videogame training, we coded whether the videogame focused on
Finally, we also coded more general details about the studies themselves, such as
publication year, country of origin, and socioeconomic status of country. Two trained coders
coded all 113 studies and inter-rater agreement was 95% on categorical measures and 91% on
continuous measures.
publication bias. Five studies that reported very extreme effect sizes (some as large as 8.33)
were excluded because they were conducted with participants from significantly underprivileged
countries who might have been expected to have very little experience with testing or spatial
testing in general. These five studies differed significantly from the rest of the sample in terms of
mean effect size (the mean, g = 1.71, SE = .07, k = 93, was more than twice the group mean of
the remaining sample, g = .65, SE = .03, k = 509), Q (1, 601) = 187.99, p < .001. 2 The countries
represented in these studies were also ranked significantly lower according to the Human
Development Index, a composite of standard of living, life expectancy, well-being and education
Training of spatial ability 14
that provides a general indicator of a nation’s quality of life (Human Development Report,
2007/2008). The three countries represented in these six studies, Papua New Guinea, Bahrain
and Nigeria, were ranked 145, 41 and 158, respectively, compared to an average ranking of 12.19
for the remaining 101 studies in our sample. Because studies focusing on more underprivileged
populations typically report effect sizes many times larger than those obtained from more typical
populations (REFS), the inclusion of these studies would lead to an exaggerated view of the
malleability of spatial training. Consistent with this view, when these studies were included, we
found a strongignificant correlation in our sample between HDI ranking and effect size,
Spearman’s rho (df = 600) = .30, p < .001, suggesting that the inclusion of these studies from
very low HDI ranking countries would inflate the overall effect size. Thus, these studies were
We also took steps to curb the effects of less extreme outliers. To curb the effect of
extreme effect sizes that were more than 4 SD (.475) above the mean of the remaining sample by
capping them at 4 SD above the mean. In other words, the value of any effect sizes greater than
2.55 was reset to 2.55. We did this for a total of 14 effect sizes in all. Windsorizing had a
negligible effect on the mean effect size of the overall sample (unadjusted g = .65, SE = .03 vs.
Windsorized g = .64, SE = .02) but had the desired effect of curtailing the value of the effect
sizes at the upper extreme (Figure 2). This recoding of extreme values also reduced their impact
on the mean effect size of the subgroups considered in the next section. All analyses reported
below were conducted Henceforth, all results are from this on the Windsorized valuessample.
studies, we were concerned that the mean effect size of the sample could be biased upward due
Training of spatial ability 15
to a publication bias toward statistically significant findings (Lipsey & Wilson, 1993). To
measure the size of this potential publication bias, we first compared the average effect size of
published (g = .66, SE = .03) and unpublished (g = .58, SE = .04) studies in our sample and
found only a marginally significant difference, Q (1, 507) = 2.71, p = .10. To ascertain whether
this difference was large enough to affect our conclusions, we also calculated the Fail-safe n,
which estimates the number of studies with a null outcome (i.e., the number of studies reporting
g = 0) that would be required to render the overall mean effect size to be negligible in magnitude.
For the purposes of calculation, we set the value for a negligible effect size to be .10, which is
smaller than Cohen’s (1979) criterion for a low effect size. Thus, we were able to use Orwin’s
(1983) formula to calculate the fail-safe n for each partition. According to Orwin’s formula K0 =
k [ESk/ESc - 1], where ESk is the mean effect size, ESc is a mean effect size judged to be
negligible in magnitude (in this case, .10), and K0 is the fail-safe n, or the number of studies with
null results that would render the mean effect size to be negligible (i.e., equal to .10). By Orwin’s
formulaThis analysis revealed that, there would need to be 2794 “file drawer” studies reporting
an effect size of zero to reduce the mean effect size of the sample to .10. Taken together, these
results suggest that our sampling procedure produced an adequate and representative sample of
Our final sample consisted of 101 studies with 509 effect sizes: 76 (75%) that were
published in journals and 29 (25%) from dissertations, unpublished data or conference papers. In
our sample, 85 studies (84%) were conducted in the United States. The characteristics of these
Even with the outliers removed, the sample was highly heterogeneous (Q = 2351.57, df =
508, p < .0001), indicating that the effect sizes included in our sample differed in systematic
ways that might be uncovered by partitioning them into smaller groups. We followed the
procedure described by Hedges and Becker (1986) of using study descriptors to create effect size
partitions for reducing heterogeneity. The goal was to devise subgroups of interest in which the
main source of variability is sampling error (Lipsey & Wilson, 2001). Put another way, Voyer et
al. (2007) conceptualized homogeneous clusters as summarizing g), that is, roups of studies that
Therefore, We we followed the method of Voyer, Voyer and Bryden (1995) and Voyer, Postma,
Brake and Imperato-McGinley’s (2007) of expanding the p values indicating a significant test for
homogeneity, from greater than .05 being the standard for homogeneity to including p-values
that were less than .05 but greater than .005. As these authors noted, homogeneity can be
difficult to achieve within metaanalyses. By this convention, a sample of effect sizes whose
homogeneity statistic has a significance level p > .005 would be considered statistically
homogeneous. Partitioning had the desired effect of reducing heterogeneity, although complete
homogeneity was difficult to attain, a common obstacle reported in metaanalyses (Voyer et al.
2005). Because the amount of variability depends on the underlying literature, it is particularly
difficult to attain homogeneity with highly diverse samples of research. To determine the
appropriate partitions to make sense of the variation in effect sizes, we identified variables that
Research designs. An effect size indicating the success of a single training intervention is
typically expressed as the extent to which the treatment group outperforms the control group
Training of spatial ability 17
(noted as E vs. C, or Ec). In other words, this effect size summarizes the impact of training on
spatial skill measured either as a pretest to posttest change, as a between subjects comparison of
design.
When multiple training interventions are compared across different studies, however, Ec
can be difficult to interpret. Given that Ec, reflects the improvement of trained groups relative to
control groups, if control groups also improve to varying degrees, whether Ec is large because of
To illustrate, in the studies sampled here, we found that the presence of a control group
had a significant effect on the magnitude of Ec. Comparing the mean effect sizes for the three
categories of study design, pretest-posttest with control, treatment vs. control only, and pretest-
posttest only, we found that mean effect sizes differed significantly depending on whether the
effectiveness of a treatment intervention was evaluated relative to a control group, Q (2, 507) =
17.88, p < .001. A post-hoc comparison, using an adjusted alpha of .01 to reduce the Type I error
rate, revealed that studies that used a pretest-posttest only design, in which no control group
offsets the training-related gains of the treatment group, reported the highest mean Ec overall (g =
.85, SE = .07, k = 47). The Pretest-posttest only group was significantly higher than both the
Treatment vs. Control group (g = .50, SE = .05, k = 127, p < .001) and the Pretest-posttest with
control group (g = .64, SE = .03, k = 334, p < .01), which were also marginally different from
each other (p < .05). Thus, studies that did not include a control group tended to report higher
We also tested whether the nature of the control group manipulation affected the
magnitude of control group improvement by comparing effect sizes by type of control group (No
The significance of this result reflected only the difference between studies with or without
control groups, Q (4, 507) = 16.74, p < .01, with studies with no control group yielding
significantly higher Ec effect sizes than those with a control group. There were no other
against the backdrop of the gains experienced by the control group. To isolate the magnitude of
treatment-related gains, we focused our analyses on the studies in which it was possible to
calculate effect sizes for the treatment and control group separately. Before doing so, we tested
whether the mean effect size for the “inseparable” studies (define inseparable) differed
significantly from the “separable” studies on which we planned to base our analyses. In fact,
there was no significant difference between the Ec effect sizes from the inseparable studies, M =
.70, SE = .05, k = 79, and the Ec effect sizes from the separable studies, M = .62, SE = .04, k =
256, Q (1, 334) = 2.81, p > .09. This result suggests that the nonseparability of the studies was
likely due only to a difference in preference for how the data were reported and not due to
the improvements made by treatment and control groups separately (Wilson, Lipsey, & Derzon,
2003).
Among the 55 studes that used a pretest-posttest with control design and provided
sufficient data for this analysis, treatment groups (g = .75, SE = .03, k = 246) improved
significantly more than control groups did, g = .56, SE = .03, k = 224, Q (1, 453) = 23.38, p <
Training of spatial ability 19
.001. In other words, when treatment effect sizes are isolated from the improvements made by
the control groups, the treatment groups show average improve ofwe can conclude that training
improves spatial skills an average of .75, or ¾ of a SD. Our remaining analyses focus on these
separable studies, but we retained the unseparable studies to be used in specific comparisons
when appropriate. The remainder of the paper focuses on the factors that moderate this
improvement of treatment groups alone (as opposed to treatment relative to control), and we will
refer to this improvement as treatment effect size. Mean effect sizes for each study, along with
key characteristics of each study, are summarized in a table in the appendix.in Appendix A.
Analysis plan
Our analysis plan addressed our four major questions. First, does average effect size
differ for different categories of spatial outcome measures and how do they rank against one
another in terms of their malleability to training? Second, what is the magnitude of the test-retest
effect for spatial training and how large are the gains attributable to training, above and beyond
the retesting effect? Third, we considered the retention, what is the duration of training effects. ?
Are training-related gains maintained and for how long? FinallyFourth, we asked to what degree
do individuals starting at different spatial skill levels benefit from spatial training?
To what extent are some spatial measures more malleable than others? In other words,
does the size of training effects depend on the outcome measure? In addition, toTo what extent
is malleability attributable to the effects of repeated practice? how How much does training to
transfer to untrained tasks? These questions have important implications for the design of
educational interventions, in which it is necessary to know both the magnitude of gains that can
be expected and the generality of the benefits that can be expected with training.
Training of spatial ability 20
To answer these questions, we first drew upon the largest number of studies that could be
compared by using Ec as the measure of effect size. This focus allowed us to make a
comprehensive determination of whether there were large differences in malleability among the
transformation, spatial principles, and mental rotation. Overall, we found statistically significant
differences in mean effect size (Ec) by outcome measure, Q (4, 508) = 12.68, p < .05. Post-hoc
comparisons of the of the Ec effect sizes (with alpha reduced to .01 to control the Type I error
rate) suggested that spatial principles showed the largest gains with training while studies of
spatial perception showed the smallest gains with training (p < .01).
We also analyzed the treatment and control group effect sizes separately to eliminate the
ambiguity of evaluating effect sizes relative to control groups. Among the smaller sample of 55
studies with complete data, we still found significant differences by outcome measure, Q (4, 245)
= 11.20, p < .05. However, as shown in Table 3, the ordering of effect sizes revealed a different
pattern once the effect of control groups was removed and only the effect of treatment was
considered. In fact, although spatial perception studies yielded the lowest mean Ec effect size of
all outcome measures, the mean effect size for spatial perception treatment groups was the
highest of all groups, significantly higher than mental rotation (p < .01), with no other pairwise
Across the five outcome measures, there were also significant differences in the pretest to
posttest gains of the control groups, Q (4, 223) = 25.80, p < .001. Control-group improvement
was highest for the Spatial perception and Assembly/transformation control groups improved the
.05) also showed significantly larger gains than mental rotation control groups did, M = .51, SE =
Training of spatial ability 21
.04, p < .01. On the other hand, control groups for spatial principles (M = .18, SE = .10) showed
significantly smaller gains than all other control groups (ps < .01). No other pairs were
significantly different. These results suggests that spatial perception measures, in fact, yielded
the largest gains with training but that these gains may have been underestimated because of the
relatively large gains made concurrently by the control groups. In contrast, the spatial principles
treatment groups showed relatively modest gains with training, but when gains were evaluated
relative to control groups, spatial principles appeared to be highly malleable because the control
If differences in mean effect size between spatial perception and spatial principles were
primarily the result of differences in control group improvement, then we would also expect the
effect size onto Group (treatment vs. control), Outcome measure (spatial perception vs. spatial
principles), and Group x Outcome. The results were consistent with the apparent difference in
malleability being the result of differential control group improvement. As shown in Figure 3, we
found a significant interaction between group and outcome measure, providing strong evidence
that the difference between treatment and control group improvement was smaller in studies of
spatial perception than in studies of spatial principles. Thus, the apparent difference in
malleability between spatial perception and spatial principles is not due to the effects of training
but is, instead, primarily the result of differences in the improvement of control groups; s spatial
perception control groups improving a great deal and spatial principles control groups improving
very little. In sum, our results suggest that spatial perception is, in fact, highly malleable to
training, as evidenced by the high gains observed in its treatment groups, while spatial principles
Training of spatial ability 22
is significantly less malleable. Both of these results are consistent with our claim that control
groups moderate the size of training effects observed for different outcome measures. These
results also illustrate what is emerging as an important theme of this metaanalysis: the fact that
control groups play an important, moderating role for the size of training effects.
Why is it that control groups for spatial perception improved to a larger extent than
control groups for spatial principles? One possibility is a greater proportion of “reactive”
manipulations being used for some outcome measures than others. To operationalize reactivity,
we rank ordered the different types of control groups by effect size to determine which control
group manipulations were associated with higher and lower effect sizes. Those yielding higher
effect sizes were judged to be more reactive. For example, spatial perception control groups
might show larger effect sizes because a larger proportion of them receive reactive
manipulations. In contrast, spatial principles control groups could yield smaller effect sizes
because a larger proportion of these studies used inert control group manipulations. To
operationalize reactivity, we rank ordered the different types of control groups by effect size to
determine which control group manipulations were associated with higher and lower effect sizes.
We analyzed the relative frequency with which each type of control group was used
within each outcome category (Figure 4). We found that theThe majority of spatial perception
studies tended to use an alternative treatment as a control group (51 out of 87 or 59% of spatial
perception effect sizes). On the other thandIn contrast,, the majority of in studies of spatial
principles the control groups were most likely to receive no treatment at all (66 out of 100 or
66% of spatial principles effect sizes). The difference in proportions of control groups observed
for these two outcome categories was statistically significant, χ2 (4, 187) = 87.04, p < .001.
Training of spatial ability 23
However, we found that control group effect sizes did not differ significantly for the four types of
control group (i.e., received nothing, treatment as usual, diluted treatment, alternative treatment),
Q (3, 223) = 5.69, p < .13. Among these groups, control groups that received a diluted version of
the treatment improved the least (M = .42, SE = .09) while those that received nothing (M = .62,
SE = .05) or received an alternative treatment (M = .60, SE = .05) improved slightly more. Thus,
the high proportion of alternative treatments used for spatial perception studies might explain the
high control group improvement observed for this group. In contrast, the high proportion of
control groups that received nothing in the spatial principles group does not seem to explain the
The previous analysis rank ordered the outcome measures in terms of their malleability to
training, although most of the measures exhibited similar effect sizes, on average. Significant
heterogeneity remained in each outcome category, suggesting that other study charactieristcs
might moderate the size of training effects. We next considered the effect sizes of the treatment
groups for each outcome category separately, and. Wwhenever possible we used coded study
variables and differences in training to account for variability that remained within each outcome
measure category.
Coded study variables. As shown in Table 4, the average length of training across all
studies was 21 days (range = 195). This translated into an average of 6.13 (SD = 400) hours of
training, with the majority of studies administering training in one single session. The number of
test items or trials varied considerably by outcome measure, with studies training to improve
mental rotation skills using the largest number of trials (M = 213.68, SD = 108.46) and studies of
perspective taking skill using the smallest number of trials (M = 6.71, SD = 1.60), F (4, 52) =
Several variables had no significant effect on effect size for any of the outcome measures
we tested: Neither age (younger than 13 vs. 13 – 18 years vs. older than 18 years) nor publication
status (published or not) was significantly related to treatment group effect size. Whether ;The
location training (in a classroom or elsewhere) was administered in a classroom also had no
significant impact on training effect size, either overall (p > .39) or within any of the outcome
Three variables had a significant effect on some but not all outcome measures (Table 4).
For the sample as a whole and within 4 out of the 5 outcome measure categories there was no
significant effect of training frequency on treatment effect size. The lone exception was mental
rotation, which showed significantly higher effect sizes when multiple sessions of training were
used instead of one single session. , which yielded the lowest treatment effect size overall (g =
.38, SE = .08). Whether feedback was providedPresence or absence of feedback also had a
significant impact on effect size. For Assembly/transformation and spatial principles, studies that
provided feedback during training yielded significantly higher treatment effect sizes than studies
that did not, ps < .05. The reverse was true for spatial perception, where the mean treatment
effect size was lower for studies that provided feedback than it was for those that did not.
Finally, random (vs. nonrandom) assignment was generally associated with significantly lower
effect sizes for the overall sample (p < .01) as well as for Assembly/transformation and mental
rotation (ps < .05) training studies, but this was not true for any other outcome measure.
Types of training. We also classified training into three major categories: course training,
videogame playing, and performance of spatial tasks. Each category was subdivided further to
help pinpoint the components of each type of training that were related to the size of training
effects.
Training of spatial ability 25
course constituted the treatment, from training in which a short-term course enhancement was
compared to the standard version of the course. In the former case, the control group was
drafting). In the latter case, training consisted of a single unit or course module, administered to a
small number of students who were compared to students receiving the course as usual.
We found that the average treatment effect size for pure course training (M = 1.11, SE =
.10, k = 15) was significantly higher than the treatment effect size when training consisted of an
enhanced course, M = .58, SE = .05, k = 60, Q (1, 74) = 21.66, p < .001. A similar trend was
observed within the outcome measure categories in which there were sufficient numbers of effect
sizes to perform a comparison. As shown in Table 5, mean treatment effect size was higher for
pure course training compared to enhanced course training for Assembly and transformation,
Q(1, 36) = 7.69, p < .01, and to a lesser degree for mental rotation, Q (1, 34) = 3.34, p < .07.
Because the results compared only the performance of the treatment groups, these results
are not confounded affected by control group performance. One disadvantage of this approach is
that many full-length courses used designs in which it was not possible to calculate separate
treatment and control effect sizes and, thus, were not included in the previous analysis.
Furthermore, because many studies examining course training provided few details about how
training was administered (i.e., frequency of sessions, session length, etc.) focusing only on
treatment groups made it difficult to obtain enough cases to identify characteristics that might
explain the advantage of full-length course training vs. short-term enhanced course training.
ThusTherefore, we also compared the effect of full-length course training and short-term
enhanced course training for the two categories of “inseparable” studies: those using pretest-
Training of spatial ability 26
posttest only designs and those using treatment vs. control (i.e., between subjects) designs. We
found largely the same pattern of results that was observed when only treatment groups were
considered. As summarized in Table 5, among both pretest-posttest only studies and treatment vs.
control studies, the mean effect size for full-length course training was again significantly higher
than the mean for enhanced courses. Within outcome measure categories, this was true for
mental rotation and assembly/transformation (although the result did not reach significance for
results, we pooled the effect sizes from the separable and inseparable in order to compare full-
courses and found a number of significant differences. On average, training for full courses was
longer in duration (p < .01) and was administered in more sessions (p < .001). Full course
training took place over an average of 78.91 days (SD = 12.94) compared to 45.17 days (SD =
38.98) for enhanced courses. Full course training was given in nearly twice as many sessions (M
= 16.27, SD = 5.24) compared to enhanced courses (M = 8.69, SD = 4.90). On the other hand,
theThe difference in total number of hours spent training was not significantly
differentsignificant, although we observed thatbut the mean number of hours was higher for
enhanced course training took longer on average (M = 52.54, SD = 85.86) than full course
training did (M = 32.07, SD = 11.23), p > .43. Consequently, when training was converted to a
per diem rate (total hours of training divided by total length of training period), we found that
enhanced courses could be characterized as more intense, performing a higher number of hours
of training in a shorter number of days. Enhanced courses trained participants at a rate of 2.07
Training of spatial ability 27
hours per day (SD = 3.03) compared to full-length courses, whose training averaged .43 hours
These results are consistent with performance advantage that results from distributed
versus massed practice (Ebbinghaus, 1885; Donovan & Radosevich, 1999): Short-term enhanced
courses and full-length courses require a similar number of hours of training; however, in short-
term enhanced courses, the distribution of these hours over a shorter training period and into a
smaller number of sessions may account for their smaller effect sizes relative to traditional,
semester-long courses that distribute the same amount of training over a longer period and spread
it out across a greater number of sessions. In short, our results are consistent with past research
Videogame training. Training effects from videogame play were similar for all outcome
measure categories, Q (4, 56) = 2.45, p > .65. However, videogames that entailed mental
rotation (e.g., Tetris) were associated with stronger training effects (M = .87, SE = .09) compared
to games that did not involve mental rotation, M = .63, SE = .06, Q (1, 56) = 5.22, p < .05. The
.91, SE = .09) yielded larger effects than non-mental rotation videogames (M = .35, SE = .09) did
on tasks requiring spatial assembly or transformation, Q (1, 11) = 17.54, p < .001. Whether
videogames involved mental rotation did not have a significant impact on mental rotation task
performance, however. This may have been because treatment effect sizes were large for
videogame training on mental rotation outcome measures regardless of whether the videogames
used mental rotation (M = .85, SE = .12) or not (M = .72, SE = .08), p > .36.
Spatial task training. The final category of training involved the use of spatial tasks as a
form of training. encompassed studies that used training involving the administration of spatial
Training of spatial ability 28
tasks. TThese were mutually exclusive of studies that used courses or videogames as training.
Within this category, we examined the effectiveness of training that involved direct or repeated
practice on a task of interest and then measured the improvements on the task of interest as well
as the transfer of training to other reference tests administered in the study. We distinguish
practice from other types of training in that we define practice as repetitions of the same task and
training as more varied and less task-specific. We consider them separately as two different
methods of training but acknowledge that either can lead to generalization and transfer. In cases
of transfer, following Barnett and Ceci’s (2002) distinction we distinguished near transfer, in
which training and the reference test were highly similar (e.g., Tetris-playing to mental rotation
or WLT using a round flask to WLT using an irregularly-shaped flask) from far transfer, in which
training and reference test were more dissimilar (e.g., Tetris-playing to Paper Folding or
In this section, we focus on the treatment effect sizes for training that involved repeated
practice on the same task versus training that required transfer to an untrained task. However, we
acknowledge that the absolute number of tests given may also play an important role in
increasing training effects. Thus, we focus on treatment effect sizes here but, in the next section,
consider the effects of training that includes one single task versus training that includes
Whenever an outcome measure was not identical to the training task, it was counted as a
test of transfer. Only for Assembly/transformation tasks was the mean treatment effect size
significantly higher for repeated practice than when transfer was required (p < .05). The
similarity between effect sizes for repeated practice and transfer tests within the Mental rotation
category initially seems to suggest a uniformly high rate of successful transfer and to contradict
Training of spatial ability 29
past findings suggesting that mental rotation training effects are highly specific (e.g., Sims &
Mayer, 1996). We found, however, that this result was largely due to our broad construal of
Because our definition of transfer included a wide range of training and tests, we also
subdivided our transfer effect sizes and compared cases of “near” transfer, where the training
task and outcome measure were highly similar (e.g., training on rotating 2-D figures and testing
on Card Rotations Test), with cases of “far” transfer, where training and outcome were more
dissimilar. In most cases, near transfer produced significantly higher effect sizes than far transfer
did. The mean treatment effect size for near transfer was significantly higher than for far transfer
for the overall sample, p < .001, spatial perception, p < .001, and mental rotation, p < .001. The
significant difference between near and far transfer makes sense in light of past work: although
there was no difference between treatment effect sizes for repeated practice and transfer overall,
training that constituted “near” transfer was significantly more effective in improving spatial
skills than training that was considered to be “far” transfer. Thus, our results are consistent with
work suggesting that spatial training is more effective when it is similar to the task of interest. In
sum, in most cases, training that more closely approximated the outcome measure of interest was
Finally, to gain a sense for what has “worked” in terms of type of training and outcome
measure, we rank ordered treatment effect sizes into quartiles and compared the proportion of
each type of training and each outcome measure found in each quartile. This provided another
way of examining how effect sizes clustered in order to identify the characteristics of the most
We found no significant association between type of training and treatment effect size
quartile, χ2 (df = 9) = 12.03, p > .21. There was, however, a significant association between
outcome measure and treatment effect size quartile, χ2 (df = 12) = 21.32, p < .05. As summarized
in Table 6, this is consistent with our results in that it revealed that the majority of treatment
effect sizes for spatial perception were found in the highest (4th) quartile while the majority of
treatment effect sizes for spatial principles were found in the lowest (1st) quartile. This analysis
I. Summary
In this section, we investigated whether different spatial outcomes show different gains
with training. We compared the effect sizes for the treatment groups of each of the five outcome
assembly/transformation, spatial principles, and mental rotation. By analyzing only the treatment
group effect sizes, we avoided the problem of confounding effect size with control group
improvement.
Once the effect sizes for treatment groups were considered separately from those of
control groups, we found that spatial perception, in fact, was highly malleable to training; its low
effect size Ec was the result of its control groups also showing large gains with training. On the
other hand, studies of spatial principles showed large Ec effect sizes, largely because its control
groups showed extremely small gains with training. We found that studiesStudies of spatial
perception most often presented control groups with an alternative treatment in place of the
training intervention while studies of spatial principles while control groups in studies of spatial
principles tended to receive no treatment. This may help to explain why the control groups for
In addition to the presence of control groups, certain study characteristics also had a
significant impact on treatment effect size. Mental rotation benefited from multiple sessions of
training (compared to one single session); for all other outcome measures, however, treatment
effect sizes were similar for multi-session and single-session studies. Providing feedback during
training produced mixed results: For assembly/transformation and spatial principles, providing
feedback was associated with higher treatment effect sizes while for spatial perception, it was
associated with lower effect sizes. Type of training also moderated effect sizes for different
outcome measures. We found that videogames that entailed mental rotation produced larger
training effects than nonrotation games, but all types of spatial outcome measures appeared to
Training in the form of a lengthy course compared to an alternative course (e.g., drafting
vs. water purification) reported led to larger effect sizes than shorter-term course enhancements
(e.g., addition of 3-D module to an existing course). We found this to be theThis result held case
across all three types of study designs (i.e., pretest-posttest with control, treatment vs. control,
pretest-posttest only). Our results suggested that fFull-length courses included the same number
of hours of training as enhanced courses but distributed it over a longer period of time and
divided it into a greater number of sessions. This finding may explain the consistently higher
effect sizes associated with full-length courses, which use distributed training, compared to
short-term enhanced courses, which used mass training. Finally, we found that training effects
were similar across studies for repeated practice and for training that required a degree of
transfer to a different task. However, training effects were significantly smaller for tasks
requiring far transfer and the size of training effects for near transfer were large, on average.
Training of spatial ability 32
Taken together, these results provide some evidence that the effects of training do extend
beyond mere practice on a task. Evidence for far transfer suggests that improvements in
untrained tasks accompany the improvements that result directly from training. In the next
section, we will decompose these effects of practice to try and specify the extent and limit to
II. What is the magnitude of the test-retest effect in spatial training and what factors are
In the previous section, we showed that when the influence of control groups is removed,
both repeated practice as well as training using highly similar spatial tasks led to improvements
in spatial skills. Because our intention waswe wanted to separate training effects from gains in
the the influence of training from the gains experienced by control groups, the previous analysis
included only those studies that provided separate information about the performance of both the
treatment and control groups. Focusing only on these studies focused on the performance of the
treatment groups within those studies in which it was possible to calculate separate effect sizes
This aspect of the analysis limits the conclusions that can be drawn in a few ways. First,
it reduced the number of effect sizes that could be included in the analysis. Second, it examined
in isolation the use of practice as a treatment manipulation when, in reality, repeated practice is
often given in combination with other training methods. For example, many researchersstudies
used an additive methodology; the control group received repeated practice, and the treatment
group received the same practice plus an additional treatment in which a control group is given
repeated practice and the treatment group receives repeated practice along with an additional
treatment intervention (e.g., Terlecki, Newcombe, & Little, 2008). Thus, the linking of individual
Training of spatial ability 33
treatment effect sizes to a particular aspect of training may have ignored the fact that studies may
present participants with a collection of tests and that the unique constellation of tests, as well as
the characteristics of the individual tests themselves, may have a modulating influence on effect
size.
In this section we consider factors that influenced learning in the control groups. Our
working hypothesis is that the type of experience subjects had in the control group strongly
influenced the magnitude of their improvement. In addition, we test the hypothesis that
improvements in the control group represent something more than simple test-retest effects.
To address this issue (?), We we also considered that the type of filler task separating the
administration of the test and retest might also affect the size of the retesting effect. For example,
the magnitude of test-retest effects might depend on whether filler tasks were spatial in nature or
left the testing site and returned at a designated time to complete the retest. Of the 203 control
group effect sizes included in this analysis, 88 used a spatial filler task and 115 used a nonspatial
filler task. There were 119 effect sizes derived from test-retest on a single measure and 84 effect
sizes from test-retest on multiple (i.e., more than one) measures. We compared these
effects. Our focus in this section is therefore only on testing and resting among the , we focused
in this section on the instances of testing and retesting among the control groups. Our approach
was two-fold: First, we compared the magnitudeWe began by examing of test-retest effects
among the following types of control groups: 1) those who practiced a single task repeatedly in
Training of spatial ability 34
lieu of the training received by the treatment group; 2) those that received nothing or performed a
nonspatial filler task between a pretest and posttest on a single measure; 3) those that performed
a spatial filler task between a pretest and posttest on a single measure; 4) those that received
nothing or performed a nonspatial filler task between pretest and posttest administrations of
multiple measures; and 5) those that performed a spatial filler task between a pretest and posttest
We then ?compared the mean effect size of these 5 types of groups and found that there
were significant differences among these variants on test-retest control groups, Q (4, 223) =
18.32, p < .01. The means are summarized in Figure (See Figure 5). A MetaRegression
confirmed main significant effects of both number of measures (single vs. multiple, ß = .39, p <
.001) and type of filler task (spatial vs. nonspatial, ß = .26, p < .001) as well as a significant
These results suggest that the act of retesting on a single test (when a nonspatial filler task
is used) does have an effect on raising scores from pretest levels (g = .37, SE = .05). This is
similar in magnitude to the average test-retest effect reported in the literature (.28). This test-
retest effect is even larger, however, when a spatial filler task is used (g = .66, SE = .06). The
presence of a significant interaction indicated that the difference in mean effect size between
single and multiple measures was significant for studies using a nonspatial filler task (p < .001)
but not for studies using a spatial filler task (p = .26). In other words, among subjects given
nonspatial filler tasks, control group subjects whose test-retest protocol included multiple
measures improved more than subjects who were only tested and retested on a single measure. In
contrast, test-retest procedures that included a single or multiple measures generated similar
levels of improvement when the test and retest were separated by a spatial filler task. This
Training of spatial ability 35
finding is to be expected if we consider that subjects given a spatial filler task may still be
learning something. These improvements were statistically similar to the gains observed when
The total number of tests that control groups completed was also associated with higher
effect sizes, Q (2, 223) = 12.03, p < .01. As shown in Figure 6, control groups that completed a
test-retest procedure on a single test (M = .49, SE = .04) improved significantly less than those
that completed five or more tests (M = .78, SE = .08, p < .01) and those that completed 2-4 tests
(M = .59, SE = .04), although the latter difference was significant only at p < .05. Because the
two categories of multiple tests did not differ, the main distinction appeared to be between
control groups receiving a single test and those receiving multiple tests, M = .64, SE = .04, Q (1,
223) = 7.66, p < .01. There was not an appreciable gain for control groups that received 4 tests
or more than 4 tests in a test-retest design. In fact, there was a large and significant gain in
control group performance between those completing pretest-posttest on only one test (M = .49,
SE = .04) versus those receiving two tests, M = .70, SE = .06, Q (1, 166) = 9.09, p < .01. Thus,
the inclusion of at least two different measures in a test-retest design appears to provide a level of
A second finding of interest here is that control groups that merely practiced a single task
repeatedly improved to a similar degree as control groups that took a pre- and posttest on
multiple measures. Although the two groups may learn different things with repeated testing,
multiple (more than 2) repetitions on a single test seem to yield similar gains as two (pretest and
Finally, it is also interesting to note that, on average, control groups that received a spatial
filler task improved to a similar degree as those that received a nonspatial filler task or no filler
Training of spatial ability 36
task at all. This is consistent with our earlier analysis of control groups that revealed a significant
difference between studies that had no control group and those that had a control group but no
significant differences between the different types of control groups. However, those who
received a nonspatial filler task showed larger improvements as more measures were included
while those who received a spatial filler task performed similarly whether a single measure or
II. Summary
The typical improvement that is expected to result from retesting on a single test is .29.
Some aspects of Our our results were consistent with the typical pattern observed in past
research: Control groups that received a test-retest regimen on a single measure (with no spatial
filler) improved about .38. However, we also found that the number of tests that accompany a
measure, as well as the nature of the filler task also have an effect on the size of the test-retest
improvement observed.
In general, control group participants who were tested and retested on multiple measures
improved significantly more than those who did so onreceived only a single measure. This
tendency was particularly true when the intervening time between the tests was spent doing
either nothing or completing a nonspatial filler task. We also found a similar degree of
improvement in control groups that practiced one task repeatedly and control groups that
completed only a pretest and posttest but on multiple measures. This might also help to explain
the high gains observed earlier when battery was considered as an outcome measure, since
studies in that category typically administered a large number of tests at pretest and posttest.
We noted earlier that the success of interventions is typically judged by looking at the Ec
effect sizes, which summarize the gains shown by the treatment group relative to the gains
Training of spatial ability 37
shown by the control group. In the previous section, we showed that spatial skills improve
regardless of whether individuals receive repeated practice on a task of interest or if they are
trained on related spatial skills, with larger training effects being observed when there is a higher
degree of similarity between the training and outcome measures. However, in this section we
showed that not only is the degree of similarity between training and outcome important, the
number of tests included within a training regimen also effects the size of training effects.
Specifically, control groups that received multiple tests in a test-retest design improved more
than control groups that received only a single test and improved to the same degree as control
training individuals—has implications for how the effectiveness of training interventions are
evaluated. Because within a single study, a large improvement by the control group attenuates
the size of the effect size for the intervention, a study that administers multiple measures versus a
single measure may report different effect sizes. It also suggests that the concept of what
constitutes training may need to be reframed. Training is not limited to the content of the
material, it also consists of increasing familiarity with procedures and taking tests. Thus, control
groups, who do not receive the same material content as treatment groups, still show sizable
improvements when they are enrolled in a study. Furthermore, control groups that complete a
test-retest on multiple tests have even more opportunities for being “trained” and, as such, show
larger improvements compared to control group individuals who take a single test.
Training of spatial ability 38
III. Are the effects of spatial training durable and how long do Durability of Trainingthey
last?
The majority of studies tested only the immediate effects of training. However, we found
that pretest to posttest improvements did not differ significantly among posttests given
immediately, 2 weeks after, or more than 2 weeks after the end of training (Figure 7). The last
category includes posttests that were given up to 3 months after the end of training.
Because studies that administered a delayed posttest did not test long-term retention more
than once, the previous analysis was based on the comparison of durability across different
studies. What about studies that employed both and immediate and delayed posttest? As shown
in Figure 8, we found the same to be true within studies that administered both immediate and
delayed posttests: the effect size from pretest to immediate posttest (g = .64) was not
significantly different from the effect size from pretest to delayed posttest (g = .65). Thus, the
effects of training are durable; they do not decline the gains in spatial skill due to training do not
We also tested whether the effects of training on transfer to reference tests was also
durable. We found Ffar transfer was actually more durable than near transfer waseffects to be
more resistant to delays than near transfer effects (Figure 9). There was no interaction between
Delay type and Transfer type. This result could suggest that training designed to achieve far
transfer may also be more likely to yield training effects that are durable. Very few studies tested
the effects of long delays after the conclusion of training, but these results do suggest that
transfer is durable, although certainly more research is needed that examines the long-term
III. Summary
Some researchers have speculated that spatial training effects are neither long-lasting nor
generalizable to tasks beyond those directly trained. Contrary to these assumptions, we found
that the consensus in past research is that the effects or training are durable. We found that
thereThere were are no significant losses in pretest-posttest gains resulting from training, even
for studies that retested participants 3 months from the end of training. We also found that
training is generalizable to tasks other than those used in traininggeneralizes: In the studies
variety of reference tests that were not trained directly. The gains on these tests of transfer also
Why haven’t past attempts at spatial training led to stronger effects? First, we have
shown that the extent to which control groups improve is important to consider and we have also
shown that control groups may improve a lot (g = .56 in our sample). This is very large given
that the test-retest effect alone is .26. Furthermore, length of training may not be long enough. If
long periods of training are used (e.g., Terlecki, Newcombe & Little, 2008), durable training
IV. How do individuals’ pre-existing levels of performance modulate the size of training-
In this section, we address the questions related to whether high- and low-performing
individuals benefit from training to the same degree, whether they be males vs. females, or adults
vs. children, etc. Our goal was to shed light on some of the sources of individual differences in
receptivity to training and also to ascertain the extent that methodological factors (such as
Training of spatial ability 40
differences in improvement by control groups) could account for some of these apparent
Determining whether the effects of training and experience on spatial ability are stronger
or weaker within different populations is of critical importance for determining who stands to
Within the literature on the effects of training or experience on spatial performance, some
Some of the largest improvements in spatial performance have been recorded withinobserved in
populations with limited exposure to spatial tasks. (e.g., Saunderson,1973; Seddon & Shubbar,
1984; Seddon, Eniaiyeju & Jusoh, 1984; Seddon & Shubber, 1985; Shubbar, 1990). Higher
versus lower degrees of prior experience with spatial tasks also have a modulating influence on
the size of training effects. For example, Gagnon (1985) found that female, but not male,
demanding video game, likely probably because the females reported lower levels of previous
game-playing experience. On the other handHowever, the spatial skills of males and females
with low levels of gaming experience improved equally after when Dorval and Pepin (1986)
specifically recruited males and females with low levels of gaming experience, they found that
both sexes showed significant improvements in spatial skills after playing a spatial video
gamespatial video game training (Dorval & Pepin, 1986). Thus, it is of great theoretical and
practical interest to test the hypothesis that differing amounts of spatial experience and activities
lead to differences (or different increases) in spatial ability. This hypothesis has received
substantial attention in the literature; researchers have called it the experiential hypothesis,
Sherman’s hypothesis, or the Bent Twig Model is experiential hypothesis (Baenninger &
Newcombe, 1989), also termed Sherman’s hypothesis or the bent twig model (Casey, 1996), that
Training of spatial ability 41
differing amounts of exposure to spatial experiences and activities lead to differences in spatial
ability).
individuals will show larger training effects compared to more spatially-experienced individuals.
As such, training effects should be larger for studies conducted in less-industrialized versus
more-industrialized countries, larger for females than males due to their differing degrees of
spatial experience (Baenninger & Newcombe, 1989), and also larger for children compared to
adults. Our results earlier are consistent with the first prediction: We excluded 6 studies from
nonindustrialized countries (g = 1.67, SE = .07) from ourthe main analysis because their mean
effect size was significantly higher than that of the remaining studies, which were from more
industrialized countries, g = .68, SE = .02, Q (1, 633) = 167.61, p < .001. Even after these
extreme cases were excluded, there remained a significant, negative relationship between HDI
ranking and treatment group effect size, r (236) = -.21, p < .01. The same negative relationship
was found when all Ec effect sizes were considered, r (508) = -.12, p < .01. Figure 10 shows the
Within the studies retained in our sample (i.e., after the exclusion of outliers), we were
interested in whether training had the same impact on males and females and children and adults.
We begin this section with a discussion of how spatial training might play a role in modifying
differences in spatial skills that have been observed as a function of sex. We also examine
whether there is support in our data for an age difference in malleability to training.
Sex differences. Mirroring the gender gap in the STEM disciplines, spatial skills are one
cognitive domain in which sex differences have been reliably and systematically found (Linn &
Peterson 1985). Men consistently score higher than women on most standardized measures of
Training of spatial ability 42
spatial skills, with the notable exception of object location memory (Voyer, Voyer & Postma,
2003). The most popular explanation of this discrepancy, first advanced by Fennema and
Sherman (1977), attributes the difference in scores to the cultural sex-typing of spatial activities
such that males are more likely to engage in them. This difference, in turn, leads to greater and
richer levels of spatial experience among males than females. Indeed, research has shown that
males are much more likely to participate in spatial activities such as sports and construction
play while females are traditionally much more likely to engage in play with dolls, cooking and
art (Baenninger & Newcombe, 1995; Voyer, Nolan, & Voyer, 2000). These spatial experiences
and activities can, in a sense, be considered training (and likely enhancing) spatial skills.
Thus far, thereThere have been many individual studies of the effects of training in sex
differences in spatial cognition, but to our knowledge, the last meta-analysis of this topic was
conducted almost twenty years ago (B. However, to our knowledge, the meta-analysis
conducted by Baenniger and Newcombe (1989) over 15 years ago is the only systematic review
Newcombe (1989) tested the experiential hypothesis by examiningexamined the relation the
relationship between spatial activity participation or experience and scores on psychometric tests
of spatial ability. They found a weak, but significant, relation They found a weak but reliable
relationship between between spatial activity participation and spatial ability for both males and
femalesfor both males and females, supporting the notion that males may have an advantage in
spatial ability due to their greater amount of spatial experience over females. However, they also
found that spatial ability test performance was equally improved with training for females and
males. In other words, their meta-analysis didBaenninger and Newcombe did not show find the
Training of spatial ability 43
Sex by Training interaction that would be expected if would have been predicted by the
hypothesis that spatial experience is the key in leading to enhanced spatial skills.
Recent work by Levine et al. (2005) provides additional support for the experiential
hypothesis. They found that the emergence sex differences on spatial rotation and (examples)
tasks depended on SES level; low SES boys and girls performed the same, but mid- and high-
SES boys performed better than their female counterparts. They Levine et al. speculated that
low-SES children do not have as much access to stimulating materials that have been shown to
enhance spatial ability, such as video games, puzzles and Legos as higher SES-children do. If
the experiences that tend to enhance spatial abilities are absent from the lives of low-SES
children and less prevalent for girls of all SES levels, then we would expect that they would do
worse on spatial tests but might improve substantially if provided with adequate or compensating
experience or training.
Although sex differences were not the primary focus of our analysis, we were able to test
for sex differences with our data set for all studies in which separate means for males and
females were provided by the authors. Table 7 gives the We computed the mean effect sizes of
the effect sizes of control and treatment groups for both sexes. There were no and found no
significant sex differences in mean effect size for either the control groups (p > .56) or treatment
groups (p > .18). However, as expected, the). Treatment mean effect size for the treatment
groups groups performed significantly better than the control groups did, but this replicates
analyses reported above. was significantly higher than for the control groups (p < .001). Using
the metaregression macro, we tested the significance of the Condition x Sex The interaction
between Condition and Sex was not significant and found no significant interaction (p > .54).
This is consistent with previous research (e.g., Baenninger & Newcombe, 1989) which found
Training of spatial ability 44
that males and females both benefit from training. Our results suggest indicate that training
leads to comparable gains in male and females, in both control and training groupsoverall, male
and female control groups and treatment groups show statistically similar gains after
participating in training studies. Contrary to the experiential hypothesis, we did not find that
women improved more than men did.Treatment groups show larger improvements than control
groups do, indicating that training is effective, but training does not appear to be more effective
for women than men, as both sexes improve with training. The group means are summarized in
Table 7.
The results presented thus far indicate that training leads to improvement of equal
magnitude for Our results indicate that the magnitude of improvement is similar for males and
females. We next addressed whether males and females began at the same or at different levels
of performance. However, these results do not address whether there are pre-existing differences
across studies and whether these differences are reduced or even eliminated with training. Males
and females both appear to benefit from training, but what is the nature of this improvement?
Despite the high volume of research on sex differences, no consensus has been reached on how
training modifies pre-existing levels of performance in males and females. Figure 11 depicts two
competing scenarios seem possible for depicting the start and end points for males and females
in spatial skill. The first (a) is a remediation scenario, in which the mean performance of males is
higher at pretest but females catch up with practice. The second (b) is one of parallel
improvement; males perform better at the beginning than females do, and training leads to
comparable improvement in both groups. Thus the male advantage is maintained across training.
in which there is a consistent male advantage in spatial skill in which both sexes respond to
Training of spatial ability 45
training but in which a male advantage exists both before and after training and training does not
To determine which of the two hypothetical scenarios provides the better summary of the
effects of spatial training on the performance of males and females, we first computed the pre-
and post-test performance by sex, we also computed the magnitude of sex differences at pretest
and posttest for every study for which this was possible. To be included in this analysis, a study
We included only those studies that must have provided both the mean pretest and posttest
scores for male and female participants separately. A total of 35 studies were found thatThirty-
five studies satisfied met these criteria. Following the procedure described by Voyer, Voyer, &
Postma (2007), we calculated the size of the sex difference in pretest scores and the sex
difference in posttest scores for each of the 35 studies. The size of the sex difference is
summarized by the Hedges’ g statistic, which in this case reflects the size of the sex difference
favoring males. A larger and more positive g in this case reflects a larger sex difference favoring
males over females, a negative g would indicate a sex difference favoring females over males,
and a g close to zero would suggest no sex difference. Using the SPSS macros, we tested the
significance of sex differences at pretest and again at posttest to determine whether significant
Across the 35 studies, males performed better than women did both at For the 35 studies
that were included in this analysis, we found that a significant sex difference favoring men was
found at both pretest (g = .50, SE = .04, k = 48) and posttest (g = .44, SE = .04, k = 48) and that
the size of this difference did not change significantly as a result of participating in training, Q
(1, 95) = .86, n. s. We repeated this analysis separately for each of the five outcome measures
and found the same result. These results are summarized in Table 8. To explore the possibility
Training of spatial ability 46
that the male advantage was reduced for some outcome measures but not others, we repeated this
analysis for the five outcome measure categories and found the same pattern of results. Sex
differences favoring men were statistically unchanged from pretest to posttest for all measures.
This pattern isOur findings are consistent with the parallel improvement scenario. On
average, males perform better than females at pre-test. Training leads to similar improvement in
males and females, and males therefore maintain their initial advantage. Training does not seem
to reduce the gender gap in spatial skills, but it does help both men and women perform at
substantially higher levels. in which both men and women improve with training but women do
not catch up to the levels of men. In other words, the consensus of the studies we survey here is
that there is a male advantage on spatial skills and that it persists despite the fact that both males
and females improve with training. Thus, our results suggest that the male advantage in spatial
Age differences. We also tested whether children (who have more limited spatial
experience) would show consistently larger training effects compared tothan adults did. We
found that the overall Ec effect size was higher for children (g = .75, SE = .04) than adults, g =
.61, SE = .03, Q (1, 538) = 8.69, p < .01, and that this difference was driven by an age difference
in the extent to which the control groups improved during testing. As shown in Table 9, when the
control and treatment groups were considered separately, a significant effect of age was found for
the control groups only, with adult control groups improving significantly more than children’s,
p < .001. There was no significant age difference, however, in the size of improvement by the
treatment groups (p > .11) nor was there a significant Age x Condition interaction (p > .16). For
both children and adults, treatment groups yielded significantly higher effect sizes than control
Training of spatial ability 47
groups did, ps < .001. Thus, both children and adults improved to an equal extent with training
and, despite differences in their pre-existing levels of performance, there was no evidence to
suggest that children showed larger gains with training than adults did.
We also tested whether the nonsignificance of the overall sex difference could be
attributed to the existence of a significant Age x Sex interaction. In other words, if a sex
difference existed but was more pronounced in either children or adults, this interaction would
obscure an overall main effect of sex. Because we found a significant difference in control group
effect size for children and adults, we tested for an Age x Sex interaction on the treatment groups
only. A metaregression of effect size on Age, Sex and Age x Sex interaction revealed no
significant effects, ps > .08. The only outcome that approached significance was a main effect of
Sex (p = .08). Otherwise, there was no evidence to suggest there was sex difference in the size of
Thus far, we have focused on age comparisons between children and adults. Yet, the
experiential hypothesis would also predict that training effects would be stronger for younger
children than older children. To test this prediction, we compared effect sizes (from treatment
groups only) of individuals under 13 years, from 13 – 18 years, and older than 18 years. We
found no significant differences in treatment group effect size, p > .72. We also compared the
effect sizes from control groups to determine whether younger and older children’s control
groups perform differently in training studies. We reported earlier that children’s control groups
showed significantly smaller gains with training compared to adults. When younger and older
children were considered separately, we found a significant effect of age on control group effect
size, with control groups for the youngest children improving significantly less than those of
older children and adults (Table 10). Thus, our results suggest again that there is no age
Training of spatial ability 48
difference in effect size for treatment groups, only age differences in the extent to which control
groups improve.
We investigated a possible explanation for why the youngest control group participant
showed significantly smaller improvements compared to those who were older by analyzing the
proportion of each control group type used by age. If certain types of control groups were used
more often with the youngest participants than with the other two age groups and these were
more inert (i.e., less reactive and likely to produce gains), this might explain the low levels of
improvement in the youngest control groups. We analyzed the proportion studies that used each
type of control group and compared these for each age group. The data for the Age x Type of
control group analysis are summarized in Table 11, both for the separable studies and for the
The analysis of Age by Control group type was not significant when only the separable
studies were considered (χ2 = 6.61, p < .36) but was significant when the entire sample was
included, χ2 = 14.33, p < .05. Both analyses converged on the same result, that for the youngest
age group, control groups were more likely to receive no treatment than one of the other control
group manipulations. In contrast, control groups for the middle age group (13 – 18 years) were
given treatment as usual most often while control groups for the oldest age group (over 18 years)
Was the type of control groups used with the youngest participants responsible for the
low performance of the Under 13 control groups? Specifically, was there one type of control
group associated with low effect sizes and was this type overrepresented in the youngest group of
participants? To answer this question, we compared the mean effect sizes for each type of control
group for each age group (Figure 12). Recall that our earlier analysis (when all age groups were
Training of spatial ability 49
considered together) had indicated that mean effect size did not differ by type of control group.
In contrast, when this analysis was broken down by age, the type of control group was
significantly related to control group effect size (p < .01 for the oldest group, p < .05 for the
middle age group, and p = .08 for the youngest group). Thus, although type of control group had
a significant impact on control group improvement, the types of control groups that were
associated with the highest and lowest effect sizes were not the same for each age group.
We also tested whether the lowest-performing control group for the youngest age group
(i.e., diluted treatment) was also the type of control group used most frequently by the youngest
group. This turned out not to be the case: the lowest-performing control group was not the most
frequently-occurring. Instead, control groups that received nothing both were the most
Taken together, this analysis reveals three main findings: 1) The control groups for the
youngest participants improved the least of all three age groups; 2) The types of control group
that were used most frequently varied significantly by age; and 3) The nature of what the control
group received had a significant impact on effect size. However, there was no evidence to
suggest that the choice of control groups accounted for the low gains observed for the youngest
control groups, since the most frequently used control group (receives nothing) was associated
with the highest mean effect size. Thus, it appears that control groups from studies of young
children tend to show small gains but that this cannot be explained by their receiving a higher
proportion of ineffectual control group manipulations. In sum, there is not strong evidence to
suggest that the types of control groups used accounted for the low effect sizes of the youngest
Low vs. High-performers. Finally, we also tested whether the process of screening out
high-performing individuals at the start of training has a significant impact on the size of training
effects. For example, past work has shown that differences in spatial skill among low- and high-
frequency video gamers can be reduced or eliminated if the low-gamers are given additional
video game playing experience (Gagnon, 1985; Dorval & Pepin, 1986. Okagaki & Frensch,
1994). Do training studies that incorporate a screening procedure yield higher effect sizes
In all, 11 out of 101 studies (107 effect sizes) tested only individuals who were defined
during a screening procedure as low scorers. An ANOVA on Ec effect sizes revealed that the 11
studies that tested only low scorers (M = .74, SE = .05) were significantly higher in mean effect
size than the remaining 90 studies, M = .61, SE = .03, Q (1, 507) = 4.90, p < .05. Of the 55
separable studies, there was no difference in the mean effect size for the control groups of the 7
studies that prescreened and the 48 that did not prescreen participants, p > .61. A significant
difference in mean effect size was found for the treatment groups, however, with treatment group
effect size being significantly larger for studies employing prescreening (M = .90, SE = .07, k =
40) compared to those that did not, M = .72, SE = .03, k = 206, Q (1, 245) = 5.31, p < .05. These
results are also consistent with the experiential hypothesis: Studies that focus on training low-
performing individuals report significantly larger effect sizes compared to studies that train
IV. Summary
Testing individuals that vary widely in spatial ability are important to our investigation of
spatial training effects because they address the extent to which it is possible to improve the
Training of spatial ability 51
spatial skills of low-performing groups and help to identify the situations in which we can expect
the gap between low- and high-performers can be closed or even eliminated.
We first tested whether the typical male advantage in specific spatial skills were reduced,
or even eliminated, when females were given additional training or experience. Our results were
most consistent with a parallel improvement outcome pattern, indicating that both males and
females improved with spatial training but that the male advantage present at pretest remained at
posttest after training. Thus, although a number of individual studies have reported that pretest
sex differences favoring males were erased after females showed larger gains with training
compared to males (e.g., Gittler & Gluck, 1998; Kass, Ahlers & Dugger, 1998; Larson et al.,
1999; Lohman & Nichols, 1990; Parameswaran, 1996; Vasta, Knott & Gaze, 1996), the
consensus across studies is most consistent with males and females showing equal gains with
training. After combining results across studies, we conclude that although males and females
both respond to training, spatial training does not eliminate the sex differences in spatial skills
Comparisons of children and adults likewise did not reveal significant differences in the
size of training effects obtained for each group. It is sometimes assumed that children will show
greater effects of training compared to adults. However, once we accounted for the influence of
control groups, we found no evidence to support this difference. We suggest that one reason why
training studies of children may appear to yield larger effect sizes than studies of adults is that
children’s control groups, on average, improve significantly less than adults’ do. This does not
appear to be due to a tendency to use ineffectual control group manipulations with younger
children; the highest-performing type of control group was also the most frequently used type
anong the youngest group. Likewise, the lowest-perforning type of control group was the least
Training of spatial ability 52
frequently used. Thus, the low performance of the youngest control groups may be accounted for
by more general factors, such as lack of familiarity with testing or possible failure to develop
strategies spontaneously, factors that may be less likely to limit performance in adult control
groups.
prescreening procedure) tended to report higher effect sizes than those that tested participants
regardless of skill level. This is consistent with the pattern we observed earlier for studies
focusing on low-SES populations likely to have limited spatial experiences. Like the current
group of studies that focuses on low-performers, these studies also reported larger than average
training effects. Overall, our results in this section provide qualified support for the experiential
hypothesis, with both sexes benefiting equally with training and training improving spatial skills
to a similar degree for different age groups but training having a larger effect on individuals who
General Discussion
training, yet there is considerable variability in the magnitude of training effects that have been
reported. Despite the large number of studies that have found positive effects of training on
spatial performance, other studies have found minimal or even some negative effects of receiving
interventions(e.g., Johnson, 1991; Larson, 1996; Kass et al, 1998; Simmons, 1998; Kirby &
Boulter, 1999; Faubion, Cleveland & Harrel,1942; Smedslund, 1963; Gagnon, 1985; Kass,
Ahlers & Dugger,1998; McGillicuddy-DeLisi, DeLisi & Youniss,1978; Johnson, 1991; Vasta,
We suggest that the mixed results of past research on training can be attributed, in part, to
overlooking the size of improvements made by control groups and failing to consider how
incidental aspects of study design, such as the number of measures included within a battery of
Overall, our results clearly support that spatial traning yields substantial, durable gains in
spatial skills that generalize to other tasks. Generally, we found that both men and women and
children and adults benefit from training and that there were limited differences in malleability.
The differences in malleability among the types of outcome measures and training methods were
somewhat limited and these differences did not seem to be as important as the differences
nearly all cases, the size of the training-related improvements was heavily dependent on whether
studies included a control group and, if so, the size of the gains observed within their control
groups.
We also argue for a broader conception of what constitutes training. We suggest that a full
characterization of spatial training entails not only examining the content of courses or training
regimens but also examining the nature of the practice effect that results from being enrolled in a
training study and being tested multiple times. We found that control group participants who
were otherwise “untrained” showed differences in improvement when they received multiple
tests or a single test, suggesting that even untrained particpants in training studies learn
something important. This may be learning about the act of taking a test, becoming familiar
spatial measures, both of which are enhanced when individuals take multiple tests, when there is
the opportunity to compare items across tests and to learn by making contrasts (Gentner &
Markman, 1994; 1997). Alignable differences are highlighted when similar entities are
Training of spatial ability 54
compared, so the act of taking multiple tests permits comparisons to be made across different
spatial tests and can potentially highlight important similarities and differences in test content
and strategy.
When we isolated the size of the treatment effects and compared them across outcome
measures, we found that there were relatively few differences in treatment effect sizes but large
differences in the extent to which control groups improved. We also found that it was not the
total number of hours spent training that made a difference but rather how it was distributed:
Full-length and enhanced courses used a comparable number of hours of training, but full-length
courses spread out this training over a longer training period, resulting in a less-intensive but
magnitude to cases of “near” transfer, where the practiced task and the outcome measure were
highly similar. Thus, training and outcome do not need to be identical in order for training-
related gains to be observed. However, repeated practice and near transfer both led to
significantly higher gains than “far” transfer did. This suggests that while it is efficacious to train
spatial skills, that there are limits to how these training effects will generalize to tasks that are
dissimilar to those trained. It is worth noting, however, that far transfer effects were more durable
than near transfer effects. We also found that taking multiple pretests and posttests yields larger
improvements on average compared to taking a single pretest and posttest, and that the effect of
giving multiple measures was similar to the improvement generated from practicing repeatedly
on a single task.
In our sample, the majority of studies used immediate posttesting after the conclusion of training.
However, among the studies that delayed the administration of the posttest there was strong
Training of spatial ability 55
evidence for the successful maintenance of training effects. The magnitude of training effects
was statistically similar for posttests given immediately, 2 weeks after, and even more than 2
With regard to Sherman’s hypothesis, we found qualified support for the role of pre-
existing levels of spatial experience on the size of spatial training effects. The size of spatial
training effects depended on a country’s HDI ranking, with lower ranking countries reporting
much larger training effects than countries that ranked higher in HDI. Likewise, studies that
focused on remediating low-performing individuals reported significantly higher effect sizes than
studies that tested all ability levels. This suggests that whether spatial interventions are deemed
to be effective depends heavily on the populations selected for study. Studies focusing on low-
SES or low-performing individuals will tend to report higher effect sizes than studies testing
On the other hand, we found at the same time that neither sex nor age was a consistent
predictor of training effects. Specifically, the overall pattern for males and females was
consistent with a scenario of parallel improvement, with both sexes improving to an equal extent
with training but males maintaining their advantage over females across all different types of
spatial tasks and training. Comparisons of children and adults likewise did not reveal significant
differences in the size of training effects obtained for each group. Once the effect of control
group improvement was removed, a direct comparison of treatment group effect sizes revealed
no significant differences between children and adults. There was also no evidence that this
affect was modified by sex; it appears that children and adults improve with training to a similar
extent regardless of sex. Thus, these results suggest that when comparing training effects, it is at
least as important to consider the methodological characteristics of studies (i.e., the nature and
Training of spatial ability 56
variables.
individuals are relatively untrained (i.e., receive little practice on or have little experience in)
should show particularly large gains in training. Although humans process and navigate a rich
array of spatial information on a daily basis, it is possible that not all spatial skills receive equal
training in naturalistic settings and that an increased reliance on technological aids (e.g., online
maps, GIS) have removed many opportunities to receive natural practice on spatial skills. Certain
professions might yield specific types of spatial experiences that are relevant to spatial test-
taking (e.g., dress making and the SR-DAT, Workman, Caldwell & Kallal, 1999; working in a
restaurant and the Water level task, Vasta, Rosenberg, Knott & Gaze, 1997). For most
individuals, however, opportunities for naturalistic spatial training are less available. Thus,
secular trends describing the interaction between technological devices and changes in spatial
Conclusion
Ultimately, the goal of research on spatial skills is to translate the results of individual
training studies into the development of best-practice guidelines for spatial interventions. Reports
of success for individual training regimens on isolated spatial tasks are important in that they
attest to the efficacy of training these skills. However, success in the STEM disciplines depends
on improving more molar measures, such as grades in school and the performance in situated
contexts of tasks requiring spatial skills. Thus, success in improving component skills such as
mental rotation or spatial perception is not noteworthy unless it can be shown that these
improvements translate into the skills that are relevant to success in STEM. This attitude is also
Training of spatial ability 57
important when evaluating the implications for some of our other results, namely that across
studies, spatial training was not successful in closing the gap between male and female
performance levels. On one hand, one might interpret this result negatively in that it could imply
a level of inevitability about the male advantage in spatial skills or the impossibility of female
students catching up. However, it is important to note, first, that acknowledging the existence of
a gender gap in basic spatial skills is not the same as conceding that males should outperform
females in all of the molar measures relevant to STEM success (e.g., grades, job performance).
In other words, the goal of future research is not to focus on remediation in order to close the
gender gap in basic spatial skills but to close the gap in STEM success.
Directions for future research. This metaanalysis was limited to studies that included at
least one spatial outcome measure. One important direction for future work is to analyze the
relationship between basic spatial skills and measures that are directly relevant to STEM success,
such as classroom behaviors, grades, and occupational success. Our analysis identified some key
similarity between training task and outcome measure, providing feedback during training, and
training regimens. Across all studies, 81% tested the effects of training immediately after the
conclusion of training, indicating the need for more research that includes delayed posttesting.
Related to the issue of control groups, we found that control group improvement varies
widely but that the magnitude of gains experienced by a control group can heavily influences
whether a training intervention is judged to be effective. For example, studies that fail to include
a control group typically report significantly higher effect sizes than those that include some type
of control group. When control groups are included but improve a great deal, this also masks the
Training of spatial ability 58
children, control groups tend to perform very poorly, which also can potentially lead to inflated
estimates about the effectiveness of training. Thus, the appropriate interpretation of training
effect sizes must be done cautiously and the specification of control groups are an important
In this metaanalysis, we also found not only that test-retest effects are sizable but that the
they increase with the number of separate tests included as part of a training study. Taken
together, these findings suggest that a significant component of training, one that improves with
age, is the act of learning to take tests and the development of spontaneous strategies for
approaching spatial tests. On one hand, designing training interventions that surpass the sizable
improvements attained through repeated practice is a challenge. On the other hand, it also
suggests a potentially important mechanism for raising spatial skills to a minimum level of
performance. Our finding that the act of taking multiple separate tests provides “training”
suggests the importance of training test literacy as well as focusing on component spatial skills.
Training of spatial ability 59
References
Baenninger, M., & Newcombe, N. S. (1989). The role of experience in spatial test performance:
Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn? A taxonomy
31-43.
* Batey, A. H. (1986). The effects of training specificity on sex differences in spatial ability.
* Battista, M. T., Wheatley, G. H., & Talsma, G. (1982). The importance of spatial visualization
* Beilin, H., Kagan, J., & Rabinowitz, R. (1966). Effects of verbal and perceptual training on
* Ben-Chaim, D., Lappan, G., & Houang, R. T. (1988). The effect of instruction on spatial
visualization skills of middle school boys and girls. American Educational Research
* Blade, M. F., & Watson, W. S. (1955). Increase in spatial visualization test scores during
engineering study. Psychological Monographs: General and Applied, 69 (12, Whole No.
397), 1-13.
* Blatter, P. (1983). Training in spatial ability: A test of Sherman's hypothesis. Perceptual &
Borenstein, M., Hedges, L., Higgins, J., Rothstein, H. (2005). Comprehensive Meta-analysis Ver-
Capozzoli, M. V., McSweeney, L., Sinha, D. (1999). Beyond kappa: A review of interrater
* Carpenter, F., Brinkmann, E. H., & Lirones, D. S. (1965). Educability of students in the
visualization of objects in space (Cooperative Research Project No. 1474). Ann Arbor,
* Chatters, L. B. (1984). An assessment of the effects of video game practice on the visual motor
Toledo.
* Churchill, R. D., Curtis, J. M., Coombs, C. H., & Harrell, T. W. (1942). Effect of engineer
Measurement, 2, 279-280.
* Ciganko, R. A. (1973). The effect of spatial information training and drawing practice upon
* Clements, D. H., Battista, M. T., Sarama, J., & Swaminathan, S. (1997). Development of
students' spatial thinking in a unit on geometric motions and area. The Elementary School
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
* Connor, J. M., Schackman, M., & Serbin, L. A. (1978). Sex-related differences in response to
49, 24-29.
* Day, J. D., Engelhardt, J. L., Maxwell, S. E., & Bolig, E. E. (1997). Comparison of static and
* De Lisi, R., & Cammarano, D. M. (1996). Computer experience and gender differences in
361.
Training of spatial ability 62
* De Lisi, R. & Wolford, J. L. (2002). Improving children's mental rotation accuracy with
* Deratzou, S. (2007). A qualitative inquiry into the effects of visualization on high school
effect: Now yu see it, now you don’t. Journal of Applied Psychology, 84, 795-805.
* Dorval, M. & Pepin, M. (1986). Effect of playing a video game on a measure of spatial
University of Minnesota.
* Duesbury, R. T., & O'Neil, H. F. (1996). Effect of type of practice in a computer-aided design
* Eliot, J. (1966). The effects of age and training upon children's conceptualization of space.
* Emler, N. & Valiant, G. L. (1982). Social interaction and cognitive conflict in the
* Faubion, R. W., Cleveland, E. A., & Harrell, T. W. (1942). The influence of training on
* Feng, J. (2006). Cognitive training using action video games: A new approach to close the
* Feng, J., Spence, I., & Pratt, J. (2007). Playing and action video game reduces gender
18, 126-140.
Gentner, D., & Markman, A. B. (1997). Structure mapping in analogy and similarity. American
* Gerson, H. B. P., Sorby, S. A., Wysocki, A., & Baartmans, B. J. (2001). The development and
* Geva, E., & Cohen, R. (1987). Transfer of Spatial Concepts from Logo to Map-Reading
Department of Education.
* Gittler, G., & Gluck, J. (1998). Differential transfer of learning: Effects of instruction in
descriptive geometry on spatial test performance. Journal for Geometry and Graphics, 2,
71-84.
Training of spatial ability 64
Halpern, D. F., Benbow, C. P., Geary, D. C., Gur, R. C., Hyde, J. S., & Gernsbacher, M. A.
Science, 8, 1-51.
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Gerrard, M. O. M. (in press). Retesting in
Applied Psychology.
Heckman, J. J. & Masterov, D. V. (2006). The productivity argument for investing in young
Hedges, L. V., & Chung, V. (in preparation). Does spatial ability predict STEM college major
* Heil, M., Rossler, F., Link, M., & Bajric, J. (1998). What is improved if a mental rotation task
* Hsi, S., Linn, M. C., & Bell, J. E. (1997). The role of spatial reasoning in engineering and the
Humphreys, L. G., Lubinski, D., & Yao, G. (1993). Utility of predicting group membership and
* Johnson, J. E. (1991). Can spatial visualization skills be improved through training that
of Minnesota.
Training of spatial ability 65
* Johnson, S., Flinn, J. M., & Tyer, Z. E. (1979). Effect of practice and training in spatial skills
on embedded figures scores of males and females. Perceptual & Motor Skills, 48, 975-
984.
Johnson, M. H., Munakata, Y., & Gilmore, R. O. (Eds.) (2002). Brain Development and
Kanaya, T., Scullin, M. H., & Ceci, S. J. (2003). The Flynn effect and U.specific. policies: The
* Kaplan, B. J., & Weisberg, F. B. (1987). Sex differences and practice effects on two visual-
* Kass, S. J., Ahlers, R. H., & Dugger, M. (1998). Eliminating gender differences through
performance on a field-based map skills task. Cognition and Instruction, 25, 45–74.
* Kastens, K. A., Kaplan, D., & Christie-Blick, K. (2001). Development and evaluation of
“Where are We?” map skills software and curriculum. Journal of Geoscience Education,
49, 249-266.
* Kirby, J. R., & Boulter, D. R. (1998). Spatial abilities and transformational geometry.
12, 146-155.
Training of spatial ability 66
* Kozhevnikov, M., & Thornton, R. (2006). Real-time data display, spatial visualization ability,
and learning force and motion concepts. Journal of Science Education and Technology,
15, 111-132.
* Kwon, O. N., Kim, S. H., & Kim, Y. (2002). Enhancing spatial visualization through Virtual
Reality (VR) on the web: Software design and impact analysis. Journal of Computers in
program and paper-based program. In C. W. Chung, C. K., Kim, W. Kim, T. W. Ling, &
Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical
* Larson, P. et al. (1999). Gender issues in the use of virtual environments. CyberPsychology &
Behavior, 2, 113-123.
development of 7-8 year-old children. Korean Journal of Child Studies, 16, 79-88.
* Leino, V., & Willemsen, E. (1976). Use of a perceptually based apparatus to train adult
* Lohman, D. F., & Nichols, P. D. (1990). Training spatial abilities: Effects of practice on
experience with related games. Educational & Psychological Measurement, 50, 1-6.
* Luursema, J., Verwey, W. B., Kommers, P. A. M., Geelkerken, R. H., & Vos, H. J. (2006).
* McClurg, P. A., & Chaille, C. (1987). Computer games: Environments for developing spatial
* McGee, M. G. (1978). Effects of training and practice on sex differences in mental rotation
* McGillicuddy-De Lisi, A. V., De Lisi, R., & Youniss, J. (1978). Representation of the
horizontal coordinate with and without liquid. Merrill-Palmer Quarterly, 24, 199-208.
* Miller, G. G., & Kapel, D. E. (1985). Can non-verbal, puzzle type microcomputer software
affect spatial discrimination and sequential thinking skills of 7th and 8th graders?
* Miller, J. W., Boismier, J. D., & Hooks, J. (1969). Training in spatial conceptualization:
Morris, S. B. (in press). Estimating effect sizes from the pretest-postest-control design.
National Research Council (2006). Learning to Think Spatially. Washington D. C.: The
* Okagaki, L., & Frensch, P. A. (1994). Effects of video game playing on measures of spatial
* Pepin, M., & Dorval, M. (1986, April). Effect of playing a video game on adults’ and
adolescents’ spatial visualization. Paper presented at the annual meeting of the American
* Piburn, M. D., Reynolds, S. J., McAuliffe, C., Leedy, D. E., Birk, J. P., & Johnson, J. K.
* Priddle, R. E., & Rubin, K. H. (1977). A comparison of two methods for the training of spatial
* Ranucci, E. R. (1952). Effect of the study of solid geometry on certain aspects of space
Rayner, K., Foorman, B. R., Perfetti, C. A., Pesetsky, D., & Seidenberg, M. S. (2001). How
Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results. Psychological
* Saunderson, A. (1973). The effect of a special training programme on spatial ability test
* Schaeffer, P. D., & Thomas, J. (1998). Difficulty of a spatial task and sex difference in gains
* Schaie, K. & Willis, S. L. (1986). Can decline in adult intellectual functioning be reversed?
* Schofield, N. J., & Kirby, J. R. (1994). Position location on topographical maps: Effects of
task factors, training and strategies. Cognition and Instruction, 12, 35-60.
* Seddon, G. M., & Shubber, K. E. (1984). The effects of presentation mode and colour in
* Seddon, G. M., & Shubber, K. E. (1985a). The effects of colour in teaching the visualization
* Seddon, G. M., Eniaiyeju, P. A., & Jusoh, I. (1984). The visualization of rotation in diagrams
* Shavalier, M. (2004). The effects of CAD-like software on the spatial ability of middle school
Shea, D. L., Lubinski, D., Benbow, C. P. (2001). Importance of assessing spatial ability in
Sherman, J. A. (1967). Problem of sex differences in space perception and aspects of intellectual
Shonkoff, J., & Phillips, D. A (Eds.). (2000). From neurons to neighborhoods. Washington, DC:
* Simmons, N. A. (1998). The effect of orthographic projection instruction on the cognitive style
* Sims, V. K. & Mayer, R. E. (2002). Domain specificity of spatial expertise: The case of video
* Smith, G. G. (1998). Computers, computer games, active control and spatial visualization
* Smith, W. S., & Litman, C. I. (1979). Early adolescent girls' and boys' learning of a spatial
* Smith, W. S., & Schroeder, C. K. (1979). Instruction of fourth grade girls and boys on spatial
* Sorby, S. A., & Baartmans, B. J. (1996). A course for the development of 3-D spatial
* Sorby, S. A. (2008). Applied Educational Research in Developing 3-D spatial skills for
Spelke, E. S. (2005). Sex differences in intrinsic aptitude for mathematics and science?: A
* Stringer, P. (1975). Drawing training and spatial ability. Ergonomics, 18, 101-108.
* Subrahmanyam, K. & Greenfield, P. M. (1994). Effect of video game practice on spatial skills
Terlecki, M. S., & Newcombe, N. S. (2005). How important is the digital divide? The relation
of computer and videogame usage to gender differences in mental rotation ability. Sex
* Terlecki, M. S., Newcombe, N. S., & Little, M. (2007). Durable and generalized effects of
DOI: 10.1002/acp.1420.
training for male and female engineering graphics students with low, middle, and high
levels of visualization skill as measured by mental rotation and hidden figures tasks.
* Turner, G. F. W. (1997). The effects of stimulus complexity, training, and gender on mental
U. S. Department of Education (2008). The Final Report of the National Mathematics Advisory
http://www.ed.gov/about/bdscomm/list/mathpanel/report/final-report.pdf
Training of spatial ability 73
* Vasta, R., Knott, J. A., & Gaze, C. E. (1996). Can spatial training erase the gender differences
Vasta, R., Rosenberg, D., Knott, J. A., & Gaze, C. E. (1997). Experience and the water-level task
* Wang, C. H., Chang, C. Y., & Li, T. Y. (2006). The comparative efficacy of 2D-versus 3D-
based media design for influencing spatial visual skills. Computers in Human Behavior,
23, 1943-1957.
* Weidenbauer, G., Schmid, J., & Jansen-Osmann, P. (2006). Manual training of mental rotation.
Wilson, D. B. (2002). SPSS macros for metaanalysis. Downloaded with permission from
http://mason.gmu.edu/~dwilsonb/ma.html
* Workman, J. E., Caldwell, L. F., & Kallal, M. J. (1999). Development of a test to measure
spatial abilities associated with apparel design and product development. Clothing &
visualization test and paper folding test. Clothing & Textiles Research Journal, 22, 22-
30.
* Workman, J. E. & Zhang, L. (1999). Relationship of general and apparel spatial visualization
* Wright, R., Thompson, W. L., Ganis, G., Newcombe, N. S., & Kosslyn, S. M. (2008). Training
* Yates, L. G. (1986). Effect of visualization training on spatial ability test scores. Journal of
Author Note
Linda L. Liu, David H. Uttal, Loren M. Marulis, Christopher M. Warren, and Alison R.
We thank Kate O’Doherty, Bridget O’Brien, Maggie Carlin, Melissa Sifuentes, Bonnie
Vu, and Eleanor Tushman for their assistance in coding the studies included in this metaanalysis.
Footnotes
1
We excluded cases in which the training outcome was a single summary score from an
entire psychometric test (e.g., WPPSI-R or the Kit of Factor Referenced Tests) in which no
breakdown of subtests was included. Because of the high internal consistency of standardized
test batteries, psychometric test scores appear highly malleable, but this is a characteristic of the
high internal consistency of its test items rather than due to aspects related to training procedures.
2
The Q statistic represents the total homogeneity statistic. When an analysis analogous to
ANOVA is performed on effect sizes, the Q is partitioned into the portion represented by the
grouping variable and the portion representing the within groups residual. It follows a chi-square
Table captions
Table 1. Defining characteristics of the outcome measure categories and their correspondence to
Table 2. Study characteristics for the 101 studies remaining in the metaanalysis after the
exclusion of outliers.
Table 3. Mean effect sizes for control groups, treatment groups, and Ec summarized by outcome
measure category.
Table 4. Treatment group effect sizes by outcome measure and study variables.
Table 5. Treatment group effect sizes by outcome measure and type of training.
Table 7. Mean effect sizes for control groups and treatment groups by sex.
Table 8. Mean effect size that summarizes the size of the sex difference favoring males over
females. Averaged overall all outcome measures and also presented for each outcome measure.
Table 9. Mean effect sizes for control groups and treatment groups for children vs. adults.
Table 10. Average treatment and control group effect size for younger and older children and
adults.
Figure Captions
Figure 3. Comparison of treatment vs. control group improvement for spatial perception and
spatial principles.
Figure 4. Relative proportion of control group effect size for each type of control group.
Figure 6. Number of separate tests included in training regimen (control group effect sizes).
Figure 7. Mean effect size as a function of time between end of training and posttest.
Figure 8. Comparison of effect sizes for immediate posttest vs. delayed posttest among studies
Figure 9. Mean effect size for near and far transfer after immediate and delayed administration
of posttest.
Figure 10. Scatterplot of mean effect size g vs. HDI ranking of country.
Figure 11. Three hypothetical scenarios for the effect of training on on sex differences in spatial
skill.
Figure 12. Mean effect size by type of control group and age.
Training of spatial ability 79
Table 1
Table 2
9 – 20 hours 16 25.00
21 – 40 hours 5 7.81
41 – 100 hours 2 3.13
More than 100 hours 2 3.13
Total number of sessions (out of 77 studies)
1 (one-time session) 33 42.86
2 7 9.09
3–7 13 16.88
8 – 14 21 27.27
15 – 21 8 10.39
Frequency of training (out of 78 studies)
One-time session 37 47.44
One session per 1-2 weeks 14 17.95
2-3 sessions per week 14 17.95
4-5 sessions per week or “daily” 14 17.95
Days from end of training to posttest (out of 98)
None (immediate posttest) 84 85.71
1–6 7 7.14
7 – 31 11 11.22
61-90 1 1.02
More than 90 1 1.02
Categories of outcome measures b
Spatial principles c 11 10.89
Spatial perception 21 20.79
Perspective taking 9 8.91
Mental rotation 48 47.52
Assembly/Transformation 50 49.50
Training categories
Videogames 18 17.82
Courses
Course alone was treatment 15 14.85
Enhanced course 27 26.73
Spatial training
Tested effects of repeated practice 36 35.64
Tested for transfer to untrained tasks 83 82.18
Study characteristics
Published (out of 113 studies) 76 75.25
Publication year (out of 113 studies)
Through 1970s 27 26.73
1980s 20 19.80
1990s 31 30.69
2000s 24 23.76
Location of study b
Australia 1 1%
Austria 1 1%
Canada 6 7%
Training of spatial ability 82
China 1 1%
Germany 2 2%
Greece 1 1%
Korea 3 3%
Norway 1 1%
Spain 1 1%
The Netherlands 1 1%
United Kingdom 2 2%
United States 84 83%
a
Data were not reported in a way that separate effect sizes could be obtained for each sex.
b
Percentages do not sum to 100% because of studies that tested multiple age groups, used more
than one type of control group, included outcome measures from multiple categories, or tested
participants from more than one country.
c
Same as Linn and Petersen’s (1985) category of Spatial Perception
Training of spatial ability 83
Table 3
Control Treatment C vs. T Effect size Ec
sig.
Outcome category g (SE) N g (SE) N g (SE) n
Spatial perception .65 (.10) 11 .96 (.11) 11 p < .05 .52 (.12) 11
Perspective taking .46 (.12) 5 .89 (.10) 5 p < .01 .89 (.18) 5
Assembly/transform. .71 (.05) 25 .78 (.05) 25 n. s. .54 (.07) 25
Spatial principles .18 (.09) 7 .75 (.07) 7 p < .001 .89 (.11) 7
Mental rotation .51 (.04) 31 .67 (.04) 31 p < .01 .61 (.06) 31
AVERAGE .56 (.03) 55 .75 (.03) 55 p < .001 .62 (.04) 55
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
* Age x Control group type χ2 significant, p < .05
Training of spatial ability 84
Effect size by Age p > .30 p > .26 p > .25 p > .63 p > .33 p > .52
Under 13 .69 (.31), 4 .99 (.14), 13 .52 (.21), 4 .79 (.11), 22 .55 (.10), 13 .73 (.06), 56
13 – 18 -- -- .88 (.10), 17 -- .75 (.10), 14 .83 (.07), 31
Over 18 1.04 (.15), 19 .53 (.39), 1 .76 (.07), 39 .71 (.13), 10 .67 (.04), 82 .74 (.03), 151
Feedback provided
p < .05 n. s. p < .01 p < .05 n. s. n. s.
after each trial?
Yes .48 (.25), 5 .81 (.31), 2 † .99 (.09), 17 .89 (.10), 21 .63 (.06), 33 .77 (.05), 78
No 1.13 (.14), 18 .97 (.14), 12 .70 (.06), 43 .51 (.13), 11 † .67 (.04), 75 .73 (.03), 159
Study published? p > .37 p > .26 p > .56 -- p > .82 p > .69
Yes 1.01 (.14), 21 .99 (.14), 13 .76 (.07), 38 .76 (.60), 32 .66 (.04), 75 .76 (.03), 179
No .61 (.43), 2 .53 (.39), 1 .82 (.09), 22 -- .68 (.06), 34 .73 (.05), 59
Random assignment? p > .22 p > .75 p < .05 p > .48 p < .01 p < .001
Yes .90 (.14), 19 .85 (.32), 2 .64 (.08), 27 .80 (.11), 20 .57 (.05), 57 .66 (.04), 125
Training of spatial ability 85
No 1.31 (.31), 4 .96 (.14), 12 .90 (.07), 33 .68 (.14), 12 .78 (.05), 52 .85 (.04), 113
Classroom? p > .58 p > .19 p > .10 p > .55 p > .11 p > .39
Took place in classroom .65 (.61), 1 .84 (.15), 8 .89 (.08), 22 .85 (.15), 12 .60 (.06), 41 .73 (.04), 84
Outside of classroom .99 (.14), 22 1.22 (.25), 6 .72 (.07), 38 .74 (.10), 19 .72 (.05), 68 .77 (.03), 153
† Homogeneous (p > .005) Groups are significantly different: p < .05, p < .01, & p < .001
* + a
Average of all multi-session groups
Training of spatial ability 86
Videogames .56 (.22), 3 .53 (.37), 1 .63 (.07), 12 .28 (.38), 1 .75 (.06), 40 .71 (.05), 57
Non-MR videogame -- -- .35 (.09), 4 & -- .72 (.08), 27 .63 (.06), 36 *
MR videogame -- -- .91 (.09), 8 & -- .85 (.12), 13 .87 (.09), 21 *
Spatial training p > .93 p > .31 p < .05 p > .55 p > .24 p > .49
Repeated practice 1.01 (.18), 4 .90 (.15), 7 1.10 (.21), 3 * .82 (.10), 17 .53 (.09), 20 .93 (.07), 34
Transfer from training on a
.94 (.12), 14 1.17 (.24), 6 .65 (.07), 32 * .72 (.13), 14 .62 (.06), 41 .65 (.04), 131
different spatial task
“Near” transfer 1.66(.15),8& -- .65 (.19), 3 .73 (.45), 14 1.03 (.08), 11 & 1.01 (.07), 36 &
“Far” transfer .29 (.14), 6& 1.38 (.36), 6 .64 (.06), 29 -- .44 (.05), 30 & .56 (.04), 79 &
-- No cases found
† Homogeneity achieved (p > .005)
Groups are significantly different: * p < .05, + p < .01, & p < .001
Training of spatial ability 87
Q1 4 0 13 11 32
Q2 6 5 22 5 25
Q3 4 1 19 7 28
Q4 9 8 14 9 24
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
* Outcome measure x Quartile χ2 significant, p < .05
Table 7
Control a Experimental b
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
Training of spatial ability 88
Table 8
Perspective taking -- -- --
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
Training of spatial ability 89
Table 9
Control a Experimental b
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
Table 10
Control Experimental
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
Training of spatial ability 90
13 – 18 years 3 6 2 1 4 8 2 1
† Homogeneity achieved
ab
Groups labeled with different superscripts are significantly different.
* Age x Control group type χ2 significant, p < .05
Training of spatial ability 91
Figure 1
no
Stop. Enter
ÒSPECIFICÓ
yes
Did training Were SÕ s enrolled in a
yes
match the outcome course where the training
no measure? occurred?
Stop. Enter
ÒCOURSESÓ
no
Figure 2
6.0
4.0
g
2.0
0.0
Unadjusted g Windsorized g
Training of spatial ability 93
Figure 3
1.2
1 0.95
Mean effect size (g)
0.8
0.76
0.64
0.6
0.4
0.18
0.2
0
Spatial principles Spatial perception
Control Treatment
Training of spatial ability 94
Figure 4
0.8
Proportion of effect sizes
0.6
0.4
0.2
0
n
n
es
n
g
io
io
ti o
in
l
at
ip
at
k
ep
ta
c
m
ot
rin
c
r
r
fo
er
iv
al
lp
s
lp
ct
t
an
en
ia
pe
ia
at
tr
at
rs
y/
Sp
Sp
Pe
bl
m
se
As
Figure 5
0.8
0.6 0.54
0.5
0.37
0.4
0.3
0.2
0.1
0
Nonspatial Spatial
Type of filler task
Figure 6
0.9
0.78
0.8
Mean effect size (g)
0.7 0.59
0.6 0.49
0.5
0.4
0.3
0.2
0.1
0
One measure 2 - 4 measures 5 or more
measures
Number of test-retest measures per study
Training of spatial ability 97
Figure 7
1
0.9
0.8 0.76
0.7
Effect size (g)
0.59 0.57
0.6
0.5
0.4
0.3
0.2
0.1
0
Immediate test Up to 2 weeks More than 2
weeks
Training of spatial ability 98
Figure 8
1
0.9
0.8
0.7 0.64 0.65
Effect size (g)
0.6
0.5
0.4
0.3
0.2
0.1
0
Immediate test Delayed test
Training of spatial ability 99
Figure 9
1.2
1.05
1
Effect size (g)
0.8 0.72
0.58
0.6
0.37
0.4
0.2
0
Near Far
Transfer type
Figure 10
Training of spatial ability 101
Figure 11
a) M b)
M
F
a F
a
Training of spatial ability 102
Figure 12
1.2
p < .05
1
Average effect size (g)
p < .01
0.8
p = .08
0.6
0.4
0.2
0
Younger than 13 13 - 18 years Older than 18
-0.2
Appendix
Mean effect sizes and key characteristics of studies included in the metaanalyis.
Churchill, Curtis, Coombs & Harrell 1942 - Drafting course vs. Surface 1.255 1
Control control (Water Development Test
3 1 2 1
Churchill, Curtis, Coombs & Harrell 1942 - Purification course) 1.391 1
Exp
2D and 3D visualization Multiple Aptitudes .678 2
Ciganko 1973 - Control practice vs. control (2D Test of 2-D Spatial
2 1 3 2
and 3D observational Relations .797 2
Ciganko 1973 - Exp practice)
Clements et al. 1997 Geometry training in
slides, flips, turns etc. Wheatley Spatial
4 4 1, 2 2 1.191 2
using video game Test (MRT)
Tumbling Tetronimoes
Folding Blocks
Connor, Schackman & Serbin 1978 – Unsep Task (adapted from 1 1, 2 2 .259 2
SR-DAT)
Connor, Schackman & Serbin 1978 – Control Training in visuospatial .618 2
1, 2
disembedding vs.
control (no training) Children's 5 1, 2 2
.969 2
Connor_Schackman_and_Serbin_1978 – Exp Embedded Figures
Test
DeLisi & Wolford 2002 - Control Video game training French Kit Card .341 2
with Tetris vs. control 4 Rotation Test 4 1, 2 2
DeLisi & Wolford 2002 - Exp (Carmen Sandiago) .597 2
Deratzou 2006 Visualization training Card rotation, cube
with problems sets, comparison, Form
journals, videos, lab 3 Board, Paper 4 1, 2 2 .583 10
experiments, computers Folding, Surface
Development Test
.265 2
Dorval & Pepin 1986 - Control Zaxxon video game
playing vs. control (no Embedded Figures
game play) 4 Test 5 1, 2 1
.540 2
Dorval & Pepin1986 - Exp
Training of spatial ability 106
Kozhevnikov & Thornton 2006 - Control Added Interactive Paper Folding Test,
Lecture Demonstrations MRT .399 9
(ILDs) to physics
Kozhevnikov & Thornton 2006 - Exp instruction for Dickinson
vs. Tufts science and 2, 3 1, 4 3 1
nonscience majors and
.424 9
middle-school and high
school science teachers
Miller & Kapel 1985 - Control 7th vs. 8th grade Gifted
vs. control (Average Wheatly Spatial .750 4
Miller & Kapel 1985 - Exp ability) students trained 4 Test (MRT 4 3 2
with problem solving .939 4
video game
Miller, Boismier & Hooks 1969 Teacher-directed
training vs. automated Perspective Ability
training in sighting vs. 2 Test (PAT) 2 1, 2 2 .711 2
combination vs. control
(no training)
Moses 1979 Spatial mathematics Form Board,
course including Punched Holes,
lessons on 3D and 2D Card Rotations
objects vs. control from Kit of
3 1, 4 3 2 .516 4
(math class as usual) Reference Tests for
Cognitive Factors,
Gulliksen’s
Identical Blocks
Mullin 2006 Physical vs. cognitive Wayfinding to
vs. no physical control target, pointing to
over navigation, with 1 target, recalling 5 3 1 .392 32
attention vs. distracted object locations
during wayfinding
Okagaki & Frensch 1994 - Control Tetris video game Form Board, .239 6
training vs control (no Card Rotation,
Okagaki & Frensch 1994 - Exp 4 1, 4 1, 2 1
video game) Cube Comparison, .420 6
(from French kit)
Parameswaran 2003 Ages 5, 6, 7, 8, 9: Water level task,
Graduated training vs. Verticality task
Demonstration training 1, 2 3 1, 2 2 .870 40
vs. control (completed
task with no feedback)
Parameswaren 1996 - unseparated Tutor guided direct Van verticality test,
Parameswaren 1996 - Control instruction in principle Water-clock and
1, 2 3 1, 2 1 .703 16
Parameswaren 1996 - Exp vs. Learner guided self- cross-bar tests of
discovery vs. control horizontality
(no feedback) Water level task .103 2
1 3 1, 2 1
.525 4
Pepin & Dorval 1986 - Control Zaxxon video game .157 4
training vs. control (no SR-DAT
4 1 1, 2 1, 2
Pepin & Dorval 1986 - Exp training) .332 4
Sims & Mayer 2002 - Control Tetris players vs. non- Paper folding test,
Tetris players vs. Form Board and 1.111 9
control (no video game MRT (with tetris vs.
Sims & Mayer 2002 - Exp 4 1, 4 1 1
play) nontetris shapes or
letters), Card 1.193 9
Rotations
Smedslund 1963
Ink horizontality training Water-level task
1 3 3 2 .184 1
5 = Spatial perception