You are on page 1of 43

Running head: WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE?

1







What do Multiple-Choice tests Actually Measure?
Jay C. Powell, Ph.D., C.E.O.
Better Schooling Systems
107 Shawnee Road
Pittsburgh, PA 15241
Ph. (412) 835-2116

WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 2
Abstract
Two optional test-scoring approaches are presented and compared; 1) total-correct score
changes and 2) answer-selection changes. Common practice compares changes in acceptable
(type A) answer-selection frequencies, ignoring the selection of unacceptable (type non-A)
answers. Alternatively, students bring their entire selves to the test situation interpret the
question using all their current skills and then select the closest match to their interpretation from
those offered. That is, tests measure students item interpretations skills. In this context, all
answer-selection frequencies can be scaled, providing greater granularity for performance
interpretation. Support for this contention is provided using behavioral analysis from test items
used in North America and in Asia. A method to bypass linearity was developed to determine the
strength of answer-selection-change frequencies. The results from several studies are combined
to show that specific answer changes are based upon item interpretation-changes. A
representative sub-sample from a 2,000+ balanced sample of students from the third grade
through the end of high school from a Midwestern city were scored using both techniques. Total-
score changes agreed with item-interpretation changes by less than one in three (16:52). If
answer-interpretation changes are more educationally meaningful (being qualitative and
formative) than total-score changes (being quantitative and summative), then current test-scoring
practice needs reconsideration. Because item answer-selection involves students thinking and
question interpretation skills in addition to their knowledge-acquisition details, exploring ideas
may be more educationally productive than acquitting content. This paradigm shift implies a
corresponding change in teaching from didactic presentation to exploring ideas.

WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 3
Key Words: broad perspective, chronological-age scaling, cognitive maturity, creativity,
formative assessment, general linear model, item interpretation, item response theory,
linear dependency, multiple-choice tests, narrow perspective, test scoring, total-correct
scoring, qualitative, quantitative, summative, response spectrum evaluation, Thurs
statistic, uncertainty.

WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 4
This universe can only be explored, not captured
Understanding demands placing something into a context (Byers, 2011, p. 8, 9)
Introduction
This paper addresses three central issues in educational measurement when using
multiple-choices tests. These are the:
1. Purposes of educational measurement.
2. Nature of the behaviors these tests observe.
3. Mathematical and psychological validity of current testing practices.
Purposes for testing
Educational measurement may be defined as standardized systematic observations of
students behavior. These observations serve three distinct purposes (Wainer, H. & Thissen, D.
(1961)) with a fourth added by the present author:
1. Assessing student performance,
2. Making comparisons,
3. Conducting contests and
4. Informing teaching.
Assessing student performance
For academic performance assessment these observations are standardized for three
essential reasons:
1. The tasks are representative of the knowledge domain being tested (Content
validity).
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 5
2. The tasks are constructed to incorporate the behavior(s) to be observed (Construct
validity).
3. The observed behaviors identify the current performance status of the respondent
(Reliability).
Tests are systematic because they form a representative sample of the domain of tasks,
content, or skills being examined for proficiency, organized to invoke acceptable (type A)
answer selections.
As observations, they require students to make structured responses uniquely identifiable
for task requirements making it possible to use student output to define proficiency levels.
Multiple-choice (m-c) tests have a distinct advantage in this context because they can assess
tightly defined performances. The problem being addresses in this paper is that, for well-crafted
tests, the non-A selections also identify uniquely definable selection procedures, but this
information is ignored in current scoring practices. This paper challenges the wisdom of this
oversight.
Making comparisons
Originally, multiple-choice tests were developed to estimate the capabilities of specific
individuals. More recently, as the accountability movement overtook the educational enterprise,
these tests have been used increasingly to make comparisons among students, schools, school
systems and even nations (Quality control).
Conducting contests
Testing as a contest has a long history for use to conserve limited resources. For this
purpose they are regularly used as one criterion for admission to college entrance and for
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 6
employment decisions (Scalability). They can also be used for competitive identification of
talent, such as with programing contests, mathematics contests and such (Talent identification).
Informing teaching
Since the realization that human thought is a cybernetic system, the role of feedback has
become an important part of our education system. It is being used for proficiency assessment
and quality control. If there is something wrong with the manner of this application, our
educational system could be in serious trouble.
To detect an error in our use of test information for feedback, we need to examine the
psychology of how test answers are selected or generated by students and how test-scores are
generated in the test-scoring process. In this paper, we consider several test items using
interview-based answer-selection or item structure analysis to determine the answering
rationales.
This fourth use for testing is less common for professionally prepared tests but is the
main purpose for teacher-prepared tests. It was added to the Wainer and Thissen trilogy by the
present author. It focuses primarily upon the diagnostic dimension of educational delivery
(Powell, 2009). Commonly, these tests are designed specifically to identify particular
weaknesses or disorders such as for special education placement. In this paper a broader view is
taken. The perspective here is to use test to help teachers identify specific strengths or weak-
nesses (from their unacceptable type non-A answer selections) of students for program
planning and delivery decision and to assess the success of attempted interventions. (Diagnostic
feedback). Errors occur in layers and progressively disappear as understanding increases.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 7
What do multiple-choice tests measure?
The main issue in this paper is; What are m-c tests actually measuring? Practitioners
assume that achievement tests measure the course-content knowledge and skills that students
have acquired through teaching. This paper suggests a third measurement outcome; the item
interpretation skills of students. Thus the measurements include (in order of educational
importance):
1. Item interpretation skills,
2. Course specific skills, and
3. Subject-matter knowledge.
At the lowest level are their content- or procedure-recognition skills our students have
acquired, they answer from memory with little additional cognition. In current test-scoring
practice, without regard to their selection methods, their frequency of selecting type A answers is
assumed to be proportional to what they know. This equivalence assumption has frequently gone
untested. For instance, Powell (1968) showed that students reasoning reports for answer
selection significantly predicted which non-A answer they selected. Answer selection was based
upon interpretation strategies and not knowledge, Answer selection could be being confounded
by students item-interpretations skills.
Beyond memorization may be several levels of reading comprehension and response
capabilities that are developmentally appropriate to their current cognitive maturity (CM). At
least, Item Response Theory (IRT) response curves often show an ordered sequence of answer
selection among the non-A answers (DeMars, 2010, pp. 22-27, etc.). When response selections
display such systematic distributions, it is unreasonable to reject them as random events. At
present, IRT is founded upon a know-guess presumption in which the frequency of type A
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 8
answer selection is assumed to contain all the educationally meaningful information. The
possibility that answer selection is thoughtfully based upon criteria other than the course content
is not being considered. If such alternative bases are in fact being used, these tests are measuring
these other criteria, In this case, conclusions derived from current scoring will be spurious, and
explaining why ten years of effort has failed to provide effective educational improvement.
(Guisbond, et al., 2012)
Further, they have abilities to perform information management tasks that may or may
not have been intended by the item designers. Finally, examinees may arrive with profound
understand beyond that intended by the item writers, leading to selection of type not-A answers
for metacognitive reasons (Bond and Fox, 2007) Bond and Fox (p. 22) attribute this phenomenon
to inattention and not metacognition without testing their hypothesis.
Course skills
Some test items require more than the recall of factual material. In this case, application
of knowledge is expected. These applications can vary from mathematical computations to
deciphering logical relationships. If special notation systems or response structures are also
required, the student must be familiar with expected response styles to avoid response style
errors confounding the assessments of understanding.
In every case, the tests items are designed to converge to specific expectations. Some
clinical and psychological tests, such as the MMPI, are scores based upon response patterns that
categorize the examinees making diversity of responses assessable. Such scoring procedures are
uncommon with achievement testing. Note that convergence to certainty is contrary to the
characteristics of knowledge as defined by Byers (2011) in the headline quoted above.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 9
Item interpretation skills
In any case, when non-A options are ignored, the rationale for these errors is lost. If this
rationale is educationally valuable, then their educational value is also lost. It is now appropriate
to look closely at what m-c tests actually measure from three perspectives using real test items,
with live data using interviews or item linguistic structure analysis to determine answer-selection
rationales:
1. Reading skills,
2. Common misconceptions and
3. Developmentally appropriate thinking processes.
Reading skills:
Here is a sample item illustrating how reading errors can influence answer interpretation.
This item was taken from (Powell, Bernauer, and Agnihorti, 2010) and is derived from a test
given India to 16,000+ Asian Indian students in Years 4, 6, and 8 in mathematics and science.
The commentary also is adapted from this publication.
Figure 1. India data; science item 26
This item requires the student to analyze two separate
statements for their meaning and then to synthesize the
resulting relationship.
The dynamics of the answer selection drawn from the
selection proportions at each age level support the inter-
pretation-skills presumption about selection dynamics. The
table and graph provide this information.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 10
Alternative A is remarkable in that it shows no change with age. This answer is achieved
by responding only to the emphasized wording of the question. This response is clearly a
misreading of the item. The words double and half are capitalized and the reverse outcome
would be the true effect. Looking for key words is an important reading skill; here misapplied.
Those who chose option B have the difference in the correct direction, but did not get the
proportion relationships correct. Their reading of the item gave them partial understanding.
Those who chose option C did not understand the item. They responded to the
illustration, which shows both balls appearing to be of equal size.
The proportion of the acceptable option D increased from about 1 in 6 to a bit better than
1 in 4. When teachers emphasize information acquisition instead of how to think, this outcome
would be the expected with such questions. This item is measuring item interpretation skills,
their lack of which is confounding their understanding. Alternative C declined significantly in
selection proportion and alternative D increased significantly in selection proportion and the
selection of the rest can be inferred from the item structure. Therefore, most of the selections
were thoughtful and showed at the ways these students interpreted this question. The fact that the
selection proportions suggest primarily random selection could mean that this item would be
discarded and the selection information just presented would be lost as teaching procedures
feedback to these teachers.
The criticism being raised here is that it is not usual for a production-level test item to
have its answer selections considered this closely, leaving the teachers of these students
uninformed about the details of their skill levels.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 11
At issue, therefore, is that this item was set with an arbitrary expected answer and
machine scored, with only the degree of rightness collected, losing all other educationally
pertinent information.
When we consider the reasoning that must have occurred to produce the selection of the
options, the lack of reading skill becomes immediately evident. Students ability improved by
about 50% in four years, but is still not good enough for most of these students to read accurately
for details. Would this additional information help effective remediation for these students
difficulties? As a special education teacher in high school for a number of years, my experience
gives an affirmative answer to this question.
Subject matter knowledge
Tests are designed to be representative sample of a specific body of content. The
acceptable (type A) answers represent the expected selection from course content. The frequency
of selection of these answers is assumed to be directly proportional to the level of knowledge
achieved.
The optional (type non-A) answers in such tests, severally called foils, misleads or
distractors presumably contain little useful student-performance information Additionally,
because including them in linear analysis produce linear dependency problems, test scoring
avoids this problem by converting non-A answers to zero (0) in the scoring process. Any
information these answer selections might contain disappears into cyberspace. In this way the
required dichotomous data for using the GLM is preserved by converting polytomous data to
dichotomous data with an unknown level of data and information loss. We are arguing here that
these are untested hypotheses being confounded by students item-interpretation skills. At least,
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 12
this paper is raising an issue worthy of deep consideration, students the world over are in peril, if
this challenge has substance.
Actually, these data are collected as members of a nominal scale for each item, with only
one of the options offered expected to be chosen. Should we question the mathematical validity
of converting nominal-scale data to interval-scale data by accepting only one of the possible
answers within each item? Any performance information contained among these non-A answers
being ignored is lost in the scoring process. If these answers might serve to provide more useful
information to inform teaching than the frequency of Type A answers, then the potential of these
tests to inform teaching is also lost. This omission could be producing a blind spot (Byers,
2011) in the observation of the dynamics of learning.
Common Misconceptions
Figure 2: India science test item 23.
The fascinating outcome of this
question is its indication of a serious non-A
answer increase with age illustrating the
reason for concept inventories (Halloun, and
Hestenes, 1985).
Option C increased from 50% to 71% selection over this four year interval with all of the
other three declining proportionally Dr. Agnihorti revealed that in India New York is explained
as being geographically opposite to Mumbai, becoming the basis for this error.
This question illustrates several aspects of item interpretation. First, there is error
persistence. From the correct answer selection proportions alone this is a difficult item whose
difficulty increased with age, clearly identifying a concept that needs remediation.


WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 13
Second, the particular error selected, explains why this item is so difficult.
Third, if this item had been discarded because of its difficulty, teachers would not know
this difficulty existed.
Fourth, it illustrates the fact that correct answers are context specific. They should not
be treated as absolute in test scoring because some students may have more knowledge than
expected. The author has regularly encountered instances of such items in standardized tests.
Developmentally Appropriate Thinking Processes


They are reading only the first part and relating ego-centrically to their personal in-classroom activities.





third grade classroom. They are reading only the first part of the proverb and relating correctly
egocentrically to their personal in-classroom activities.
Ten-year-olds choose d because it talks about coming and going. They have shifted
to attempting a literal interpretation and correctly match their interpretation to the nearest good-
fitting option.
Fourteen-year-olds choose a because they read the first part to mean A rolling stone
gathers no moss. Once again this is correctly fitting their interpretive conclusion to the option
selected.
This item is part of a 40 item, 4
option m-c test which has now been
used in four studies. (Their reasoning
reports come from interviews.)
Eight-year-olds choose c
(timely action) because this is what the
teacher always tells them in the busy
Figure 3: Item 18: (Gorham, D.
Proverbs Test. 1957.)
Quickly come, quickly go. (Easy come, easy go.)
a. Always coming and going and never
satisfied.
b. What you get easily does not mean much to
you.
c. Always do things on time.
d. Most people do as they please and go as
they please.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 14
Sixteen-year-olds choose b (expected answer) because they read the whole proverb and
selected the acceptable answer correctly derived from familiarity with its use or inferred
correctly from their achieved ability to deal with figurative instead of literal presentations.
Some more profoundly informed students choose a because the never satisfied
phrase deals best with the way this proverb is used in an interpersonal context. These students
are bringing deeper understanding to the question than expected by the item designer (Bond and
Fox, 2007, who interpret this phenomenon as mental laziness instead of profundity of
interpretation.).
All these answers are developmentally appropriate, informing teachers about how these
students are thinking and only indirectly what they know. Furthermore, each answer selected
represents a discrete change of orientation in the interpretation of this item, making the learning
sequence discontinuously non-linear. Thus the application of the general linear model GLM is a
violation of the nature of these data.
Figure 4 provided the results of the analysis of
most answer selections (both acceptable and
unacceptable on this test.) These data are qualitative,
and formative. Ignoring this discontinuity changes the
item interpretations into the quantitative and
summative mode from the qualitative and formative
mode. The present author has used this information to
improve teaching (Powell, 2010a), finding the qualitative information extremely useful for
program planning and instructional delivery.
Figure 4.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 15
These answers form a transformative developmental sequence, which in the correlation
matrix reproduced in Figure 4 in Powell (1977) is conjoint (Luce and Tukey, 1964), explaining
why the subtest scores sequence that emerged correlated perfectly (r = 1.00) with the age modes
of selection. Thus we obtain an answer sequence based upon the chronological age (CA). Herein
lay a critical interpretive problem. Since we consider only the frequency of Type A answers, we
could be converting our test-score interpretations from formative (developmentally appropriate)
to summative in violation of the nature of these data.
Notice that most of the correlation triads are conjoint. The Concrete Right Answers and
the bimodal subset form the two exceptions. A third subset, not included here was the nine not-A
answers that did not cluster. These and the bi-modals formed the subset that were chosen by the
profoundly informed and were characteristic of 18 year-olds, given as A+ for Item 18 in its
developmental sequence. This shift from a narrow perspective type A answer to a broad
perspective non-A answer according to the scoring key actually adds a multidimensional
thinking phase to Piagets model and implies that total-correct scores are insufficient to account
all the psychological aspects of the performance being measured. This shift could be the step
necessary to release student creativity by empowering consideration of more than two options.
The developmental sequence for all answers in Item 18 is:
C D A B* A+ where the CA levels are; R
m
(all ages) 8 10 14
16 + (R
u
) 18+; representing the CM formative levels: 1) memory based answers
(requiring little cognition), 2) egocentric thinking, 3) literal thinking, 4) transitional
thinking, 5)formal thinking (yes-no logic) and 6) multidimensional thinking (profound
understanding involving N-value logic). In this latest case we see considering options
beyond the information given (Bruner, 1973).
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 16
Here is an illustration from the mathematics test in the India data:
Figure 5: Mathematics item 13. Although this study did not involve repeated
measures so that changes of answers by
individuals were not available, nonetheless the
age-order sequence of alternative answer selection
is still visible from the selection ratios. Option A
displaced option B between Years 4 and 6, and the
right answer (option C) is still not the highest
option being selected in Year 8. Alternative D is barely functioning. Is this an estimate of the
amount of guessing on this item? If so it is under 10% for all three ages. This low level of
guessing can hardly explain the appearance of the normal bivariate curve with the score
frequencies, and the graphs themselves clearly show a strategy or orientation change between
years 4 and 6.
In all of these illustrations from two continents, an identifiable developmental sequence
underlying non-A answers is evident.
Defining Cognitive Maturity
At this point, it is essential that a definition of cognitive maturity (CM) be given, it is:
The average current status on an age scale of each students performance based upon the
norming modal ages for every answer they selected.
In the Proverbs Test, which we are treating as a reading comprehension test and the two
tests, both mathematics and science, the selection ratios from Asia tell the same story, when all
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 17
options are considered, there is a hierarchy of cognitive complexity in the answers selected and
in many cases, these form a part of a behaviorally appropriate developmental sequence.
If interpretation skills are being measured, not knowledge, then using total-correct score
changes to assess knowledge changes may be psychologically invalid, unless item interpretation
skill and accumulated knowledge are equivalent cognitive behaviors.
This supposition gives us a mathematically testable hypothesis of the equivalence
assumption of total-score changes correctly describing performance changes. Scoring a repeated
administration of the same test with the same students using total-score changes and all answer
changes should give nearly identical results for this assumption to be true. If they are not,
because scoring the changes among the non-A answers provides more detailed information about
student status changes, using all answers in test scoring may be necessary. In this case, we must
change the ways we score, interpret and use test-derived information accounting for these
observations in order to improve educational effectiveness.
The type A answer for item 18 was classified in our research as a memory-based answer
using a second crosstabulated test Schools I Would Like to See (Powell, Cottrell, and Lever,
1977) that was used to distinguished between two reasoning styles (narrow perspective; self-
protective/self-indulgent thinking and broad perspective self-development and others-oriented
thinking), while option A was bimodal at ages 14 and age 18, suggesting two different selection
rationales.
Our research (Powell, 1977) with this test showed the non-A answers to be conjoint
(Luce and Tukey, 1964) with 13 sequential levels, exactly recovering modes of answer selection
by age.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 18
Subsequently, (Powell and Shklov, 1992) we established seventeen such levels beginning
at CA 7 years and extending into the early 20s, with an interpretation shift for nearly every year
for half of these categories and a greater frequency of subscores for the younger children. These
results mean that by scoring answer changes we can consider selections moving from one level
of CM to another across all forty items to represent + 1 year of gain in proficiency.
Our research found two other developmental pathways in this test. One was almost
completely opposite. It was B* A D C. Most of the 16 year-old high school students
who withdrew from school before the March administration chose D or C in the previous
October. This sequence began with 10 year-olds and ended most frequently with 16 year-olds.
Scoring these changes for CM represents a -1 year transformation along the CM scale for
students performance declines.
The third pathway stalled at literal interpretations. In the case of item 18, this appears as
repeated choices of option D (Literal Interpretation) for both October and March administrations,
through the end of high school for each age cohort, representing about 30 % of all high school
graduates. This observation could mean that about one third of all students in this system stopped
intellectual advancement in the fourth or fifth grades. Such criticisms have been levied at schools
throughout North America for many years in the absence of firm evidence.
Thus we can represent three developmental patterns for CM, namely; gains, declines and
no changes. This pattern of change, I. e. 1) score increase, 2) score decline 3) score stalled, was
also found in the changes among the total-correct scores for these same students.
Here is the learning dynamics for Item 18:
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 19
Figure 6: Three developmental pathways in one item (from Powell, 2010)

The other 39 items showed shifts within one or more stages. The test was administered
only twice because the study lasted only one year. It is not possible therefore, to determine
whether these latter within-item within stage shifts represent greater data details, eddies in the
dynamics, or additional pathways. Nonetheless, the revelation of learning dynamics pathways
become observable in the repeated measures context when all answer option are included in test
interpretation is of psychometric importance. Notice that answer selections are discrete events
and are therefore non-linear and that the interpretive changes are formative not summative,
making current practice a violation of the nature of these data in two distinct ways. First, because
the interpretation of each answer is known, these data are formative, not summative. Second,
these data show students staying at the same answers on both administrations, in terms of the
developmental sequences observed, this pathway is step-wise, not linear.
The distinction between high and very high proportions represents interpretations from a
new statistic, the Thurs (), which is an adaptation of the multinomial procedure identifying
The left-hand panel in Figure 6 shows the
mainstream pathway including the shift to option
A+, while the right-hand panel shows the
deteriorating cognitive pathway, with the school
leaving age of 16 years marked. The vertical
sequence on D to D is the stalled sequence at the
literal thinking level.
This item was the only one of the 40 items
that contained all five advancing selecting stages.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 20
the strength of the contribution of the frequency or less made from each matrix cell to the
entire table (Powell and Shklov, 1992). This approach bypasses linear dependency, making
the inclusion of all answers selections possible. It was used to validate the sequences just
given.
Mathematical and psychological validity
This discussion can now turn to addressing the mathematical and psychological validity
of these two scoring procedures. The discussion deals with:
1. Linear dependency
2. Cognitive maturity and
3. Comparison methods
The question then becomes, Are the changes in CM equivalent for each examinee to
their total-correct score change? This question becomes critically important because changes in
total-correct scores are essential for current approaches to quality control assessment in the
Federal legislation behind No Child Left Behind, which has failed to meet its objectives
(Guisbond. 2012) and Race to the Top (NCES, 2011) and any other quality control procedure
that might be proposed in the future.
If the directions of changes in CM as defined by scoring all answer selection changes are
not statistically different from the performance changes defined by changes in total-correct
scores, then these two variables may be considered to be equivalent and the current scoring
practice of ignoring non-A answer selection is supported. No further research involving the
selection of these options need be conducted. If they are not equivalent, then we must decide
which variable, total-scores or CM scores, provide better performance assessment.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 21
The frequency of right answers forms a linear scale, implying cumulative knowledge as
represented by the test content. From our interview data however, the answer-selection changes
are predominantly transformational, involving reorientation of students thinking processes. As
such they are non-linear and discrete.
So far the research has implied that the interpretation of test items are predominating and
represent the reading skills, logic skills, linguistic skills, misconceptions and metacognitive skills
combined into influencing answer selections. The test examples given represent assessing many
aspects of performance and not merely knowledge acquisition. We must decide which way to
proceed because the validity of current practice and ultimately the effectiveness of education are
at stake.
School psychologists administering individual tests have found that the manner of
answering can be more meaningful than the answers given. Thus the omission of non-A answers
is now being addressed with some current higher-order procedural developments such as
multidimensional IRT (Reckase, 2009) as a polytomous IRT linearly based mathematical model.
Linear dependency
However, response spectrum evaluation (RSE) does not measure directly knowledge
acquisition. It measures students skills for acquiring conceptual understanding. Using this
information requires teachers to change their teaching approach from transmitting information
The Sage on the Stage to A Guide on the Side (Aronsen, E. et al.. 1978; King, 1993,
Deslauriers, L. et al., 2011) by exploring ideas in a playful and gaming milieu. Perhaps we
should teach thinking, learning and self-development skills using the course content as the
vehicle and provoking student self-generated knowledge through self-developed insights.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 22
This perspective is the essence of Byers (2011) position for all scientific pursuits,
arguing that the certainty we are assuming will come from statistical analysis is illusory and that
exploration is the viable alternative for all science, not just education.
Underlying this entire report is the data set collected to establish developmental
sequences among all answers based upon (CA) selection modes (Powell, 1977; Powell and
Shklov, 1992). Giving the Proverbs reading comprehension test (Gorham, 1957) twice (October
March) cross-tabulating these two administrations and defining the cluster-established subtests
using interviews has revealed answer-selection dynamics. It should be noted that this research is
data-analysis based and not theory based. We are arguing that a theory of the dynamics of
learning is not yet sufficiently well formulated to conduct research from a theoretical base,
although the approach utilized here brings that day much closer than is possible with current test-
scoring practices.
Excel profiles from subset change scores show ascending reverse S ( ) curves or
descending S ( ) curves along the CA (y) axis. Cognitive maturity (CM) increases when higher
CM sub-scores increase and lower CM sub-scores decrease. Hence, the top bulge reveals
advancing changes to the right with increasing
subscores and the bottom to the left with decreasing
CM subscores in the lower achievement range.
Declining cognition reverses pattern with a horizontal
flip. The higher cognitive subscores the show decline
in selection frequency and the lower ones increase
across observations.
Change
Category
Age
Mode %
Over Generalized
Transposition
Irrelevant
Over Simplify
Word Association
Redefined terms
Isolated Response
14
13
12
11
9
8
8
-16
-25
-20
+17
+25
-22
+13
+ 21 Narrow Perspective 14
-11 Broad Perspective 18
+7.5 Total type A 16
Key
Declines
Increases
Non-A Changes
Figure 7: Sample student profile
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 23
Figure 7 provides an example of a change-scores profile. (From Powell, 2010b, p. 44)
with percentages (to adjust for unequal subtest sizes) given beside each changing profile
element for the x axis. The age modes are at the extreme right. When there is no overall change
or the change is orthogonal to the CA axis the horizontal bars for the profile display are scattered
along the y axis with neither the positive nor its negative clustering patterns at closely associated
CAs being evident as seen here.
This example (note the S curve) signifies performance decline. The logic is as given
above. If higher CM subscores decline and lower CM subscores increase, the overall pattern is a
decline in CM. Subscores that did not change were omitted.
This S curve profile shows a student whose lower not-A profile sub-scores increased,
with one exception (Redefined Terms), and the higher CM performance declined. This students
CM was 12-8 in October and 11-6 in March, representing a decline of more than one year in CM
in five months. The students CA was 13-3 in October. This score in March would be similar to
an IQ of about 80, (138/164) for those who appreciate such mathenations.
On the other hand, this students type A score (top three bars) increased by 7.5 %
(1.5 years) over these same 5 months. This increase was caused by a 21 % increase in narrow
perspective selections and an 11 % decline in broad perspective selections. The students CM
has collapsed from the transition towards formal operations towards concrete thinking (age
modes 9, 10) and egocentric thinking (age mode 8).
The increase in total-correct score misclassifies this students CM change by about three
years. Relying upon type A selection score gains mean that such students performance collapses
would not be identified using current practice, delaying dealing with learning needs. This student
was one of 9 with the same change pattern, drawn from a sample of 52 (every 50
th
student drawn
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 24
from CA ordered data) students. Because the equivalence assumption is an all or nothing
assumption, this one example is the sufficient condition to falsify it.
In addition, the increase is in narrow perspective type A answers exceeded a decline in
broad perspective answers. Thus, this increase in total-scores also represents a decline in CM
overall performance by both measures. Since the equivalence assumption requires that most or
all changes in both variables to have the same direction of change, this one profile is sufficient to
cast doubts upon the validity of this assumption.
The sub-score shift may illustrate the deterioration in high level CM identified by the
Fordham study (Shah, 2011).
Cognitive Maturity
Now let us become quite specific. We have a student Joe Smith in the eleventh grade,
who scored 27 on the above Proverbs test. For promotion reasons we have decided he must
achieve at least a score of 26 to pass. Joe is 17 years old.
What do we know about him other than that he passed the test? Absolutely nothing, we
dont even know how he answered item 18 above. The comments that follow are similar to the
profile in Figure 7.
What could we know? Suppose that of his 27 right answers, 20 were items of narrow
perspective (out of 24) and 7 were broad perspective (out of 16. Suppose also that 9 of his 13
wrong answers were at the age 9 or 10 maturity level, including alternative D in item 18.
What do we now know about Joe? We now know that he is a narrow perspective thinker
tending to literal thinking. His cognitive maturity is about half his chronological age. However,
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 25
he will be promoted to the twelfth grade with this huge cognitive deficit if we use the original
decision rule.
This illustration demonstrates the nature of RSE analytic interpretation and displays how
CM is determined using all answer selections. If choices are thoughtful, then they contain useful
information for teachers not currently accessible. Item response theory (IRT) can supply scale
order for not-A answer selections (DeMars, 2010). Our research supports non-A answer
scalability by showing that their categorical subscores are conjoint (Luce and Tukey, 1964;
Powell, 2010a). In contrast with the common assessment myth, probably less than 10% of
students answers are uneducated guesses (Powell, et al., 2011).
RSE (Powell, 2010c) development began with these observations. Our research suggests
that IRT not-A scale differences may be related to students cognitive maturity (CM). They
select their answers based upon their position in a Piaget-like developmental sequence or some
other developmental sequence (Powell, 2010d; Powell et al., 2011). Behavioral observations
show increasing CM when students change their answer selection upscale within each item:
1. From a not-A option to a type A option in higher cognitive complexity,
2. From a not-A option to another not-A option in higher cognitive complexity or
3. From a type A option in low cognitive domain to a not-A option with a higher
cognitive complexity (Powell, 2010b).
They show no increase in CM when their answer selection in an item:
1. Does not change over time or
2. When they select another answer in the same item that represents a similar CM
level with relatively equivalent age-appropriate reasoning.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 26
3. Or when there is no clear S curve pattern on the CA scale.
Meaningful shifts in answer selection from one option to another at the CM level may
indicate additional developmental pathways. A one-year study cannot provide this information.
These reasoning changes may be related to diverse learning pathways (Gardener, 1993) or to
insight-generating interventions by teachers or some undefined causes.
They show a decline in CM when they change their answer selection in a particular item:
1. From a type A option to a not-A (wrong) option in lower cognitive domain or
2. From a not-A option to one in lower domain or perhaps
3. From a not-A option to a type A (right) option in lower cognitive domain (Powell
and Shklov, 1992)
Figure 7 (page 20) shows all the decline characteristics given above: an increase in
lower domain type A and not-A selections and a decline in higher level type A and not-A answer
selections.
These dynamics, as revealed in our research, describe CM as a complex discontinuous
multidimensional non-linear dynamic learning process. It is impossible to detect this process
from dichotomous test scoring wherein all not-A option selections are collapsed to zero (0) and
lost in cyberspace. Additionally, the GLM requires dichotomous data to be the basic data type
within the current scoring paradigm, whereas m-c tests have more than two options, violating
this mathematical requirement. When all options are used to illuminate CM diversity a different
perspective emerges. The GLM procedures cannot detect such response-selection complexity.
For these reasons, when not-A answer-selections are ignored and only type A selections
are considered, misclassifying student performance (as in Figure 7) becomes possible because
the mathematical model being applied is inappropriate for the data set being studied. The bell-
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 27
shaped curve may be the statistical noise released from an oversimplifying procedure with these
data, which unintentionally removes the signal from these data.
These studies show, in most cases, not-A answer selections are strategy dependent
(Powell, 1968, 1977, 2010b and 2010d). Changes in selection reflect students personal insights
as they try to resolve disequilibrium (Piaget, 1985). Not all such resolutions are growth fostering
(Powell and Shklov, 1992.)
When not-A sub-score selection frequencies are combined with the type A sub-scores
selections; operationally they measure cognitive maturity (CM). We have found these changes in
answer-selection behavior are perfectly chronologically (CA) ordered by selection mode
(Powell, 1977 and 2010b). RSE educational performance observations may augment
multidimensional item response theory (MIRT) (Reckase, 2009). It also may provide detailed
performance diagnosis to teachers, once the learning significance of answer-selection pattern
changes is fully understood. That is, this procedure observes the dynamics of learning of each
student on an answer-by-answer basis.
In summary, when all selected answers from m-c tests are used to track learning a
complex of answers-selection-change diversity (as in the case of Joe Smith) we find
discontinuously multidimensional learning dynamics. Discontinuities occur when identical
answers are selected repeatedly or when answers changes cross CM levels. Stability is evident
where student stall at a particular levels, such as a literal thinking pattern persisting through high
school (Powell and Shklov, 1992 and Powell, 2010b).
About one in three students emerge routinely into adult life unable to make finer
distinctions than Yes/No and All/Nothing. Many such students are unable to compete with
their own past performance. They move into business and marriage unable to cooperate, to honor
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 28
diversity, to consider multiple options and to find common ground in disputes. A large range of
learning types, where learning is defined as behavior change, is being overlooked because
answer selection diversity is unintentionally suppressed by current test-scoring practices. Yes,
current test scoring practices could actually be this bad. They may be invalid both
mathematically and psychologically.
CM may credibly measure student performance of critical importance to life-long
learning. To determine how credible this option may be, we now compare it to current scoring
procedures to determine the incidence student misclassification from current practice compared
to using RSE to obtain CM estimates. If misclassification incidence is too high, then RSE the
better assessment approach.
Comparison methods
Four essential assumptions from classical test theory (CTT) (Lord, 1952; Lord and
Novick, 1968) drive the current practices for assessment for the NCLB initiative and RTT
funding and other achievement assessment. These are:
1. Successful students possess the necessary knowledge to answer correctly.
2. Equivalence occurs between score change direction and cognitive development
direction.
3. High total-correct scores signify similarly high CM.
4. Relative score change is proportional to CM change magnitude.
By implication in the classical test theory (CTT) mode, type A selections are necessary
and sufficient to establish performance levels with confidence. . Unacceptable (not-A) selections
contain little additional information about students proficiency levels. This paper challenges this
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 29
equivalence assumption for NCLB and RTTT assessments through the National Assessment of
Educational Progress (NAEP) (NCES, 2011) and other assessment applications. Accountability
objectives require that these decisions must be based upon grounded data (Glaser and Strauss,
1967). Note once again, this research began with data, not theory.
Identifying this discrepancy between CM decline and type A selection increase (Figure
7) arises from an increased performance details available from this profile (Powell, 2011). If
such misclassification is not a rare event, current test scoring practice should be replaced by a
procedure similar to RSE.
To support or challenge the equivalence assumption, the hypothetical questions become:
H
o
: Equivalence exists between CM level changes defined by the RSE procedure and
corresponding changes type A selection frequencies.
H
1
Omitting type not-A answer-selection frequencies misclassifies performance
changes.
H
2
: Using the RSE procedure captures sufficient variability, minimizing student
misclassification when all answer selections are included in the scoring.
These data are a 52 student representative sample drawn from the 2,000+ students from
the third grade through completing high school, being every 50
th
students from these data when
ordered by the October CA of each student in the larger sample.
Excel profiles classified individuals into nine frequency categories into a 3 3 table
with variables of:
1. Increase,
2. No change and
3. Decline
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 30
These variables are placed orthogonally with the CA-based CM scale (y) and aggregate
selection-change proportion frequencies (x). The classification procedure involved visual
identification from the profile curves using the reverse S and S curve patterns as illustrated in
Figure 7.
Four independent tests of equivalence were used:
1. Captured equivalence proportion in the main diagonal becomes the variance
recovered. Its square root value estimates the product-moment correlation (Fisher test). This
value must be near 1.000, giving a significant Fisher value for the equivalence assumption to be
true.
2. An expectation table designed to represent equivalence using _
2
to determine the
conformity to expectations for all 9 cells. A good fit to the model, distributing high values on the
main diagonal and low values on the off diagonal should produce a non-significant _
2
.
3. The proportion of within-expectation capture to outside-expectation capture
establishes the degree of equivalence using a difference of proportions test. This should be
significant in favor of the main diagonal.
4. The high values (the basis of RSE) should be in the main diagonal.
Conclusions are drawn from the combination of these four statistical tests.

WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 31
Results
The main diagonal contains 16:52
members, provides a captured variance
estimate (r
2
) of 0.308 for which the
correlational equivalent is r = 0.555. This
value gives a Fisher z = 1.627 which is
not significant at the p = 0.05 level. There is therefore no agreement between the main diagonal
contents and the expectation that most of the frequencies would reside here. Notice that the joint
increase cell, the objective for improving education contains exactly one fourth of the table
frequency for this four option m-c test. The goal for improving educational effectiveness in this
case will be achieved at no better than the changes level a supported by Guisbond (2012).
Notice that the comparison comes from a commonly used score changes equated against
score changes derived from a new procedure based upon a new statistic. This second variable ()
is not currently used to assess educational progress. This procedures credibility hinges upon the
reasonableness of using changes among all answer selections to determine CM. Many years of
investigation has culminated in this report.


Answers selection has been
shown repeatedly to be dependent
upon question interpretation. This
table equalizes the main diagonal
and places a 1 in all the off diagonals
Cognitive
Change
Direction
Total-Score Change Direction
Marginal
Totals
Up | None Down +
Up| 13 2 6 21
None 14 2 2 18
Down+ 9 3 1 13
Totals 36 7 9 52
Cognitive
Change
Direction
Total-Score Change Direction
Marginal
Totals
Up | None
Down
+
Up| 15.3 1 1 17.3
None 1 15.3 1 17.3
Down+ 1 1 15.4 17.4
Totals 17.3 17.3 17.4 52
Table 1. Observed Frequencies
Table 2. Expected _
2

WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 32
because we cannot use zero (0) as an expected value for _
2
.
In Table 2, the _
2
analysis gives value of 289.3. A _
2
sum exceeding P
0.05
= 15.507 (for 8
degrees of freedom) will reject the equivalence hypothesis. Placing an expectation of 44 in the
both-increase cell and 1 in the remainder is even worse. The model does not fit these data using
the _
2
test for agreement.
The observed proportional difference compares 16:52 with 46:52, where 46 is the sum
of the off-diagonal elements in the expectation table provides a resulting z = 19.041 indicating
that these two proportions are meaningfully different in the direction opposite to fit requirements.
The new Thurs () statistic being employed here needs explanation. Developed by
Powell and Shklov (1992), it was intended to test whether answer changes from one non-A
answer category to another fitted the chronological order found with these categories. Verifying
this sequence needed to bypass linear dependency. This multinomial procedure adaptation gives
two cumulative probabilities. It compares the observed cell frequency or less cumulative
probability with the same cells total possible cumulative probability. It is calculated
independently for each cell in the entire matrix by collapsing each cell to 2 2 tables, preserving
the matrix structure. Its range is 0.000 s s 1.000.
Table 3. Values from Table 1
The value reveals the pattern
when all changes are considered mean-
ingful. Particularly import are the cells
shown Boldfaced, which are within 2o
e

of 1.000. These are the highest observed frequency in their respective columns. Similar strength
Cognitive
Change
Direction
Total-Score Change Direction
Marginal
Totals Up | None Down +
Up| 0.261 0.402 0.984 1.647
None 0.903 0.539 0.327 1.769
Down+ 0.627 0.944 0.275 1.846
Totals 1.791 1.885 1.586 5.262
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 33
is not found in the main diagonal where it should reside. These three cells account for 23:36, or
about two thirds of the misclassified students. In Table 3, the value was calculated as described
above.
For teacher information, those students whose CM scores increased with Type A score
declined with a CM increase need a different scoring system. Those with no CM change but
score increase and CM declined or with no score change need investigation. The 9 students
represented in Figure 2 group did not emerge as the most important misclassified group. Second-
order analysis for , in which the main diagonal is made zero before off-diagonal calculation,
picks this group up. This observation may be the way linear dependency appears with the . Sub-
score change characteristics provides specific diagnostic information.
The fact that the sum of the is >1.000 and approximately one tenth of the total
frequency may mean that these two values are related and explained cell variability contribution.
All four equivalency assumption tests have failed to support it. Using Type A selection
frequencies to assess performance change misclassifies more than two thirds of these students.
Such a high falsification level violates the grant awards terms and has unconscionably high
human costs. A reasonable resolution to the problem would require shifting from current practice
to RSE scoring or something similar.
Conclusions and Implications
The main observation instrument used in these studies is Gorhams (1956) Proverbs
Test. It is actually a clinical test developed with a diagnostic purpose. The current author;
intrigued by the presence of two type A answer scores, one for concrete answers and one for
abstract type A answer. This feature of this test meant that it might reveal details of the
transition from concrete to abstract thinking (Piaget, 1985) by including non-A answer changes
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 34
into the test interpretation (Powell, 1977). The result was a perfect correlation between
chronological age and the subtest age modes for all answers. This idea seemed a reasonable use
of the test following the observation that reasoning reports predicted answer selection (Powell,
1968) In other words, this series of studies began with the possibility that the non-A answers,
instead of being randomly distributed blind guesses, displayed a strong developmental order.
The research therefore, although founded upon Piagets clinical studies, began with fortuitous
data observations and not psychological or mathematical theories.
This test is not commonly used for assessing reading comprehension. However this
research showed, using answer-selection reasoning interviews, that item comprehension behavior
was what this test was measuring. Support for this interpretation has been established with
several studies (Powell, 1968; Powell, 1970; Powell, 1977; Powell, 2010a; and Powell and
Powell, 2011).
Although it is intuitively reasonable that a test drawn from a representative catalogue of
subject area concepts and procedures should provide a set of observations indicating students
knowledge of this catalogue, the studies reported here seem to contradict this intuition.
The possibility that item interpretation is what is actually being measured by other m-c
tests is supported by a separate study (Powell et al. 2011) using data from 16,000+ students in
India with two tests; one in mathematics and the other in science, used in elementary and middle
school, three item samples are reported above.
It now behooves the major test-development companies to protect their interests by
showing that answer interpretations do not contaminate their results in major tests, such as the
SAT, ACT, and GED tests, not to mention more specialized tests such as the GRE, professional
certification tests and the like. Because of the professional importance attached to these tests,
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 35
these companies could be open to serious class-action suits if they are unable to show that these
findings do not generalize to their products.
It has long been known by statisticians (Keeping, 1956) that the application of the GLM
to non-linear data is invalid and can conceal non-linear dynamics within a data set. In fact, the
Thurs procedure has been applied to sociological (Mishra and Powell, 2011) and medical data
(Powell, 2010d) with interesting results. This procedure may have much broader than
educational data applications and perhaps should be developed as a more generalized statistical
tool. It is currently being implemented as an Excel Spreadsheet for test-scoring applications.
The change combinations possible with a 40 item 4-option M/C test are so many (160
2

possible pairs) that learning diversity tracking could become routine. With the recognition that
we may be actually measuring thinking and learning skills with such tests, provides the basis for
shifting the teaching-learning paradigm from sage on the stage to guide on the side
(Aronsen, et al., 1978; King, 1998, Deslauriers, et al., 2011). In this case, the main
teaching/learning thrust becomes thinking and learning skills leading to student capacity for self-
initiated learning on a life-long basis.
We have not yet determined whether there is an equivalence relationship between IRT u
values and these two type A categories or those for the not-A categories created using (RSE).
Students with a Narrow Perspective seem to be the ones with the closed minds. Perhaps they fit
the R
m
(Type A selection from memory) category or easy (IRT) answers.
Currently, by focusing narrowly on type A answers, educational assessment may have
become a tool to enforce cultural conformity. This paradigm shift could be the route to unlimited
possibilities and generate considerable new research.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 36
Mathematics is more than a set of rules to memorize, it is a way to perceive reality by
structuring our observation logically and numerically (Lakoff and Nunez, 2000). Science is more
than a list of discovered facts, it is a way to provide systematic observations to what we
examine (Popper, 1968; Kuhn, 1996; Byers, 2009). Language studies are more than the contents
of libraries; they are ways of comparing our thinking to the thinking of the great communicators
throughout the ages and to learn how to become the same (Powell, 2010d). Uncertainty is more
than a puzzling anomaly of physics (Heisenberg, 2007), it is central to the freedom of thought
because it opens the door upon many creative ways of viewing reality (Byers, 2011) we have not
yet imagined. Mistakes are lesson not yet learned and should be honored for the helpful
information they provide.
The emerging digital age requires information access and information management skills
more than encyclopedic personal knowledge levels. The Internet removed the physical barriers
among peoples. We now need to remove the psychological barriers among peoples. This second
transformation can only be accomplished with an educational system that recognizes and honors
diversity, promotes cooperation replacing competition and encourages mutual respect for
differences that contribute to and support sustainability.
This shift requires the acceptance of a both/and style of relating to the world inside us
and around us, replacing our either/or style of perceiving. It requires the ability to resolve
conflicts non-violently through insight sharing (Rosenberg, 1999). Research is showing that
interpersonal violence frequency is reducing (Shermer, 2011). Education (using RSE) might
eliminate violence arising from excessive intrapersonal, interpersonal and intergroup
competition. This change could lead to lead to making the entire world into a learning
community.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 37
This paper is an invitation for those who accept these results to join in furthering this
research as much as it is a challenge to current practice.
According to Byers (2011) the presumption of certainty itself is an error. In an emerging
future of voluminous information, educators cannot afford to create the illusion of certainty.
Particularly, as Byers points out, uncertainty is the window opening upon productive creativity.
Insisting upon certainty using a right-wrong scoring philosophy apparently squelches creativity
and neglects determining, during the educational process, where constructive points for
exercising creativity might be found. The headline quotation from his book pinpoints this error,
indicating that exploring ideas as a teaching modality may be the only resolution of this
dilemma. Student errors appear to delineate the frontiers of their current capabilities. Attending
to these and helping them to achieve thoughtful reorientation of their information access and
management procedures using insight promoting self-aware thought-process changes seems to
provide for highly motivated learning and rapid progress.
Much more research must be done. Is the bell-shaped curve the noise in the data after the
signal has been destroyed by the current data processing procedure? Can we make parallel tests
by matching every answer using the Thurs? Can we engineer CAI systems that create insightful
learning or is this a skill inherently human requiring a human teacher? Would a long longitudinal
study reveal more than three developmental pathways? Would this RSE procedure make it
possible to probe learning in depth, explain consciousness and resolve many other psychological
unanswered issues? Would this granularity make it possible to identify the best teaching methods
for definable sub populations? Does the Thurs have applications in other areas of science?
If, indeed, m-c tests are measuring CM and not knowledge acquisition, then the
application of this paradigm shift in test scoring and interpretation could inform the paradigm
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 38
shift from the Sage on the Stage delivery modality to the Guide on the side modality
(Aronsen, 1978, King, 1993) bringing the constructionist delivery perspective (Pappert and
Harel, 1991, Deslauriers, et al., 2011), to the fore to replace the current didactic delivery
perspective.
In conclusion, learning, from this research, is not the simple acquisition of new
information as is implies by the simple increase in the frequency of the selection of acceptable
answers. As a process, it is a subtle and complex psychological event that is insufficiently
described by linear models of acquisition. The nuances of the process are concealed in the details
of the manners in which students interpret test questions. These details are found in the reasoning
behind the selection of the unacceptable answers that are effectively making discriminations in
superbly crafted tests. The removal of these answers in the commonly used current test-scoring
procedures decimates the information available from these tests, providing inadequate feedback
for monitoring the learning process and rendering the effective improvement of education
impossible. Put simply, learning is transformative, not cumulative meaning that the wrong
mathematical models are being applied to these data, providing spurious and inaccurate results
for process feedback in like manner to the application of the Ptolemaic cosmology to the
interpretation of our solar system. The impact of this change in perspective could be every bit as
profound as that one was, leading to a New Renaissance, for the same reasons, making profound
life-critical decisions based upon untested hypotheses.

WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 39
References
Aronsen, E, Blaney, N, Stephan, C, Sykes, J. and Snapp, M. (1978). The Jigsaw Classroom.
Beverly Hills: CA: Sage.
Bond, T. G. & Fox, C. M. (2007). (2nd Ed.). Applying the Rasch Model: Fundamental
Measurement in the Human Sciences. Mahwah, NJ: Lawrence Erlbaum.
Bruner, Jerome S & Anglin, Jeremy M. (1973). Beyond the Information Given: Studies in the
Psychology of Knowing. New York: W. W. Norton.
Byers, W. (2011). In The Blind Spot: Science and the Crisis of Uncertainty (p. 8, 9).Princeton,
NJ: Princeton University Press
De Bono, E. (2005). The six value medals. London: Vermilion.
Deslauriers, L, Schelew, E and Wiemann, C. (2011). Improving Learning in a Large-Enrolment
Physics Class, Science, 332 (6031). Pp.862-864.
DeMars, C. (2010). Item Response Theory. New York, NY: Oxford University Press.
Gardner, Howard. (1993). (10th Anniversary Ed.) Frames of mind: the theory of multiple
intelligences. [With a new introduction by the author]. New York, NY: Basic Books.
Guisbond, Lisa, Neill, Monty, and Schaeffer, Bob. (2012) NCLBs Lost Decade for Educational
Progress: What Can We Learb from this Policy Failure? Jamaica Plain, MA: National
Center for Fair and Open Testing.
Halloun, I. & Hestenes, D. L. (1985). Common sense about motion. American Journal of
Physics, 53, (11) 1056 1065.
Heisenberg, W. (2007) Physics and Philosophy: The Revolution in Modern Science. New York:
Harper Perennial Modern Classics. (Full text of 1958 version).
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 40
Keeping, R. (1957). Introductory statistics teacher. University of Alberta; Edmonton, Alberta,
Canada; Personal conversation March 13,
King. A. (1993). From Sage on the Stage to Guide on the Side. College Teaching. 41, (1) 30-35.
Kolen, Michael J and Brennan, Robert L. (2004). (2
nd
Ed.) Test Equating, Scaling and Linking
Methods and Practices. New York: Springer Science + Business Media. ISBN: 0-387-
40086-9.
Kuhn, Thomas S. (1996). The Structure of Scientific Revolutions (3rd Ed.). Chicago, IL:
University of Chicago Press.
Lakoff, G. & Nez, Rafael, (2000). Where Mathematics Comes From: How the Embodied Mind
Brings Mathematics into Being. New York: Basic Books.
Lord, Fredric. (1952). A Theory of Test Scores. Psychometric Monographs, Number 7.
Philadelphia, PA: Ferguson.
Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA:
Addison-Wesley.
Luce, R. D. & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of
fundamental measurement. Journal of Mathematical Psychology, 1 (1), 1-27.
Marzano, R. J. & Kendall, J. S. (1996). A comprehensive guide to designing standards-based
districts, schools and classrooms. Alexandria, VA: Association for Supervision and
Curriculum Development.
Mithra, S. and Powell, J. C. (2011) Identifying Information Security Governance Dimensions: A
Multinomial Analysis. Issues in Information Systems. XII, (1). .pp. 271-279. Cluster
analysis in this study used this procedure.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 41
National Center for Educational Statistics, NAEP Questions Tool, Explore NAEP Questions,
http://nces.ed.gov/nationsreportcard/itmrlsx/default.aspx, accessed July 27, 2011
Piaget, J. (1985). The Equilibration of Cognitive Structures: The Central Problem of Intellectual
Development. Chicago. IL: University of Chicago Press. (New translation of The
Development of Thought)
Piburn, M. D. (1989). Reliability and Validity of the Propositional Logic Test. Educational and
Psychological Measurement. Vol. 49 no. 3 667-672
Popper, Karl R. (1968). Conjecture and Refutation: The Growth of Scientific Knowledge. New
York: Harper and Row
Powell, J. C. (1968). The interpretation of wrong answers from a multiple-choice test.
Educational and Psychological Measurement, 28, 403-412.
Powell, J. C. (1977). The developmental sequence of cognition as revealed by wrong answers.
Alberta Journal of Educational Research XXIII, (1) 43 51. ERIC EJ159335
Powell, J. C. (2010a). Beyond Linearity: Tracking Accommodative Learning and other Non-
linear Events Using Response Spectrum Evaluation (RSE). This is a PowerPoint


presentation given to the staff at the Pittsburgh Supercomputer Center.
Powell, J. C. (2010b). Chapter 3: Testing as feedback to inform teaching. In J. Michael Spector,
et al. (2010) Learning and Instruction in the Digital Age: Making a Difference through
Cognitive Approaches. New York: Springer. ISBN: 978-1-4419-1551-4
Powell, J. C. (2010c). Do profoundly informed students choose wrong answers, lowering their
scores? Research report presented to the Psychometric Society, Athens, GA
Powell, J. C. (2010d). Making Peasants into Kings. Bloomington, IN: AuthorHouse. ISBN: 978-
1-4490-0634-1.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 42
Powell, J. C. (2011). Observing Cognition from Unacceptable Answers. Presented to CELDA,
2011 in Rio de Janeiro, Brazil, November.
Powell, J. C., Bernauer, James and Agnihotri, Vishnu. (2010) An Analysis of Answer Selection
Patterns from Multiple-Choice Items. In Isaias, P. et al. (eds.), Towards Learning and
Instruction in Web 3.0: Advances in Cognitive and Educational Psychology, DOI
10.1008/978-1-4614-1539-8-16, London, UK: Springer Science + Business Media.
Powell, J. C., Cottrell, D. J. and Lever, Margaret (1977) Schools I Would Like to See: An
opinion survey with interesting possibilities. Alberta Journal of Educational Research.
XXIII (3) 226 241.
Powell, J. C., Cottrell, D. J. and Lever, Margaret (1977) Schools I Would Like to See: An
opinion survey with interesting possibilities. Alberta Journal of Educational Research.
XXIII (3) 226 241.
Powell, J. C. & Miki, H. (1985). Answer anomalies, how serious? Nashville, TN: Paper
presented to the Psychometric Society.
Powell, J. C. & Shklov N. (1992). Obtaining information about learners thinking strategies from
wrong answers on multiple-choice tests. Educational and Psychological Measurement,
52, 847-865.
Powell, J. C. & Powell, Valerie J. H. (2010). Do profoundly informed students choose wrong
answers, lowering their scores? Presentation to the Psychometric Society; Athens, GA.
Reckase, M. D. (2009). Multidimensional Item Response Theory. New York: Springer.
Shah, N. (2011). Early Achievers Lose Academic Edge, Researchers Conclude: Report cites
NCLB, anti-ability-grouping policies. Education Week. 31, (5) September 28. Report of
Thomas Fordham Foundation study.
WHAT DO MULTIPLE-CHOICE TESTS ACTUALLY MEASURE? 43
Shermer, M. (2011) The Decline of Violence: Be skeptical of claims that we live in an ever more
dangerous world. Scientific American. 305, (4) October, p. 90.
Sternberg, R. J. (2003). Wisdom, Intelligence and Creativity Synthesized. New York: Cambridge
Univ. Press.
Wainer, H. & Thissen, D. (1961). On the general laws and the meaning of measurement in
psychology. In Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability. (Vol. 4, pp. 321-334). Berkeley: University of California
Press.
Wainer, H. (2011). Uneducated Guesses: Using Evidence to Uncover Misguided Educational
Policies. Princeton: NJ: Princeton University Press. ISBN: 978-0-691-14928-8

You might also like