You are on page 1of 11

A Facets Model for Judgmental Scoring

Abstract: An extension to the Rasch model for fundamental measurement is proposed in


which there is parameterization not only for examinee ability and item difficulty but also for
judge severity. Several variants of this model are discussed. Its use and characteristics
are explained by an application of the model to an empirical testing situation.
Key-words: Rater, Rasch, Judging, Latent Trait, IRT
Authors:
John M. Linacre, University of Chicago
Benjamin D. Wright, University of Chicago
Mary E. Lunz, American Society of Clinical Pathologists.
MESA Memo 61, written in July 1990 and accepted for a special issue of Applied
Measurement in Education, but not published due to lack of space.
I. Introduction:
This century has seen efforts to remove subjectivity from the measurement of examinee
ability, aptitude or knowledge. There are still many areas, however, in which performance
ratings depend on assessments made by judges. Artistic skills, essay writing and science
projects are but a few of the many such areas in education. In measuring professional
performance, "By far the most widely used performance measurement techniques are
judgmental ones" (Landy and Farr, 1983 p.57).
Four factors dominate the rating given to an examinee's performance: the ability of the
examinee, the difficulty of the task performed, the severity of the judge and the way in
which the judge applies the rating scale. In a competitive diving competition, each diver
performs a series of dives, the test "items", and is rated on each dive by several judges,
who probably differ in levels of severity, as well as their application of the rating scale. The
aim is to determine the diver of the highest skill, regardless of which dives are actually
performed, and which judges happen to rate them.
When there are several judges, it would be ideal if all judges gave exactly the same rating
to a particular performance. Then each performance need only be evaluated by one such
ideal judge and all ratings would be directly comparable. Practically speaking, minor
random differences in the ratings of the same performance by different judges might be
acceptable. However, even this level of agreement is hard to obtain. "In a study designed
to see how 'free from error' ratings could be under relatively ideal conditions, Borman
selected expert raters, used a carefully designed instrument, and tested it by using
videotapes of behavior enacted to present clearly different levels of performance; he
succeeded in getting an agreement among the raters of above .80, and yet concluded that
'ratings are far from perfect'" (Gruenfeld, 1981 p.12).

Since differences between judge ratings are usually non-trivial, it becomes necessary to
determine how judges differ, and how these differences can be accounted for, and hence
controlled, in a measurement model.
II. Expanding the partial credit model to include judging.
In "Probabilistic Models for Some Intelligence and Attainment Tests" (1960/1980), Georg
Rasch writes that "we shall try to define simultaneously the meaning of two concepts:
degree of ability and degree of difficulty", and presents a model to enable this. Rasch's
initial model has been expanded to allow for partial credit items by including a set of
parameters which describe the partial credit steps associated with each item (Wright &
Masters 1982, Masters 1982).
With the inclusion of judges in the measurement process, it is useful to define
simultaneously not only the ability of the examinee, and the difficulty of the test item, but
also the severity of the judge. This is accomplished by expanding the partial credit model
to include parameters describing each judge's method of applying the rating scale.
Because this involves the additional facet of judges, beyond the facets of examinees and
items, this expansion is called the "facets model", and the computer program which
performs the analysis FACETS.
III. The facets model.
Here is a facets model for a rating scale that is the same for all judges on all items.
loge (Pnijk/Pnij(k-1)) = Bn - Di - Cj - Fk (1)
Where
Pnijk is the probability of examinee n being awarded, on item i by judge j, a rating of
category k
Pnij(k-1) is the probability of examinee n being awarded, on item i by judge j, a rating of
category k-1
Bn is the ABILITY of examinee n, where n = 1,N
Di is the DIFFICULTY of item i, where i=1,L
Cj is the SEVERITY of judge j, where j=1,J
Fk is the HEIGHT of step k on a partial credit scale of K+1 categories, labelled 0,1,,K in
ascending order of perceived quality, where k = 1,K.
In this model, each test item is characterized by a difficulty, Di, each examinee by an
ability, Bn, and each judge by a level of severity, Cj. The loge odds formulation of (1) places
these parameters on a common scale of loge odds units ("logits").
Judges apply a rating scale to the performance of each examinee on each item. Each
successive category represents a further step of discernibly better performance on the
underlying variable being judged. The simple term Fk has one subscript. This defines the
rating scale to have the same structure for every item and judge. This "common step"

model is always the case when every observation is a dichotomy, because then K=1 and
F0=F1=0.
This model allows for the estimation of differences in severity between judges, and thus
eliminates this kind of judge "bias" from the calibration of items and the measurement of
examinees.
Its FACETS implementation does not require that every examinee be rated by every judge
on every item. It is only necessary that observations designed to create a network through
which every parameter can be linked to every other parameter, directly or indirectly, by
some connecting observations (Wright & Stone 1979 p. 98-106). This network enables all
measures and calibrations estimated from the observations to be placed on one common
scale.
A tempting way to organize a panel of judges might be for one judge to rate all the
performances on one item, while another judge rates all the performances on another
item. But this judging plan provides no way to discern whether a mean difference in
ratings between the two items was because one item was harder, or because one judge
was more severe. This confounding can be overcome by rotating judges among items so
that, although the performance of an examinee on any particular item is rated once by
only one judge, the total performance of each examinee is rated by two or more judges.
Further, as part of the rotation design, each item is rated by several judges during the
course of the examination scoring process.
III. Two other FACETS formulations.
a) Here ia a facets model for a rating scale that holds across items but differs among
judges.
loge (Pnijk/Pnij(k-1)) = Bn - Di - Cj - Fjk (2)
Where
Pnijk is the probability of examinee n being awarded, on item i by judge j, a rating of
category k
Pnij(k-1) is the probability of examinee n being awarded, on item i by judge j, a rating of
category k-1
Bn is the ABILITY of examinee n, where n = 1,N
Di is the DIFFICULTY of item i, where i=1,L
Cj is the SEVERITY of judge j, where j=1,J
Fjk is the HEIGHT of step k on a partial credit scale of K+1 categories, labelled 0,1,,K in
ascending order of perceived quality as applied by judge j, where k=1,K.
In this model, the heights of the steps between adjacent rating categories vary among
judges. The two-subscript term Fjk is the height of the step from the lower category k-1 up
to the next higher category k as used by judge j.
b) Here is a facets model for a rating scale that differs for each item/judge combination.

loge (Pnijk/Pnij(k-1)) = Bn - Di - Cj - Fijk (3)


Where:
Fijk is the HEIGHT of step k on item i as rated by judge j.
The complex term Fijk allows each judge to have a different way of using the rating
categories on each item. This model is useful for examinations in which items differ in step
structure so that judges who differ in their judging styles can also differ in the way they use
different rating categories.
IV. The measurement properties of the facets model.
Test results obtained through the ratings of judges are descriptions of a local interaction
between examinees, items and judges. Such results remain ambiguous with respect to
inference, unless they can be transformed into measures with general meaning. It is
essential for inference that the measures estimated from the test results be independent
of the particular sample of examinees and items comprising the test situation. This
requirement is especially apparent when examinees do not face identical testing
situations. In circumstances where examinees are rated by different judges, respond to
different sets of test items, or perform different demonstrations of competence, measures
must be independent of the particular local judging situation to be meaningful at all. The
construction of this independence is always necessary when the intention is to compare
examinees on a common scale.
This generalizability of measurement estimates is called objectivity (Rasch 1968).
Objectivity of examinee measures is modelled to exist when the same measures for
examinees are obtained regardless of which sample of items, from the universe of
relevant items, and which panel of judges, from the universe of relevant judges, were used
in the test.
The facets model can be derived directly from the requirement of objectivity in the same
manner as other Rasch models (Linacre 1987), and consequently also satisfies the
mathematical requirements for fundamental measurement. Counts of steps taken are
sufficient statistics for each parameter (Fisher, 1922), and the parameters for each facet
can be estimated independently of estimates of the other facets. The measures of the
examinees are thus "test-freed" and "judge- freed" (Wright 1968, 1977). Complete
objectivity may not be obtainable for steps of the rating scale, however, when the definition
of the scale depends on the observed structure of a particular testing situation (Wright &
Douglas 1986).
V. Estimating the parameters of the facets model.
The FACETS estimation equations are derived by Linacre (1987) in a manner similar to
those obtained for the partial credit model by Wright & Masters (1982 p.86), using
unconditional maximum likelihood (Fisher 1922, JMLE). These equations yield sufficient
parameter estimates and asymptotic standard errors for the ability of each examinee, the
difficulty of each item, the severity of each judge, and the additional level of performance

represented by each step on the partial credit scale. Mean square fit statistics (Wright &
Masters, 1982 p.100) are also obtained.
VI. An application of FACETS.
The facets model was applied to performance ratings obtained by an examination board
which certifies histotechnologists. 287 examinees were rated on 15 items by a panel of 15
judges. The ratings were on a 4 category partial credit scale labelled from 0 to 3, in which
0 means "poor/unacceptable", 1 means "below average" 2 means "average", and 3 means
"above average" performance, as defined during the thorough training which the judges
received.
Each examinee's performance on each item was rated only once. However, the 15 items
were divided into 3 groups of 5 (1-5, 6-10, 11-15) so that each group of 5 items for each
examinee could be rated by a different judge. Thus each examinee was rated by three
judges, over the 15 items. Judges rotated through the groups of 5 items, so that each
judge rated all 15 items over the course of the scoring session. The rotation was also
designed so that the combinations of three judges per examinee varied over examinees.
This provided a network of connections which linked all judges, items and examinees into
one common measurement system, while enabling the separate estimation of the
parameters of each facet.
Two aspects of judge behavior were examined. First, the extent to which judges differed in
severity. Second, the extent to which each judge had his own way of using the rating
scale, and how this affected his awarding of credit.
First, judges were calibrated under the assumption that they all applied the rating scale in
the same way, but that each judge represented a different degree of severity. This is the
facets model given in equation (1), in which rating scale steps are represented by Fk.
Judge severity was calibrated at the logit value where the probability of awarding category
"2" equalled that of awarding category "3" (rather than at equal probability of awarding
category "0" or category "3") because these judges awarded far more "2" or "3" ratings
than "0" or "1". This prevented perturbations in the infrequent awarding of 0 ratings from
disturbing the estimation of judge severity. The resulting estimates are in Table 1. (The
counts of ratings are in the last line of Table 3.)
Count of
Count of Sum of
Severity
Mn Sq
Judge
Label
Examinees Ratings
Ratings in Logits S.E.
Fit
-----------------------------------------------------------------------A (most severe)
68
340
803
.51
.08
.85
B
58
290
668
.35
.08
1.03
C
69
345
830
.23
.08
.89
D
43
215
533
.11
.10
.80
E
45
225
560
.08
.10
.87
F
79
395
982
.06
.08
.98
G
48
240
591
.00
.10
.78
H
47
235
594
-.01
.10
.91
I
52
260
652
-.02
.10
.96
J
32
160
409
-.05
.13
1.01
K
50
250
646
-.07
.10
1.05

L
58
290
728
-.18
.09
1.23
M
106
530
1382
-.33
.07
1.33
N
48
240
635
-.33
.12
1.22
O (most lenient)
58
290
767
-.35
.10
1.27
-----------------------------------------------------------------------Mean:
57.4
287.0
718.7
.00
.09
1.01
Standard Deviation:
17.2
86.1
221.8
.24
.02
.17
Table 1. COMMON Scale Judge Calibrations at step
from category 2 to category 3.

The "Count" column in Table 1 shows that these judges rated different number of
examinees over the course of the examination, e.g. Judge M rated 106 examinees, while
Judge J rated only 32. (Since each judge rated an examinee on 5 of the examinee's 15
items, the count of ratings is five times the count of examinees.) "Sum of Ratings" is the
grand total of the ratings given by each judge. "Severity in logits" is the calibration of each
judge according to the facets model (1),(and the choice of reference point at the transition
from "2" to "3"). "Severity in logits" is accompanied by its modelled asymptotic standard
error. Finally, a mean-square fit statistic is reported. Values greater than one indicate more
variance in the ratings than was modelled. Values less than one indicate more
dependence in the ratings than was modelled.
In the literature, judges are often used as though they were interchangeable. Each judge
is thought to be "equivalent" to an "ideal" judge but for some small error variance. Were
this the case, judge severities would be homogeneous.
But a chi-square for homogeneity (116 with 14 d.f.) among these 15 judges is significant at
the .01 level. The hypothesis that these judges are interchangeable approximations of
some ideal is unsupportable.
++-----------+-----------+-----------+-----------+-----------++
45 +
+
|
|
Examinee
|
1114231
|
Score
|
|
|
12123521
|
|
|
|
1 244342
|
|
|
|
1596231
|
|
|
40 +
1259544
+
|
|
|
358552
|
|
|
|
1 584555
|
|
|
|
3353131
|
|
|
|
232672
|
|
|
35 +
532121
+
|
|
|
2322122
|
|
|
|
X 2223
|

|
|
|
23221
|
|
|
|
1 11 1
|
|
|
30 +
111 1
+
|
|
|
21
|
|
Note: curvilinearity and
|
|
1 1
obtuseness of relation
|
|
|
|
12
|
|
|
|
1
1
|
|
|
25 +
Y 2 1 W
+
|
|
|
1
|
|
|
|
1
|
|
|
|
|
|
|
|
|
|
|
20 +
1
+
++-----------+-----------+-----------+-----------+-----------++
-1
0
1
2
3 logits
4
Examinee measure
Figure 1.

Examinee score vs measure, on the COMMON judge scale.


(W, X, Y discussed in text)

The effect of this variation in judge severity can be demonstrated by comparing each
objective examinee measure with its judge-dependent raw score. These are plotted in Figure
1. The ordinate is the raw score for each examinee, the sum of the 15 ratings each
received, which has a possible range of zero to 45 points.
The abscissa is the logit measure estimated for each examinee from the ratings each
received, but adjusted for variation in judge severity by the facets measurement model.
The horizontal spread of measures corresponding to each raw score shows the degree to
which different levels of judge severity disturb the meaning of a raw score. Similarly,
the vertical spread in raw scores corresponding to each logit measure shows the range of
raw scores that an examinee of any given ability might receive depending on the
combination of judges who rated him.
As can be seen, one examinee (W), who scored 25, is estimated to have greater ability
(0.43 logits) than another examinee (X), who scored 33, with a measure of 0.41 logits. Raw
scores are biased against the examinee who scored 25, and would be unfair were the passfail criterion 30 points and only raw scores considered. The bias in the raw scores is
entirely due to the particular combinations of severe and lenient judges that rated these
examinees.
The bias in raw scoring is also brought out by a comparison of examinee W with examinee Y,
who also scored 25, but measured only -0.19. The raw scores of W and Y are identical, but
the measured difference in their ability is 0.43 + 0.19 = 0.62 logits. Since the measures
of W and Y have standard errors of 0.27 logits, a statistical test for their difference (t
= 1.62, p = 0.1) may be significant enough to alarm an examining board concerned with
making fair and defensible pass-fail decisions.

The introduction into the measurement model of parameters calibrating and hence adjusting
for the severity of judges enables the obvious inequities due to variance in judge
severity to be removed from examinee measures.
So far, we have allowed each judge to have his own level of severity, but have acted as
though each uses the rating scale in the same way. Experience suggests that each judge,
though thoroughly trained and experienced, applies the rating scale in a slightly
different, though self-consistent, manner.
Consequently, we will now model each judge to have his own personal way of using the
rating scale. This corresponds to an analysis based on model equation (2), in which the
step structure is represented by Fjk. Table 2 shows the judge severity estimates when each
judge is calibrated with his personal rating scale. Again severity was calibrated at the
logit value where the probability of awarding category "2" equalled that of awarding
category "3".
Count of
Count of Sum of
Severity
Mn Sq
Judge
Examinees Ratings
Ratings in Logits S.E.
Fit
-----------------------------------------------------------------------A
68
340
803
.90
.08
1.01
B
58
290
668
.41
.08
1.06
C
69
345
830
.16
.08
.91
D
43
215
533
.22
.10
.81
E
45
225
560
.37
.11
1.02
F
79
395
982
.09
.08
1.00
G
48
240
591
.41
.11
1.01
H
47
235
594
.19
.11
1.06
I
52
260
652
-.16
.09
.93
J
32
160
409
.06
.13
1.13
K
50
250
646
.02
.11
1.09
L
58
290
728
-.48
.09
1.08
M
06
530
1382
-.67
.07
1.04
N
48
240
635
-.75
.10
.90
O
58
290
767
-.82
.10
1.05
-----------------------------------------------------------------------Mean:
57.4
287.0
718.7
.00
.10
1.01
Standard Deviation:
17.2
86.1
221.8
.47
.02
.08
Table 2. PERSONAL Scale Judge Calibrations
As would be expected, giving each judge his own rating scale has lessened the degree of
unexpected behavior. The fit statistics are closer to their expected value of one when
judges are modelled for personal scales. Comparing Tables 1 and 2, this is most clearly
noticeable, for more lenient judges (L, M, N, O), and more severe judges (A, E).
Categories
0
1
2
3
Judge
Used Rel.
Used Rel.
Used Rel.
Used Rel.
Count %
% | Count
%
% | Count
%
% | Count
%
% |
--------------------------------------------------------------------------A
14
4
135 |
18
3
88 | 139
11
145 | 169
7
81 |
B
15
5
170 |
27
5
156 | 103
9
128 | 145
7
82 |
C
12
3
114 |
36
5
173 |
97
7
100 | 200
8
94 |
D
9
4
138 |
8
2
62 |
69
8
113 | 129
8
96 |
E
5
2
73 |
8
2
59 |
84
10
130 | 128
8
91 |
F
11
3
92 |
27
3
113 | 116
8
104 | 241
8
98 |
G
3
1
41 |
12
3
82 |
96
10
139 | 129
7
86 |
H
3
1
42 |
13
3
91 |
76
8
113 | 143
8
97 |
I
7
3
88 |
20
4
127 |
67
7
91 | 166
9
102 |
J
1
1
21 |
10
3
102 |
48
8
104 | 101
8
100 |
K
4
2
53 |
10
2
66 |
72
7
100 | 164
9
104 |
L
9
3
102 |
24
4
137 |
67
6
82 | 190
9
105 |

M
23
4
143 |
18
2
57 | 103
5
68 | 386
10
115 |
N
9
4
123 |
9
2
62 |
40
4
59 | 182
10
119 |
O
6
2
68 |
20
3
114 |
45
4
55 | 219
10
118 |
--------------------------------------------------------------------------All 131
3
100 | 260
3
100 | 1222
7
100 | 2692
8
100 |
Table 3.

Use frequency of rating scale categories.

In this examination, each judge rated a more or less random sample of examinees, and so an
inspection of how many ratings each judge awarded in each category provides an explanation
for the change in fit statistics when judges are calibrated on their personal scales.
Table 3 gives the percentage of ratings, "Used %", given in each category by each
category.
These percents show how judges differ in the way they used the rating scale. The "Rel. %"
columns show how much each judge used each category relative to the use of each category
by all the judges. The more severe judges (A through H) used relatively more ratings of
"2" than were expected from the common scale. When calibrated on the common scale, these
judges had less dispersion, more central tendency, in their ratings than was expected, and
so there ratings were less stochastic than expected, resulting in mean square fits of less
than one. On the other hand, the more lenient judges (I through O) awarded relatively more
extreme ratings of "3". When modelled on the common scale, their fit statistics were
greater than one, showing more dispersion in their ratings than was modelled.
Nevertheless, the patterns of responses in Table 3 show considerable similarity in the way
that these judges viewed the rating scale, once the variation in their severity levels is
accounted for.
In fact, this panel of judges is so well trained that the none of their fit statistics is
unacceptable. Of the residual variance obtained when modelling all judges to be identical,
xx% is explained by allowing each judge his own severity but using a common scale, and
only a further xx% by modelling each judge to have his own scale.

Examinee
Measure
on
PERSONAL
Judge
Scales

++-----------+-----------+-----------+-----------+-----------++
4 +
+
|
11
|
|
11
|
|
11
|
|
1
|
|
13
|
|
|
|
111 1
|
|
11
|
3 +
1
1
+
|
12
|
|
2 12
|
|
Note: collinearity and
1
1
|
|
acuity of relation
3 1
|
|
1212
|
|
4
|
|
26 1
|
|
3311
|
2 +
4931
+
|
2442
|
|
941
|
|
696
|
|
693
|
|
78
|
|
493
|
|
291
|
|
196
|
1 +
58
+
|
562
|

|
54
|
|
X142
|
|
1211
|
|
1W1
|
|
12
|
|
1
|
|
Y 1 1
|
0 +
111
+
|
1
|
++-----------+-----------+-----------+-----------+-----------++
-1
0
1
2
3
logits 4
Examinee measure on COMMON Judge scale
Figure 2. Examinee measures on COMMON Judge scale
vs PERSONAL Judge scales.
In Figure 2, the measures obtained for each examinee when the judges are regarded as using
a common scale are plotted against those obtained when each judge is allowed his personal
scale. The examinee points are located close to the identity line. This is a visual
representation of the fact that giving the judges their own scales has had very little
effect on the ordering of examinees by ability. The Spearman rank order correlation of the
two examinee measures is 0.998, indicating that almost no examinee's pass-fail decision
would be affected by choice of models.
In contrast, the Spearman correlation between raw scores and common scale measures is
0.976. This suggests that introducing the extra judge parameters into the model need not
result in a meaningful difference in so far as examinee measures are concerned.
Allowing each judge his own rating scale weakens inference because it lessens the
generality of the measures obtained. Were a new judge included, it would be necessary to
estimate not only his level of severity but also his own personal manner of using the
rating scale.
In Tables 1 and 2, the judge severity calibrations for common and personal scales are in
statistically equivalent order. The personal scale calibrations, however, have twice the
range of the common scale calibrations. Modelling a common scale has forced judges to seem
more alike in severity. Fortunately Figure 2 shows that the effect on examinee measures of
this compression of differences in judge severity is immaterial. But, for the study of
judging and judge training, specifying each judge to have his own scale brings out
noteworthy features of judge behavior.
VII. Conclusion.
The facets model is an extension of the partial credit model, designed for examinations
which include subjective judgments. Its development enables the benefits of "sample-free",
"test-free", and "judge-free" measurement to be realized in this hitherto intractable
area. The use of the facets model yields greater freedom from judge bias and greater
generalizability of the resulting examinee measures than has previously been available.
The practicality of the facets model in allowing for simple, convenient, and efficient
judging designs has proved of benefit to those for whom rapid, efficient judging is a
priority. Further, the diagnostic information is of use in judge training.
In the examination that was analyzed, it is clear that judges differ significantly in
their severity, and that it is necessary to model this difference in determining examinee
measures. Some evidence was also found to suggest that, even after allowing for this
difference in severities, judges use the categories of the rating scale differently.
However, for well-trained judges, the consequent differences in examinee measures do not
appear to be large enough to merit modelling a separate rating scale for each judge.
Modelling one common scale was satisfactory, for practical purposes.
VIII. Bibliography.

Borman, W.C. Exploring Upper Limits of Reliability and Validity in Job Performance
Ratings. J. Applied Psychology 1978 63:2 134-144
Fisher R.A. On the mathematical foundations of theoretical statistics. Proc. Roy. Soc.
1922 Vol. CCXXII p. 309-368
Gruenfeld E.F. Performance Appraisal: Promise and Peril. Ithaca, New York: Cornell
University Press 1981
Landy F.J., Farr J.L. The Measurement of Work Performance. New York: Academic Press 1983
Linacre J.M. An Extension of the Rasch Model to multi-faceted situation. Chicago:
University of Chicago, Department of Education 1987
Masters, G.N. A Rasch Model for partial credit scoring. Psychometrika 1982 47:149-174.
Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. (Copenhagen
1960) Chicago: University of Chicago Press 1980.
Rasch G. A mathematical theory of objectivity and its consequences for model construction.
In "Report from the European Meeting on Statistics, Econometrics and Management Sciences",
Amsterdam 1968.
Wright B.D. Sample-free test calibration and person measurement. In "Proceedings of the
1967 Invitational Conference on Testing Problems". Princeton, N.J.: Educational Testing
Services 1968.
Wright B.D. Solving measurement problems with the Rasch model. Journal of Educational
Measurement, 1977, 14, 97-116.
Wright B.D., Douglass G.A. The Rating Scale Model for Objective Measurement. MESA Research
memorandum No. 35. Chicago: University of Chicago 1986.
Wright B.D., Masters G.N. Rating Scale Analysis: Rasch Measurement. Chicago: MESA Press
1982.
Wright B.D., Stone M.H. Best Test Design: Rasch Measurement. Chicago: Mesa Press 1979.

http://www.rasch.org/memo61.htm 7/4/15

You might also like