Efficient Elicitation of Annotations For Human Evaluation of Machine Translation

Efficient Elicitation of Annotations for Human Evaluation of Machine
Translation
Keisuke Sakaguchi∗ , Matt Post† , Benjamin Van Durme†
∗
Center for Language and Speech Processing
†
Human Language Technology Center of Excellence
Johns Hopkins University, Baltimore, Maryland
{keisuke,post,vandurme}@cs.jhu.edu
Abstract # score range system

1 0.638 1 UEDIN-HEAFIELD
2 0.604 2-3 UEDIN
A main output of the annual Workshop 0.591 2-3 ONLINE-B
on Statistical Machine Translation (WMT) 4 0.571 4-5 LIMSI-SOUL
is a ranking of the systems that partici- 0.562 4-5 KIT
0.541 5-6 ONLINE-A
pated in its shared translation tasks, pro- 7 0.512 7 MES-SIMPLIFIED
duced by aggregating pairwise sentence- 8 0.486 8 DCU
level comparisons collected from human 9 0.439 9-10 RWTH
0.429 9-11 CMU-T2T
judges. Over the past few years, there 0.420 10-11 CU-ZEMAN
have been a number of tweaks to the ag- 12 0.389 12 JHU
gregation formula in attempts to address 13 0.322 13 SHEF-WPROA
issues arising from the inherent ambigu- Table 1: System rankings presented as clusters
ity and subjectivity of the task, as well as (WMT13 French-English competition). The score
weaknesses in the proposed models and column is the percentage of time each system was
the manner of model selection. judged better across its comparisons (§2.1).
We continue this line of work by adapt-
ing the TrueSkillTM algorithm — an online reported (e.g., BLEU (Papineni et al., 2002)), the
approach for modeling the relative skills human evaluation is considered primary, and is in
of players in ongoing competitions, such fact used as the gold standard for its metrics task,
as Microsoft’s Xbox Live — to the hu- where evaluation metrics are evaluated.
man evaluation of machine translation out- In machine translation, the longstanding dis-
put. Our experimental results show that agreements about evaluation measures do not go
TrueSkill outperforms other recently pro- away when moving from automatic metrics to hu-
posed models on accuracy, and also can man judges. This is due in no small part to the in-
significantly reduce the number of pair- herent ambiguity and subjectivity of the task, but
wise annotations that need to be collected also arises from the particular way that the WMT
by sampling non-uniformly from the space organizers produce the rankings. The system-
of system competitions. level rankings are produced by collecting pairwise
sentence-level comparisons between system out-
1 Introduction puts. These are then aggregated to produce a com-
The Workshop on Statistical Machine Translation plete ordering of all systems, or, more recently, a
(WMT) has long been a central event in the ma- partial ordering (Koehn, 2012), with systems clus-
chine translation (MT) community for the evalua- tered where they cannot be distinguished in a sta-
tion of MT output. It hosts an annual set of shared tistically significant way (Table 1, taken from Bo-
translation tasks focused mostly on the translation jar et al. (2013)).
of western European languages. One of its main A number of problems have been noted with
functions is to publish a ranking of the systems this approach. The first has to do with the na-
for each task, which are produced by aggregating ture of ranking itself. Over the past few years, the
a large number of human judgments of sentence- WMT organizers have introduced a number of mi-
level pairwise rankings of system outputs. While nor tweaks to the ranking algorithm (§2) in reac-
the performance on many automatic metrics is also tion to largely intuitive arguments that have been
1
Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 1–11,
c
Baltimore, Maryland USA, June 26–27, 2014. 2014 Association for Computational Linguistics
raised about how the evaluation is conducted (Bo- is. TrueSkill has been adapted to a number of
jar et al., 2011; Lopez, 2012). While these tweaks areas, including chess, advertising, and academic
have been sensible (and later corroborated), Hop- conference management.
kins and May (2013) point out that this is essen- The rest of this paper provides an empirical
tially a model selection task, and should prop- comparison of a number of models of human eval-
erly be driven by empirical performance on held- uation (§2). We evaluate on perplexity and also
out data according to some metric. Instead of in- on accuracy, showing that the two are not always
tuition, they suggest perplexity, and show that a correlated, and arguing for the primacy of the lat-
novel graphical model outperforms existing ap- ter (§3). We find that TrueSkill outperforms other
proaches on that metric, with less amount of data. models (§4). Moreover, TrueSkill also allows us to
A second problem is the deficiency of the mod- drastically reduce the amount of data that needs to
els used to produce the ranking, which work by be collected by sampling non-uniformly from the
computing simple ratios of wins (and, option- space of all competitions (§5), which also allows
ally, ties) to losses. Such approaches do not con- for greater separation of the systems into ranked
sider the relative difficulty of system matchups, clusters (§6).
and thus leave open the possibility that a system 2 Models
is ranked highly from the luck of comparisons
against poorer opponents. Before introducing our adaptation of the TrueSkill
model for ranking translation systems with human
Third, a large number of judgments need to be
judgments (§2.3), we describe two comparisons:
collected in order to separate the systems into clus-
the “Expected Wins” model used in recent evalu-
ters to produce a partial ranking. The sheer size of
ations, and the Bayesian model proposed by Hop-
the space of possible comparisons (all pairs of sys-
kins and May (§2.2).
tems times the number of segments in the test set)
As we described briefly in the introduction,
requires sampling from this space and distributing
WMT produces system rankings by aggregating
the annotations across a number of judges. Even
sentence-level ternary judgments of the form:
still, the number of judgments needed to produce
statistically significant rankings like those in Ta- (i, S1 , S2 , π)
ble 1 grows quadratically in the number of par-
where i is the source segment (id), S1 and S2
ticipating systems (Koehn, 2012), often forcing
are the system pair drawn from a set of systems
the use of paid, lower-quality annotators hired on
{S}, and π ∈ {<, >, =} denotes whether the
Amazon’s Mechanical Turk. Part of the prob-
first system was judged to be better than, worse
lem is that the sampling strategy collects data uni-
than, or equivalent to the second. These ternary
formly across system pairings. Intuitively, we
judgments are obtained by presenting judges with
should need many fewer annotations between sys-
a randomly-selected input sentence and the refer-
tems with divergent base performance levels, in-
ence, followed by five randomly-selected transla-
stead focusing the collection effort on system pairs
tions of that sentence. Annotators are asked to
whose performance is more matched, in order to
rank these systems from best (rank 1) to worst
tease out the gaps between similarly-performing
(rank 5), ties permitted, and with no meaning as-
systems. Why spend precious human time on re-
cribed to the absolute values or differences be-
dundantly affirming predictable outcomes?
tween ranks. This is done to accelerate data collec-
To address these issues, we developed a varia- tion, since it yields ten pairwise comparisons per
tion of the TrueSkill model (Herbrich et al., 2006), ranking. Tens of thousands of judgments of this
an adaptative model of competitions originally de- form constitute the raw data used to compute the
veloped for the Xbox Live online gaming commu- system-level rankings. All the work described in
nity. It assumes that each player’s skill level fol- this section is computed over these pairwise com-
lows a Gaussian distribution N (µ, σ 2 ), in which parisons, which are treated as if they were col-
µ represents a player’s mean performance, and σ 2 lected independently.
the system’s uncertainty about its current estimate
of this mean. These values are updated after each 2.1 Expected Wins
“game” (in our case, the value of a ternary judg- The “Expected Wins” model computes the per-
ment) in proportion to how surprising the outcome centage of times that each system wins in its
2
pairwise comparisons. Let A be the complete 2.2 The Hopkins and May (2013) model
set of annotations or judgments of the form Recent papers (Koehn, 2012; Hopkins and May,
{i, S1 , S2 , πR }. We assume these judgments have 2013) have proposed models focused on the rel-
been converted into a normal form where S1 is ei- ative ability of the competition systems. These
ther the winner or is tied with S2 , and therefore approaches assume that each system has a mean
πR ∈ {<, =}. Let δ(x, y) be the Kronecker delta quality represented by a Gaussian distribution with
function.1 We then define the function: a fixed variance shared across all systems. In the
graphical model formulation of Hopkins and May
wins(Si , Sj ) = (2013), the pairwise judgments (i, S1 , S2 , π) are
|A|
X imagined to have been generated according to the
(n) (n) (n)
δ(Si , S1 )δ(Sj , S2 )δ(πR , <) following process:
n=1
• Select a source sentence i
which counts the number of annotations for which
• Select two systems S1 and S2 . A system
system Si was ranked better than system Sj . We
Sj is associated with a Gaussian distribution
define a single-variable version that marginalizes
N (µSj , σa2 ), samples from which represent
over all annotations:
the quality of translations
X
wins(Si ) = wins(Si , Sj ) • Draw two “translations”, adding random
Sj 6=Si 2 to simulate
Gaussian noise with variance σobs
We also define analogous functions for loses and the subjectivity of the task and the differences
ties. Until the WMT11 evaluation (Callison-Burch among annotators:
et al., 2011), the score for each system Si was q1 ∼ N (µS1 , σa2 ) + N (0, σobs
2
)
computed as follows: q2 ∼ N (µS2 , σa2 ) + N (0, σobs
2
)
wins(Si ) + ties(Si ) • Let d be a nonzero real number that defines
score(Si ) =
wins(Si ) + ties(Si ) + loses(Si ) a fixed decision radius. Produce a rating π
according to:2
Bojar et al. (2011) suggested that the inclusion of 
ties biased the results, due to their large numbers, < q 1 − q2 > d
the underlying similarity of many of the models, π= > q 2 − q1 > d

and the fact that they are counted for both systems = otherwise
in the tie, and proposed the following modified
The task is to then infer the posterior parameters,
scoring function:
given the data: the system means µSj and, by ne-
1 X wins(Si , Sj ) cessity, the latent values {qi } for each of the pair-
score(Si ) =
|{S}| wins(Si , Sj ) + wins(Sj , Si ) wise comparison training instances. Hopkins and
Sj 6=Si
May do not publish code or describe details of this
This metric computes an average relative fre- algorithm beyond mentioning Gibbs sampling, so
quency of wins, excluding ties, and was used we used our own implementation,3 and describe it
in WMT12 and WMT13 (Callison-Burch et al., here for completeness.
2012; Bojar et al., 2013). After initialization, we have training instances
The decision to exclude ties isn’t without of the form (i, S1 , S2 , πR , q1 , q2 ), where all but the
its problems; for example, an evaluation where qi are observed. At a high level, the sampler iter-
two systems are nearly always judged equivalent ates over the training data, inferring values of q1
should be relevant in producing the final ranking and q2 for each annotation together in a single step
of systems. Furthermore, as Hopkins and May of the sampler from the current values of the sys-
(2013) point out, throwing out data to avoid bi- tems means, {µj }.4 At the end of each iteration,
asing a model suggests a problem with the model. 2
Note that better systems have higher relative abilities
We now turn to a description of their model, which {µSj }. Better translations subsequently have on-average
higher values {qi }, which translate into a lower ranking π.
addresses these problems. 3
github.com/keisks/wmt-trueskill
4
1 if x = y This worked better than a version of the sampler that

1
δ(x, y) = changed one at a time.
0 o.w.
3
these means are then recomputed by re-averaging not all competitions are equal, and a good model
all values of {qi } associated with that system. Af- should incorporate some notion of “match diffi-
ter the burn-in period, the µs are stored as samples, culty” when evaluating system’s abilities. The
which are averaged when the sampling concludes. inference procedure above incorporates this no-
During each iteration, q1 and q2 are resampled tion implicitly in the inference procedure, but the
from their corresponding system means: model itself does not include a notion of match
difficulty or outcome surprisal.
q1 ∼ N (µS1 , σa2 ) A model that does is TrueSkill5 (Herbrich et al.,
q2 ∼ N (µS2 , σa2 ) 2006). TrueSkill is an adaptive, online system that
also assumes that each system’s skill level follows
We then update these values to respect the annota- a Gaussian distribution, maintaining a mean µSj
tion π as follows. Let t = q1 −q2 (S1 is the winner for each system Sj representing its current esti-
by human judgments), and ensure that the values mate of that system’s native ability. However, it
are outside the decision radius, d: also maintains a per-system variance, σS2 j , which
( represents TrueSkill’s uncertainty about its esti-
q1 t≥d
q10 = 1 mate of each mean. After an outcome is observed
q1 + (d − t) otherwise (a game in which the result is a win, loss, or draw),
2
( the size of the updates is proportional to how sur-
q2 t≥d prising the outcome was, which is computed from
0
q2 = 1 the current system means and variances. If a trans-
q2 − (d − t) otherwise
2 lation from a system with a high mean is judged
In the case of a tie: better than a system with a greatly lower mean, the
 result is not surprising, and the update size for the
 1
 q1 + (d − t)
 t≥d corresponding system means will be small. On the
 2
0
q1 = q1 t<d other hand, when an upset occurs in a competition,

 the means will receive larger updates.
 q1 + 1 (−d − t)

t ≤ −d
2 Before defining the update equations, we need
 to be more concrete about how this notion of sur-
 1 prisal is incorporated. Let t = µS1 − µS2 , the dif-

 q2 − (d − t) t≥d
 2 ference in system relative abilities, and let be a
q20 = q2 t<d fixed hyper-parameter corresponding to the earlier


 1
 q2 − (−d − t) decision radius. We then define two loss functions
t ≤ −d
2 of this difference for wins and for ties:
These values are stored for the current iteration
and averaged at its end to produce new estimates N (− + t)
vwin (t, ) =
of the system means. The quantity d − t can be in- Φ(− + t)
terpreted as a loss function, returning a high value N (− − t) − N ( − t)
when the observed outcome is unexpected and a vtie (t, ) =
Φ( − t) − Φ(− − t)
low value otherwise (Figure 1).
where Φ(x) is the cumulative distribution function
2.3 TrueSkill and the N s are Gaussians. Figures 1 and 2 display
Prior to 2012, the WMT organizers included refer- plots of these two functions compared to the Hop-
ence translations among the system comparisons. kins and May model. Note how vwin (Figure 1) in-
These were used as a control against which the creases exponentially as µS2 becomes greater than
evaluators could be measured for consistency, on the (purportedly) better system, µS1 .
the assumption that the reference was almost al- As noted above, TrueSkill maintains not only
ways best. They were also included as data points estimates {µSj } of system abilities, but also
in computing the system ranking. Another of system-specific confidences about those estimates
Bojar et al. (2011)’s suggestions was to exclude
5
this data, because systems compared more of- The goal of this section is to provide an intuitive descrip-
tion of TrueSkill as adapted for WMT manual evaluations,
ten against the references suffered unfairly. This with enough detail to carry the main ideas. For more details,
can be further generalized to the observation that please see Herbrich et al. (2006).
4
The second term in these equations captures the
1.5
TrueSkill idea about balancing surprisal with confidence,
HM described above.
In order to update the system-level confidences,
1.0 TrueSkill defines another set of functions, w, for
the cases of wins and ties. These functions are
v(t, )
multiplicative factors that affect the amount of

0.5 change in σ 2 :
wwin (t, ) = vwin · (vwin + t − )
( − t) · N ( − t) + ( + t) · N ( + t)
wtie (t, ) = vtie +
0.0
−1.0 −0.5 0.0 0.5 1.0 Φ( − t) − Φ(− − t)
t = µS1 − µS2
Figure 1: TrueSkill’s vwin and the corresponding The underlying idea is that these functions cap-
loss function in the Hopkins and May model as ture the outcome surprisal via v. This update al-
a function of the difference t of system means ways decreases the size of the variances σ 2 , which
( = 0.5, c = 0.8 for TrueSkill, and d = 0.5 for means uncertainty of µ decreases as comparisons
Hopkins and May model). go on. With these defined, we can conclude by
defining the updates for σS2 1 and σS2 2 :
" #
1.0
2 2
σS2 1 t
TrueSkill σS1 = σS1 · 1 − 2 · w ,
HM c c c
" #
0.5
2 2
σS2 2 t
σS2 = σS2 · 1 − 2 · w ,
c c c
v(t, )
0.0
One final complication not presented here but rel-
evant to adapting TrueSkill to the WMT setting:
−0.5
the parameter β and another parameter (not dis-
cussed) τ are incorporated into the update equa-
−1.0 tions to give more weight to recent matches. This
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
t = µS1 − µS2 “latest-oriented” property is useful in the gaming
Figure 2: TrueSkills vtie and the corresponding setting for which TrueSkill was built, where play-
loss function in the Hopkins and May model as ers improve over time, but is not applicable in the
a function of the difference t of system means WMT competition setting. To cancel this property
( = 0.5, c = 0.3, and d = 0.5). in TrueSkill, we set τ = 0 and β = 0.025 · |A| · σ 2
in order to lessen the impact of the order in which
{σSj }. These confidences also factor into the up- annotations are presented to the system.
dates: while surprising outcomes result in larger
2.4 Data selection with TrueSkill
updates to system means, higher confidences (rep-
resented by smaller variances) result in smaller A drawback of the standard WMT data collection
updates. TrueSkill defines the following value: method is that it samples uniformly from the space
of pairwise system combinations. This is undesir-
c2 = 2β 2 + σS2 1 + σS2 2 able: systems with vastly divergent relative abil-
ity need not be compared as often as systems that
which accumulates the variances along β, another are more evenly matched. Unfortunately, one can-
free parameter. We can now define the update not sample non-uniformly without knowing ahead
equations for the system means: of time which systems are better. TrueSkill pro-
vides a solution to this dilemma with its match-
σS2 1 t selection ability: systems with similar means and
µS1 = µS1 + ·v ,
c c c low variances can be confidently considered to be

σS2 2 t close matches. This presents a strong possibility
µS 2 = µS 2 − ·v , of reducing the amount of data that needs to be
c c c
5
collected in the WMT competitions. In fact, the each produced in two ways: selecting from all re-
TrueSkill formulation provides a way to compute searchers, or split between researchers and Turk-
the probability of a draw between two systems, ers. An important note is that the training data
which can be used to compute for a system Si a differs according to the model. For the Expected
conditional distribution over matches with other Wins and Hopkins and May model, we sim-
systems {Sj6=i }. ply sample uniformly at random. The TrueSkill
Formally, in the TrueSkill model, the match- model, however, selects its own training data (with
selection (chance to draw) between two players replacement) according to the description in Sec-
(systems in WMT) is computed as follows: tion 2.4.7
r For tuning hyperparameters and reporting test
2β 2 (µa − µb )2 results, we used development and test sets of 2,000
pdraw = · exp(− )
c2 2c2 comparisons drawn entirely from the researcher
judgments, and fixed across all experiments.
However, our setting for canceling the “latest-
oriented” property affects this matching quality
3.2 Perplexity
equation, where most systems are almost equally
competitive (≈ 1). Therefore, we modify the equa- We first compare the Hopkins and May model and
tion in the following manner which simply de- TrueSkill using perplexity on the test data T , com-
pends on the difference of µ. puted as follows:
1 −
P
log2 p(π|S1 ,S2 )
p̂draw = ppl(p|T ) = 2 (i,S1 ,S2 ,π)∈T
exp(|µa − µb |)
TrueSkill selects the matches it would like to where p is the model under consideration. The
create, according to this selection criteria. We do probability of each observed outcome π between
this according to the following process: two systems S1 and S2 is computed by taking a
difference of the Gaussian distributions associated
1. Select a system S1 (e.g., the one with the with those systems:
highest variance)
N (µδ , σδ2 ) = N (µS1 , σS2 1 ) − N (µS2 , σS2 2 )
2. Compute a normalized distribution over
matches with other systems pairs p̂draw = N (µS1 − µS2 , σS2 1 + σS2 2 )
3. Draw a system S2 from this distribution This Gaussian can then be carved into three pieces:
the area where S1 loses, the middle area represent-
4. Draw a source sentence, and present to the
ing ties (defined by a decision radius, r, whose
judge for annotation
value is fit using development data), and a third
3 Experimental setup area representing where S1 wins. By integrating
over each of these regions, we have a probability
3.1 Datasets distribution over these outcomes:
We used the evaluation data released by WMT13.6 
The data contains (1) five-way system rankings 
R0


2
−∞ N (µδ , σδ ) if π is >
made by either researchers or Turkers and (2) 



translation data consisting of source sentences, hu- Rr
p(π | S1 , S2 ) = 2
man reference translations, and submitted transla- 
 0 N (µδ , σδ ) if π is =


tions. Data exists for 10 language pairs. More de- R


 ∞ N (µδ , σ 2 ) if π is <
tails about the dataset can be found in the WMT r δ
2013 overview paper (Bojar et al., 2013).
Each five-way system ranking was converted We do not compute perplexity for the Expected
into ten pairwise judgments (§2). We trained the Wins model, which does not put any probability
models using randomly selected sets of 400, 800, mass on ties.
1,600, 3,200, and 6,400 pairwise comparisons,
7
We use a Python implementation of TrueSkill
6
statmt.org/wmt13/results.html (github.com/sublee/trueskill).
6
3.3 Accuracy
3.00
Perplexity is often viewed as a neutral metric, but HM-all
HM-res
without access to unbounded training data or the TS-all
true model parameters, it can only be approxi- 2.95 TS-res
mated. Furthermore, it does not always corre-
Perplexity
late perfectly with evaluation metrics. As such, 2.90
we also present accuracy results, measuring each
model’s ability to predict the values of the ternary
pairwise judgments made by the annotators. These 2.85
are computed using the above equation, picking

the highest value of p(π) for all annotations be- 2.80
tween each system pair (Si , Sj ). As with perplex- 1000 2000 3000 4000
Training Data Size
5000 6000
ity, we emphasize that these predictions are func-

tions of the system pair only, and not the individual Figure 3: Model Perplexities for WMT13 dataset.
sentences under consideration, so the same out- ‘all’ indicates that models are trained on both re-
come is always predicted for all sentences between searcher and Turker judgements, and ‘res’ means
a system pair. that models are trained on only researcher judge-
ments.
3.4 Parameter Tuning
We follow the settings described in Hopkins and outperforms Expected Win and the Hopkins and
May (2013) for their model: σa = 0.5, σobs = 1.0, May, especially when the training size is small
and d = 0.5. In TrueSkill, in accordance with the (Table 2). We also note that training on researcher
Hopkins and May model, we set the initial µ and judgments alone (dashed lines) results in better
σ values for each system to 0 and 0.5 respectively, performance than training on both researchers and
and to 0.25. Turker judgments. This likely reflects both a bet-
For test data, we tuned the “decision ra- ter match between training and test data (recall the
dius” parameter r by doing grid search over test data consists of researcher judgments only),
{0.001, 0.01, 0.1, 0.3, 0.5}, searching for the as well as the higher consistency of this data, as
value which minimized perplexity and maximized evidenced by the annotator agreement scores pub-
accuracy on the development set. We do this for lished in the WMT overview paper (Bojar et al.,
each model and language pair. When tuned by 2013). Recall that the models only have access
perplexity, r is typically either 0.3 or 0.5 for both to the system pair (and not the sentences them-
models and language pairs, whereas, for accuracy, selves), and thus make the same prediction for π
the best r is either 0.001, 0.01, or 0.1. for a particular system pair, regardless of which
source sentence was selected. As an upper bound
4 Results for performance on this metric, Table 2 contains
an oracle score, which is computed by selecting,
4.1 Model Comparison for each pair of systems, the highest-probability
Figure 3 shows the perplexity of the two mod- ranking.8
els with regard to the number of training compar- Comparing the plots, we see there is not a per-
isons. The perplexities in the figure are averaged fect relationship between perplexity and accuracy
over all ten language pairs in the WMT13 dataset. among the models; the low perplexity does not
Overall, perplexities decrease according to the in- mean the high accuracy, and in fact the order of
crease of training size. The Hopkins and May the systems is different.
and TrueSkill models trained on both researcher
and Turker judgments are comparable, whereas 4.2 Free-for-all matches
the Hopkins and May model trained on researcher TrueSkill need not deal with judgments in pairs
judgments alone shows lower perplexity than the only, but was in fact designed to be used in a vari-
corresponding TrueSkill model. ety of settings, including N-way free-for-all games
In terms of accuracy, we see that the TrueSkill 8
Note that this might not represent a consistent ranking
model has the highest accuracies, saturating at just among systems, but is itself an upper bound on the highest-
over 3,000 training instances (Figure 4). TrueSkill scoring consistent ranking.
7
# N=2 N=3 N=4 N=5
0.500
400 0.479 0.482 0.491 0.492
0.495
800 0.483 0.493 0.495 0.495
0.490 1600 0.493 0.492 0.497 0.495
0.485
3200 0.493 0.494 0.498 0.497
Accuracy
6400 0.495 0.498 0.498 0.498

0.480
Table 3: Accuracies when training with N-way
0.475 ExpWin-all
ExpWin-res
free-for-all models, fixing the number of matches.
0.470 HM-all
HM-res
# N=2 N=3 N=4 N=5
0.465 TS-all
TS-res 400 0.479 0.475 0.470 0.459
0.460
1000 2000 3000 4000 5000 6000 800 0.483 0.488 0.476 0.466
Training Data Size
1600 0.493 0.488 0.481 0.481
Figure 4: Model accuracies with different training 3200 0.493 0.492 0.487 0.489
domain for WMT13 dataset. 6400 0.495 0.496 0.494 0.495
Table 4: Accuracies when training with N-way
Train Size Exp-Win HM TrueSkill free-for-all models, fixing the number of pairwise
400 0.465 0.471 0.479 comparisons.
800 0.471 0.475 0.483
all 1600 0.479 0.477 0.493 tute these comparisons. The results bear this out,
3200 0.486 0.489 0.493 but also suggest that the standard WMT setting
6400 0.487 0.490 0.495 — which extracts ten pairwise comparisons from
400 0.460 0.463 0.484 each 5-way match and treats them independently
800 0.475 0.473 0.488 — works well. We will not speculate further here,
res 1600 0.481 0.482 0.493 but provide this experiment purely to motivate po-
3200 0.492 0.494 0.497 tential future work. Here we will focus our con-
6400 0.495 0.496 0.497 clusions to the pair-wise ranking scenario.
Upper Bound 0.525
5 Reduced Data Collection with
Table 2: Model accuracies: models are tuned by
Non-uniform Match Selection
accuracy instead of perplexity. Upper bound is
computed by selecting the most frequent choice As mentioned earlier, a drawback of the selection
(<, >, =) for each system pair. of training data for annotation is that it is sampled
uniformly from the space of system pair compe-
with many players all competing for first place. titions, and an advantage of TrueSkill is its abil-
This adapts nicely to WMT’s actual collection set- ity to instead compute a distribution over pairings
ting. Recall that annotators are presented with five and thereby focus annotation efforts on competi-
translations which are then ranked; we can treat tive matches. In this section, we report results in
this setting as a 5-way free-for-all match. While the form of heat maps indicating the percentage of
the details of these updates are beyond the scope of pairwise judgments requested by TrueSkill across
this paper, they are presented in the original model the full cross-product of system pairs, using the
and are implemented in the toolkit we used. We WMT13 French-English translation task.
thus also conducted experiments varying the value Figure 5 depicts a system-versus-system heat
of N from 2 to 5. map for all judgments in the dataset. Across this
The results are shown in Tables 3 and 4, which figure and the next two, systems are sorted along
hold constant the number of matches and pairwise each axis by the final values of µ inferred by
judgments, respectively. When fixing the num- TrueSkill during training, and the heat of each
ber of matches, the 5-way setting is at an advan- square is proportional to the percentage of judg-
tage, since there is much more information in each ments obtained between those two systems. The
match; in contrast, when fixing the number of pair- diagonal reflects the fact that systems do not com-
wise comparisons, the 5-way setting is at a dis- pete against themselves, and the stripe at row and
advantage, since many fewer competitions consti- column 5 reflects a system that was entered late
8
1 2 3 4 5 6 7 8 9 10 11 12 13
into the WMT13 competition and thus had many
2.0
1 fewer judgments. It is clear that these values are
1.8
2
roughly uniformly distributed. This figure serves
3 1.6
4
as a sort of baseline, demonstrating the lack of pat-
1.4
5 terns in the data-selection process.
1.2
6 The next two figures focus on the data that
7 1.0
8
TrueSkill itself selected for its use from among all
0.8
9 of the available data. Figure 6 is a second heat
0.6
10 map presenting the set of system pairs selected by
0.4
11
TrueSkill for the first 20% of its matches chosen
12 0.2
13 during training, while Figure 7 presents a heat map
0.0
of the last 20%. The contrast is striking: whereas
the judgments are roughly uniformly distributed at
the beginning, the bulk of the judgments obtained
Figure 5: Heat map for the ratio of pairwise judg-
for the last set are clustered along the diagonal,
ments across the full cross-product of systems in
where the most competitive matches lie.
the WMT13 French-English translation task.
Together with the higher accuracy of TrueSkill,
1 2 3 4 5 6 7 8 9 10 11 12 13
this suggests that it could be used to decrease the
1
2.0
amount of data that needs to be collected in future
2 1.8
WMT human evaluations by focusing the annota-
3 1.6
tion effort on more closely-matched systems.
4
1.4
5
6
1.2 6 Clustering
7 1.0
8
0.8
As pointed out by Koehn (2012), a ranking pre-
9
0.6
sented as a total ordering among systems con-
10
11 0.4
ceals the closeness of comparable systems. In the
12 0.2
WMT13 competition, systems are grouped into
13
0.0
clusters, which is equivalent to presenting only
a partial ordering among the systems. Clusters
are constructed using bootstrap resampling to in-
Figure 6: Heat map for the ratio of pairwise judg- fer many system rankings. From these rankings,
ments across the full cross-product of systems rank ranges are then collected, which can be used
used in the first 20% of TrueSkill model. to construct 95% confidence intervals, and, in turn,
to cluster systems whose ranges overlap. We use
1 2 3 4 5 6 7 8 9 10 11 12 13 a similar approach for clustering in the TrueSkill
2.0
1 model. We obtain rank ranges for each system by
1.8
running the TrueSkill model 100 times,9 throw-
2
3 1.6
4 ing out the top and bottom 2 rankings for each
1.4
5 system, and clustering where rank ranges overlap.
1.2
6
For comparison, we also do this for the other two
7 1.0
8 models, altering the amount of training data from
0.8
9 1k to 25k in increments of 1,000, and plotting the
0.6
10
number of clusters that can be obtained from each
11 0.4
12
technique on each amount of training data.
0.2
13 Figure 8 show the number of clusters according
0.0
to the increase of training data for three models.
TrueSkill efficiently split the systems into clusters
Figure 7: Heat map for the ratio of pairwise judg-
compared to other two methods. Figure 9 and 10
ments across the full cross-product of systems
present the result of clustering two different size of
used in the last 20% of TrueSkill model.
9
We also tried the sampling 1,000 times and the clustering
granularities were the same.
9
1
7 2
ExpWin
3
6 HM
4
TS
5
5
Num. of Clusters
7
4
8
3 9
10
2 11
12
1 13
h .B w SI T .A S-S DCU CMU TH z U ef

in- on edin- LIM KI on cu- JH Sh
0 ued u ME RW
5000 10000 15000 20000 25000
Pairwise Comparisons
Figure 9: The result of clustering by TrueSkill
Figure 8: The number of clusters according to model with 1K training data from WMT13
the increase of training data for WMT13 French- French-English. The boxes range from the lower
English (13 systems in total). to upper quartile values, with means in the middle.
The whiskers show the full range of each system’s
training data (1K and 25K pairwise comparisons) rank after the bootstrap resampling.
on the TrueSkill model, which indicates that the
rank ranges become narrow and generate clusters
reasonably as the number of training samples in- 1
creases. The ranking and clusters are slightly dif-

2
3
ferent from the official result (Table 1) mainly be- 4
cause the official result is based on Expected Wins. 5
One noteworthy observation is that the ranking 6
of systems between Figure 9 and Figure 10 is the

7
8
same, further corroborating the stability and ac- 9
curacy of the TrueSkill model even with a small 10
amount of data. Furthermore, while the need 11
to cluster systems forces the collection of sig- 12
nificantly more data than if we wanted only to

13
h .B w SI T .A S-S DCU CMU TH z U ef

in- on edin- LIM KI on cu- JH Sh
report a total ordering, TrueSkill here produces ued u ME RW
nicely-sized clusters with only 25K pairwise com-

Figure 10: The result of clustering by TrueSkill
parisons, which is nearly one-third large of that
model with 25K training data. Dashed lines sep-
used in the WMT13 campaign (80K for French-
arate systems with non-overlapping rank ranges,
English, yielding 8 clusters).
splitting the data into clusters.
7 Conclusion
of the participating systems), without the need to
Models of “relative ability” (Koehn, 2012; Hop- hire non-experts on Mechanical Turk.
kins and May, 2013) are a welcome addition to One piece missing from the methods explored
methods for inferring system rankings from hu- and proposed in this paper is models of the actual
man judgments. The TrueSkill variant presented translations being compared by judges. Clearly,
in this paper is a promising further development, it is properties of the sentences themselves that
both in its ability to achieve higher accuracy levels judges use to make their judgments, a fact which
than alternatives, and in its ability to sample non- is captured only indirectly by modeling transla-
uniformly from the space of system pair match- tion qualities sampled from system abilities. This
ings. It’s possible that future WMT evaluations observation has been used in the development
could significantly reduce the amount of data they of automatic evaluation metrics (Song and Cohn,
need to collect, also potentially allowing them to 2011), and is something we hope to explore in fu-
draw from expert annotators alone (the developers ture work for system ranking.
10
References Pennsylvania, July. Association for Computational
Linguistics.
Ondřej Bojar, Miloš Ercegovčević, Martin Popel, and
Omar Zaidan. 2011. A Grain of Salt for the WMT Xingyi Song and Trevor Cohn. 2011. Regression and
Manual Evaluation. In Proceedings of the Sixth Ranking based Optimisation for Sentence Level MT
Workshop on Statistical Machine Translation, pages Evaluation. In Proceedings of the Sixth Workshop
1–11, Edinburgh, Scotland, July. Association for on Statistical Machine Translation, pages 123–129,
Computational Linguistics. Edinburgh, Scotland, July. Association for Compu-
tational Linguistics.
Ondřej Bojar, Christian Buck, Chris Callison-Burch,
Christian Federmann, Barry Haddow, Philipp
Koehn, Christof Monz, Matt Post, Radu Soricut, and
Lucia Specia. 2013. Findings of the 2013 Work-
shop on Statistical Machine Translation. In Pro-
ceedings of the Eighth Workshop on Statistical Ma-
chine Translation, pages 1–44, Sofia, Bulgaria, Au-
gust. Association for Computational Linguistics.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
and Omar Zaidan. 2011. Findings of the 2011
Workshop on Statistical Machine Translation. In
Proceedings of the Sixth Workshop on Statisti-
cal Machine Translation, pages 22–64, Edinburgh,
Scotland, July. Association for Computational Lin-
guistics.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
Matt Post, Radu Soricut, and Lucia Specia. 2012.
Findings of the 2012 Workshop on Statistical Ma-
chine Translation. In Proceedings of the Seventh
Workshop on Statistical Machine Translation, pages
10–51, Montréal, Canada, June. Association for
Computational Linguistics.
Ralf Herbrich, Tom Minka, and Thore Graepel. 2006.
TrueSkillTM : A Bayesian Skill Rating System. In
Proceedings of the Twentieth Annual Conference on
Neural Information Processing Systems, pages 569–
576, Vancouver, British Columbia, Canada, Decem-
ber. MIT Press.
Mark Hopkins and Jonathan May. 2013. Models of
translation competitions. In Proceedings of the 51st
Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages
1416–1424, Sofia, Bulgaria, August. Association for
Computational Linguistics.
Philipp Koehn. 2012. Simulating Human Judgment
in Machine Translation Evaluation Campaigns. In
Proceedings of the 9th International Workshop on
Spoken Language Translation (IWSLT), pages 179–
184, Hong Kong, China, December. International
Speech Communication Association.
Adam Lopez. 2012. Putting Human Assessments of
Machine Translation Systems in Order. In Proceed-
ings of the Seventh Workshop on Statistical Machine
Translation, pages 1–9, Montréal, Canada, June. As-
sociation for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of Machine Translation. In Proceedings
of 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
11

Efficient Elicitation of Annotations For Human Evaluation of Machine Translation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Elicitation of Annotations For Human Evaluation of Machine Translation

Uploaded by

Copyright:

Available Formats

Efficient Elicitation of Annotations for Human Evaluation of Machine

Abstract # score range system

multiplicative factors that affect the amount of

mated. Furthermore, it does not always corre-

are computed using the above equation, picking

ity, we emphasize that these predictions are func-

6400 0.495 0.498 0.498 0.498

h .B w SI T .A S-S DCU CMU TH z U ef

creases. The ranking and clusters are slightly dif-

cause the official result is based on Expected Wins. 5

One noteworthy observation is that the ranking 6

of systems between Figure 9 and Figure 10 is the

curacy of the TrueSkill model even with a small 10

amount of data. Furthermore, while the need 11

to cluster systems forces the collection of sig- 12

nificantly more data than if we wanted only to

h .B w SI T .A S-S DCU CMU TH z U ef

nicely-sized clusters with only 25K pairwise com-

You might also like