Professional Documents
Culture Documents
Translation
Keisuke Sakaguchi∗ , Matt Post† , Benjamin Van Durme†
∗
Center for Language and Speech Processing
†
Human Language Technology Center of Excellence
Johns Hopkins University, Baltimore, Maryland
{keisuke,post,vandurme}@cs.jhu.edu
1
Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 1–11,
c
Baltimore, Maryland USA, June 26–27, 2014.
2014 Association for Computational Linguistics
raised about how the evaluation is conducted (Bo- is. TrueSkill has been adapted to a number of
jar et al., 2011; Lopez, 2012). While these tweaks areas, including chess, advertising, and academic
have been sensible (and later corroborated), Hop- conference management.
kins and May (2013) point out that this is essen- The rest of this paper provides an empirical
tially a model selection task, and should prop- comparison of a number of models of human eval-
erly be driven by empirical performance on held- uation (§2). We evaluate on perplexity and also
out data according to some metric. Instead of in- on accuracy, showing that the two are not always
tuition, they suggest perplexity, and show that a correlated, and arguing for the primacy of the lat-
novel graphical model outperforms existing ap- ter (§3). We find that TrueSkill outperforms other
proaches on that metric, with less amount of data. models (§4). Moreover, TrueSkill also allows us to
A second problem is the deficiency of the mod- drastically reduce the amount of data that needs to
els used to produce the ranking, which work by be collected by sampling non-uniformly from the
computing simple ratios of wins (and, option- space of all competitions (§5), which also allows
ally, ties) to losses. Such approaches do not con- for greater separation of the systems into ranked
sider the relative difficulty of system matchups, clusters (§6).
and thus leave open the possibility that a system 2 Models
is ranked highly from the luck of comparisons
against poorer opponents. Before introducing our adaptation of the TrueSkill
model for ranking translation systems with human
Third, a large number of judgments need to be
judgments (§2.3), we describe two comparisons:
collected in order to separate the systems into clus-
the “Expected Wins” model used in recent evalu-
ters to produce a partial ranking. The sheer size of
ations, and the Bayesian model proposed by Hop-
the space of possible comparisons (all pairs of sys-
kins and May (§2.2).
tems times the number of segments in the test set)
As we described briefly in the introduction,
requires sampling from this space and distributing
WMT produces system rankings by aggregating
the annotations across a number of judges. Even
sentence-level ternary judgments of the form:
still, the number of judgments needed to produce
statistically significant rankings like those in Ta- (i, S1 , S2 , π)
ble 1 grows quadratically in the number of par-
where i is the source segment (id), S1 and S2
ticipating systems (Koehn, 2012), often forcing
are the system pair drawn from a set of systems
the use of paid, lower-quality annotators hired on
{S}, and π ∈ {<, >, =} denotes whether the
Amazon’s Mechanical Turk. Part of the prob-
first system was judged to be better than, worse
lem is that the sampling strategy collects data uni-
than, or equivalent to the second. These ternary
formly across system pairings. Intuitively, we
judgments are obtained by presenting judges with
should need many fewer annotations between sys-
a randomly-selected input sentence and the refer-
tems with divergent base performance levels, in-
ence, followed by five randomly-selected transla-
stead focusing the collection effort on system pairs
tions of that sentence. Annotators are asked to
whose performance is more matched, in order to
rank these systems from best (rank 1) to worst
tease out the gaps between similarly-performing
(rank 5), ties permitted, and with no meaning as-
systems. Why spend precious human time on re-
cribed to the absolute values or differences be-
dundantly affirming predictable outcomes?
tween ranks. This is done to accelerate data collec-
To address these issues, we developed a varia- tion, since it yields ten pairwise comparisons per
tion of the TrueSkill model (Herbrich et al., 2006), ranking. Tens of thousands of judgments of this
an adaptative model of competitions originally de- form constitute the raw data used to compute the
veloped for the Xbox Live online gaming commu- system-level rankings. All the work described in
nity. It assumes that each player’s skill level fol- this section is computed over these pairwise com-
lows a Gaussian distribution N (µ, σ 2 ), in which parisons, which are treated as if they were col-
µ represents a player’s mean performance, and σ 2 lected independently.
the system’s uncertainty about its current estimate
of this mean. These values are updated after each 2.1 Expected Wins
“game” (in our case, the value of a ternary judg- The “Expected Wins” model computes the per-
ment) in proportion to how surprising the outcome centage of times that each system wins in its
2
pairwise comparisons. Let A be the complete 2.2 The Hopkins and May (2013) model
set of annotations or judgments of the form Recent papers (Koehn, 2012; Hopkins and May,
{i, S1 , S2 , πR }. We assume these judgments have 2013) have proposed models focused on the rel-
been converted into a normal form where S1 is ei- ative ability of the competition systems. These
ther the winner or is tied with S2 , and therefore approaches assume that each system has a mean
πR ∈ {<, =}. Let δ(x, y) be the Kronecker delta quality represented by a Gaussian distribution with
function.1 We then define the function: a fixed variance shared across all systems. In the
graphical model formulation of Hopkins and May
wins(Si , Sj ) = (2013), the pairwise judgments (i, S1 , S2 , π) are
|A|
X imagined to have been generated according to the
(n) (n) (n)
δ(Si , S1 )δ(Sj , S2 )δ(πR , <) following process:
n=1
• Select a source sentence i
which counts the number of annotations for which
• Select two systems S1 and S2 . A system
system Si was ranked better than system Sj . We
Sj is associated with a Gaussian distribution
define a single-variable version that marginalizes
N (µSj , σa2 ), samples from which represent
over all annotations:
the quality of translations
X
wins(Si ) = wins(Si , Sj ) • Draw two “translations”, adding random
Sj 6=Si 2 to simulate
Gaussian noise with variance σobs
We also define analogous functions for loses and the subjectivity of the task and the differences
ties. Until the WMT11 evaluation (Callison-Burch among annotators:
et al., 2011), the score for each system Si was q1 ∼ N (µS1 , σa2 ) + N (0, σobs
2
)
computed as follows: q2 ∼ N (µS2 , σa2 ) + N (0, σobs
2
)
wins(Si ) + ties(Si ) • Let d be a nonzero real number that defines
score(Si ) =
wins(Si ) + ties(Si ) + loses(Si ) a fixed decision radius. Produce a rating π
according to:2
Bojar et al. (2011) suggested that the inclusion of
ties biased the results, due to their large numbers, < q 1 − q2 > d
the underlying similarity of many of the models, π= > q 2 − q1 > d
and the fact that they are counted for both systems = otherwise
in the tie, and proposed the following modified
The task is to then infer the posterior parameters,
scoring function:
given the data: the system means µSj and, by ne-
1 X wins(Si , Sj ) cessity, the latent values {qi } for each of the pair-
score(Si ) =
|{S}| wins(Si , Sj ) + wins(Sj , Si ) wise comparison training instances. Hopkins and
Sj 6=Si
May do not publish code or describe details of this
This metric computes an average relative fre- algorithm beyond mentioning Gibbs sampling, so
quency of wins, excluding ties, and was used we used our own implementation,3 and describe it
in WMT12 and WMT13 (Callison-Burch et al., here for completeness.
2012; Bojar et al., 2013). After initialization, we have training instances
The decision to exclude ties isn’t without of the form (i, S1 , S2 , πR , q1 , q2 ), where all but the
its problems; for example, an evaluation where qi are observed. At a high level, the sampler iter-
two systems are nearly always judged equivalent ates over the training data, inferring values of q1
should be relevant in producing the final ranking and q2 for each annotation together in a single step
of systems. Furthermore, as Hopkins and May of the sampler from the current values of the sys-
(2013) point out, throwing out data to avoid bi- tems means, {µj }.4 At the end of each iteration,
asing a model suggests a problem with the model. 2
Note that better systems have higher relative abilities
We now turn to a description of their model, which {µSj }. Better translations subsequently have on-average
higher values {qi }, which translate into a lower ranking π.
addresses these problems. 3
github.com/keisks/wmt-trueskill
4
1 if x = y This worked better than a version of the sampler that
1
δ(x, y) = changed one at a time.
0 o.w.
3
these means are then recomputed by re-averaging not all competitions are equal, and a good model
all values of {qi } associated with that system. Af- should incorporate some notion of “match diffi-
ter the burn-in period, the µs are stored as samples, culty” when evaluating system’s abilities. The
which are averaged when the sampling concludes. inference procedure above incorporates this no-
During each iteration, q1 and q2 are resampled tion implicitly in the inference procedure, but the
from their corresponding system means: model itself does not include a notion of match
difficulty or outcome surprisal.
q1 ∼ N (µS1 , σa2 ) A model that does is TrueSkill5 (Herbrich et al.,
q2 ∼ N (µS2 , σa2 ) 2006). TrueSkill is an adaptive, online system that
also assumes that each system’s skill level follows
We then update these values to respect the annota- a Gaussian distribution, maintaining a mean µSj
tion π as follows. Let t = q1 −q2 (S1 is the winner for each system Sj representing its current esti-
by human judgments), and ensure that the values mate of that system’s native ability. However, it
are outside the decision radius, d: also maintains a per-system variance, σS2 j , which
( represents TrueSkill’s uncertainty about its esti-
q1 t≥d
q10 = 1 mate of each mean. After an outcome is observed
q1 + (d − t) otherwise (a game in which the result is a win, loss, or draw),
2
( the size of the updates is proportional to how sur-
q2 t≥d prising the outcome was, which is computed from
0
q2 = 1 the current system means and variances. If a trans-
q2 − (d − t) otherwise
2 lation from a system with a high mean is judged
In the case of a tie: better than a system with a greatly lower mean, the
result is not surprising, and the update size for the
1
q1 + (d − t)
t≥d corresponding system means will be small. On the
2
0
q1 = q1 t<d other hand, when an upset occurs in a competition,
the means will receive larger updates.
q1 + 1 (−d − t)
t ≤ −d
2 Before defining the update equations, we need
to be more concrete about how this notion of sur-
1 prisal is incorporated. Let t = µS1 − µS2 , the dif-
q2 − (d − t) t≥d
2 ference in system relative abilities, and let be a
q20 = q2 t<d fixed hyper-parameter corresponding to the earlier
1
q2 − (−d − t) decision radius. We then define two loss functions
t ≤ −d
2 of this difference for wins and for ties:
These values are stored for the current iteration
and averaged at its end to produce new estimates N (− + t)
vwin (t, ) =
of the system means. The quantity d − t can be in- Φ(− + t)
terpreted as a loss function, returning a high value N (− − t) − N ( − t)
when the observed outcome is unexpected and a vtie (t, ) =
Φ( − t) − Φ(− − t)
low value otherwise (Figure 1).
where Φ(x) is the cumulative distribution function
2.3 TrueSkill and the N s are Gaussians. Figures 1 and 2 display
Prior to 2012, the WMT organizers included refer- plots of these two functions compared to the Hop-
ence translations among the system comparisons. kins and May model. Note how vwin (Figure 1) in-
These were used as a control against which the creases exponentially as µS2 becomes greater than
evaluators could be measured for consistency, on the (purportedly) better system, µS1 .
the assumption that the reference was almost al- As noted above, TrueSkill maintains not only
ways best. They were also included as data points estimates {µSj } of system abilities, but also
in computing the system ranking. Another of system-specific confidences about those estimates
Bojar et al. (2011)’s suggestions was to exclude
5
this data, because systems compared more of- The goal of this section is to provide an intuitive descrip-
tion of TrueSkill as adapted for WMT manual evaluations,
ten against the references suffered unfairly. This with enough detail to carry the main ideas. For more details,
can be further generalized to the observation that please see Herbrich et al. (2006).
4
The second term in these equations captures the
1.5
TrueSkill idea about balancing surprisal with confidence,
HM described above.
In order to update the system-level confidences,
1.0 TrueSkill defines another set of functions, w, for
the cases of wins and ties. These functions are
v(t, )
Figure 1: TrueSkill’s vwin and the corresponding The underlying idea is that these functions cap-
loss function in the Hopkins and May model as ture the outcome surprisal via v. This update al-
a function of the difference t of system means ways decreases the size of the variances σ 2 , which
( = 0.5, c = 0.8 for TrueSkill, and d = 0.5 for means uncertainty of µ decreases as comparisons
Hopkins and May model). go on. With these defined, we can conclude by
defining the updates for σS2 1 and σS2 2 :
" #
1.0
2 2
σS2 1 t
TrueSkill σS1 = σS1 · 1 − 2 · w ,
HM c c c
" #
0.5
2 2
σS2 2 t
σS2 = σS2 · 1 − 2 · w ,
c c c
v(t, )
0.0
One final complication not presented here but rel-
evant to adapting TrueSkill to the WMT setting:
−0.5
the parameter β and another parameter (not dis-
cussed) τ are incorporated into the update equa-
−1.0 tions to give more weight to recent matches. This
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
t = µS1 − µS2 “latest-oriented” property is useful in the gaming
Figure 2: TrueSkills vtie and the corresponding setting for which TrueSkill was built, where play-
loss function in the Hopkins and May model as ers improve over time, but is not applicable in the
a function of the difference t of system means WMT competition setting. To cancel this property
( = 0.5, c = 0.3, and d = 0.5). in TrueSkill, we set τ = 0 and β = 0.025 · |A| · σ 2
in order to lessen the impact of the order in which
{σSj }. These confidences also factor into the up- annotations are presented to the system.
dates: while surprising outcomes result in larger
2.4 Data selection with TrueSkill
updates to system means, higher confidences (rep-
resented by smaller variances) result in smaller A drawback of the standard WMT data collection
updates. TrueSkill defines the following value: method is that it samples uniformly from the space
of pairwise system combinations. This is undesir-
c2 = 2β 2 + σS2 1 + σS2 2 able: systems with vastly divergent relative abil-
ity need not be compared as often as systems that
which accumulates the variances along β, another are more evenly matched. Unfortunately, one can-
free parameter. We can now define the update not sample non-uniformly without knowing ahead
equations for the system means: of time which systems are better. TrueSkill pro-
vides a solution to this dilemma with its match-
σS2 1 t selection ability: systems with similar means and
µS1 = µS1 + ·v ,
c c c low variances can be confidently considered to be
σS2 2 t close matches. This presents a strong possibility
µS 2 = µS 2 − ·v , of reducing the amount of data that needs to be
c c c
5
collected in the WMT competitions. In fact, the each produced in two ways: selecting from all re-
TrueSkill formulation provides a way to compute searchers, or split between researchers and Turk-
the probability of a draw between two systems, ers. An important note is that the training data
which can be used to compute for a system Si a differs according to the model. For the Expected
conditional distribution over matches with other Wins and Hopkins and May model, we sim-
systems {Sj6=i }. ply sample uniformly at random. The TrueSkill
Formally, in the TrueSkill model, the match- model, however, selects its own training data (with
selection (chance to draw) between two players replacement) according to the description in Sec-
(systems in WMT) is computed as follows: tion 2.4.7
r For tuning hyperparameters and reporting test
2β 2 (µa − µb )2 results, we used development and test sets of 2,000
pdraw = · exp(− )
c2 2c2 comparisons drawn entirely from the researcher
judgments, and fixed across all experiments.
However, our setting for canceling the “latest-
oriented” property affects this matching quality
3.2 Perplexity
equation, where most systems are almost equally
competitive (≈ 1). Therefore, we modify the equa- We first compare the Hopkins and May model and
tion in the following manner which simply de- TrueSkill using perplexity on the test data T , com-
pends on the difference of µ. puted as follows:
1 −
P
log2 p(π|S1 ,S2 )
p̂draw = ppl(p|T ) = 2 (i,S1 ,S2 ,π)∈T
exp(|µa − µb |)
TrueSkill selects the matches it would like to where p is the model under consideration. The
create, according to this selection criteria. We do probability of each observed outcome π between
this according to the following process: two systems S1 and S2 is computed by taking a
difference of the Gaussian distributions associated
1. Select a system S1 (e.g., the one with the with those systems:
highest variance)
N (µδ , σδ2 ) = N (µS1 , σS2 1 ) − N (µS2 , σS2 2 )
2. Compute a normalized distribution over
matches with other systems pairs p̂draw = N (µS1 − µS2 , σS2 1 + σS2 2 )
3. Draw a system S2 from this distribution This Gaussian can then be carved into three pieces:
the area where S1 loses, the middle area represent-
4. Draw a source sentence, and present to the
ing ties (defined by a decision radius, r, whose
judge for annotation
value is fit using development data), and a third
3 Experimental setup area representing where S1 wins. By integrating
over each of these regions, we have a probability
3.1 Datasets distribution over these outcomes:
We used the evaluation data released by WMT13.6
The data contains (1) five-way system rankings
R0
2
−∞ N (µδ , σδ ) if π is >
made by either researchers or Turkers and (2)
translation data consisting of source sentences, hu- Rr
p(π | S1 , S2 ) = 2
man reference translations, and submitted transla-
0 N (µδ , σδ ) if π is =
tions. Data exists for 10 language pairs. More de- R
∞ N (µδ , σ 2 ) if π is <
tails about the dataset can be found in the WMT r δ
2013 overview paper (Bojar et al., 2013).
Each five-way system ranking was converted We do not compute perplexity for the Expected
into ten pairwise judgments (§2). We trained the Wins model, which does not put any probability
models using randomly selected sets of 400, 800, mass on ties.
1,600, 3,200, and 6,400 pairwise comparisons,
7
We use a Python implementation of TrueSkill
6
statmt.org/wmt13/results.html (github.com/sublee/trueskill).
6
3.3 Accuracy
3.00
Perplexity is often viewed as a neutral metric, but HM-all
HM-res
without access to unbounded training data or the TS-all
true model parameters, it can only be approxi- 2.95 TS-res
Perplexity
late perfectly with evaluation metrics. As such, 2.90
we also present accuracy results, measuring each
model’s ability to predict the values of the ternary
pairwise judgments made by the annotators. These 2.85
7
# N=2 N=3 N=4 N=5
0.500
400 0.479 0.482 0.491 0.492
0.495
800 0.483 0.493 0.495 0.495
0.490 1600 0.493 0.492 0.497 0.495
0.485
3200 0.493 0.494 0.498 0.497
Accuracy
8
1 2 3 4 5 6 7 8 9 10 11 12 13
into the WMT13 competition and thus had many
2.0
1 fewer judgments. It is clear that these values are
1.8
2
roughly uniformly distributed. This figure serves
3 1.6
4
as a sort of baseline, demonstrating the lack of pat-
1.4
5 terns in the data-selection process.
1.2
6 The next two figures focus on the data that
7 1.0
8
TrueSkill itself selected for its use from among all
0.8
9 of the available data. Figure 6 is a second heat
0.6
10 map presenting the set of system pairs selected by
0.4
11
TrueSkill for the first 20% of its matches chosen
12 0.2
13 during training, while Figure 7 presents a heat map
0.0
of the last 20%. The contrast is striking: whereas
the judgments are roughly uniformly distributed at
the beginning, the bulk of the judgments obtained
Figure 5: Heat map for the ratio of pairwise judg-
for the last set are clustered along the diagonal,
ments across the full cross-product of systems in
where the most competitive matches lie.
the WMT13 French-English translation task.
Together with the higher accuracy of TrueSkill,
1 2 3 4 5 6 7 8 9 10 11 12 13
this suggests that it could be used to decrease the
1
2.0
amount of data that needs to be collected in future
2 1.8
WMT human evaluations by focusing the annota-
3 1.6
tion effort on more closely-matched systems.
4
1.4
5
6
1.2 6 Clustering
7 1.0
8
0.8
As pointed out by Koehn (2012), a ranking pre-
9
0.6
sented as a total ordering among systems con-
10
11 0.4
ceals the closeness of comparable systems. In the
12 0.2
WMT13 competition, systems are grouped into
13
0.0
clusters, which is equivalent to presenting only
a partial ordering among the systems. Clusters
are constructed using bootstrap resampling to in-
Figure 6: Heat map for the ratio of pairwise judg- fer many system rankings. From these rankings,
ments across the full cross-product of systems rank ranges are then collected, which can be used
used in the first 20% of TrueSkill model. to construct 95% confidence intervals, and, in turn,
to cluster systems whose ranges overlap. We use
1 2 3 4 5 6 7 8 9 10 11 12 13 a similar approach for clustering in the TrueSkill
2.0
1 model. We obtain rank ranges for each system by
1.8
running the TrueSkill model 100 times,9 throw-
2
3 1.6
4 ing out the top and bottom 2 rankings for each
1.4
5 system, and clustering where rank ranges overlap.
1.2
6
For comparison, we also do this for the other two
7 1.0
8 models, altering the amount of training data from
0.8
9 1k to 25k in increments of 1,000, and plotting the
0.6
10
number of clusters that can be obtained from each
11 0.4
12
technique on each amount of training data.
0.2
13 Figure 8 show the number of clusters according
0.0
to the increase of training data for three models.
TrueSkill efficiently split the systems into clusters
Figure 7: Heat map for the ratio of pairwise judg-
compared to other two methods. Figure 9 and 10
ments across the full cross-product of systems
present the result of clustering two different size of
used in the last 20% of TrueSkill model.
9
We also tried the sampling 1,000 times and the clustering
granularities were the same.
9
1
7 2
ExpWin
3
6 HM
4
TS
5
5
Num. of Clusters
7
4
8
3 9
10
2 11
12
1 13
3
ferent from the official result (Table 1) mainly be- 4
8
same, further corroborating the stability and ac- 9
10
References Pennsylvania, July. Association for Computational
Linguistics.
Ondřej Bojar, Miloš Ercegovčević, Martin Popel, and
Omar Zaidan. 2011. A Grain of Salt for the WMT Xingyi Song and Trevor Cohn. 2011. Regression and
Manual Evaluation. In Proceedings of the Sixth Ranking based Optimisation for Sentence Level MT
Workshop on Statistical Machine Translation, pages Evaluation. In Proceedings of the Sixth Workshop
1–11, Edinburgh, Scotland, July. Association for on Statistical Machine Translation, pages 123–129,
Computational Linguistics. Edinburgh, Scotland, July. Association for Compu-
tational Linguistics.
Ondřej Bojar, Christian Buck, Chris Callison-Burch,
Christian Federmann, Barry Haddow, Philipp
Koehn, Christof Monz, Matt Post, Radu Soricut, and
Lucia Specia. 2013. Findings of the 2013 Work-
shop on Statistical Machine Translation. In Pro-
ceedings of the Eighth Workshop on Statistical Ma-
chine Translation, pages 1–44, Sofia, Bulgaria, Au-
gust. Association for Computational Linguistics.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
and Omar Zaidan. 2011. Findings of the 2011
Workshop on Statistical Machine Translation. In
Proceedings of the Sixth Workshop on Statisti-
cal Machine Translation, pages 22–64, Edinburgh,
Scotland, July. Association for Computational Lin-
guistics.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
Matt Post, Radu Soricut, and Lucia Specia. 2012.
Findings of the 2012 Workshop on Statistical Ma-
chine Translation. In Proceedings of the Seventh
Workshop on Statistical Machine Translation, pages
10–51, Montréal, Canada, June. Association for
Computational Linguistics.
Ralf Herbrich, Tom Minka, and Thore Graepel. 2006.
TrueSkillTM : A Bayesian Skill Rating System. In
Proceedings of the Twentieth Annual Conference on
Neural Information Processing Systems, pages 569–
576, Vancouver, British Columbia, Canada, Decem-
ber. MIT Press.
Mark Hopkins and Jonathan May. 2013. Models of
translation competitions. In Proceedings of the 51st
Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages
1416–1424, Sofia, Bulgaria, August. Association for
Computational Linguistics.
Philipp Koehn. 2012. Simulating Human Judgment
in Machine Translation Evaluation Campaigns. In
Proceedings of the 9th International Workshop on
Spoken Language Translation (IWSLT), pages 179–
184, Hong Kong, China, December. International
Speech Communication Association.
Adam Lopez. 2012. Putting Human Assessments of
Machine Translation Systems in Order. In Proceed-
ings of the Seventh Workshop on Statistical Machine
Translation, pages 1–9, Montréal, Canada, June. As-
sociation for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of Machine Translation. In Proceedings
of 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
11