You are on page 1of 9

Statistical Model to Predict the Outcome of Tennis Matches

Vijay Menon
Vijay Menon

Abstract Before the start of a tournament, various statistics on each tennis player - the success the player has enjoyed in a particular year, his ranking, the percentage of points won on rst serve, the number of games won etc - are available. And based on these, general comparisons are made before the start of a match. But not often do commentators or tennis analysts use these metrics to predict the outcome of a match. The only predictions that are made use the ranking of a player, and this is not a good measure to predict the outcome of a match since the level of tennis displayed by, say, the top 5 players is not comparable with the ones ranked lower. We therefore develop a statistical model which predicts the outcome of a tennis match taking into account all relevant metrics beyond ranking. In order to predict the result, we take into consideration the success the player has had till the tournament. We consider two main factors points won when serving and receiving. The model additionally takes into consideration the head to head record between the two players, and also their current form (the last 52 weeks). After considering these various parameters, we assign weights for each of them (for instance, the current form and head to head record are given higher weights) and nally arrive at the probability that a particular player will win the match.

Modelling a Match

Any tennis match could be considered as a repetition of events - a sequence of points form a game, a sequence of games constitute a set and a sequence of sets form a match - which ultimately decide the winner and thus, a model to predict the outcome of a match could be done by rst modelling a game and then a set. Since most tournaments use a tie-breaker, to decide the outcome of a set when the score reaches 6:6 (except for the 5th set in 3 out of the 4 grandslam tournaments), we also model a tie-breaker which then is combined with the model for a game and a set to decide the outcome of the match.

1.1

Probability of winning a game

Consider a player A and let the probability of him winning a point be p. Here p is assumed to be a constant throughout the duration of the game and also an independence assumption is made i.e it is assumed that the outcome of a point is not dependent on the outcome of any of the previous points. To calculate the probability of winning a game, we consider dierent possibilities like winning a game losing 0 points, losing 1 point, losing 2 points and losing 3 points (40-40) after which the player to win two consecutive points wins the game. The associated probability of winning the game, given p is: G(p) = 20 (p 1) p5 2 + 10 (p 1) p4 4 (p 1)p4 + p4 2 (p 1)p + 1
3

(1)

Now, to calculate the probability that a particular player, say A, will win the game while serving, we can feed in the probability of A winning a point on serve (s) to the above equation to get G(S). Similarly, to obtain the probability that A will when the game while facing the serve, we can feed in the probability of A winning a point while returning (r) to get G(r).

1.2

Probability of winning a Tie-Breaker

Since most sets, when the score is tied at 6:6, use a tie-breaker to decide the winner, before modelling a set we rst model the tie-breaker. A tie-breaker is a 12-point game where the rst player to get to 7 with a minimum of two point lead over his opponent is declared the winner. When the tie-breaker score is tied at 6-6 then, the rst player to win two consecutive points is declared the winner. To model a tie-breaker for a player A, we consider two parameters - the probability of winning a point on serve (S) and the probability of winning a point on return (r). Using this the probability of winning a tie-breaker T(p,q) is derived

as:
T (s, r) = 36 (s 1)5 (r 1)r6 s2 225 (s 1)4 (r 1)2 r5 s3 (s 1)6 r7 s + + 2 rs r s + 1 2 rs r s + 1 2 rs r s + 1 400 (s 1)3 (r 1)3 r4 s4 225 (s 1)2 (r 1)4 r3 s5 36 (s 1)(r 1)5 r2 s6 + + + 2 rs r s + 1 2 rs r s + 1 2 rs r s + 1 (r 1)6 rs7 5 6 4 5 2 + (s 1) r s 30 (s 1) (r 1)r s 150 (s 1)3 (r 1)2 r4 s3 2 rs r s + 1 200 (s 1)2 (r 1)3 r3 s4 75 (s 1)(r 1)4 r2 s5 6 (r 1)5 rs6 + 5 (s 1)4 r6 s + 50 (s 1)3 (r 1)r5 s2 + 100 (s 1)2 (r 1)2 r4 s3 + 50 (s 1)(r 1)3 r3 s4 + 5 (r 1)4 r2 s5 10 (s 1)3 r5 s2 40 (s 1)2 (r 1)r4 s3 30 (s 1)(r 1)2 r3 s4 4 (r 1)3 r2 s5 + 6 (s 1)2 r4 s3 + 16 (s 1)(r 1)r3 s4 + 6 (r 1)2 r2 s5 3 (s 1)r4 s3 4 (r 1)r3 s4 + r4 s3 (2)

1.3

Probability of winning a Set

Each set is composed of games and each game in turn is composed of many points. Therefore to derive the probability of winning a set, we make use of the probability of winning a game on serve G(s), probability of winning a game on return G(r) and the probability of winning the tie-breaker T(s,r). This can be modelled by considering the dierent possibilities like winning the set, losing 0 games, losing 1 game....

1.3.1

Probability of winning a set with TB:


1

The expression is: ST B (s, r) = G(s)3 G(r)3 +


i=0 2

3 G(s)4i (1 G(s))i i

3 G(r)i+2 (1 G(r))1i 1i

+
i=0 3

4 G(s)4i (1 G(s))i i 4 G(s)5i (1 G(s))i i 5 G(s)5i (1 G(s))i i 5 G(s)5i (1 G(s))i i

3 G(r)i+2 (1 G(r))2i 2i 4 G(r)i+1 (1 G(r))3i 3i 4 G(r)i+1 (1 G(r))4i 4i 5 G(r)i (1 G(r))5i 5i 5 G(r)i (1 G(r))5i 5i (3) G(s) G(r)

+
i=0 4

+
i=0 5

+
i=0 5

+
i=0

5 G(s)5i (1 G(s))i i

(G(s) + G(r) 2 G(s)G(r)) T (s, r)]

1.3.2

Probability of winning a set without TB:

The expression is:


1

SN T B (s, r) = G(s)3 G(r)3 +


i=0 2

3 G(s)4i (1 G(s))i i

3 G(r)i+2 (1 G(r))1i 1i

+
i=0 3

4 G(s)4i (1 G(s))i i 4 G(s)5i (1 G(s))i i 5 G(s)5i (1 G(s))i i


5

3 G(r)i+2 (1 G(r))2i 2i 4 G(r)i+1 (1 G(r))3i 3i 4 G(r)i+1 (1 G(r))4i 4i 5 G(r)i (1 G(r))5i 5i

+
i=0 4

+
i=0

+
i=0

5 G(s)5i (1 G(s))i i

G(s) G(r) 1 G(s) (1 G(r)) G(r) (1 G(s)) (4)

1.4

Probability of winning a Match

Now that we have modelled a set, we can now model a match since the probability of winning a match is dependent on the probability of winning a set S(s,r).

1.4.1

Probability of winning a 3 set match

In a 3 set match the rst player to get a 2 set advantage is declared the winner and typically all the sets decide the winner with a tie-breaker if the score reaches 6:6. The expression for this probability is: M3 (s, r) = ST B (s, r)2 ( 1 + 2 (1 ST B (s, r)) ) (5)

1.4.2

Probability of winning a 5 set match

In a 5 set match the rst player to get a 3 set advantage is declared the winner. With the only exception of the U.S Open, all other major tournaments do not have a tiebreaker in the nal (5th) set. The expression for this probability (for all tournaments except U.S Open): M5 (s, r) = ST B (s, r)3 ( 1 + 3 ( 1 ST B (s, r)) ) + 6 ( ( 1 ST B (s, r))2 ) ST B (s, r)2 SN T B (s, r) (6) In the case of the U.S Open the expression is: M5 (s, r) = ST B (s, r)3 ( 1 + 3 (1 ST B (s, r)) ) + 6 ( 1 S(s, r))2 (7)

Using the Model to predict the World Tour Final

In order to predict matches with the model the most important things that are needed are the probability a player will win on serve (s) and the probability that the player will win on return (r). Although, there are tables like the ones in the website of the Association of Tennis Players (ATP) which rank the current players according the percentage of serves won, percentage of break-points converted and the number of aces delivered etc.., in terms of a reliability index as they call it, these ranking arent really a useful measure since its very much probable that a player who has played a very few games has faced very little breakpoints and has managed to convert most of those. Therefore, in order to get a fair look at the real probabilities of a player winning on serve or return we adopt an approach where we arrive at these probabilities by tracking a player over his entire career and his recent form. Here in the case of the recent form, we consider the performance of the player over a 52 week period. Although taking the recent form to be a period like the last 15-20 weeks would be more accurate, we do not do this here for the lack of detailed stats that are available. To calculate the probability of winning on serve (s), for both during the 52 week period and over the entire career, we consider the percentage of rst serves in (f1 ), percentage of rst serves won (p1 ) and the percentage of the second serves won (p2 ) and we calculate s as follows: s = f1 p1 + (1 f1 ) p1 For the probability of winning on return, we directly take the percentage of return points won. Here all the data used were taken from [1].

2.1

Using the data collected to predict the winner

In the previous section, we have individually calculated the probability of winning on serve as well as on return for each player. Now, the objective here is to combine this to predict the outcome of a match between say player A and player B. It is often the case that a player, say A, has a win record against most players, is ranked high, but fails to win against certain other players whose game do not match his style. For instance, a typical serve and volley player is bound to have a tougher time against the more defensive opponents and the baseliners than against most other opponents. Therefore, for predicting the outcome of a match one other important criteria that we need to consider is the how the 6

player A fares against the particular opponent. Hence to include this, we take into account the Head-Head (H-H) stats as well. Like in the previous section, we calculate the H-H stats i.e. the probability of winning on serve and on return, against this particular opponent, taking into account rst the last 52 weeks and then over the period of their entire career. In case there have been no matches between the players in the last 52 weeks, we take this to be the same as their career probabilities. Therefore, we now have the following data, for each player, at our disposal to predict the outcome: 1. H-H 52 week probability of winning on serve, return Career probability of winning on serve, return 2. Individual stats 52 week probability of winning on serve, return Career probability of winning on serve, return Now, to predict the winner of a match between two players, say A and B, we have to consider all the parameters mentioned above. While the H-H stats are already in the normalized form, meaning that the probability of win on serve for player A = 1 probability of win on return for player B and vice-versa, the individual stats are not and therefore cannot be used directly unlike the H-H stats. Therefore, in order to normalize the Individual Stats for this particular tournament we do the following: Normalization: Normalization is done so that it can be used for direct comparison between the two players. Here, the serve and return ability of each player is measured comparing it with the average serve and return stats for the most recent Hard Court Tournament (i.e. U.S Open ) and is computed the following way: Serve(P 1, P 2) = savg + (sP 1 sP 2 ) (rP 2 rP 1) Return(P 1, P 2) = ravg + (rP 1 rP 2 ) (sP 2 sP 1 ) where Serve(P1,P2) is the normalized probability of winning on serve for P1 against P2 and Return(P1, P2) is the normalized probability of winning on return for P1 against P2 and savg = 0.636 and ravg = 0.3639 are the average serve and return percentage from the U.S. Open. After having normalized all the data, the next step is to derive a single probability for win on serve and for win on return from all the data we have collected. To achieve this we assign weights to all the components namely 52 week H-H 7

serve, Career H-H serve, Normalized 52 week serve and Normalized career serve (in the case of serve) and then combine all of them to arrive at a single number for probability of winning on serve. We do the same for arriving at the probability of winning on return. Methodology used for assigning weights: In any match the performance of any player is largely dependent on the opponent against whom he is playing. Therefore on the outset we should give a higher weight to the H-H stats. Now within the H-H stat, the performance of a player in the forthcoming match will largely depend on the success or failure hes had against this particular opponent in recent times and hence the 52 week H-H stat is given a higher weightage. After considering how good he is against this particular opponent, we should now consider how good the player has been in recent times against all opponents (i.e. basically his current form). Hence the next higher weightage is assigned to the players individual 52-week stat. To assign the weights we use the Rank reciprocal method and hence get the following weights for each of the components: 1. H-H stat - 0.5454 52 week (serve/return) - 0.6 Career (serve/return) - 0.4 2. 52 week (serve/return) - 0.3546 3. Career (serve/return) - 0.1

Results

Having obtained the weights in the previous section, we now use these to derive the probability of win on serve and return for each player in a particular match. We then feed these to the model developed in the rst section to nally predict the outcome of all the matches at the World Tour Final. The table below gives the predicted winners for each match and compares them with the actual result.
Match Murray vs Berdych Djokovic vs Tsonga Federer vs Tipsarevic Ferrer vs del Potro Djokovic vs Murray Berdych vs Tsonga Federer vs Ferrer Del Potro vs Tipsarevic Murray vs Tsonga Djokovic vs Berdych Federer vs del Potro Ferrer vs Tipsarevic Djokovic vs del Potro Federer vs Murray Djokovic vs Federer Predicted Winner Murray Djokovic Federer Ferrer Djokovic Berdych Federer Del Potro Murray Djokovic Federer Ferrer Djokovic Federer Djokovic % of winning 57.49 91.13 96.39 80.66 66.98 73.85 93.39 70.52 86.24 87.62 79.26 75.13 83.23 59.51 54.83 Actual winner Murray Djokovic Federer Ferrer Djokovic Berdych Federer Del Potro Murray Djokovic Del Potro Ferrer Djokovic Federer Djokovic

You might also like