Frequently Asked Questions
How are the ratings calculated?
The ratings are an adaptation of the TrueSkill rating system developed by Microsoft for its online gaming platforms. TrueSkill is an improvement on the classic Elo system that has been used for decades to rate international chess players. Elo ratings have also been used in many other settings, including international soccer, college football, e-games, and other forms of competitive gaming.
In both Elo and TrueSkill, the difference in ratings between two competitors is an indicator of the probability of the outcome of a match between them. Team ratings go up or down based on the difference between their predicted performance and their actual performance. Opponent strength is integrated from the beginning of the calculation. If the difference in rating between teams is large and the favorite wins, then neither team’s rating will be affected very much because the result was as the algorithm already predicted. Conversely, if the lower ranked team wins, then the effect will be much larger as the ratings adjust to the new information.
The ratings are solely based on the accumulated head-to-head results of all teams. There are no assumptions made about the quality of a tournament or the abstract value of making it to a particular elimination round. The ratings avoid the impulse to assign impressionistic and potentially arbitrary bonuses or penalties to different kinds of debates. Opponent strength is “baked in” to the ratings at the most basic level of the calculation, so there is no need to add further variables.
Other/earlier rating systems depended on assumptions about what constituted a “good” tournament. For example, John Brushke’s ratings started with a calculation that took the size of a tournament as an indicator of its quality. He then calculated the value of a win based on the quality of the tournament. This approach is something of a kludge that can be a valuable workaround in the world of inadequate data, but when we can access the results of every round it should not be necessary.
If you want more information on the nitty gritty of the TrueSkill algorithm, see the question “What is TrueSkill?”
What is TrueSkill?
TrueSkill is a rating system developed by Microsoft for its online gaming platforms. It is an improvement on the classic Elo system that has been used for decades to rate international chess players. TrueSkill employs Bayesian inference methods that assume that a team’s skill can be represented as a gaussian normal distribution. Essentially, a team’s skill is a bell curve of possibilities – they may perform better or worse in a given round, but the sum of those possibilities can be represented as a normal distribution of their particular mean and standard deviation. The way that the system predicts the outcome of a specific debate round is by comparing the possibilities represented by each team’s curve.
There are two core elements of a team’s TrueSkill rating: mu and sigma – or mean and deviation. Mu is the average skill level that we expect of a team, and sigma is the range within which their performances are expected to fall. As teams get more rounds, their deviation should fall as the algorithm gains more confidence in its assessment of their rating. For this reason, ratings are much more volatile earlier in the season and grow more stable as time goes on.
Resources concerning TrueSkill can be found at Microsoft Research. There is a great summary here, and a really good in-depth explanation of the mathematical principles at work has been written by Jeff Moser.
Why do you hate me?
These are not my personal opinions. The algorithm is set and runs autonomously from how I may personally feel about teams. I do not put my finger on the scale. I frequently have personal opinions about teams that deviate from the computer ratings.
If you feel strongly that you belong on the list but do not see your name, it may be possible that this is due to the ratings not having enough data on you. Given that a team’s sigma (deviation) becomes fairly stable around 2.0 (which roughly amounts to about 20-24 rounds, depending on who you have debated), I have set the minimum cutoff to be listed at 20 rounds.
Why aren't the results from X tournament included?
The most likely explanation for this is that the tournament has not posted their results on Tabroom.com in a way that's conducive to me obtaining them. The tournament administrator must have the "Result" column displaying the winner of each debate on the results page for each round.
How do you account for opponent strength at the beginning of the year?
Evaluating opponent strength at the beginning of the year is a problem for any rating system that starts with zero assumptions about each team’s skill level. At the extreme, how can it know in the first debate of the year if you are debating against the eventual NDT champion? If the system could run forever, the starting point would not matter much because the ratings would eventually sort themselves out. However, college debate has a finite season in which even the top teams might only participate in somewhere between 50 and 80 rounds. The ratings must work quickly, and excessive error at the beginning of the season could influence the final ratings.
The solution to this problem is relatively simple. Given that the issue is a lack of information to form a reliable rating, the answer is to give the algorithm more information. Since the rating algorithm is being run after sufficient data has been collected, it is possible to use the results from subsequent rounds to form a better estimate of how good a team is. Effectively, what the ratings do is run the algorithm twice. On the first pass, it creates a provisional rating for each team that uses all of the available information. On the second pass, it will use those provisional ratings in its predictions to estimate opponent strength until the time when that opponent has a sufficiently reliable rating.
This method does not involve double counting. The provisional rating is only ever used to evaluate how strong an opponent is. The second pass starts each team’s actual rating from scratch. When Team A debates Team B in round 1 of the season opener, the ratings create a separate prediction for each side. One prediction will be between the ratings for a blank slate Team A versus a reliable Team B; the other vice versa. The first will be used to update Team A’s rating, the latter to update Team B’s rating. The algorithm stops using a team’s provisional rating once that team’s actual rating becomes reliable enough (i.e., its sigma becomes small enough – a length of time that varies, but it is usually reached around 20 to 25 rounds).
What is mu?
See “What is TrueSkill?”
What is sigma?
See “What is TrueSkill?”
Are elimination rounds weighted?
Yes and no. There is a multiplier applied to the amount that a team’s rating will increase when they win an elimination round. This is not a flat value but is rather a function of how good the opponent is. There is very little “extra credit” for winning as a heavy favorite, but the effect grows larger for how “big” the win is. After running several experiments, I found that the accuracy of the ratings went up when teams received a bonus for elim wins. However, the same did not hold true weighting elim losses more. Thus, there is no added penalty for losing in elims.
Why is the “rating” lower than the “mu”?
A team’s final rating is an adjustment of their mean skill level (mu) that accounts for the size of their deviation (sigma). The adjustment is flat for teams that have reached a sufficiently small sigma, but it grows larger for teams that the ratings are less confident about. In effect, it says that we are fairly confident that a team is "at least" as good as their adjusted rating. If two teams have about the same mean rating but one has significantly fewer rounds then the other, then we should be less confident that their rating is accurate.
While helpful to weed out teams with high deviations, there is a limit to the usefulness of this procedure. When most regularly travelling teams end the season with somewhere around 80 rounds, it is somewhat silly to use deviation as a tool to delineate between them.
How are panels tabulated by the system?
Panels are tabulated as binary wins and losses. I tried experiments tabulating panels as the fraction of the ballot count (e.g., a 2-1 victory as 2/3 of a win). However, I discovered that the predictions became more accurate when I treated panels no different from single judge rounds.
What if a debater has multiple partners over the course of the year?
Unfortunately, each of those partnerships is treated as an independent unit. There are practical reasons for this related to sample size, but the main reason is that it is simply impossible for the ratings to know how responsible each member of the team is for their overall record.
How are the ratings calculated?
The ratings are an adaptation of the TrueSkill rating system developed by Microsoft for its online gaming platforms. TrueSkill is an improvement on the classic Elo system that has been used for decades to rate international chess players. Elo ratings have also been used in many other settings, including international soccer, college football, e-games, and other forms of competitive gaming.
In both Elo and TrueSkill, the difference in ratings between two competitors is an indicator of the probability of the outcome of a match between them. Team ratings go up or down based on the difference between their predicted performance and their actual performance. Opponent strength is integrated from the beginning of the calculation. If the difference in rating between teams is large and the favorite wins, then neither team’s rating will be affected very much because the result was as the algorithm already predicted. Conversely, if the lower ranked team wins, then the effect will be much larger as the ratings adjust to the new information.
The ratings are solely based on the accumulated head-to-head results of all teams. There are no assumptions made about the quality of a tournament or the abstract value of making it to a particular elimination round. The ratings avoid the impulse to assign impressionistic and potentially arbitrary bonuses or penalties to different kinds of debates. Opponent strength is “baked in” to the ratings at the most basic level of the calculation, so there is no need to add further variables.
Other/earlier rating systems depended on assumptions about what constituted a “good” tournament. For example, John Brushke’s ratings started with a calculation that took the size of a tournament as an indicator of its quality. He then calculated the value of a win based on the quality of the tournament. This approach is something of a kludge that can be a valuable workaround in the world of inadequate data, but when we can access the results of every round it should not be necessary.
If you want more information on the nitty gritty of the TrueSkill algorithm, see the question “What is TrueSkill?”
What is TrueSkill?
TrueSkill is a rating system developed by Microsoft for its online gaming platforms. It is an improvement on the classic Elo system that has been used for decades to rate international chess players. TrueSkill employs Bayesian inference methods that assume that a team’s skill can be represented as a gaussian normal distribution. Essentially, a team’s skill is a bell curve of possibilities – they may perform better or worse in a given round, but the sum of those possibilities can be represented as a normal distribution of their particular mean and standard deviation. The way that the system predicts the outcome of a specific debate round is by comparing the possibilities represented by each team’s curve.
There are two core elements of a team’s TrueSkill rating: mu and sigma – or mean and deviation. Mu is the average skill level that we expect of a team, and sigma is the range within which their performances are expected to fall. As teams get more rounds, their deviation should fall as the algorithm gains more confidence in its assessment of their rating. For this reason, ratings are much more volatile earlier in the season and grow more stable as time goes on.
Resources concerning TrueSkill can be found at Microsoft Research. There is a great summary here, and a really good in-depth explanation of the mathematical principles at work has been written by Jeff Moser.
Why do you hate me?
These are not my personal opinions. The algorithm is set and runs autonomously from how I may personally feel about teams. I do not put my finger on the scale. I frequently have personal opinions about teams that deviate from the computer ratings.
If you feel strongly that you belong on the list but do not see your name, it may be possible that this is due to the ratings not having enough data on you. Given that a team’s sigma (deviation) becomes fairly stable around 2.0 (which roughly amounts to about 20-24 rounds, depending on who you have debated), I have set the minimum cutoff to be listed at 20 rounds.
Why aren't the results from X tournament included?
The most likely explanation for this is that the tournament has not posted their results on Tabroom.com in a way that's conducive to me obtaining them. The tournament administrator must have the "Result" column displaying the winner of each debate on the results page for each round.
How do you account for opponent strength at the beginning of the year?
Evaluating opponent strength at the beginning of the year is a problem for any rating system that starts with zero assumptions about each team’s skill level. At the extreme, how can it know in the first debate of the year if you are debating against the eventual NDT champion? If the system could run forever, the starting point would not matter much because the ratings would eventually sort themselves out. However, college debate has a finite season in which even the top teams might only participate in somewhere between 50 and 80 rounds. The ratings must work quickly, and excessive error at the beginning of the season could influence the final ratings.
The solution to this problem is relatively simple. Given that the issue is a lack of information to form a reliable rating, the answer is to give the algorithm more information. Since the rating algorithm is being run after sufficient data has been collected, it is possible to use the results from subsequent rounds to form a better estimate of how good a team is. Effectively, what the ratings do is run the algorithm twice. On the first pass, it creates a provisional rating for each team that uses all of the available information. On the second pass, it will use those provisional ratings in its predictions to estimate opponent strength until the time when that opponent has a sufficiently reliable rating.
This method does not involve double counting. The provisional rating is only ever used to evaluate how strong an opponent is. The second pass starts each team’s actual rating from scratch. When Team A debates Team B in round 1 of the season opener, the ratings create a separate prediction for each side. One prediction will be between the ratings for a blank slate Team A versus a reliable Team B; the other vice versa. The first will be used to update Team A’s rating, the latter to update Team B’s rating. The algorithm stops using a team’s provisional rating once that team’s actual rating becomes reliable enough (i.e., its sigma becomes small enough – a length of time that varies, but it is usually reached around 20 to 25 rounds).
What is mu?
See “What is TrueSkill?”
What is sigma?
See “What is TrueSkill?”
Are elimination rounds weighted?
Yes and no. There is a multiplier applied to the amount that a team’s rating will increase when they win an elimination round. This is not a flat value but is rather a function of how good the opponent is. There is very little “extra credit” for winning as a heavy favorite, but the effect grows larger for how “big” the win is. After running several experiments, I found that the accuracy of the ratings went up when teams received a bonus for elim wins. However, the same did not hold true weighting elim losses more. Thus, there is no added penalty for losing in elims.
Why is the “rating” lower than the “mu”?
A team’s final rating is an adjustment of their mean skill level (mu) that accounts for the size of their deviation (sigma). The adjustment is flat for teams that have reached a sufficiently small sigma, but it grows larger for teams that the ratings are less confident about. In effect, it says that we are fairly confident that a team is "at least" as good as their adjusted rating. If two teams have about the same mean rating but one has significantly fewer rounds then the other, then we should be less confident that their rating is accurate.
While helpful to weed out teams with high deviations, there is a limit to the usefulness of this procedure. When most regularly travelling teams end the season with somewhere around 80 rounds, it is somewhat silly to use deviation as a tool to delineate between them.
How are panels tabulated by the system?
Panels are tabulated as binary wins and losses. I tried experiments tabulating panels as the fraction of the ballot count (e.g., a 2-1 victory as 2/3 of a win). However, I discovered that the predictions became more accurate when I treated panels no different from single judge rounds.
What if a debater has multiple partners over the course of the year?
Unfortunately, each of those partnerships is treated as an independent unit. There are practical reasons for this related to sample size, but the main reason is that it is simply impossible for the ratings to know how responsible each member of the team is for their overall record.