Judging accuracy - an oxymoron?
Can dressage judging ever be accurate?
There are only ever about 30 5* judges in the world. They are the high achievers, the best of the best, and solely qualified to judge at the Olympics and the World Equestrian Games. It takes over 20 years of training and judging at the highest international level to reach this position.
So why is it that when 5 of them judged at the 2017 Hickstead CDIO the results are inconsistent? Not only did they not agree on percentage scores for each horse, they disagreed on their ranking. Is it that judges have opinions that just don’t agree, something like “beauty in in the eye of the beholder”? Or could it all be explained by each judge having a different viewing position? Or, as the judges themselves claim, is it because their extensive training is inadequate? In this article, I am going to suggest a very different reason - “Compound Deductions” - I will explain what I mean by this later.
To summarise the results at Hickstead, the GP had five 5* judges - all current and amongst the best in the world. The results were nothing short of shocking. They didn’t agree on the scores for each horse nor on the ranking.
Look at the large percentage differences:
- 34% of the class had scoring differences of 5% or more - at this level judges are supposed to sit down after the class and discuss any discrepancy. With so many more than 5%, they must have been up late.
- 62% of the class had scoring differences of 4% or more. There is nothing special about 5% - when a class is separated by fractions of one percent, saying we will discuss anything at 5% is giving themselves a VERY wide margin of error.
The judges admitted problems with the percentages but said they got the ranking right. This would be some consolation, but how do they know? The actual ranking differences didn’t remotely support this contention:
- 48% of the class had judges disagreeing by 10 places on their rank
- One rider placed 1st and 20th
- Two other riders placed 2nd and 19th
- One rider placed 5th and 20th
- One rider placed 5th and 19th
- One rider placed 7th and 21st (judges at E and B)
So what is the problem?
It isn’t the judges.
To save IDOC (International Dressage Officials Club - the judges and stewards club) time writing another letter complaining that I am gratuitously judge bashing and that I don’t mean
"1it when I say it isn't the judges, I do mean it. It isn’t the judges. They do a far better job than most could, including myself, and I sincerely believe that they are not the problem.
At least they are not the problem in the way they give marks, but some of them are in the way they stand in the way of apparently any change that may improve accuracy or consistency.
By claiming that the problem is simply education, some of the top judges are hiding from addressing the core issue. I will show that with our current system of judging, more training alone is definitely not the answer.
Compound Deductions are the basis of our current system - the judge observes a movement, considers the observed errors and the quality of execution, and decides on a single mark representing an overall impression of that movement.
Importantly, neither the errors nor quality of execution are enumerated individually nor are they assigned a particular weighting. So, if the rider or public wants to understand why a mark was given it is not possible - it is an expert’s view based on years of experience. Great for hiding behind arcane rules and interpretation, but it isn’t very transparent for the public, riders or trainers.
From a judge training perspective, using this Compound Deduction approach, the judge has to be shown many different “scenarios” of errors and qualities of execution - each will have a mark and the judge has to memorise these. When judging a particular movement and horse, they recall that scenario and allocate the appropriate mark based on how close the execution is to what they have seen. Clearly, a judge at the highest level will have had to see many thousands of scenarios over many, many horses and have an encyclopaedic knowledge.
Implications... for training and judging
If there were only 10 or even 100 different scenarios per movement this approach may work. Even with a much larger number it could work, and training would be straightforward. Most “normal” judges could reach international level and even 5* level.
However, I have done a theoretical calculation of the possible maximum number of scenarios for passage alone and it is 31 million (for the actual calculation see note below). An impossible number to witness and memorise, no wonder accuracy is the casualty - it is amazing they can do as well as they do. Of course, this is a theoretical calculation - in reality, it will be significantly less as many problems come in groups or won’t be considered.
Even from this simple analysis, it is obvious that Compound Deductions are probably the single biggest cause of error in our current judging system and would account for the significant inconsistencies we regularly see.
As an example, take the judges’ suggestion for the extended walk, a 4 is defined as:
- 4 - Regularity of 4 beat is lost, shows some pacing steps, loss of suppleness, relaxation and freedom of the shoulders. No over-track.
There is not one description here, there are 5. So the question is what score should the judge give if one or more of the 5 problems aren’t there? For example, what if there are no pacing steps but it has the other problems? How many is “some” pacing steps. Or what if the horse shows good relaxation but has the other problems?
This is the essence of the issue - this approach can’t be accurate or consistent in theory nor in practice.
What is the solution?
Other judged Olympic sports, except boxing, use a “discrete”, Cumulative Deduction approach and it works for them. When they have a disagreement between judges it is because one of them makes a mistake not because they disagree on the quality of execution or what deduction should be given to an error. We need to get to that point.
The approach is simple and could be applied to dressage. Every quality or error that is considered to be important to judge is assigned a deduction. There are not so many of them, 43 for passage, and most of them are the same for most movements. Each quality/error can have 4 grades of execution: minor; significant; major; serious. For example, if there is one step with a rhythm error this would be considered to be minor. If there were 2 or 3 steps, this could be considered to be significant. If there were more than that but less than half of the movement, this would be major. If it were for more than 50% of the movement, this would be serious. The 4 grades could be awarded deductions - 0.1 for minor; 0.5 for significant; 1.0 for major; and 2.0 for serious. When a judge looked at a movement, there should normally only be 1 or 2 problems, they would add the deductions to arrive a final score. The numbers used here are just examples, the actual number would need to be decided by an expert committee.
Let’s say a horse has a rhythm problem for 5 steps, this would be a 1.0 deduction. At the same time, the contact was not steady but this continued for several more steps, this would be a 2.0 deduction. Also, the impulsion is not high quality so this would be an additional 1.0 deduction. The final deduction would be 4.0. These deductions would be given during the movement, ie as they are seen, not at the end.
The way forward
Clearly, the move to a Cumulative Deduction system is not trivial. At the request of the IDRC General Assembly 2015, we developed System X along these lines taking the gymnastics approach and modifying it for dressage.
The next stage will be to train a number of judges to the point where they can do the deductions without thinking and then a number of desktop video trials.
After further revision or revisions, it would be time to try this in practice on live horses in a controlled environment.
If it proves to be more consistent and accurate than the current system, we can look to implementing this, or something like this.
Judges’ reaction so far?
The reaction I have had from the top judges is less than enthusiastic. It ranges from “we don’t need it” through to “no one could ever do it”.
I chose one show to highlight the problem but there are many that refute the notion that the current system is good enough. If the judges in other sports can do it, then I very much doubt that our judges can’t - I suspect that it is more that no one likes change, especially change that will require a fairly significant amount of work. But, if dressage is to continue as an Olympic sport, it has to change, fundamentally.
Who are the experts?
Not strictly related to the article, but who says judges should get to make the rules?
Judges often say that they are the ones that set the standard and have the responsibility to maintain our classical heritage. Why? What is the basis for this claim? The role of a judge is to give a mark against a predetermined standard. No more than that.
Dressage needs a Standards of Excellence Committee. Logically, it should be the ones that actually do the training - the talented riders and trainers that have the depth of knowledge to be able to say what is good or not, what is important, they should be the members supported by experts in related fields such as equine ethology, ethics and veterinary medicine.
Certainly, some judges have trained and can train horses to a high level. But this is their ability as a trainer, not a judge.
Candidly, I suspect the judges will move into overdrive to stop any progress in this area. Maybe because they really do believe that the only thing missing is yet more training. If they want to continue to state this, they need to move from assertion to evidence - what do they have that leads them to this conclusion and how can we test it?
The Dressage Judging Working Group couldn’t get HiLo passed and this is a simple, sensible bandaid that would help judging consistency with no downside to judges nor riders. What chance has a fundamental change got...?
In passage, there are 43 different deductions outlined in the Dressage Handbook. If we said only 4 errors of these 43 would be used to determine the mark, there are 31 million different combinations (assuming 4 grades of execution):
43C4 x 44
- by Wayne Channon