Bartosz Jura: "Scoring Statistics Prejudiced Against Judges?!"

Guest Column

David Stickland and Global Dressage Analytics have published a number of diagrams and commentaries titled "GDA Dressage Facts News" related to dressage judging statistics. Due to the fact that these diagrams are used as arguments in public discussions about dressage judging system changes I would like to use the opportunity to point out number of issues related those diagrams, conclusions and commentaries.

Figure scoring precision – important preselection factor not mentioned

According to dressage rules the marks 6 - 8 define the range of minimal (6) and expected (8) quality of athletes performance. As the GDA analysed international PSG results only, competing athletes were preselected in national competitions to meet this minimal-expected quality of performance for most of PSG lessons. So not surprisingly international judges agreed with national judges that the athletes do present minimal-expected quality of performance and granted marks are in 6-8 range (small deviation) for most of lessons!

For marks ranging 1-5 and 9-10, much larger deviation was observed due to both smaller population and of course disagreement between judges on penalties for underperformance and rewards for good performance of exercises. To have any further conclusions one must drill down and analyse root-cause why judges disagree on 1-5 and 9-10 marks. However, one cannot draw any more conclusions from presented diagrams, especially since it is only few international PSG results!

Controversial, large differences between judges

GDA analysed the percentage of instances where judges disagree on a final note by more than 5% for various pairs of judging position in CDI Grand Prix with 5 judges. The conclusion which can be drawn out of the diagram is that there is evident but relatively small difference between E/B and H/C/M judge view. However, this chart cannot substantiate any other conclusion due to a fault in the design of the analysis. This diagram presents a difference between two final cumulative scores out of 37 dressage lessons marks. Mathematically it means that this is accumulation of 37 multi-attributes measurement variables which is methodologically wrong.

In order to draw any conclusions one shall analyse large differences for each one out of thirty seven dressage moves. This is the only mathematical way to calculate precision of variable measure. Then one would be able to draw conclusions e.g. that judges constantly fail on observing some mistakes from certain positions or this is not a case. At the moment one shall not analyse this diagram to find a root-cause as there is no more statistical information in there.

Accuracy of GP judging for international shows which is not judging precision at all

There are three important mathematical deficiencies regarding the diagram: 1) mathematical RSD or CV factors are common measure of precision, none of them is used by GDA in this diagram 2) mathematically saying dressage judging is example of five measurements of the same variable by five judges, hence a) precision shall be calculated for five judges and not for pairs of judges b) as indicated earlier, the precision should be calculated for each movement separately not for final scores. As a result, what is presented on the diagram as a precision is not precision at all, and could be misleading information. It is worth to mention that there might be a number of different reasons for average scores rising in time. The first is more thorough athletes preselection at national shows (data from international competitions), improved schooling of athletes and improved quality of horses. This could be an interesting topic to analyse.

Collective scores – misleading interpretation of analysis results?

Contrary to the GDA statement that collective marks have no sense (“I’m pretty sure it does mean that either way we could live without them”), one can see these marks have clear influence on the final results of athletes. The correlation is positive and inflating the final score – let’s have a look at results from analysed show: if the athlete gets just 65% for figure score, collective marks are 65%. It means that for decent performance there is no bonus in collective scores. But yet for 70% score granted for figure performance the collective marks are ~72% so these marks would inflate the final score of athlete. Following, for 75% figure score the collectives are even more influencing (78%) so adding extra 3% to contribute increased final score, for 80% figure score the collectives hit 85% with extra 5% influence, and for 85% technical figures score the collectives break 91% with 6% positive impact. Yes, collectives work!

As you can see, this is completely different picture comparing commentary to GDA diagrams. This is the right picture of collective marks. The better execution of moves the higher chances for training scale adherence, proper aids, submissiveness, clear rhythm, … Everything is fine - judges apply dressage rules and reward riders who follow the rules. Each of us can imagine the two rides performing the same test with the same technical marks for figures. One test was superb fluent ride, with happy, supple and well trained horse, however some mistakes happened due to wind, stress, etc. The other ride was without mistakes but with some horse-rider deficiencies e.g. lack of rhythm, collection, forced contact, opened mouth, uneven steps, tension, etc. The first ride would be rewarded with significantly higher collective marks and this is exactly right according to Dressage Rules and spirit of proper dressage. Actually, even on the diagram we have such examples - pls have a look at figure score: 68%, 70%, 76%.

It is even more clearly evident when one analyses ranking instability diagram for this one show. GDA says: “the collective marks are largely inoffensive but they do increase the sensitivity to a single judge changing the ranking. So perhaps they should be discontinued, or maybe just have their overall weight reduced (coef 1).” However, regardless of what mathematics is behind this conclusion cannot be supported with analysis. The following are accurate conclusions from the diagram:

if collective marks had been removed instability of ranking drops for rank 3-5 only and increases for the rest of ranking so removing collectives actually would increase ranking instability!
collective scores are influential for whole ranking except top 3 riders, further analysis is needed to determine root-cause of top 3 superiority and resilience to individual judges preferences - celebrity judging or top riders supremacy/superiority of skills.

Summary

Presented GDA analysis:

was limited to some aspects of judging only, fragmented in terms of tests, athletes and judges level, geography, not comprehensive in terms of root-cause analysis of observed trends;
was based on the source data which are not coherent – some analysis is performed on national competition data, some are based on international shows, some on one show only;
was performed with some statistics methodology deficiencies.

Having above in mind one may say that extending the unsupported conclusions to global dressage and postulating systemic changes is methodologically inappropriate and risky.

Next Steps - Recommendation

Scientific studies, in order to form a solid foundation of any improvement effort, would require publishing all the details of performed exercises including data sets, assertions made, mathematical apparatus used, complete results, criticism of results. Such an approach ensures transparency and objectivity since any recipient may apply scepticism and conduct alternative studies in order to validate or impair claimed arguments.

Based on our live experience we all would agree that only such a thorough validation may safeguard adequacy of public dispute which impacts large group of dressage stakeholders’ lives. With all due respect to David Stickland and GDA, at the moment FEI, judges, athletes, spectators, dressage fans, are attacked with public and social media communication using GDA diagrams and conclusive statements which have deficiencies. As a result, the readers, who in majority have no sufficient mathematical background, are exposed to arguments which can be neither validated nor contested by them and can be easily influenced or manipulated.

I am personally convinced that some statements made by or based on GDA diagrams are more than just a prejudice towards dressage judges. However, before FEI or any of stakeholders proceed further with discussions on judging system changes, it is highly recommended to apply due care in this respect. It would be just normal practice to engage some sport Universities to obtain few different perspectives based on statistical analysis of full FEI 2014-2017 competitions scores population. After sport scientists provide objective inputs, international dressage forum will be equipped in required facts, figures and arguments to discuss and decide about dressage future.

by Bartosz Jura

Bartosz Jura is a certified internal auditor, certified IT auditor, and certified risk management assurer with a passion for dressage. A national dressage judge in Poland, Jura works and lives in Belgium with his wife who is a Grand Prix dressage rider.