Every year, hundreds of new computational methods are published in the field of biomedical image analysis. For a long time, validation of new algorithms was based on authors’ private datasets, rendering fair and direct comparison of solutions impossible. In the meantime, common research practice has evolved to favor the organization of international competitions (‘challenges’) that allow for comparative benchmarking of algorithms in a controlled environment.
Our comprehensive analysis of biomedical image analysis challenges [1] revealed a huge discrepancy between the impact of a challenge and the quality (control) of the design and reporting standard. We showed that (1) "common practice related to challenge reporting is poor and does not allow for adequate interpretation and reproducibility of results", (2) "challenge design is very heterogeneous and lacks common standards, although these are requested by the community" and (3) "challenge rankings are sensitive to a range of challenge design parameters, such as the metric variant applied, the type of test case aggregation performed and the observer annotating the data". We also showed that security holes in challenge design can potentially be exploited by both challenge organizers and participants to tune rankings (e.g., by selective test case submission (participants) or retrospective tuning of the ranking scheme (organizers)) [2].
Given these shortcomings, we presented challengeR [3], which provides a set of methods to comprehensively analyze and visualize the results of single- and multi-task challenges and demonstrates their applicability to a number of simulated and real-life challenges. Our approach offers an intuitive way to gain important insights into the relative and absolute performance of algorithms and their specific strengths and weaknesses which cannot be revealed by commonly applied visualization techniques. The online tool presented on this website, Rankings Reloaded, extends the capabilities of challengeR and broadens its usability, especially.to developers not familiar with the R language.
This project is being developed within the following organizations:
[1] Maier-Hein, L., Eisenmann, M., Reinke, A. et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat Commun 9, 5217 (2018). https://doi.org/10.1038/s41467-018-07619-7
[2] Reinke, A., Eisenmann, M., Onogur, S. et al. How to exploit weaknesses in biomedical challenge design and organization. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 388-395). Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_45
[3] Wiesenfarth, M., Reinke, A., Landman, B.A., Eisenmann, M., Aguilera Saiz, L., Cardoso, M.J., Maier-Hein, L. and Kopp-Schneider, A. Methods and open-source toolkit for analyzing and visualizing challenge results. Sci Rep 11, 2369 (2021). https://doi.org/10.1038/s41598-021-82017-6