Rankings Reloaded

How does it work?

Before using Rankings Reloaded, make sure that you understand its concepts. In the definition section, you can find brief explanations of input components. In the data structure section, the organization of the input data file is defined. After understanding the concepts, you are ready to prepare your own data file.

Definitions [1]

Task: A problem to be solved in the scope of a challenge/scientific work for which a dedicated ranking/leaderboard is provided (if any). Depending on the design, a challenge/problem can have a single or multiple tasks. Single-task works require the handling of only one job, for example detecting only human faces in a set of images.Multi-task challenges can be organized in different ways. For example, the first task could be the detection of an object and the second to segment the detected object on the same data. Another example can be the segmentation of multiple objects on the same data. On the other hand, some multi-task challenges require conducting the same assignment on different datasets/data types, for example segmenting the liver from CT and MRI images.
Case: A subset of the dataset for which the algorithm(s) of interest produce results. In general, the case number refers to a subset of the data stemming from a single acquisition. For example, data with 50 cases can mean that there are image series of 50 different acquisitions in the dataset.
Algorithm: A method designed to solve the tasks in the challenge.
Metric: A measure used to compute the performance of a given algorithm for a given case, typically based on the known correct answer. Different metric outputs can vary in their distribution and order. For example, a higher value can mean better results for one metric; vice versa for another metric. Therefore, usually, metrics are normalized to yield values in the interval between 0 (worst performance) and 1 (best performance). However, this is not essential for using Rankings Reloaded. You can use any metric range as long as you specify the distribution order.
Assessment method: The assessment method defines how to create a ranking of the participating algorithms. Different approaches exist; the most common are "rank then aggregate" and "aggregate then rank". Assessment methods may vary across different tasks of a challenge.

Data structure

In order to generate your benchmarking report, your data is needed in the format of a CSV file. For example, in a challenge/problem with two tasks, two test cases, and two algorithms, the data might look like this:

Task	Case	Algorithm	Metric Values
Task 1	case1	A1	0.27
Task 1	case1	A2	0.20
Task 1	case2	A1	0.57
Task 1	case2	A2	0.95
Task 2	case1	A1	0.37
Task 2	case1	A2	0.89
Task 2	case2	A1	0.91
Task 2	case2	A2	NA

The column structure of your data must be as same as the sample above. The columns represent:

A task identifier. The ranking analysis may be performed for different tasks which are defined via this column. A task may, for example, refer to

the classical definition for multi-task challenges, for example: Task 1 - classification, Task 2 - segmentation,
different settings in the challenge, for example different levels of difficulty, Task 1 - Stage 1, Task 2 - Stage 2 or
different metrics to be tested, for example: Task 1 - Dice similarity coefficient (DSC), Task 2 - Normalized surface distance, Task 3 - Sensitivity.

Remark: If you are comparing different metrics, make sure that they are ordered in the same way! The toolkit requires the boolean specification whether small values are better, meaning the sort direction of metric values. Example: The first metric is the Dice similarity coefficient (DSC) (range: [0,1]) and should be compared to rankings with the volumetric difference (range: [0,x], x not specified). High values of the DSC (close to 1) correspond to high performance. In comparison, high values of the volumetric difference correspond to low performance. To solve this issue and thus make it work as a multi-task challenge, you can invert the DSC values, so that high values also correspond to low performance.

A case identifier. This column should contain all cases (images) that are used in the challenge/benchmarking experiment. Make sure that each case appears only once for the same algorithm and task.
An algorithm identifier. The column should contain all used algorithms/methods that will be compared. For each case, the algorithm should appear once.
The calculated metric values should appear in this column. For a single metric, a value should appear for each algorithm for each case. In case of missing metric values, a missing observation has to be provided (either as a blank field or “NA”), otherwise the system cannot generate the report. For example, in task “T2”, test case “case2”, algorithm “A2” did not give a prediction and thus NAis inserted to denote a missing value in the table above. Rankings Reloaded will ask you how to handle such missing cases (e.g., by assigning them the worst possible metric score) throughout the report generation.

Download Sample Data

You can also download our sample_data.csv to analyze the structure of the data file. In this file, results of three tasks are generated to analyze different scenarios (ideal, random, worst case). This file is used to generate the sample_data_report.pdf file at the "Use Cases" page.

Ready for the Report Generation!

Congratulations! Your benchmarking report is only some clicks away! You are ready to prepare your challenge data for report generation. Before you continue, you may want to check Citation and FAQ pages. Then you may prepare your data and click the button below:

Configure report

[1] Maier-Hein, L., Eisenmann, M., Reinke, A. et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat Commun 9, 5217 (2018). https://doi.org/10.1038/s41467-018-07619-7