Submission Format

Performance in the test set will be evaluated in microscope videos only. Tray videos are simply meant as contextual information that participants may or may not use. The reason is that tool/tissue contacts, as seen in microscope videos, are more relevant for surgical workflow analysis than tool (in)visibility in the tray videos.

Results in the test set must be submitted as a zip archive, named after the participating team. This archive should contain one result file per microscope video in the test set. Each result file should be a comma-separated value file named ‘test<video id>.csv’ (example: ‘test01.csv’ for video ‘test01’). Result files should follow these guidelines:

  • Each line in these CSV files should start with the frame ID (i.e. the JPEG image name, without the file extension) and then provide the algorithm’s confidence level for each tool.
  • A confidence level can be any real number. In particular, it does not have to be a probability. The only restriction is that a larger number indicates a larger confidence that the tool is being used.
  • The order in which confidence levels for tools are written in the result files should be the order used in ground truth files (i.e. the order used in Fig. 3 in the 'data' page.).
  • Note that ground truth files differ in that they include a header line mentioning the tool names: this header should be omitted in the result files.

An example of valid result file (for a very short video with only five frames) is given below:

1, -0.37, 0.73, -0.38, -0.93, -0.10, 0.16, -1.97, -0.44, -1.17, -0.72, -1.39, -0.61, 0.20, 0.15, 0.57, 1.10, 0.05, 1.09, -0.27, 0.85, -0.86
2, 0.63, 0.63, -0.74, 1.03, -0.06, 1.46, 0.39, -0.54, -0.84, 0.05, 0.26, 0.73, 0.81, -0.87, -0.57, 1.28, 1.42, 1.57, 0.75, 0.88, -1.36
3, 0.97, -1.12, 0.41, 1.28, 1.10, -0.52, -1.29, -0.88, 1.37, -1.49, 0.94, 0.34, 0.27, -0.67, 0.43, -0.14, 0.31, -0.72, 0.95, -1.08, 0.62
4, 0.95, -0.17, -0.11, -1.57, -0.55, 0.56, -0.62, 0.82, 1.18, 0.43, -0.49, -0.35, 0.72, -1.45, -3.36, 0.96, -0.12, -1.06, -0.71, 0.04, -1.74
5, -0.76, -0.16, -0.63, -0.13, -1.37, -1.39, -0.40, -1.47, -0.03, -1.13, -0.06, 0.32, 0.95, 0.76, -0.64, 0.81, 1.04, -0.48, -1.03, 0.32, 2.65

Performance Evaluation

Evaluation Metric

Submissions will be evaluated by the mean per-tool area under the ROC curve (mAz):

  1. The annotation performance for each tool Ti, i=1...21, is defined as Az(Ti), the area under the ROC curve for tool Ti.
    • Each Az(Ti) value is computed over all microscope videos in the test set.
    • Frames associated with a disagreement for tool Ti are ignored when computing Az(Ti).
    • Empty frames added for synchronization purposes are ignored when computing Az(Ti).
  2. A global figure of merit mAz is defined by taking the average of all Az(Ti) values.

The evaluation script (evaluate.py) can be downloaded.

Rank Computation

Participants will be ranked by decreasing order of mAz, the mean per-tool area under the ROC curve. In case of a tie, participants will be assigned the same rank.

Missing Data Handling

Only complete submissions will be evaluated.

Uncertainty Handling

A confidence interval (CI) will be defined for the performance score mAz of each participant:

  1. CIs on the per-tool Az will be computed using DeLong’s method,
  2. their radii will then be combined using the root mean square, assuming independence between tools.

Statistical Tests

Confidence intervals on mAz will be used to assess whether two consecutive participants in the ranking have statistically different scores.