Abstract

The output of many algorithms in computer-vision is either non-binary maps or binary maps (e.g., salient object detection and object segmentation). Several measures have been suggested to evaluate the accuracy of these foreground maps. In this paper, we show that the most commonly-used measures for evaluating both non-binary maps and binary maps do not always provide a reliable evaluation. This includes the Area-Under-the-Curve measure, the Average-Precision measure, the F-measure, and the evaluation measure of the PASCAL VOC segmentation challenge. We start by identifying three causes of inaccurate evaluation. We then propose a new measure that amends these flaws. An appealing property of our measure is being an intuitive generalization of the F-measure. Finally we propose four meta-measures to compare the adequacy of evaluation measures. We show via experiments that our novel measure is preferable.

Paper

Ran Margolin, Lihi Zelnik-Manor and Ayellet Tal, “How to Evaluate Foreground Maps?”, CVPR 2014 [pdf]

"Funny behaviour" of common measures

Previous measures such as Area-under-the-curve (AUC), Average-precision (AP), F-measure and PASCAL VOC's segmentation measure, exhibit "funny behaviours":

Our Measure

No "Funny behaviour"

Agrees with application: Context-based image retrieval

Code

Our proposed measure - Weighted F-measure

Foreground evaluation code (MATLAB) [download] - Tested on Windows 64-bit Matlab 2012b & 2013a.

The code is for academic purposes only. Please cite this paper if you make use of it:

@conference{margoinEval14, title={How to Evaluate Foreground maps?}, author={Margolin, R. and Zelnik-Manor, L. and Tal, A}, year = {2014}, booktitle = {CVPR}}

In case of any problems please contact us at: margolin (at) tx.technion.ac.il or hovav (at) ee.technion.ac.il