Performance evaluation in content-based image retrieval: overview and proposals

https://doi.org/10.1016/S0167-8655(00)00118-5Get rights and content

Abstract

Evaluation of retrieval performance is a crucial problem in content-based image retrieval (CBIR). Many different methods for measuring the performance of a system have been created and used by researchers. This article discusses the advantages and shortcomings of the performance measures currently used. Problems such as defining a common image database for performance comparisons and a means of getting relevance judgments (or ground truth) for queries are explained. The relationship between CBIR and information retrieval (IR) is made clear, since IR researchers have decades of experience with the evaluation problem. Many of their solutions can be used for CBIR, despite the differences between the fields. Several methods used in text retrieval are explained. Proposals for performance measures and means of developing a standard test suite for CBIR, similar to that used in IR at the annual Text REtrieval Conference (TREC), are presented.

Introduction

Early reports of the performance of content-based image retrieval (CBIR) systems were often restricted simply to printing the results of one or more example queries (e.g. Flickner et al., 1995). This is easily tailored to give a positive impression, since developers can select queries which give good results. Hence it is neither an objective performance measure, nor a means of comparing different systems. Researchers have subsequently developed a variety of CBIR performance measures, which are discussed in Section 4. The paper of Narasimhalu et al. (1997) gives a good grouping of multimedia retrieval systems for evaluation and provides some guidelines for the construction of evaluation measures. MIR (1996) gives a further survey on performance measures. However, few standard methods exist which are used by a large number of researchers. Many of the measures used in CBIR (such as precision, recall and their graphical representation) have long been used in information retrieval (IR). Several other standard IR tools have recently been imported into CBIR, e.g. relevance feedback. In order to avoid reinventing already existing techniques, it seems logical to make a systematic review of evaluation methods used in IR and their suitability for CBIR.

In the 1950s, IR researchers were already discussing performance evaluation, and the first concrete steps were taken with the development of the SMART system in 1961 (Salton, 1971b). Other important steps towards common performance measures were made with the Cranfield test (Cleverdon et al., 1966). Finally, the TREC series started in 1992, combining many efforts to provide common performance tests. The TREC project (see TRE, 1999; Vorhees and Harmann, 1998) provides a focus for these activities and is the worldwide standard in IR. Nevertheless, much research remains to be done on the evaluation of interactive systems and the inclusion of the user into the query process. Such novelties are included in TREC regularly, e.g. the interactive track in 1994. Salton (1992) gives an overview of IR system evaluation.

Section snippets

Textual information retrieval

Although performance evaluation in IR started in the 1950s, here we focus on newer results and especially on TREC and its achievements in the IR community.

Basic problems in CBIR performance evaluation

The current status of performance evaluation in CBIR is far from that in IR. There are many different groups which work with several sets of specialized images. There is neither a common image collection, nor a common way to get relevance judgments, nor a common evaluation scheme.

User comparison

User comparison is an interactive method. The users judge the success of a query directly after the query. It is hard to get a large number of such user comparisons as they are time-consuming.

Before-after comparison. This is the easiest test method. Users are given two or more different results and are asked to choose the one which is preferred or found to be most relevant to the query. This method needs a base system or, at least, another system for comparison.

Single-valued measures

Rank of the best match. Berman

Proposals

In the preceding sections a large number of different evaluation techniques has been described. It is apparent that many of them are equivalent or contain the same information. Clearly it would be beneficial to the CBIR community if only standardized names and definitions were used for performance measures. Since scaling or the use of partial graphs impedes interpretation, these techniques should only be used for emphasis, in conjunction with a complete graph.

We propose to use only image

Conclusions

This article gives an overview of existing performance evaluation measures in CBIR. The need for standardized evaluation measures is clear, since several measures are slight variations of the same definition. This makes it very hard to compare the performance of systems objectively. To overcome this problem a set of standard performance measures and a standard image database is needed. We have proposed such a set of measures, similar to those used in TREC. A frequently updated shared image

Acknowledgements

This work is supported by the Swiss National Foundation for Scientific Research (grant no. 2000-052426.97).

References (41)

  • Aksoy, S., Haralick, R.M., 1999. Graph theoretic clustering for image grouping and retrieval. In: Proc. 1999 IEEE Conf....
  • ANN, 1999. Annotated groundtruth database, Department of Computer Science and Engineering, University of Washington,...
  • Belongie, S., Carson, C., Greenspan, H., Malik, J., 1998. Color- and texture-based image segmentation using EM and its...
  • Berman, A.P., Shapiro, L.G., 1999. Efficient content-based retrieval: Experimental results. In: IEEE Workshop on...
  • C.L. Borgman

    All users of information retrieval systems are not created equal: an exploration into individual differences

    Information Processing and Management

    (1989)
  • Cleverdon, C.W., Mills, L., Keen, M., 1966. Factors determining the performance of indexing systems, Technical report,...
  • Comaniciu, D., Meer, P., Xu, K., Tyler, D., 1999. Retrieval performance improvement through low rank corrections. In:...
  • COR, 1999. Corel clipart and photos,...
  • Cox, I.J., Miller, M.L., Omohundro, S.M., Yianilos, P.N., 1996. Target testing and the PicHunter Bayesian multimedia...
  • Dy, J.G., Brodley, C.E., Kak, A., Shyu, C.-R., Broderick, L.S., 1999. The customized-queries approach to CBIR using EM....
  • M. Flickner et al.

    Query by image and video content: The QBIC system

    IEEE Computer

    (1995)
  • Gargi, U., Kasturi, R., 1999. Image database querying using a multi-scale localized color representation. In: IEEE...
  • He, Q., 1997. An evaluation on MARS – an image indexing and retrieval system, Technical report, Graduate School of...
  • Huet, B., Hancock, E.R., 1999. Inexact graph retrieval. In: IEEE Workshop on Content-based Access of Image and Video...
  • Hwang, W.-S., Weng, J.J., Fang, M., Qian, J., 1999. A fast image retrieval algorithm with automatically extracted...
  • Iqbal, Q., Aggarwal, J.K., 1999. Applying perceptual grouping to content-based image retrieval:Building images. In:...
  • Markkula, M., Sormunen, E., 1998. Searching for photos – journalists' practices in pictorial IR, In: Eakins, J.P.,...
  • Martinez, A., 1999. Face image retrieval using HMMs. In: IEEE Workshop on Content-based Access of Image and Video...
  • MIR, 1996. MIRA: Evaluation frameworks for interactive multimedia retrieval applications. Esprit working group 20039.,...
  • MPEG Requirements Group, 1998. MPEG-7: Context and objectives (version 10 Atlantic City), Doc. ISO/IEC JTC1/SC29/WG11,...
  • Cited by (453)

    • Privacy-preserving image retrieval based on additive secret sharing

      2024, International Journal of Autonomous and Adaptive Communications Systems
    View all citing articles on Scopus
    View full text