skip to main content
research-article

On Annotation Methodologies for Image Search Evaluation

Published:27 March 2019Publication History
Skip Abstract Section

Abstract

Image search engines differ significantly from general web search engines in the way of presenting search results. The difference leads to different interaction and examination behavior patterns, and therefore requires changes in evaluation methodologies. However, evaluation of image search still utilizes the methods for general web search. In particular, offline metrics are calculated based on coarse-fine topical relevance judgments with the assumption that users examine results in a sequential manner.

In this article, we investigate annotation methods via crowdsourcing for image search evaluation based on a lab-based user study. Using user satisfaction as the golden standard, we make several interesting findings. First, instead of item-based annotation, annotating relevance in a row-based way is more efficient without hurting performance. Second, besides topical relevance, image quality plays a crucial role when evaluating the image search results, and the importance of image quality changes with search intent. Third, compared to traditional four-level scales, the fine-grain annotation method outperforms significantly. To our best knowledge, our work is the first to systematically study how diverse factors in data annotation impact image search evaluation. Our results suggest different strategies for exploiting the crowdsourcing to get data annotated under different conditions.

References

  1. Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 773--774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, Vol. 15. 16.Google ScholarGoogle Scholar
  3. Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC relevance assessment. Information Processing and Management 48, 6 (2012), 1053--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Paul André, Edward Cutrell, Desney S. Tan, and Greg Smith. 2009. Designing novel image search interfaces by understanding unique characteristics and usage. In Proceedings of the IFIP Conference on Human-Computer Interaction. 340--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: An information foraging based measure. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 903--912. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge (CIKM’09). 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the International ACM SIGIR Conference. 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Karen Church, Mauro Cherubini, and Nuria Oliver. 2014. A large-scale study of daily information needs captured in situ. ACM Transactions on Computer-Human Interaction 21, 2 (2014), 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 terabyte track. In Proceedings of the Text Retrieval Conference (TREC’04), Vol. 4. 74.Google ScholarGoogle Scholar
  11. C. W. Cleverdon and E. M. Keen. 1966. Aslib--Cranfield research project. Factors Determining the Performance of Indexing Systems, Vol. 1. College of Aeronautics.Google ScholarGoogle Scholar
  12. Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.Google ScholarGoogle Scholar
  13. Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M. Voorhees. 2015. TREC 2014 Web Track Overview. Technical Report. Michigan University, Ann Arbor, MI.Google ScholarGoogle Scholar
  14. Eli P. Cox III. 1980. The optimal number of response alternatives for a scale: A review. Journal of Marketing Research (1980), 407--422.Google ScholarGoogle Scholar
  15. Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 162--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Henry A. Feild, James Allan, and Rosie Jones. 2010. Predicting searcher frustration. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bo Geng, Linjun Yang, Chao Xu, Xian-Sheng Hua, and Shipeng Li. 2011. The role of attractiveness in web image search. In Proceedings of the 19th ACM International Conference on Multimedia (MM’11). 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Abby Goodrum and Amanda Spink. 1999. Visual information seeking: A study of image queries on the World Wide Web. In Proceedings of the ASIST Annual Meeting, Vol. 36. 665--74.Google ScholarGoogle Scholar
  19. Qi Guo and Yang Song. 2016. Large-scale analysis of viewing behavior: Towards measuring satisfaction with mobile proactive systems. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16). 579--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 1 (2007), 77--89.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jiyin He and Emine Yilmaz. 2017. User behaviour and task characteristics: A field study of daily information behaviour. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. ACM, New York, NY, 67--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mehdi Hosseini, Ingemar J. Cox, Nataša Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. 2012. On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the European Conference on Information Retrieval. 182--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. Vol. 20. ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Diane Kelly and Nicholas J. Belkin. 2004. Display time as implicit feedback: Understanding task effects. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 377--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. 2018. Crowd vs. expert: What can relevance judgment rationales teach us about assessor disagreement? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). 805--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dmitry Lagun, Chih Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 113--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159--174.Google ScholarGoogle ScholarCross RefCross Ref
  28. Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 493--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating mobile search with height-biased gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl’Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  31. Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems 35, 3 (2017), 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, et al. 2016. When does relevance mean usefulness and user satisfaction in web search? In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 463--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2011. Crowdsourcing blog track top news judgments at TREC. In Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). 23--26.Google ScholarGoogle Scholar
  34. Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the ACM International Conference on Information and Knowledge Management. 659--668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008), 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Howard R. Moskowitz. 1977. Magnitude estimation: Notes on what, how, when, and why to use it. Journal of Food Quality 1, 3 (1977), 195--227.Google ScholarGoogle ScholarCross RefCross Ref
  37. Neil O’Hare, Paloma De Juan, Rossano Schifanella, Yunlong He, Dawei Yin, and Yi Chang. 2016. Leveraging user interaction signals for web image search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 559--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jaimie Y. Park, Neil O’Hare, Rossano Schifanella, Alejandro Jaimes, and Chin-Wan Chung. 2015. A large-scale study of user image search behavior on the web. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, New York, NY, 985--994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Hsiao-Tieh Pu. 2005. A comparative analysis of web image and textual queries. Online Information Review 29, 5 (2005), 457--467.Google ScholarGoogle ScholarCross RefCross Ref
  40. Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 675--684. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tetsuya Sakai. 2018. Conducting laboratory experiments properly with statistical tools: An easy hands-on tutorial. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 1369--1370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 473--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 555--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 2126--2144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. David J. Sheskin. 2003. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton, FL.Google ScholarGoogle Scholar
  46. Mark D. Smucker and Charles L. A. Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 95--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Mark D. Smucker, Gabriella Kazai, and Matthew Lease. 2012. Overview of the TREC 2012 Crowdsourcing Track. Technical Report. Texas University at Austin School of Information.Google ScholarGoogle Scholar
  48. Yang Song, Hao Ma, Hongning Wang, and Kuansan Wang. 2013. Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance. In Proceedings of the 22nd International Conference on World Wide Web. ACM, New York, NY, 1201--1212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Eero Sormunen. 2002. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 324--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Alexandru Lucian Ginsca, Adrian Popescu, Yiannis Kompatsiaris, and Ioannis Vlahavas. 2015. Improving diversity in image search via supervised relevance scoring. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, New York, NY, 323--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xu Sun and Andrew May. 2013. A comparison of field-based and lab-based experiments to evaluate user experience of personalised mobile devices. Advances in Human-Computer Interaction 2013 (2013), 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Rong Tang, William M. Shaw, and Jack L. Vevea. 2010. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science and Technology 50, 3 (2010), 254--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 565--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Reinier H. van Leuken, Lluis Garcia, Ximena Olivares, and Roelof van Zwol. 2009. Visual diversification of image search results. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 341--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 1. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xiaohui Xie, Yiqun Liu, Maarten de Rijke, Jiyin He, Min Zhang, and Shaoping Ma. 2018. Why people search for images using web search engines. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). 655--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xiaohui Xie, Yiqun Liu, Xiaochuan Wang, Meng Wang, Zhijing Wu, Yingying Wu, Min Zhang, et al. 2017. Investigating examination behavior of image search users. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating web search with a Bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 425--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How well do offline and online evaluation metrics measure user satisfaction in web image search? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 615--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Emine Yilmaz, Joemon M. Jose, and Leif Azzopardi. 2013. Crowdsourcing interactions: Using crowdsourcing for evaluating interactive information retrieval systems. Information Retrieval 16, 2 (2013), 267--305. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On Annotation Methodologies for Image Search Evaluation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 37, Issue 3
      July 2019
      335 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/3320115
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 March 2019
      • Accepted: 1 January 2019
      • Revised: 1 November 2018
      • Received: 1 August 2018
      Published in tois Volume 37, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format