Abstract
Image search engines differ significantly from general web search engines in the way of presenting search results. The difference leads to different interaction and examination behavior patterns, and therefore requires changes in evaluation methodologies. However, evaluation of image search still utilizes the methods for general web search. In particular, offline metrics are calculated based on coarse-fine topical relevance judgments with the assumption that users examine results in a sequential manner.
In this article, we investigate annotation methods via crowdsourcing for image search evaluation based on a lab-based user study. Using user satisfaction as the golden standard, we make several interesting findings. First, instead of item-based annotation, annotating relevance in a row-based way is more efficient without hurting performance. Second, besides topical relevance, image quality plays a crucial role when evaluating the image search results, and the importance of image quality changes with search intent. Third, compared to traditional four-level scales, the fine-grain annotation method outperforms significantly. To our best knowledge, our work is the first to systematically study how diverse factors in data annotation impact image search evaluation. Our results suggest different strategies for exploiting the crowdsourcing to get data annotated under different conditions.
- Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 773--774. Google ScholarDigital Library
- Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, Vol. 15. 16.Google Scholar
- Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC relevance assessment. Information Processing and Management 48, 6 (2012), 1053--1066. Google ScholarDigital Library
- Paul André, Edward Cutrell, Desney S. Tan, and Greg Smith. 2009. Designing novel image search interfaces by understanding unique characteristics and usage. In Proceedings of the IFIP Conference on Human-Computer Interaction. 340--353. Google ScholarDigital Library
- Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: An information foraging based measure. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 903--912. Google ScholarDigital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge (CIKM’09). 621--630. Google ScholarDigital Library
- Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the International ACM SIGIR Conference. 15--24. Google ScholarDigital Library
- Karen Church, Mauro Cherubini, and Nuria Oliver. 2014. A large-scale study of daily information needs captured in situ. ACM Transactions on Computer-Human Interaction 21, 2 (2014), 10. Google ScholarDigital Library
- Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 terabyte track. In Proceedings of the Text Retrieval Conference (TREC’04), Vol. 4. 74.Google Scholar
- C. W. Cleverdon and E. M. Keen. 1966. Aslib--Cranfield research project. Factors Determining the Performance of Indexing Systems, Vol. 1. College of Aeronautics.Google Scholar
- Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.Google Scholar
- Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M. Voorhees. 2015. TREC 2014 Web Track Overview. Technical Report. Michigan University, Ann Arbor, MI.Google Scholar
- Eli P. Cox III. 1980. The optimal number of response alternatives for a scale: A review. Journal of Marketing Research (1980), 407--422.Google Scholar
- Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 162--170. Google ScholarDigital Library
- Henry A. Feild, James Allan, and Rosie Jones. 2010. Predicting searcher frustration. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 34--41. Google ScholarDigital Library
- Bo Geng, Linjun Yang, Chao Xu, Xian-Sheng Hua, and Shipeng Li. 2011. The role of attractiveness in web image search. In Proceedings of the 19th ACM International Conference on Multimedia (MM’11). 63--72. Google ScholarDigital Library
- Abby Goodrum and Amanda Spink. 1999. Visual information seeking: A study of image queries on the World Wide Web. In Proceedings of the ASIST Annual Meeting, Vol. 36. 665--74.Google Scholar
- Qi Guo and Yang Song. 2016. Large-scale analysis of viewing behavior: Towards measuring satisfaction with mobile proactive systems. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16). 579--588. Google ScholarDigital Library
- Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 1 (2007), 77--89.Google ScholarCross Ref
- Jiyin He and Emine Yilmaz. 2017. User behaviour and task characteristics: A field study of daily information behaviour. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. ACM, New York, NY, 67--76. Google ScholarDigital Library
- Mehdi Hosseini, Ingemar J. Cox, Nataša Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. 2012. On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the European Conference on Information Retrieval. 182--194. Google ScholarDigital Library
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. Vol. 20. ACM, New York, NY. Google ScholarDigital Library
- Diane Kelly and Nicholas J. Belkin. 2004. Display time as implicit feedback: Understanding task effects. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 377--384. Google ScholarDigital Library
- Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. 2018. Crowd vs. expert: What can relevance judgment rationales teach us about assessor disagreement? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). 805--814. Google ScholarDigital Library
- Dmitry Lagun, Chih Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 113--122. Google ScholarDigital Library
- J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159--174.Google ScholarCross Ref
- Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 493--502. Google ScholarDigital Library
- Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating mobile search with height-biased gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444. Google ScholarDigital Library
- Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl’Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
- Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems 35, 3 (2017), 19. Google ScholarDigital Library
- Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, et al. 2016. When does relevance mean usefulness and user satisfaction in web search? In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 463--472. Google ScholarDigital Library
- Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2011. Crowdsourcing blog track top news judgments at TREC. In Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). 23--26.Google Scholar
- Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the ACM International Conference on Information and Knowledge Management. 659--668. Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008), 2. Google ScholarDigital Library
- Howard R. Moskowitz. 1977. Magnitude estimation: Notes on what, how, when, and why to use it. Journal of Food Quality 1, 3 (1977), 195--227.Google ScholarCross Ref
- Neil O’Hare, Paloma De Juan, Rossano Schifanella, Yunlong He, Dawei Yin, and Yi Chang. 2016. Leveraging user interaction signals for web image search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 559--568. Google ScholarDigital Library
- Jaimie Y. Park, Neil O’Hare, Rossano Schifanella, Alejandro Jaimes, and Chin-Wan Chung. 2015. A large-scale study of user image search behavior on the web. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, New York, NY, 985--994. Google ScholarDigital Library
- Hsiao-Tieh Pu. 2005. A comparative analysis of web image and textual queries. Online Information Review 29, 5 (2005), 457--467.Google ScholarCross Ref
- Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 675--684. Google ScholarDigital Library
- Tetsuya Sakai. 2018. Conducting laboratory experiments properly with statistical tools: An easy hands-on tutorial. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 1369--1370. Google ScholarDigital Library
- Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 473--482. Google ScholarDigital Library
- Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 555--562. Google ScholarDigital Library
- Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 2126--2144. Google ScholarDigital Library
- David J. Sheskin. 2003. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton, FL.Google Scholar
- Mark D. Smucker and Charles L. A. Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 95--104. Google ScholarDigital Library
- Mark D. Smucker, Gabriella Kazai, and Matthew Lease. 2012. Overview of the TREC 2012 Crowdsourcing Track. Technical Report. Texas University at Austin School of Information.Google Scholar
- Yang Song, Hao Ma, Hongning Wang, and Kuansan Wang. 2013. Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance. In Proceedings of the 22nd International Conference on World Wide Web. ACM, New York, NY, 1201--1212. Google ScholarDigital Library
- Eero Sormunen. 2002. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 324--330. Google ScholarDigital Library
- Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Alexandru Lucian Ginsca, Adrian Popescu, Yiannis Kompatsiaris, and Ioannis Vlahavas. 2015. Improving diversity in image search via supervised relevance scoring. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, New York, NY, 323--330. Google ScholarDigital Library
- Xu Sun and Andrew May. 2013. A comparison of field-based and lab-based experiments to evaluate user experience of personalised mobile devices. Advances in Human-Computer Interaction 2013 (2013), 2. Google ScholarDigital Library
- Rong Tang, William M. Shaw, and Jack L. Vevea. 2010. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science and Technology 50, 3 (2010), 254--264. Google ScholarDigital Library
- Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 565--574. Google ScholarDigital Library
- Reinier H. van Leuken, Lluis Garcia, Ximena Olivares, and Roelof van Zwol. 2009. Visual diversification of image search results. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 341--350. Google ScholarDigital Library
- Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 1. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Xiaohui Xie, Yiqun Liu, Maarten de Rijke, Jiyin He, Min Zhang, and Shaoping Ma. 2018. Why people search for images using web search engines. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). 655--663. Google ScholarDigital Library
- Xiaohui Xie, Yiqun Liu, Xiaochuan Wang, Meng Wang, Zhijing Wu, Yingying Wu, Min Zhang, et al. 2017. Investigating examination behavior of image search users. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--284. Google ScholarDigital Library
- Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating web search with a Bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 425--434. Google ScholarDigital Library
- Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How well do offline and online evaluation metrics measure user satisfaction in web image search? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 615--624. Google ScholarDigital Library
- Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Emine Yilmaz, Joemon M. Jose, and Leif Azzopardi. 2013. Crowdsourcing interactions: Using crowdsourcing for evaluating interactive information retrieval systems. Information Retrieval 16, 2 (2013), 267--305. Google ScholarDigital Library
Index Terms
- On Annotation Methodologies for Image Search Evaluation
Recommendations
Does Diversity Affect User Satisfaction in Image Search
Diversity has been taken into consideration by existing Web image search engines in ranking search results. However, there is no thorough investigation of how diversity affects user satisfaction in image search. In this article, we address the following ...
Towards Context-Aware Evaluation for Image Search
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalCompared to general web search, image search engines present results in a significantly different way, which leads to changes in user behavior patterns, and thus creates challenges for the existing evaluation mechanisms. In this paper, we pay attention ...
Why People Search for Images using Web Search Engines
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data MiningWhat are the intents or goals behind human interactions with image search engines? Knowing why people search for images is of major concern to Web image search engines because user satisfaction may vary as intent varies. Previous analyses of image ...
Comments