Abstract
Automatic image annotation is one of the fundamental problems in computer vision and machine learning. Given an image, here the goal is to predict a set of textual labels that describe the semantics of that image. During the last decade, a large number of image annotation techniques have been proposed that have been shown to achieve encouraging results on various annotation datasets. However, their scope has mostly remained restricted to quantitative results on the test data, thus ignoring various key aspects related to dataset properties and evaluation metrics that inherently affect the performance to a considerable extent. In this paper, first we evaluate ten state-of-the-art (both deep-learning based as well as non-deep-learning based) approaches for image annotation using the same baseline CNN features. Then we propose new quantitative measures to examine various issues/aspects in the image annotation domain, such as dataset specific biases, per-label versus per-image evaluation criteria, and the impact of changing the number and type of predicted labels. We believe the conclusions derived in this paper through thorough empirical analyzes would be helpful in making systematic advancements in this domain.
Similar content being viewed by others
References
Ahn LV, Dabbish L (2004) Labeling images with a computer game. In: ACM SIGCHI Conference on human factors in computing systems
Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Chen M, Zheng A, Weinberger KQ (2013) Fast image tagging. In: ICML
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: A real-world web image database from National University of Singapore. In: ACM CIVR
Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods. Cambridge University Press, Cambridge
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: The quirks and what works. In: ACL
Duygulu P, Barnard K, de Freitas JFG, Forsyth DA (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: ECCV
Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: CVPR
Fu H, Zhang Q, Qiu G (2012) Random forest for image annotation. In: ECCV, pp 86–99
Gong Y, Jia Y, Leung TK, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. In: ICLR
Grubinger M, Clough PD, Müller H, Deselaers T (2006) The IAPR benchmark: A new evaluation resource for visual information systems. In: International Conference on Language Resources and Evaluation. http://www-i6.informatik.rwth-aachen.de/imageclef/resources/iaprtc12.tgz
Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) TagProp: Discriminative metric learning in nearest neighbour models for image auto-annotation. In: ICCV
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: AAAI
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
Hu H, Zhou GT, Deng Z, Liao Z, Mori G (2016) Learning structured inference neural networks with label relations. In: CVPR
Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: Image annotation by exploiting image metadata. In: ICCV
Kalayeh MM, Idrees H, Shah M (2014) NMF-KNN: Image annotation using weighted multi-view non-negative matrix factorization. In: CVPR
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: ACL
Li Z, Tang J (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288
Li X, Snoek CGM, Worring M (2009) Learning social tag relevance by neighbor voting. Trans Multi 11(7):1310–1322
Li Z, Liu J, Xu C, Lu H (2013) Mlrank: Multi-correlation learning to rank for image annotation. Pattern Recogn 46(10):2700–2710
Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098
Li Y, Song Y, Luo J (2017) Improving pairwise ranking for multi-label image classification. In: CVPR
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnic CL (2014) Microsoft COCO: Common objects in contex. In: ECCV
Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2017) Semantic regularisation for recurrent image annotation. In: CVPR
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: ECCV
Makadia A, Pavlovic V, Kumar S (2010) Baselines for image annotation. Int J Comput Vis 90(1):88–105
Moran S, Lavrenko V (2014) A sparse kernel relevance model for automatic image annotation. Int J Multimed Inf Retr 3(4):209–219
Mori Y, Takahashi H, Oka R (1999) Image-to-word transformation based on dividing and vector quantizing images with words. In: MISRM’99 First international workshop on multimedia intelligent storage and retrieval management
Platt JC (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers
Ren Z, Jin H, Lin ZL, Fang C, Yuille AL (2015) Multi-instance visual-semantic embedding. CoRR arXiv:1512.06963
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
Uricchio T, Ballan L, Seidenari L, Bimbo AD (2016) Automatic image annotation via label transfer in the semantic space. CoRR arXiv:1605.04770
Verma Y, Jawahar CV (2012) Image annotation using metric learning in semantic neighbourhoods. In: ECCV
Verma Y, Jawahar CV (2013) Exploring SVM for image annotation in presence of confusing labels. In: BMVC
Verma Y, Jawahar CV (2017) Image annotation by propagating labels from semantic neighbourhoods. Int J Comput Vis 121(1):126–148
Verma Y, Gupta A, Mannem P, Jawahar CV (2013) Generating image descriptions using semantic similarities in the output space. In: CVPR Workshop
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: A unified framework for multi-label image classification. In: CVPR
Weston J, Bengio S, Usunier N (2011) WSABIE: Scaling up to large vocabulary image annotation. In: IJCAI
Zhang S, Huang J, Huang Y, Yu Y, Li H, Metaxas DN (2010) Automatic image annotation using group sparsity. In: CVPR, pp 3312–3319
Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(99):1819–1837
Acknowledgments
Yashaswi Verma would like to thank the Department of Science and Technology (India) for the INSPIRE Faculty Award 2017.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dutta, A., Verma, Y. & Jawahar, C.V. Automatic image annotation: the quirks and what works. Multimed Tools Appl 77, 31991–32011 (2018). https://doi.org/10.1007/s11042-018-6247-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6247-3