A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Gong, Yunchao; Ke, Qifa; Isard, Michael; Lazebnik, Svetlana

doi:10.1007/s11263-013-0658-4

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Published: 02 October 2013

Volume 106, pages 210–233, (2014)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yunchao Gong¹,
Qifa Ke²,
Michael Isard² &
…
Svetlana Lazebnik³

13k Accesses
418 Citations
6 Altmetric
Explore all metrics

Abstract

This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large Scale Image Indexing Using Online Non-negative Semantic Embedding

Similarity Search with Multiple-Object Queries

Semantic embedding: scene image classification using scene-specific objects

Article 18 October 2022

Notes

It can be shown that CCA with labels as one of the views is equivalent to Linear Discriminant Analysis (LDA) (Bartlett 1938).

References

Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853.
MATH MathSciNet Google Scholar
Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.
MathSciNet Google Scholar
Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In ICCV (Vol. 2, pp. 408–415).
Bartlett, M. S. (1938). Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society, 34(1), 33–40.
Article Google Scholar
Berg, T., & Forsyth, D. (2006). Animals on the web. In CVPR.
Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In Second workshop on Internet vision at CVPR.
Blaschko, M., & Lampert, C. (2008). Correlational spectral clustering. In CVPR.
Blei, D., & Jordan, M. (2003). Modeling annotated data. In ACM SIGIR (pp. 127–134).
Blei, D., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Carneiro, G., Chan, A., Moreno, P., & Vasconcelos, N. (2007). Supervised learning of semantic classes for image annotation and retrieval. In PAMI.
Chapelle, O., Weston, J., & Scholkopf, B. (2003). Cluster kernels for semi-supervised learning. In NIPS.
Chen, N., Zhu, J., Sun, F., & Xing, E. P. (2012). Large-margin predictive latent subspace learning for multi-view data analysis. In PAMI.
Chen, X., Yuan, X.-T., Chen, Q., Yan, S., & Chua, T.-S. (2011). Multi-label visual classification with label exclusive context. In ICCV.
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y.-T. (2009). NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM conference on image and video retrieval (CIVR’09), Santorini, Greece.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2), 1–60.
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., & Li, K. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV.
Fan, J., Shen, Y., Zhou, N., & Gao, Y. (2010). Harvesting large-scale weakly-tagged image databases from the web. In CVPR (pp. 802–809).
Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. A. (2010). Every picture tells a story: Generating sentences for images. In ECCV.
Foster, D. P., Johnson, R., Kakade, S. M., & Zhang, T. (2010). Multi-view dimensionality reduction via canonical correlation analysis. Tech Report. Rutgers University.
Frankel, C., Swain, M. J., & Athitsos, V. (1997). Webseer: An image search engine for the World Wide Web. In CVPR.
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV.
Gong, Y., & Lazebnik, S. (2011). Iterative quantization: An procrustean approach to learning binary codes. In CVPR.
Globerson, A., & Roweis, S. (2005). Metric Learning by collapsing classes. In NIPS.
Goldberger, J., Roweis, S., & Hinton, G. (2004). Neighbourhood components analysis. In NIPS.
Grangier, D., & Bengio, S. (2008). A discriminative kernel-based model to rank images from text queries. In PAMI.
Grubinger, M., Clough, P. D., Müller, H., & Deselaers, T. (2006). The IAPR TC-12 benchmark—A new evaluation resource for visual information systems. In Proceedings of the international workshop OntoImage’2006 language resources for content-based image retrieval (pp. 13–23).
Gordo, A., Rodriguez-Serrano, J., Perronnin, F., & Valveny, E. (2012). Leveraging category-level labels for instance-level image retrieval. In CVPR.
Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2009). TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV.
Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In CVPR.
Hardoon, D., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application. Neural Computation, 16(12), 2639–2664.
Article MATH Google Scholar
Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR.
Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312–377.
Google Scholar
Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In NIPS.
Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In BMVC.
Hwang, S. J., & Grauman, K. (2011). Learning the relative importance of objects from tagged images for retrieval and cross-modal search. In IJCV.
Krapac, J., Allan, M., Verbeek, J., & Jurie, F. (2010). Improving web-image search results using query-relative classifiers. In CVPR.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Tech Report. University of Toronto.
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Babytalk: Understanding and generating simple image descriptions. In CVPR.
Larsen, R. M. (1998). Lanczos bidiagonalization with partial reorthogonalization. Technical report, Department of Computer Science, Aarhus University
Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.
Lazebnik, S., Schmid, S., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Li, J., & Wang, J. (2008). Real-time computerized annotation of pictures. In PAMI.
Liu, C., Yuen, J., & Torralba, A. (2010). Sift flow: Dense correspondence across difference scenes. In PAMI.
Liu, Y., Xu, D., Tsang, I., & Luo, J. (2009). Using large-scale web data to facilitate textual query based retrieval of consumer photos. In ACM MM.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. In IJCV.
Lucchi, A., & Weston, J. (2012). Joint image and word sense discrimination for image retrieval. In ECCV.
Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In CVPR.
Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In ECCV.
Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.
Monay, F., & Gatica-Perez, D. (2004). PLSA-based image auto-annotation: Constraining the latent space. In ACM Multimedia.
Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In NIPS.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In IJCV.
Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.
Perronnin, F., Sanchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.
Quadrianto, N., & Lampert, C. H. (2011). Learning multi-view neighborhood preserving projections. In ICML.
Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. In CVPR.
Raguram, R., & Lazebnik, S. (2008). Computing iconic summaries for general visual concepts. In First workshop on Internet vision at CVPR.
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In NIPS.
Rai, P., & Daumé, H. (2009). Multi-label prediction via sparse infinite CCA. In NIPS.
Rasiwasia, N., & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5), 923–938.
Article Google Scholar
Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G., Levy, R., et al. (2010). A new approach to cross-modal multimedia retrieval. In ACM MM.
Scholkopf, B., Smola, A., & Muller, K.-R. (1997). Kernel principal component analysis. In ICANN.
Schroff, F., Criminisi, A., & Zisserman, A. (2007). Harvesting image databases from the Web. In ICCV.
Sharma, A., Kumar, A., Daumé, H., & Jacobs, D. (2012). Generalized multiview analysis: A discriminative latent space. In CVPR.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. In PAMI.
Smeulders, A. W., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. The IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349–1380.
Article Google Scholar
Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.
Udupa, R., & Khapra, M. (2010). Improving the multilingual user experience of Wikipedia using cross-language name search. In NAACL.
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. In PAMI.
Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR.
Verma, Y., & Jawahar, C. V. (2012). Image annotation using metric learning in semantic neighbourhoods. In ECCV.
Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In NIPS.
von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In ACM SIGCHI.
Wang, C., Blei, D., & Li, F. (2009a). Simultaneous image classification and annotation. In CVPR (pp. 1903–1910).
Wang, G., Hoiem, D., & Forsyth, D. (2009b). Building text features for object image classification. In CVPR.
Wang, G., Hoiem, D., & Forsyth, D. (2009c). Learning image similarity from Flickr groups using stochastic intersection kernel machines. In ICCV.
Wang, X.-J., Zhang, L., Li, X., & Ma, W.-Y. (2008). Annotating images by mining image search results. The IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1919–1932.
Google Scholar
Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In IJCAI.
Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In SIGIR.
Weinberger, K., Blitzer, J., & Saul, L. (2005). Distance metric learning for large margin nearest neighbor classification. In NIPS.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR.
Yakhnenko, O., & Honavar, V. (2009). Multiple label prediction for image annotation with multiple kernel correlation models. In Workshop on visual context learning (in conjunction with CVPR).
Zhang, Y., & Schneider, J. (2011). Multi-label output codes using canonical correlation analysis. In AISTATS.
Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classication using maximum entropy method. In ACM SIGIR.

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their constructive comments; Jason Weston for advice on implementing the Wsabie method; Albert Gordo and Florent Perronnin for useful discussions; and Joseph Tighe, Hongtao Huang, Juan Caicedo, and Mariyam Khalid for helping with manual evaluation of the auto-tagging experiments. Gong and Lazebnik were supported by NSF grant IIS 1228082, DARPA Computer Science Study Group (D12AP00305), and Microsoft Research Faculty Fellowship.

Author information

Authors and Affiliations

Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Yunchao Gong
Microsoft Research Silicon Valley, Mountain View, CA, USA
Qifa Ke & Michael Isard
Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA
Svetlana Lazebnik

Authors

Yunchao Gong
View author publications
You can also search for this author in PubMed Google Scholar
Qifa Ke
View author publications
You can also search for this author in PubMed Google Scholar
Michael Isard
View author publications
You can also search for this author in PubMed Google Scholar
Svetlana Lazebnik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunchao Gong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gong, Y., Ke, Q., Isard, M. et al. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int J Comput Vis 106, 210–233 (2014). https://doi.org/10.1007/s11263-013-0658-4

Download citation

Received: 17 December 2012
Accepted: 07 September 2013
Published: 02 October 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s11263-013-0658-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Abstract

Access this article

Similar content being viewed by others

Large Scale Image Indexing Using Online Non-negative Semantic Embedding

Similarity Search with Multiple-Object Queries

Semantic embedding: scene image classification using scene-specific objects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

Abstract

Access this article

Similar content being viewed by others

Large Scale Image Indexing Using Online Non-negative Semantic Embedding

Similarity Search with Multiple-Object Queries

Semantic embedding: scene image classification using scene-specific objects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation