Skip to main content
Log in

A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. It can be shown that CCA with labels as one of the views is equivalent to Linear Discriminant Analysis (LDA) (Bartlett 1938).

References

  • Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853.

    MATH  MathSciNet  Google Scholar 

  • Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.

    MathSciNet  Google Scholar 

  • Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In ICCV (Vol. 2, pp. 408–415).

  • Bartlett, M. S. (1938). Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society, 34(1), 33–40.

    Article  Google Scholar 

  • Berg, T., & Forsyth, D. (2006). Animals on the web. In CVPR.

  • Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In Second workshop on Internet vision at CVPR.

  • Blaschko, M., & Lampert, C. (2008). Correlational spectral clustering. In CVPR.

  • Blei, D., & Jordan, M. (2003). Modeling annotated data. In ACM SIGIR (pp. 127–134).

  • Blei, D., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Carneiro, G., Chan, A., Moreno, P., & Vasconcelos, N. (2007). Supervised learning of semantic classes for image annotation and retrieval. In PAMI.

  • Chapelle, O., Weston, J., & Scholkopf, B. (2003). Cluster kernels for semi-supervised learning. In NIPS.

  • Chen, N., Zhu, J., Sun, F., & Xing, E. P. (2012). Large-margin predictive latent subspace learning for multi-view data analysis. In PAMI.

  • Chen, X., Yuan, X.-T., Chen, Q., Yan, S., & Chua, T.-S. (2011). Multi-label visual classification with label exclusive context. In ICCV.

  • Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y.-T. (2009). NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM conference on image and video retrieval (CIVR’09), Santorini, Greece.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  • Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2), 1–60.

    Article  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L., & Li, K. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV.

  • Fan, J., Shen, Y., Zhou, N., & Gao, Y. (2010). Harvesting large-scale weakly-tagged image databases from the web. In CVPR (pp. 802–809).

  • Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. A. (2010). Every picture tells a story: Generating sentences for images. In ECCV.

  • Foster, D. P., Johnson, R., Kakade, S. M., & Zhang, T. (2010). Multi-view dimensionality reduction via canonical correlation analysis. Tech Report. Rutgers University.

  • Frankel, C., Swain, M. J., & Athitsos, V. (1997). Webseer: An image search engine for the World Wide Web. In CVPR.

  • Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV.

  • Gong, Y., & Lazebnik, S. (2011). Iterative quantization: An procrustean approach to learning binary codes. In CVPR.

  • Globerson, A., & Roweis, S. (2005). Metric Learning by collapsing classes. In NIPS.

  • Goldberger, J., Roweis, S., & Hinton, G. (2004). Neighbourhood components analysis. In NIPS.

  • Grangier, D., & Bengio, S. (2008). A discriminative kernel-based model to rank images from text queries. In PAMI.

  • Grubinger, M., Clough, P. D., Müller, H., & Deselaers, T. (2006). The IAPR TC-12 benchmark—A new evaluation resource for visual information systems. In Proceedings of the international workshop OntoImage’2006 language resources for content-based image retrieval (pp. 13–23).

  • Gordo, A., Rodriguez-Serrano, J., Perronnin, F., & Valveny, E. (2012). Leveraging category-level labels for instance-level image retrieval. In CVPR.

  • Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2009). TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV.

  • Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In CVPR.

  • Hardoon, D., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application. Neural Computation, 16(12), 2639–2664.

    Article  MATH  Google Scholar 

  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR.

  • Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312–377.

    Google Scholar 

  • Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In NIPS.

  • Hwang, S. J., & Grauman, K. (2010). Accounting for the relative importance of objects in image retrieval. In BMVC.

  • Hwang, S. J., & Grauman, K. (2011). Learning the relative importance of objects from tagged images for retrieval and cross-modal search. In IJCV.

  • Krapac, J., Allan, M., Verbeek, J., & Jurie, F. (2010). Improving web-image search results using query-relative classifiers. In CVPR.

  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Tech Report. University of Toronto.

  • Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Babytalk: Understanding and generating simple image descriptions. In CVPR.

  • Larsen, R. M. (1998). Lanczos bidiagonalization with partial reorthogonalization. Technical report, Department of Computer Science, Aarhus University

  • Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.

  • Lazebnik, S., Schmid, S., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • Li, J., & Wang, J. (2008). Real-time computerized annotation of pictures. In PAMI.

  • Liu, C., Yuen, J., & Torralba, A. (2010). Sift flow: Dense correspondence across difference scenes. In PAMI.

  • Liu, Y., Xu, D., Tsang, I., & Luo, J. (2009). Using large-scale web data to facilitate textual query based retrieval of consumer photos. In ACM MM.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. In IJCV.

  • Lucchi, A., & Weston, J. (2012). Joint image and word sense discrimination for image retrieval. In ECCV.

  • Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In CVPR.

  • Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In ECCV.

  • Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.

  • Monay, F., & Gatica-Perez, D. (2004). PLSA-based image auto-annotation: Constraining the latent space. In ACM Multimedia.

  • Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In NIPS.

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In IJCV.

  • Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.

  • Perronnin, F., Sanchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.

  • Quadrianto, N., & Lampert, C. H. (2011). Learning multi-view neighborhood preserving projections. In ICML.

  • Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. In CVPR.

  • Raguram, R., & Lazebnik, S. (2008). Computing iconic summaries for general visual concepts. In First workshop on Internet vision at CVPR.

  • Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In NIPS.

  • Rai, P., & Daumé, H. (2009). Multi-label prediction via sparse infinite CCA. In NIPS.

  • Rasiwasia, N., & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5), 923–938.

    Article  Google Scholar 

  • Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G., Levy, R., et al. (2010). A new approach to cross-modal multimedia retrieval. In ACM MM.

  • Scholkopf, B., Smola, A., & Muller, K.-R. (1997). Kernel principal component analysis. In ICANN.

  • Schroff, F., Criminisi, A., & Zisserman, A. (2007). Harvesting image databases from the Web. In ICCV.

  • Sharma, A., Kumar, A., Daumé, H., & Jacobs, D. (2012). Generalized multiview analysis: A discriminative latent space. In CVPR.

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. In PAMI.

  • Smeulders, A. W., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. The IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349–1380.

    Article  Google Scholar 

  • Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.

  • Udupa, R., & Khapra, M. (2010). Improving the multilingual user experience of Wikipedia using cross-language name search. In NAACL.

  • van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. In PAMI.

  • Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR.

  • Verma, Y., & Jawahar, C. V. (2012). Image annotation using metric learning in semantic neighbourhoods. In ECCV.

  • Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In NIPS.

  • von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In ACM SIGCHI.

  • Wang, C., Blei, D., & Li, F. (2009a). Simultaneous image classification and annotation. In CVPR (pp. 1903–1910).

  • Wang, G., Hoiem, D., & Forsyth, D. (2009b). Building text features for object image classification. In CVPR.

  • Wang, G., Hoiem, D., & Forsyth, D. (2009c). Learning image similarity from Flickr groups using stochastic intersection kernel machines. In ICCV.

  • Wang, X.-J., Zhang, L., Li, X., & Ma, W.-Y. (2008). Annotating images by mining image search results. The IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1919–1932.

    Google Scholar 

  • Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In IJCAI.

  • Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In SIGIR.

  • Weinberger, K., Blitzer, J., & Saul, L. (2005). Distance metric learning for large margin nearest neighbor classification. In NIPS.

  • Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.

  • Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR.

  • Yakhnenko, O., & Honavar, V. (2009). Multiple label prediction for image annotation with multiple kernel correlation models. In Workshop on visual context learning (in conjunction with CVPR).

  • Zhang, Y., & Schneider, J. (2011). Multi-label output codes using canonical correlation analysis. In AISTATS.

  • Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classication using maximum entropy method. In ACM SIGIR.

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their constructive comments; Jason Weston for advice on implementing the Wsabie method; Albert Gordo and Florent Perronnin for useful discussions; and Joseph Tighe, Hongtao Huang, Juan Caicedo, and Mariyam Khalid for helping with manual evaluation of the auto-tagging experiments. Gong and Lazebnik were supported by NSF grant IIS 1228082, DARPA Computer Science Study Group (D12AP00305), and Microsoft Research Faculty Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunchao Gong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gong, Y., Ke, Q., Isard, M. et al. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int J Comput Vis 106, 210–233 (2014). https://doi.org/10.1007/s11263-013-0658-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0658-4

Keywords

Navigation