Skip to main content
Log in

End-to-End Learning of Deep Visual Representations for Image Retrieval

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: (1) noisy training data, (2) inappropriate deep architecture, and (3) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Note that this differs from the original setup of Tolias et al. (2016), that resizes images to 1024 pixels, and leads to different results in Table 1. Please see Gordo et al. (2016) for a discussion about this issue.

References

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., et al. (2015). Vqa: Visual question answering. In ICCV.

  • Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In CVPR.

  • Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR.

  • Azizpour, H., Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2015). Factors of transferability for a generic convnet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, (99):1–1.

  • Babenko, A., & Lempitsky, V. S. (2015). Aggregating deep convolutional features for image retrieval. In ICCV.

  • Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. S. (2014). Neural codes for image retrieval. In ECCV.

  • Chopra, S., Hadsell, R., & Lecun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In Proceedings of computer vision and pattern recognition conference.

  • Chum, O., Philbin, J., Sivic, J., Isard, M., & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In ICCV.

  • Chum, O., Mikulik, A., Perdoch, M., & Matas, J. (2011). Total recall II: Query expansion revisited. In CVPR.

  • Danfeng, Q., Gammeter, S., Bossard, L., Quack, T., & Van Gool, L. (2011). Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In CVPR.

  • Deng, C., Ji, R., Liu, W., Tao, D., & Gao, X. (2013). Visual reranking through weakly supervised multi-graph learning. In ICCV.

  • Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Douze, M., Jegou, H., & Perronnin, F. (2016). Polysemous codes. In ECCV.

  • Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In NIPS.

  • Girshick, R. (2015). Fast R-CNN. In CVPR.

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.

  • Gordo, A., Rodríguez-Serrano, J. A., Perronnin, F., & Valveny, E. (2012). Leveraging category-level labels for instance-level image retrieval. In CVPR.

  • Gordo, A., Almazán, J., Revaud, J., & Larlus, D. (2016). Deep image retrieval: Learning global representations for image search. In ECCV.

  • Hadsell, R., Chopra, S., & Lecun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR.

  • Hays, J., & Efros, A. A. (2008). im2gps: Estimating geographic information from a single image. In CVPR.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hoffer, E., & Ailon, N. (2015). Deep metric learning using triplet network. In SIMBAD.

  • Hu, J., Lu, J., & Tan, Y. P. (2014). Discriminative deep metric learning for face verification in the wild. In CVPR.

  • Jégou, H., & Chum, O. (2012). Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In ECCV.

  • Jégou, H., & Zisserman, A. (2014). Triangulation embedding and democratic aggregation for image search. In CVPR.

  • Jégou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In ECCV.

  • Jégou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features for large scale image search. In IJCV.

  • Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR.

  • Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In TPAMI.

  • Kalantidis, Y., Mellina, C., & Osindero, S. (2016). Cross-dimensional weighting for aggregated deep convolutional features. In Workshop on web-scale vision and social media (VSM), ECCV.

  • Karpathy, A., Joulin, A., & Fei-Fei, L. (2014). Deep fragment embeddings for bidirectional image-sentence mapping. In NIPS.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.

  • Laptev, D., Savinov, N., Buhmann, J. M., & Pollefeys, M. (2016). Ti-pooling: Transformation-invariant pooling for feature learning in convolutional neural networks. In CVPR.

  • Li, X., Larson, M., & Hanjalic, A. (2015). Pairwise geometric matching for large-scale object retrieval. In CVPR.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. In IJCV.

  • Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In ECCV.

  • Mikolajczyk, K., & Schmid, C. (2004), Scale and affine invariant interest point detectors. In IJCV.

  • Mikulík, A., Perdoch, M., Chum, O., & Matas, J. (2010). Learning a fine vocabulary. In ECCV.

  • Mikulik, A., Perdoch, M., Chum, O., & Matas, J. (2013). Learning vocabularies over a fine quantization. In IJCV.

  • Ng, J. Y. H., Yang, F., & Davis, L. S. (2015). Exploiting local features from deep networks for image retrieval. In CVPR workshops.

  • Nister, D., & Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In CVPR.

  • Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin, F., & Schmid, C. (2015). Local convolutional features with unsupervised training for image retrieval. In ICCV.

  • Perdoch, M., Chum, O., & Matas, J. (2009). Efficient representation of local geometry for large scale object retrieval. In CVPR.

  • Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.

  • Perronnin, F., & Larlus, D. (2015). Fisher vectors meet neural networks: A hybrid classification architecture. In CVPR.

  • Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In CVPR.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In CVPR.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008). Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR.

  • Philbin, J., Isard, M., Sivic, J., & Zisserman, A. (2010). Descriptor learning for efficient retrieval. In ECCV.

  • Radenovic, F., Jegou, H., & Chum, O. (2015). Multiple measurements and joint dimensionality reduction for large scale image search with short vectors-extended version. In International Conference on Multimedia Retrieval.

  • Radenovic, F., Tolias, G., & Chum, O. (2016). CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV.

  • Razavian, A.S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR deep vision workshop.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.

  • Rodriguez-Serrano, J., Larlus, D., & Dai, Z. (2015). Data-driven detection of prominent objects. In TPAMI.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, AC., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. In IJCV.

  • Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR.

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In CVPR.

  • Shen, X., Lin, Z., Brandt, J., & Wu, Y. (2014). Spatially-constrained similarity measurefor large-scale object retrieval. In TPAMI.

  • Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-Noguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. In ICCV.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV.

  • Song, H.O., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR.

  • Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In NIPS.

  • Tao, R., Gavves, E., Snoek, C.G., & Smeulders, AW. (2014). Locality in generic instance search from one example. In CVPR.

  • Tolias, G., & Jégou, H. (2015). Visual query expansion with or without geometry: Refining local descriptors by feature aggregation. In PR.

  • Tolias, G., Avrithis, Y., & Jégou, H. (2015). Image search with selective match kernels: Aggregation across single and multiple images. In IJCV.

  • Tolias, G., Sicre, R., & Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. In ICLR.

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on PAMI, 30(11), 1958–1970. doi:10.1109/TPAMI.2008.128.

  • Turcot, P., & Lowe, D.G. (2009). Better matching with fewer features: The selection of useful features in large database recognition problems. In ICCV Workshops.

  • Vardi, Y., & Zhang, C. H. (2004). The multivariate L1-median and associated data depth. In Proceedings of the National Academy of Sciences.

  • Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., et al. (2014) Learning fine-grained image similarity with deep ranking. In CVPR.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diane Larlus.

Additional information

Communicated by Svetlana Lazebnik, Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gordo, A., Almazán, J., Revaud, J. et al. End-to-End Learning of Deep Visual Representations for Image Retrieval. Int J Comput Vis 124, 237–254 (2017). https://doi.org/10.1007/s11263-017-1016-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-017-1016-8

Keywords

Navigation