Skip to main content
Log in

The Open Images Dataset V4

Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33

Similar content being viewed by others

Notes

  1. Image hosting service (www.flickr.com)

  2. In Flickr terms, images are served at different sizes (Thumbnail, Large, Medium, etc.). The Original size is a pristine copy of the image that was uploaded by the author.

  3. More details at https://storage.googleapis.com/openimages/web/2018-05-17-rotation-information.html.

  4. Image ids are generated based on hashes of the data so effectively the sampling within a stratum is pseudo-random and deterministic.

  5. Note that while in theory logit scores are unbounded, we rarely observe values outside of \([-8,8]\) so the number of strata is bounded in practice.

  6. These are really unique objects: Each object is annotated only with its leafmost label, e.g. a man has a single box, it is not annotated as person also.

  7. We thank Ross Girshick for suggesting this type of visualization.

  8. To find the triplets in common between two datasets we matched the class names based on Lexicographical comparison and aggregated annotations in VG based on relationship; since VG contains somewhat inconsistent relationship names, we use loose string matching to match relationships

  9. https://storage.googleapis.com/openimages/web/evaluation.html.

References

  • Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In CVPR.

  • Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on PAMI, 34, 2189–2202.

    Article  Google Scholar 

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In CVPR.

  • Dai, B., Zhang, Y., & Lin, D. (2017). Detecting visual relationships with deep relational networks. In CVPR.

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88, 303–338.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., & Zisserman, A. (2012). The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

  • Everingham, M., Eslami, S., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 111, 98–136.

    Article  Google Scholar 

  • Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.

    Article  Google Scholar 

  • Felzenszwalb, P., Girshick, R., & McAllester, D. (2010a). Cascade object detection with deformable part models. In CVPR.

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Gao, C., Zou, Y., & Huang, J.B. (2018). iCAN: Instance-centric attention network for human-object interaction detection. In BMVC.

  • Girshick, R. (2015). Fast R-CNN. In ICCV.

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • Gkioxari, G., Girshick, R., Dollár, P., & He, K. (2018). Detecting and recognizing human-object interactions. CVPR.

  • Griffin, G., Holub, A., & Perona, P. (2007). The Caltech-256. Technical report, Caltech.

  • Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1775–1789.

    Article  Google Scholar 

  • Gupta, S., & Malik, J. (2015). Visual semantic role labeling. arXiv preprint arXiv:1505.04474.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hinton, G. E., Vinyals, O., & Dean, J. (2014). Distilling the knowledge in a neural network. In NeurIPS.

  • Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

  • Kolesnikov, A., Kuznetsova, A., Lampert, C., & Ferrari, V. (2018). Detecting visual relationships using box attention. arXiv:1807.02136.

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1), 32–73.

    Article  MathSciNet  Google Scholar 

  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

  • Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS.

  • Li, Y., Ouyang, W., Wang, X., & Tang, X. (2017). ViP-CNN: Visual phrase guided convolutional neural network. In CVPR.

  • Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI.

  • Liang, X., Lee, L., & Xing, E. P. (2017). Deep variation-structured reinforcement learning for visual relationship and attribute detection. In CVPR.

  • Lin, T., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. In ICCV.

  • Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona P, Ramanan, D., Zitnick, C.L., & Dollár, P. (2014). Microsoft COCO: Common objects in context. In ECCV.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). SSD: Single shot multibox detector. In ECCV.

  • Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In European Conference on Computer Vision.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NeurIPS.

  • Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., & Ferrari, V. (2016). We don’t need no bounding-boxes: Training object class detectors using only human verification. In CVPR.

  • Papadopoulos, D.P., Uijlings, J.R., Keller, F., & Ferrari, V. (2017). Extreme clicking for efficient object annotation. In ICCV.

  • Peyre, J., Laptev, I., Schmid, C., & Sivic, J. (2017). Weakly-supervised learning of visual relations. In CVPR.

  • Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 601–614.

    Article  Google Scholar 

  • Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151.

    Article  MathSciNet  Google Scholar 

  • Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In CVPR.

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. IJCV.

  • Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., & Chen, L. (2018). Mobilenetv2: Inverted residuals and linear bottleneck. In CVPR.

  • Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI Human Computation Workshop.

  • Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In ICCV.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In CVPR.

  • Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.

  • Uijlings, J., Popov, S., & Ferrari, V. (2018). Revisiting knowledge transfer for training object class detectors. In CVPR.

  • Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154–171.

    Article  Google Scholar 

  • Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., & Belongie, S. (2017). Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 839–847). http://openaccess.thecvf.com/content_cvpr_2017/papers/Veit_Learning_From_Noisy_CVPR_2017_paper.pdf.

  • Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In CVPR.

  • Viola, P., & Jones, M. (2001b). Robust real-time object detection. International Journal of Computer Vision, 4, 4.

    Google Scholar 

  • Xu, D., Zhu, Y., Choy, C., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Computer Vision and Pattern Recognition (CVPR).

  • Yao, B., & Fei-Fei, L. (2010). Modeling mutual context of object and human pose in human-object interaction activities. In CVPR.

  • Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In CVPR.

  • Zhang, H., Kyaw, Z., Chang, S.F., & Chua, T.S. (2017a). Visual translation embedding network for visual relation detection. In CVPR.

  • Zhang, H., Kyaw, Z., Yu, J., & Chang, S.F. (2017b). PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In ICCV

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jordi Pont-Tuset.

Additional information

Communicated by Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kuznetsova, A., Rom, H., Alldrin, N. et al. The Open Images Dataset V4. Int J Comput Vis 128, 1956–1981 (2020). https://doi.org/10.1007/s11263-020-01316-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01316-z

Keywords

Navigation