skip to main content
10.1145/3240508.3241916acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Fluid Annotation: A Human-Machine Collaboration Interface for Full Image Annotation

Published:15 October 2018Publication History

ABSTRACT

We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid annotation is based on three principles:(I) Strong Machine-Learning aid. We start from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new regions to cover missing objects, and removing incorrect regions.The edit operations are also assisted by the model.(II) Full image annotation in a single pass. As opposed to performing a series of small annotation tasks in isolation [51,68], we propose a unified interface for full image annotation in a single pass.(III) Empower the annotator.We empower the annotator to choose what to annotate and in which order. This enables concentrating on what the ma-chine does not already know, i.e. putting human effort only on the errors it made. This helps using the annotation budget effectively.

Through extensive experiments on the COCO+Stuff dataset [11,51], we demonstrate that Fluid Annotation leads to accurate an-notations very efficiently, taking 3x less annotation time than the popular LabelMe interface [70].

References

  1. D. Acuna, H. Ling, A. Kar, and S. Fidler. 2018. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNNGoogle ScholarGoogle Scholar
  2. . (2018).Google ScholarGoogle Scholar
  3. P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. 2014. Multiscale Combinatorial Grouping. In CVPR .Google ScholarGoogle Scholar
  4. A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. 2016. What's the point: Semantic segmentation with point supervision. ECCV .Google ScholarGoogle Scholar
  5. S. Bell, P. Upchurch, N. Snavely, and K. Bala. 2015. Material Recognition in the Wild with the Materials in Context Database. In CVPR .Google ScholarGoogle Scholar
  6. T. Berg and D.A. Forsyth. 2006. Animals on the web. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Bilen and A. Vedaldi. 2016. Weakly Supervised Deep Detection Networks. In CVPR .Google ScholarGoogle Scholar
  8. Arijit Biswas and Devi Parikh. 2013. Simultaneous active learning of classifiers & attributes via relative feedback. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Boykov and M. P. Jolly. 2001. Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. In ICCV .Google ScholarGoogle Scholar
  10. Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie. 2010. Visual recognition with humans in the loop. In ECCV . Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Caesar, J.R.R. Uijlings, and V. Ferrari. 2018. COCO-Stuff: Thing and Stuff Classes in Context. In CVPR .Google ScholarGoogle Scholar
  12. J. Carreira and C. Sminchisescu. 2010. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR .Google ScholarGoogle Scholar
  13. L. Castrejón, K. Kundu, R. Urtasun, and S. Fidler. 2017. Annotating object instances with a polygon-rnn. In CVPR .Google ScholarGoogle Scholar
  14. L.-C. Chen, A. Hermans, F. Schroff G. Papandreou, P. Wang, and H. Adam. 2017. MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features. ArXiv (2017).Google ScholarGoogle Scholar
  15. L-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A.L. Yuille. 2018. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. on PAMI (2018).Google ScholarGoogle Scholar
  16. R.G. Cinbis, J. Verbeek, and C. Schmid. 2014. Multi-fold MIL Training for Weakly Supervised Object Localization. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. CVPR (2016).Google ScholarGoogle Scholar
  18. J. Dai, K. He, Y. Li, S. Ren, and J. Sun. 2016. Instance-sensitive Fully Convolutional Networks. In ECCV .Google ScholarGoogle Scholar
  19. Jia Deng, Olga Russakovsky, Jonathan Krause, Michael S. Bernstein, Alex Berg, and Li Fei-Fei. 2014. Scalable Multi-label Annotation. In Proceedings of the 32Nd Annual ACM Conference on Human Factors in Computing Systems (CHI '14). ACM, 3099--3102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Deselaers, B. Alexe, and V. Ferrari. 2010. Localizing Objects while Learning Their Appearance. In ECCV . Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. Endres and D. Hoiem. 2014. Category-Independent Object Proposals with Diverse Ranking. IEEE Trans. on PAMI , Vol. 36, 2 (2014), 222--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Everingham, S. Eslami, L. van Gool, C. Williams, J. Winn, and A. Zisserman. 2015. The PASCAL Visual Object Classes Challenge: A Retrospective. IJCV (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. 2010. Object Detection with Discriminatively Trained Part Based Models. IEEE Trans. on PAMI , Vol. 32, 9 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. 2010. Learning Object Categories From Internet Image Searches. In Proceedings of the IEEE.Google ScholarGoogle Scholar
  25. Y. Freund and R.E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Greg Griffin, Alex Holub, and Pietro Perona. 2007. The Caltech-256. Technical Report. Caltech.Google ScholarGoogle Scholar
  28. B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. 2014. Simultaneous Detection and Segmentation. In ECCV .Google ScholarGoogle Scholar
  29. M. Haußmann, F.A. Hamprecht, and M. Kandemir. 2017. Variational Bayesian Multiple Instance Learning with Gaussian Processes. In CVPR .Google ScholarGoogle Scholar
  30. K. He, G. Gkioxari, P. Dollár, and R. Girshick. 2017. Mask R-CNN. In ICCV .Google ScholarGoogle Scholar
  31. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. CVPR .Google ScholarGoogle Scholar
  32. J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. 2017. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR .Google ScholarGoogle Scholar
  33. M. Huh, P. Agrawal, and A.A. Efros. 2016. What makes ImageNet good for transfer learning? NIPS LSCVS workshop .Google ScholarGoogle Scholar
  34. S. Jain and K. Grauman. 2016. Click Carving: Segmenting Objects in Video with Point Clicks. In Proceedings of the Fourth AAAI Conference on Human Computation and Crowdsourcing .Google ScholarGoogle Scholar
  35. Suyog Dutt Jain and Kristen Grauman. 2013. Predicting sufficient annotation strength for interactive foreground segmentation. In ICCV .Google ScholarGoogle Scholar
  36. B. Jin, M.V. Ortiz-Segovia, and S. Süsstrunk. 2017. Webly supervised semantic segmentation. In CVPR .Google ScholarGoogle Scholar
  37. Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. 2009. Multi-class active learning for image classification. In CVPR .Google ScholarGoogle Scholar
  38. V. Kantorov, M. Oquab, M. Cho, and I. Laptev. 2010. ContextLocNet: Context-aware Deep Network Models for Weakly Supervised Localization. In ECCV .Google ScholarGoogle Scholar
  39. Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. 2007. Active learning with gaussian processes for object categorization. In ICCV .Google ScholarGoogle Scholar
  40. A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. 2017. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR .Google ScholarGoogle Scholar
  41. A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. 2018. Panoptic Segmentation. In ArXiv.Google ScholarGoogle Scholar
  42. A. Kolesnikov and C.H. Lampert. 2016. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV .Google ScholarGoogle Scholar
  43. K. Konyushkova, J.R.R. Uijlings, C. Lampert, and V. Ferrari. 2018. Learning Intelligent Dialogs for Bounding Box Annotation. In CVPR .Google ScholarGoogle Scholar
  44. Adriana Kovashka, Sudheendra Vijayanarasimhan, and Kristen Grauman. 2011. Actively selecting annotations among objects and attributes. In ICCV . Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. 2017. OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html (2017).Google ScholarGoogle Scholar
  46. A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Li, A. Jabri, A. Joulin, and L. van der Maaten. 2017. Learning Visual N-Grams from Web Data. ICCV .Google ScholarGoogle Scholar
  48. X. Li, L. Chen, L. Zhang, F. Lin, and W-Y. Ma. 2006. Image Annotation by Large-scale Content-based Image Retrieval. In ACM Multimedia . Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J.H. Liew, Y. Wei, W. Xiong, S-H. Ong, and J. Feng. 2017. Regional interactive image segmentation networks. In ICCV .Google ScholarGoogle Scholar
  50. D. Lin, J. Dai, J. Jia, K. He, and J. Sun. 2016. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. In CVPR .Google ScholarGoogle Scholar
  51. T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.L. Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV .Google ScholarGoogle Scholar
  52. W. Liu, A. Rabinovich, and A.C. Berg. 2016. ParseNet: Looking Wider to See Better. In ICLR workshop .Google ScholarGoogle Scholar
  53. J. Long, E. Shelhamer, and T. Darrell. 2015. Fully Convolutional Networks for Semantic Segmentation. In CVPR .Google ScholarGoogle Scholar
  54. D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. 2018. Exploring the limits of weakly supervised pretraining. In ArXiv .Google ScholarGoogle Scholar
  55. K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool. 2018. Deep Extreme Cut: From Extreme Points to Object Segmentation. In CVPR .Google ScholarGoogle Scholar
  56. Pascal Mettes, Jan C van Gemert, and Cees GM Snoek. 2016. Spot On: Action Localization from Pointly-Supervised Proposals. In ECCV .Google ScholarGoogle Scholar
  57. R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. R. Mottaghi, S. Fidler, J. Yao, R. Urtasun, and D. Parikh. 2013. Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs. In CVPR . 3143--3150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. N. S. Nagaraja, F. R. Schmidt, and T. Brox. 2015. Video Segmentation with Just a Few Strokes. In ICCV . Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. 2017a. Extreme clicking for efficient object annotation. In ICCV .Google ScholarGoogle Scholar
  61. Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. 2017b. Training object class detectors with click supervision. In CVPR .Google ScholarGoogle Scholar
  62. D. P. Papadopoulos, Jasper R. R. Uijlings, F. Keller, and V. Ferrari. 2016. We don't need no bounding-boxes: Training object class detectors using only human verification. In CVPR .Google ScholarGoogle Scholar
  63. Amar Parkash and Devi Parikh. 2012. Attributes for classifier feedback. In ECCV . Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. D. Pathak, P. Kr"ahenbuhl, and T. Darrell. 2015. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV . Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, and Hong-Jiang Zhang. 2008. Two-dimensional active learning for image classification. In CVPR .Google ScholarGoogle Scholar
  66. S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. C. Rother, V. Kolmogorov, and A. Blake. 2004. GrabCut: interactive foreground extraction using iterated graph cuts. SIGGRAPH , Vol. 23, 3 (2004), 309--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. 2015a. ImageNet Large Scale Visual Recognition Challenge. IJCV (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. O. Russakovsky, L-J. Li, and L. Fei-Fei. 2015b. Best of both worlds: human-machine collaboration for object annotation. In CVPR .Google ScholarGoogle Scholar
  70. B. Russel and A. Torralba. 2008. LabelMe: a database and web-based tool for image annotation. IJCV , Vol. 77, 1--3 (2008), 157--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. B. C. Russell, K. P. Murphy, and W. T. Freeman. 2008. LabelMe: a database and web-based tool for image annotation. IJCV (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. F. Schroff, D. Kalenichenko, and J. Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR .Google ScholarGoogle Scholar
  73. J. Shotton, J. Winn, C. Rother, and A. Criminisi. 2009. TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Appearance, Shape and Context. IJCV , Vol. 81, 1 (2009), 2--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. A. Shrivastava, A. Gupta, and R. Girshick. 2016. Training region-based object detectors with online hard example mining. In CVPR .Google ScholarGoogle Scholar
  75. Behjat Siddiquie and Abhinav Gupta. 2010. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR .Google ScholarGoogle Scholar
  76. K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR .Google ScholarGoogle Scholar
  77. H. Su, J. Deng, and L. Fei-Fei. 2012. Crowdsourcing annotations for visual object detection. In AAAI Human Computation Workshop .Google ScholarGoogle Scholar
  78. C. Sun, A. Shrivastava, S. Singh, and A. Gupta. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV .Google ScholarGoogle Scholar
  79. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI .Google ScholarGoogle Scholar
  80. Joseph Tighe and Svetlana Lazebnik. 2013. Superparsing - Scalable Nonparametric Image Parsing with Superpixels. IJCV , Vol. 101, 2 (2013), 329--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. 2013. Selective search for object recognition. IJCV (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Sudheendra Vijayanarasimhan and Kristen Grauman. 2008. Multi-Level Active Prediction of Useful Image Annotations for Recognition. In NIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Sudheendra Vijayanarasimhan and Kristen Grauman. 2009. What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In CVPR .Google ScholarGoogle Scholar
  84. Sudheendra Vijayanarasimhan and Kristen Grauman. 2014. Large-scale live active learning: Training object detectors with crawled data and crowds. IJCV , Vol. 108, 1--2 (2014), 97--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Catherine Wah, Grant Van Horn, Steve Branson, Subhrajyoti Maji, Pietro Perona, and Serge Belongie. 2014. Similarity comparisons for interactive fine-grained categorization. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. T. Wang, B. Han, and J. Collomosse. 2014. TouchCut: Fast image and video segmentation using single-touch interaction. CVIU (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Z. Wu, C. Shen, and A. van den Hengel. 2016. Bridging Category-level and Instance-level Semantic Image Segmentation. ArXiv (2016).Google ScholarGoogle Scholar
  88. J. Xiao, K. Ehinger, J. Hays, A. Torralba, and A. Oliva. 2014. SUN Database: Exploring a Large Collection of Scene Categories. IJCV (2014), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. J. Xu, A. G. Schwing, and R. Urtasun. 2015. Learning to Segment Under Various Forms of Weak Supervision. In CVPR .Google ScholarGoogle Scholar
  90. N. Xu, B. Price, S. Cohen, J. Yang, and T.S. Huang. 2016. Deep interactive object selection. In CVPR .Google ScholarGoogle Scholar
  91. Angela Yao, Juergen Gall, Christian Leistner, and Luc Van Gool. 2012. Interactive object detection. In CVPR . Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. 2017. Scene Parsing through ADE20K Dataset. In CVPR .Google ScholarGoogle Scholar
  93. Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao. 2017. Soft Proposal Networks for Weakly Supervised Object Localization. In ICCV .Google ScholarGoogle Scholar

Index Terms

  1. Fluid Annotation: A Human-Machine Collaboration Interface for Full Image Annotation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '18: Proceedings of the 26th ACM international conference on Multimedia
          October 2018
          2167 pages
          ISBN:9781450356657
          DOI:10.1145/3240508

          Copyright © 2018 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2018

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader