Object Level Grouping for Video Shots

Sivic, Josef; Schaffalitzky, Frederik; Zisserman, Andrew

doi:10.1007/s11263-005-4264-y

Object Level Grouping for Video Shots

Published: 01 January 2006

Volume 67, pages 189–210, (2006)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Josef Sivic¹,
Frederik Schaffalitzky¹ &
Andrew Zisserman¹

248 Accesses
61 Citations
3 Altmetric
Explore all metrics

Abstract

We describe a method for automatically obtaining object representations suitable for retrieval from generic video shots. The object representation consists of an association of frame regions. These regions provide exemplars of the object’s possible visual appearances.

Two ideas are developed: (i) associating regions within a single shot to represent a deforming object; (ii) associating regions from the multiple visual aspects of a 3D object, thereby implicitly representing 3D structure. For the association we exploit temporal continuity (tracking) and wide baseline matching of affine covariant regions.

In the implementation there are three areas of novelty: First, we describe a method to repair short gaps in tracks. Second, we show how to join tracks across occlusions (where many tracks terminate simultaneously). Third, we develop an affine factorization method that copes with motion degeneracy.

We obtain tracks that last throughout the shot, without requiring a 3D reconstruction. The factorization method is used to associate tracks into object-level groups, with common motion. The outcome is that separate parts of an object that are not simultaneously visible (such as the front and back of a car, or the front and side of a face) are associated together. In turn this enables object-level matching and recognition throughout a video.

We illustrate the method on the feature film “Groundhog Day.” Examples are given for the retrieval of deforming objects (heads, walking people) and rigid objects (vehicles, locations).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aanaes, H., Fisker, R. Astrom, K., and Carstensen, J. M. 2002. Robust Factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1215–1225.
Article Google Scholar
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. ACM Press, ISBN: 020139829.
Bolles, R.C., Baker, H.H., and Marimont, D.H. 1987. Epipolar-plane image analysis: An approach to determining structure from motion. International Journal of Computer Vision, 1(1):7–56.
Article Google Scholar
De la Torre, F. and Black, M. 2003. A framework for robust subspace learning. International Journal of Computer Vision, 54(1–3):117–142.
Article MATH Google Scholar
Ferrari, V., Tuytelaars, T., and Van Gool, L. 2003. Wide-baseline multiple-view correspondences. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1. pp. 718–725.
Google Scholar
Ferrari, V., Tuytelaars, T., and Van Gool, L. 2004a. Integrating multiple model views for object recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2. pp. 105–112.
Google Scholar
Ferrari, V., Tuytelaars, T., and Van Gool L. 2004b. Simultaneous object recognition and segmentation by image exploration. In Proc. of the European Conference on Computer Vision, vol. 1. pp. 40–54.
Google Scholar
Fischler, M.A. and Bolles, R.C. 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. Assoc. Comp. Mach., 24(6):381–395.
MathSciNet Google Scholar
Goedeme, T., Tuytelaars, T., Van Gool, L., Sivic, J., and Zisserman, A. 2005. Cognitive Vision Systems, EC Project Final Report, IST-2000-29404, Chapt. Location and Object Matching and Discovery in Video. (in press).
Hartley, R.I. and Zisserman, A. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049.
Jacobs, D.W. 1997. Linear fitting with missing data: Applications to structure-from-motion and to characterizing intensity images. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 206–212.
Kender, J.R. and Yeo, B.L. 1998. Video scene segmentation via continuous video coherence. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 367–373.
Lowe, D. 1999. Object recognition from local scale-invariant features. In Proc. of the 7th International Conference on Computer Vision, Kerkyra, Greece. pp. 1150–1157.
Lowe, D. 2001. Local feature view clustering for 3D object recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, Springer. pp. 682–688.
Google Scholar
Mahindroo, A., Bose, B., Chaudhury, S., and Harit, G. 2002. Enhanced video representation using objects. In Proc. of the Indian Conference on Computer Vision, Graphics and Image Processing. pp. 105–112.
Matas, J., Chum, O., Urban, M., and Pajdla, T. 2002. Robust wide baseline stereo from maximally stable extremal regions. In Proc. of the British Machine Vision Conference. pp. 384–393.
Mikolajczyk, K., and Schmid, C. 2002. An affine invariant interest point detector. In Proc. of the 7th European Conference on Computer Vision, Copenhagen, Denmark. Springer-Verlag.
Google Scholar
Osian, M. and Van Gool, L. 2004. Video shot characterization. Machine Vision and Applications Journal, 15(3):172–177.
Google Scholar
Rothganger, F., Lazebnik, S., Schmid, C., and Ponce, J. 2003. 3D object modeling and recognition using local affine-invariant descriptors and multi-view spatial constraints. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2. pp. 272–280.
Google Scholar
Rothganger, F., Lazebnik, S. Schmid, C., and Ponce, J. 2004. Segmenting, modeling, and matching video clips containing multiple moving objects. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2. pp. 914–921.
Google Scholar
Schaffalitzky, F. and Zisserman, A. 2002. Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?”. In Proc. of the 7th European Conference on Computer Vision, Copenhagen, Denmark, vol. 1. Springer-Verlag. pp. 414–431.
Google Scholar
Schaffalitzky, F. and Zisserman, A. 2003. Automated location matching in movies. Computer Vision and Image Understanding 92: 236–264.
Article Google Scholar
Schmid, C. 1997. ‘Appariement d’Images par Invariants Locaux de Niveaux de Gris’. Ph.D. thesis, L’Institut National Polytechnique de Grenoble, Grenoble.
Shum, H.-Y., Ikeuchi, I., and Reddy, R. 1995. Principal component analysis with missing data and its application to polyhedral object modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(9):855–867.
Article Google Scholar
Sivic, J., Schaffalitzky, F. and Zisserman, A. 2004. Object level grouping for video shots. In Proc. of the 8th European Conference on Computer Vision, Prague, Czech Republic, Springer-Verlag, vol. 2., pp. 85–98.
Google Scholar
Sivic, J. and Zisserman, A. 2003. Video google: A text retrieval approach to object matching in videos. In Proc. of the International Conference on Computer Vision.
Torr, P.H.S. 1995. Motion segmentation and outlier detection. Ph.D. thesis, Dept. of Engineering Science, University of Oxford.
Torr, P.H.S., Szeliski, R., and Anadan, P. 2001. An integrated bayesian approach to layer extraction from image sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(3):297–304.
Article Google Scholar
Torr, P.H.S., Zisserman, A., and Maybank, S. 1998. Robust detection of degenerate configurations for the fundamental matrix. Computer Vision and Image Understanding 71(3):312–333.
Article Google Scholar
Tuytelaars, T. and Van Gool, L. 2000. Wide baseline stereo matching based on local, affinely invariant regions. In Proc. of the 11th British Machine Vision Conference, Bristol. pp. 412–425.
Wallraven, C. and Bulthoff, H. 2001. Automatic acquisition of exemplar-based representations for recognition from image sequences. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, Workshop on Models vs. Exemplars.
Zelnik-Manor, L. and Irani, M. 1999. Multi-view subspace constraints on homographies. In Proc. of the 7th International Conference on Computer Vision, Kerkyra, Greece, vol. 2. pp. 710–715.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK
Josef Sivic, Frederik Schaffalitzky & Andrew Zisserman

Authors

Josef Sivic
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Schaffalitzky
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sivic, J., Schaffalitzky, F. & Zisserman, A. Object Level Grouping for Video Shots. Int J Comput Vision 67, 189–210 (2006). https://doi.org/10.1007/s11263-005-4264-y

Download citation

Received: 21 September 2004
Revised: 11 April 2005
Accepted: 03 May 2005
Published: 01 January 2006
Issue Date: April 2006
DOI: https://doi.org/10.1007/s11263-005-4264-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object Level Grouping for Video Shots

Abstract

Access this article

Similar content being viewed by others

Multiple Object Tracking by Efficient Graph Partitioning

An Algorithm for Semantic Vectorization of Video Scenes: Applications to Retrieval and Anomaly Detection

Unsupervised Dense Object Discovery, Detection, Tracking and Reconstruction

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Object Level Grouping for Video Shots

Abstract

Access this article

Similar content being viewed by others

Multiple Object Tracking by Efficient Graph Partitioning

An Algorithm for Semantic Vectorization of Video Scenes: Applications to Retrieval and Anomaly Detection

Unsupervised Dense Object Discovery, Detection, Tracking and Reconstruction

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation