People Watching: Human Actions as a Cue for Single View Geometry

Fouhey, David F.; Delaitre, Vincent; Gupta, Abhinav; Efros, Alexei A.; Laptev, Ivan; Sivic, Josef

doi:10.1007/s11263-014-0710-z

People Watching: Human Actions as a Cue for Single View Geometry

Published: 22 March 2014

Volume 110, pages 259–274, (2014)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

David F. Fouhey¹,
Vincent Delaitre²,
Abhinav Gupta¹,
Alexei A. Efros¹^nAff3,
Ivan Laptev² &
…
Josef Sivic²

1419 Accesses
41 Citations
3 Altmetric
Explore all metrics

Abstract

We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene understanding approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Internet. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey

Article 23 March 2023

Visual SLAM algorithms: a survey from 2010 to 2016

Article Open access 02 June 2017

LSD-SLAM: Large-Scale Direct Monocular SLAM

References

Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In CVPR.
Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In CVPR.
Barinova, O., Lempitsky, V., Tretyak, E., & Kohli, P. (2010). Geometric image parsing in man-made environments. In ECCV.
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.
Choi, W., Chao, Y.W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.
Coughlan, J., & Yuille, A. (2000). The Manhattan world assumption: Regularities in scene statistics which enable bayesian inference. In NIPS.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.
Del Pero, L., Guan, J., Brau, E., Schlecht, J., & Barnard, K. (2011). Sampling bedrooms. In CVPR.
Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Efros, A., & Gupta, A. (2012). Scene semantics from long-term observation of people. In ECCV.
Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person–object interactions for action recognition in still images. In NIPS.
Desai, C., Ramanan, D., & Fowlkes, C. (2010). Discriminative models for static human–object interactions. In SMiCV, CVPR.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.
Article MATH MathSciNet Google Scholar
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.
Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3D features. In ICCV.
Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.
Fouhey, D.F., Gupta, A., & Hebert, M. (2013). Data-driven 3D primitives for single image understanding. In ICCV.
Gall, J., Fossati, A., & van Gool, L. (2011). Functional categorization of objects using real-time markerless motion capture. In CVPR.
Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin.
Google Scholar
Grabner, H., Gall, J., & van Gool, L. (2011). What makes a chair a chair? In CVPR.
Guan, L., Franco, J.S., & Pollefeys, M. (2007). 3D occlusion inference from silhouette cues. In CVPR.
Gupta, A., & Davis, L. S. (2007). Objects in action: An approach for combining action understanding and object perception. In CVPR.
Gupta, A., Chen, T., Chen, F., Kimber, D., & Davis, L. (2008). Context and observation driven latent variable model for human pose estimation. In CVPR.
Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.
Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.
Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.). Cambridge University Press, Cambridge, ISBN: 0521540518.
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In: ICCV.
Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In: ECCV.
Hoiem, D., Efros, A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.
Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. In IJCV.
Jiang, Y., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3D scenes. In CVPR.
Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In: CVPR.
Kanade, T. (1981). Recovery of the three-dimensional shape of an object from a single view. Artificial Intelligence, 17(1), 409–460.
Article Google Scholar
Karsch, K., Liu, C., & Kang, S.B. (2012). Depth extraction from video using non-parametric sampling. In ECCV.
Kjellstrom, H., Romero, J., Martinez, D., & Kragic, D. (2008). Simultaneous visual recognition of manipulation actions and manipulated objects. In ECCV.
Krahnstoever, N., & Mendonca, P. R. S. (2005). Bayesian autocalibration for surveillance. In CVPR.
Lee, D., Gupta, A., Hebert, M., Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.
Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In ICCV.
Park, D., & Ramanan, D. (2011). N-best maximal decoders for part models. In ICCV.
Payet, N., & Todorovic, S. (2011). Scene shape from texture of objects. In CVPR.
Prest, A., Schmid, C., & Ferrari, V. (2011). Weakly supervised learning of interactions between humans and objects. In PAMI.
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2013). Tracking human pose by tracking symmetric parts. In CVPR.
Rother, C. (2002). A new approach to vanishing point detection in architectural environments. In IVC 20.
Rother, D., Patwardhan, K., & Sapiro, G. (2007). What can casual walkers tell us about the 3D scene. In CVPR.
Saxena, A., Sun, M., & Ng, A. Y. (2008). Make3D: Learning 3D scene structure from a single still image. In TPAMI.
Schodl, A., & Essa, I. (2001). Depth layers from occlusions. In CVPR.
Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.
Schwing, A.G., Fidler, S., Pollefeys, M., & Urtasun, R. (2013). Box in the box: Joint 3D layout and object reasoning from single images. In ICCV.
Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single image. In: CVPR.
Turek, M., Hoogs, A., & Collins, R. (2010). Unsupervised learning of functional categories in video scenes. In ECCV.
Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.
Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In NIPS.
Yang, Y., & Ramanan, D. (2011). Articulated pose estimation using flexible mixtures of parts. In: CVPR.
Yao, B., Khosla, A., & Fei-Fei, L. (2011). Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In Proceedings of the ICML.
Yu, S. X., Zhang, H., & Malik, J. (2008). Inferring spatial layout from a single image via depth-ordered grouping. In 6th Workshop on Perceptual Organization in Computer Vision.

Download references

Acknowledgments

This work was supported by NSF Graduate Research and NDSEG Fellowships to DF, and by ONR-MURI N000141010934, NSF IIS-1320083, the MSR-INRIA laboratory, the EIT-ICT labs, Google, ERC Activia, and the Quaero Programme, funded by OSEO.

Author information

Alexei A. Efros
Present address: EECS Department at UC Berkeley, Berkeley, CA, USA

Authors and Affiliations

Robotics Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA , 15213, USA
David F. Fouhey, Abhinav Gupta & Alexei A. Efros
WILLOW Project, Département d’Informatique de l’École Normale Supérieure, ENS/INRIA/CNRS UMR 8548, 23, Avenue d’Italie, 75013 , Paris, France
Vincent Delaitre, Ivan Laptev & Josef Sivic

Authors

David F. Fouhey
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Delaitre
View author publications
You can also search for this author in PubMed Google Scholar
Abhinav Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Alexei A. Efros
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Laptev
View author publications
You can also search for this author in PubMed Google Scholar
Josef Sivic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David F. Fouhey.

Additional information

Communicated by Carlo Colombo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fouhey, D.F., Delaitre, V., Gupta, A. et al. People Watching: Human Actions as a Cue for Single View Geometry. Int J Comput Vis 110, 259–274 (2014). https://doi.org/10.1007/s11263-014-0710-z

Download citation

Received: 23 June 2013
Accepted: 25 February 2014
Published: 22 March 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s11263-014-0710-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

People Watching: Human Actions as a Cue for Single View Geometry

Abstract

Access this article

Similar content being viewed by others

Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey

Visual SLAM algorithms: a survey from 2010 to 2016

LSD-SLAM: Large-Scale Direct Monocular SLAM

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

People Watching: Human Actions as a Cue for Single View Geometry

Abstract

Access this article

Similar content being viewed by others

Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey

Visual SLAM algorithms: a survey from 2010 to 2016

LSD-SLAM: Large-Scale Direct Monocular SLAM

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation