Article

Evaluating bag-of-visual-words representations in scene classification

Authors:
Jun Yang

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Yu-Gang Jiang

City University of Hong Kong, Hong Kong, China

City University of Hong Kong, Hong Kong, China
View Profile

,
Alexander G. Hauptmann

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Chong-Wah Ngo

City University of Hong Kong, Hong Kong, China

City University of Hong Kong, Hong Kong, China
View Profile

MIR '07: Proceedings of the international workshop on Workshop on multimedia information retrievalSeptember 2007Pages 197–206https://doi.org/10.1145/1290082.1290111

Published:24 September 2007Publication History

MIR '07: Proceedings of the international workshop on Workshop on multimedia information retrieval

Pages 197–206

ABSTRACT

Based on keypoints extracted as salient image patches, an image can be described as a "bag of visual words" and this representation has been used in scene classification. The choice of dimension, selection, and weighting of visual words in this representation is crucial to the classification performance but has not been thoroughly studied in previous work. Given the analogy between this representation and the bag-of-words representation of text documents, we apply techniques used in text categorization, including term weighting, stop word removal, feature selection, to generate image representations that differ in the dimension, selection, and weighting of visual words. The impact of these representation choices to scene classification is studied through extensive experiments on the TRECVID and PASCAL collection. This study provides an empirical basis for designing visual-word representations that are likely to produce superior classification performance.

References

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press Series/Addison Wesley, 1999. Google ScholarDigital Library
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarDigital Library
G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):394--410, 2007. Google ScholarDigital Library
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. pages 148--155, 1998. Google ScholarDigital Library
Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proc. of ACM Int'l Conf. on Image and Video Retrieval, 2007. Google ScholarDigital Library
T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In Proc. of the 10th European Conf. on Machine Learning, pages 137--142. Springer-Verlag, 1998. Google ScholarDigital Library
Y. Ke and R. Sukthankar. Pca-sift: A more distinctive representation for local image descriptors. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2004. Google ScholarDigital Library
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. of 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, volume 2, pages 2169--2178, 2006. Google ScholarDigital Library
F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proc. of the 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pages 524--531, 2005. Google ScholarDigital Library
J. Li and J. Z. Wang. Real-time computerized annotation of pictures. In Proc. of the 14th Annual ACM Int'l Conf. on Multimedia, pages 911--920, 2006. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91--110, 2004. Google ScholarDigital Library
K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. Int. J. Comput. Vision, 60(1):63--86, 2004. Google ScholarDigital Library
K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615--1630, 2005. Google ScholarDigital Library
M. R. Naphade, L. Kennedy, J. R. Kender, S. F. Chang, J. Smith, P. Over, and A. Hauptmann. A light scale concept ontology for multimedia understanding for TRECVID 2005. In IBM Research Technical Report, 2005.Google Scholar
D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. of 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pages 2161--2168, Los Alamitos, CA, USA, 2006. Google ScholarDigital Library
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275--281, 1998. Google ScholarDigital Library
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management: an Int'l Journal, 25(5):513--523, 1988. Google ScholarDigital Library
J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. of 9th IEEE Int'l Conf. on Computer Vision, Vol. 2, 2003. Google ScholarDigital Library
A. Smeaton and P. Over. Trecvid: Benchmarking the effectiveness of infomration retrieval tasks on digital video. In Proc. of the Intl. Conf. on Image and Video Retrieval, 2003. Google ScholarDigital Library
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. of the 22nd Annual int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarDigital Library
Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In 14th Int'l Conf. on Machine Learning, pages 412--420, 1997. Google Scholar
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: An in-depth study. In Technical report, INRIA, 2005.Google Scholar
W. Zhao, Y.-G. Jiang, and C.-W. Ngo. Keyframe retrieval by keypoints: Can point-to-point matching help? In Proc. of 5th Int'l Conf. on Image and Video Retrieval (CIVR), pages 72--81, 2006. Google ScholarDigital Library

Index Terms

Evaluating bag-of-visual-words representations in scene classification
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Bag of spatio-visual words for context inference in scene classification

In the ''bag of visual words (BoVW)'' representation each image is represented by an unordered set of visual words. In this paper, a novel approach to encode ordered spatial configurations of visual words in order to add context in the representation is ...
Read More
A Thousand Words in a Scene

This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a textlike bag-of-visterms (BOV) ...
Read More
Improving Bag-of-Visual-Words model using visual n-grams for human action classification

Visual n-grams for human action classification are introduced.A new version of Leader-Follower clustering improves the detection performance.Spatio-temporal relations are included using graphs from which n-grams are computed.Experimental results show ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MIR '07: Proceedings of the international workshop on Workshop on multimedia information retrieval
September 2007
343 pages
ISBN:9781595937780
DOI:10.1145/1290082
General Chairs:
James Z. Wang
The Pennsylvania State University, USA
,
Nozha Boujemaa
INRIA Rocquencourt, France
,
Program Chairs:
Alberto Del Bimbo
University of Florence, Italy
,
Jia Li
The Pennsylvania State University, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 September 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bag-of-visual-words
keypoint
local interest point
scene classification
Qualifiers
- Article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 544
  Total Citations
  View Citations
- 4,027
  Total Downloads
- Downloads (Last 12 months)108
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating bag-of-visual-words representations in scene classification

MIR '07: Proceedings of the international workshop on Workshop on multimedia information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bag of spatio-visual words for context inference in scene classification

A Thousand Words in a Scene

Improving Bag-of-Visual-Words model using visual n-grams for human action classification