ABSTRACT
Based on keypoints extracted as salient image patches, an image can be described as a "bag of visual words" and this representation has been used in scene classification. The choice of dimension, selection, and weighting of visual words in this representation is crucial to the classification performance but has not been thoroughly studied in previous work. Given the analogy between this representation and the bag-of-words representation of text documents, we apply techniques used in text categorization, including term weighting, stop word removal, feature selection, to generate image representations that differ in the dimension, selection, and weighting of visual words. The impact of these representation choices to scene classification is studied through extensive experiments on the TRECVID and PASCAL collection. This study provides an empirical basis for designing visual-word representations that are likely to produce superior classification performance.
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press Series/Addison Wesley, 1999. Google ScholarDigital Library
- C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarDigital Library
- G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):394--410, 2007. Google ScholarDigital Library
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. pages 148--155, 1998. Google ScholarDigital Library
- Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proc. of ACM Int'l Conf. on Image and Video Retrieval, 2007. Google ScholarDigital Library
- T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In Proc. of the 10th European Conf. on Machine Learning, pages 137--142. Springer-Verlag, 1998. Google ScholarDigital Library
- Y. Ke and R. Sukthankar. Pca-sift: A more distinctive representation for local image descriptors. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2004. Google ScholarDigital Library
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. of 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, volume 2, pages 2169--2178, 2006. Google ScholarDigital Library
- F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proc. of the 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pages 524--531, 2005. Google ScholarDigital Library
- J. Li and J. Z. Wang. Real-time computerized annotation of pictures. In Proc. of the 14th Annual ACM Int'l Conf. on Multimedia, pages 911--920, 2006. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91--110, 2004. Google ScholarDigital Library
- K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. Int. J. Comput. Vision, 60(1):63--86, 2004. Google ScholarDigital Library
- K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615--1630, 2005. Google ScholarDigital Library
- M. R. Naphade, L. Kennedy, J. R. Kender, S. F. Chang, J. Smith, P. Over, and A. Hauptmann. A light scale concept ontology for multimedia understanding for TRECVID 2005. In IBM Research Technical Report, 2005.Google Scholar
- D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. of 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pages 2161--2168, Los Alamitos, CA, USA, 2006. Google ScholarDigital Library
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275--281, 1998. Google ScholarDigital Library
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management: an Int'l Journal, 25(5):513--523, 1988. Google ScholarDigital Library
- J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. of 9th IEEE Int'l Conf. on Computer Vision, Vol. 2, 2003. Google ScholarDigital Library
- A. Smeaton and P. Over. Trecvid: Benchmarking the effectiveness of infomration retrieval tasks on digital video. In Proc. of the Intl. Conf. on Image and Video Retrieval, 2003. Google ScholarDigital Library
- Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. of the 22nd Annual int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarDigital Library
- Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In 14th Int'l Conf. on Machine Learning, pages 412--420, 1997. Google Scholar
- J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: An in-depth study. In Technical report, INRIA, 2005.Google Scholar
- W. Zhao, Y.-G. Jiang, and C.-W. Ngo. Keyframe retrieval by keypoints: Can point-to-point matching help? In Proc. of 5th Int'l Conf. on Image and Video Retrieval (CIVR), pages 72--81, 2006. Google ScholarDigital Library
Index Terms
- Evaluating bag-of-visual-words representations in scene classification
Recommendations
Bag of spatio-visual words for context inference in scene classification
In the ''bag of visual words (BoVW)'' representation each image is represented by an unordered set of visual words. In this paper, a novel approach to encode ordered spatial configurations of visual words in order to add context in the representation is ...
A Thousand Words in a Scene
This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a textlike bag-of-visterms (BOV) ...
Improving Bag-of-Visual-Words model using visual n-grams for human action classification
Visual n-grams for human action classification are introduced.A new version of Leader-Follower clustering improves the detection performance.Spatio-temporal relations are included using graphs from which n-grams are computed.Experimental results show ...
Comments