ABSTRACT
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.
- Bar-Yossef, Z. and Rajagopalan, S., Template Detection via Data Mining and its Applications, in the proceedings of 11th World Wide Web conference (WWW 2002), May 2002. Google ScholarDigital Library
- Brin, S. and Page L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, in the Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998. Google ScholarDigital Library
- Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.Google Scholar
- Chen, J., Zhou, B., Shi, J., Zhang, H.-J. and Qiu, F., Function-Based Object Model Towards Website Adaptation, in the proceedings of the 10th World Wide Web conference (WWW10), Budapest, Hungary, May 2001. Google ScholarDigital Library
- Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326--334.Google ScholarCross Ref
- Dietterich, T. G. and Bakiri, G., Solving multiclass learning problem via error correcting output codes, Journal of Artificial Intelligence Research, 2:263--286, 1995. Google ScholarDigital Library
- Dietterich, T. G. and Bakiri, G., Error-correcting output codes: a general method for improving multiclass inductive learning programs, in the proceedings of AAAI-91, pages 572--577. AAAI press / MIT press, 1991.Google Scholar
- Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P., DOM-based Content Extraction of HTML Documents, in the proceedings of the 12th World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. Google ScholarDigital Library
- Kovacevic, M., Diligenti, M., Gori, M. and Milutinovic, V., Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification, in the proceedings of 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi City, Japan, December, 2002 Google ScholarDigital Library
- Lin, S.-H. and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD'02), 2002 Google ScholarDigital Library
- Liu, H., Xie, X., Ma, W.-Y. and Zhang, H.-J., Automatic Browsing of Large Pictures on Mobile Devices, in the proceedings of 11th ACM International Conference on Multimedia, Berkeley, CA, USA, Nov. 2003 Google ScholarDigital Library
- Mayoraz, E. and Alpaydin, E., Support vector machines for multiclass classification, in the proceedings of the international workshop on artificial intelligence neural networks, 1999.Google Scholar
- V. Vapnik. Principles of risk minimization for learning theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 831--838. Morgan Kaufmann, 1992Google Scholar
- Yang, Y., An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, Vol. 1, Number 1-2, pp.69--90, 1999 Google ScholarDigital Library
- Yi, L. and Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003. Google ScholarDigital Library
- Yi, L. and Liu, B., Eliminating Noisy Information in Web Pages for Data Mining, in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August, 2003. Google ScholarDigital Library
- Yu, S., Cai, D., Wen, J.-R. and Ma, W.-Y., Improving Pseudo-Relevance Feedback in Web Information retrieval Using Web Page Segmentation, in the proceedings of Twelfth World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. Google ScholarDigital Library
Index Terms
- Learning block importance models for web pages
Recommendations
Learning important models for web page blocks based on layout and content analysis
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. It has also been proven that differentiating noisy and unimportant blocks from pages can ...
Computing block importance for searching on web sites
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementIn this paper we consider the problem of using the block structure of a Web page to improve ranking results when searching for information on Web sites. Given the block structure of the Web pages as input, we propose a method for computing the ...
Block-based web search
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalMultiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to ...
Comments