skip to main content
10.1145/988672.988700acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Learning block importance models for web pages

Published:17 May 2004Publication History

ABSTRACT

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.

References

  1. Bar-Yossef, Z. and Rajagopalan, S., Template Detection via Data Mining and its Applications, in the proceedings of 11th World Wide Web conference (WWW 2002), May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brin, S. and Page L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, in the Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.Google ScholarGoogle Scholar
  4. Chen, J., Zhou, B., Shi, J., Zhang, H.-J. and Qiu, F., Function-Based Object Model Towards Website Adaptation, in the proceedings of the 10th World Wide Web conference (WWW10), Budapest, Hungary, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326--334.Google ScholarGoogle ScholarCross RefCross Ref
  6. Dietterich, T. G. and Bakiri, G., Solving multiclass learning problem via error correcting output codes, Journal of Artificial Intelligence Research, 2:263--286, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dietterich, T. G. and Bakiri, G., Error-correcting output codes: a general method for improving multiclass inductive learning programs, in the proceedings of AAAI-91, pages 572--577. AAAI press / MIT press, 1991.Google ScholarGoogle Scholar
  8. Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P., DOM-based Content Extraction of HTML Documents, in the proceedings of the 12th World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kovacevic, M., Diligenti, M., Gori, M. and Milutinovic, V., Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification, in the proceedings of 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi City, Japan, December, 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lin, S.-H. and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD'02), 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Liu, H., Xie, X., Ma, W.-Y. and Zhang, H.-J., Automatic Browsing of Large Pictures on Mobile Devices, in the proceedings of 11th ACM International Conference on Multimedia, Berkeley, CA, USA, Nov. 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mayoraz, E. and Alpaydin, E., Support vector machines for multiclass classification, in the proceedings of the international workshop on artificial intelligence neural networks, 1999.Google ScholarGoogle Scholar
  13. V. Vapnik. Principles of risk minimization for learning theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 831--838. Morgan Kaufmann, 1992Google ScholarGoogle Scholar
  14. Yang, Y., An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, Vol. 1, Number 1-2, pp.69--90, 1999 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yi, L. and Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yi, L. and Liu, B., Eliminating Noisy Information in Web Pages for Data Mining, in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yu, S., Cai, D., Wen, J.-R. and Ma, W.-Y., Improving Pseudo-Relevance Feedback in Web Information retrieval Using Web Page Segmentation, in the proceedings of Twelfth World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning block importance models for web pages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '04: Proceedings of the 13th international conference on World Wide Web
        May 2004
        754 pages
        ISBN:158113844X
        DOI:10.1145/988672

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 May 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

        Upcoming Conference

        WWW '24
        The ACM Web Conference 2024
        May 13 - 17, 2024
        Singapore , Singapore

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader