Article

Learning block importance models for web pages

Authors:
Ruihua Song

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

,
Haifeng Liu

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Ji-Rong Wen

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

,
Wei-Ying Ma

Microsoft Research Asia, Beijing, P.R. China

Microsoft Research Asia, Beijing, P.R. China
View Profile

WWW '04: Proceedings of the 13th international conference on World Wide WebMay 2004Pages 203–211https://doi.org/10.1145/988672.988700

Published:17 May 2004Publication History

WWW '04: Proceedings of the 13th international conference on World Wide Web

Pages 203–211

ABSTRACT

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.

References

Bar-Yossef, Z. and Rajagopalan, S., Template Detection via Data Mining and its Applications, in the proceedings of 11th World Wide Web conference (WWW 2002), May 2002. Google ScholarDigital Library
Brin, S. and Page L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, in the Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998. Google ScholarDigital Library
Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.Google Scholar
Chen, J., Zhou, B., Shi, J., Zhang, H.-J. and Qiu, F., Function-Based Object Model Towards Website Adaptation, in the proceedings of the 10th World Wide Web conference (WWW10), Budapest, Hungary, May 2001. Google ScholarDigital Library
Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326--334.Google ScholarCross Ref
Dietterich, T. G. and Bakiri, G., Solving multiclass learning problem via error correcting output codes, Journal of Artificial Intelligence Research, 2:263--286, 1995. Google ScholarDigital Library
Dietterich, T. G. and Bakiri, G., Error-correcting output codes: a general method for improving multiclass inductive learning programs, in the proceedings of AAAI-91, pages 572--577. AAAI press / MIT press, 1991.Google Scholar
Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P., DOM-based Content Extraction of HTML Documents, in the proceedings of the 12th World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. Google ScholarDigital Library
Kovacevic, M., Diligenti, M., Gori, M. and Milutinovic, V., Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification, in the proceedings of 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi City, Japan, December, 2002 Google ScholarDigital Library
Lin, S.-H. and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD'02), 2002 Google ScholarDigital Library
Liu, H., Xie, X., Ma, W.-Y. and Zhang, H.-J., Automatic Browsing of Large Pictures on Mobile Devices, in the proceedings of 11th ACM International Conference on Multimedia, Berkeley, CA, USA, Nov. 2003 Google ScholarDigital Library
Mayoraz, E. and Alpaydin, E., Support vector machines for multiclass classification, in the proceedings of the international workshop on artificial intelligence neural networks, 1999.Google Scholar
V. Vapnik. Principles of risk minimization for learning theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 831--838. Morgan Kaufmann, 1992Google Scholar
Yang, Y., An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, Vol. 1, Number 1-2, pp.69--90, 1999 Google ScholarDigital Library
Yi, L. and Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003. Google ScholarDigital Library
Yi, L. and Liu, B., Eliminating Noisy Information in Web Pages for Data Mining, in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August, 2003. Google ScholarDigital Library
Yu, S., Cai, D., Wen, J.-R. and Ma, W.-Y., Improving Pseudo-Relevance Feedback in Web Information retrieval Using Web Page Segmentation, in the proceedings of Twelfth World Wide Web conference (WWW 2003), Budapest, Hungary, May 2003. Google ScholarDigital Library

Index Terms

Learning block importance models for web pages
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Learning important models for web page blocks based on layout and content analysis

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. It has also been proven that differentiating noisy and unimportant blocks from pages can ...
Read More
Computing block importance for searching on web sites
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

In this paper we consider the problem of using the block structure of a Web page to improve ranking results when searching for information on Web sites. Given the block structure of the Web pages as input, we propose a method for computing the ...
Read More
Block-based web search
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '04: Proceedings of the 13th international conference on World Wide Web
May 2004
754 pages
ISBN:158113844X
DOI:10.1145/988672
Conference Chairs:
Stuart Feldman
IBM Research
,
Mike Uretsky
New York University
,
Program Chairs:
Marc Najork
Microsoft Research
,
Craig Wills
Worcester Polytechnic Institute
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 May 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
block importance model
classification
page segmentation
web mining
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 206
  Total Citations
  View Citations
- 2,106
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning block importance models for web pages

WWW '04: Proceedings of the 13th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning important models for web page blocks based on layout and content analysis

Computing block importance for searching on web sites

Block-based web search