Abstract
A Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the noisy blocks. The noises in Web pages can seriously harm Web data mining. To the question of climinating these noises, we intro duce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree. The Style Tree Model is employed to detect and climinate noises in any Web pages of the site. An information based measure to determine which element node is noisy is also constructed. In addition, the applications of this method are discussed in detail. Experimental results show that our noises climination technique is able to improve the mining results significantly.
Similar content being viewed by others
References
LAN Yi, BING Liu, Xiao-li Li. Eliminating Noisy Information in Web Pages for Data Mining.http://www.cs.uic. edu/~liub/publications/kdd2003-webNoise.pdf, 2003.
Lin Shian-Hua, Ho Jan-Ming. Discovering Informative Content Blocks from Web Dôcuments.http://kp05.iis.sinica. edu.tw/shlin/paper/kdd-ShianHuaLin.pdf, 2002.
Bar-Yossef Z, Rajagopalan S. Template Detection Via Data Mining and Its Applications.http://www2002.org/CDROM/refereed/579/, 2002.
Davision B D, Recognizing Nepotistic Links on the Web,http://citeseer. ist. psu. edu/davison00recognizing. html, 2000.
Jushmerick N. Learning to remove Internet advertisements.http://citeseer. ist. psu. edu/kushmerick99learning. html, 1999.
Kao J Y, Lin S H, Ho J M,et al. Entropy-Based Link Analysis for Mining Web Informative Structures,http://cscl. iis.sinica.edu.tw/documents/bobby/cikm02.pdf, 2002.
Kleinberg J. Authoritative Sources in a Hyperlinked Environment.http://www. cs. cornell. edu/home/kleinber/auth. pdf, 1998.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China (60003013)
Biography: ZHAN Cheng-li (1979-), male, Master candidate, research direction: Intelligent Information System.
Rights and permissions
About this article
Cite this article
Cheng-li, Z., Dong-yun, Y. A method of eliminating noises in Web pages by style tree model and its applications. Wuhan Univ. J. Nat. Sci. 9, 611–616 (2004). https://doi.org/10.1007/BF02831651
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF02831651