Skip to main content
Log in

A method of eliminating noises in Web pages by style tree model and its applications

  • Semantic Web and Intelligent Web
  • Published:
Wuhan University Journal of Natural Sciences

Abstract

A Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks the noisy blocks. The noises in Web pages can seriously harm Web data mining. To the question of climinating these noises, we intro duce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree. The Style Tree Model is employed to detect and climinate noises in any Web pages of the site. An information based measure to determine which element node is noisy is also constructed. In addition, the applications of this method are discussed in detail. Experimental results show that our noises climination technique is able to improve the mining results significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. LAN Yi, BING Liu, Xiao-li Li. Eliminating Noisy Information in Web Pages for Data Mining.http://www.cs.uic. edu/~liub/publications/kdd2003-webNoise.pdf, 2003.

  2. Lin Shian-Hua, Ho Jan-Ming. Discovering Informative Content Blocks from Web Dôcuments.http://kp05.iis.sinica. edu.tw/shlin/paper/kdd-ShianHuaLin.pdf, 2002.

  3. Bar-Yossef Z, Rajagopalan S. Template Detection Via Data Mining and Its Applications.http://www2002.org/CDROM/refereed/579/, 2002.

  4. Davision B D, Recognizing Nepotistic Links on the Web,http://citeseer. ist. psu. edu/davison00recognizing. html, 2000.

  5. Jushmerick N. Learning to remove Internet advertisements.http://citeseer. ist. psu. edu/kushmerick99learning. html, 1999.

  6. Kao J Y, Lin S H, Ho J M,et al. Entropy-Based Link Analysis for Mining Web Informative Structures,http://cscl. iis.sinica.edu.tw/documents/bobby/cikm02.pdf, 2002.

  7. Kleinberg J. Authoritative Sources in a Hyperlinked Environment.http://www. cs. cornell. edu/home/kleinber/auth. pdf, 1998.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhao Cheng-li.

Additional information

Foundation item: Supported by the National Natural Science Foundation of China (60003013)

Biography: ZHAN Cheng-li (1979-), male, Master candidate, research direction: Intelligent Information System.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng-li, Z., Dong-yun, Y. A method of eliminating noises in Web pages by style tree model and its applications. Wuhan Univ. J. Nat. Sci. 9, 611–616 (2004). https://doi.org/10.1007/BF02831651

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02831651

Key words

CLC number

Navigation