Skip to main content

Web Page Template and Data Separation for Better Maintainability

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11233))

Abstract

Separating a web page into template code and data records populated into the template is an important problem. This problem has a wide range of applications in web page compression and information extraction. We study this problem with the aim to separate a web page into easily maintainable template code and data records. We show that this problem is NP-hard. We then propose a heuristic algorithm to solve the problem. The main idea of our algorithm is to parse a web page into a tree and then to process it recursively in a bottom-up manner with three steps: splitting, folding, and alignment. We perform experiments on real datasets to evaluate the performance of our proposed algorithms in maximizing the maintainability of the template code produced. The experimental results show that our proposed algorithms outperform the baseline algorithms by 25% in the maintainability measure.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Counsell, S., et al.: Re-visiting the ‘maintainability index’ metric from an object-oriented perspective. In: SEAA, pp. 84–87 (2015)

    Google Scholar 

  2. Hammouda, K.M., Kamel, M.S.: Phrase-based document similarity based on an index graph model. In: ICDM, pp. 203–210 (2002)

    Google Scholar 

  3. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9

    Chapter  Google Scholar 

  4. Kayed, M., Chang, C.H.: FiVaTech: page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22(2), 249–263 (2010)

    Article  Google Scholar 

  5. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606 (2003)

    Google Scholar 

  6. McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)

    Article  MathSciNet  Google Scholar 

  7. Omari, A., Kimelfeld, B., Yahav, E., Shoham, S.: Lossless separation of web pages into layout code and data. In: KDD, pp. 1805–1814 (2016)

    Google Scholar 

  8. Pang, C., Zhang, R., Zhang, Q., Wang, J.: Dominating sets in directed graphs. Inf. Sci. 180(19), 3647–3652 (2010)

    Article  MathSciNet  Google Scholar 

  9. Rao, R.V., Savsani, V.J., Vakharia, D.: Teaching-learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput.-Aided Des. 43(3), 303–315 (2011)

    Article  Google Scholar 

  10. Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: WWW, pp. 346–347 (2004)

    Google Scholar 

  11. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)

    Google Scholar 

Download references

Acknowledgment

This work is supported by Australian Research Council (ARC) Future Fellowships Project FT120100832 and Discovery Project DP180102050.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, C., Zhang, R., Qi, J. (2018). Web Page Template and Data Separation for Better Maintainability. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11233. Springer, Cham. https://doi.org/10.1007/978-3-030-02922-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02922-7_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02921-0

  • Online ISBN: 978-3-030-02922-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics