Skip to main content
Log in

A learning framework for information block search based on probabilistic graphical models and Fisher Kernel

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Contrary to traditional Web information retrieval methods that can only return a ranked list of Web pages and only allow search terms in the query, we have developed a novel learning framework for retrieving precise information blocks from Web pages given a query, which may contain some search terms and prior information such as the layout format of the data. There are two challenging sub-tasks for this problem. One challenge is information block detection, where a Web page is automatically segmented into blocks. Another challenge is to find the information blocks relevant to the query. Existing page segmentation methods, which make use of only visual layout information or only content information, do not consider the query information, leading to a solution having conflict with the information need expressed by the query. Our framework aims at modeling the query and the block features to capture both keyword information and prior information via a probabilistic graphical model. Fisher Kernel, which can effectively incorporate the graphical model, is then employed to accomplish the two sub-tasks in a unified manner, optimizing the final goal of block retrieval performance. We have conducted experiments on benchmark datasets and read-world data. Comparisons between existing methods have been conducted to evaluate the effectiveness of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The details of DOM can be found in http://www.w3.org/DOM/

  2. VIPS can be obtained in http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html

References

  1. Arun K, Govindan V (2016) A context-aware semantic modeling framework for efficient image retrieval. Int J Mach Learn Cybern. doi:10.1007/s13042-016-0498-y

  2. Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738

  3. Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96

  4. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  5. Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978

  6. Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463

  7. Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699

    Article  Google Scholar 

  8. Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779

    Article  MathSciNet  MATH  Google Scholar 

  9. Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209

  10. Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640

    Article  Google Scholar 

  11. Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174

  12. Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784

  13. Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257

  14. Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493

  15. Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426

  16. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289

  17. Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94

    Article  Google Scholar 

  18. Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593

  19. Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341

    Article  Google Scholar 

  20. Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460

    Article  Google Scholar 

  21. Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579

  22. Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244

  23. McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191

  24. Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990

  25. Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery

  26. Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76

    Article  Google Scholar 

  27. Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816

    Article  Google Scholar 

  28. Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141

  29. Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48

  30. Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272

  31. Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254

  32. Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790

  33. Teh Y, Jordan M, Beal M, Blei D (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101:1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  34. Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370

  35. Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4

  36. van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning

  37. Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024

  38. Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332

  39. Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650

    Article  Google Scholar 

  40. Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527

  41. Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56

  42. Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614

    MATH  Google Scholar 

Download references

Acknowledgements

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. 14203414 and Project No. UGC/FDS11/E06/14).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tak-Lam Wong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wong, TL., Xie, H., Lam, W. et al. A learning framework for information block search based on probabilistic graphical models and Fisher Kernel. Int. J. Mach. Learn. & Cyber. 9, 1473–1487 (2018). https://doi.org/10.1007/s13042-017-0657-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-017-0657-9

Keywords

Navigation