skip to main content
10.1145/2835776.2835778acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Public Access

Long-tail Vocabulary Dictionary Extraction from the Web

Published:08 February 2016Publication History

ABSTRACT

A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall.

In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary.

Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

References

  1. E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 85--94, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pages 2670--2676, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Annual conference on Computational learning theory, pages 92--100. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Brin. Extracting patterns and relations from the world wide web. In WebDB, pages 172--183, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. V. Chawla and G. I. Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. Intell. Res., 23:331--366, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pages 100--110, 1999.Google ScholarGoogle Scholar
  7. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. VLDB, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. VLDB, 5(7):680--691, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, and N. Lao. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In KDD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. ARTIFICIAL INTELLIGENCE, 165:91--134, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Forman. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research, 3:1289--1305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. He and D. Xin. Seisa: Set expansion by iterative similarity aggregation. In WWW, pages 427--436, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL, pages 286--295, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Jones. Learning to extract entities from labeled and unlabeled text. PhD thesis, University of Utah, 2005.Google ScholarGoogle Scholar
  17. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. Systemt: A system for declarative information extraction. SIGMOD, 37(4):7--13, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res., 11:955--984, Mar. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, pages 938--947, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In ACL, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  24. X. Rong, Z. Chen, Q. Mei, and E. Adar. Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. WSDM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Roy, L. Chiticariu, V. Feldman, F. R. Reiss, and H. Zhu. Provenance-based dictionary refinement in information extraction. In SIGMOD, pages 457--468, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. C. Wang. Language-Independent Class Instance Extraction Using the Web. PhD thesis, Carnegie Mellon University, 2009.Google ScholarGoogle Scholar
  27. R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM '07, pages 342--350, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. C. Wang and W. W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, pages 1503--1512, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In CIKM, pages 41--50, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. CoRR, abs/1304.5634, 2013.Google ScholarGoogle Scholar
  32. W. Xu, R. Hoffmann, L. Zhao, and R. Grishman. Filling knowledge base gaps for distant supervision of relation extraction. In ACL, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Long-tail Vocabulary Dictionary Extraction from the Web

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining
      February 2016
      746 pages
      ISBN:9781450337168
      DOI:10.1145/2835776

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WSDM '16 Paper Acceptance Rate67of368submissions,18%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader