Long-tail Vocabulary Dictionary Extraction from the Web

Authors:
Zhe Chen

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

,
Michael Cafarella

University of Michigan, Ann arbor, MI, USA

University of Michigan, Ann arbor, MI, USA
View Profile

,
H. V. Jagadish

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data MiningFebruary 2016Pages 625–634https://doi.org/10.1145/2835776.2835778

Published:08 February 2016Publication History

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

Pages 625–634

ABSTRACT

A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall.

In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary.

Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.

References

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 85--94, 2000. Google ScholarDigital Library
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pages 2670--2676, 2007. Google ScholarDigital Library
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Annual conference on Computational learning theory, pages 92--100. ACM, 1998. Google ScholarDigital Library
S. Brin. Extracting patterns and relations from the world wide web. In WebDB, pages 172--183, 1999. Google ScholarDigital Library
N. V. Chawla and G. I. Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. Intell. Res., 23:331--366, 2005. Google ScholarDigital Library
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pages 100--110, 1999.Google Scholar
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. VLDB, 4(4):219--230, 2011. Google ScholarDigital Library
N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. VLDB, 5(7):680--691, 2012. Google ScholarDigital Library
A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarDigital Library
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, and N. Lao. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In KDD, 2014. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. ARTIFICIAL INTELLIGENCE, 165:91--134, 2005. Google ScholarDigital Library
G. Forman. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research, 3:1289--1305, 2003. Google ScholarDigital Library
Y. He and D. Xin. Seisa: Set expansion by iterative similarity aggregation. In WWW, pages 427--436, 2011. Google ScholarDigital Library
R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL, pages 286--295, 2010. Google ScholarDigital Library
R. Jones. Learning to extract entities from labeled and unlabeled text. PhD thesis, University of Utah, 2005.Google Scholar
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. Systemt: A system for declarative information extraction. SIGMOD, 37(4):7--13, 2009. Google ScholarDigital Library
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarDigital Library
G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res., 11:955--984, Mar. 2010. Google ScholarDigital Library
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google ScholarDigital Library
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, pages 938--947, 2009. Google ScholarDigital Library
P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In ACL, 2014.Google ScholarCross Ref
X. Rong, Z. Chen, Q. Mei, and E. Adar. Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. WSDM, 2016. Google ScholarDigital Library
S. Roy, L. Chiticariu, V. Feldman, F. R. Reiss, and H. Zhu. Provenance-based dictionary refinement in information extraction. In SIGMOD, pages 457--468, 2013. Google ScholarDigital Library
R. C. Wang. Language-Independent Class Instance Extraction Using the Web. PhD thesis, Carnegie Mellon University, 2009.Google Scholar
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM '07, pages 342--350, 2007. Google ScholarDigital Library
R. C. Wang and W. W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008. Google ScholarDigital Library
R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, pages 1503--1512, 2009. Google ScholarDigital Library
F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In CIKM, pages 41--50, 2007. Google ScholarDigital Library
C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. CoRR, abs/1304.5634, 2013.Google Scholar
W. Xu, R. Hoffmann, L. Zhao, and R. Grishman. Filling knowledge base gaps for distant supervision of relation extraction. In ACL, 2013.Google Scholar

Index Terms

Long-tail Vocabulary Dictionary Extraction from the Web
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Long-tail Session-based Recommendation
RecSys '20: Proceedings of the 14th ACM Conference on Recommender Systems

Session-based recommendation focuses on the prediction of user actions based on anonymous sessions and is a necessary method in the lack of user historical data. However, none of the existing session-based recommendation methods explicitly takes the ...
Read More
Long-Tail Recommendation Framework Using Frequent Neighbors
ICIT '20: Proceedings of the 2020 8th International Conference on Information Technology: IoT and Smart City

The neighborhood-based recommendation method is a basic collaborative-filtering method, and one of the methods still used in the industry. The main basis for recommendation is the user's scoring data in the neighborhood-based recommendation method. ...
Read More
A Survey of Long-Tail Item Recommendation Methods
Recommender systems represent a critical field of AI technology applications. The core function of a recommender system is to recommend items of interest to users, but if it is only user history-based (purchasing or browsing data), it can only recommend ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining
February 2016
746 pages
ISBN:9781450337168
DOI:10.1145/2835776
General Chairs:
Paul N. Bennett
Microsoft Research
,
Vanja Josifovski
Pinterest
,
Program Chairs:
Jennifer Neville
Purdue University
,
Filip Radlinski
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 February 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
long-tail dictionary
set expansion
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '16 Paper Acceptance Rate67of368submissions,18%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 506
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Long-tail Vocabulary Dictionary Extraction from the Web

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Long-tail Session-based Recommendation

Long-Tail Recommendation Framework Using Frequent Neighbors

A Survey of Long-Tail Item Recommendation Methods