ABSTRACT
A dictionary --- a set of instances belonging to the same conceptual class --- is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall.
In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, Lyretail is able to output a high-quality comprehensive dictionary.
Our experiments demonstrate that in long-tail vocabulary settings, we obtained a 17.3% improvement on mean average precision for the dictionary generation process, and a 30.7% improvement on F1 for the page-specific extraction, when compared to previous state-of-the-art methods.
- E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 85--94, 2000. Google ScholarDigital Library
- M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pages 2670--2676, 2007. Google ScholarDigital Library
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Annual conference on Computational learning theory, pages 92--100. ACM, 1998. Google ScholarDigital Library
- S. Brin. Extracting patterns and relations from the world wide web. In WebDB, pages 172--183, 1999. Google ScholarDigital Library
- N. V. Chawla and G. I. Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. Intell. Res., 23:331--366, 2005. Google ScholarDigital Library
- M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pages 100--110, 1999.Google Scholar
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
- N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. VLDB, 4(4):219--230, 2011. Google ScholarDigital Library
- N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. VLDB, 5(7):680--691, 2012. Google ScholarDigital Library
- A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarDigital Library
- X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, and N. Lao. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In KDD, 2014. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. ARTIFICIAL INTELLIGENCE, 165:91--134, 2005. Google ScholarDigital Library
- G. Forman. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research, 3:1289--1305, 2003. Google ScholarDigital Library
- Y. He and D. Xin. Seisa: Set expansion by iterative similarity aggregation. In WWW, pages 427--436, 2011. Google ScholarDigital Library
- R. Hoffmann, C. Zhang, and D. S. Weld. Learning 5000 relational extractors. In ACL, pages 286--295, 2010. Google ScholarDigital Library
- R. Jones. Learning to extract entities from labeled and unlabeled text. PhD thesis, University of Utah, 2005.Google Scholar
- R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. Systemt: A system for declarative information extraction. SIGMOD, 37(4):7--13, 2009. Google ScholarDigital Library
- N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarDigital Library
- G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res., 11:955--984, Mar. 2010. Google ScholarDigital Library
- M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google ScholarDigital Library
- P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, pages 938--947, 2009. Google ScholarDigital Library
- P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In ACL, 2014.Google ScholarCross Ref
- X. Rong, Z. Chen, Q. Mei, and E. Adar. Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. WSDM, 2016. Google ScholarDigital Library
- S. Roy, L. Chiticariu, V. Feldman, F. R. Reiss, and H. Zhu. Provenance-based dictionary refinement in information extraction. In SIGMOD, pages 457--468, 2013. Google ScholarDigital Library
- R. C. Wang. Language-Independent Class Instance Extraction Using the Web. PhD thesis, Carnegie Mellon University, 2009.Google Scholar
- R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM '07, pages 342--350, 2007. Google ScholarDigital Library
- R. C. Wang and W. W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008. Google ScholarDigital Library
- R. C. Wang and W. W. Cohen. Character-level analysis of semi-structured documents for set expansion. In EMNLP, pages 1503--1512, 2009. Google ScholarDigital Library
- F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In CIKM, pages 41--50, 2007. Google ScholarDigital Library
- C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. CoRR, abs/1304.5634, 2013.Google Scholar
- W. Xu, R. Hoffmann, L. Zhao, and R. Grishman. Filling knowledge base gaps for distant supervision of relation extraction. In ACL, 2013.Google Scholar
Index Terms
- Long-tail Vocabulary Dictionary Extraction from the Web
Recommendations
Long-tail Session-based Recommendation
RecSys '20: Proceedings of the 14th ACM Conference on Recommender SystemsSession-based recommendation focuses on the prediction of user actions based on anonymous sessions and is a necessary method in the lack of user historical data. However, none of the existing session-based recommendation methods explicitly takes the ...
Long-Tail Recommendation Framework Using Frequent Neighbors
ICIT '20: Proceedings of the 2020 8th International Conference on Information Technology: IoT and Smart CityThe neighborhood-based recommendation method is a basic collaborative-filtering method, and one of the methods still used in the industry. The main basis for recommendation is the user's scoring data in the neighborhood-based recommendation method. ...
A Survey of Long-Tail Item Recommendation Methods
Recommender systems represent a critical field of AI technology applications. The core function of a recommender system is to recommend items of interest to users, but if it is only user history-based (purchasing or browsing data), it can only recommend ...
Comments