Entity Extraction with Knowledge from Web Scale Corpora

Wen, Zeyi; Huang, Zeyu; Zhang, Rui

doi:10.1007/978-3-030-39469-1_14

Zeyi Wen¹¹,
Zeyu Huang¹² &
Rui Zhang¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12008))

Included in the following conference series:

Australasian Database Conference

947 Accesses
1 Citations

Abstract

Entity extraction is an important task in text mining and natural language processing. A popular method for entity extraction is by comparing substrings from free text against a dictionary of entities. In this paper, we present several techniques as a post-processing step for improving the effectiveness of the existing entity extraction technique. These techniques utilise models trained with the web-scale corpora which makes our techniques robust and versatile. Experiments show that our techniques bring a notable improvement on efficiency and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
MATH Google Scholar
Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)
Article Google Scholar
Corbet, S.A., Delfosse, E.S.: Honeybees and the nectar of echium plantagineum l. in Southeastern Australia. Aust. J. Ecol. 9(2), 125–139 (1984)
Article Google Scholar
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TIIS) 5(4), 19 (2016)
Google Scholar
Heafield, K.: KenLM: faster and smaller language model queries. In: Workshop on Statistical Machine Translation, pp. 187–197. ACL (2011)
Google Scholar
Jagadish, H.V., Ooi, B.C., Tan, K.-L., Yu, C., Zhang, R.: iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. Trans. Database Syst. 30(2), 364–397 (2005)
Article Google Scholar
Lin, Y., Michel, J.-B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: ACL, pp. 169–174 (2012)
Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. Association for Computational Linguistics (2011)
Google Scholar
McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 165–172. ACM (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
Google Scholar
Pauls, A., Klein, D.: Faster and smaller n-gram language models. In: ACL, pp. 258–267. ACL (2011)
Google Scholar
Tseytlin, E., Mitchell, K., Legowski, E., Corrigan, J., Chavan, G., Jacobson, R.S.: Noble-flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinf. 17(1), 32 (2016)
Article Google Scholar
Wei, Q., Chen, T., Xu, R., He, Y., Gui, L.: Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks. Database 2016 (2016)
Google Scholar
Wen, Z., Deng, D., Zhang, R., Kotagiri, R.: An efficient entity extraction algorithm using two-level edit-distance. In: ICDE, pp. 998–1009. IEEE (2019)
Google Scholar
Wen, Z., Zhang, R., Ramamohanarao, K., Qi, J., Taylor, K.: MASCOT: fast and highly scalable SVM cross-validation using GPUs and SSDs. In: ICDM, pp. 580–589. IEEE (2014)
Google Scholar
Whitelaw, C., Hutchinson, B., Chung, G.Y., Ellis, G.: Using the web for language independent spellchecking and autocorrection. In: EMNLP, pp. 890–899. ACL (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Western Australia, Perth, Australia
Zeyi Wen
The University of Melbourne, Melbourne, Australia
Zeyu Huang & Rui Zhang

Authors

Zeyi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeyi Wen .

Editor information

Editors and Affiliations

University of Melbourne, Parkville, Australia
Renata Borovica-Gajic
School of Computing and Information Systems, University of Melbourne, Parkville, VIC, Australia
Jianzhong Qi
Monash University, Clayton, Australia
Weiqing Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, Z., Huang, Z., Zhang, R. (2020). Entity Extraction with Knowledge from Web Scale Corpora. In: Borovica-Gajic, R., Qi, J., Wang, W. (eds) Databases Theory and Applications. ADC 2020. Lecture Notes in Computer Science(), vol 12008. Springer, Cham. https://doi.org/10.1007/978-3-030-39469-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-39469-1_14
Published: 21 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39468-4
Online ISBN: 978-3-030-39469-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics