Skip to main content

Entity Extraction with Knowledge from Web Scale Corpora

  • Conference paper
  • First Online:
Databases Theory and Applications (ADC 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12008))

Included in the following conference series:

Abstract

Entity extraction is an important task in text mining and natural language processing. A popular method for entity extraction is by comparing substrings from free text against a dictionary of entities. In this paper, we present several techniques as a post-processing step for improving the effectiveness of the existing entity extraction technique. These techniques utilise models trained with the web-scale corpora which makes our techniques robust and versatile. Experiments show that our techniques bring a notable improvement on efficiency and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)

    MATH  Google Scholar 

  2. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)

    Article  Google Scholar 

  3. Corbet, S.A., Delfosse, E.S.: Honeybees and the nectar of echium plantagineum l. in Southeastern Australia. Aust. J. Ecol. 9(2), 125–139 (1984)

    Article  Google Scholar 

  4. Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TIIS) 5(4), 19 (2016)

    Google Scholar 

  5. Heafield, K.: KenLM: faster and smaller language model queries. In: Workshop on Statistical Machine Translation, pp. 187–197. ACL (2011)

    Google Scholar 

  6. Jagadish, H.V., Ooi, B.C., Tan, K.-L., Yu, C., Zhang, R.: iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. Trans. Database Syst. 30(2), 364–397 (2005)

    Article  Google Scholar 

  7. Lin, Y., Michel, J.-B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: ACL, pp. 169–174 (2012)

    Google Scholar 

  8. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. Association for Computational Linguistics (2011)

    Google Scholar 

  9. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 165–172. ACM (2013)

    Google Scholar 

  10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  12. Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)

    Google Scholar 

  13. Pauls, A., Klein, D.: Faster and smaller n-gram language models. In: ACL, pp. 258–267. ACL (2011)

    Google Scholar 

  14. Tseytlin, E., Mitchell, K., Legowski, E., Corrigan, J., Chavan, G., Jacobson, R.S.: Noble-flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinf. 17(1), 32 (2016)

    Article  Google Scholar 

  15. Wei, Q., Chen, T., Xu, R., He, Y., Gui, L.: Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks. Database 2016 (2016)

    Google Scholar 

  16. Wen, Z., Deng, D., Zhang, R., Kotagiri, R.: An efficient entity extraction algorithm using two-level edit-distance. In: ICDE, pp. 998–1009. IEEE (2019)

    Google Scholar 

  17. Wen, Z., Zhang, R., Ramamohanarao, K., Qi, J., Taylor, K.: MASCOT: fast and highly scalable SVM cross-validation using GPUs and SSDs. In: ICDM, pp. 580–589. IEEE (2014)

    Google Scholar 

  18. Whitelaw, C., Hutchinson, B., Chung, G.Y., Ellis, G.: Using the web for language independent spellchecking and autocorrection. In: EMNLP, pp. 890–899. ACL (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeyi Wen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wen, Z., Huang, Z., Zhang, R. (2020). Entity Extraction with Knowledge from Web Scale Corpora. In: Borovica-Gajic, R., Qi, J., Wang, W. (eds) Databases Theory and Applications. ADC 2020. Lecture Notes in Computer Science(), vol 12008. Springer, Cham. https://doi.org/10.1007/978-3-030-39469-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39469-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39468-4

  • Online ISBN: 978-3-030-39469-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics