Skip to main content
Log in

Auto-labelling entities in low-resource text: a geological case study

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Studies on named entity recognition (NER) often require a substantial amount of human-annotated training data. This makes technical domain-specific NER from industry data especially challenging as labelled data are scarce. Despite English as the surface language, technical jargon and writing conventions used in technical documents render the low-resource language challenges where techniques such as transfer learning hardly work. Relieving labour intensive annotations using automatic labelling is thus an important research topic, seeking ways to obtain labelled data quickly and consistently. In this work, we propose an iterative deep learning NER framework using distant supervision for automatic labelling of domain-specific datasets. The framework is applied to mineral exploration reports and produced a large BIO-annotated dataset with six geological categories. This quality-labelled dataset, OzROCK, is made publicly available to support future research on technical domain NER. Experimental results demonstrated the effectiveness of this approach, further confirmed by domain experts. The generalisation ability is verified by applying the framework to two other datasets: one for disease names and the other for chemical names. Overall, our approach can effectively reduce annotation efforts by identifying a much smaller subset, that is challenging for automatic labelling thus requires attention from human experts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.clips.uantwerpen.be/conll2003/ner/.

  2. https://catalog.ldc.upenn.edu/LDC2013T19.

  3. http://www.dmp.wa.gov.au/Geological-Survey/Mineral-exploration-Reports-1401.aspx.

  4. https://github.com/majiga/OzROCK.

  5. https://www.dmp.wa.gov.au/Explanatory-Notes-System-ENS-15063.aspx.

  6. https://www.mindat.org/.

  7. http://www.geonames.org.

  8. https://www.ga.gov.au/data-pubs/datastandards/stratigraphic-units.

  9. https://en.wikipedia.org.

References

  1. Akbik A, Blythe D, Vollgraf R ( 2018) Contextual string embeddings for sequence labeling. In: Proceedings of the 27th international conference on computational linguistics. pp. 1638–1649

  2. Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media Inc., Sebastopol

    MATH  Google Scholar 

  3. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on Computational learning theory’. ACM 92–100

  4. Chiticariu L, Li Y, Reiss F (2013) Rule-based information extraction is dead! long live rule-based information extraction systems. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 827–832

  5. Chiu JP, Nichols E (2016) Named entity recognition with bidirectional lstm-cnns. Trans Assoc Comput Linguist 4:357–370

    Article  Google Scholar 

  6. Devlin J, Chang M.-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  7. Doğan RI, Leaman R, Lu Z (2014) Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inf 47:1–10

    Article  Google Scholar 

  8. Enkhsaikhan M, Liu W, Holden E.-J, Duuring P (2018) Towards geological knowledge discovery using vector-based semantic similarity. In: International conference on advanced data mining and applications. Springer, pp. 224–237

  9. Feng X, Feng X, Qin B, Feng Z, Liu T (2018) Improving low resource named entity recognition using cross-lingual knowledge transfer. In: Proceedings of the 27th international joint conference on artificial intelligence. AAAI Press, pp. 4071–4077

  10. Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics 363–370

  11. Fries J, Wu S, Ratner A, Ré C (2017) Swellshark: a generative model for biomedical named entity recognition without labeled data. arXiv preprint arXiv:1704.06360

  12. Gardner M, Grus J, Neumann M, Tafjord O, Dasigi P, Liu NF, Peters M, Schmitz M, Zettlemoyer LS (2017) Allennlp: a deep semantic natural language processing platform

  13. Gers FA, Schmidhuber JA, Cummins FA (2000) Learning to forget: continual prediction with lstm. Neural Comput. 12(10):2451–2471. https://doi.org/10.1162/089976600300015015

    Article  Google Scholar 

  14. Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing’. IEEE 6645–6649

  15. Guillaume L, Miguel B, Sandeep S, Kazuya K, Chris D (2016) Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT

  16. Honnibal M ( 2017) ‘Spacy’. https://explosion.ai/blog/introducing-spacy

  17. Huang Z, Xu W , Yu K ( 2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991

  18. Kuru O, Can OA , Yuret D (2016) Charner: character-level named entity recognition. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical Papers’, pp. 911–921

  19. Lafferty J, McCallum A , Pereira FC ( 2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data

  20. Li J, Sun A, Han J , Li C ( 2018) A survey on deep learning for named entity recognition. arXiv preprint arXiv:1812.09449

  21. Li J, Sun Y, Johnson RJ, Sciaky D, Wei C-H, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z (2016) Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database

  22. Ma X, Hovy E ( 2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Vol. 1, pp. 1064–1074

  23. Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26

    Article  Google Scholar 

  24. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K , Zettlemoyer L ( 2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365

  25. Qu L, Ferraro G, Zhou L, Hou W, Baldwin T ( 2016) Named entity recognition for novel types by transfer learning. In: Proceedings of the 2016 conference on empirical methods in natural language processing. pp. 899–905

  26. Ramshaw LA, Marcus MP (1995) Text chunking using transformation-based learning. CoRR arxiv: cmp-lg/9505040

  27. Ramshaw LA, Marcus MP (1999) Text chunking using transformation-based learning. In: Natural language processing using very large corpora. Springer, pp. 157–176

  28. Sang EFTK , De Meulder F ( 2003) Introduction to the conll-2003 shared task:language-independent named entity recognition, CoNLL-2003

  29. Segura-Bedmar I, Martínez P, Segura-Bedmar M (2008) Drug name recognition and classification in biomedical texts: a case study outlining approaches underpinning automated systems. Drug Discov Today 13(17–18):816–823

    Article  Google Scholar 

  30. Shang J, Liu L, Gu X, Ren X, Ren T , Han J (2018) Learning named entity tagger using domain-specific dictionary. In: Proceedings of the 2018 conference on empirical methods in natural language processing. pp. 2054–2064

  31. Shi L, Jianping C, Jie X (2018) Prospecting information extraction by text mining based on convolutional neural networks-a case study of the lala copper deposit, china. IEEE Access 6:52286–52297

    Article  Google Scholar 

  32. Sobhana N, Mitra P, Ghosh S (2010) Conditional random field based named entity recognition in geological text. Int J Comput Appl 975:8887

    Google Scholar 

  33. Stewart M, Liu W, Cardell-Oliver R ( 2019) Redcoat: a collaborative annotation tool for hierarchical entity typing. In: Proceedings of the 2019 Conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP): system demonstrations. pp. 193–198

  34. Varma P, Ré C (2018) Snuba: automating weak supervision to label training data. Proceedings of the VLDB Endowment 12(3):223–236

  35. Wang C, Ma X, Chen J, Chen J (2018) Information extraction and knowledge graph construction from geoscience literature. Comput Geosci 112:112–120

    Article  Google Scholar 

  36. Wang R, Liu W, McDonald C ( 2016) Featureless domain-specific term extraction with minimal labelled data. In: Proceedings of the Australasian language technology association workshop 2016. pp. 103–112

  37. Wang X, Zhang Y, Li Q, Ren X, Shang J, Han J ( 2019) Distantly supervised biomedical named entity recognition with dictionary expansion. In: 2019 IEEE International conference on bioinformatics and biomedicine (BIBM), IEEE, pp. 496–503

  38. Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J (2019) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35(10):1745–1752

    Article  Google Scholar 

  39. Weischedel R, Palmer M, Marcus M, Hovy E, Pradhan S, Ramshaw L, Xue N, Taylor A, Kaufman J, Franchini M et al (2013) Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  40. Yadav V, Sharp R, Bethard S (2018) Deep affix features improve neural named entity recognizers. In: Proceedings of the seventh joint conference on lexical and computational semantics. pp. 167–172

  41. Yang LC, Tan IK, Selvaretnam B, Howg EK , Kar LH ( 2019) Text: traffic entity extraction from twitter. In: Proceedings of the 2019 5th international conference on computing and data engineering. pp. 53–59

  42. Yang Z, Salakhutdinov R, Cohen WW ( 2017) Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv preprint arXiv:1703.06345

  43. Zhang B, Pan X, Wang T, Vaswani A, Ji H, Knight K, Marcu D (2016) Name tagging for low-resource incident languages based on expectation-driven learning. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies’, pp. 249–259

  44. Zhang C, Govindaraju V, Borchardt J, Foltz T, Ré C, Peters S (2013) Geodeepdive: statistical inference using familiar data-processing languages. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp. 993–996

  45. Zhu Y, Zhou W, Xu Y, Liu J, Tan Y (2017) (2017) Intelligent learning for knowledge graph towards geological data. Scientific Programming

Download references

Acknowledgements

We thank the Geological Survey and Resource Strategy Division (GSRSD) of the Department of Mines, Industry Regulation and Safety in Western Australia for assistance with accessing the WAMEX dataset and the GSRSD Explanatory Notes System database. Paul Duuring publishes with permission from the Executive Director of the Geological Survey of Western Australia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Majigsuren Enkhsaikhan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Enkhsaikhan, M., Liu, W., Holden, EJ. et al. Auto-labelling entities in low-resource text: a geological case study. Knowl Inf Syst 63, 695–715 (2021). https://doi.org/10.1007/s10115-020-01532-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01532-6

Keywords

Navigation