Skip to main content

Citation Field Learning by RNN with Limited Training Data

  • Conference paper
  • First Online:
  • 1231 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11154))

Abstract

Citation field learning is to segment a citation string into fields of interest such as author, title, and venue from plain text. We are interested in citation field learning from researchers’ homepages. This task is challenging due to the free citation styles used by different creators of the homepages. We aim to address the challenge by neural network based approaches which learn the citation field styles automatically. Neural network based approaches are data-hungry, but manually labeled training data is expensive to obtain. Therefore, we propose a novel framework that utilizes auto-generated training data and domain adaptation to enhance a manually labeled training dataset of limited size. At the same time, we design an adaptive Recurrent Neural Network (RNN) to learn citation styles from the enhanced training data effectively. Extensive experiments show that the proposed methods outperform state-of-the-art methods for citation field learning.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://citationstyles.org/styles/.

  2. 2.

    https://github.com/brechtm/citeproc-py.

  3. 3.

    http://www.arc.gov.au/rfcd-seo-and-anzsic-codes.

  4. 4.

    Datasets available at: https://github.com/yiqingzhang/citationlearn.

  5. 5.

    The \(\dagger \) symbol in the tables denotes that the result is statistically significantly better than all the baselines, with \( p < .01\) based on t-test.

References

  1. Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 329–344. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_22

    Chapter  Google Scholar 

  2. Chen, C.C., Yang, K.H., Chen, C.L., Ho, J.M.: BibPro: a citation parser based on sequence alignment. TKDE 24(2), 236–250 (2012)

    Google Scholar 

  3. Cortez, E., da Silva, A.S., Gonçalves, M.A., Mesquita, F., de Moura, E.S.: FLUX-CIM: flexible unsupervised extraction of citation metadata. In: JCDL, pp. 215–224 (2007)

    Google Scholar 

  4. Councill, I.G., Giles, C.L., Kan, M.: ParsCit: an open-source CRF reference string parsing package. In: LREC, pp. 661–667 (2008)

    Google Scholar 

  5. Daumé III, H.: Frustratingly easy domain adaptation. In: ACL, pp. 256–263 (2007)

    Google Scholar 

  6. Day, M.Y., et al.: Reference metadata extraction using a hierarchical knowledge representation framework. Decis. Support. Syst. 43(1), 152–167 (2007)

    Article  MathSciNet  Google Scholar 

  7. Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: ACM DL, pp. 89–98 (1998)

    Google Scholar 

  8. Hetzner, E.: A simple method for citation metadata extraction using hidden Markov models. In: JCDL, pp. 280–284 (2008)

    Google Scholar 

  9. Huang, I.-A., Ho, J.-M., Kao, H.-Y., Lin, W.-C.: Extracting citation metadata from online publication lists using BLAST. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 539–548. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_64

    Chapter  Google Scholar 

  10. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)

  11. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  12. Jagannatha, A., Yu, H.: Structured prediction models for RNN based sequence labeling in clinical text. In: EMNLP, pp. 856–865 (2016)

    Google Scholar 

  13. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)

    Google Scholar 

  14. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT, pp. 260–270 (2016)

    Google Scholar 

  15. Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: ACL, pp. 1064–1074 (2016)

    Google Scholar 

  16. McCallum, A.: Andrew McCallum data (2005). https://people.cs.umass.edu/~mccallum/data.html

  17. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: NAACL HLT, pp. 188–191 (2003)

    Google Scholar 

  18. Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. IJIPM 42(4), 963–979 (2006)

    Google Scholar 

  19. Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: NAACL HLT, pp. 134–141 (2003)

    Google Scholar 

  20. Tang, J., Yao, L., Zhang, D., Zhang, J.: A combination approach to web user profiling. TKDD 5(1), 1–38 (2010)

    Article  Google Scholar 

  21. Yin, P., Zhang, M., Deng, Z., Yang, D.: Metadata extraction from bibliographies using bigram HMM. In: ICADL, pp. 310–319 (2004)

    Google Scholar 

  22. Zhou, G., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: ACL, pp. 473–480 (2002)

    Google Scholar 

  23. Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: SIGKDD, pp. 903–912 (2007)

    Google Scholar 

Download references

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiqing Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Dai, Y., Qi, J., Xu, X., Zhang, R. (2018). Citation Field Learning by RNN with Limited Training Data. In: Ganji, M., Rashidi, L., Fung, B., Wang, C. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 11154. Springer, Cham. https://doi.org/10.1007/978-3-030-04503-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04503-6_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04502-9

  • Online ISBN: 978-3-030-04503-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics