Citation Field Learning by RNN with Limited Training Data

Zhang, Yiqing; Dai, Yimeng; Qi, Jianzhong; Xu, Xinxing; Zhang, Rui

doi:10.1007/978-3-030-04503-6_23

Citation Field Learning by RNN with Limited Training Data

Yiqing Zhang¹⁶,
Yimeng Dai¹⁶,
Jianzhong Qi¹⁶,
Xinxing Xu¹⁷ &
…
Rui Zhang¹⁶

Conference paper
First Online: 21 November 2018

1231 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11154))

Abstract

Citation field learning is to segment a citation string into fields of interest such as author, title, and venue from plain text. We are interested in citation field learning from researchers’ homepages. This task is challenging due to the free citation styles used by different creators of the homepages. We aim to address the challenge by neural network based approaches which learn the citation field styles automatically. Neural network based approaches are data-hungry, but manually labeled training data is expensive to obtain. Therefore, we propose a novel framework that utilizes auto-generated training data and domain adaptation to enhance a manually labeled training dataset of limited size. At the same time, we design an adaptive Recurrent Neural Network (RNN) to learn citation styles from the enhanced training data effectively. Extensive experiments show that the proposed methods outperform state-of-the-art methods for citation field learning.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://citationstyles.org/styles/.
2.
https://github.com/brechtm/citeproc-py.
3.
http://www.arc.gov.au/rfcd-seo-and-anzsic-codes.
4.
Datasets available at: https://github.com/yiqingzhang/citationlearn.
5.
The \(\dagger \) symbol in the tables denotes that the result is statistically significantly better than all the baselines, with \( p < .01\) based on t-test.

References

Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 329–344. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_22
Chapter Google Scholar
Chen, C.C., Yang, K.H., Chen, C.L., Ho, J.M.: BibPro: a citation parser based on sequence alignment. TKDE 24(2), 236–250 (2012)
Google Scholar
Cortez, E., da Silva, A.S., Gonçalves, M.A., Mesquita, F., de Moura, E.S.: FLUX-CIM: flexible unsupervised extraction of citation metadata. In: JCDL, pp. 215–224 (2007)
Google Scholar
Councill, I.G., Giles, C.L., Kan, M.: ParsCit: an open-source CRF reference string parsing package. In: LREC, pp. 661–667 (2008)
Google Scholar
Daumé III, H.: Frustratingly easy domain adaptation. In: ACL, pp. 256–263 (2007)
Google Scholar
Day, M.Y., et al.: Reference metadata extraction using a hierarchical knowledge representation framework. Decis. Support. Syst. 43(1), 152–167 (2007)
Article MathSciNet Google Scholar
Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: ACM DL, pp. 89–98 (1998)
Google Scholar
Hetzner, E.: A simple method for citation metadata extraction using hidden Markov models. In: JCDL, pp. 280–284 (2008)
Google Scholar
Huang, I.-A., Ho, J.-M., Kao, H.-Y., Lin, W.-C.: Extracting citation metadata from online publication lists using BLAST. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 539–548. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_64
Chapter Google Scholar
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Jagannatha, A., Yu, H.: Structured prediction models for RNN based sequence labeling in clinical text. In: EMNLP, pp. 856–865 (2016)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT, pp. 260–270 (2016)
Google Scholar
Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: ACL, pp. 1064–1074 (2016)
Google Scholar
McCallum, A.: Andrew McCallum data (2005). https://people.cs.umass.edu/~mccallum/data.html
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: NAACL HLT, pp. 188–191 (2003)
Google Scholar
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. IJIPM 42(4), 963–979 (2006)
Google Scholar
Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: NAACL HLT, pp. 134–141 (2003)
Google Scholar
Tang, J., Yao, L., Zhang, D., Zhang, J.: A combination approach to web user profiling. TKDD 5(1), 1–38 (2010)
Article Google Scholar
Yin, P., Zhang, M., Deng, Z., Yang, D.: Metadata extraction from bibliographies using bigram HMM. In: ICADL, pp. 310–319 (2004)
Google Scholar
Zhou, G., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: ACL, pp. 473–480 (2002)
Google Scholar
Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: SIGKDD, pp. 903–912 (2007)
Google Scholar

Download references

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Authors and Affiliations

School of CIS, The University of Melbourne, Parkville, Australia
Yiqing Zhang, Yimeng Dai, Jianzhong Qi & Rui Zhang
Institute of High Performance Computing, A*STAR, Singapore, Singapore
Xinxing Xu

Authors

Yiqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yimeng Dai
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Qi
View author publications
You can also search for this author in PubMed Google Scholar
Xinxing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiqing Zhang .

Editor information

Editors and Affiliations

University of Melbourne, Melbourne, VIC, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, VIC, Australia
Lida Rashidi
McGill University, Montreal, QC, Canada
Benjamin C. M. Fung
Griffith University, Gold Coast, QLD, Australia
Can Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Dai, Y., Qi, J., Xu, X., Zhang, R. (2018). Citation Field Learning by RNN with Limited Training Data. In: Ganji, M., Rashidi, L., Fung, B., Wang, C. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 11154. Springer, Cham. https://doi.org/10.1007/978-3-030-04503-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-04503-6_23
Published: 21 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04502-9
Online ISBN: 978-3-030-04503-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics