research-article

Automatic keyphrase extraction from scientific documents using N-gram filtration technique

Authors:
Niraj Kumar

IIIT-Hyderabad, Hyderabad, India

IIIT-Hyderabad, Hyderabad, India
View Profile

,
Kannan Srinathan

IIIT-Hyderabad, Hyderabad, India

IIIT-Hyderabad, Hyderabad, India
View Profile

DocEng '08: Proceedings of the eighth ACM symposium on Document engineeringSeptember 2008Pages 199–208https://doi.org/10.1145/1410140.1410180

Published:16 September 2008Publication History

DocEng '08: Proceedings of the eighth ACM symposium on Document engineering

Pages 199–208

ABSTRACT

In this paper we present an automatic Keyphrase extraction technique for English documents of scientific domain. The devised algorithm uses n-gram filtration technique, which filters sophisticated n-grams {1dnd4} along with their weight from the words of input document. To develop n-gram filtration technique, we have used (1) LZ78 data compression based technique, (2) a simple refinement step, (3) A simple Pattern Filtration algorithm and, (4) a term weighting scheme. In term weighting scheme, we have introduced the importance of position of sentence (where given phrase occurs first) in document and position of phrase in sentence for documents of scientific domain (which is literally more organized than other domains). The entire system is based upon statistical observations, simple grammatical facts, heuristics, and lexical information of English language. We remark that the devised system does not require a learning phase. Our experimental results with publically available text dataset, shows that the devised system is comparable with other known algorithms.

References

Khalid Sayood, Introduction to Data Compression, ELSEVIER, 2nd Edition 2000. Google ScholarDigital Library
Didier Bourigault, "Surface Grammatical Analysis for the Extraction of Keyphrase Terminological Noun Phrases", the 14th International Conference on Computational Linguistics, 1992 Google ScholarDigital Library
Moro, A. (1997). The raising of predicates. Predicative noun phrases and the theory of clause structure, Cambridge Studies in Linguistics, Cambridge University Press, Cambridge, England.Google ScholarCross Ref
I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and C.G. Nevill-Manning, "KEA: Practical automatic Keyphrase Extraction," in proceedings of Digital Libraries '99: The Fourth ACM Conference on Digital Libraries, pp. 254--255, 1999. Google ScholarDigital Library
Silva, J. F. and Lopes, G. P. "A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units". the 6th Meeting on the Mathematics of Language, 1999.Google Scholar
K. T. Frantzi and S. Ananiadou, "The C-value / NC-value domain independent method for multiword for multiword keyphrase extraction". Journal of Natural Language Processing, 1999.Google ScholarCross Ref
P.D. Turney, "Learning algorithms for keyphrase extraction." Information Retrieval, vol. 2, no. 4, pp. 303--336, 2000. Google ScholarDigital Library
P.D. Turney, "Learning to Extract Keyphrases from Text," National Research Council,Institute for Information Technology,Technical Report ERB-1057, 1999.Google Scholar
Hale, K.; Keyser, J. (2002). "Prolegomena to a theory of argument structure", Linguistic Inquiry Monograph, 39, MIT Press, Cambridge, Massachusetts.Google Scholar
Sui Zhifang, Chen Yirong, and Wei Zhouchao, "Automatic Recognition of Chinese Scientific and technological Keyphrases Using Integrated Linguistic Knowledge", IEEE Conference on Natural Language Processing and Knowledge Engineering, 2003Google Scholar
M. Chen, , J.-T Sun, H.-J Zeng, and K.-Y Lam. "A practical system of keyphrase extraction for web pages," in proceedings of CIKM'05, pp 277--278, Bremen, Germany, 2005. Google ScholarDigital Library
D. Kelleher and S. Luz, "Automatic Hypertext Keyphrase Detection," in proceedings of IJCAI'05, Edinburgh, UK, 2005 Google ScholarDigital Library
Ida m. Pu. Fundamental data Compression, ELSEVIER, 1st edition 2006 Google ScholarDigital Library
Samhaa R. El-Beltagy; "KP-Miner: A Simple System for Effective Keyphrase Extraction" Innovations in Information Technology, 2006, Nov. 2006 Page(s):1 - 5 Digital Object Identifier 10. 1109/INNOVATIONS.2006.301948Google Scholar
English Vocabulary: Regular Verbs List (EnglishClub.com)Google Scholar
"Irregular verbs:English -- Wiktionary", http://en.wiktionary.org/wiki/Appendix:Irregular_verbs:EnglishGoogle Scholar
Porter Stemming Algorithm for suffix stripping, web -link http://telemat.det.unifi.it/book/2001/wchange/download/stem_porter.htmlGoogle Scholar
Web link for KEA5.0 source code: http://www.nzdl.org./Kea/download.htmlGoogle Scholar
Yijiang CHEN, Xiaodong SHI, Changle ZHOU, Chang SU Automatic Keyphrase Extraction from Chinese Books, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing 2007 IEEE DOI 10.1109/SNPD.2007.193 Google ScholarDigital Library
Medelyan, O., Witten I. H. (2006) "Thesaurus Based Automatic Keyphrase Indexing." In Proc. of the Joint Conference on Digital Libraries 2006, Chapel Hill, NC, USA, pp. 296--297. Google ScholarDigital Library
Medelyan, O., Witten I. H. (2005)"Thesaurus-based index term extraction for agricultural documents." In: Proc. of the 6th Agricultural Ontology Service (AOS) workshop at EFITA/WCCA 2005, Vila Real, Portugal.Google Scholar

Index Terms

Automatic keyphrase extraction from scientific documents using N-gram filtration technique
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
  2. Information storage systems

Recommendations

Domain-specific keyphrase extraction
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Document keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase ...
Read More
Automatic keyphrase extraction for Arabic news documents based on KEA system

A keyphrase is a sequence of words that play an important role in the identification of the topics that are embedded in a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applications such as document ...
Read More
Automatic Keyphrase Extraction from Bengali Documents: A Preliminary Study
EAIT '11: Proceedings of the 2011 Second International Conference on Emerging Applications of Information Technology

Key phrases are sequence of words that capture the main topics covered in a document. The key phrases help readers rapidly understand, organize, access and share information of a document. In this paper, we present a preliminary study on key phrase ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '08: Proceedings of the eighth ACM symposium on Document engineering
September 2008
312 pages
ISBN:9781605580814
DOI:10.1145/1410140
General Chair:
Maria de Graça Pimentel
Universidade de Sño Paulo, Brazil
,
Program Chairs:
Dick C.A. Bulterman
CWI and VU, Netherlands
,
Luis Fernando Gomes Soares
PUC-Rio, Brazil
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
information retrieval
keyphrase extraction
scientific domain
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '08 Paper Acceptance Rate21of62submissions,34%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 1,236
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic keyphrase extraction from scientific documents using N-gram filtration technique

DocEng '08: Proceedings of the eighth ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Domain-specific keyphrase extraction

Automatic keyphrase extraction for Arabic news documents based on KEA system

Automatic Keyphrase Extraction from Bengali Documents: A Preliminary Study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic keyphrase extraction from scientific documents using N-gram filtration technique

DocEng '08: Proceedings of the eighth ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Domain-specific keyphrase extraction

Automatic keyphrase extraction for Arabic news documents based on KEA system

Automatic Keyphrase Extraction from Bengali Documents: A Preliminary Study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media