Character N-Gram Tokenization for European Language Text Retrieval

McNamee, Paul; Mayfield, James

doi:10.1023/B:INRT.0000009441.78971.be

Character N-Gram Tokenization for European Language Text Retrieval

Published: January 2004

Volume 7, pages 73–97, (2004)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Character N-Gram Tokenization for European Language Text Retrieval

Download PDF

Paul McNamee¹ &
James Mayfield¹

1262 Accesses
168 Citations
1 Altmetric
Explore all metrics

Abstract

The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.

References

Ballasteros L and Croft WB (1997) Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, pp. 84-91.
Ballasteros L and Croft WB (1998) Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 64-71.
Benedetto D, Caglioti E and Loreto V (2002) Language Trees and Zipping. Physical Review Letters, 88.
Braschler M and Schäuble P (2000) Experiments with the eurospider retrieval system for CLEF 2000. In: Peters C, Ed. Proceedings of the First Cross-Language Evaluation Forum, pp. 140-149.
Buckley C, Mitra M, Walz J and Cardie C (1998), Using clustering and super concepts within SMART: TREC-6. In: Voorhees EM and Harman DK, Eds. Proceedings of the Sixth Text REtrieval Conference (TREC-6), NIST Special Publication 500-240, pp. 107-124.
Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y and Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43-50.
Cavnar WB (1994) Using an N-gram-based document representation with a vector processing retrieval model. In: Harman DK, Ed. Proceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-226, pp. 269-278.
Cavnar WB and Trenkle JM (1994) N-Gram based text categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, pp. 161-169.
Chen A, He J, Xu L, Gey F and Meggs J (1997) Chinese text retrieval without using a dictionary. SIGIR, 42-49.
align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1-8.
Cohen JD (1995) Highlights: Language-and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46:162–174.
Google Scholar
Comlekoglu FM (1990) Optimizing a text retrieval system utilizing N-gram indexing. Ph.D Thesis, George Washington University.
Damashek M (1995) Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843–848.
Google Scholar
D'Amore RJ and Mah CP (1985) One-time complete indexing of text: Theory and practice. In: Proceedings of the 8th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-85), pp. 155-164.
De Heer T (1974) Experiments with syntactic traces in information retrieval. Information Storage and Retrieval, 10:133–144.
Google Scholar
De Heer T (1982) The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management, 18:229–236.
Google Scholar
Harman D (1992) Relevance feedback revisited. In: Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-92), pp. 1-10.
Harman D et al. (1995) Performance of text retrieval systems. Science, 268:1417–1418.
Google Scholar
Hiemstra D (2000) Using language models for information retrieval. Ph.D. Thesis, Center for Telematics and Information Technology, The Netherlands.
Google Scholar
Jelinek F and Mercer R (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema ES and Kanal LN, Eds. Pattern Recognition in Practice, North Holland, pp. 381-402.
Kraaij W (2001) TNO at CLEF-2001: Comparing translation resources. In: Peters C et al., Eds. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (CLEF-2001).
Landauer TK and Littman ML (1990) Fully automated cross-language document retrieval using latent semantic indexing. In: Proceedings of the 6th Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pp. 31-38.
Lee JH and Ahn JS (1996) Using N-grams for Korean text retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 216-224.
Mah CP and D'Amore RJ (1983) Complete statistical indexing of text by overlapping word fragments. ACM SIGIR Forum, 17(3):6–16.
Google Scholar
Mayfield J, McNamee P and Piatko C (2000) The JHU/APL HAIRCUT system at TREC-8. In: Voorhees EM and Harman DK, Eds. Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication 500-246, Gaithersburg, Maryland, pp. 445-452.
McCarley S. (1999) Should we translate the documents or the queries in cross-language information retrieval. In: Proceedings of ACL.
McNamee P, Mayfield J and Piatko C (2001a) A language-independent approach to European text retrieval. In: Peters C Ed. Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069, Springer, Lisbon, Portugal, pp. 129–139.
Google Scholar
McNamee P, Mayfield J and Piatko C (2001b) The HAIRCUT system at TREC-9. In: Voorhees EM and Harman DK, Eds. Proceedings of the Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249, Gaithersburg, Maryland, pp. 273-279.
McNamee P and Mayfield J (2002a) Comparing cross-language query expansion techniques by degrading translation resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 159-166.
McNamee P and Mayfield J (2002b) JHU/APL experiments at CLEF-2001: Translation resoruces and score normalization. In: Peters C et al. Eds. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (CLEF-2001), Darmstadt, Germany, pp. 193-208.
McNamee P and Mayfield J (2002c) Scalable multilingual information access. In: Working Notes of the CLEF 2002 Workshop, Rome, Italy, pp. 133-140.
Melamed ID (2001) Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge, MA.
Google Scholar
Mihalcea R and Nastase V (2002) Letter level learning for language independent diacritics restoration. In: Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 105-111.
Miller D, Leek T and Schwartz R (1999) A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 214-221.
Nie J-Y, Simard M and Foster G (2000) Multilingual information retrieval based on parallel texts from the web. In: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069, Springer, Lisbon, Portugal, pp. 188–201.
Google Scholar
Oard DW and Hackett P (1997) Document translation for cross-language text retrieval at the University of Maryland. In: Proceedings of the Sixth Text REtrieval Conference (TREC-6), pp. 687-696.
Oard DW, Levow G and Cabezas CI (2001) CLEF experiments at Maryland: Statistical stemming and back-off translation. In: Peters C, Ed. Proceedings of the First Cross-Language Evaluation Forum, pp. 176-187.
Och FJ and Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440-447.
Ogawa Y and Matsuda T (1997) Overlapping statistical word indexing: A new indexing method for Japanese text. In: Proceedings of the 20th International Conference on Research and Development in Information Retrieval (SIGIR-97), pp. 226-234.
Pearce C and Nicholas C (1996) TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal for the American Society for Information Science, 47:236–275.
Google Scholar
Pearce C and Rye W (1998) N-gram term weighting: A comparative analysis. National Security Agency Technical Report, TR-R52-001-98.
Peters C and Braschler M (this volume), Manuscript in preparation.
Pirkola A, Hedlund T, Keskusalo H and Järvelin K (2001) Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4:209–230.
Google Scholar
Ponte JM and Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275-281.
Porter MF (1980) An algorithm for suffix stripping. Program, 14:130–137.
Google Scholar
Porter MF (2001) Snowball: A Language for Stemming Algorithms. <http://snowball.tartarus.org/texts/introduction>. html (visited 13 March 2003).
Robertson SE, Walker S and Beaulieu M (1999) Okapi and TREC-7: Automatic ad hoc, filtering, vlc, and interactive. In: Voorhees EM and Harman DK, Eds. Proceedings of the 7th Text REtrieval Conference (TREC-7), August 1999, NIST Special Publication 500-242, pp. 253-264.
Resnik P (1999) Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 527-534.
Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523.
Google Scholar
Savoy J (2002) Report on CLEF 2002 experiments: Combining multiple sources of evidence. In: Working Notes for the CLEF 2002 Workshop, pp. 31-46.
Savoy J (2003) Cross-language information retrieval: Experiments based on CLEF 2000 corpora. Information Processing and Management, 39(1):75–115.
Google Scholar
Shannon C (1948) A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623-656.
Google Scholar
Shannon C (1980) Scientific aspects of juggling. In: Sloane NJA and Wyner AD, Eds. (1993) Claude Elwood Shannon: Collected Papers, IEEE Press.
Teufel B (1988) Natural language documents: Indexing and retrieval in an information system. In: Proceedings of the 9th International Conference on Information Systems, Minneapolis, Minnesota, pp. 193-201.
United Nations (no date). Universal Declaration of Human Rights, http://www.unhchr.ch/udhr/ (visited October 28th, 2002).
Voorhees EM and Harman DK (1999) Overview of the seventh Text REtrieval Conference (TREC-7). In: Voorhees EM and Harman DK, Eds. The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.
Willett P (1979) Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation, 35:296–305.
Google Scholar
Witten IH, Moffat A and Bell TC (1999) Managing Gigabytes-Compressing and Indexing Documents and Images, 2nd ed., Morgan Kaufmann Publishers.
Zamora EM, Pollock JJ and Zamora A (1981) The use of trigram analysis for spelling error detection. Information Processing and Management, 17:305–316.
Google Scholar
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334-342.

Download references

Author information

Authors and Affiliations

Applied Physics Laboratory, Johns Hopkins University, 11100 Johns Hopkins Road, Laurel, MD, 20723-6099, USA
Paul McNamee & James Mayfield

Authors

Paul McNamee
View author publications
You can also search for this author in PubMed Google Scholar
James Mayfield
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

McNamee, P., Mayfield, J. Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7, 73–97 (2004). https://doi.org/10.1023/B:INRT.0000009441.78971.be

Download citation

Issue Date: January 2004
DOI: https://doi.org/10.1023/B:INRT.0000009441.78971.be

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Character N-Gram Tokenization for European Language Text Retrieval

Abstract

Article PDF

Similar content being viewed by others

Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics

Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

Internet Corpora: A Challenge for Linguistic Processing

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Character N-Gram Tokenization for European Language Text Retrieval

Abstract

Article PDF

Similar content being viewed by others

Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics

Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

Internet Corpora: A Challenge for Linguistic Processing

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation