Abstract
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
Article PDF
Similar content being viewed by others
References
Ballasteros L and Croft WB (1997) Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, pp. 84-91.
Ballasteros L and Croft WB (1998) Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 64-71.
Benedetto D, Caglioti E and Loreto V (2002) Language Trees and Zipping. Physical Review Letters, 88.
Braschler M and Schäuble P (2000) Experiments with the eurospider retrieval system for CLEF 2000. In: Peters C, Ed. Proceedings of the First Cross-Language Evaluation Forum, pp. 140-149.
Buckley C, Mitra M, Walz J and Cardie C (1998), Using clustering and super concepts within SMART: TREC-6. In: Voorhees EM and Harman DK, Eds. Proceedings of the Sixth Text REtrieval Conference (TREC-6), NIST Special Publication 500-240, pp. 107-124.
Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y and Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43-50.
Cavnar WB (1994) Using an N-gram-based document representation with a vector processing retrieval model. In: Harman DK, Ed. Proceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-226, pp. 269-278.
Cavnar WB and Trenkle JM (1994) N-Gram based text categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, pp. 161-169.
Chen A, He J, Xu L, Gey F and Meggs J (1997) Chinese text retrieval without using a dictionary. SIGIR, 42-49.
align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1-8.
Cohen JD (1995) Highlights: Language-and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46:162–174.
Comlekoglu FM (1990) Optimizing a text retrieval system utilizing N-gram indexing. Ph.D Thesis, George Washington University.
Damashek M (1995) Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843–848.
D'Amore RJ and Mah CP (1985) One-time complete indexing of text: Theory and practice. In: Proceedings of the 8th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-85), pp. 155-164.
De Heer T (1974) Experiments with syntactic traces in information retrieval. Information Storage and Retrieval, 10:133–144.
De Heer T (1982) The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management, 18:229–236.
Harman D (1992) Relevance feedback revisited. In: Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-92), pp. 1-10.
Harman D et al. (1995) Performance of text retrieval systems. Science, 268:1417–1418.
Hiemstra D (2000) Using language models for information retrieval. Ph.D. Thesis, Center for Telematics and Information Technology, The Netherlands.
Jelinek F and Mercer R (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema ES and Kanal LN, Eds. Pattern Recognition in Practice, North Holland, pp. 381-402.
Kraaij W (2001) TNO at CLEF-2001: Comparing translation resources. In: Peters C et al., Eds. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (CLEF-2001).
Landauer TK and Littman ML (1990) Fully automated cross-language document retrieval using latent semantic indexing. In: Proceedings of the 6th Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pp. 31-38.
Lee JH and Ahn JS (1996) Using N-grams for Korean text retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 216-224.
Mah CP and D'Amore RJ (1983) Complete statistical indexing of text by overlapping word fragments. ACM SIGIR Forum, 17(3):6–16.
Mayfield J, McNamee P and Piatko C (2000) The JHU/APL HAIRCUT system at TREC-8. In: Voorhees EM and Harman DK, Eds. Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication 500-246, Gaithersburg, Maryland, pp. 445-452.
McCarley S. (1999) Should we translate the documents or the queries in cross-language information retrieval. In: Proceedings of ACL.
McNamee P, Mayfield J and Piatko C (2001a) A language-independent approach to European text retrieval. In: Peters C Ed. Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069, Springer, Lisbon, Portugal, pp. 129–139.
McNamee P, Mayfield J and Piatko C (2001b) The HAIRCUT system at TREC-9. In: Voorhees EM and Harman DK, Eds. Proceedings of the Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249, Gaithersburg, Maryland, pp. 273-279.
McNamee P and Mayfield J (2002a) Comparing cross-language query expansion techniques by degrading translation resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 159-166.
McNamee P and Mayfield J (2002b) JHU/APL experiments at CLEF-2001: Translation resoruces and score normalization. In: Peters C et al. Eds. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (CLEF-2001), Darmstadt, Germany, pp. 193-208.
McNamee P and Mayfield J (2002c) Scalable multilingual information access. In: Working Notes of the CLEF 2002 Workshop, Rome, Italy, pp. 133-140.
Melamed ID (2001) Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge, MA.
Mihalcea R and Nastase V (2002) Letter level learning for language independent diacritics restoration. In: Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 105-111.
Miller D, Leek T and Schwartz R (1999) A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 214-221.
Nie J-Y, Simard M and Foster G (2000) Multilingual information retrieval based on parallel texts from the web. In: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069, Springer, Lisbon, Portugal, pp. 188–201.
Oard DW and Hackett P (1997) Document translation for cross-language text retrieval at the University of Maryland. In: Proceedings of the Sixth Text REtrieval Conference (TREC-6), pp. 687-696.
Oard DW, Levow G and Cabezas CI (2001) CLEF experiments at Maryland: Statistical stemming and back-off translation. In: Peters C, Ed. Proceedings of the First Cross-Language Evaluation Forum, pp. 176-187.
Och FJ and Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440-447.
Ogawa Y and Matsuda T (1997) Overlapping statistical word indexing: A new indexing method for Japanese text. In: Proceedings of the 20th International Conference on Research and Development in Information Retrieval (SIGIR-97), pp. 226-234.
Pearce C and Nicholas C (1996) TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal for the American Society for Information Science, 47:236–275.
Pearce C and Rye W (1998) N-gram term weighting: A comparative analysis. National Security Agency Technical Report, TR-R52-001-98.
Peters C and Braschler M (this volume), Manuscript in preparation.
Pirkola A, Hedlund T, Keskusalo H and Järvelin K (2001) Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4:209–230.
Ponte JM and Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275-281.
Porter MF (1980) An algorithm for suffix stripping. Program, 14:130–137.
Porter MF (2001) Snowball: A Language for Stemming Algorithms. <http://snowball.tartarus.org/texts/introduction>. html (visited 13 March 2003).
Robertson SE, Walker S and Beaulieu M (1999) Okapi and TREC-7: Automatic ad hoc, filtering, vlc, and interactive. In: Voorhees EM and Harman DK, Eds. Proceedings of the 7th Text REtrieval Conference (TREC-7), August 1999, NIST Special Publication 500-242, pp. 253-264.
Resnik P (1999) Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 527-534.
Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523.
Savoy J (2002) Report on CLEF 2002 experiments: Combining multiple sources of evidence. In: Working Notes for the CLEF 2002 Workshop, pp. 31-46.
Savoy J (2003) Cross-language information retrieval: Experiments based on CLEF 2000 corpora. Information Processing and Management, 39(1):75–115.
Shannon C (1948) A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623-656.
Shannon C (1980) Scientific aspects of juggling. In: Sloane NJA and Wyner AD, Eds. (1993) Claude Elwood Shannon: Collected Papers, IEEE Press.
Teufel B (1988) Natural language documents: Indexing and retrieval in an information system. In: Proceedings of the 9th International Conference on Information Systems, Minneapolis, Minnesota, pp. 193-201.
United Nations (no date). Universal Declaration of Human Rights, http://www.unhchr.ch/udhr/ (visited October 28th, 2002).
Voorhees EM and Harman DK (1999) Overview of the seventh Text REtrieval Conference (TREC-7). In: Voorhees EM and Harman DK, Eds. The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.
Willett P (1979) Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation, 35:296–305.
Witten IH, Moffat A and Bell TC (1999) Managing Gigabytes-Compressing and Indexing Documents and Images, 2nd ed., Morgan Kaufmann Publishers.
Zamora EM, Pollock JJ and Zamora A (1981) The use of trigram analysis for spelling error detection. Information Processing and Management, 17:305–316.
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334-342.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
McNamee, P., Mayfield, J. Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7, 73–97 (2004). https://doi.org/10.1023/B:INRT.0000009441.78971.be
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000009441.78971.be