ABSTRACT
A trigram is a three element sequence of characters. In this paper we demonstrate the effectiveness of a trigram based index for morphologically based retrievals from a full text document retrieval system. Retrieved documents are considered relevant if they contain exact matches for each of the query terms. Using this definition of relevance we consistently achieve a recall rate of 100%. In the experiments described here, we used sets of 100 anded three term queries, and the average precision per set varied from 47% to 87%. We propose a method for increasing the average precision to 100%. Using overlapping trigrams extracted from the Brown Corpus [KUCE67] and a character set of 45 elements, we found a horizontal asymptote near 11,000 for the number of entries in a trigram based index. Finally we show that a trigram based system provides a reasonable alternative to a word based one and is superior to it in retrievals of word fragments.
- ADAM92.Adtms, l~nbeh, "A Study of Trigrams and Their Feasibility as Index Terms in a Full Text Information Retrieval System', D.Sc. diuertation, George Washington University, 1992. Google ScholarDigital Library
- COML90.Comlekoglu, Fatih, "Optimizing a Text Retrieval System Utilizing N-gram Indexing', D.Sc. diuertation, George Washington University, 1990. Google ScholarDigital Library
- D’AM85.D'Amore, Raymond J., and Mah, Climon, P., "One-Tune Con~lete Indexing of Text: Theory and Practice', Research and develooment in information retrieval: Ei~,hth AnmmlJntema--uoul A CM $IGIR Conference, pp. 155-164,Montreal, Quebec, Canada, 1985. Google ScholarDigital Library
- FOX90.Fox, Christopher, "A Stop last for General Text', SIC.dR FORUM, Vol. 24, Nos. 1-2, pp. 19-35, Fall 89/W'mter 90. Google ScholarDigital Library
- JONE84.Jones, Kevin P., and Bell, Colin L.M., "The Auwnmtic Extraction of Words from Texts Especially for Input into Information Retrieval Syltems Bated on Inverted Fdes', pp. 410-419, in Research a~ Develovment in Information R~tl, van Rijsbergen, CJ. ed., Cambridge University Press, 1984. Google ScholarDigital Library
- KRAC81.Kractony, Peter, Kowalski, Gerald and Meltzer, Arnold, "Comparative Analysis of Hardware Versus So/twaee Text Seazch', Chapter 17, in Information Retrieval Research, Oddy, R.N., Robeemon, S.E., Rijsbergen, CJ., and Williams, P.W., eds., Btmerwoahs, London, 1981. Google ScholarDigital Library
- KUCE67.Kucera, Henry and Francis, W. Nelson, Cornou.tatio.nsl Analysis .of _Present-Day American Earl,h, Brown University Press, Providence, Rhode Island, 1%7.Google Scholar
- LAPI91.Lapir, G.M., "AJu,ociative Technique for Database Access', Report of the Institute for System Studies, USSR Academy of Sciences, 1991 (in Russian).Google Scholar
- LESL51.Leslie, Louis, 20.000 Words Svelledi Divided, and Accented for the Use of S .~ents AuO~rs and Proofreaden, Third Edition, Oregg Publishing Division of McGraw-Hill Book Coaq~any, Inc., New York, 1951.Google Scholar
- MELT87.Meltzer, Arnold C., and Kowah~, Gerald, "Text Searching Using an Inversion Databuc Consisting of Trigrams', IEEE Proceedings of Second International Coqference on Ctmqmters and Appltcatlmts, pp. 65-69, 1987Google Scholar
- SALT83.Salton, Gerard, and McOill, Michael I., h~oduction to Modern Infornmfion Retrievjal, McGraw Hill, New York, 1983. Google ScholarDigital Library
- WILK89.Wilkinson, Leland, SYSTAT: The System for Statis~c.s, SYSTAT, Inc., Evanston, IL, 1989.Google Scholar
- WILL79.W'dlett, Peter, "Document P.e~eval F.xperimenu U,ing Indexing Vocabularies of Varying Size- 2. Hashing, Truncation, Digram and Trigram Eacoding of Index Terms', Journa/ofDocumenmdon, Vol. 35, No. 4, pp. 296-305, December 1979.Google Scholar
- WISN87.W'umiewski, Janusz L., "Effective Text Compression with Sinml~s Digram and Trigram Encoding', Journal of Information Science: Prb~ples & Pracdce, Vol. 13, No. 3, pp. 159-164, 1987. Google ScholarDigital Library
- YOCH85.Yochum, Julian A., "A High-Speed Text Scanning Algorithm Utilizing Least Frequent Trigraphs',/EEE Proceedings l~v Dtreclkms in C_.ompu~f b'ympo~um, Trondhe'un, Norway, pp. 114-121, 1985.Google Scholar
Index Terms
- Trigrams as index element in full text retrieval: observations and experimental results
Recommendations
Full text document retrieval: Hebrew legal texts (report on the first phase of the responsa retrieval project)
SIGIR '71: Proceedings of the 1971 international ACM SIGIR conference on Information storage and retrievalA full text retrieval system was designed for the responsa literature, which is a large corpus of Hebrew legal cases. The unique problems of the data base --- mixture of Hebrew, Aramaic and vernaculars, lack of vowels and punctuation, extreme language ...
Automatic transliteration for Japanese-to-English text retrieval
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalFor cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between ...
Comments