Abstract
We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature.
Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%.
The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aronoff, M. (1976) Word Formation in Generative Grammar. MIT Press, Cambridge, MA.
Benveniste, E. (1966) Formes nouvelles de la composition nominale. Bulletin de la Société Linguistique de Paris, LXI (1), pp. 82–95. Republished, Problèmes de linguistique générale, 2, Gallimard, Paris, (1974).
Bourigault, D. (1993) An endogeneous corpus-based method for structural noun phrase disambiguation. In Proceedings, 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL’98),pp. 81–86, Utrecht.
Boyer, M. (1993) Dictionnaire du français. Hydro-Quebec, GNU General Public License, Québec, Canada.
Church, K.W. (1995) One term or two? In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95),pp. 310–318, Seattle.
Clemenceau, D. (1993) Structuration du Lexique et Reconnaissance de Mots Dérivés. Ph.D. thesis, Université Paris 7.
Corbin, D. (1987) Morphologie dérivationnelle et structuration du lexique. Niemeyer Verlag, Tübingen.
Corpus from the European Community Initiative. (1989 and 1990) Le Monde newspaper.
Courtois, B. (1990) Un système de dictionnaires électroniques pour les mots simples du français. In Langue Française no 87. Dictionaires électroniques du français. Larousse, Paris, pp. 11–22.
Dell, F. (1985) Les règles et les sons. Hermann, Paris.
Dillon, M. and Gray, A.S. (1983) FASIT: A fully automatic syntactically based indexing system. Journal of the American Society for Information Science, 34 (2), pp. 99–108.
Fagan, J.L. (1987) Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In Proceedings, 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’87), pp. 91–101.
Gross, M. (1994) Constructing lexicon-grammars. In Beryl S.T. Atkins and Antonio Zampolli, editors, Computational Approaches to the Lexicon. Oxford University Press, pp. 213–263.
Gruaz, C., Jacquemin, C. and Tzoukermann, E. (1996) Une approche à deux niveaux de la morphologie dérivationnelle du français. In Séminaire Lexique, pp. 107–114.
CLIPS-IMAG, Grenoble, France. GDR, PR.0 Communication Homme-Machine. Harman, D. (1991) How effective is suffixing? Journal of the American Society for Information Science, 42 (1), pp. 7–15.
Hull, D.A. (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47 (1), pp. 70–84.
Jacquernin, C. (1994) Recycling terms into a partial parser. In Proceedings, th Conference on Applied Natural Language Processing (ANLP’94),pp. 113–118, Stuttgart.
Jacquernin, C. (1996a) A symbolic and surgical acquisition of ternis through variation. In S. Wermter, E. Riloff and G. Scheler, editors, Connectionistt, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer, Heidelberg, pp. 425–438.
Jacquernin, C. (1996b) What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing i Management„ 32 (4), pp. 445–458.
Jacquemin, C. (1997) Guessing morphology from terms and corpora. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), Philadelphia,I?A.
Jacquemin, C, Klavans, J. L. and Tzoukermann, E. (1997) Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of the 35th Annual Meeting of the Association, for Computational Linguistics ((E)ACL’97), Madrid.
Jacquemin, C. and Royauté, J. (1994) Retrieving terms and their variants in a lexicalized unification-based framework. In Proceedings, 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’9f),pp. 132–141, Dublin.
Justeson, J.S. and Katz, S.M. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1 (1), PP. 9–27.
Kaplan, R.M. and Kay, M. (1994) Regular models of phonological rule systems. Computational Linguistics, 20 (3), pp. 331–378.
Karlsson, F. (1990) Constraint Grammar as a framework for parsing running text. In Proceedings, 13th International Conference on Computational Linguistics (COLING ‘80), pp. 168–173, Helsinki.
Karlsson, F., Voutilainen, A., Heikkilä, J. and Anttila, A. (1995) Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin.
Karttunen, L. (1983) KIMMO: A general morphological processor. In Texas Linguistic Forum, volume 22, pp. 165–186.
Karttunen, L. and Wittenburg, K. (1983) A two-level morphological analysis of English. In Texas Linguistic Forum, volume 22, pp. 217–228.
Klavans, J. L., Jacquemin, C and Tzoukermann, E. (1997) A natural language approach to multi-word terni conflation. In Proceedings, DELOS Workshop on Cross-Language Information Retrieval, ETHZ, Zurich. ER.CIM: European Consortium for Informatics and Mathematics.
Klavans, J.L. and Tzoukermann, E. (1992) Morphology. In Stuart C. Shapiro, editor, Encyclopedia of Artificial Intelligence, volume 2. John Wiley & Sons, New York, second edition, pp. 963–970.
Koskennierni, K. (1983) Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki, Helsinki.
Koskenniemi, K. (1996) Finite-state morphology and information retrieval. In Proceedings of the ECAI-96 Workshop on Extended Finite State Models of Language, pp. 42–45, ECAI, Budapest, Hungary.
Kraaij, W. and Pohlmann, R. (1996) Viewing stemming as recall enhancement. In Proceedings, 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96),pp. 40–48, Zurich.
Krovetz, R. (1993) Viewing morphology as an inference process. In Proceedings, 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93), pp. 191–203, Pittsburg, PA.
Kupiec, J. (1992) Robust part-of-speech tagging using HMM’s. Computers, Speech, and Language, 6 (3), pp. 225–242.
Lauriston, A. (1994) Automatic recognition of complex terms: Problems and the TERMINO solution. Terminology, 1 (1), pp. 147–170.
Lennon, M., Pierce, D.S. and Willett, P. (1981) An evaluation of some conflation algorithms. Journal of Information Science, 3, pp. 177–183.
Lovins, J.B. (1968) Development of a stemming algorithm. Translation and Computational Linguistics, 11 (1), pp. 22–31.
Marchand, H. (1969) The Categories and Types of Present-Day English Word-Formation. C. H. Beck.
Metzler, D.P. and Haas, S.W. (1989) The Constituent Object Parser: Syntactic structure matching for information retrieval. ACM Transactions on Information Systems, 7 (3), pp. 292–316.
Mohri, M. and Sproat, R. (1996) An efficient compiler for weighted rewrite rules. In 34th Annual Meeting of the Association for Computational Linguistics, pp. 231–238, Santa Cruz, CA.
Moulton. J. and Robinson, G. (1981) The Organization of Language. Cambridge University Press, Cambridge.
Paice, C.D. (1990) Another stemmer. SIGIR Forum, 24: 56–61.
Paice, C.D. (1996) Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science, 47 (8), pp. 632–649.
Pereira, F., Riley, M. and Sproat, R. (1994) Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology,pp. 249–254. Advanced Research Projects Agency.
Popovic, M. and Willett, P. (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43 (5), pp. 384–390.
Porter, M.F. (1980) An algorithm for suffix stripping. Program, 14: 130–137.
Riloff, E. (1995) Little words can make a big difference for text classification. In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95),pp. 130–136, Seattle.
Salton, G. (1966) Automatic phrase matching. In D.G. Hays, editor, Readings in Computational Linguistics. Elsevier, New York, pp. 169–188.
Scalise, S. (1986) Generative Morphology. Foris, Second Edition.
Schabes, Y., Abeillé, A. and Joshi, A. (1988) Parsing strategies with ‘lexicalized’ grammars. In Proceedings, 12th International Conference on Computational Linguistics (COLING’88),pp. 578–583, Budapest.
Schwarz, C. (1990) Automatic syntactic analysis of free text. Journal of the American Society for Information Science, 41 (6), pp. 408–417.
Sheridan, P. and Smeaton, A.F. (1992) The application of morpho-syntactic language processing to effective phrase matching. Information Processing é9 Management, 28 (3), pp. 349–369.
Shieber, S.N. (1986) An Introduction to Unification-Based Approaches to Grammar. CSLI Lecture Notes vol. 4. Chicago University Press, Stanford, CA.
Sproat, R. (1992) Morphology and Computation. MIT Press, Cambridge, MA.
Strzalkowski, T. (1996) Natural language information retrieval. Information Processing é4 Management, 31 (3), pp. 397–417.
Strzalkowski, T. and Vauthey, B. (1992) Information retrieval using robust natural language processing. In Proceedings, 30th Annual Meeting of the Association for Covnputational Linguistics (ACL’92), pp. 104–111, Newark, DE.
Tesniere, L. (1953) Esquisse d’une syntaxe structurale. Klincksieck, Paris.
Tzoukermann, E. and Jacquernin, C. (1997) Analyse automatique de la morphologie dérivationnelle et filtrage de mots possibles. Silexicales, 1, pp. 251–260. Colloque Mots possibles et mots existants, SILEX, University of Lille III.
Tzoukermann, E, Klavans, J. L. and Jacquemin, C. (1997) Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIG IR’97), Philadelphia, PA.
Tzoukermann, E. and Liberman, M.Y. (1990) A finite-state morphological processor for Spanish. In Proceedings of the Thirteenth International Conference on Computational Linguistics, pp. 277–281, Helsinki, Finland. International Conference on Computational Linguistics.
Tzoukermann, E. and Radev, D.R. (1996) Using word class for part-of-speech disambiguation. In SIGDAT Workshop, pp. 1–13, Copenhagen, Denmark. International Conference on Computational Linguistics.
Tzoukermann, E., R.adev, D.R. and Gale, W.A. (1995) Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, pp. 51–57, Dublin, Ireland. Association for Computational Linguistics–European Chapter.
Viegas, E., Gonzalez, M. and Longwell, J. (1996) Morpho-semantics and constructive derivational morphology: A transcategorial approach. Technical Report MCCS-96295, Computing Research Laboratory, New Mexico State University, Las Cruces, NM.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Jacquemin, C., Tzoukermann, E. (1999). NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_2
Download citation
DOI: https://doi.org/10.1007/978-94-017-2388-6_2
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5209-4
Online ISBN: 978-94-017-2388-6
eBook Packages: Springer Book Archive