Skip to main content

NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax

  • Chapter
Natural Language Information Retrieval

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 7))

Abstract

We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature.

Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%.

The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Aronoff, M. (1976) Word Formation in Generative Grammar. MIT Press, Cambridge, MA.

    Google Scholar 

  • Benveniste, E. (1966) Formes nouvelles de la composition nominale. Bulletin de la Société Linguistique de Paris, LXI (1), pp. 82–95. Republished, Problèmes de linguistique générale, 2, Gallimard, Paris, (1974).

    Google Scholar 

  • Bourigault, D. (1993) An endogeneous corpus-based method for structural noun phrase disambiguation. In Proceedings, 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL’98),pp. 81–86, Utrecht.

    Google Scholar 

  • Boyer, M. (1993) Dictionnaire du français. Hydro-Quebec, GNU General Public License, Québec, Canada.

    Google Scholar 

  • Church, K.W. (1995) One term or two? In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95),pp. 310–318, Seattle.

    Google Scholar 

  • Clemenceau, D. (1993) Structuration du Lexique et Reconnaissance de Mots Dérivés. Ph.D. thesis, Université Paris 7.

    Google Scholar 

  • Corbin, D. (1987) Morphologie dérivationnelle et structuration du lexique. Niemeyer Verlag, Tübingen.

    Book  Google Scholar 

  • Corpus from the European Community Initiative. (1989 and 1990) Le Monde newspaper.

    Google Scholar 

  • Courtois, B. (1990) Un système de dictionnaires électroniques pour les mots simples du français. In Langue Française no 87. Dictionaires électroniques du français. Larousse, Paris, pp. 11–22.

    Google Scholar 

  • Dell, F. (1985) Les règles et les sons. Hermann, Paris.

    Google Scholar 

  • Dillon, M. and Gray, A.S. (1983) FASIT: A fully automatic syntactically based indexing system. Journal of the American Society for Information Science, 34 (2), pp. 99–108.

    Article  Google Scholar 

  • Fagan, J.L. (1987) Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In Proceedings, 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’87), pp. 91–101.

    Google Scholar 

  • Gross, M. (1994) Constructing lexicon-grammars. In Beryl S.T. Atkins and Antonio Zampolli, editors, Computational Approaches to the Lexicon. Oxford University Press, pp. 213–263.

    Google Scholar 

  • Gruaz, C., Jacquemin, C. and Tzoukermann, E. (1996) Une approche à deux niveaux de la morphologie dérivationnelle du français. In Séminaire Lexique, pp. 107–114.

    Google Scholar 

  • CLIPS-IMAG, Grenoble, France. GDR, PR.0 Communication Homme-Machine. Harman, D. (1991) How effective is suffixing? Journal of the American Society for Information Science, 42 (1), pp. 7–15.

    Article  Google Scholar 

  • Hull, D.A. (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47 (1), pp. 70–84.

    Article  Google Scholar 

  • Jacquernin, C. (1994) Recycling terms into a partial parser. In Proceedings, th Conference on Applied Natural Language Processing (ANLP’94),pp. 113–118, Stuttgart.

    Google Scholar 

  • Jacquernin, C. (1996a) A symbolic and surgical acquisition of ternis through variation. In S. Wermter, E. Riloff and G. Scheler, editors, Connectionistt, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer, Heidelberg, pp. 425–438.

    Google Scholar 

  • Jacquernin, C. (1996b) What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing i Management„ 32 (4), pp. 445–458.

    Google Scholar 

  • Jacquemin, C. (1997) Guessing morphology from terms and corpora. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), Philadelphia,I?A.

    Google Scholar 

  • Jacquemin, C, Klavans, J. L. and Tzoukermann, E. (1997) Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of the 35th Annual Meeting of the Association, for Computational Linguistics ((E)ACL’97), Madrid.

    Google Scholar 

  • Jacquemin, C. and Royauté, J. (1994) Retrieving terms and their variants in a lexicalized unification-based framework. In Proceedings, 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’9f),pp. 132–141, Dublin.

    Google Scholar 

  • Justeson, J.S. and Katz, S.M. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1 (1), PP. 9–27.

    Google Scholar 

  • Kaplan, R.M. and Kay, M. (1994) Regular models of phonological rule systems. Computational Linguistics, 20 (3), pp. 331–378.

    Google Scholar 

  • Karlsson, F. (1990) Constraint Grammar as a framework for parsing running text. In Proceedings, 13th International Conference on Computational Linguistics (COLING ‘80), pp. 168–173, Helsinki.

    Google Scholar 

  • Karlsson, F., Voutilainen, A., Heikkilä, J. and Anttila, A. (1995) Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin.

    Book  Google Scholar 

  • Karttunen, L. (1983) KIMMO: A general morphological processor. In Texas Linguistic Forum, volume 22, pp. 165–186.

    Google Scholar 

  • Karttunen, L. and Wittenburg, K. (1983) A two-level morphological analysis of English. In Texas Linguistic Forum, volume 22, pp. 217–228.

    Google Scholar 

  • Klavans, J. L., Jacquemin, C and Tzoukermann, E. (1997) A natural language approach to multi-word terni conflation. In Proceedings, DELOS Workshop on Cross-Language Information Retrieval, ETHZ, Zurich. ER.CIM: European Consortium for Informatics and Mathematics.

    Google Scholar 

  • Klavans, J.L. and Tzoukermann, E. (1992) Morphology. In Stuart C. Shapiro, editor, Encyclopedia of Artificial Intelligence, volume 2. John Wiley & Sons, New York, second edition, pp. 963–970.

    Google Scholar 

  • Koskennierni, K. (1983) Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki, Helsinki.

    Google Scholar 

  • Koskenniemi, K. (1996) Finite-state morphology and information retrieval. In Proceedings of the ECAI-96 Workshop on Extended Finite State Models of Language, pp. 42–45, ECAI, Budapest, Hungary.

    Google Scholar 

  • Kraaij, W. and Pohlmann, R. (1996) Viewing stemming as recall enhancement. In Proceedings, 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96),pp. 40–48, Zurich.

    Google Scholar 

  • Krovetz, R. (1993) Viewing morphology as an inference process. In Proceedings, 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93), pp. 191–203, Pittsburg, PA.

    Google Scholar 

  • Kupiec, J. (1992) Robust part-of-speech tagging using HMM’s. Computers, Speech, and Language, 6 (3), pp. 225–242.

    Google Scholar 

  • Lauriston, A. (1994) Automatic recognition of complex terms: Problems and the TERMINO solution. Terminology, 1 (1), pp. 147–170.

    Article  Google Scholar 

  • Lennon, M., Pierce, D.S. and Willett, P. (1981) An evaluation of some conflation algorithms. Journal of Information Science, 3, pp. 177–183.

    Article  Google Scholar 

  • Lovins, J.B. (1968) Development of a stemming algorithm. Translation and Computational Linguistics, 11 (1), pp. 22–31.

    Google Scholar 

  • Marchand, H. (1969) The Categories and Types of Present-Day English Word-Formation. C. H. Beck.

    Google Scholar 

  • Metzler, D.P. and Haas, S.W. (1989) The Constituent Object Parser: Syntactic structure matching for information retrieval. ACM Transactions on Information Systems, 7 (3), pp. 292–316.

    Article  Google Scholar 

  • Mohri, M. and Sproat, R. (1996) An efficient compiler for weighted rewrite rules. In 34th Annual Meeting of the Association for Computational Linguistics, pp. 231–238, Santa Cruz, CA.

    Chapter  Google Scholar 

  • Moulton. J. and Robinson, G. (1981) The Organization of Language. Cambridge University Press, Cambridge.

    Book  Google Scholar 

  • Paice, C.D. (1990) Another stemmer. SIGIR Forum, 24: 56–61.

    Article  Google Scholar 

  • Paice, C.D. (1996) Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science, 47 (8), pp. 632–649.

    Article  Google Scholar 

  • Pereira, F., Riley, M. and Sproat, R. (1994) Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology,pp. 249–254. Advanced Research Projects Agency.

    Google Scholar 

  • Popovic, M. and Willett, P. (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43 (5), pp. 384–390.

    Article  Google Scholar 

  • Porter, M.F. (1980) An algorithm for suffix stripping. Program, 14: 130–137.

    Article  Google Scholar 

  • Riloff, E. (1995) Little words can make a big difference for text classification. In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95),pp. 130–136, Seattle.

    Google Scholar 

  • Salton, G. (1966) Automatic phrase matching. In D.G. Hays, editor, Readings in Computational Linguistics. Elsevier, New York, pp. 169–188.

    Google Scholar 

  • Scalise, S. (1986) Generative Morphology. Foris, Second Edition.

    Book  Google Scholar 

  • Schabes, Y., Abeillé, A. and Joshi, A. (1988) Parsing strategies with ‘lexicalized’ grammars. In Proceedings, 12th International Conference on Computational Linguistics (COLING’88),pp. 578–583, Budapest.

    Google Scholar 

  • Schwarz, C. (1990) Automatic syntactic analysis of free text. Journal of the American Society for Information Science, 41 (6), pp. 408–417.

    Article  Google Scholar 

  • Sheridan, P. and Smeaton, A.F. (1992) The application of morpho-syntactic language processing to effective phrase matching. Information Processing é9 Management, 28 (3), pp. 349–369.

    Article  Google Scholar 

  • Shieber, S.N. (1986) An Introduction to Unification-Based Approaches to Grammar. CSLI Lecture Notes vol. 4. Chicago University Press, Stanford, CA.

    Google Scholar 

  • Sproat, R. (1992) Morphology and Computation. MIT Press, Cambridge, MA.

    Google Scholar 

  • Strzalkowski, T. (1996) Natural language information retrieval. Information Processing é4 Management, 31 (3), pp. 397–417.

    Article  Google Scholar 

  • Strzalkowski, T. and Vauthey, B. (1992) Information retrieval using robust natural language processing. In Proceedings, 30th Annual Meeting of the Association for Covnputational Linguistics (ACL’92), pp. 104–111, Newark, DE.

    Chapter  Google Scholar 

  • Tesniere, L. (1953) Esquisse d’une syntaxe structurale. Klincksieck, Paris.

    Google Scholar 

  • Tzoukermann, E. and Jacquernin, C. (1997) Analyse automatique de la morphologie dérivationnelle et filtrage de mots possibles. Silexicales, 1, pp. 251–260. Colloque Mots possibles et mots existants, SILEX, University of Lille III.

    Google Scholar 

  • Tzoukermann, E, Klavans, J. L. and Jacquemin, C. (1997) Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIG IR’97), Philadelphia, PA.

    Google Scholar 

  • Tzoukermann, E. and Liberman, M.Y. (1990) A finite-state morphological processor for Spanish. In Proceedings of the Thirteenth International Conference on Computational Linguistics, pp. 277–281, Helsinki, Finland. International Conference on Computational Linguistics.

    Google Scholar 

  • Tzoukermann, E. and Radev, D.R. (1996) Using word class for part-of-speech disambiguation. In SIGDAT Workshop, pp. 1–13, Copenhagen, Denmark. International Conference on Computational Linguistics.

    Google Scholar 

  • Tzoukermann, E., R.adev, D.R. and Gale, W.A. (1995) Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, pp. 51–57, Dublin, Ireland. Association for Computational Linguistics–European Chapter.

    Google Scholar 

  • Viegas, E., Gonzalez, M. and Longwell, J. (1996) Morpho-semantics and constructive derivational morphology: A transcategorial approach. Technical Report MCCS-96295, Computing Research Laboratory, New Mexico State University, Las Cruces, NM.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Jacquemin, C., Tzoukermann, E. (1999). NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2388-6_2

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5209-4

  • Online ISBN: 978-94-017-2388-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics