NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax

Jacquemin, Christian; Tzoukermann, Evelyne

doi:10.1007/978-94-017-2388-6_2

Christian Jacquemin⁴ &
Evelyne Tzoukermann⁵

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 7))

273 Accesses
37 Citations

Abstract

We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature.

Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%.

The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aronoff, M. (1976) Word Formation in Generative Grammar. MIT Press, Cambridge, MA.
Google Scholar
Benveniste, E. (1966) Formes nouvelles de la composition nominale. Bulletin de la Société Linguistique de Paris, LXI (1), pp. 82–95. Republished, Problèmes de linguistique générale, 2, Gallimard, Paris, (1974).
Google Scholar
Bourigault, D. (1993) An endogeneous corpus-based method for structural noun phrase disambiguation. In Proceedings, 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL’98),pp. 81–86, Utrecht.
Google Scholar
Boyer, M. (1993) Dictionnaire du français. Hydro-Quebec, GNU General Public License, Québec, Canada.
Google Scholar
Church, K.W. (1995) One term or two? In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95),pp. 310–318, Seattle.
Google Scholar
Clemenceau, D. (1993) Structuration du Lexique et Reconnaissance de Mots Dérivés. Ph.D. thesis, Université Paris 7.
Google Scholar
Corbin, D. (1987) Morphologie dérivationnelle et structuration du lexique. Niemeyer Verlag, Tübingen.
Book Google Scholar
Corpus from the European Community Initiative. (1989 and 1990) Le Monde newspaper.
Google Scholar
Courtois, B. (1990) Un système de dictionnaires électroniques pour les mots simples du français. In Langue Française no 87. Dictionaires électroniques du français. Larousse, Paris, pp. 11–22.
Google Scholar
Dell, F. (1985) Les règles et les sons. Hermann, Paris.
Google Scholar
Dillon, M. and Gray, A.S. (1983) FASIT: A fully automatic syntactically based indexing system. Journal of the American Society for Information Science, 34 (2), pp. 99–108.
Article Google Scholar
Fagan, J.L. (1987) Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In Proceedings, 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’87), pp. 91–101.
Google Scholar
Gross, M. (1994) Constructing lexicon-grammars. In Beryl S.T. Atkins and Antonio Zampolli, editors, Computational Approaches to the Lexicon. Oxford University Press, pp. 213–263.
Google Scholar
Gruaz, C., Jacquemin, C. and Tzoukermann, E. (1996) Une approche à deux niveaux de la morphologie dérivationnelle du français. In Séminaire Lexique, pp. 107–114.
Google Scholar
CLIPS-IMAG, Grenoble, France. GDR, PR.0 Communication Homme-Machine. Harman, D. (1991) How effective is suffixing? Journal of the American Society for Information Science, 42 (1), pp. 7–15.
Article Google Scholar
Hull, D.A. (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47 (1), pp. 70–84.
Article Google Scholar
Jacquernin, C. (1994) Recycling terms into a partial parser. In Proceedings, th Conference on Applied Natural Language Processing (ANLP’94),pp. 113–118, Stuttgart.
Google Scholar
Jacquernin, C. (1996a) A symbolic and surgical acquisition of ternis through variation. In S. Wermter, E. Riloff and G. Scheler, editors, Connectionistt, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer, Heidelberg, pp. 425–438.
Google Scholar
Jacquernin, C. (1996b) What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing i Management„ 32 (4), pp. 445–458.
Google Scholar
Jacquemin, C. (1997) Guessing morphology from terms and corpora. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), Philadelphia,I?A.
Google Scholar
Jacquemin, C, Klavans, J. L. and Tzoukermann, E. (1997) Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of the 35th Annual Meeting of the Association, for Computational Linguistics ((E)ACL’97), Madrid.
Google Scholar
Jacquemin, C. and Royauté, J. (1994) Retrieving terms and their variants in a lexicalized unification-based framework. In Proceedings, 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’9f),pp. 132–141, Dublin.
Google Scholar
Justeson, J.S. and Katz, S.M. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1 (1), PP. 9–27.
Google Scholar
Kaplan, R.M. and Kay, M. (1994) Regular models of phonological rule systems. Computational Linguistics, 20 (3), pp. 331–378.
Google Scholar
Karlsson, F. (1990) Constraint Grammar as a framework for parsing running text. In Proceedings, 13th International Conference on Computational Linguistics (COLING ‘80), pp. 168–173, Helsinki.
Google Scholar
Karlsson, F., Voutilainen, A., Heikkilä, J. and Anttila, A. (1995) Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin.
Book Google Scholar
Karttunen, L. (1983) KIMMO: A general morphological processor. In Texas Linguistic Forum, volume 22, pp. 165–186.
Google Scholar
Karttunen, L. and Wittenburg, K. (1983) A two-level morphological analysis of English. In Texas Linguistic Forum, volume 22, pp. 217–228.
Google Scholar
Klavans, J. L., Jacquemin, C and Tzoukermann, E. (1997) A natural language approach to multi-word terni conflation. In Proceedings, DELOS Workshop on Cross-Language Information Retrieval, ETHZ, Zurich. ER.CIM: European Consortium for Informatics and Mathematics.
Google Scholar
Klavans, J.L. and Tzoukermann, E. (1992) Morphology. In Stuart C. Shapiro, editor, Encyclopedia of Artificial Intelligence, volume 2. John Wiley & Sons, New York, second edition, pp. 963–970.
Google Scholar
Koskennierni, K. (1983) Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki, Helsinki.
Google Scholar
Koskenniemi, K. (1996) Finite-state morphology and information retrieval. In Proceedings of the ECAI-96 Workshop on Extended Finite State Models of Language, pp. 42–45, ECAI, Budapest, Hungary.
Google Scholar
Kraaij, W. and Pohlmann, R. (1996) Viewing stemming as recall enhancement. In Proceedings, 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96),pp. 40–48, Zurich.
Google Scholar
Krovetz, R. (1993) Viewing morphology as an inference process. In Proceedings, 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93), pp. 191–203, Pittsburg, PA.
Google Scholar
Kupiec, J. (1992) Robust part-of-speech tagging using HMM’s. Computers, Speech, and Language, 6 (3), pp. 225–242.
Google Scholar
Lauriston, A. (1994) Automatic recognition of complex terms: Problems and the TERMINO solution. Terminology, 1 (1), pp. 147–170.
Article Google Scholar
Lennon, M., Pierce, D.S. and Willett, P. (1981) An evaluation of some conflation algorithms. Journal of Information Science, 3, pp. 177–183.
Article Google Scholar
Lovins, J.B. (1968) Development of a stemming algorithm. Translation and Computational Linguistics, 11 (1), pp. 22–31.
Google Scholar
Marchand, H. (1969) The Categories and Types of Present-Day English Word-Formation. C. H. Beck.
Google Scholar
Metzler, D.P. and Haas, S.W. (1989) The Constituent Object Parser: Syntactic structure matching for information retrieval. ACM Transactions on Information Systems, 7 (3), pp. 292–316.
Article Google Scholar
Mohri, M. and Sproat, R. (1996) An efficient compiler for weighted rewrite rules. In 34th Annual Meeting of the Association for Computational Linguistics, pp. 231–238, Santa Cruz, CA.
Chapter Google Scholar
Moulton. J. and Robinson, G. (1981) The Organization of Language. Cambridge University Press, Cambridge.
Book Google Scholar
Paice, C.D. (1990) Another stemmer. SIGIR Forum, 24: 56–61.
Article Google Scholar
Paice, C.D. (1996) Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science, 47 (8), pp. 632–649.
Article Google Scholar
Pereira, F., Riley, M. and Sproat, R. (1994) Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology,pp. 249–254. Advanced Research Projects Agency.
Google Scholar
Popovic, M. and Willett, P. (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43 (5), pp. 384–390.
Article Google Scholar
Porter, M.F. (1980) An algorithm for suffix stripping. Program, 14: 130–137.
Article Google Scholar
Riloff, E. (1995) Little words can make a big difference for text classification. In Proceedings, 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95),pp. 130–136, Seattle.
Google Scholar
Salton, G. (1966) Automatic phrase matching. In D.G. Hays, editor, Readings in Computational Linguistics. Elsevier, New York, pp. 169–188.
Google Scholar
Scalise, S. (1986) Generative Morphology. Foris, Second Edition.
Book Google Scholar
Schabes, Y., Abeillé, A. and Joshi, A. (1988) Parsing strategies with ‘lexicalized’ grammars. In Proceedings, 12th International Conference on Computational Linguistics (COLING’88),pp. 578–583, Budapest.
Google Scholar
Schwarz, C. (1990) Automatic syntactic analysis of free text. Journal of the American Society for Information Science, 41 (6), pp. 408–417.
Article Google Scholar
Sheridan, P. and Smeaton, A.F. (1992) The application of morpho-syntactic language processing to effective phrase matching. Information Processing é9 Management, 28 (3), pp. 349–369.
Article Google Scholar
Shieber, S.N. (1986) An Introduction to Unification-Based Approaches to Grammar. CSLI Lecture Notes vol. 4. Chicago University Press, Stanford, CA.
Google Scholar
Sproat, R. (1992) Morphology and Computation. MIT Press, Cambridge, MA.
Google Scholar
Strzalkowski, T. (1996) Natural language information retrieval. Information Processing é4 Management, 31 (3), pp. 397–417.
Article Google Scholar
Strzalkowski, T. and Vauthey, B. (1992) Information retrieval using robust natural language processing. In Proceedings, 30th Annual Meeting of the Association for Covnputational Linguistics (ACL’92), pp. 104–111, Newark, DE.
Chapter Google Scholar
Tesniere, L. (1953) Esquisse d’une syntaxe structurale. Klincksieck, Paris.
Google Scholar
Tzoukermann, E. and Jacquernin, C. (1997) Analyse automatique de la morphologie dérivationnelle et filtrage de mots possibles. Silexicales, 1, pp. 251–260. Colloque Mots possibles et mots existants, SILEX, University of Lille III.
Google Scholar
Tzoukermann, E, Klavans, J. L. and Jacquemin, C. (1997) Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIG IR’97), Philadelphia, PA.
Google Scholar
Tzoukermann, E. and Liberman, M.Y. (1990) A finite-state morphological processor for Spanish. In Proceedings of the Thirteenth International Conference on Computational Linguistics, pp. 277–281, Helsinki, Finland. International Conference on Computational Linguistics.
Google Scholar
Tzoukermann, E. and Radev, D.R. (1996) Using word class for part-of-speech disambiguation. In SIGDAT Workshop, pp. 1–13, Copenhagen, Denmark. International Conference on Computational Linguistics.
Google Scholar
Tzoukermann, E., R.adev, D.R. and Gale, W.A. (1995) Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, pp. 51–57, Dublin, Ireland. Association for Computational Linguistics–European Chapter.
Google Scholar
Viegas, E., Gonzalez, M. and Longwell, J. (1996) Morpho-semantics and constructive derivational morphology: A transcategorial approach. Technical Report MCCS-96295, Computing Research Laboratory, New Mexico State University, Las Cruces, NM.
Google Scholar

Download references

Author information

Authors and Affiliations

Institut de Recherche en Informatique de Nantes, 2, chemin de la Houssinière, BP 92208, 44322, Nantes Cedex 3, France
Christian Jacquemin
Bell Laboratories, Lucent Technologies, 700 Mountain Avenue, Room 2d-448, P.O. Box 636, Murray Hill, NJ, 07974, USA
Evelyne Tzoukermann

Authors

Christian Jacquemin
View author publications
You can also search for this author in PubMed Google Scholar
Evelyne Tzoukermann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

General Electric, Research & Development, 12301, Schenectady, NY, USA
Tomek Strzalkowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jacquemin, C., Tzoukermann, E. (1999). NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_2

Download citation

DOI: https://doi.org/10.1007/978-94-017-2388-6_2
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5209-4
Online ISBN: 978-94-017-2388-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics