Statistical models of morphology predict eye-tracking measures during visual word recognition

Lehtonen, Minna; Varjokallio, Matti; Kivikari, Henna; Hultén, Annika; Virpioja, Sami; Hakala, Tero; Kurimo, Mikko; Lagus, Krista; Salmelin, Riitta

doi:10.3758/s13421-019-00931-7

Statistical models of morphology predict eye-tracking measures during visual word recognition

Open access
Published: 17 May 2019

Volume 47, pages 1245–1269, (2019)
Cite this article

Download PDF

You have full access to this open access article

Memory & Cognition Aims and scope Submit manuscript

Statistical models of morphology predict eye-tracking measures during visual word recognition

Download PDF

Minna Lehtonen^1,2,3,
Matti Varjokallio⁴,
Henna Kivikari^3,5,
Annika Hultén⁵,
Sami Virpioja⁴,
Tero Hakala⁵,
Mikko Kurimo⁴,
Krista Lagus⁶ &
…
Riitta Salmelin⁵

2521 Accesses
2 Citations
Explore all metrics

Abstract

We studied how statistical models of morphology that are built on different kinds of representational units, i.e., models emphasizing either holistic units or decomposition, perform in predicting human word recognition. More specifically, we studied the predictive power of such models at early vs. late stages of word recognition by using eye-tracking during two tasks. The tasks included a standard lexical decision task and a word recognition task that assumedly places less emphasis on postlexical reanalysis and decision processes. The lexical decision results showed good performance of Morfessor models based on the Minimum Description Length optimization principle. Models which segment words at some morpheme boundaries and keep other boundaries unsegmented performed well both at early and late stages of word recognition, supporting dual- or multiple-route cognitive models of morphological processing. Statistical models based on full forms fared better in late than early measures. The results of the second, multi-word recognition task showed that early and late stages of processing often involve accessing morphological constituents, with the exception of short complex words. Late stages of word recognition additionally involve predicting upcoming morphemes on the basis of previous ones in multimorphemic words. The statistical models based fully on whole words did not fare well in this task. Thus, we assume that the good performance of such models in global measures such as gaze durations or reaction times in lexical decision largely stems from postlexical reanalysis or decision processes. This finding highlights the importance of considering task demands in the study of morphological processing.

Prediction at the intersection of sentence context and word form: Evidence from eye-movements and self-paced reading

Article Open access 12 December 2022

Vision and Language in Cross-Linguistic Research on Sentence Production

Eye Movements During Reading

Introduction

Processing of morphologically complex words (e.g., screen+ing+s) is an active topic in visual word recognition research. Studies on morphological processing have focused on determining whether complex words are recognized by decomposing them into their morphological constituents or whether they are stored as holistic units in our mental lexicon. A variety of cognitive models have been proposed which span from full decomposition models (e.g., Taft, 1979, 2004), assuming that all words are represented as morphemes, to full form models that claim that all known words are initially accessed via their whole-word representations (e.g., Butterworth, 1983). In addition, there are dual/multiple-route models (e.g., Schreuder & Baayen, 1995; Kuperman, Schreuder, Bertram, & Baayen, 2009) which assume that the mental processing system may include both types of representations and utilize different kinds of information in order to process words effectively. Processing of morphologically complex words has been studied by utilizing various tools such as reaction time (RT) measurements in visual word recognition tasks, tracking of eye-movements during reading, and techniques measuring brain activity elicited by visual or auditory presentation of words.

Furthermore, the temporal order in which these kinds of representations become active during visual word recognition has been subject to debate (see, e.g., Rastle & Davis, 2008; New, Brysbaert, Segui, Ferrand, & Rastle, 2004, Giraudo & Grainger, 2003a, 2003b). For example, a widely held view states that morphologically complex words are segmented to their constituents at early stages of word recognition (see, e.g., Rastle & Davis, 2008, for a review). At a later stage in which the semantic and syntactic features are accessed, these decomposed parts are then assumedly recombined to form a meaningful whole (Schreuder & Baayen, 1995; Taft, 2004). This stage can thus be sensitive to full-form measures such as surface frequency even if decomposition has taken place, i.e., they would reflect recombination of the morphemes already segmented at earlier levels of processing (Taft, 2004; Fruchter & Marantz, 2015). Previous eye-tracking research on recognition of morphologically complex words has revealed effects of both the whole words and the morphological constituents (e.g., Andrews, Miller, & Rayner, 2004; Hyönä, Bertram, & Pollatsek, 2004). In compound words, whole-word frequency effects have been observed earlier in time than effects of the constituents (Kuperman et al., 2009), challenging the obligatory early decomposition accounts observed in, e.g., lexical decision (e.g., Taft, 2004; Rastle & Davis, 2008). The present study aims to better understand the processes and representations accessed at different stages of word recognition. To do this, we study how different computational models that are based on different kinds of representational units correspond to measures of participants’ eye-movement behavior during visual word recognition.

One central theme in morphological processing studies has been the issue of optimization, i.e., determining the most optimal units of representation in the mental lexicon, in terms of minimizing both storage capacity and processing speed (Schreuder & Baayen, 1995). Finnish, for example, is a morphologically rich language where each noun has about 150 paradigmatic forms, and various clitic particles can additionally be attached to these forms. Storing all these word forms as whole units is thus unlikely to be economical for the storage capacity of the mental lexicon, suggesting that decomposing them into morphological constituents is a useful strategy for the cognitive system. However, inflected Finnish words robustly elicit longer RTs, larger error rates, and longer eye-fixations than matched monomorphemic words (Niemi, Laine, & Tuominen, 1994; Hyönä, Laine, & Niemi, 1995; Bertram, Laine, & Karvinen, 1999; Lehtonen & Laine, 2003), suggesting that decomposition may also entail a cost. It is, however, unclear what an optimal balance between these two costs is and whether it differs in early vs. later stages of word recognition.

Computational models can provide useful means for addressing issues related to optimization. In contrast to psycholinguistic models that are typically descriptive, the output of computational models is quantitative. It can therefore be directly compared to continuous performance measures such as RTs in a word recognition task or eye-tracking measures during reading. If a computational model is able to successfully predict variation in these cognitive measures, it is likely able to tell us something essential about the cognitive operations relevant in these tasks. Previous work on statistical modeling of morphological processing has utilized a variety of approaches, many of which have not assumed morphemes themselves to have an influential role in word processing. Such approaches include the distributed-connectionist models (Seidenberg, 2005; McClelland, 1988; Gonnerman, Seidenberg, & Andersen, 2007, see Rueckl, 2010 for a review) and the amorphous Naïve Discriminative Reader model (Baayen, Milin, Filipovic, Hendrix, & Marelli, 2011; Baayen, Shaoul, Willits, & Ramscar, 2016), which maps orthographic or phonetic input units directly to symbolic semantic units, without hidden units or a morphological level. Here, in contrast, we focus on models that allow morphological information to be utilized in storage of words and models that are based on the principle of optimization, a principle that is likely to bear relevance in the human cognitive system.

Morfessor (Creutz & Lagus, 2007) is a computational model that is able to learn segmentations of words in an unsupervised manner from unannotated data, and it applies a principle of optimization in building a concise lexicon of morphs. First, it stores word forms as wholes (assuming one word is one morph, e.g., dog, dogs). Then it utilizes these stored morphs in segmenting new incoming words (e.g., after storing dog, also − s gets stored separately when encountering the word dogs). Morfessor searches for a segmentation that is simultaneously compact and an accurate representation of the data. As an illustration, an extremely compact lexicon would include only letters but it would not provide a good representation of the data, whereas listing all words as whole units in the lexicon would be a very accurate but not compact representation of the data. Via the cost function based on the two-part coding scheme of the Minimum Description Length (MDL) principle (Rissanen, 1978), Morfessor attempts to find an optimal balance between the two. The first part in the cost function represents a cost for storage for the lexicon where larger units are more costly. The second part, in turn, represents a cost for computation where holistic units reduce the cost. If only the second part was included, all words would be stored as full forms, and this would be a problem, e.g., when encountering words with novel combinations of known morphemes.

In Morfessor, it is also possible to manipulate the emphasis the model places on these parts, or decomposed vs. full-form units. This can be done by manipulating a hyperparameter alpha, which enables one to vary the length of the units that Morfessor tends to produce. A small value of the hyperparameter provides a lexicon of short units (or morphs that the model stores), whereas a large value provides a lexicon of long units. As an example of extremes, Morfessor with an alpha value of 0.01 leads to a lexicon of units which largely resemble linguistically analyzed morphs, whereas an alpha value of 10 includes a lexicon of full forms (Virpioja, Lehtonen, Hultén, Kivikari, Salmelin, & Lagus, 2018). This feature allows us to investigate, within the same model type, whether a solution that decomposes words at practically all morpheme boundaries corresponds better to human word-recognition measures than one that keeps some or all boundaries unsegmented. Unsupervised models such as Morfessor utilize general learning principles in extracting regularities from the input and can in this way mimic the kind of human learning in which discovering regular structures and patterns from the linguistic environment is central. An interesting comparison point is provided by supervised models trained on pre-given linguistically structured input, for which parallels can be found in human learning with innate constraints.

Morfessor was initially studied in psycholinguistic context by Virpioja, Lehtonen, Hultén, Salmelin, and Lagus (2011b) who demonstrated that the self-information values predicted by Morfessor correlated highly with word recognition times for morphologically complex Finnish words in a visual lexical decision task. These correlations were higher for Morfessor than for typically used psycholinguistic variables, such as lemma frequency, length, or morphological family size. Following this first investigation, Virpioja et al., (2018) utilized Morfessor and other statistical models based on self-information in studying the optimal balance of storage and decomposition in the human mental lexicon. They used simple statistical models of morphology that are based on different representational units: words thoroughly decomposed based on their linguistic analysis, full word forms, and a solution which segments words at some morpheme boundaries and leaves others unsegmented. They compared these models’ predictions with lexical decision RTs and aimed to uncover whether human representations for morphologically complex words are based on decomposed morphemes, full forms, or something in-between. The best correspondence was found by using a combination of two models: an instance of Morfessor that segments words at some morpheme boundaries while not others (Morfessor with an alpha value of 0.8), and a whole-word model. While Morfessor does not incorporate information about different types of morphemes, the output segmentations differ to some extent for words carrying different type of morphemes. In the analysis of Virpioja et al., (2018), the best-performing Morfessor instance left most derivational morpheme boundaries unsegmented (in line with previous behavioral studies on derivational processing, e.g., Niemi et al., 1994; Bozic and Marslen-Wilson, 2010; Laudanna, Badecker, & Caramazza, 1992) whereas all clitic particles were kept separate from the rest of the word. Interestingly, it also left a large proportion of the inflectional suffixes unsegmented. The results were interpreted to support dual-route accounts of morphological processing.

As the Virpioja et al., (2011b, 2018) studies were based on lexical decision RTs, it is unclear whether the good performance of Morfessor and the whole-word model stem from particular, possibly different stages of the word recognition process. Word recognition times in a lexical decision task necessarily include several stages, including form-level (e.g., letter and bigram) processing and access to more abstract lexical representations (e.g., whole words or morphemes) but also decision-making processes and button-press-related motor preparation. Tracking of eye-movements during reading can be used to study increasingly automatic aspects of the process. It provides us with measures that allow a look on the processes at play also during word recognition, enabling an improved temporal resolution. First fixation duration (FFD) is an eye-tracking measure expected to reflect early stages of word recognition, while more global measures such as gaze duration (GD; sum of duration of all fixations on the word) are assumed to emphasize also later processing stages (see, e.g. Hyönä et al., 1995). In addition to these well-established measures, we also include a further measure of the later stages, namely gaze duration minus first fixation duration (GmF), to more closely focus on the processes taking place after the initial landing of the eyes on the word.

Using these measures, we aim to better understand whether the predictive power of unsupervised Morfessor in lexical decision (Virpioja et al., 2011b, 2018) stems primarily from early or late word recognition processes. We investigate how the MDL-based optimization principle of Morfessor and its different model variants (e.g., those that decompose words exhaustively vs. those that keep many words unsegmented) perform in predicting the different eye-tracking measures during word recognition. Our first aim is thus to study the question of optimal units of processing utilized at different processing levels, for a variety of morphologically complex (inflected and derived) words. We compare the relative performance of statistical models that are based on different kinds of representational units, e.g., those close to linguistically defined morphemes, full forms, or a solution which finds an optimal balance between the two: for some morpheme combinations this may be full forms and for some decomposed representations. To study the optimal balance between the two extremes, we vary the hyperparameter alpha in the first type of the Morfessor method, Morfessor Baseline. We compare these three Morfessor instances to a similar simple model which is, however, trained using linguistically pre-segmented input in a supervised manner and thus fully morpheme-based (Morph unigram model), and to a full-form model based on surface frequencies (Word unigram model). With this approach and the temporal dimension provided by eye-tracking, we aim to study the sensitivity of early vs. late word recognition processes to morpheme-based vs. more holistic units.

Our second aim is to investigate statistical models which predict upcoming morphological information on the basis of previously observed morphs. We investigate to what extent these kinds of predictive processes are used in online recognition of morphologically complex words. We hypothesize that information of the morpheme context is to some extent utilized in recognition of multimorphemic words, at least after initial landing of the eyes on the word and after accessing the first morphological constituent. An unsupervised model type that allows testing the effect of morpheme context is Morfessor Categories-MAP (CatMAP) (Creutz & Lagus, 2005a, 2007) that incorporates rudimentary structural information regarding word formation, i.e., that words may include prefixes, stems, and suffixes. The segmentations provided by the CatMAP method correspond in most cases more accurately to linguistic morph segmentations than the segmentation given by the Morfessor Baseline algorithm (Creutz & Lagus, 2007). However, there are still differences compared to the linguistically defined morphemes. Therefore, as a comparison, we investigate the performance of a supervised model (Morph bigram model) that also predicts upcoming morphs on the basis of previous ones in the same word but the model is during its training given linguistically pre-segmented input, i.e., it utilizes units that strictly correspond to linguistic morphemes.

Our focus is on computational models that provide self-information estimates. The measure of self-information or “surprisal” is the negative logarithm of the word’s probability estimated by a statistical language model and is a measure of how unexpected a word form is. This measure has previously been used, e.g., in the context of auditory word recognition (Balling & Baayen, 2012; Ettinger, Linzen, & Marantz, 2014) and can be assumed to correspond to a cost of storage, i.e., the minimum number of bits required to encode the word using the model.

The kind of information that is relevant to extract from the visual input during word recognition may depend on the task. Overall, we expect eye-movement measures to reflect at least to some extent more automatic processes than behavioral reaction times. In two experiments, we employ different ways of presenting the words to the participants during the measurement of their eye-movements: 1) standard visual lexical decision combined with eye-tracking, to enable direct comparisons to the previous lexical decision study by Virpioja et al., (2011b, 2018), and 2) a task in which the target words are presented embedded in rows of several unrelated letter strings, to better mimic eye-movement behavior in natural reading. In the latter task, the participants are to evaluate the lexicality of unrelated letter strings presented in the row (i.e., whether they were all real words or not). This is done in order to keep the main cognitive aspects of the second task as similar as possible to the lexical decision experiments. Additionally, by using unrelated words instead of sentences, we want to control for predictive spill-over effects from previous words, i.e., predicting upcoming words on the basis of sentence context (see, e.g., Hyönä, Vainio, & Laine, 2002). While the task is still essentially lexical decision, a behavioral response is not required on every item read, and the probability of observing a pseudoword is lower than in a standard lexical decision task. We assume that this aspect of the task reduces postlexical processes, such as demands to reanalyze the words, and puts more emphasis on primary lexical access processes in our measures. Thus, we ask to what extent the nature of the task affects the relative performance of the models, by comparing the standard visual lexical decision to a task that assumedly reduces the cost of reanalysis, check-up, and decision-making processes, which are likely to not be part of the most central aspects of word recognition in ecologically valid conditions.

Taken together, by using statistical models of morphology we study what kind of information is accessed during recognition of multimorphemic words. In particular, we are interested in the nature of the optimal units of processing (e.g., whether they are morpheme- or full-form-based) at different stages of word recognition and whether readers predict morphemes on the basis of previous ones. We additionally study to what extent particular task demands affect the kind of information used during online word recognition.

Experiment 1

Method

Participants

Twenty-four healthy volunteers (22 females; mean age 26.3 years; SD 5.6) participated in the lexical decision experiment. All were native speakers of Finnish with no diagnosed language disorders or neurological illnesses, and they were remunerated for their time. The study was approved by the Aalto University Research Ethics Committee.

Materials

The word stimuli were the same as those used in Virpioja et al., (2018) and consisted of 360 Finnish nouns with one or multiple (1-5) morphemes. In multimorphemic words the stem was accompanied by one or several inflectional, derivational, or possessive suffixes and/or clitic particles. The number of morphs was first calculated using the FINTWOL morphological analyzer (Lingsoft, Inc.) and further corrected by two native speakers of Finnish on the basis of linguistic assessment of derivational suffixes’ regularity and productivity, according to Karlsson (1983). The word materials had broad statistical distributions for several lexical parameters, permitting a correlational analysis for eye-tracking data. Three hundred words were randomly selected from the Morpho Challenge corpus (Kurimo, Creutz, & Varjokallio, 2008) including over 2.2 million word types and 44 million word tokens. This list was complemented with 60 additional randomly selected higher-frequency words because the random sample overemphasized low-frequency and bimorphemic words. For properties of the word stimuli with respect to statistical language models and descriptive statistics, see Table 1. Lemma frequency is the summative frequency of all the inflectional variants of a single stem (e.g., Baayen, Dijkstra, & Schreuder, 1997; Bertram, Baayen, & Schreuder, 2000, Taft, 1979) and assumed to affect the speed of accessing the stem when decomposing complex words. Morphological family size is the number of derivations and compounds where the noun occurs as a constituent (e.g., Bertram et al., 2000; del Prado, Bertram, Häikiö, Schreuder, & Baayen, 2004; Schreuder & Baayen, 1997).

Table 1 Properties of the stimulus words and cross-entropy values for the language models. For the models, the range and mean (SD) represent their self-information values

Statistical models of morphology predict eye-tracking measures during visual word recognition

Abstract

Similar content being viewed by others

Prediction at the intersection of sentence context and word form: Evidence from eye-movements and self-paced reading

Vision and Language in Cross-Linguistic Research on Sentence Production

Eye Movements During Reading

Introduction

Experiment 1

Method

Participants

Materials

Procedure

Statistical models of interest

Data analyses

Word length analyses (short vs. long words)

Results

Analyses for short and long words

Discussion

Experiment 2

Method

Participants

Materials

Procedure

Data analyses

Results

Analyses for short and long words

Discussion

General discussion

Conclusions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Open Practices Statement

Appendices

Appendix A: Correlations between word properties and language models

Appendix B: Descriptions of the statistical models of interest

Word unigram

Morph n-grams

Morfessor Baseline

Morfessor CatMAP

Appendix C: Decrease in regression model deviance as a function of cross-entropy

Appendix D: Regression analyses with word unigram as a base predictor in Experiment 1

Appendix E: Correlations between background variables and the eye-tracking measures for the long and short words in Experiment 1

Appendix F: Regression analyses with word unigram as a base predictor in Experiment 2

Appendix G: Correlations between background variables and the eye-tracking measures for the long and short words in Experiment 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation