When context is and isn’t helpful: A corpus study of naturalistic speech

Hitczenko, Kasia; Mazuka, Reiko; Elsner, Micha; Feldman, Naomi H.

doi:10.3758/s13423-019-01687-6

When context is and isn’t helpful: A corpus study of naturalistic speech

Theoretical Review
Published: 12 March 2020

Volume 27, pages 640–676, (2020)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

When context is and isn’t helpful: A corpus study of naturalistic speech

Download PDF

Kasia Hitczenko¹,
Reiko Mazuka²,
Micha Elsner³ &
…
Naomi H. Feldman⁴

1454 Accesses
7 Citations
Explore all metrics

Abstract

Infants learn about the sounds of their language and adults process the sounds they hear, even though sound categories often overlap in their acoustics. Researchers have suggested that listeners rely on context for these tasks, and have proposed two main ways that context could be helpful: top-down information accounts, which argue that listeners use context to predict which sound will be produced, and normalization accounts, which argue that listeners compensate for the fact that the same sound is produced differently in different contexts by factoring out this systematic context-dependent variability from the acoustics. These ideas have been somewhat conflated in past research, and have rarely been tested on naturalistic speech. We implement top-down and normalization accounts separately and evaluate their relative efficacy on spontaneous speech, using the test case of Japanese vowels. We find that top-down information strategies are effective even on spontaneous speech. Surprisingly, we find that at least one common implementation of normalization is ineffective on spontaneous speech, in contrast to what has been found on lab speech. We provide analyses showing that when there are systematic regularities in which contexts different sounds occur in—which are common in naturalistic speech, but generally controlled for in lab speech—normalization can actually increase category overlap rather than decrease it. This work calls into question the usefulness of normalization in naturalistic listening tasks, and highlights the importance of applying ideas from carefully controlled lab speech to naturalistic, spontaneous speech.

Natural speech statistics shift phoneme categorization

Article 18 March 2019

Speaker-normalized sound representations in the human auditory cortex

Article Open access 05 June 2019

Re-examining selective adaptation: Fatiguing feature detectors, or distributional learning?

Article 05 October 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Listeners are exposed to highly variable, continuous speech and map it to discrete sound categories. To do so, they first learn as infants what the relevant sounds of their language are, and, subsequently, map incoming signal to learned categories. This is generally a robust process—infants learn about the sounds of their language as early as 6 months (Kuhl et al., 1992) and, for the most part, listeners process what they are hearing in an effortless manner. However, despite how seemingly easily listeners solve these tasks, they are computationally difficult problems. In fact, after decades of research in this area, researchers have not yet established a robust one-to-one mapping between signal and category that works to anywhere near the degree of success of human listeners.

The reason these tasks are so computationally difficult is because there is a large amount of variability in the speech signal, which can lead to acoustic overlap between different sound categories (Bion et al., 2013). One sound category can be acoustically realized in infinitely many ways, and two different sound categories can have identical acoustic realizations. This makes establishing a one-to-one mapping between speech and category difficult. Although we focus on speech perception in this paper, this problem is not unique to speech. For example, a particular visual stimulus may appear completely different across instances, due to, for example, lighting conditions, viewing angles, or occlusion. As in speech, two different objects can also have identical physical attributes, and, yet, for the most part, people effortlessly identify what they are seeing (Bar, 2004).

The basic problem is that absolute acoustic or other perceptual cues are insufficient to separate categories as well as humans do. This has led researchers to propose that listeners may be relying on context to help map from signal to categories. The role of context is widely studied in cognitive science, and fundamental to many cognitive theories, with most researchers largely agreeing that it is crucial in speech perception, language acquisition, object recognition, and visual perception, along with many other domains (e.g., Warren, 1970; Ganong, 1980; Port & Dalby, 1982; Mann & Repp, 1980; Bar & Ullman, 1996; Bar, 2004).

In the speech domain, researchers have identified two main, non-mutually exclusive ways that listeners could rely on context, based on two ways that context affects a speaker’s production. The first is that context affects which sounds are likely to occur—e.g., /æ/ is much more likely than /ε/ to occur in the context th_t (‘that’ is a word, ‘thet’ is not), so listeners could be biased to perceive acoustics in that frame as /æ/ rather than /ε/. That is, top-down information could guide expectations about what category was likely to be heard. This type of information can supplement the acoustics, and we will refer to these as ‘top-down information’ accounts.^{Footnote 1}

The second is that context affects how sounds are produced. For example, who is speaking will significantly and systematically alter the acoustics of the signal. This leads to variability in how a particular sound is produced, and can lead to overlap between different sound categories (e.g., one speaker’s /s/ could be another speaker’s /$\int \limits $/ as shown in Newman et al., 2001). Listeners could, thus, factor out systematic variability stemming from contextual factors like speaker (but also, speech rate, position in an utterance, neighboring sounds, and so forth) from their input. Removing variability may lead to less overlap between categories, and make the mapping from acoustics to categories clearer. In other words, context could be used to pre-process the acoustics that are used for categorization decisions. These types of accounts have generally been termed ‘normalization’ accounts.

The top-down information and normalization account examples provided above make use of two different contextual factors (i.e. neighboring sounds vs. speaker information), but many contextual factors can affect both stages of production (i.e. which category is produced and how it is produced). That is, the core difference between these two accounts is not which contextual factors are used, but rather how they are used. In this paper, we will broadly define context to include neighboring sounds, position in a word/utterance, part of speech of the word the sound was produced in, speech rate, speaker, as well as aspects of the sound itself that have already been processed. Many of these contextual factors could be useful in both top-down information and normalization accounts. For example, a particular phoneme may be a priori more likely to be produced word-finally, in which case a listener would benefit from a bias towards perceiving that phoneme word-finally, as in a top-down information strategy. At the same time, sounds are acoustically longer word-finally, so a listener would separately benefit from accounting for this difference in how the sound was produced, as in a normalization strategy.

These two ways of using context have both been studied extensively. There is a large body of experimental and computational work supporting the notions (i) that context does affect both which sound is produced and how it is produced, (ii) that listeners can make use of these strategies, and (iii) that listeners do make use of these strategies to help overcome the overlapping categories problem. Both ways of using context are relatively well accepted in the speech perception literature.

However, there are two main limitations with previous work that warrant further study. First, these two ways of using context, although different, have been somewhat conflated in previous work, and have been difficult to dissociate experimentally. In particular, experiments that have been used to argue for one over the other generally show that an acoustic signal is perceived as one category in a particular context, but when the same signal is placed in a different context, it is perceived differently. This type of finding has been used to argue for both top-down information and normalization accounts, but depending on the specifics, merely shows that context is used, but not how. Therefore, it is not entirely clear whether listeners are using both of these strategies, and if not, which one they are using. This limitation requires separating these accounts, and testing them individually, which computational methods will allow us to do.

Second, these ideas have mostly been studied on synthetic or carefully controlled lab speech, which differs in important ways from the naturalistic and spontaneous speech that listeners actually learn from and process. It is not clear whether promising results from controlled lab speech generalize to more variable spontaneous speech; indeed, where tested, they have often not (e.g., Antetomaso et al., 2017). In addition, most of the debate so far has centered on whether listeners do or do not make use of these strategies, and has assumed that if listeners did use these strategies, doing so would help them process naturalistic speech. However, there is actually little to no evidence so far that these strategies are effective on naturalistic speech. Addressing this limitation requires applying these two strategies to naturalistic speech of the type that listeners are mostly exposed to, and testing whether they are effective in separating overlapping categories.

In this work, we study how context can be effectively used in speech perception, taking these two issues into account. We implement top-down information and normalization accounts separately and evaluate their relative contribution in the process of going from speech signal to categories—and we do so on spontaneous speech. We focus on the test case of Japanese vowel length, a test case with particularly overlapping categories that current computational models fail to learn. We find that top-down information is helpful in separating the sound categories, remaining robust even on spontaneously produced speech. However, contrary to expectations, we find that normalization is not helpful, at least as it has often been implemented in the cognitive literature. We then study why exactly the discrepancy between our results and previous findings occurs. We find that the discrepancy results from the difference between controlled lab speech and spontaneous speech, by showing that the exact same normalization process we use works if we apply it to lab speech that is more similar to the speech used in previous work. Simulations and a mathematical analysis reveal that one property of spontaneous speech that seems to play a particularly important role is the fact that categories do not occur uniformly across contexts in spontaneous speech, as they do in controlled lab speech. Imbalances in where categories occur—precisely the type of signal that is helpful in top-down information accounts—can hurt normalization. That is, this work not only dissociates two strategies that have often been conflated, but shows interesting interactions between them, such that properties of the input that make one of them effective can make the other ineffective.

Past research on these cognitive theories has tended to focus on whether listeners do or do not use these strategies, assuming that using them would actually solve the overlapping categories problem present in speech. While our results validate this assumption for top-down information accounts, our results show that in our case study, this assumption is wrong for a common implementation of normalization. It is possible that the theory about how listeners normalize could be repaired in light of these findings, as we will discuss, and this warrants further study. Overall, these results highlight the importance of studying speech perception using spontaneous speech, in addition to carefully controlled lab speech, as results from one do not necessarily generalize to the other.

Background

The Japanese vowel length contrast

This paper uses the Japanese vowel length contrast as a test case to compare the relative efficacy of top-down information and normalization strategies. In Japanese, there are two sound categories along the duration dimension—referred to as ‘short’ vowels and ‘long’ vowels (Vance, 1987). Which category is used can change the meaning of a word. For example, /biru/ with a short vowel means ‘building,’ while /bi ru/ with a long vowel means ‘beer’. Results from perception and production studies reveal that Japanese speakers differentiate short and long vowels: they produce short and long vowels differently and can identify which vowel length category a particular vowel belongs to (Chen et al., 2016; Hisagi et al., 2010; Mugitani et al., 2009; Werker et al., 2007). Based primarily on studies of controlled laboratory speech, vowel length is often thought to be signaled primarily by the vowel duration cue and to a lesser extent, by formant values (e.g., Arai et al., 1999; Kinoshita et al., 2002; Lehnert-LeHouillier, 2010). Some researchers have alternatively hypothesized that relativized vowel duration (the ratio between a vowel’s duration and the duration of its neighboring sound or the word it is in) might be the primary cue to vowel length instead (Hirata, 2004). However, there is no conclusive evidence in either direction so, for a number of reasons, we follow a substantial body of work in using vowel duration as the cue to vowel length. On the one hand, absolute duration can be more easily and reliably measured in naturalistic speech, where, for example, vowels often occur in isolation without any neighboring sounds to relativize against. On the other hand, one of the theories we consider—normalization—has historically only operated over absolute cues. This is because it has been treated as an alternative, not a supplement, to relativizing cues: both are ways to transform the acoustics in such a way as to remove systematic contextual variability, and doing both could be redundant. Nonetheless, future work should study how these results generalize when using duration ratios, and we return to this issue in the General Discussion.

At this point, we wish to highlight an important terminological distinction between vowel length and vowel duration—and the corresponding two meanings that short/long can have in this context. Vowel length refers to the category status of a vowel—i.e., whether it is the vowel category that will result in /biru/ (‘building’) or /bi ru/ (‘beer’). Vowel duration refers to the acoustic property of a vowel—i.e., how long it took the speaker to articulate the vowel—and is thought to be a cue to vowel length. Therefore, a vowel can be referred to as short (or long) if it belongs to the short (or long) category, but it can also be referred to short (or long) depending on its physical duration. In this paper, we will use ‘phonologically short/long,’ ‘phonemically short/long,’ or simply ‘short/long’ to refer to category status, and ‘acoustically short/long’ to refer to physical vowel duration.

This distinction is critical because a vowel’s duration and length do not always line up. Recent work has shown that although short vowels and long vowels are different categories, the range of durations they can have overlap substantially (Bion et al., 2013). While long vowels are, on average, acoustically longer than short vowels, a particular production of a phonologically short vowel can be acoustically longer than a particular production of a phonologically long vowel. In fact, because only 9% of Japanese vowels are phonologically long, the combined distribution of all vowels is unimodal along the duration dimension (Fig. 1). Therefore, while vowel duration is thought to be the primary cue to vowel length, it is insufficient to completely separate short and long vowels in spontaneous productions. This is precisely what has led some researchers to instead consider relativized vowel duration as the primary cue to vowel length; however, that work has only considered controlled lab speech. On naturalistic speech, this problem persists, regardless of which type of cue is used (Bion et al., 2013).

Note that vowel length is not the only way that Japanese listeners need to categorize incoming vowels. There are ten total vowel categories in Japanese, a short and long version of five different vowel qualities (/a/, /e/, /i/, /o/, /u/), so Japanese listeners need to determine both the vowel length and the vowel quality of incoming vowels. However, the acquisition and processing of vowel quality and vowel length seem to be relatively independent processes. It is thought that Japanese infants learn the vowel length contrast at around 10 months of age, about 6 months after they have been argued to learn the vowel qualities (Sato et al., 2010). For the purpose of this paper, we simply consider how Japanese listeners may learn and process vowel length, treating vowel quality as something that is already known and can help in the categorization.

Japanese vowel length is just one instance of a commonly observed overlapping categories problem, both in speech perception (Allen et al., 2003; Hillenbrand et al., 1995; Hillenbrand et al., 2001; Narayan, 2013; 2008; Narayan et al., 2017; Newman et al., 2001; Swingley & Alarcon, 2018), and more generally (Adelson, 1993; Bar, 2004; Todorović, 2010), where the physical cues are insufficient on their own to explain human perception. The Japanese vowel length contrast is a good first test case to consider because (i) existing computational models fail to adequately learn and classify these vowels due to overlap between the categories, (ii) contextual information has been argued to play a role in processing and learning, making it a good test case for studying how context is helpful, and (iii) there exists a hand-annotated dataset consisting of both child- and adult-directed Japanese spontaneous speech.

However, one important question to consider is the extent to which our findings on vowel length will generalize to other contrasts. Our results are likely to be informative about other cases of overlapping categories, which may arise for similar reasons as in this test case. Nonetheless, the Japanese vowel length contrast has some unique properties that might make it different from some other contrasts. First, there are disproportionately many short vowels relative to long vowels, while other contrasts are often more balanced in numbers. In fact, the overlapping categories problem arises because of this imbalance: if short vowels and long vowels were equally frequent, then the distribution might be bimodal (as in controlled lab speech), which might make the contrast easier to learn. Second, the main acoustic cue to this contrast—duration—is particularly influenced by other linguistic and non-linguistic factors. This means that the underlying relationship between category and acoustic cue may be particularly difficult to recover for Japanese vowel length compared to a different contrast where the acoustic cue is less affected by other factors. Third, the Japanese vowel length contrast has relatively low functional load, because it does not distinguish many minimal pairs. As a result, it is possible that the acoustics are less important than in other contrasts, because which sound was produced might be more predictable. Finally, the Japanese vowel length is acquired relatively late compared to other contrasts (e.g., vowel quality contrasts) in both phonetic development (Sato et al., 2010) and phonological development (Mugitani et al., 2009). As a result, it could be that learners have access to different information or that a different learning mechanism is involved in learning this contrast. All of these properties suggest that Japanese vowel length may be a particularly overlapping and difficult contrast to learn, and results from these analyses could be illuminating for less extreme contrasts. While we will speculate on how our results might generalize to other types of contrasts through the paper, future work will need to investigate this issue more thoroughly.

Categorization - using unnormalized acoustic cues

Japanese listeners must first determine how many sounds there are along the duration dimension during acquisition and, once they have learned the language and its categories, they must decide which of the vowels they hear are short or long through a categorization process. We will test the usefulness of top-down information and normalization strategies by implementing them computationally and seeing how well they perform in categorizing Japanese vowels as short or long. We will compare their performance against a baseline model that categorizes exclusively based on unaltered, unnormalized acoustic cues.

All of the models we test are supervised and rely on already knowing the distinction between short and long vowels. As a result, these results are only directly applicable to adult speech perception, where the task is precisely to categorize vowels, and not to acquisition, where the task is to discover that there are two categories to begin with. Nonetheless, the results of this paper can provide some insight into acquisition, by pointing to promising directions to pursue in the future. Our categorization analyses reveal how well a strategy can, at its best, separate short vowels from long vowels. If a strategy cannot separate short vowels from long vowels in a supervised model, then it would be hard for an infant to use it to learn, and is less promising to pursue in the context of acquisition. A strategy that can separate short and long vowels in a supervised setting is a much more promising one to pursue in the unsupervised acquisition setting, even though the analyses in this paper can not make claims about how exactly infants learn these distinctions. In what follows, we lay out what this base categorization model looks like, before turning to a discussion of how context could be used in the process.

A categorization model can take many forms, but for the purpose of this paper, we model categorization using logistic regression, following previous work (McMurray & Jongman, 2011). Our logistic regression models will take as input a set of cues and map them to vowel category (either short or long). The baseline categorization model—argued to be insufficient in Bion et al., (2013) as described in the previous section—will take as input a vowel’s acoustic cues—duration and formant values—and will categorize the vowel as short or long depending on those cues. It will do so by weighting each of the cues (in terms of how much they contribute to whether the vowel should be short or long), summing the weighted acoustic cues, and then transforming this value into a probability that represents the probability that this vowel is short versus long. That is, if we consider the acoustic cues, d, f₁,f₂,f₃, logistic regression takes the following form:

$$ P(\text{long} | d, f_{1}, f_{2}, f_{3}) = \frac{1}{1+e^{\beta_{0} + \beta_{d}d + \beta_{f_{1}}f_{1} + \beta_{f_{2}}f_{2} + \beta_{f_{3}}f_{3}}} $$

(1)

where the β terms are weights on the cues—duration, d, and formants, f₁-f₃. The probability that the vowel is short is 1 − P(long|d, f₁,f₂,f₃). The model categorizes the sound as belonging to the category (short or long) that has the higher probability.

Learning this function involves learning an intercept (β₀), as well as a weight for each cue ($\beta _{d} ... \beta _{f_{3}}$). The model is trained on data that consist of the unnormalized acoustic cues of a vowel, labeled with the category that vowel belongs to, and weights are learned so as to optimally separate the short vowels from the long vowels. Once we learn this function, we can take any new vowel and calculate the probability that that vowel is long (or short). The only information that this model has access to is the acoustics of the vowel, so it will be insufficient when categories overlap in acoustic cues, but incorporating context could help.

How context could be used

There are two ways that listeners could use context, which we define broadly to include neighboring sounds, speaker, position in a word/utterance, speech rate, already processed aspects of the sound itself (like vowel quality), and so forth (see full list in the first column of Table 1). To illustrate the two ways, it is helpful to consider the production process.

Table 1 The full set of contextual factors available for each dataset, with factors that were included in the normalization upper-bound shown in bold (as described in the sections on normalization methods). In the case of the R-JMICC corpus, these are taken from the linear regression normalization method, which outperformed the neural network normalization method

Full size table

When a speaker produces a vowel, they first decide which category to produce (short or long vowel) depending on what word they are producing, and then utter a particular acoustic value for that vowel based on the vowel category they are producing. Both of these components of the production process are affected by the context of the vowel, but this is ignored in the base categorization model. The following two sections introduce the two ways context affects a speaker’s sound production and, consequently, the two ways that context could be used to improve categorization.

Top-down information accounts

First, the context of a sound directly relates to which vowel category is more or less likely to be produced. An English speaker is much more likely to produce an /æ/ vowel (as in ‘mat’) than an /ε/ vowel (as in ‘met’) in the context th_t (the word ‘that’ exists, but ‘thet’ does not), ‘but the opposite holds when the context is w_t instead (the word ‘wet’ exists, but ‘wat’ does not)’. In Japanese, a speaker is relatively more likely to produce a long vowel if they are saying an /o/ vowel than if they are saying an /a/ vowel, as can be seen in Fig. 1. A listener could benefit from taking this type of prior knowledge into account, and indeed listeners’ perception appears to be biased by which sound was a priori more likely to occur.

This type of strategy is often referred to as a ‘top down information’ account, as it makes use of listeners’ prior knowledge of which sounds are likely to occur in which contexts, in addition to the sounds’ bottom-up acoustic cues. It can also be thought of a ‘predictive’ strategy, in the sense that context is used to directly predict which sound occurred.

In this paper, we will use the term ‘top-down information’ account to refer to the use of any prior knowledge—including information at the phonemic level—to directly bias perception. We wish to make explicit that we are using the term more broadly than it sometimes is used in the literature. It is sometimes used to refer only to lexical, syntactic, or semantic factors influencing speech perception. However, we use it in the sense of any non-acoustic information directly biasing speech perception, which can include information at the phonemic level (e.g., vowel quality or neighboring sounds).

The categorization model presented in the previous section does not, in its current form, take this type of information into account: it only takes into account whether one of the vowel categories is more likely to occur overall, not whether vowel categories are more likely to occur in particular contexts. To illustrate why this is problematic, consider the toy case shown in Fig. 2, in which there are two categories (short and long), and only two contexts (let’s say phrase-medial vowels and phrase-final vowels). Overall, phonemically short and long vowels occur with identical frequency; however, the phonemically short category is much more likely to occur in phrase-final position and the phonemically long category is much more likely to occur in phrase-medial position. The base categorization model will simply place the category boundary halfway between the short and long vowel means in (c), when really this category boundary should be at a shorter duration for vowels that occur phrase-medially and at a longer duration for vowels that occur phrase-finally. This means that the base categorization model will overclassify phrase-medial short vowels (i.e., short vowels in contexts where long vowels are much more likely to occur) as long and will overclassify phrase-final long vowels as short. However, taking into account context as a top-down influence can help correct this problem. In particular, if the model or listener takes into account expectations about which vowel is more likely to occur in the context heard, then they will, all else being equal, be biased towards categorizing vowels as short in contexts where short vowels are more likely to occur, and biased towards categorizing vowels as long in contexts where long vowels are more likely to occur.

The base categorization model can be augmented to take this into account, in order to reflect what listeners are thought to do. In the logistic regression, this could be accomplished by adding the contexts as independent predictors. For example, in our Japanese example, if we added the vowel quality, q, of the vowel as an independent predictor, this could encode the fact that vowels that are /o/ are relatively more likely to be long than /a/ vowels:

$$ P(\text{long} | d, f_{1}, f_{2}, f_{3}, q) = \frac{1}{1+e^{\beta_{0} + \beta_{1}d + \beta_{2}f_{1} + \beta_{3}f_{2} + \beta_{4}f_{3} + \beta_{5}q}} $$

(2)

Essentially, this means that in addition to the acoustic cues affecting the relative probabilities of the vowel being short or long, the quality of vowel can also affect the categorization decision. Additional terms could be added depending on what other contextual factors are thought to predict category membership.

The effect is that instead of having one categorization boundary overall, the boundary between categorizing a vowel as short and categorizing it as long will shift depending on the context, and how likely short vs. long vowels are to occur in that context. If phonemically long vowels are relatively more likely to occur in a particular context than short vowels, then the boundary between short and long vowels will shift towards vowels of shorter durations, such that more vowels are classified as long, and the opposite holds in contexts where phonemically short vowels are relatively more likely.

Crucially, this model assumes that the acoustics of a sound will be the same regardless of the context it was produced in, and so it cannot take into account the fact that vowel durations may systematically vary between different contexts (due to e.g., acoustic lengthening effects).

Normalization accounts

In the previous section, we saw that context can affect which category a speaker is likely to produce.

The next and final component of the speaker’s production process is to actually produce an acoustic value for the vowel category they have chosen. This portion of the production process is also affected by context, as context systematically and predictably affects how a particular sound category is acoustically realized. As an example, vowels uttered in fast speech are, all else being equal, acoustically shorter than vowels uttered in slow speech. Similarly, vowels uttered phrase-finally are, all else being equal, acoustically longer than vowels uttered phrase-medially.

This can introduce variability and overlap between short vowels and long vowels into the overall distribution—and is problematic for a categorization model simply relying on absolute acoustic cues. Consider the toy case in Fig. 3. Here again, there are two vowel categories (short vowels and long vowels) and there are two contexts (let’s say phrase-medial and phrase-final). In phrase-medial position, short vowels are produced with an average acoustic duration of 150 ms and long vowels are produced with an average acoustic duration of 300 ms. In phrase-final position, vowels are systematically acoustically lengthened by 100 ms. This scenario is problematic for the base categorization model because the overall distribution will reveal a lot of overlap between sound categories. In particular, long vowels in phrase-medial position will overlap with short vowels in phrase-final position. The baseline categorization model presented previously learns a categorization boundary between short vowels and long vowels, which is the same for all vowels, regardless of context. This will cause the model to overclassify vowels occurring in lengthening contexts as phonemically long, and overclassify vowels occurring in acoustically short contexts as phonemically short.

However, the shifts in acoustic cue values are systematic and predictable once the context is known, so using contextual information can help overcome these problems—and listeners have been argued to do so in listening situations. There are various ways this problem could be overcome, and corresponding ways the baseline logistic regression model could be augmented. One is that listeners might build a separate mapping between acoustics and category membership for each context they encounter, such that lengthening contexts will have a boundary between short/long vowels at a higher duration, and vice versa for shortening contexts. This idea is referred to as adaptation (Kleinschmidt and Jaeger, 2015), and we will return to it in later discussion, but do not directly study it in this paper. Instead, we focus on a second idea, referred to as normalization.

The idea behind normalization is that instead of creating a different acoustic boundary between short/long vowels for every context encountered, all acoustics are mapped to the same context-independent acoustic space and then one boundary is estimated in this context-independent space. This is done by estimating how much any particular context lengthens or shortens the vowels, and then undoing all lengthening or shortening processes. Returning to our example, normalization would work by estimating that the vowels in phrase-final context are on average 100 ms longer than vowels in phrase-medial context, and then essentially shifting the distributions to compensate for this lengthening. Another way to think about it is that each vowel is represented relative to the mean duration of vowels that occurred in the same context. Acoustic cues that have been mapped to this context-independent space are referred to as normalized cues.

In the top-down information accounts, the logistic regression in Eq. 1 was augmented by adding additional predictors based on the context of the sound in question (e.g., the vowel quality, q). In normalization accounts, the logistic regression in Eq. 1 is changed by performing a preprocessing step (which will be described in detail below), and inputting normalized cues (d^norm, $f_{1}^{norm}$, $f_{2}^{norm}$, $f_{3}^{norm}$) into the logistic regression, instead of unnormalized cues as before (d^unnorm, $f_{1}^{unnorm}$, $f_{2}^{unnorm}$, $f_{3}^{unnorm}$).

This means that while information about the context a sound occurs in is a direct input to the logistic regression categorization model in top-down information accounts, it is not in normalization accounts. Rather, in normalization accounts, the contextual information is used to obtain normalized acoustic cues, which are ultimately the only input to the categorization model (e.g., Cole et al., 2010; McMurray & Jongman, 2011).

The normalized cues of a sound are obtained by predicting its expected cue values based on the context it occurs in, and then comparing these expected cue values against its actual cue values.

$$ cue^{norm} = cue^{unnorm} - cue^{expected} $$

(3)

The expected cues can be calculated from a sound’s contextual information using various methods, and we make use of two such methods. First, we follow past work, and train a linear regression to predict a sound’s acoustic cues from the context the sound occurs in (Cole et al., 2010; McMurray & Jongman, 2011).

The second method involves training a neural network to predict a sound’s acoustic cues from the context the sound occurs in. The benefit of this method is that it allows for more powerful, non-linear normalization functions to be learned. Once the pre-processing step is complete and we have normalized all of the acoustic cues relative to context, we can then replace the unnormalized cues with normalized cues in the logistic regression categorization model:

$$ \begin{array}{@{}rcl@{}} P(\text{long} | d^{norm}, f_{1}^{norm}, f_{2}^{norm}, f_{3}^{norm})\\ = \frac{1}{1+e^{\beta_{0} + \beta_{1}d^{norm} + \beta_{2}f_{1}^{norm} + \beta_{3}f_{2}^{norm} + \beta_{4}f_{3}^{norm}}} \end{array} $$

(4)

Normalizing has the effect of shifting where the boundary between short and long vowels falls. In particular, considering the example in Fig. 3, the vowels in phrase-final position are perfectly shifted from those in phrase-medial position. Ignoring context will cause there to be a huge degree of variability and overlap between short and long vowel acoustics in the overall distributions. However, normalizing out this variability by shifting the two contexts so that they line up will help. In particular, vowels that are acoustically quite long will be readily classified as short because listeners may be accounting for the fact that these vowels were lengthened and undoing this effect. That is, a long acoustic duration presented in phrase-medial context may be perceived as long; however, when placed in phrase-final context, that same vowel with the same acoustics may now be perceived as phonemically short because it may be relatively short relative to other vowels that occur in that same lengthening context.

There are other implementations of normalization, including z-scoring, vocal tract normalization, relativizing cues, as well as proposals by Dillon et al., (2013) that we do not test in this work. We return to the question of how our results generalize to other normalization methods in the General Discussion, but future work should investigate this question more thoroughly.

Adaptation accounts

Another idea that has been proposed is that of adaptation (Kleinschmidt and Jaeger, 2015). Under ‘adaptation’ accounts, listeners build a separate model for each context they encounter, so they have a different mapping between acoustic space to categories for each context a sound occurs in. For example, a listener using an adaptation strategy would build a separate model for utterance-medial /o/ vowels, utterance-final /a/ vowels, etc. (see Kleinschmidt & Jaeger, 2015 for a more thorough explanation of adaptation). In doing so, adaptation allows listeners to ignore systematic acoustic variability that stems from the context a sound occurs in. These models would encode the fact that a shorter absolute duration is required to classify a vowel as long in utterance-medial position than in utterance-final position, without transforming the vowels’ acoustic cues as is done in normalization.

While both normalization and adaptation aim to explain how listeners account for systematic acoustic variability, they do so in different ways. In particular, under normalization accounts, all acoustics are mapped to one context-independent acoustic space using an explicit normalization function, and listeners only estimate one boundary between short and long vowels in the context-independent acoustic space. Under adaptation accounts, listeners estimate a different acoustic boundary between short and long vowels for each context they encounter, without considering data from other contexts. This means that in addition to accounting for systematic acoustic variability, adaptation can also encode top-down information. Building a separate model for each encountered context necessarily encodes relative frequency of occurrence of different sound categories across different contexts, and this could bias perception. Therefore, adaptation can take advantage both of factoring out systematic variability and using top-down information. Because we wish to disentangle the relative contribution of these two ideas, we do not study the efficacy of adaptation strategies here, but we return to the idea of adaptation in the General Discussion.

Crucially, we have seen that these two distinct theories about how listeners could use context in categorization produce similar changes in categorization, which has led researcher to sometimes conflate them in the literature. In what follows, for each of the two strategies, we first review the evidence that has been used to argue in favor of listeners using them, and then review evidence that both ways of using context could potentially be helpful in Japanese.

Top-down information

Evidence for top-down information accounts

Experimental and computational work suggests that people can and do make use of higher-level linguistic information in a top-down fashion—at least on synthesized or carefully controlled laboratory speech. In various experiments, researchers have presented participants with stimuli that have portions of removed or degraded acoustic information, and shown that participants make use of contextual information to compensate. Warren (1970) showed that when adult participants are played a sentence with a single phone (and its transitional cues) completely removed and replaced with a cough, they nonetheless report hearing the sound, suggesting that linguistic context affects speech perception.

This is true even when full acoustic information is available. In a classic study, Ganong (1980) played participants acoustic continuua that ranged between a non-word and a word (i.e. from dask to task, or from dash to tash), and showed that participants were biased towards categorizing the initial sound in a way that resulted in a word. They were more likely to classify a given sound as /t/ for dask-task, but as /d/ for dash-tash, suggesting that listeners use context (in this case, lexical information), in addition to acoustics, to constrain their categorization decisions. Similar work has shown that phonotactic constraints also affect categorization decisions, such that listeners are more likely to classify a particular sound in a way that adheres to, rather than violates, the phonotactics of their language (e.g., Brown and Hildum, 1956; Massaro & Cohen, 1983).

Particularly relevant to our test case, there is experimental evidence from Moreton and Amano (1999) that Japanese speakers may make use of higher-level contextual information to make decisions about vowel length. Words in Japanese fall into four main groups based on their historical origin (e.g., Foreign words, Sino-Japanese words, etc.) and these word groups differ in their properties. For example, long /a/ occurs in Foreign words, but not Sino-Japanese words, and Sino-Japanese and Foreign words have different frequency distributions over consonants (e.g., /p/ is frequent in Foreign words, but very rare in Sino-Japanese words, and vice versa for /hy/). Taken together, this means that, for example, an /a/ vowel that co-occurs with a /hy/ vowel is almost certainly phonemically short, while an /a/ vowel that co-occurs in a word with a /p/ could also be long. In a series of experiments, Moreton and Amano (1999) showed that Japanese listeners make use of these regularities when identifying vowels: the other consonants a particular vowel token co-occurred with affected whether participants categorized it as short or long, again showing that top-down information affects adults’ categorization.

Children also seem to use top-down information to guide acquisition and processing. A number of studies have shown that both adults and infants use lexical context while acquiring sound categories (Thiessen, 2007; Swingley, 2009; Feldman et al., 2013b). For example, Feldman et al., (2013b) showed that adults and infants were more likely to assign acoustically similar vowels (/ / vs. / /) to different sound categories when they were not exposed to minimal pairs between them (i.e., when they did not occur in the same phonetic contexts) than when they were exposed to minimal pairs (i.e., when the vowels occurred in identical phonetic contexts). In addition, Feldman et al., (2013a) showed that a computational model that made use of information about the word frames that sounds occurred in resulted in an improvement in phonetic category learning over models that did not incorporate lexical information.

The idea that higher-level information influences speech perception and language acquisition has been replicated many times over, and is mostly accepted in the field. Most of the support for this idea, however, comes from work on simplified speech data. Furthermore, the model from Feldman et al., (2013a) was recently applied to the problem of Japanese vowel length we study here, and was found to be ineffective on spontaneous speech (Antetomaso et al., 2017). Therefore, there is some recent doubt that this strategy could be helpful on spontaneous speech. However, phonemically short vowels and phonemically long vowels have been shown to differ in the contexts that they are likely to occur in Japanese, so there is potentially signal that would be helpful to a listener relying on such a strategy. We discuss this evidence in the following section.

Evidence that there is top-down information in Japanese

With the exception of Moreton and Amano (1999) and Antetomaso et al., (2017), there has not been much work on studying the role of top-down information in the acquisition and processing of Japanese vowel length. However, there is independent evidence that there are systematic differences between short and long vowels in the types of contexts/environments they occur in that listeners could make use of.

First, different vowel qualities have different relative probabilities of short and long vowels, as seen in Fig. 1. In particular, long vowels make up a greater proportion of /o/ vowels than /a/ vowels.

Short and long vowels also differ in the types of sounds they co-occur with, due, for example, to properties of various subsets of the Japanese lexicon as seen in the previous section (Moreton and Amano, 1999). Similarly, in some dialects of Japanese, long vowels do not occur before nasals, due to phonotactic constraints. Vowels also tend to be phonologically short when adjacent to long consonants. Therefore, the adjacent sounds of a vowel could potentially provide useful, disambiguating information about the length status of a target vowel (Isei-Jaakkola, 2004).

Finally, prosodic position could also be helpful, as phonemically long vowels are less likely to occur domain-finally (e.g., Kubozono, 2002). As a result, listeners could exploit the prosodic position of the vowel to help determine the length of a vowel: they could be biased towards classifying a domain-final (e.g., word-final) vowel as short.

Overall, there are various patterns due to phonological, historical, or lexical reasons that result in differences in how likely short versus long vowels are to occur in particular contexts. Listeners could exploit this information in a top-down fashion to categorize and learn the vowel length contrast. We test how effective this strategy could be by applying it to the Japanese vowel length contrast.

Normalization

Evidence for normalization

A body of experimental work has been used to argue that listeners can and do normalize when making categorization decisions—at least on the carefully controlled laboratory speech or synthetic speech that is typically studied (but see Johnson, 1997, 2006; Pierrehumbert, 2002, which argue against normalization). This work generally shows that listeners’ perception of a particular sound can change by modifying the context it appears in. As we saw, modifying the context can also change listeners’ perception if they are relying on a top-down information strategy. Therefore, this evidence is insufficient to argue uniquely for normalization as a useful strategy when the contextual factor being normalized out could also prove helpful in a top-down information account (e.g., neighboring sounds, prosodic position).

However, for contextual factors that do not influence which category is more likely to be produced (e.g., speech rate and speaker), there is extensive evidence that listeners factor out systematic variability from the acoustics of lab speech, though these studies do not necessarily pinpoint normalization as the involved mechanism—as opposed to adaptation, for example (Kleinschmidt & Jaeger, 2015). In this section, we review the literature that has been taken as support for normalization in the field, even if it could also be used to argue for adaptation or top-down accounts, but we return to the issue of how to properly dissociate these accounts, and what evidence could be taken as unequivocal support for one of these theories, in the General Discussion.

Nearey (1978) studied synthetic speech and showed that listeners factor out systematic variability stemming from speaker. His study showed that listeners’ category boundaries were shifted upward in F1 and F2 when a target sound followed a vowel that sounded like a child produced it instead of a man. This type of result has been repeatedly reported (e.g., Strand and Johnson, 1996).

Mann and Repp (1980) studied synthetic speech and argued that listeners also take into account coarticulatory influences. They played participants a fricative from the /$\int \limits $/ to /s/ continuum, followed by either the rounded vowel /u/ or the unrounded vowel /a/. They found that participants were more likely to identify the fricative as /s/ when it was followed by /u/ than when it was followed by /a/. Fujisaki and Kunisaki (1978) found a similar effect with Japanese speakers.

Various studies have also shown that listeners take into account the influence of speech rate. These findings are particularly relevant to the Japanese vowel length case, because they offer evidence that participants using durational cues also take into account systematic variability due to context. Using synthesized speech, Fujisaki et al., (1975) studied Japanese listeners’ perception of the contrast between short and long consonants as a function of contextual speech rate. They played participants synthesized syllables ranging from /ise/ to /isse/ and found that the absolute duration at which participants’ percept changed from a short consonant to a long consonant was affected by the speech rate of the utterance.

Analogous effects have been found for English vowel and voicing categorization, as well as /b/-/w/ distinctions, and recent work has even suggested that changing the speech rate of neighboring consonants can cause listeners to not hear or insert entire function words (e.g., Ainsworth, 1974; Verbrugge et al., 1976; Ainsworth, 1973; Dilley and Pitt, 2010; Miller & Liberman, 1979; Minifie et al., 1977 Summerfield, 1981). Overall, the general finding that listeners’ perceptions of a sound (or even a word) change as a function of the context it occurs in has been replicated many times over (e.g., Crystal & House, 1990; Miller, 1981; Miller et al., 1984; Miller et al., 1997; Newman & Sawusch, 1996; Pickett & Decker, 1960; Sawusch & Newman, 2000; Wayland et al., 1992, 1994) and has often since been taken as evidence for normalization.

However, as mentioned above, recent work has also suggested that some of the experimental findings that have been taken as evidence for factoring out systematic variability may actually be support for participants making use of top-down information. In a classic study, Port and Dalby (1982) argued that listeners use durations of neighboring sounds, in addition to utterance speech rate, to calibrate (or normalize) the durational cues of the target sound. They ran several experiments studying English listeners’ voicing judgments in synthesized minimal pairs like rapid versus rabid. They showed that the duration of a vowel neighboring a stop could affect listeners’ perception of whether that stop was voiced or voiceless (Port & Dalby, 1982), and similar findings have been reported in other research as well (e.g., Boucher, 2002; Summerfield, 1981). These findings have classically been interpreted as evidence that listeners factor out the effect of speech rate, and use the relative duration of the stop’s closure duration to the neighboring vowel to do so. However, Toscano and McMurray (2012) argued that these same findings were consistent with the alternative idea that listeners are using neighboring vowel duration as a direct cue to the voicing of the target stop (parallel to closure duration or VOT), rather than normalizing for it. Although this reinterpretation has been discussed with reference to a particular set of studies (Boucher, 2002; Port & Dalby, 1982; Summerfield, 1981), it raises the interesting possibility that other studies arguing for normalization could also be used as evidence for a top-down information account, rather than for normalization. In particular, this holds true for all studies where the contextual factor that is normalized out could also prove helpful in a top-down information account—for example, neighboring sounds.

Experimental findings in support of normalization have been supplemented by recent computational work, which has generally found that models that normalize for systematic variability achieve better sound category identification results, and better match human performance than models that do not.

McMurray and Jongman (2011) showed that a model that normalized for multiple contextual factors better matched human behavior than a model that did not. They took lab recordings of the 8 English fricatives /f, v, , s, z, $\int \limits $, / produced in the initial position of a CVC syllable, where the vowel was one of six vowels, and the final consonant was always /p/. They had measurements of 24 cues from these tokens (Jongman et al., 2000). They presented a subset of these recordings to listeners and asked them to identify the syllable-initial fricative. They then used a method from Nearey (1990) and Cole et al., (2010), that we also make use of in this paper, to compare whether normalized or unnormalized cues led to more human-like identification in their model. They found that the version that normalized for speaker and neighboring vowel yielded a better match to human categorization than the version that used unnormalized cues. This finding has been replicated many times, sometimes with different normalization implementations (Apfelbaum and McMurray, 2015; Cole et al., 2010; Richter et al., 2017); however, these models have, for the most part, only been applied to controlled and well-enunciated lab speech.

There has also been some work looking at normalization in acquisition. Dillon et al., (2013) considered the problem of learning the phonological system of Inuktitut, using elicited speech. Inuktitut has three vowels (/i/, /u/, /a/), but these vowels are lowered when followed by uvular consonants. The researchers found that a computational model that learned from the unnormalized vowel formants failed to learn the correct sound categories of Inuktitut (learning six categories instead), but when they subtracted out the influence of the neighboring uvular and used these normalized vowel formants as input to the model, it was able to learn the correct three categories of Inuktitut, just as infants do, suggesting that normalization is a possible strategy that infants could be using in learning the sounds of their language.

Because most cognitive research has focused on carefully controlled laboratory research or synthesized speech, and because many of the empirical studies supporting normalization could also be in support of top-down information accounts, it is hard to draw strong conclusions about the efficacy of normalization in naturalistic listening environments. This paper further tests its efficacy in naturalistic listening situations.

Evidence that factoring out systematic variability might be useful in Japanese

Factors other than phonological length influence the duration of Japanese vowels, and could cause the overlap between short and long vowels. This is variability that normalization could, in principle, help reduce.

First, the quality of a vowel systematically affects its duration. Hirata (2004) had Japanese participants produce disyllabic non-words in a carrier phrase and found that the vowel /e/ tended to be acoustically longer than /o/ and /u/. In addition, Bion et al., (2013) analyzed a corpus of spontaneously produced infant-directed speech and found that low vowels were acoustically longer than high vowels.

Japanese vowels are also acoustically shorter in fast speech than slow speech, all else being equal. Hirata (2004) had participants produce Japanese sentences (including non-words) at three different speech rates—slow, normal, and fast speech—and found that as the speech rate quickened, the vowels became acoustically shorter.

There is evidence that the prosodic position of a sound influences the duration of a vowel, as well. There are various prosodic phrase types in Japanese—utterances are made up of intonational phrases (IPs), which are, in turn, made up of accentual phrases (APs)—and a vowel’s position relative to these phrasal units affects its duration. Bion et al., (2013) found that in spontaneous infant-directed speech, vowels are acoustically longer when followed by an intonational phrase boundary, but acoustically shorter when followed by a word boundary that is not an intonational phrase boundary. Martin et al., (2016) calculated the average mora duration in various prosodic positions in spontaneously produced adult- and infant-directed speech. They found that the average mora duration increases, moving from more phrase-medial to more phrase-final positions (from phrase-medial, to AP-final, to IP-final, to utterance-final position), which suggests that segments are acoustically lengthened phrase-finally.

Some work has also shown that neighboring sounds can influence the duration of a vowel. For example, several studies have found that vowels tend to be acoustically longer before a geminate than a singleton consonant (Fukui, 1978; Han, 1994; Kawahara, 2006). Other work has suggested that accented vowels tend to be acoustically longer than unaccented vowels (Hirata, 2004).

Finally, although these factors have not been studied in Japanese, work in other languages suggests sounds may be acoustically lengthened at the beginning of a phrase, in addition to phrase-finally (Keating et al., 2004; Rakerd et al., 1987), that sounds may be acoustically shorter in function words rather than content words, and that other features of neighboring consonants, for example voicing, may affect the duration of the target vowel (House, 1961; Luce and Charles-Luce, 1985; Umeda, 1975; Van Santen, 1992). In sum, there are a priori reasons to believe that normalization could be helpful for the vowel length contrast.

Testing the efficacy of top-down information on Japanese vowel length

In this section, we test how helpful using contextual information as top-down information can be in categorizing Japanese vowels, by testing to what extent it helps separate short and long vowels in spontaneous speech. We compare various logistic regression models that make use of higher-level contextual factors to the baseline logistic regression that only uses unnormalized duration and formants.

Data

The data we use come from the RIKEN Japanese Mother-Infant Conversational Corpus (R-JMICC) (Mazuka et al., 2006). It is spontaneously produced child-directed speech. Mazuka et al., (2006) collected the data by recording the speech of 22 mothers who visited the lab with their 18- to 24-month-old children. The mothers first played with their child with picture books for about 15 min. They then played with their child with toys for about 15 min. Finally, a female experimenter came into the room and talked to the mother. The mothers’ speech in the first two parts, where they interacted only with their child, was labeled as child-directed speech. The mothers’ speech in the third part, where they interacted with the experimenter, was labeled as adult-directed speech. The corpus consists of about 14 total hours of speech, and is labeled for both phonetic and prosodic information.

We extracted information about each of the vowels produced by the mothers, but excluded singing, coughing, devoiced vowels, diphthongs, and any segments that the researchers could not transcribe. We also excluded any vowels that were not labeled with prosodic information. This left 92003 total vowels, 30035 of which were in the adult-directed section of the corpus and 61968 of which were in the child-directed section of the corpus. All of the analyses we report were run on the child-directed part of the corpus; however, we also ran these analyses on the adult-directed parts and did not find substantial differences in model performance (see Supplementary Materials).

We extracted both acoustic information and contextual information about each vowel, as described below. The list of the features we extracted is also compiled in Table 1.

Acoustic cues

Duration: We extracted the duration of each vowel in milliseconds.
Formants: We extracted the first three formants, and used these as direct acoustic cues to vowel length. While duration is thought to be the primary acoustic predictor of vowel length in Japanese, previous work has shown that spectral information can improve categorization performance (e.g., Arai et al., 1999; Kinoshita et al., 2002; Lehnert-LeHouillier, 2010). The formants were automatically extracted using Praat (Boersma, 2001) in previous work on this corpus (Antetomaso et al., 2017) and we used the formant values at the midpoint of the vowel.

Contextual factors

In addition to extracting acoustic information, we also extracted contextual information about each vowel that has been shown to be relevant for normalization or top-down information accounts:

Vowel quality: This was a categorical variable that took one of five values (/a/, /e/, /i/, /o/, /u/) and was taken from the coding of what the mother said.
Speaker: This was a categorical variable, with 22 different possible speaker values.
Neighboring sounds: We extracted the identity of the previous sound and the following sound (both categorical variables), as labeled by the phonetic transcription. This was marked as ‘#’ if the vowel was preceded by silence. Because the vowel length contrast is thought to be learned later than other types of contrasts (Sato et al., 2010), it is reasonable to assume that infants can make use of the other contrasts in their language to learn vowel length.
Prosodic position: We represented prosodic position in three different ways. First, we extracted a categorical variable that ranged from 1 to 4, which indicated whether the word that the vowel occurred in was not phrase-final at all (1), was AP-final (at the end of an accentual phrase) (2), was IP-final (at the end of an intonational phrase) (3), or was utterance-final (4) (BI). Second, we extracted a second categorical variable that ranged from 1 to 4, which indicated whether the word that the vowel was in was not phrase-initial (1), was AP-initial (2), was IP-initial (3), or was utterance-initial (4) (BIstart). Third and finally, we extracted a vector of length 12, which represented the prosodic position of the vowel itself in a bit more detail. Namely, each element of the 12-long vector was a binary categorical variable, with three elements of the 12 elements corresponding to whether the vowel itself was word-initial, word-medial, word-final, three to whether the vowel itself was AP-initial, AP-medial, AP-final, three to whether the vowel itself was IP-initial, IP-medial, IP-final, and three to whether the vowel itself was utterance-initial, utterance-medial, utterance-final. That is, while the first two categorical variables represented the prosodic position of the word the vowel was in, and would, thus, have the same value for every vowel in a given word, the vector represented the prosodic position of the vowel itself.
Accented?: This was a binary variable that took a value of 1 if the vowel was accented and 0 if it was not.
Speech rate: We extracted the duration of the immediately preceding and the immediately following sounds, as proxies for speech rate. If the vowel was immediately preceded (or followed) by silence, we did not use the duration of the silence, but instead used the average duration of the immediately preceding (or following) sound, averaged across all vowels that were not preceded (or followed) by silence.
Condition of the vowel: This was a categorical variable with a value of ‘B’ if the vowel occurred when mother and child were playing with books and ‘T’ if it occurred when mother and child were playing with toys. We include this to account for the possibility that the mothers’ speech was consistently different (e.g., more or less clear) while playing with books than toys.
Part of speech: This was a categorical variable taken from the annotation in the corpus. In our simulations, we either use full part-of-speech information, or simplified part-of-speech information, which only considers the distinction between function and content words. We vary this because we want our results to be applicable to language acquisition. Infants show evidence of distinguishing function vs. content words using acoustic correlates as early as birth (Shi et al., 1999; Shi and Werker, 2001), so it is relatively likely that they can make use of this knowledge in learning the contrast. However, it is less clear that they could make use of full part-of-speech information for this task, as cross-linguistic evidence suggests that infants have much of this knowledge only after Japanese infants have learned the vowel length contrast (Höhle et al., 2004; Mintz, 2006; Shi & Melançon, 2010). That being said, He and Lidz (2017) show evidence that infants know the distinction between nouns and verbs as early as 12 months, so while infants might not have complete part-of-speech information, they may be able to use more than just the distinction between function and content words for acquiring the vowel length contrast. Testing function vs. content word distinctions in addition to full part-of-speech allows us to determine whether our qualitative results hold true regardless of what infants know.

Methods

We compare the results of four models—divided into three types of models. The baseline model is a logistic regression that learns to predict short/long from only a vowel’s absolute duration and formant values (Baseline). The next two models are logistic regressions that learn to predict short/long from contextual factors listed previously and in Table 1, in addition to absolute acoustic cues (Acoustic and Top-Down Information Models). The first of these makes use of all of the contextual factors listed in Table 1, with part-of-speech simplified to just indicate whether the word was a function or content word. The second of these makes use of all of the contextual factors, including detailed part-of-speech, exactly as annotated in the corpus. Finally, we test how much signal just the contextual factors provide, by running a logistic regression model that learns to categorize vowels as short/long using only the contextual factors, without any access to acoustic information (Top-Down Information Model Without Acoustics). Studying the results of this model will allow us to understand how much of the work context does. That is, it will reveal how many vowels can be identified just by the context they occur in, without even turning to acoustic information, or, in other words, how much information is lost when acoustics are removed.

We split the dataset into a training subset (90% of the data) and a test set (the remaining 10% of the data), keeping the proportions of short and long vowels equal in the two sets. The training and test sets consisted of the same tokens for all of the simulations run in this paper.

Once the logistic regression equations were estimated from the training set, we simply applied each equation to the vowels in the unseen test set to make a prediction about whether that vowel was short or long, as described previously. We compared the models’ predictions to the true labels. We report two types of evaluation metrics for each tested model.

First, we report overall categorization accuracy, which is simply the percentage of all of the vowels in the test set that the model categorized correctly, as well as accuracy on just the short vowels and accuracy on just the long vowels. Second, we report the Bayesian Information Criterion (BIC) for each model, computed over the training set. The BIC is a common metric used to select between different models (Schwarz, 1978). The benefit of the BIC is that it balances how well the model works (the likelihood of the model given the data) with how complicated the model is (how many parameters it uses), so it will prefer simpler models, all else being equal. The BIC is calculated as follows and lower values are better:

$$ \text{BIC} = -2*ln(L) + k*ln(n) $$

(5)

where L is the likelihood of the model given the data, k is the number of parameters, and n is the number of samples.

We ran each model ten times and averaged performance across these ten runs.

Results

The results from this analysis on child-directed speech are summarized in Table 2.

Table 2 Summary of top-down information results from the R-JMICC dataset

Full size table

Baseline model

The baseline model reached an overall accuracy of 91.1%. It correctly categorized 99.1% of short vowels, and 12.2% of long vowels. It had a BIC of 28716. Because 90.9% of vowels in the R-JMICC corpus are short, this model performs comparably to a model that simply categorizes every incoming vowel as short, and has failed to learn anything meaningful about the distinction between short and long vowels.

Acoustic and top-down information model

The following models used contextual factors as direct predictors to category membership, in addition to using absolute duration and formant values. When part-of-speech was simplified to the distinction between function and content words, the model reached an overall accuracy of 95.2%, correctly classifying 98.8% of short vowels and 59.0% of long vowels. The BIC was 15193. When we included full part-of-speech information, the model achieved an overall accuracy of 95.7%, correctly classifying 98.8% of short vowels and 63.9% of long vowels. The BIC was 13106. Including additional part-of-speech information led to performance improvements, but both models substantially outperformed the baseline model. Table 3 analyzes the role of each contextual factor, by showing how well a model with each factor as its only piece of top-down information performs. The most helpful factors include part-of-speech, the previous sound, the following sound, whether the sound is accented, prosodic information (BI and BIstart as described previously), as well as vowel quality.

Table 3 Results showing the contribution of each contextual factor. This table shows model results when each available contextual factor is included as the only piece of top-down information in a logistic regression model. Factors are ranked from lowest BIC (best) to highest BIC (worst)

Full size table

Top-down information model without acoustics

Even without any acoustic information, only contextual information, the final top model achieved an overall accuracy of 94.5%, correctly classifying 98.6% of short vowels and 54.0% of long vowels. The model BIC was 16301. That is, although there was a slight dip in performance when we removed acoustic information, top-down information models can still perform well even without any acoustic information, suggesting a large role for context.

Discussion

In these analyses, we investigated the hypothesis that infants and adults learn and process the Japanese vowel length contrast by combining bottom-up acoustic cues with top-down expectations about which category is likely to occur in a particular context. To implement this hypothesis, we included contextual factors listed in Table 1 as direct predictors of category membership in the logistic regression model (in addition to absolute acoustic cues), and compared its performance against a model that only makes use of absolute acoustic cues as predictors.

We found that including these additional contextual factors as predictors drastically improved accuracy and lowered BIC scores, suggesting that this method does quite well at separating short vowels from long vowels. Given the relatively small set of factors we used—for example, the only word-level information we used was part-of-speech—it is quite impressive that the model achieved this level of performance, and it suggests that this may be a hypothesis worth pursuing as a way that infants could learn and adults could process the Japanese vowel length contrast.

In fact, although excluding acoustic information did hurt performance, a model relying on contextual information alone still performs very well. Even without any acoustic information, this model can correctly identify nearly all short vowels and more than half of all long vowels. This illustrates just how much signal there is in contextual information.

This work shows that top-down information could be very useful in adult speech perception, and also has implications for acquisition. Although these are supervised models that have much more information available to them than infants learning language, and there is still work to be done to show that this is a strategy that could be helpful in acquisition, our analysis does reveal that there is signal to separate short and long vowels that could be exploited in a future unsupervised model.

Testing the efficacy of normalization on Japanese vowel length

In this section, we test whether normalization can help categorize Japanese vowels, by comparing models that use normalized acoustic cues to models that use unnormalized acoustic cues.

Data

The data are exactly the same from the first analysis, but the contextual factors listed in Table 1 are normalized out of the acoustics (as described below in the Methods section), instead of being included as independent predictors in the logistic regression categorization model. The same training and test sets are used as in the previous analysis, which allows us to directly compare results.

Methods

In testing the efficacy of normalization on spontaneous speech, we implement and test two normalization methods. First, we apply methods from previous work (Cole et al., 2010; McMurray & Jongman, 2011; Nearey, 1990) to the Japanese vowel length contrast, by using linear regression to normalize out systematic variability from vocalic acoustic cues. Second, we implement normalization using a neural network, which has the advantage over past implementations that it can represent more powerful, non-linear normalization functions. Our results can only directly tell us about the two implementations we use, and future work should investigate other ways of normalizing. We return to the question of how these results would generalize to other contrasts in the General Discussion.

Normalization implementation

We use either unnormalized or normalized acoustic cues as predictors of vowel length. Using unnormalized cues simply involves representing the absolute acoustic cues, so this section will focus on how we implement normalization. The basic idea underlying both of the implementations we use is to learn a function that predicts acoustic features (duration and formants) of a vowel from the context that a vowel occurs in (i.e., vowel quality, speaker, prosodic position). Once we learn this function, we can make a prediction about a vowel’s duration and formants based on everything we know about where it occurs. We can then use the residuals, or the difference between how long we expect the vowel to be given all of the factors and how long it actually is, to represent a normalized version of this vowel. That is, we have excluded the influence of contextual factors and have recoded the acoustic cues in terms of their difference from expected values. Once we learn this equation from the training set, we recode both the training set and the test set in normalized terms. We use two different methods for representing the function between contextual factors and acoustic cues, linear regression and neural nets, which can learn non-linear functions, which we describe in turn.

Linear regression as normalization

Following previous work, we first use linear regression to factor out systematic variability (Cole et al., 2010; McMurray & Jongman, 2011; Nearey, 1990). Linear regression models represent a relationship between a continuous dependent variable and a set of independent variables. In this particular case, we try to estimate an equation that can predict what the acoustic features (duration and formants) of a vowel should be from its context. Each of the factors (e.g., vowel quality, speaker, prosodic position from Table 1) is weighted and combined linearly to yield a prediction. That is, given the factors x₁,x₂,...,x_n, linear regression models take the form:

$$ \text{acoustic cue} = \beta_{0} + \beta_{1}*x_{1} + \beta_{2}*x_{2} + ... + \beta_{n}*x_{n} $$

(6)

Learning this function involves learning an intercept (β₀), as well as a weight for each cue (β₁...β_n). The data it learns from consist of the information we want to factor out of the acoustic cues, as well as the known acoustic cue values of the vowel, and weights are learned so as to minimize the error in predicting the duration of the vowel.

Neural networks as normalization

The linear regression models we use can only learn normalization functions and do not include interactions, even though previous work did. This is because our analyses use a total of 23 contextual factors, so considering all possible interactions would be computationally difficult. To test the possibility that our linear regression without interactions was insufficient to handle spontaneous speech, we also implemented normalization using a neural network. We train a neural network on the training set to predict the duration and formants of a vowel token from its context. Once we have a trained neural network, we use it to predict expected acoustic cues for each vowel, subtract them from the vowel’s true acoustic values, and input this into a logistic regression model.

We use a simple feed-forward neural network. We use five-fold cross validation on the training data to tune parameters of the neural network. We manipulate the number of hidden layers, the batch size, the number of nodes in the hidden layers (either keeping this constant for all of the layers or decreasing the number of nodes progressively deeper into the network), learning rate, number of epochs, and regularization factors. We choose the parameters that minimize average mean squared error on the training set.

Logistic regressions

To test the efficacy of normalization, we compare seven total logistic regression models, which can be grouped into three types of models. The first model is the baseline, which as before uses absolute (unnormalized) duration and formants to predict category membership (short or long). Then, for each type of normalization (linear regression and neural networks), we run three models. First, we regress out all of the contextual factors listed in Table 1 with part-of-speech in simplified form (i.e., function/content word distinctions). Second, we regress out all of the contextual factors listed in Table 1 including full, detailed part-of-speech information. In both of these models, the normalization function is trained completely independently of the subsequent logistic regression. That is, the normalization function is not trained to maximize categorization performance. The third and final model is an oracle model: we choose the subset of contextual factors from Table 1 that maximizes categorization performance, which gives us an estimate of the upper bound on normalization performance. This is useful because it is possible that we are wrongly including some factors in the first three models and underestimating the efficacy of normalization. Running this oracle model allows us to see what the best normalization performance could be.

Results

A summary of the results is presented in Table 4.

Table 4 Summary of normalization results on R-JMICC corpus

Full size table

Unnormalized model

The baseline model is identical to the baseline model from the previous analysis and uses unnormalized duration and formants as predictors of category membership, without running any linear regression models. As a reminder, this logistic regression model reached an overall accuracy of 91.1%. It correctly classified 99.1% of short vowels and 12.2% of long vowels. Its BIC was 28716.

Linear regression normalization models

When all of the contextual factors with simplified part-of-speech (function vs. content word) were regressed out, the model had an overall categorization accuracy of 91.2%, correctly classifying 99.5% of the short vowels and 8.3% of the long vowels. It had a BIC of 30774. The set of factors used accounted for 26.8% of the variance in duration, 23.0% of the variance in F1, 40.2% of the variance in F2, and 8.1% of the variance in F3.

When all of the contextual factors including full part-of-speech information were regressed out, the model had an overall categorization accuracy of 91.2%, correctly classifying 99.6% of the short vowels and 7.6% of the long vowels. It had a BIC of 30990. The set of factors used accounted for 27.8% of the variance in duration, 23.1% of the variance in F1, 40.3% of the variance in F2, and 8.3% of the variance in F3. Figure 4 plots the normalized durations by vowel length for this model.

Finally, the oracle normalization model included the following five contextual factors: speaker, whether the vowel itself was word-final, whether the vowel itself was AP-final, whether the vowel itself was IP-final, and whether the vowel itself was utterance-final. This oracle model had an overall accuracy of 91.2%, and correctly classified 99.0% of the short vowels and 13.6% of the long vowels. It had an overall BIC of 28122. The set of factors that resulted in the best categorization performance accounted for 11.7% of the variance in duration, 3.6% of the variance in F1, 3.3% of the variance in F2, and 3.8% of the variance in F3.

Neural network normalization models

When all of the contextual factors with simplified part-of-speech (function vs. content word) were normalized out, the model had an overall categorization accuracy of 91.1%, correctly classifying 99.8% of short vowels, and 5.1% of long vowels. The BIC was 32356.

When all of the contextual factors, including fully detailed part-of-speech information, were normalized out, the model reached an overall categorization accuracy of 91.1%, correctly classifying 99.7% of short vowels, and 5.8% of long vowels. The BIC was 31738.

Finally, the oracle model normalized out the following factors from the acoustics: whether the vowel itself was word-final, whether the vowel itself was AP-initial, whether the vowel itself was AP-final, and whether the vowel itself was utterance-final. The oracle model had an overall accuracy of 91.2%, and correctly classified 99.0% of the short vowels and 13.4% of the long vowels. It had an overall BIC of 28188.

Discussion

Previous work has argued that normalization can be helpful in acquisition and processing (Cole et al., 2010; Dillon et al., 2013; McMurray & Jongman, 2011); however, our results on Japanese vowel length did not lend additional support to this hypothesis. We compared the Japanese vowel length categorization performance of a logistic regression model that used unnormalized acoustic cues to the performance of various logistic regression models that used normalized acoustic cues. We considered two different normalization implementations, and three different instantiations of normalized cues for each. The first normalized all available contextual factors, with simplified part-of-speech information (i.e., whether the word containing a vowel was a function or content word). The second normalized all available contextual factors, including detailed part-of-speech information. The third and final normalized out the subset of contextual factors that led to best categorization performance. Crucially, in the first two models, as in past work, normalization was not optimized to give the best categorization. The final, oracle model considered categorization performance in choosing how to normalize, giving it the best possible chance to succeed.

The main finding was that, at its best, normalization resulted in only a modest improvement in accuracy and BIC, regardless of which implementation we used. Although the overall accuracy of all of the models is quite high, just guessing that all of the vowels were short would result in similar results. Normalization never improved accuracy, but improved the BIC from 28716 for the unnormalized version to 28122 for the best normalized version. While this does constitute improvement, it is only modest improvement and a listener would need to learn precisely which factors they should normalize out. Of course, it is possible that results would be better on a larger corpus with more information about the contextual factors. We used previous and following sound duration as a proxy for speech rate, while other measures of speech rate might lead to better performance, and we return to this possibility in the discussion. However, given how prevalent normalization is in the field, the results are surprisingly bad and call into question the efficacy of normalization, at least in this task.

Although it is difficult to directly compare this degree of improvement to the improvement shown in past studies merely on the basis of accuracy, past studies that have implemented and tested normalization reported that normalization resulted in an increase in performance from 28.63% to 54% and 83.3% to 92.9% respectively (Cole et al., 2010; McMurray & Jongman, 2011). In comparison, in this work, the overall accuracy did not change depending on whether cues where unnormalized or normalized, and the long vowel accuracy increased from 12.2% to 13.6%—a much weaker increase in performance than has been observed previously.

It is important to emphasize that there are many ways that normalization could be implemented. Here, we have only tested one that has been proposed and well studied in the literature, as well as a neural network extension of it. It is possible that a different implementation of normalization could yield different results, and future work should test this, in addition to developing additional specific proposals for how normalization could operate. Nonetheless, we have some evidence that this normalization model may not be as helpful as previously thought, and we explore why in the following sections. Understanding why normalization is not helpful will also let us speculate whether these results will generalize to other normalization implementations.

Do differences between controlled lab speech and spontaneous speech explain discrepancies in results?

Previous results found normalization to be helpful; however, our results were surprising in that they showed that normalization was unhelpful—even when the process was fully supervised. The biggest difference between previous work and our own is that most previous work has explored normalization on controlled and carefully enunciated lab speech, but our work looked at normalization on spontaneously produced speech. To bring these results more in line with each other, we apply the same normalization analyses we used on the R-JMICC Spontaneous Speech corpus to a corpus of read speech that more closely resembles controlled, lab speech. We find that the same linear regression normalization process that was not helpful on spontaneous speech is helpful on read lab speech, suggesting that the discrepancy in results between our work and previous work arises from differences between spontaneous and controlled speech.

Data

The data we use here come from Werker et al., (2007). The data consist of ten mothers teaching their 12-month-old infants a set of 16 nonce CVCV words, while looking at picture books together. This interaction included both a reading task, in which mothers were asked to read sentences containing the nonce words with pictures of the novel object (Werker Read dataset - Fig. 5), as well as a spontaneous speech task, in which mothers were asked to describe a scene that contained the novel object, using the nonce word as much as possible (Werker Spontaneous dataset - Fig. 6). The nonce words were made only using /i/ and /e/ as critical vowels, so the data do not contain any annotated instances of /a/, /o/, or /u/, unlike the R-JMICC corpus. The data were collected in the NTT Communication Science Laboratories in Keihanna, Japan and were labeled by trained phoneticians.

These data were much more similar to datasets that had previously been used to study normalization, though not identical. The experimenter controls the environment in which target sounds occur in, and artificially changes the statistical co-occurrences from that of naturalistic speech. This is especially true for the read portions, but still true for the spontaneous subset, in which researchers still decided what the nonce words were and, therefore, what sounds each target vowel was likely to occur next to. In addition, the productions are relatively well enunciated because the parents are trying to teach their children new words. However, that being said, even the read speech is less constrained than many speech recordings used for research, in which words are often recorded in isolation, or in highly constrained contexts like “Now I will say __.”

It is also worth pointing out that though we, and the past researchers, refer to one portion of the Werker data as spontaneous, it is quite different than the spontaneous R-JMICC data, in that nonce words were used, only a subset of vowels are represented, mothers were instructed to teach their infants, and they were provided with highly constrained images to describe.

Given these data, we extracted information about each of the vowels produced by the mothers, excluding any segments that the researchers could not annotate with certainty. The read speech data consisted of 798 vowels, of which 381 (47.7%) were phonemically short vowels and the remaining 417 (52.3%) were phonemically long vowels. The spontaneous speech data consisted of 1382 vowels, exactly half of which were phonemically short and half of which were phonemically long. Similarly as for the R-JMICC data, the information we extracted was either used as an acoustic predictor or as a contextual factor to be normalized out.