Keywords
annotation, language, speech, narrative, naturalistic stimulus, fMRI, studyforrest
annotation, language, speech, narrative, naturalistic stimulus, fMRI, studyforrest
Cognitive and psychiatric neuroimaging are moving towards studying brain functions under conditions of lifelike complexity1,2. Motion pictures3 and continuous narratives4,5 are increasingly utilized as so called “naturalistic stimuli”. Naturalistic stimuli are usually designed for commercial purposes and to entertain their audiences. Thus, the temporal structure of their feature space is usually not explicitly known, leading to an “annotation bottleneck”6 when used for neuroscientific research.
Data-driven methods like inter-subject correlation (ISC)7 or independent component analysis (ICA)8 are often used to analyze such fMRI data in order to circumvent this bottleneck. However, use of data-driven methods alone falls short of associating results with particular stimulus events9. Model-driven methods, like the general linear model (GLM), which are based on stimulus annotations can be useful to test hypotheses on specific brain functions under more ecologically valid conditions, to statistically control confounding stimulus features, and to explain not just “how” the brain is responding to a stimulus but also “why”10. Studies using GLMs based on annotations of a stimulus’ temporal structure have elucidated, for example, how the brain responds to visual features of a movie11 or speech-related features of a narrative12. Furthermore, stimulus annotations can inform data-driven methods about a stimulus’ temporal dynamics, or model-driven and data-driven methods can be combined to improve the interpretability of results13.
Here we provide an annotation with exact onset and offset of each sentence, word and phoneme (see Table 1 for an overview) spoken in the audio-visual movie “Forrest Gump”14 and its audio-description (i.e. the movie’s soundtrack with an additional narrator)15. fMRI data of participants watching the audio-visual movie16 and listening to the audio-description17 are the core data of the publicly available studyforrest dataset (studyforrest.org). The current publication enables researchers to model hemodynamic brain responses that correlate with a variety of aspects of spoken language ranging from a speaker’s identity, to phonetics, grammar, syntax, and semantics. This publication extends already available annotations of portrayed emotions18, perceived emotions19, as well as cuts and locations depicted in the movie20. All annotations can be used in any study focusing on aspects of real-life cognition by serving as additional confound measures describing the temporal structure and feature space of the stimuli.
We annotated speech in the slightly shortened “research cut”17 of the movie “Forrest Gump” and its temporally aligned audio-description16 that was broadcast as an additional audio track for visually impaired listeners on Swiss public television15. The plot of the original movie is already carried by an off-screen voice of the main character Forrest Gump. In the audio-description, an additional male narrator describes essential aspects of the visual scenery when there is no off-screen voice, dialog, or other relevant auditory content.
Preliminary, manual orthographic transcripts of dialogues, non-speech vocalizations (e.g. laughter or groaning) and the script for the audio-description’s narrator were merged and converted to Praat’s21 TextGrid format. This merged transcript contained rough onset and offset timings for small groups of sentences, and was further edited in Praat for manual validation against the actual content of the audio material. The following steps were performed by a single person, already familiar with the stimulus, in several passes to iteratively improve the quality of the data: approximate temporal onsets and offsets were corrected; intervals containing several sentences were split into intervals containing only one sentence; when two or more persons were speaking simultaneously the less dominant voice was dropped; low volume non-speech vocalizations or low volume background speech (especially during music or continuous environmental noise) which were subjectively assessed to be incomprehensible for the audience were also dropped.
We then used the Montreal Forced Aligner v1.0.122 to algorithmically identify the exact onset and offset of each word and phoneme. To enable the aligner to look up the phonemes embedded within each word, we chose the accompanying German pronunciation dictionary provided by Prosodylab23 that uses the Prosodylab PhoneSet to describe the pronunciation of phonemes. To improve the detection rate of the automatic alignment, the dictionary was manually updated with German words that occur in the stimuli but were originally missing in the dictionary. The pronunciation of English words and phonemes occurring in the otherwise German audio track was taken from the accompanying English pronunciation dictionary (following the ARPAbet PhoneSet). The audio track of the audio-description was converted from FLAC to WAV via FFmpeg v4.1.424 to meet the aligner’s input requirements. This WAV file, the merged transcription, and the updated dictionary were submitted to the aligner that first trained an acoustic model on the data and then performed the alignment.
The resulting timings of words and phonemes were corrected manually and iteratively in several passes using Praat v6.0.2221: in a first step, onsets and offsets on which the automatic alignment performed moderately were corrected. Some low volume sentences that are spoken in continuously noisy settings (e.g. during battle or hurricane) were removed due to poor overall alignment performance. In a second step, the complete sentences of the orthographic transcription were copied into the annotation created by the aligner. In a third step, a speaker’s identity was added for each sentence (see Table 2 for the most often occurring speakers). During every step previous results were repeatedly checked for errors and further improvements.
We employed the Python package spaCy v2.2.125 and its accompanying German language model (de_core_news_md) that was trained on the TIGER Treebank corpus26 to automatically analyze linguistic features of each word in their corresponding sentence. Non-speech vocalizations were dropped from the sentences before analysis to improve results. We then performed analyses regarding part-of-speech (i.e. grammatical tagging or word-category disambiguation), syntactic dependencies, lemmatization, word embedding (i.e. a multi-dimensional meaning representation of a word), and if the word is one of the most common words of the German language (i.e. if the word is part of a stop list).
The annotation is available in two different versions, both providing the same information: a) as a text-based Praat TextGrid file, and b) as a text-based, tab-separated value (TSV) formatted table. The following descriptions refer to the ten columns of the TSV file, namely onset, duration, person, text, pos, tag, dep, lemma, stop, vector.
Start (start)
The onset of the sentence, word or phoneme. Time stamps are provided in the format seconds.milliseconds from stimulus onset.
Duration (duration)
The duration of the sentence, word or phoneme provided in the format seconds.milliseconds.
Speaker identity (person)
Name of the person that speaks the sentence, word or phoneme. See Table 2 for the ten most often occurring speakers.
Text (text)
The text of a spoken sentence or word, or the pronunciation of a phoneme. Phonemes of German words follow the Prosodylab PhoneSet, English words follow the ARPAbet PhoneSet.
Simple part-of-speech tag (pos)
A simple part-of-speech tagging (grammatical tagging; word-category disambiguation) of words. The tag labels of this simple part-of-speech tagging follow the Universal Dependencies v2 POS tag set (universaldependencies.org). See Table 3 for a description of the labels and the respective counts of all 15 labels. Nouns that spaCy mistook for proper nouns or vice versa were corrected via script. Additionally in cells of this column, sentences are tagged as SENTENCE, and phonemes are tagged as PHONEME to facilitate filtering in potential further processing steps.
Detailed part-of-speech tag (tag)
A detailed part-of-speech tagging of words following the TIGER Treebank annotation scheme26 which is based on the Stuttgart-Tübingen-Tagset27. See Table 4 for a description of the labels and the respective counts of the 15 most often occurring labels (overall 43 labels). Nouns that spaCy mistook for proper nouns or vice versa were corrected via script.
Syntactic dependency (dep)
Information about a word’s syntactic dependencies with other words within the same sentence. Information follows the TIGER Treebank annotation scheme26 and is given in the format: “arc label;word’s head;word’s child1, word’s child2, ...”, where the “arc label” (see Table 5) describes the type of syntactic relation that connects a ”child” (the current word) to its “head”.
Lemmatization (lemma)
The base form (root) of a word.
Common Word (stop)
This column’s cell provides information if the word is part of a stop list, hence one of the most common words in the German language or not (True vs. False).
Word embedding (vector)
A 300-dimensional word vector providing a multi-dimensional meaning representation of a word. Out-of-vocabulary words with a vector consisting of 300 dimensions of zeroes were set to # to save space.
The annotation comes in two different versions. First, as a text-based TextGrid file (annotation/ fg_rscut_ad_ger_speech_tagged.TextGrid) to be conveniently edited using the software Praat21. Second, as a text-based, tab-separated-value (TSV) formatted table (annotation/fg_rscut_ ad_ger_speech_tagged.tsv) in accordance with the brain imaging data structure (BIDS)28. The dataset and validation data are available from Open Science Framework, DataLad and Zenodo (see Underlying data)29,30,31. The source code for all descriptive statistics included in this paper is available in code/descriptive-statistics.py (Python script).
In order to assess the annotation’s quality, we investigated if contrasting speech-related events to events without speech lead to increased activation in areas known to be involved in language processing32. Moreover, we tested if two similar linguistic concepts (proper nouns and nouns) providing high semantic information contrasted with a concept providing low semantic information (coordinate conjunctions) lead to increased activation in congruent brain areas.
We used a dataset providing blood oxygenation level-dependent (BOLD) functional magnetic resonance imaging (fMRI) data of 20 subjects (age 21–38 years, mean age 26.6 years, 12 male) listening to the 2 h audio-description (7 Tesla, 2 s repetition time, 3599 volumes, 36 axial slices, thickness 1.4 mm, 1.4 × 1.4 mm in-plane resolution, 224 mm field-of-view)17. Data were already corrected for motion at the scanner computer. Further, individual BOLD time-series were already aligned by non-linear warping to a study-specific T2*-weighted echo planar imaging (EPI) group template (cf.17 for exact details).
All further steps for the current analysis were carried out using FEAT v6.00 (FMRI Expert Analysis Tool)33 as part of FSL v5.0.9 (FMRIB’s Software Library)34. Data of one participant were dropped to due to invalid distortion correction during scanning. Data were temporally high-pass filtered (cut-off 150 s), spatially smoothed (Gaussian kernel; 4.0 mm FWHM), and the brain was extracted from surrounding tissue. A grand-mean intensity normalization of the entire 4D dataset was performed by a single multiplicative factor.
We implemented a standard three-level, voxel-wise general linear model (GLM) to average parameter estimates across the eight stimulus segments, and later across 19 subjects. At the first level analyzing each segment for each subject individually, we created 26 regressors (see Table 6) based on events drawn from the annotation. The 20 most often occurring detailed part-of-speech labels (nn with N=2620 to prf with N=157) were modeled as boxcar function from onset to offset of each word. The remaining other part-of-speech labels were pooled to a single new label (tag_other; N=1123) and modeled as a boxcar function from a word’s onset to offset. The 80 most often occurring phonemes (n with N=6053 to IY1 with N=32) were pooled to phonemes (N=65251) and modeled as boxcar function from a phoneme’s onset to offset. The end of each complete grammatical sentence was modeled as an impulse event (N=1651) to capture variance correlating with sentence comprehension. “No-speech” events (no-sp; N=264) serving as a control condition were created such that a sufficient number of events and a minimum separation of speech and non-speech events were achieved. Events were randomly positioned in intervals without audible speech that lasted at least 3.6 s. Each event of the no-speech condition had to have a minimum distance of 1.8 s to any onset or offset of a word, and to any onset of another no-speech event. A length of 70 ms was chosen for no-speech events matching the average length of phonemes. Lastly, we used continuous bins of information about low-level auditory features (left-right difference in volume and root mean square energy) that was averaged across the length of every movie frame (40 ms) to capture variance correlating with assumed low-level perceptual processes. Time series of events were convolved with FSL’s “Double-Gamma HRF” as a model of the hemodynamic response function to create the actual regressors. The Pearson correlation coefficients of the 26 regressors across the time course of all stimulus segments can be seen in Figure 1. Temporal derivatives were also included in the design matrix to compensate for regional differences between modeled and actual HRF. Finally, six motion parameters were used as additional nuisance regressors and the design was subjected to the same temporal filtering as the BOLD time series. The following three t-contrasts were defined: 1) words (all 21 tag-related regressors) > no-speech (no-sp), 2) proper nouns (ne) > coordinate conjunctions (kon), and 3) nouns (nn) > coordinate conjunctions (kon).
The second-level analysis that averaged contrast estimates across the eight stimulus segments per subject was carried out using a fixed effects model by forcing the random effects variance to zero in FLAME (FMRIB’s Local Analysis of Mixed Effects)35,36. The third level analysis which averaged contrast estimates across subjects was carried out using a mixed-effects model (FLAME stage 1) with automatic outlier deweighting36,37. Z (Gaussianised T/F) statistic images were thresholded using clusters determined by Z>3.4 and a corrected cluster significance threshold of p<.0537. Brain regions associated with observed clusters were labeled using the Jülich Histological Atlas38,39 and the Harvard-Oxford Cortical Atlas40 provided by FSL.
Figure 2 depicts the results of the three contrasts (z-threshold Z>3.4; p<.05 cluster-corrected). The contrast words > no-speech yielded four significant clusters (see Table 7): one left-lateralized cluster spanning from the angular gyrus and inferior posterior supramarginal gyrus across the superior and middle temporal gyrus, including parts of Heschl’s gyrus and planum temporale. A second left cluster in (inferior) frontal regions, including precentral gyrus, pars opercularis (Brodmann Areal 44; BA44) and pars triangularis (BA45). Similarly in the right hemisphere, one cluster spanning from the angular gyrus across the superior and middle temporal gyrus but including frontal inferior regions (pars opercularis and pars triangularis). A fourth significant cluster is located in the left thalamus.
The contrast proper nouns > coordinate conjunctions yielded nine significant clusters (see Table 8): one left-lateralized cluster spanning from the angular gyrus across planum temporale and superior temporal gyrus, partially covering the Heschl’s gyrus, into the anterior middle temporal gyrus. A largely congruent but smaller cluster in the right hemisphere. Two clusters in posterior cingulate cortex and precuneus of both hemispheres. Three small clusters in the right occipital pole, right Heschl’s gyrus and left superior lateral occipital pole.
The contrast nouns > coordinate conjunctions yielded four significant clusters (see Table 9): two clusters that are slightly smaller than the lateral temporal clusters of contrast nouns > coordinate conjunction. In this case, spanning from angular gyrus in the left hemisphere and from planum temporale in the right hemisphere into the anterior part of superior temporal cortex. Finally, two small right-lateralized clusters in the right posterior cingulate gyrus and right precuneus.
For the contrast words > no-speech, results show increased hemodynamic activity in a bilateral cortical network including temporal, parietal and frontal regions related to processing spoken language32,41,42. These clusters resemble results of previous studies that implemented an ISC approach to analyze fMRI data of naturalistic auditory stimuli5,43,44. We do not find significantly increased activations in midline areas (like the posterior cingulate cortex and precuneus or anterior cingulate cortex and medial frontal cortex) which showed synchronized activity across subjects in previous studies. In this regard, our results are similar to4 who implemented both an ISC and a GLM analysis. In this study, the ISC analysis showed synchronized activity in midline areas but the GLM analysis contrasting blocks of listening to narratives to blocks of a resting condition showed significantly decreased activity in these areas.
The two contrasts that contrasted nouns and proper nouns respectively to coordinate junctions yielded increased activation partially located in early sensory regions (Heschl’s Gyrus;45) and most prominently adjacent regions bilaterally (planum temporale; superior temporal gyrus;46,47). We chose nouns and proper nouns for these two contrasts because they represent linguistically similar concepts but are uncorrelated in the German language and stimulus (cf. Figure 1). We contrasted nouns and proper nouns respectively to coordinate conjunctions because nouns and proper nouns are linguistically different to coordinate conjunctions as well as uncorrelated. Despite the fact that nouns and proper nouns are uncorrelated, both contrasts lead to largely spatially congruent clusters. Results suggest that models based on our annotation of similar linguistic concepts correlate with hemodynamic activity in spatially similar areas. We confirmed the validity of these interpretation by testing if the spatial congruency could be attributed to a negative correlation of coordinate conjunctions with the modeled time series which turned out not to be the case. In summary, results of our exploratory analyses suggest that the annotation of speech meets basic quality requirements to be a basis for model-based analyses that investigate language perception under more ecologically valid conditions.
Zenodo: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description (annotation). https://doi.org/10.5281/zenodo.438214329.
Dataset 1. The annotation (v1.0; registered) as a tab-separated-value (TSV) formatted table and a text-based TextGrid file (the native format of the software Praat).
Zenodo: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description (validation analysis). https://doi.org/10.5281/zenodo.438218830.
Dataset 2. The data of the analysis (v1.0; registered) that we ran as a validation of the annotation’s content and quality.
Open Science Framework: studyforrest-paper-speechannotation. https://doi.org/10.17605/OSF.IO/GFRME31.
The paper as LATE X document, and accompanying datasets 1 and 2 (up-to-date; unregistered) accessible as DataLad (RRID:SCR_003931) datasets.
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
COH is grateful to Valeri Kippes who took care of the author’s mental sanity by providing excellent training at his gym in Jülich during the mentally draining period of manual corrections of the annotation.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: naturalistic fMRI, natural language processing, psychoinformatics
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: Invited the authors to a symposium on naturalistic fMRI, scheduled for March 2021; using their data from studyforrest.org for my own work
Reviewer Expertise: fMRI, language
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Neuroscience
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 28 Jan 21 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)