ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description

[version 1; peer review: 1 approved, 2 approved with reservations]
PUBLISHED 28 Jan 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Here we present an annotation of speech in the audio-visual movie “Forrest Gump” and its audio-description for a visually impaired audience, as an addition to a large public functional brain imaging dataset (studyforrest.org). The annotation provides information about the exact timing of each of the more than 2500 spoken sentences, 16,000 words (including 202 non-speech vocalizations), 66,000 phonemes, and their corresponding speaker. Additionally, for every word, we provide lemmatization, a simple part-of-speech-tagging (15 grammatical categories), a detailed part-of-speech tagging (43 grammatical categories), syntactic dependencies, and a semantic analysis based on word embedding which represents each word in a 300-dimensional semantic space. To validate the dataset’s quality, we build a model of hemodynamic brain activity based on information drawn from the annotation. Results suggest that the annotation’s content and quality enable independent researchers to create models of brain activity correlating with a variety of linguistic aspects under conditions of near-real-life complexity.

Keywords

annotation, language, speech, narrative, naturalistic stimulus, fMRI, studyforrest

Introduction

Cognitive and psychiatric neuroimaging are moving towards studying brain functions under conditions of lifelike complexity1,2. Motion pictures3 and continuous narratives4,5 are increasingly utilized as so called “naturalistic stimuli”. Naturalistic stimuli are usually designed for commercial purposes and to entertain their audiences. Thus, the temporal structure of their feature space is usually not explicitly known, leading to an “annotation bottleneck”6 when used for neuroscientific research.

Data-driven methods like inter-subject correlation (ISC)7 or independent component analysis (ICA)8 are often used to analyze such fMRI data in order to circumvent this bottleneck. However, use of data-driven methods alone falls short of associating results with particular stimulus events9. Model-driven methods, like the general linear model (GLM), which are based on stimulus annotations can be useful to test hypotheses on specific brain functions under more ecologically valid conditions, to statistically control confounding stimulus features, and to explain not just “how” the brain is responding to a stimulus but also “why”10. Studies using GLMs based on annotations of a stimulus’ temporal structure have elucidated, for example, how the brain responds to visual features of a movie11 or speech-related features of a narrative12. Furthermore, stimulus annotations can inform data-driven methods about a stimulus’ temporal dynamics, or model-driven and data-driven methods can be combined to improve the interpretability of results13.

Here we provide an annotation with exact onset and offset of each sentence, word and phoneme (see Table 1 for an overview) spoken in the audio-visual movie “Forrest Gump”14 and its audio-description (i.e. the movie’s soundtrack with an additional narrator)15. fMRI data of participants watching the audio-visual movie16 and listening to the audio-description17 are the core data of the publicly available studyforrest dataset (studyforrest.org). The current publication enables researchers to model hemodynamic brain responses that correlate with a variety of aspects of spoken language ranging from a speaker’s identity, to phonetics, grammar, syntax, and semantics. This publication extends already available annotations of portrayed emotions18, perceived emotions19, as well as cuts and locations depicted in the movie20. All annotations can be used in any study focusing on aspects of real-life cognition by serving as additional confound measures describing the temporal structure and feature space of the stimuli.

Table 1.Overview of the annotation’s content for the audio-description of “Forrest Gump” (i.e. the audio-only variant of the movie) that comprises the additional narrator. Counts are given for the whole stimulus (all) and its individual segments used during fMRI scanning. The category sentences comprises complete grammatical sentences which are additionally marked in the annotation with a full stop at the end (“my feet hurt.”). It also comprises questions (“do you want a chocolate?”), exclamations (“run away!”), or non-speech vocalizations in quick succession (“ha, ha, ha”), or in isolation (e.g. “Forrest?”, “Forrest!”, “ha”) at time points when speakers switch rapidly. The category words comprises each word or non-speech vocalization (N=202) in isolation.

CategoryAll12345678
Sentences2528292366320352344289365200
Words1618720892162211520352217203323221214
Phonemes6661188028727877085579197835393514854

Materials and methods

Stimulus

We annotated speech in the slightly shortened “research cut”17 of the movie “Forrest Gump” and its temporally aligned audio-description16 that was broadcast as an additional audio track for visually impaired listeners on Swiss public television15. The plot of the original movie is already carried by an off-screen voice of the main character Forrest Gump. In the audio-description, an additional male narrator describes essential aspects of the visual scenery when there is no off-screen voice, dialog, or other relevant auditory content.

Annotation procedure

Preliminary, manual orthographic transcripts of dialogues, non-speech vocalizations (e.g. laughter or groaning) and the script for the audio-description’s narrator were merged and converted to Praat’s21 TextGrid format. This merged transcript contained rough onset and offset timings for small groups of sentences, and was further edited in Praat for manual validation against the actual content of the audio material. The following steps were performed by a single person, already familiar with the stimulus, in several passes to iteratively improve the quality of the data: approximate temporal onsets and offsets were corrected; intervals containing several sentences were split into intervals containing only one sentence; when two or more persons were speaking simultaneously the less dominant voice was dropped; low volume non-speech vocalizations or low volume background speech (especially during music or continuous environmental noise) which were subjectively assessed to be incomprehensible for the audience were also dropped.

We then used the Montreal Forced Aligner v1.0.122 to algorithmically identify the exact onset and offset of each word and phoneme. To enable the aligner to look up the phonemes embedded within each word, we chose the accompanying German pronunciation dictionary provided by Prosodylab23 that uses the Prosodylab PhoneSet to describe the pronunciation of phonemes. To improve the detection rate of the automatic alignment, the dictionary was manually updated with German words that occur in the stimuli but were originally missing in the dictionary. The pronunciation of English words and phonemes occurring in the otherwise German audio track was taken from the accompanying English pronunciation dictionary (following the ARPAbet PhoneSet). The audio track of the audio-description was converted from FLAC to WAV via FFmpeg v4.1.424 to meet the aligner’s input requirements. This WAV file, the merged transcription, and the updated dictionary were submitted to the aligner that first trained an acoustic model on the data and then performed the alignment.

The resulting timings of words and phonemes were corrected manually and iteratively in several passes using Praat v6.0.2221: in a first step, onsets and offsets on which the automatic alignment performed moderately were corrected. Some low volume sentences that are spoken in continuously noisy settings (e.g. during battle or hurricane) were removed due to poor overall alignment performance. In a second step, the complete sentences of the orthographic transcription were copied into the annotation created by the aligner. In a third step, a speaker’s identity was added for each sentence (see Table 2 for the most often occurring speakers). During every step previous results were repeatedly checked for errors and further improvements.

Table 2.Sentences spoken by the ten most often occurring speakers sorted alphabetically. The narrator only occurs in the audio-description. Overall 97 persons were identified. Names are mostly identical to the names used in18.

NameAll12345678
Bubba7401640180000
Forrest3542237224850614965
Forrest (child)19172000000
Forrest (v.o.)3696148535137406316
Hancock16160000000
Jenny1770463032505716
Jenny (child)23716000000
Lt. Dan183004933652808
Mrs. Gump533820001300
Narrator903111134781399311514786

We employed the Python package spaCy v2.2.125 and its accompanying German language model (de_core_news_md) that was trained on the TIGER Treebank corpus26 to automatically analyze linguistic features of each word in their corresponding sentence. Non-speech vocalizations were dropped from the sentences before analysis to improve results. We then performed analyses regarding part-of-speech (i.e. grammatical tagging or word-category disambiguation), syntactic dependencies, lemmatization, word embedding (i.e. a multi-dimensional meaning representation of a word), and if the word is one of the most common words of the German language (i.e. if the word is part of a stop list).

Data legend

The annotation is available in two different versions, both providing the same information: a) as a text-based Praat TextGrid file, and b) as a text-based, tab-separated value (TSV) formatted table. The following descriptions refer to the ten columns of the TSV file, namely onset, duration, person, text, pos, tag, dep, lemma, stop, vector.

Start (start)

The onset of the sentence, word or phoneme. Time stamps are provided in the format seconds.milliseconds from stimulus onset.

Duration (duration)

The duration of the sentence, word or phoneme provided in the format seconds.milliseconds.

Speaker identity (person)

Name of the person that speaks the sentence, word or phoneme. See Table 2 for the ten most often occurring speakers.

Text (text)

The text of a spoken sentence or word, or the pronunciation of a phoneme. Phonemes of German words follow the Prosodylab PhoneSet, English words follow the ARPAbet PhoneSet.

Simple part-of-speech tag (pos)

A simple part-of-speech tagging (grammatical tagging; word-category disambiguation) of words. The tag labels of this simple part-of-speech tagging follow the Universal Dependencies v2 POS tag set (universaldependencies.org). See Table 3 for a description of the labels and the respective counts of all 15 labels. Nouns that spaCy mistook for proper nouns or vice versa were corrected via script. Additionally in cells of this column, sentences are tagged as SENTENCE, and phonemes are tagged as PHONEME to facilitate filtering in potential further processing steps.

Table 3.Simple part-of-speech tagging (pos) performed by the Python package spaCy25. All 15 labels sorted alphabetically. Descriptions were taken from spaCy.explain(). Non-speech vocalizations (NONSPEECH) were manually identified. Counts for the whole stimulus (all) and for each of the eight stimulus segments refer to the audio-description.

LabelDescriptionAll12345678
ADJadjective9161381261069613011812874
ADPadposition1429181176176194188183213118
ADVadverb133216616922016217816919375
AUXauxiliary807102120929612511011250
CONJconjunction5257463714961808641
DETdeterminer1754257243198219220222254141
NONSPEECHnon-speech vocalization202232192355441314
NOUNnoun2620361341332343331356351205
NUMnumeral6681111749142
PARTparticle57260100906283538638
PRONpronoun2348275321328260348262362192
PROPNproper noun101213113511916816211611764
SCONJsubordinating conjunction1721918201531272616
VERBverb2317285308320319289274349173
Xother10881021211191711

Detailed part-of-speech tag (tag)

A detailed part-of-speech tagging of words following the TIGER Treebank annotation scheme26 which is based on the Stuttgart-Tübingen-Tagset27. See Table 4 for a description of the labels and the respective counts of the 15 most often occurring labels (overall 43 labels). Nouns that spaCy mistook for proper nouns or vice versa were corrected via script.

Table 4.Detailed part-of-speech tagging (tag) performed by the Python package spaCy25. The 15 most often occurring labels (overall 43 labels) sorted alphabetically. Descriptions were taken from spaCy.explain(). Counts for the whole stimulus (all) and for each of the eight stimulus segments refer to the audio-description.

LabelDescriptionAll12345678
ADJAadjective, attributive4787358585177587033
ADJDadjective, adverbial or predicative4386568484553605841
ADVadverb118114614520114315714917466
APPRpreposition; circumposition left119215614615615215715017897
ARTdefinite or indefinite article1340199183140178159176191114
KONcoordinate conjunction4755858664558767836
NEproper noun101213113511916816211611764
NNnoun, singular or mass2620361341332343331356351205
PPERnon-reflexive personal pronoun1638183210221168246176287147
PPOSATattributive possessive pronoun2743447362339324023
PTKVZseparable verbal particle3533463494641336027
VAFINfinite verb, auxiliary76796108899211610611050
VVFINfinite verb, full1512181213201202172181228134
VVINFinfinitive, full2713725513242274017
VVPPperfect participle, full3293740355844515014

Syntactic dependency (dep)

Information about a word’s syntactic dependencies with other words within the same sentence. Information follows the TIGER Treebank annotation scheme26 and is given in the format: “arc label;word’s head;word’s child1, word’s child2, ...”, where the “arc label” (see Table 5) describes the type of syntactic relation that connects a ”child” (the current word) to its “head”.

Table 5.Syntactic dependencies (dep) performed by the Python package spaCy25. The 15 most often occurring labels (overall 37 labels) sorted alphabetically. Descriptions were taken from spaCy.explain(). Counts for the whole stimulus (all) and for each of the eight stimulus segments refer to the audio-description.

LabelDescriptionAll12345678
cdcoordinating conjunction3354844483441534225
cjconjunct5246574885365806534
cpcomplementizer1601717201629252115
dadative1701530271923182711
jujunctor1301013161216223110
mnrpostnominal modifier2453029334431282723
momodifier2634349345355327356334384184
nknoun kernel element3763516482448475507485551299
oaaccusative object103611713914914814612613477
occlausal object732988697941059711540
pdpredicate3013950402545413823
pncproper noun component154361915142822155
ROOTroot of sentence2417285349322336317267358183
sbsubject2231280306271276301281340176
svpseparable verb prefix3553665454943335628

Lemmatization (lemma)

The base form (root) of a word.

Common Word (stop)

This column’s cell provides information if the word is part of a stop list, hence one of the most common words in the German language or not (True vs. False).

Word embedding (vector)

A 300-dimensional word vector providing a multi-dimensional meaning representation of a word. Out-of-vocabulary words with a vector consisting of 300 dimensions of zeroes were set to # to save space.

Dataset content

The annotation comes in two different versions. First, as a text-based TextGrid file (annotation/ fg_rscut_ad_ger_speech_tagged.TextGrid) to be conveniently edited using the software Praat21. Second, as a text-based, tab-separated-value (TSV) formatted table (annotation/fg_rscut_ ad_ger_speech_tagged.tsv) in accordance with the brain imaging data structure (BIDS)28. The dataset and validation data are available from Open Science Framework, DataLad and Zenodo (see Underlying data)29,30,31. The source code for all descriptive statistics included in this paper is available in code/descriptive-statistics.py (Python script).

Dataset validation

In order to assess the annotation’s quality, we investigated if contrasting speech-related events to events without speech lead to increased activation in areas known to be involved in language processing32. Moreover, we tested if two similar linguistic concepts (proper nouns and nouns) providing high semantic information contrasted with a concept providing low semantic information (coordinate conjunctions) lead to increased activation in congruent brain areas.

We used a dataset providing blood oxygenation level-dependent (BOLD) functional magnetic resonance imaging (fMRI) data of 20 subjects (age 21–38 years, mean age 26.6 years, 12 male) listening to the 2 h audio-description (7 Tesla, 2 s repetition time, 3599 volumes, 36 axial slices, thickness 1.4 mm, 1.4 × 1.4 mm in-plane resolution, 224 mm field-of-view)17. Data were already corrected for motion at the scanner computer. Further, individual BOLD time-series were already aligned by non-linear warping to a study-specific T2*-weighted echo planar imaging (EPI) group template (cf.17 for exact details).

All further steps for the current analysis were carried out using FEAT v6.00 (FMRI Expert Analysis Tool)33 as part of FSL v5.0.9 (FMRIB’s Software Library)34. Data of one participant were dropped to due to invalid distortion correction during scanning. Data were temporally high-pass filtered (cut-off 150 s), spatially smoothed (Gaussian kernel; 4.0 mm FWHM), and the brain was extracted from surrounding tissue. A grand-mean intensity normalization of the entire 4D dataset was performed by a single multiplicative factor.

We implemented a standard three-level, voxel-wise general linear model (GLM) to average parameter estimates across the eight stimulus segments, and later across 19 subjects. At the first level analyzing each segment for each subject individually, we created 26 regressors (see Table 6) based on events drawn from the annotation. The 20 most often occurring detailed part-of-speech labels (nn with N=2620 to prf with N=157) were modeled as boxcar function from onset to offset of each word. The remaining other part-of-speech labels were pooled to a single new label (tag_other; N=1123) and modeled as a boxcar function from a word’s onset to offset. The 80 most often occurring phonemes (n with N=6053 to IY1 with N=32) were pooled to phonemes (N=65251) and modeled as boxcar function from a phoneme’s onset to offset. The end of each complete grammatical sentence was modeled as an impulse event (N=1651) to capture variance correlating with sentence comprehension. “No-speech” events (no-sp; N=264) serving as a control condition were created such that a sufficient number of events and a minimum separation of speech and non-speech events were achieved. Events were randomly positioned in intervals without audible speech that lasted at least 3.6 s. Each event of the no-speech condition had to have a minimum distance of 1.8 s to any onset or offset of a word, and to any onset of another no-speech event. A length of 70 ms was chosen for no-speech events matching the average length of phonemes. Lastly, we used continuous bins of information about low-level auditory features (left-right difference in volume and root mean square energy) that was averaged across the length of every movie frame (40 ms) to capture variance correlating with assumed low-level perceptual processes. Time series of events were convolved with FSL’s “Double-Gamma HRF” as a model of the hemodynamic response function to create the actual regressors. The Pearson correlation coefficients of the 26 regressors across the time course of all stimulus segments can be seen in Figure 1. Temporal derivatives were also included in the design matrix to compensate for regional differences between modeled and actual HRF. Finally, six motion parameters were used as additional nuisance regressors and the design was subjected to the same temporal filtering as the BOLD time series. The following three t-contrasts were defined: 1) words (all 21 tag-related regressors) > no-speech (no-sp), 2) proper nouns (ne) > coordinate conjunctions (kon), and 3) nouns (nn) > coordinate conjunctions (kon).

Table 6.Overview of events that were used to create the 26 regressors of the GLM analysis. The respective counts are given for the whole stimulus and the eight segments that were used during fMRI scanning. The 20 most often occurring labels from the detailed part-of speech tagging (tag) were used as such. Words belonging to all other labels were pooled to tag_other. The label sentence contains the end of complete grammatical sentences. The label phones contains events of the 80 most often occurring phonemes (phoneme n with N=6053 to phoneme IY1 with N=32). The label no-sp represents moments when no speech was audible. fg_ad_lrdiff (left-right volume difference) and fg_ad_rms (root mean square energy) were compute for and averaged across every movie frame (40 ms) via Python script. Events were convolved with FSL’s “Double-Gamma HRF” to create the regressors. The correlation of these regressors over the time course of the whole stimulus can be seen in Figure 1.

LabelDescriptionAll12345678
adjaadjective, attributive4787358585177587033
adjdadjective, adverbial or predicative4386568484553605841
advadverb118114614520114315714917466
apprpreposition; circumposition left119215614615615215715017897
apprartpreposition with article2332428204231323521
artdefinite or indefinite article1340199183140178159176191114
koncoordinate conjunction4755858664558767836
neproper noun101213113511916816211611764
nnnoun, singular or mass2620361341332343331356351205
pdssubstituting demonstrative pronoun1921632312533271711
pissubstituting indefinite pronoun2173630352830212314
ppernon-reflexive personal pronoun1638183210221168246176287147
pposatattributive possessive pronoun2743447362339324023
prfreflexive personal pronoun1571825152317261914
ptkvzseparable verbal particle3533463494641336027
vafinfinite verb, auxiliary76796108899211610611050
vmfinfinite verb, modal183282129242815308
vvfinfinite verb, full1512181213201202172181228134
vvinfinfinitive, full2713725513242274017
vvppperfect participle, full3293740355844515014
tag_otherall other TAG categories112315316517412416912115364
sentencecomplete grammatical sentences1651205231200215212198249141
phones80 most often occurring phonemes6525185898534859783878976818492324752
no-spno-speech2641620235025275647
fg_ad_lrdiffleft-right volume difference1801332257422075219252442523125219752717516859
fg_ad_rmsroot mean square1801332257422075219252442523125219752717516859
d5bf4b16-fadb-40ae-b980-bed4b3b3152c_figure1.gif

Figure 1. Pearson correlation coefficients of the 26 regressors used in the analysis to validate the annotation.

Regressors were created by convolving the events with FSL’s “Double-Gamma HRF” as a model of the hemodynamic response function, temporally filtered with the same high-pass filter (cut-off 150 s) as the BOLD time series, and concatenated across runs before computing the correlation.

The second-level analysis that averaged contrast estimates across the eight stimulus segments per subject was carried out using a fixed effects model by forcing the random effects variance to zero in FLAME (FMRIB’s Local Analysis of Mixed Effects)35,36. The third level analysis which averaged contrast estimates across subjects was carried out using a mixed-effects model (FLAME stage 1) with automatic outlier deweighting36,37. Z (Gaussianised T/F) statistic images were thresholded using clusters determined by Z>3.4 and a corrected cluster significance threshold of p<.0537. Brain regions associated with observed clusters were labeled using the Jülich Histological Atlas38,39 and the Harvard-Oxford Cortical Atlas40 provided by FSL.

Figure 2 depicts the results of the three contrasts (z-threshold Z>3.4; p<.05 cluster-corrected). The contrast words > no-speech yielded four significant clusters (see Table 7): one left-lateralized cluster spanning from the angular gyrus and inferior posterior supramarginal gyrus across the superior and middle temporal gyrus, including parts of Heschl’s gyrus and planum temporale. A second left cluster in (inferior) frontal regions, including precentral gyrus, pars opercularis (Brodmann Areal 44; BA44) and pars triangularis (BA45). Similarly in the right hemisphere, one cluster spanning from the angular gyrus across the superior and middle temporal gyrus but including frontal inferior regions (pars opercularis and pars triangularis). A fourth significant cluster is located in the left thalamus.

Table 7.Significant clusters (z-threshold Z>3.4; p<.05 cluster-corrected) for the contrast words (all 21 tag-related regressors) > no-speech. Clusters sorted by voxel size. The first brain structure given contains the voxel with the maximum Z-Value, followed by brain structures from posterior to anterior, and partially covered areas (l. = left; r. = right; c. = cortex; g. = gyrus).

Max location (MNI)Center of gravity (MNI)
Voxelspcorr.Z-maxxyzxyzStructure
14990<.0016.31-49-24.76.35-54.8-32.53.73l. Heschl’s g.; lateral superior occipital c., angular g., superior & middle temporal g. (posterior to anterior); parts of supramarginal g. & planum temporale
14469<.0016.4855-14.9-6.954.1-23.10.374r. superior temporal g.; angular g., superior (and middle) temporal g. (posterior to anterior), Heschl’s g.; parts of supramarginal g., planum temporale, pars opercularis (BA44) & pars triangularis (BA45)
1971<.0015.26-51.125.6-10.5-53.617.810.2l. frontal orbital c.; pars opercularis (BA44), pars triangularis (BA45); parts of precentral g.
217.0024.55-4.48-13.710.3-6.46-14.99.96l. thalamus
d5bf4b16-fadb-40ae-b980-bed4b3b3152c_figure2.gif

Figure 2. Results of the mixed-effects group-level (N=14) GLM t-contrasts for the audio-description of the movie “Forrest Gump”.

Significant clusters (Z>3.4, p<0.05 cluster-corrected) are overlaid on the MNI152 T1-weighted head template (grey). Light grey: the audio-description dataset’s field-of-view (cf.17).

The contrast proper nouns > coordinate conjunctions yielded nine significant clusters (see Table 8): one left-lateralized cluster spanning from the angular gyrus across planum temporale and superior temporal gyrus, partially covering the Heschl’s gyrus, into the anterior middle temporal gyrus. A largely congruent but smaller cluster in the right hemisphere. Two clusters in posterior cingulate cortex and precuneus of both hemispheres. Three small clusters in the right occipital pole, right Heschl’s gyrus and left superior lateral occipital pole.

Table 8.Significant clusters (z-threshold Z>3.4; p<.05 cluster-corrected) for the contrast proper nouns (ne) > coordinate conjunctions (kon). Clusters sorted by voxel size. The first brain structure given contains the voxel with the maximum Z-Value, followed by brain structures from posterior to anterior, and partially covered areas (l. = left; r. = right; c. = cortex; g. = gyrus).

Max location (MNI)Center of gravity (MNI)
Voxelspcorr.Z-maxxyzxyzStructure
7691<.0016.23-61.2-22.311.6-55.9-20.74.03l. planum temporale; posterior inferior supramarginal g., superior temporal g., planum polare, parts of posterior angular g., Heschl’s g., middle temporal gyrus
5928<.0015.557.5-26.215.958.2-15.83.55r. planum temporale; Heschl’s g., superior temporal g., planum polare, temporal pole; parts of angular g. & posterior inferior supramarginal gyrus
479<.0014.62-5.42-32.325.3-4.28-39.422.8l. posterior cingulate g.
420<.0014.85-4.76-71.440.1-3.74-68.536.2l. precuneus
407<.0015.076.83-40.124.56.67-38.723.1r. posterior cingulate c.
294<.0014.5717-69.134.617.7-67.134.9r. precuneus
121.0243.958.12-98.20.3598.75-97.7-3.15r. occipital pole
117.0274.3836.9-24.84.5537.4-233.09r. Heschl’s g.
115.0294.08-44.6-71.721.7-43.6-70.823.4l. superior lateral occipital c.

The contrast nouns > coordinate conjunctions yielded four significant clusters (see Table 9): two clusters that are slightly smaller than the lateral temporal clusters of contrast nouns > coordinate conjunction. In this case, spanning from angular gyrus in the left hemisphere and from planum temporale in the right hemisphere into the anterior part of superior temporal cortex. Finally, two small right-lateralized clusters in the right posterior cingulate gyrus and right precuneus.

Table 9.Significant clusters (z-threshold Z>3.4; p<.05 cluster-corrected) for the contrast nouns (nn) > coordinate conjunctions (kon). Clusters sorted by voxel size. The first brain structure given contains the voxel with the maximum Z-Value, followed by brain structures from posterior to anterior, and partially covered areas (l. = left; r. = right; c. = cortex; g. = gyrus).

Max location (MNI)Center of gravity (MNI)
Voxelspcorr.Z-maxxyzxyzStructure
3166<.0015.75-61.3-10.6-2.93-57.7-14.31.47l. anterior superior (and middle) temporal g.; planum temporale, planum polare, anterior superior temporal g.; part of posterior supramarginal g., Heschl’s g.
1753<.0014.9963.3-15.18.4158-134.02r. planum temporale, anterior superior temporal g., planum polare; part of& part of Heschl’s G.
166.0044.56.83-40.124.57.01-39.724.2r. posterior cingulate g.
149.0084.1318.2-67.83619.8-66.434.6r. precuneus

For the contrast words > no-speech, results show increased hemodynamic activity in a bilateral cortical network including temporal, parietal and frontal regions related to processing spoken language32,41,42. These clusters resemble results of previous studies that implemented an ISC approach to analyze fMRI data of naturalistic auditory stimuli5,43,44. We do not find significantly increased activations in midline areas (like the posterior cingulate cortex and precuneus or anterior cingulate cortex and medial frontal cortex) which showed synchronized activity across subjects in previous studies. In this regard, our results are similar to4 who implemented both an ISC and a GLM analysis. In this study, the ISC analysis showed synchronized activity in midline areas but the GLM analysis contrasting blocks of listening to narratives to blocks of a resting condition showed significantly decreased activity in these areas.

The two contrasts that contrasted nouns and proper nouns respectively to coordinate junctions yielded increased activation partially located in early sensory regions (Heschl’s Gyrus;45) and most prominently adjacent regions bilaterally (planum temporale; superior temporal gyrus;46,47). We chose nouns and proper nouns for these two contrasts because they represent linguistically similar concepts but are uncorrelated in the German language and stimulus (cf. Figure 1). We contrasted nouns and proper nouns respectively to coordinate conjunctions because nouns and proper nouns are linguistically different to coordinate conjunctions as well as uncorrelated. Despite the fact that nouns and proper nouns are uncorrelated, both contrasts lead to largely spatially congruent clusters. Results suggest that models based on our annotation of similar linguistic concepts correlate with hemodynamic activity in spatially similar areas. We confirmed the validity of these interpretation by testing if the spatial congruency could be attributed to a negative correlation of coordinate conjunctions with the modeled time series which turned out not to be the case. In summary, results of our exploratory analyses suggest that the annotation of speech meets basic quality requirements to be a basis for model-based analyses that investigate language perception under more ecologically valid conditions.

Data availability

Underlying data

Zenodo: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description (annotation). https://doi.org/10.5281/zenodo.438214329.

Dataset 1. The annotation (v1.0; registered) as a tab-separated-value (TSV) formatted table and a text-based TextGrid file (the native format of the software Praat).

Zenodo: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description (validation analysis). https://doi.org/10.5281/zenodo.438218830.

Dataset 2. The data of the analysis (v1.0; registered) that we ran as a validation of the annotation’s content and quality.

Open Science Framework: studyforrest-paper-speechannotation. https://doi.org/10.17605/OSF.IO/GFRME31.

The paper as LATE X document, and accompanying datasets 1 and 2 (up-to-date; unregistered) accessible as DataLad (RRID:SCR_003931) datasets.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Author contributions

COH designed, performed, and validated the annotation, and wrote the manuscript. MH provided critical feedback on the procedure and wrote the manuscript.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 28 Jan 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Häusler CO and Hanke M. A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description [version 1; peer review: 1 approved, 2 approved with reservations] F1000Research 2021, 10:54 (https://doi.org/10.12688/f1000research.27621.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 28 Jan 2021
Views
13
Cite
Reviewer Report 01 Mar 2021
Roberta Rocca, Psychoinformatics Lab, Department of Psychology, The University of Texas at Austin, Austin, TX, USA 
Approved with Reservations
VIEWS 13
The authors introduce a new set of linguistic annotations for the studyforrest dataset, an open naturalistic fMRI dataset where subjects are presented audio-visual or audio-only versions of the movie Forrest Gump. This new corpus of annotations is an extremely valuable ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Rocca R. Reviewer Report For: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2021, 10:54 (https://doi.org/10.5256/f1000research.30529.r78808)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
18
Cite
Reviewer Report 16 Feb 2021
Martin Wegrzyn, Department of Psychology, Bielefeld University, Bielefeld, Germany 
Approved with Reservations
VIEWS 18
The authors present a word-for-word annotation of an audio version of the movie “Forrest Gump”, as used in the “studyforrest” project, for which 7T fMRT data of 20 participants is available (as described in other publications). The presented work provides ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Wegrzyn M. Reviewer Report For: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2021, 10:54 (https://doi.org/10.5256/f1000research.30529.r78810)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
17
Cite
Reviewer Report 16 Feb 2021
Giada Lettieri, Social and Affective Neuroscience Group, MoMiLab, IMT School for Advanced Studies Lucca, Lucca, Italy 
Approved
VIEWS 17
The current Data Note reports and describe an extensive dataset including speech annotation in the audio-visual movie "Forrest Gump". To assess the consistency of annotation, the brain hemodynamic activity elicited by the contrast of speech events versus no speech was ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lettieri G. Reviewer Report For: A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2021, 10:54 (https://doi.org/10.5256/f1000research.30529.r78806)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 28 Jan 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.