Introduction

Among the variables known to affect the speed and accuracy of lexical–semantic processing in both healthy and brain-damaged participants are semantic typicality, age of acquisition, and concept familiarity. In most studies, these variables have been assessed empirically by asking participants to estimate the age at which they learned a word (age of acquisition), how familiar a concept is in a person’s individual experience (concept familiarity), or how well it represents a particular semantic category (semantic typicality). Other variables that affect word processing are intrinsic to each word and can be determined directly from its surface structure (e.g., their word length, in terms of number of syllables, phonemes, or letters). Furthermore, variables such as word frequency are determined by counting the frequency of occurrence of words in large language corpora (for German: CELEX—Baayen, Piepenbrock, & van Rijn, 1993; or the dlexDB database—Heister, Würzner, Bubenzer, Pohl, Hanneforth, Geyken, & Kliegl, 2011).

The use of category norms for selecting stimuli for experimental investigations has had a long tradition in cognitive psychology. In her pioneering work on the internal structure of semantic categories, Rosch (1975) used examples from different semantic categories that had been directly generated by speakers in a norming study by Battig and Montague (1969; see also Van Overschelde, Rawson, & Dunlosky, 2004, for an updated and extended version of these norms). For these empirically generated category exemplars, Rosch collected norms for estimated within-category typicality using a 7-point scale (see also Uyeda & Mandler, 1980, for an extension of Rosch’s typicality norms). As such, semantic typicality reflects the degree to which a concept (e.g., penguin, robin) is representative of a given category (e.g., birds; Rosch & Mervis, 1975). It has been shown that typicality influences semantic-processing performance in online categorization or semantic decision tasks for both healthy (Holmes & Ellis, 2006; Morrison & Gibbons, 2006; Rips, Shoben, & Smith, 1973) and aphasic (Kiran, Ntourou, & Eubank, 2007; Kiran & Thompson, 2003a; Stanczak, Waters, & Caplan, 2006) processing. Semantic typicality has also been shown to influence processing speed in picture naming in healthy adults (Dell’Acqua, Lotto, & Job, 2000; Holmes & Ellis, 2006), as well as picture-naming accuracy in patients suffering from different neuropsychological disorders (Laiacona, Luzzatti, Zonca, Guarnaschelli, & Capitani, 2001; Woollams, Cooper-Pye, Hodges, & Patterson, 2008). Furthermore, within the framework of the “complexity account of treatment efficacy” (Thompson, 2007), it has been suggested that the treatment of aphasic word-finding difficulties is possibly more effective when targeting atypical items during treatment (Kiran & Thompson, 2003b). Variances in typicality are also reflected in differential neurophysiological responses, as atypical items have been shown to increase the N400 component in healthy participants (e.g., Heinze, Muente, & Kutas, 1998; Monetta, Tremblay, & Joanette, 2003; Núñez-Peña & Honrubia-Serrano, 2005; Stuss, Picton, & Cerri, 1988).

Age of acquisition refers to the age at which a word was learned. There are several assumptions as to why age of acquisition might affect word processing. One is that earlier-acquired concepts might build the basis for the acquisition of later concepts; hence, they might be more connected and/or more often used. The greater use of early-acquired concepts can also be described as a higher cumulative frequency of their associated words or by different frequency trajectories (Zevin & Seidenberg, 2002, 2004). Another assumption is that “different” or “better” learning mechanisms are available at early ages based on specific biological foundations such as brain plasticity (see Hernandez & Li, 2007, for a review). Although some studies have used objective measures of age of acquisition with data from children who were asked to name pictures (Álvarez & Cuetos, 2007; Morrison, Chappell, & Ellis, 1997; Pind, Jónsdóttir, Gossurardóttir, & Jónsson, 2000), most researchers have used subjective measures of estimated age of acquisition judged retrospectively by adult participants. In these estimates, age of acquisition has usually been rated on a 7-point scale (after Gilhooly & Logie, 1980). Since the original studies of Carroll and White (1973a) and Gilhooly and Logie (1980), norm data of estimated age of acquisition have been collected for a number of different languages (e.g., Alario & Ferrand, 1999; Bonin, Peereman, Malardier, Méot, & Chalard, 2003; Cameirão & Vicente, 2010; Cortese & Khanna, 2008; Dell’Acqua et al., 2000; Dimitropoulou, Duñabeitia, Blitsas, & Carreiras, 2009; Ghyselinck, De Moor, & Brysbaert, 2000; Izura, Hernández-Muñoz, & Ellis, 2005; Khanna & Cortese, 2011; Manoiloff, Artstein, Canavoso, Fernández, & Segui, 2010; Marques, Fonseca, Morais, & Pinto, 2007; Nishimoto, Miyawaki, Ueda, Une, & Takahashi, 2005; Pind et al., 2000; Ruts, De Deyne, Ameel, Vanpaemel, Verbeemen, & Storms, 2004; Sirois, Kremin, & Cohen, 2006; Tsaparina, Bonin, & Méot, 2011). Overall, those ratings seem to be consistent, as indexed by high intra- and intergroup reliability measures with high correlations between the rating scores within the group of participants and between the rating scores for the same words used in different studies with different participant populations. In addition, a number of studies have found high correlations between estimated and objective age of acquisition, measured as the age by which children can read words (Carroll & White, 1973a) or as the age by which children can name pictures (e.g., Morrison et al., 1997; Pind et al., 2000; Schröder, Kauschke, & De Bleser, 2004). However, it has also been shown that objective and estimated age of acquisition differ, in that the subjective measure based on adult estimates is more influenced by word frequency and the familiarity of concepts than by objective age-of-acquisition data (Morrison et al. 1997). Nevertheless, as objective and estimated age of acquisition values are highly correlated, adult estimates are regarded as being adequate measures of age of acquisition (Morrison et al., 1997).

In general, words acquired earlier in life are processed faster or more accurately than words acquired later in life in various language-processing tasks. Age of acquisition has been discussed as an important variable at the lexical processing level, where it affects the speed of processing in word recognition (Baumgaertner & Tompkins, 1998; Turner, Valentine, & Ellis, 1998) and picture-naming (e.g., Barry, Morrison, & Ellis, 1997; Carroll & White, 1973b; Chalard & Bonin, 2006; Cuetos, Ellis, & Alvarez, 1999; Hodgson & Ellis, 1998; Johnston & Barry, 2006; Morrison & Ellis, 1995) tasks. In support of semantic hypotheses (e.g., the “semantic locus” theory of Brysbaert, Van Wijnendaele, & De Deyne, 2000), age of acquisition also affects the semantic system, with faster responses for earlier- than for later-acquired words in various semantic tasks (e.g., Brysbaert et al., 2000; Cortese & Khanna, 2007; De Deyne & Storms, 2007; Ghyselinck, Custers, & Brysbaert, 2004; Morrison & Gibbons, 2006). However, some studies have not reported an influence of age of acquisition on semantic processing (e.g., Catling & Johnston, 2006; Morrison, Ellis, & Quinlan, 1992), especially when the items used were controlled for semantic typicality (Holmes & Ellis, 2006). Age of acquisition influences the speed and the accuracy of picture naming in normal aging (Morrison, Hirsh, Chappell, & Ellis, 2002), as well as affecting word processing in patients suffering from different neuropsychological conditions, with words acquired early being better preserved than words acquired later (e.g., Cuetos, Herrera, & Ellis, 2010; De Bleser & Kauschke, 2003; Gerhand & Barry 2000; Lambon Ralph, Graham, Ellis, & Hodges, 1998; Nickels & Howard, 1995; see Ellis, in press, for an overview). Interestingly, the data from event-related potentials (ERP) and functional magnetic resonance imaging (fMRI) studies have suggested that words acquired early versus later may be represented differently in the brain (Cuetos, Barbón, Urrutia, & Domínguez, 2009; Fiebach, Friederici, Müller, von Cramon, & Hernandez, 2003).

The term familiarity has been used in the literature in the senses of both lexical familiarity with the word form (subjective frequency/subjective familiarity; Balota, Pilotti, & Cortese, 2001; Gernsbacher, 1984; Gilhooly & Logie, 1980; Stadthagen-Gonzalez & Davis, 2006) and familiarity with the concept of an object (e.g., Snodgrass & Vanderwart, 1980). Snodgrass and Vanderwart defined familiarity as “the degree to which you come in contact with or think about the concept” (p. 183). These or similar instructions, which explicitly cover not only the familiarity of the word form but also the usage of an item, have been used in several rating studies in which the familiarity of objects has been rated by participants after presentation of pictures (e.g., Alario & Ferrand, 1999; Bonin et al., 2003; Cuetos et al., 1999; Genzel, Kerkhoff, & Scheffter, 1995; Morrison et al., 1997; Snodgrass & Vanderwart, 1980) as well as of words (Izura et al., 2005). In contrast to word frequency and age of acquisition, it is less clear how concept familiarity and lexical retrieval in picture naming are related. An influence of familiarity on lexical retrieval in picture naming, with better processing of highly familiar words, has been found in some studies (e.g., Cuetos et al., 1999; Snodgrass & Yuditsky, 1996), but not in others (Bonin et al., 2003; Ellis & Morrison, 1998). In neuropsychological research, familiarity is regarded as a variable that influences semantic processing. In patients with acquired semantic-processing disorders, highly familiar items seem to be protected better against loss than are less familiar items, leading to better performance with highly familiar items in lexical retrieval and comprehension tasks (e.g., Funnell & De Mornay Davies, 1996; Hirsh & Funnell, 1995; Lambon Ralph et al., 1998; Woollams et al., 2008).

Within the study of so-called category-specific semantic disorders, it has been shown that some “category-specific” effects may arise due only to material–intrinsic differences in word frequency, concept familiarity, or age of acquisition (Cappa, Frugoni, Pasquali, Perani, & Zorat, 1998; Funnell & De Mornay Davies, 1996; Funnell & Sheridan, 1992; Stewart, Parkin, & Hunkin, 1992). In some studies, sets of items from animate categories (e.g., animals, fruits, vegetables) were less frequent or were acquired earlier than items from inanimate categories (e.g., furniture, tools, musical instruments; Funnell & De Mornay Davies, 1996; Howard, Best, Bruce, & Gatehouse, 1995). In line with this, it has been shown in normative studies that items from different semantic categories may vary in their mean ratings, with animals rated as being acquired relatively early and highly typical, yet rated as relatively low in familiarity (Izura et al., 2005). Likewise, Snodgrass and Vanderwart (1980) found that the items from different semantic categories could be grouped by their significant differences in their mean familiarity ratings, with items from the category of animals, together with birds and musical instruments, getting the lowest familiarity ratings, and items from the categories of furniture and kitchen utensils, together with body parts, gaining the highest familiarity ratings. Hence, controlling variables when designing experiments is particularly important when performance is assessed for different semantic categories.

During the last decade, several normative databases for age of acquisition, semantic typicality, and concept familiarity have been collected in different languages. Cross-linguistic comparisons have shown that one needs to be careful when using norms from one language in another, because culture-specific differences may arise not only with respect to name agreement, but also with regard to the conceptual familiarity of objects (Cuetos et al., 1999; Dell’Acqua et al., 2000; Sanfeliu & Fernandez, 1996). In addition, comparisons of correlations between variables have shown that although there is considerable overlap in the types of correlations, the magnitudes of these relations vary across studies. For this reason, it has been suggested that normative data should be collected for each language separately (Bonin et al., 2003).

In sum, language-specific norm data on typicality, age of acquisition, and familiarity are needed for selecting items in research on healthy and impaired language processing. In recent years, some extensive German databases of more than 2,000 words have been published for imageability, concreteness, emotional valence, and arousal (Lahl, Göritz, Pietrowsky, & Rosenberg, 2009; Võ, Conrad, Kuchinke, Urton, Hofmann, & Jacobs, 2009; Võ, Jacobs, & Conrad, 2006). By contrast, German databases with rather limited lists of items exist with norms for familiarity, visual complexity, and age of acquisition (i.e., norms for N = 244–255 items from Snodgrass & Vanderwart, 1980, are provided in Genzel et al., 1995, and Schröder et al., 2004). In addition, German norms for typicality are—to our knowledge—not yet available. Therefore, the present study had two main objectives:

  • First, to provide substantial German norm data for semantic typicality, age of acquisition, and concept familiarity for a large number of words from various semantic categories. Despite the existence of such databases in other languages, there is no such instrument in German.

  • Second, to investigate the characteristics of the present database in terms of an analysis of its intra- and interstudy reliabilities, the degree of intercorrelations between variables, and differences in rating scores with regard to different semantic categories.

Method

Four different studies were conducted for developing the current database of German norms for the semantic typicality, age of acquisition, and concept familiarity of 824 exemplars of 11 semantic categories (animals, birds, fruits, vegetables, clothing, furniture, vehicles, tools, musical instruments, professions, and sports). First, all category exemplars were collected in an exemplar generation study. Subsequently, three different rating studies were conducted to gather German norm data for the semantic typicality (Rating Study 1), age of acquisition (Rating Study 2), and concept familiarity (Rating Study 3) of the collected items. All materials were presented in German. For the present purpose, the closest English equivalent was chosen to describe the data set.

Participants

Table 1 lists the overall characteristics of the 160 participants who took part in the exemplar generation study and the three rating studies. The specific characteristics of the participants in the different rating studies are listed in the separate subsections for the four studies. All participants gave signed consent for participation and were monolingual native speakers of German. Some of the participants were enrolled in university degree programs and received course credit for their participation. The participants took part in only one study; that is, there was no overlap of participants across the four different studies.

Table 1 Age and years of education of participants in the four studies

Selection of stimuli: Exemplar generation study

Participants, materials, and procedure

A group of 20 participants (15 female, 5 male) took part in the exemplar generation study. Participants were provided with a booklet containing a list of 11 category labels (vegetables, vehicles, tools, clothing, furniture, sports, birds, fruits, animals, professions, and musical instruments Footnote 1). Each category label was presented on a separate sheet of paper. Participants were asked by written instructions to write down as many examples as they could think of for each semantic category. No time limit was given to complete the task.

Data analysis

All responses were considered for further analyses. For each item, its generation frequency (number of participants listing that item) was coded. Adaptation of the raw data was kept to a minimum in order to keep a wide range of category exemplars that should be rated for semantic typicality, age of acquisition, and familiarity in the rating studies. Items that were judged by two independent raters as not belonging to the depicted category, as well as homographs (e.g., kiwi: bird, fruit; horn: musical instrument, part of an animal) were eliminated. Singular and plural forms of the same lemma were merged, as well as synonyms. For the category of professions, each item was coded in its singular male word form. In the case of synonyms, the term generated by the majority of participants was selected. Items were regarded as synonyms (e.g., German: Grapefruit, Pampelmuse; English: grapefruit, shaddock) only if they were coded as such in a German online database of the University of Leipzig (Biemann, Bordag, Heyer, Quasthoff, & Wolff, 2004, http://wortschatz.uni-leipzig.de). All other items with minimal semantic differences (e.g., German: Stöckelschuh, Pumps; English: stiletto, pumps) remained in the set. Items listed for both the categories of animals and birds (i.e., three items: duck, parrot, and chicken), as well as superordinates (e.g., wildcat, cat of prey) and subordinates (e.g., kitchen table, dining table) remained in the set and were rated for their within-category typicality in the typicality rating study (1,123 exemplars).

Rating Study 1: Semantic typicality

Participants, materials, and procedure

A group of 20 participants (15 female, 5 male) took part in the semantic typicality rating study. A total of 1,123 exemplars of the 11 categories collected in the exemplar generation study were included in the typicality rating. Following Rosch (1975), items were presented block-wise within their corresponding categories. Two lists with different randomizations (appearance of categories and items within their categories) were presented. The participants were asked to rate the typicality of the category exemplar on a 7-point scale from 1 (very good example of the category/typical) to 7 (bad example of the category/atypical; see Appendix A and B for the specific instructions). In addition, participants could indicate if they did not know the item (unfamiliar) or if they thought that the item was not a member of the requested category (not a category member).

Data analysis

Items that were judged either as being unfamiliar or as not being a category member by 25% (5/20) or more of the participants were removed from the item set (n = 63 items). In addition, items that showed a high variability in judgments, resulting in standard deviations greater than ±2, were also removed from the item set (n = 264 items). The final set of items consisted of 870 items that were included in the age-of-acquisition rating and familiarity rating studies.

Rating Study 2: age of acquisition

Participants, materials, and procedure

A group of 60 participants (35 female, 25 male) participants took part in the age-of-acquisition rating study. The 870 words rated for typicality were divided into three lists of items (n = 290 items each). Items from the 11 semantic categories were equally distributed across the three lists. Items in each list (and within each category) did not differ in terms of typicality (t test for unrelated samples, all ps > .1). Following Gilhooly and Logie (1980), participants were asked to indicate on a 7-point scale when they thought they had learned the words. At the top of each page, the 7-point scale was explained, in which 1 = 0–2 years, 2 = 3–4 years, 3 = 5–6 years, 4 = 7–8 years, 5 = 9–10 years, 6 = 11–12 years, 7 = 13 years or older. An additional column (item unknown) was added (after Marques et al., 2007; see Appendix A and B for the specific instructions), and each of the three lists was rated by 20 new participants. Items were presented in blocksFootnote 2 within their corresponding categories. Two lists with different randomizations (appearance of categories and items within their categories) were presented.

Data analysis

Items that were judged as being unknown by 25% (5/20) or more of the participants were removed from the item set (n = 22 items). In addition, items that showed a high variability in judgments, resulting in standard deviations greater than ±2, were also removed from the item set (n = 5 items).

Rating Study 3: Concept familiarity

Participants, materials, and procedure

A group of 60 participants (31 female, 29 male) took part in the familiarity rating study. The 870 words rated for typicality in Rating Study 1 were divided into three lists of items (n = 290 items each). Items from the 11 semantic categories were distributed equally across the three lists. Items in each list (and within each category) did not differ in terms of typicality (t test for unrelated samples, all ps > .1). Following Snodgrass and Vanderwart (1980), participants were asked to estimate the degree to which they thought about or came in contact with a concept, using a 5-point scale ranging from 1 (very unfamiliar) to 5 (very familiar). Care was taken to make sure that the estimate had been attributed to the concept itself and not the word (see Appendix A and B for specific instructions). Each of the three lists was rated by 20 new participants. Items were presented together with its category label. Two lists with different randomizations of items across categories (with no more than two items from the same semantic category appearing subsequently) were presented.

Data analysis

Items that were judged as unfamiliar by 25% (5/20) or more of the participants were removed from the item set (n = 8 items). None of the items remaining in the item set showed high variability in judgments that resuled in standard deviations greater than ±2.

Characteristics of the final database

Finally, the data from the exemplar generation study and the three rating studies were subsumed into a single database. The database consisted of 824 German nouns that were exemplars from 11 semantic categories (animals, birds, fruits, vegetables, clothing, furniture, vehicles, tools, musical instruments, professions, and sports). Each semantic category included between 40 and 193 exemplars that were generated in the exemplar generation study and rated for semantic typicality, age of acquisition, and concept familiarity by 20 different participants. For each category exemplar, its exemplar generation frequency (number of participants listing that item in the exemplar generation study) is provided in the database. In addition, norms for semantic typicality, age of acquisition, and familiarity are provided. Furthermore, for each word in the database, measures of word length (number of phonemes, number of syllables) and word frequency (normalized lemma frequency per million and logarithmic normalized lemma frequency) are given. All frequency values given in the database were taken from the German dlexDB database (www.dlexdb.de; Heister et al., 2011), which is based on the reference corpus of the German language compiled by the Digital Dictionary of the German Language (DWDS) with a size of about 100 million words (tokens) and 2.3 million distinct words (types).Footnote 3 The full database can be downloaded from www.springerlink.com.

Results

Reliability

The intrastudy reliability of the data was tested by computing split-half correlations of the mean rating values for two different lists of randomized items. The results showed high intrastudy reliabilities for all three rating studies, with strong correlations between the mean rating values of the two lists of randomized items (typicality rating, r = .87; age-of-acquisition rating, r = .92; familiarity rating, r = .79). Interstudy reliability was examined by carrying out cross-study correlations on the variables in common on subsets of identical items included in other, comparable databases. A database was included if there was an overlap of about 100 or more items in both databases.Footnote 4 For the measures of semantic typicality, items were only included if they were estimated in relation to the same category in the comparable studies.Footnote 5 Table 2 depicts the results for the cross-study correlations on subsets of identical items—precisely, 2 other German studies and 15 other studies carried out in 10 different languages. Overall, there were highly significant correlations across studies for all three ratings carried out in the present study. The strongest correlations were found for measures of age of acquisition and conceptual familiarity carried out in 2 other German studies by Schröder et al. (2004) and by Genzel et al. (1995). For ratings obtained in studies from other languages, moderate to strong correlations were shown for the estimates of semantic typicality, age of acquisition, and concept familiarity (see Table 2). Negative values were simply due to the fact that in some studies, rating scales opposite to the ones used in our studies were used, with high values on the rating scale representing lower values of the estimated variables, and vice versa.

Table 2 Correlations of the variables in common between the ratings of the present study and other studies (Pearson’s r)

Intercorrelations between variables

Table 3 provides the intercorrelations of the German norms collected for semantic typicality, age of acquisition, and concept familiarity in the present study. In addition, the correlations of these variables with measures of word frequency and word length are also reported.Footnote 6

Table 3 Correlations among semantic typicality, age of acquisition, concept familiarity, word frequency, and word length (N = 824 items)

The correlational analyses showed that all variables were significantly correlated with each other (all ps < .01). The two measures of word length (i.e., number of phonemes and syllables) were strongly correlated. In addition, there were moderate correlations between age of acquisition, familiarity, typicality, and word frequency, showing that the words in the database that were acquired early also tended to be more familiar, more typical, and more frequent than words acquired later. Furthermore, word length (e.g., in syllables) was moderately correlated with age of acquisition and word frequency, as well as being weakly but still significantly correlated with semantic typicality and concept familiarity (see Table 3).

Effects of semantic category

To analyze an effect of semantic category, one-way analyses of variance were used to compare the mean ratings of each item on each of the variables (semantic typicality, age of acquisition, and concept familiarity). Table 4 provides a description of the mean rating values for the 11 semantic categories in each of the rating studies.

Table 4 German database with norms for semantic typicality, age of acquisition, and concept familiarity: Mean ratings of items in each category

Rating Study 1: Semantic typicality

The analysis revealed a main effect of semantic category on the typicality ratings, F(10, 813) = 7.33, p < .001, η 2 = .08. Pairwise comparisons between the mean rating values in the different semantic categories showed that items in the categories of vegetables, fruits, and birds were rated as more typical than were items from the categories of clothes, sports, and vehicles, which were rated as more atypical (Tukey’s HSD, all ps < .05).

Rating Study 2: Age of acquisition

The analysis revealed a main effect of semantic category on the age-of-acquisition ratings, F(10, 813) = 21.97, p < .001, η 2 = .21. Pairwise comparisons between the mean rating values in the different semantic categories (see Table 4) showed that items in the category of animals were rated as being acquired earlier than were items in any of the other categories (Tukey’s HSD, all ps < .01), except for furniture and vehicles. By contrast, words in the category of professions were rated as being acquired later than were words in any of the other categories (Tukey’s HSD, all ps < .01), except for musical instruments, tools, and sports.

Rating Study 3: Concept familiarity

The analysis of variance revealed a main effect of semantic category on the familiarity ratings, F(10, 813) = 21.12, p < .001, η 2 = .21. Pairwise comparisons between the mean rating values in the different semantic categories showed that items in the categories of vegetables, fruits, and furniture had relatively high familiarity ratings and were rated as more familiar than were items in any of the other categories (Tukey’s HSD, all ps < .01). By contrast, items in the category of musical instruments were rated as less familiar than were items in any of the other categories (Tukey’s HSD, all ps < .01) except for birds and sports.

Discussion

The present study had two aims: (a) to provide a large German database containing norms for semantic typicality, age of acquisition, and concept familiarity for German nouns from numerous semantic categories, and (b) to provide a descriptive analysis of the database that included an examination of the intra- and interstudy reliabilities, an analysis of intercorrelations between the estimated variables, word frequency, and word length, and the distributions of the mean rating scores for different semantic categories.

To meet the first goal, of establishing a large German database with norms for semantic typicality, age of acquisition, and concept familiarity, we collected a large list of exemplars of 11 semantic categories that were directly generated by native speakers of German in an exemplar generation study. For each of these 824 category exemplars, norms for semantic typicality, age of acquisition, and concept familiarity were gathered. In addition, values of word frequency taken from the German lexical database dlexDB (Heister et al., 2011) and measures of word length (number of phonemes, number of syllables) were included in the database.

Second, we characterized our database by providing analyses of inter- and intrastudy reliabilities. To obtain a measure of intrastudy reliability for each rating study, split-half Pearson’s rs were computed for the participants rating the two lists of randomized items. Overall, the data showed high intrastudy reliability, with scores of r = .87 for the typicality rating, r = .79 for the familiarity rating, and r = .92 for the age-of-acquisition rating. The reliability scores obtained in the present study for the familiarity and typicality ratings are somewhat lower than those reported previously (e.g., for typicality, r = .90 or higher in Ruts et al., 2004, and Rosch, 1975; for familiarity, r = .92, Izura et al., 2005), whereas the intrastudy correlations obtained for the age-of-acquisition ratings were similar to those reported in the literature (e.g., r = .98, Gilhooly & Logie, 1980; r = .88, Izura et al., 2005). It is difficult to interpret these findings, as a number of studies have not reported split-half reliabilities at all (e.g., Dell’Acqua et al., 2000; Sirios et al., 2006; Snodgrass & Vanderwart, 1980). Overall, all three rating studies reached high interrater reliabilities, as all correlations were quite strong and reached statistical significance (all ps < .01).

The analysis of interstudy reliability revealed moderate to strong correlations between the data from this study and studies carried out in American English (Cortese & Khanna, 2008; Rosch, 1975; Snodgrass & Vanderwart, 1980; Uyeda & Mandler, 1980), British English (Morrison et al., 1997), Dutch (Ruts et al., 2004), French (Alario & Ferrand, 1999), German (Genzel et al. 1995; Schröder et al., 2004), Greek (Dimitropoulou et al., 2009), Icelandic (Pind et al., 2000), Italian (Dell’Acqua et al., 2000), Japanese (Nishimoto et al., 2005), Spanish (Izura et al., 2005; Manoiloff et al., 2010; Sanfeliu & Fernandez, 1996), and Russian (Tsaparina et al., 2011), showing further evidence for the reliability of the obtained data.

For semantic typicality, no cross-linguistic comparisons between different studies are reported in the literature. We conducted five cross-study correlations with the typicality ratings for overlapping, identical items used in our study and in the studies of Rosch (1975), Uyeda and Mandler (1980), Ruts et al. (2004), Izura et al. (2005), and Dell’Acqua et al. (2000) in order to validate the present database further. The results of these correlations showed moderate correlations of r = .55 (Uyeda & Mandler, 1980), r = .60 (Rosch, 1975), r = .63 (Izura et al., 2005), r = −.65 (Ruts et al., 2004), and r = −.74 (Dell’Acqua et al., 2000). Dell’Acqua et al. carried out their typicality rating on the basis of a set of line drawings, whereas the items used by Izura et al. were selected by their lexical availability (produced in a category-fluency task within 2 min; Izura et al., 2005, p. 387). The other three studies (Uyeda & Mandler, 1980; Rosch, 1975; Izura et al., 2005) used items for their typicality ratings, which were generated in exemplar generation studies within 30 s. It might be possible that the items used in those studies have a higher production frequency and higher semantic typicality than do the items used in our study (produced without time limit). To explore this issue further, we compared the set of n = 220 overlapping items in our study and the study by Rosch.Footnote 7 Overall, we think that selected items from our study and the study by Rosch are quite comparable in terms of production frequency. In our study, we included every category exemplar, even if it was produced by only 1 of the participants (corresponding to a production frequency of 0.05%). Rosch (p. 197) stated that she included all items that had been produced by 10 (n = 2.3% of the participants) or more subjects in the Battig and Montague (1969) study, as well as items that were produced “by fewer subjects in the Battig and Montague norms.” For example, for the category of furniture, she included the items magazine rack, closet, and fan, which were only listed by 3 (0.7%), 2 (0.45%), and 1 (0.23%) participant(s) in the Battig and Montague study. Thus, in both our and Rosch’s studies some of the included items had relatively high (e.g., bed) and low (e.g., newspaper/magazine rack) production frequencies. In addition, the two sets did not differ in terms of their distributions of rated typicality (for our study, M = 2.42, SD = 1.09, range 1.00–5.74, Mdn = 2.25; for Rosch, M = 2.36, SD = 1.01, range = 1.02–5.90, Mdn = 2.25), which makes it unlikely that the moderate correlation of r = .60 between our and Rosch’s data was due to a general difference in the mean typicality of the items included in the analysis.

In sum, we cannot rule out the possibility that the differences in study design (i.e., the time limit) could contribute to the moderate correlations of the overlapping items. However, as the reported correlations are comparable for all studies, despite the use of different study designs, we think that the differences shown in the magnitudes of the correlation coefficients were also influenced by cultural and/or linguistic differences across the different studies. That is, some items that are more typical representatives of a given category in one culture may not be as representative in another culture (e.g., potato is rated as being a relatively typical representative of the category of vegetables in German, whereas papaya seems to be a relatively typical representative of the category of fruits for most of the American ratersFootnote 8).

For age of acquisition and concept familiarity, the comparison of the variables in common on subsets of identical items across studies showed that the highest correlations were obtained for the measures of age of acquisition and concept familiarity in the present study and in two other studies conducted in German (age of acquisition, r = .93; concept familiarity, r = .85). For all of the studies carried out in other languages, moderate to strong correlations were obtained (age of acquisition, all rs between .51 and .81; concept familiarity, all rs between .58 and .81; see Table 2). These results mirror those from other studies in which comparable analyses have been performed (e.g., Alario & Ferrand, 1999; Dell’Acqua et al., 2000; Nishimoto et al., 2005; Sanfeliu & Fernandez, 1996; Tsaparina et al., 2011). Whereas most of the ratings on concept familiarity reported in the literature are carried out with pictures as input stimuli, our study was carried out with words (note that it would have been very difficult or impossible to find nonambiguous pictures for some of the words of the exemplar generation study, especially for those that were not very typical in a given semantic category or that belonged to the categories professions or sports). Furthermore, the correlational analyses indicated that the differences in magnitudes of the correlations obtained in the present study are not attributable to differences in the input modalities. In fact, the highest cross-study correlations of concept familiarity were shown with two other studies carried out with pictures (r = .85, Genzel et al., 1995; r = .81, Tsaparina et al., 2011). At the same time, moderate correlations were found with overlapping items from two Spanish studies, both carried out with pictures (r = .59, Sanfeliu & Fernandez, 1996) or words (r = .58, Izura et al., 2005). These results suggest that the input modality does not seem to influence the familiarity ratings and that the participants of our study were rating the object concept and not the specific word form of each exemplar given.

Overall, the pattern of significant cross-study correlations obtained in the present study indicates that the items in our database share most aspects of age of acquisition, semantic typicality, and concept familiarity across different cultures and languages. Objects that are common in one culture may not be as common in another culture (Sanfeliu & Fernandez, 1996), and words that are acquired early in one language may be acquired later in another language, especially when they differ in morphological complexity, word frequency, or word length. In sum, the analyses of intra- and interstudy correlations have provided further evidence for the reliability of our data. Importantly, as several differences across different languages occurred, ratings of semantic typicality, age of acquisition, and concept familiarity should be carried out for each language separately.

The results of the intercorrelational analyses of the variables investigated in the present study showed that semantic typicality, age of acquisition, and concept familiarity, as well as word frequency, were moderately correlated with each other. It is assumed that this finding reflects a natural correlation of these variables (e.g., typical exemplars of a semantic category tend to be more familiar and more frequent in adult language and to be acquired earlier during childhood than atypical exemplars). By nature, age of acquisition and frequency are correlated, as highly frequent words are learned earlier in life and are more central (De Deyne & Storms, 2008; Morrison et al., 1997). Especially when age of acquisition is estimated retrospectively by adults, the correlation of estimated age of acquisition with word frequency and concept familiarity is high (e.g., Morrison et al., 1997). However, frequency and age of acquisition are not interchangeable with each other, as age of acquisition still significantly accounts for variance in performance, even when other variables such as frequency are controlled for (for recent data, see, e.g., Brysbaert & Cortese, 2010). Note that the influence of age of acquisition (and its relation to other variables, such as imageability) depends highly on the task applied (e.g., written naming vs. lexical decision; Cortese & Khanna, 2007) or the “transparency” or “regularity” of the input and output variables, as suggested by computational models (Zevin & Seidenberg, 2002). In sum, our findings provide further evidence that the degree of natural correlations needs to be taken into account when interpreting any effects seen in experimental investigations of language processing, and further research needs to investigate the independent contributions of each of these variables on different aspects of language processing in various experimental tasks.

Whereas the present study has been conducted with a group of participants of a relatively wide age range (20–70 years), most of the norming studies mentioned above were conducted with young participants, mainly college students. However, Hodgson and Ellis (1998) suggested that the age of acquisition and familiarity of certain objects can be different for participants of different age groups, because they may have encountered the objects at different stages in their lives. In line with this theory, Sirois et al. (2006) found age-related differences in familiarity ratings, with older participants (60–85 years) judging the familiarity of 388 pictured objects as being more familiar than did young (18–39 years) or middle-aged (40–59 years) participants. In the same study, young participants estimated that they had learned the words corresponding to those pictures earlier than did the middle-aged and older participants. De Deyne and Storms (2007) suggested that differences in age-of-acquisition ratings may occur especially for words introduced in recent decades (e.g., exotic fruits, such as mango). In line with this, the age-of-acquisition ratings of elderly raters (61–85 years) for some specific words (e.g., robot, television, lime) differed essentially from those of younger participants in a study by Cuetos, Samartino and Ellis (2011). To investigate whether there was a tendency for any age effects in our data, we conducted several separate analyses for the groups of younger (20–40 years) and elderly participants (41–70 years, n = 10 participants each). The results of the correlational analyses showed that there was a high overlap between the ratings of both age groups (r = .82 for the typicality rating, r = .85 for the age-of-acquisition rating, and r = .77 for the familiarity rating). In an analysis of variance, we found no main effects of age for either the typicality [F(1, 18) = 0.00, p = .986, η 2 = .00] or the age-of-acquisition [F(1, 58) = 2.75, p = .103, η 2 = .02] rating. However, for the familiarity rating, a significant effect of age [F(1, 58) = 5.12, p = .027, η 2 = .04]Footnote 9 was found. Given our relatively small sample size, we are careful about interpreting these results. Still, these data indicate that further studies should explore possible age differences to avoid over- or underestimating any effects shown.

The findings of the present study add to recent research showing that ratings of age of acquisition, semantic typicality, and concept familiarity may differ for items from different semantic categories (Izura et al., 2005; Snodgrass & Vanderwart, 1980).Footnote 10 As in the study by Izura et al., items in our database in the category of animals were estimated as being highly typical and being acquired earlier than items in most of the other categories. Similarly, we were able to replicate some findings of other studies, with items in the category of furniture being rated as highly familiar (Izura et al., 2005; Snodgrass & Vanderwart, 1980) and items in the categories of birds and musical instruments being rated as relatively low in concept familiarity (Snodgrass & Vanderwart, 1980). It is possible that this result reflects only random differences in the distributions of category exemplars in our database, with some categories being more widely dispersed (note that all items generated in the exemplar generation study were included in the database, even if they were generated only by 1 or 2 participants). However, these replicated findings suggest that differences in semantic typicality, age of acquisition, and concept familiarity may be inherent to certain semantic categories. It is therefore important to control for age of acquisition, semantic typicality, and concept familiarity when designing experiments for research on semantic processing. Future studies should explore these effects in greater detail and in various populations, such as different age groups or clinical populations.

Conclusion

The present study provides the first substantial German database of 824 nouns from 11 semantic categories, with norms for semantic typicality, age of acquisition, concept familiarity, word frequency, and word length that can be used by researchers from different scientific fields. Overall, high inter- and intrastudy reliabilities were shown. The results revealed that items in different semantic categories might vary with regard to semantic typicality, age of acquisition, and concept familiarity, indicating that it is important to control for these variables when designing experiments for psycho- or neurolinguistic research. In sum, the present database increases the pool of available German norm data and will serve as an important tool for selecting stimuli in the research of healthy lexical–semantic processing and in the assessment and rehabilitation of patients suffering from lexical–semantic impairments.