Introduction

Imagine you set out to test or manipulate the creative ability of some of your participants using the Remote Associates Test (Mednick & Mednick 1971). This test consists of giving participants three words—like water, mine, and shaker—and asking them what word might be related to all three such words.Footnote 1 You would have to first find and select a set of RAT queries, and then control as many variables as you can about them.

Various creativity tests are available: the Alternative Uses Test (Guilford 1967), the Remote Associates Test (Mednick and Mednick 1971), the Torrance Tests of Creative Thinking (Kim 2006), the Wallach–Kogan tests (Wallach and Kogan 1965), riddles (used by Whitt & Prentice, 1977; Qiu et al., 2008), empirical insight tests from Duncker (1945), Maier (1931), and Saugstad and Raaheim (1957) and others. However, some of these tests are not easily available (TTCT), some are non-standardized or do not provide any norms, and others provide small sets of stimuli. Thus, various creativity tests could highly benefit from modernization by being normed, having more factors controlled for and by developing ampler sets of stimuli, in order to provide more varied testing conditions.

The Remote Associates Test (Mednick & Mednick 1971), has been used to measure creativity and adapted to various other languages—for example in Japanese (Baba 1982) and Dutch (Chermahini, Hickendorff, & Hommel, 2012). The test has been rated as the second most used creativity test in a meta-analysis surveying 45 neuroimaging studies (Arden, Chavez, Grazioplene, & Jung, 2010). The RAT is assumed to measure creative convergent thinking, unlike other creativity tests, which are better suited to measure divergent thinking—e.g., the Alternative Uses Test. The RAT is very useful in measuring insight theory effects, as performance in the RAT has been shown to correlate with performance in insight problems (Schooler and Melcher 1995).

Various types of investigations in and beyond creative cognition use the Remote Associates Test. Amongst others, these include the study of the effects of incubation with GO players (Sio & Rudowicz 2007) and baseball players (Wiley 1998), the relation between REM sleep and creativity (Cai, Mednick, Harrison, Kanady, & Mednick, 2009), synesthesia and creativity (Sitton & Pierce, 2004; Ward, Thompson-Lake, Ely, & Kaminski, 2008), the role of affect in problem-solving (Fodor 1999), memory (Storm, Angello, & Bjork, 2011), peripheral attention (Ansburg and Hill 2003) etc. Because of its wide use, the scientific community would thus benefit from an ampler set of standardized stimuli for the RAT.

Worthen and Clark (1971) made the case that different stimuli categories can be distinguished within the original stimuli of Mednick and Mednick (1971)—specifically they differentiated between functional items and structural items (sometimes also appearing in the literature under the name of compound items).

The CreaCogs framework (Olteţeanu, 2014, 2016) for creative problem-solving uses knowledge organization to support a unified set of core creative problem solving processes, like association, associative convergence, re-representation, restructuring, search, and substitution. The validation of the framework and processes is done in a comparative manner to human performance (Olteţeanu, Falomir, & Freksa, in press) through implementing systems which can show creative problem-solving abilities and solve creativity tests for humans. Among such systems, the comRAT-C system (Olteţeanu & Falomir 2015) has explicitly addressed the computational solving of compound RAT queries. When solving the RAT, comRAT-C calculated the probability of finding a solution based on the frequency of query and answer words, as will be shown in “Generating new remote associates test items with comRAT-G”. A highly significant correlation has been observed between the results of comRAT-C and the difficulty of RAT queries for humans, as expressed in percentage of solvers and response times in the human normative data (Bowden & Jung-Beeman 2003). This correlation ranged between 0.3 and 0.52 for different solving times. This correlation showed that the frequency of query items plays an important role into how difficult a query is for human participants, and a standardized set of RAT stimuli should take this into account. Furthermore, (Olteţeanu & Schultheis in press) modified frequency and probability factors independently, keeping them at low and high levels. This showed that frequency and probability both are factors which influence accuracy and response times when solving the RAT.

For the Remote Associates Test, normative data from human participants does exist from Bowden and Jung-Beeman (2003), which provides data on mean time-to-solution and percentage of participants solving for 144 compound RAT problems, with four different time limits. Though very useful, this work does not provide standardized queries based on frequency of occurrence of word or answer stimuli, the importance of which has been shown by Olteţeanu & Falomir (2015). This paper aims to enrich the existing pool of compound Remote Associates Test items and provide a standardized treatment that allows control over the frequency of occurrence and probability of finding an answer. Seventeen million new compound RAT items are constructed, using the entire space of frequent noun expressions in American English - thus providing the largest standardized treatment of the compound RAT test to date. These items are computationally generated by adapting the previously implemented computational solver of the Remote Associates Test, comRAT-C (Olteţeanu & Falomir 2015), to a generative variant - comRAT-G. The frequency of items from the COCA corpusFootnote 2 and comRAT-C’s probability of finding an answer are indexed in the provided repository, and can be used to generate a set of controlled queries. Subsets of queries in which one word or the answer are kept constant can also be extracted.

The rest of this paper is structured as follows. A brief overview of the comRAT solver and its transformation to comRAT-G is provided in “From comRAT to comRAT-G”. The methodology of generating new Remote Associates Test items with comRAT-G is explained, together with examples, in “Generating new remote associates test items with comRAT-G”. The evaluation of 100 queries with human participants is shown in “Evaluation”. The type of data generated in the repository is described in “The Repository”, and various possible uses are showcased. Finally, future work is discussed. The 100 items used for evaluation are presented in the Annex.

From comRAT to comRAT-G

The comRAT-C (Olteţeanu & Falomir 2015) system has been built to solve the compound variant of the Remote Associates Test, in the tradition of Psychometric AI (Bringsjord 2011) and as an exploration of the processes of the CreaCogs creative problem-solving framework (Olteţeanu 2014; Olteţeanu & Falomir 2016).

As data, comRAT-C takes the most frequent 2-grams from the COCA corpus. Knowledge organization plays an important role in CreaCogs and in comRAT’s knowledge base in the following way: whenever a 2-gram is given to comRAT-C, it is stored as an object of the Expression class, which is constructed from two objects of the Concept class and a Link between them. If one of the Concepts is already present, the other Concept and the new Link are added. If both Concepts are present, only the Link is added. For example, if the 2-gram “cake flour” is read, comRAT-C will check to see if it knows the Concepts “cake” and “flour”. If it doesn’t know one of them, the item will be added as a ConceptFootnote 3; if it doesn’t know either, both items will be added. Then it will add a link between those Concepts, with a numeric tag attached—the frequency of the 2-gram as taken from the corpus.

Over time, each Concept will end up with a set of Links to all the other Concepts it has been in an Expression with, as shown in Fig. 1. This associative structure constitutes the knowledge organization of comRAT-C. When three words are given to comRAT-C, as they would in the context of a RAT query, each of these words activates the set of Concepts they are linked to. An overlapping activation starting from two or three of the initial words can sometimes be observed. In Fig. 1, the initial words are depicted in green - cottage, Swiss, and cheese. A two-item convergence of activation is observed for the word chocolate, and a three-item convergence for the word cheese. The three-item convergences are possible RAT query responses. Multiple two and three-item convergences are of course possible for the same query.

Fig. 1
figure 1

A graphical depiction of the link structure obtained in comRAT-C. Only a few links are shown for visual readability

The comRAT-C system performs well in answering compound RAT queries even without the links between Concepts being weighted using the frequency of 2-gram metrics; however, adding these metrics improves comRAT-C’s performance and helps break ties between multiple possible three-item convergences.

An analysis of comRAT-C’s probability to find an answer based on the two-gram frequency data revealed a correlation to human performance normative data. This correlation shows that the frequency of 2-grams, on which the probability of finding the answer is based, might have an influence on the process of solving compound RAT queries. In order to keep compound queries controlled for the frequency variable, to check for other influences (like order), and to understand these influences in more detail, frequency-based probability data needs to be gathered on a large set of queries. A large enough set of queries can also be used to keep part of the query words or the answer word (different queries, same answer) constant. In order to gather such data, and construct a large set of queries, we proposed reverse-engineering our computational approach in order to generate new compound RAT items.

Thus, instead of using the organization structure of comRAT to provide answers, this will be used to provide queries. From this vantage point, each word w a n s that has at least three links, let’s say to words w a , w b , w c , is a potential answer to a RAT query.

Generating new remote associates test items with comRAT-G

The process of generating new Remote Associates Test items unfolds as follows. All the noun–noun high frequency 2-grams in COCA are organized in Concepts and Links in the knowledge base of comRAT-G. The selection of noun–noun 2-grams is done using the UCREL CLAWS7 TagsetFootnote 4 (tags as per this tagset are provided with the 2-grams dataset). comRAT-G uses nouns alone in this current version, unlike comRAT-C, which used more parts of speech, as described in Olteţeanu & Falomir (2015). Thus the set of most frequent 2 million 2-grams is reduced to a set of 43,908 expressions.

First, comRAT-G iterates through the words and provides preliminary results, which henceforth we shall call type 1 results, a sample of which is shown in Table 1. Type 1 results consider each word as a potential answer word. Thus in Table 1, w a n s stands for the answer word and w q for a potential query word which can be used to get the answer word. Terms w q can be further integrated in a RAT query in positions w a , w b or w c . The third column represents the frequency of association between the query word w q and the answer word w a n s . The fourth column represents the frequency of association between the query word and any word. The fifth column represents the probability of answer w a n s given specific query word w q .Footnote 5 This is calculated as shown in Eq. 1. Query words w q are only generated for the w a n s , which have at least three w q . Applying the process to get to type 1 results yields a total of ∼ 81500 unique w a n s , w q combinations, based on 9601 unique answer words.

$$ P[w_{ans} \mid w_{q}]=\frac{fr(w_{q} w_{ans})}{\sum\limits_{k=1}^{n} (w_{q}, w_{k})} $$
(1)
Table 1 An example of preliminary results, focusing on words as answers to possible queries

After type 1 results have been produced and stored, the new RAT queries are generated, using a combinatoric approach. For each answer word, the set of query words are retrieved and three-word combinations are generated. This applies the well-known combinatorics formula shown in Eq. 2, with n being the number of query items connected to a specific answer word, and k being 3.

$$ \begin{array}{c}n \\ k \end{array} = \frac{n!}{k!(n-k)!} $$
(2)

For each possible answer with n < 100 (and of course n > 2), all unique combination triples are produced. We capped n at 100 because of computational costs (\({100 \choose 3}\) is 161,700 possible combinations with the terms connected to the same answer word) and diminishing returns—an answer word w a n s connected to over 100 items might be a very common word, or form much weaker bonds with either of the words; thus RAT items constructed from its terms might not be too interesting or intuitive to solve (lower associative power of triggering result).

In order to construct all such combinations in a computationally feasible manner, comRAT-G uses Alan Tucker’s combinatorics algorithm (Tucker 2006). For 9601 query answer words and capping n at 100 (which translates into only using about 9200 answers), we obtain about 17 million possible RAT triples. The probability of answering the query is calculated based on the conditional probability of the answer to be triggered by each of the three query items. The probability thus currently considers an equal weighting of the three items, as shown in Eq. 3 (as in comRAT-C (Olteţeanu & Falomir 2015)). However, different weighting schemes can be considered for modeling purposes—which is why we also provide the conditional probability of each item, as shown in the Appendix. The various types of data items captured by this ample list of possible RAT queries and the roles in which such data can be used in empirical research are presented in “The Repository”.

$$ P(w_{ans}) = \frac{ P(w_{ans} \mid w_{a})+P(w_{ans} \mid w_{b})+P(w_{ans} \mid w_{c})}{3} $$
(3)

Evaluation

In order to check whether the queries created by comRAT-G are suitable, valid and reliable RAT queries, which can be solved by human participants and are coherent with existing RAT datasets, we have set up a study in which the human performance on comRAT-G queries is compared to that on normative data from Bowden and Jung-Beeman (2003).

Method

Two sets of query items, one comprising randomly selected comRAT-G queries and the second comprising randomly selected queries from the normative data of Bowden and Jung-Beeman (2003), were presented mixed in random order to native speakers in an online study. Accuracy and response times for solving the items were recorded. The purpose of the study was to check whether: (i) correlations between performance indicators hold between comRAT-G and Bowden & Jung-Beeman items, thus showing validity of comRAT-G items and (ii) whether comRAT-G queries are a reliable tool, as measured by Cronbach’s alpha, compared to Bowden & Jung-Beeman items.

Participants

A total of 113 native English speakers, 72 female and 41 male, were recruited at University of Pittsburgh and on Crowdflower and volunteered to answer our study, which was set up online. Participants had a wide range of ages, education levels and self-rated their creativity on a wide set of levels, as shown in Table 2.

Table 2 Descriptive data on the age, education, and self-rated creativity level of the participants, n = 113

Materials

Fifty compound RAT queries were randomly selected from the items produced by comRAT-G. Another 50 queries were randomly selected from the query set of Bowden & Jung-Beeman. The comRAT-G queries can be found in the Appendix. From Bowden and Jung-Beeman (2003) we used queries 5, 6, 11, 15, 17, 20, 21, 22, 24, 26, 28, 29, 30, 37, 38, 40, 45, 46, 50, 51, 53, 58, 62, 65, 68, 71, 72, 74, 76, 79, 82, 84, 87, 90, 95, 96, 99, 106, 110, 111, 114, 116, 122, 124, 130, 131, 133, 136, 139 and 144.

Procedure

The task was explained with two query examples. Then, five training queries were presented. These queries were taken from Bowden & Jung-Beeman items and did not overlap with our random selection of 50 items. After the participants attempted to solve the training queries, feedback including the correct answer was presented. Then, the 100 queries (50 from Bowden & Jung-Beeman, 50 from comRAT-G) were presented in random order.

Results

The dependent variables were (i) accuracy, measured as the number of correct responses for each participant in comRAT-G and Bowden & Jung-Beeman queries, and (ii) response times, measured as the number of seconds each participant spent on answering each comRAT-G and Bowden & Jung-Beeman query.

As Table 3 shows, the mean accuracy was 26.20, standard deviation (SD) = 7.03) problems correctly solved (52.4 %) for comRAT-G items and 26.41 (S D = 11.24) problems correctly solved (52.82%) for Bowden & Jung-Beeman items. The mean response times for correct solutions (n = 112Footnote 6) was 14.52 s (S D = 9.89) for comRAT-G items and 16.56 s (S D = 12.84) for Bowden & Jung-Beeman items, as shown in Table 4.

Table 3 Descriptive statistics on accuracy in number of queries solved, n = 113
Table 4 Descriptive statistics on response times (RT) in seconds for queries solved, n = 112

As shown in Table 5, the mean number of participants solving each comRAT-G query was 59.28, and the mean number of participants solving each Bowden & Jung-Beeman query was 59.92. The mean time spent per comRAT-G query, independent of whether it was solved or not, was 21.9 s, while the mean time spent per each Bowden & Jung-Beeman query was 23.12.

Table 5 Descriptive statistics on number of participants solving per query (of 113), and mean time spent per query (whether a correct answer was given or not)

Accuracy and response times per query for the comRAT-G dataset are shown in the Appendix.

Accuracy showed an average significant correlation between the comRAT-G and the Bowden & Jung-Beeman datasets of r = 0.54, p < 0.0001 (Fig. 2a). Response times showed a highly significant large correlation between the two datasets of r = 0.75, p < 0.0001 (Fig. 2b). Note that response times were calculated only for correct answers.

Fig. 2
figure 2

Correlations on a accuracy and b response times

We then measured the scale reliability of the two datasets (comRAT-G items and Bowden & Jung-Beeman items) using Cronbach’s alpha as an internal consistency measure. As Table 6 shows, Cronbach’s alpha on accuracy was 0.851 for comRAT-G items, 0.932 for Bowden & Jung-Beeman items and 0.936 for both sets of queries taken together. Cronbach’s alpha on response times (on both correct and incorrect answers) was 0.991 for comRAT-G items, 0.99 for Bowden & Jung-Beeman items and 0.995 for both set of queries taken together.

Table 6 Cronbach’s alpha internal consistency measures

As a final point, we checked to see whether the accuracy and response times data we obtained on the Bowden & Jung-Beeman dataset with our participants correlated with that obtained by Bowden & Jung-Beeman. As shown in Table 7, all accuracy measures and all but one response times measures correlated significantly.

Table 7 Correlation of performance on the Bowden & Jung-Beeman queries, between our participants and Bowden & Jung-Beeman’s participants

Discussion

The descriptive data is similar between the comRAT-G and Bowden & Jung-Beeman’s set of queries, on both mean accuracy and mean response times. The average and high correlations obtained between the performance of the participants on the comRAT-G and Bowden & Jung-Beeman sets of items on both accuracy and response times show the validity of the comRAT-G dataset, pointing to the fact that we are measuring the same skill with comRAT-G as with Bowden & Jung-Beeman sets. The high Cronbach’s alpha internal consistency scores, which remain the same or increase when putting the two item sets together, show that both sets are highly reliable, and consistent with each other. Thus the comRAT-G data are in all crucial respects similar to the established query set.

The Repository

In the following, the RAT queries generated and the data which accompanies them will be explained as a function of data items (columns), ability to search for and order items and some examples of possible empirical research using this data. Table 8 shows a sample of the generated queries data and their form.

Table 8 An example of generated queries, organized by w a n s which is ability

The generated compound RAT queries can thus be ordered in the following ways:

  1. (1)

    Alphabetically by the first, second and third word (on w a , w b and w c ). This ability to search for alphabetically ordered RAT queries allows empirical research keeping the first letter or the entire query word (or more than one word) constant. This can be used in various forms, at its extension allowing for keeping the entire query constant and checking for different possible answers.

  2. (2)

    By the answer (w a ). This can allow for comparisons of query difficulty in which the query terms differ, and the answer is kept constant.Footnote 7 Thus for the queries a) health, child and center and b) insurance, hair and child, the answer is the same - care, and so is one of the given terms, child. However, the likelihood of reaching this answer is not the same. Keeping the answer the same can help check upon the influence the different terms and their frequency have on the performance.

  3. (3)

    By frequency of the favorable cases (f r(w a , w a n s ), f r(w b , w a n s ), f r(w c , w a n s )).

  4. (4)

    By the (sum) frequency of the given words (f r(w a ), f r(w b ), f r(w c )). This allows for the study of frequency based influence separately from probability. Empirical exploration of whether keeping the frequency constant throughout the words (query or answer) among different queries has an impact on answer performance, and the function and interinfluences between frequency and answer performance should thus be possible.

  5. (5)

    By the probability of the answer to be found. Multiple queries with similar probabilities can thus be analyzed, and queries from different probability classes can be analyzed (low probability, medium probability, high probability) together with their influence on human performance.

The frequency of the various query words with the answer, and the frequency of the query words occurring in other combinations has been provided here separately, as the probability of finding the answer has been calculated here taking the influence from the three words to be equal. It might be the case that the first two words have a higher influence (see Olteţeanu, 2014), and showing frequency explicitly for each of the query words allows the study of order effects.

An interface permitting access to the queries constructed by comRAT-G can be found here: http://creacogcomp.com/comRAT-G.html.

Conclusions and future work

An ample set of about 17 million queries was generated using a variant of a computational RAT solver - comRAT-G. This set of queries aims to fill a gap in the area of providing normative frequency-based data and an ampler set of stimuli for cognitive and computational creativity research. Frequency and frequency-based probability of finding the answer have been computed for all the generated queries and are provided with this data. The contributed repository allows further control over variables when testing for the influence of frequency, keeping words constant and word order in Remote Associates queries.

The entire list of queries or a subset thereof can be obtained by contacting the authors. As future work, we will aim to make the following contributions:

  1. (i)

    Improve the online interface with more search and selection features, for easy access.

  2. (ii)

    Generate an updated version of this repository by also parsing compound nouns from the corpus automatically, and offer queries based on compound nouns as another controllable variable. The motivation for this is that items of the form (w a n s , w q ) which are parsed from compound nouns might be associated more tightly than items which have 2-grams as a point of origin.

  3. (iii)

    Generate a version of the repository which includes queries made of other parts of speech than nouns alone.

  4. (iv)

    Offer the ability to collapse plural and singular forms to modelers.

  5. (v)

    Add free association norms data to the query-answer pairs, if available.

  6. (vi)

    Enable control of query and answer word length.

  7. (vii)

    Enable control over semantic domain of words.

  8. (viii)

    Rate a part of these queries for interestingness and hardness, in order to further refine the generating algorithm.