The feeling of sudden clarity and understanding, often accompanied by a sub-vocal or exuberantly shouted “aha,” is known to many as insight in problem solving contexts. This feeling of insight (also known as an aha experience) has been shown to both improve motivation in problem solving (Liljedahl, 2005) and facilitate recall (Danek, Fraps, von Müller, Grothe, & Öllinger, 2013; Kizilirmak, Gomes da Silva, Imamoglu, & Richardson-Klavehn, 2016). Despite these benefits, finding methods that reliably test insight is a recognised challenge (Bowden, Jung-Beeman, Fleck, & Kounios, 2005). For instance, although the investigation of insight has had a long history (Duncker, 1945; see, e.g., Gilhooly & Murphy, 2005; Jung-Beeman et al., 2004; Knoblich, Ohlsson, Haider, & Rhenius, 1999; Köhler, 1921; Maier, 1931; Metcalfe, 1986a; see also Sternberg & Davidson, 1995), only a handful studies have investigated which specific problems reliably elicit the feeling of insight (Davidson, 1995; Metcalfe, 1986b; Metcalfe & Wiebe, 1987). Furthermore, those studies that have investigated insight problem reliability were predominantly concerned with the subjective experiences (e.g., the feeling of warmth to a solution) leading to solution in insight and noninsight problems. The goal of the present article is to provide a detailed investigation of the strength and reliability of a range of problems used to elicit the cognitive processes and affective components of insight in the individual solving the problem.

Recent research in cognitive neuroscience has demonstrated that problems currently used as insight problems (i.e., compound remote associates, remote associate problems, and anagrams) can elicit insight and noninsight responses (Aziz-Zadeh, Kaplan, & Iacoboni, 2009; Bowden & Beeman, 1998; Kounios et al., 2008; see Kounios & Beeman, 2014, for a review; see also Luo & Knoblich, 2007). However, these studies investigate insight as a categorical response (i.e., participants indicate whether the solution occurred to them through insight or noninsight). Consequently, this method does not reveal the strength of the aha experience elicited. A more recent investigation (Danek, Wiley, & Öllinger, 2016) examined the strength of insight elicited by insight problems but only for three classic problems, making the results difficult to generalize. Here we test a wide array of different problems using a continuous measure of insight strength.

Although insight is a subjective experience, there are a number of good reasons to study it. Aside from evidence indicating that the experience of insight is common (Jarman, 2014; Ovington, Saliba, Moran, Goldring, & MacDonald, 2015) and thus of significant general interest, insight has been associated with new and innovative thinking (Feynman, 1999; Poincaré, 1913; Schultz, 1890), facilitated recall (Danek et al., 2013; Kizilirmak, Thuerich, Folta-Schoofs, Schott, & Richardson-Klavehn, 2016), improved learning (Dominowski & Buyer, 2000; Kizilirmak, Gomes da Silva, et al., 2016), and increased motivation (Liljedahl, 2004, 2005). For example, Kizilirmak, Thuerich, et al. (2016) presented participants with a series of compound remote associates (a type of insight problem that has been increasingly used, particularly in the cognitive neuroscience literature). Aha experiences during encoding predicted a significantly higher proportion of solutions to be both recalled and recognised during subsequent testing, presumably due to the deeper encoding afforded by the sudden realisation of the relation between the word problems. To take a second example, Liljedahl (2005) evaluated the impact of the aha experience on motivation for learning mathematics; students who had had an aha experience became less anxious about mathematics and more willing to continue through a problem solving process until they had reached the solution. The investigation of insight delves into the processes underlying these problem-solving techniques, the understanding of which may aid creative problem solving, motivation in learning, and memory.

Defining insight

Definitions of insight can be approached in three ways: (1) the process-based approach, which is concerned with the cognitive processes involved in problem solving; (2) the task-based approach, which is concerned with identifying problems that are capable of eliciting insight, with much of this approach being used to determine insight problems that elicit insight processes (e.g., Davidson, 1995; Metcalfe & Wiebe, 1987; Weisberg, 1995b); and (3) phenomenological approaches, which are focused more on the feeling of insight (Chronicle, MacGregor, & Ormerod, 2004). Both task and process approaches to insight require an understanding of the problem space associated with each problem; that is, the mapping of all possible steps from an unresolved question or issue to the solution. When the steps from one point in the problem space to the next are clear, problem solving is able to progress in steady, incremental, steps. However, in instances when the steps toward solution are not clear, problem solving becomes discontinuous (Weisberg, 1995b); that is, there is a need to wait until further thought about the problem reveals or clarifies the solution process, or until a mental restructuring occurs (Ohlsson, 1984; Sandkühler & Bhattacharya, 2011). The term restructuring implies that the way an individual perceives or conceives a problem, and possibly the solution pathway, is fundamentally changed (Weisberg, 1995b). It is this sudden restructuring that is presumed to elicit the phenomenological component of insight (Cushen & Wiley, 2012; Fleck & Weisberg, 2004). In contrast, from a problem-space perspective, a noninsight problem is a problem that does not require restructuring because all problem-solving steps are known from the outset, or at least follow logically from the first step.

Cognitive restructuring is a fundamental aspect of contemporary research on insight (e.g., Ash & Wiley, 2006; Cushen & Wiley, 2012; Sandkühler & Bhattacharya, 2011; Weisberg, 1995a), which focuses on (1) the psychological response leading to and resulting from restructuring of a problem space (Ash & Wiley, 2006); (2) the use of heuristics (Chronicle et al., 2004; Öllinger, Jones, Faber, & Knoblich, 2012); and (3) progress monitoring (in which a problem solver attempts to minimize the gap between the current state of the problem and the goal state; see, e.g., Jones, 2003; MacGregor, Ormerod, & Chronicle, 2001). In process-based approaches, the solution of an insight problem is often presumed to indicate insight, which in turn depends upon the definition of the problem itself.

Task-oriented approaches to defining insight are similarly concerned with designing or identifying those problems that require restructuring for their solutions (i.e., insightful processing). This is often achieved by creating a problem with an initially uncertain or unusual path from problem to solution (i.e., an ill-defined problem space), perhaps by encouraging a faulty initial representation of the problem, through the overrepresentation of problem constraints (i.e., subjects are encouraged to believe that the problem includes constraints that are not there), infrequent word use, uncommon object use, or suggestive instruction. Insight tasks (insight problems) are then compared to tasks that require incremental solutions (see the supplementary materials for a selection of insight and noninsight problems).

Finally, a phenomenological approach to defining insight focuses on the experience of insight, including the emotional components of that experience (Danek, Fraps, von Müller, Grothe, & Öllinger, 2014a; Shen, Yuan, Liu, & Luo, 2016), and what might elicit or predict those feelings (Topolinski & Reber, 2010a). This area of research has grown abruptly in the last decade, with a number of researchers noting the somewhat circular reasoning of terming insight problems as “problems that require insight,” and inferring that “ insight occurs when insight problems are solved” (Öllinger & Knoblich, 2009, p. 277). To break this circularity, investigators have used self-report to determine whether a given question has elicited an experience of insight or otherwise (Bowden & Jung-Beeman, 2003a; Danek et al., 2014a; Danek et al., 2016). These self-reports may be gathered either during problem solving (e.g., Metcalfe & Wiebe, 1987) or directly after problem solving (e.g., Bowden & Jung-Beeman, 2003a; Danek et al., 2014a; Kounios et al., 2008). In the present article, we have opted to use the post-problem self-report scales developed by Danek et al. (2014a), which are concerned with the phenomenological components of insight; namely confidence, aha experience, surprise, pleasure, impasse.

One of the most distinctive components of an experience of insight is the aha experience. The aha experience has been used as a synonymous term for insight; it is generally described as sudden, accompanied by strong emotional arousal that may be either positive or negative (Danek et al., 2014a; Hill & Kemp, 2016b; Shen et al., 2016), as well as a strong sense of certainty in the reanalysis of the problem. A number of researchers consider the aha experience to be definitive of an insightful solution (Cushen & Wiley, 2011; Gick & Lockhart, 1995; Metcalfe & Wiebe, 1987), or at least the most indicative characteristic of insight problem solving (Danek et al., 2014a; Faber, 2012; Jung-Beeman et al., 2004; Schooler, Ohlsson, & Brooks, 1993).

Much of the cognitive neuroscience literature on insight has focused on validating the procedure developed by Bowden (1997), who solicited trial-by-trial judgments from participants regarding whether a solution was derived through a process of insight or through a process of analysis. Bowden (1997) found that the conscious awareness of insight processes is related to unconscious processing prior to the experience of insight (i.e., when solution words are presented subliminally, solutions are rated by participants as feeling insightful). Subsequent research using this procedure has indicated that the number of solutions that have involved insight varied with distinct brain activations (Jung-Beeman et al., 2004; Kounios et al., 2006; Subramaniam, Kounios, Parrish, & Jung-Beeman, 2009), with specific areas associated with distinct stages of preparation for problem solving. However, these trial-by-trial procedures for measuring insight have consistently used binary or categorical classifications of response [e.g., “Was this problem solved: (1) with insight, (2) not with insight, (3) unsure.”]; consequently, investigating the potential strength of the insight response has been curtailed by investigating differences in physiological measures (Hill & Kemp, 2016a). Although these (hopefully) should correlate, there is no evidence that this is the case.

Finally, some researchers have considered the aha experience sufficient to define insight (Gick & Lockhart, 1995; Kounios & Beeman, 2009), whereas others dissociate the aha experience from the experience of insight (Danek et al., 2014a; Sandkühler & Bhattacharya, 2011), arguing that insight comprises many components (e.g., surprise, confidence and impasse; Danek et al., 2014a), of which a feeling of aha is only one (Danek et al., 2014a; Danek, Fraps, von Müller, Grothe, & Öllinger, 2014b; Klein & Jarosz, 2011). Yet others consider the aha experience to be a mere epiphenomenon of restructuring the problem space (Ormerod, MacGregor, & Chronicle, 2002; Sandkühler & Bhattacharya, 2011; Weisberg & Alba, 1981). Irrespective of this debate, the aha experience is a strong emotional marker that has been associated with new discoveries (Feynman, 1999; Poincaré, 1913; Schultz, 1890), facilitated recall (Danek et al., 2013), improved learning (Dominowski & Buyer, 2000; Kizilirmak, Gomes da Silva, et al., 2016), and increased motivation (Liljedahl, 2004). As such, it is worthy of study regardless of whether it is necessary and or sufficient as an indicator of an insight experience. In this article, we investigate the validity of a number of commonly used insight and noninsight tasks by testing each problem’s ability to elicit insight.

Tasks used to elicit insight and their controls

Insight problems are designed to elicit a feeling of impasse, or being stuck, by creating a problem with an uncertain or unusual path from problem to solution (a so-called ill-defined problem space). For example:

A man is escaping from a 60-m tower. He has a length of rope that is 30 m long. He cuts the rope in half, ties it together again, and uses it to escape. How does he do this?

The answer may or may not be immediately clear; however, the solution becomes obvious if one thinks about cutting the rope along its length rather than its width.Footnote 1 It is this sudden clarity of solution and feeling of aha that is used as an indication of insight processes. However, the initial misinterpretation and consequent misrepresentation of problem space varies across observers, as problem solvers are able to solve these problems using both logical deductions and mental leaps toward a solution (Weisberg, 2014).

In contrast, noninsight problems are designed to be solvable in a simple and incremental process, with a clear path through the problem space from the initial problem to the solution. A classic example of noninsight problems are logic-based questions, though there are also many examples using fluid intelligence tasks (such as Raven’s Advanced Progressive Matrices; Raven, 2000):

Bob’s father is three times as old as Bob. They were both born in October. Four years ago, he was four times older. How old are Bob and his father?

The solution (Bob is 12; his father is 36) requires basic arithmetic (3 × 12 = 36; 36 – 4 = 32; 4 × 8 = 32); however, although this question arguably requires simply stepping through the arithmetic, it does require a problem solver to remember their basic maths, and not to get caught by multiplying the three and four to get a 12-year-old father, which is actually a frequent response. Thus, the sudden memory of how to solve the problem may result in a feeling of insight. The tendency for problem solvers to solve insight problems using both insightful and analytic methods and feelings was made particularly clear in recent research by Danek et al. (2016), who tested three classic insight problems and found problem solvers would solve these problems both with and without insight affect.

Types of insight and noninsight problems

So far we have discussed predominantly “classic problems” (so dubbed by Cunningham, MacGregor, Gibb, & Haar, 2009); however, although these problem types were initially the most frequently used, they have been superseded in recent years by other problem tasks, such as compound remote associates, anagrams, matchstick arithmetic, and rebus puzzles (Table 1 provides an outline of the problem types, along with links to the studies introducing these into the literature or to normative studies, where available). The majority of research into the ability of insight problems to actually elicit insight has been conducted on compound remote associates (see, e.g., Jung-Beeman et al., 2004; Kounios et al., 2006; Salvi, Bricolo, Bowden, Kounios, & Beeman, 2016; Sandkühler & Bhattacharya, 2011; Wegbreit, Suzuki, Grabowecky, Kounios, & Beeman, 2012), but many of the theories around insight processes arise from research in classic problems (see Sternberg & Davidson, 1995, for a comprehensive review of this literature). We next review classic insight problems and more contemporary insight problems such as the aforementioned, compound remote associates (but also several other more contemporary problem types).

Table 1 Types of insight and noninsight problems, examples, and directions for further reading

Classic problems

The example above (i.e., the rope problem) is an example of a classic insight problem. These are often riddle-type vignettes, sometimes accompanied by images to create a spatial problem (see Supplementary Materials for list of problems and solutions). Classic insight problems are typically described as impossible to solve without restructuring (Ash & Wiley, 2006; Gilhooly & Murphy, 2005; Weisberg, 1995b). That is, developing a mental representation of the problem that considers the relations between the elements of the problem in a way other than as presented. Weisberg (1995a, b) developed a taxonomy of insight and noninsight problems, based on the degree of restructuring required, and whether or not a problem was discontinuous (whether a problem solver needs to change direction/start again in order to proceed). This taxonomy outlines “pure” noninsight problems for which no restructuring is required, “pure” insight problems, which are both discontinuous and require restructuring, and hybrid problems, which are discontinuous and may require restructuring on a subject-to-subject basis. Gilhooly and Murphy (2005) compared performance on 24 presumed insight and ten presumed noninsight problems in a cluster analysis and found clusters that were congruent with Weisberg’s (1995b) taxonomy, including hybrid problems.

The other example presented above (i.e., Bob’s father) is of a classic noninsight problem, and a large literature has been concerned with testing the procedural differences between classic insight and noninsight problems (e.g., Gilhooly & Murphy, 2005; Metcalfe & Wiebe, 1987; Weisberg, 1995b). However, there are instances in which problems classified as “noninsight” have been solved with insight-like feelings or patterns of solution (e.g., Davidson, 1995; Webb, Little, & Cropper, 2016b). For example, Davidson (1995) noted 12–13% of noninsight problems were solved with the same FOW (feeling of warmth) ratings as insight problems. Webb, Little, and Cropper (2016a, b) investigated a subset of classic insight and noninsight problems, and found that, as with compound remote associates, noninsight problems may also be solved with feelings of insight.

Contemporary problems

In this context, we are distinguishing between classic and contemporary problems in the following fashion: Classic problems are riddles and puzzles drawn from literature and discussed in literature before or during 1995. Classic problems predominately have a vignette component (either as the entirety of the problem, or accompanying a spatial puzzle), and require at least 3 min on average to solve. In contrast, contemporary problems are those that have been developed or discussed predominately after 1995. These include problems such as compound remote associates (Bowden & Jung-Beeman, 2003b), anagrams (Kounios et al., 2008), and rebus puzzles (MacGregor & Cunningham, 2008). We differentiate these from classic problems as, though these problems have been used in the cognitive literature prior to 1995, they have only been applied to the study of insight more recently (see Bowden et al., 2005, for a discussion on this topic).

Compound remote associates and remote associate tasks

Both compound remote associates (Bowden & Jung-Beeman, 2003b) and remote associate tasks (Mednick, 1962) are short verbal problems: Three words are presented to a participant, combinable with a single fourth word. In the case of compound remote associates, the fourth word can combine with the three to create three compound words (e.g., tooth, potato, and heart combine with sweet). In the case of the remote associate tasks, the fourth word does not need to create compound words, but is simply related to the three problem words (e.g., lick, sprinkle, and mine with salt). These words have gained prominence in the insight literature because they are relatively short problems, can be easily administered, and have many easily created variations. Bowden and Jung-Beeman (2003b) conducted a normative study on 144 compound remote associates providing response times and solution rates. Concurrent research (Bowden & Jung-Beeman, 2003a) provided evidence to validate the ability of compound remote associates to elicit insight affect and processes; however, information regarding the probability of experiencing insight was not provided.

Anagrams

Anagrams are words that have been scrambled and presented to a participant for solution (e.g., tpoil = pilot). Metcalfe (1986b) used these in her research investigating insight-based and analytic-based (i.e., not involving insight) solutions. However, despite subjects indicating that these problems were predominately solved with a feeling-of-warmth rating similar to that experienced in insight problems (i.e., feeling-of-warmth ratings suddenly leap from far to near in insight problems, whereas they incrementally increase in noninsight problems), researchers have presented arguments against the classification of anagrams as insight problems. For instance, Weisberg (1995b) argued that anagrams were not insight problems because they do not require restructuring but rather are a simple vocabulary search task.

Nevertheless, a number of studies have used anagrams for their ability to elicit insight (e.g., Aziz-Zadeh et al., 2009; Bowden, 1997; Jacobsen, 2016; Kounios et al., 2008; Novick & Sherman, 2003). Although different studies have provided conflicting information regarding the solvability of anagrams (e.g., Novick & Sherman, 2003), no normative data have been collected for the degree of insight processes or affect elicited by different anagrams.

Raven’s Advanced Progressive Matrices

The logic pattern-completion puzzles Raven (1985) developed in order to assess fluid reasoning and problem solving abilities (Little, Lewandowsky, & Craig, 2014) have been increasingly used as noninsight problems (e.g., Gilhooly, Fioratou, & Henretty, 2010; Paulewicz, Chuderski, & Nęcka, 2007). Each task comprises a 3 × 3 figure matrix organised according to latent rules, with the task being to deduce the latent rule and select one answer from eight possible answers to complete the pattern. Investigations by Gilhooly and Murphy (2005) indicate that performance on Raven’s Advanced Progressive Matrices (Raven’s) tasks form clusters with classic noninsight problems, yet the literature consistently demonstrates a positive relationship between Raven’s and both classic insight problems (Lin, Hsu, Chen, & Wang, 2012; Nęcka, Żak, & Gruszka, 2016; Paulewicz et al., 2007) remote associate tasks (Chermahini, Hickendorff, & Hommel, 2012; Paulewicz et al., 2007). As yet, there have been no investigations into the ability of these tasks to elicit insight or otherwise. Thus, we will investigate their tendency to elicit insight or otherwise in the present study.

Rebus puzzles

MacGregor and Cunningham (2008) proposed rebus puzzles as insight problems, obtaining a measure of self-reported insight affect, and comparing performance on rebus puzzles to the remote associate tasks. A rebus puzzle combines words and visual cues to represent a familiar phrase (e.g., SOMething = “the start of something big”). Participants’ base ratings of insight were higher in response to rebus puzzles and remote associate tasks as compared to an analogies task (e.g., “sheep is to lamb as cow is to . . .” = calf). These results were interpreted as evidence that rebus puzzles could be considered insight problems. However, MacGregor and Cunningham did not obtain individual insight rating data for their problem sets. Salvi et al. (2016) also used a set of Italian rebus puzzles and found that solutions solved with insight were judged to be correct more often than solutions solved analytically. Salvi et al. replicated these findings for anagrams and compound remote associates, but did not provide data regarding ratings of insight.

Matchstick arithmetic

Matchstick arithmetic problems were proposed as insight problems by Knoblich et al. (1999) to investigate the role of chunked information and restructuring. In a matchstick arithmetic task, an incorrect equation is presented to a participant with matchsticks creating both numbers (Roman numerals) and mathematical symbols. The task is to make the equation correct by moving one matchstick (e.g., IV = III – I; answer, IV – III = I). In their experiment, Knoblich et al. tested the degree of restructuring required by each type of matchstick arithmetic; however, they did not investigate the phenomenology of insight. Recent investigations of the ability of these tasks to elicit insight affect have provided mixed results (Danek et al., 2016; Derbentseva, 2007).

Magic tricks

A novel method used by Danek et al. (2014b) was to investigate insight using magic tricks. In conjunction with a magician, the researchers developed and recorded 40 short tricks, with only one effect and one method, which were scored according to the degree of insight-related affect (i.e., surprise, aha, impasse, confidence, and pleasure) experienced when watching the trick. Although the magic tricks may or may not conform to standard definitions of insight problem (i.e., restructuring), they evidently elicited insight. Since we chose to investigate the most frequently occurring tasks in the literature, we did not investigate magic tricks or rebus puzzles.

Aim of the present work

A number of the studies discussed above contain normative data for the solution rate and response time of a variety of different problem types; however, there are currently no normative data on the strength and frequency of insight affect elicited by these tasks. The ability of any of the above problems to elicit insight is not in dispute; evidence indicates that many problems can elicit insight for many persons, depending on an individual’s focus and reason for problem solving (Klein & Jarosz, 2011; Ovington et al., 2015). It is the strength of insight that is elicited across a range of problems that we aim to investigate in this article, as well as the reliability of a subset of problems to elicit insight.

General method

Across four studies, a total of 544 University of Melbourne students (452 female, 92 male; age range = 16–58, mean = 20.34) completed insight and noninsight problem-solving tasks coupled with various additional measures. The primary study was conducted with 101 University of Melbourne students (72 female, 29 male; age range = 17–58, mean = 23.38), who completed the study for payment of $40. Before beginning the study, participants were provided with consent forms detailing the proposed study. We advertised for participants with English as a first language, as a number of problems required high English proficiency and we have previously shown this to be important (Webb, Little, Cropper, & Roze, 2017).

Materials

Classic insight and noninsight problems

To generate a dataset of classic problems, we conducted a systematic search of the literature, and noted which problems were most frequently used (see the supplementary materials for search terms and selection criteria, as well as the table detailing which problems were used most frequently).

Problems were categorized as insight or noninsight problems on the basis of published categorizations and taxonomies. There were some contradictions in the usage of particular problems (e.g., trace problems have been used as both insight and noninsight problems). In these instances, we classified each problem according to the cluster analysis performed by Gilhooly and Murphy (2005).

We selected the top 25 most frequently used insight and noninsight problem. Accuracy and RT were recorded. We provide normative data for the solution of these problems in the Appendix.

Raven’s Advanced Progressive Matrices

Participants completed the truncated Raven’s Advanced Progressive Matrices (adapted according to the method of Arthur & Day, 1994), which contains 12 test problems. These 12 problems were randomly interleaved with classic insight and noninsight problems. Accuracy and reaction time were recorded, with normative data for the solution of these problems in the Appendix.

Compound remote associates

We presented participants with 34 problems, pseudo-randomly drawn from each quantile in Bowden and Jung-Beeman’s (2003b) dataset, ensuring that the solutions would vary in difficulty and time necessary for solution. Participants had 30 s to generate the fourth word.

Anagrams

We drew 34 five-letter anagrams from Novick and Sherman (2003). Each anagram was solvable within one-, two-, or three-letter moves for the solution, with two-letter moves being most common.

Procedure

Each participant was individually tested in four sessions. Problems were presented online using Qualtrics (Qualtrics, 2016) to present problems and record reaction times (for more detail on the resolution of reaction time measures in Qualtrics, see Barnhoorn, Haasnoot, Bocanegra, & van Steenbergen, 2014). The problem-solving sets were counterbalanced across participants. No solutions were given.

Problem-solving sets

There were two problem-solving sets: classic and contemporary problem solving, respectively. The classic “insight” and “incremental” (noninsight) problems were randomly interleaved within a set. Participants were given no information about whether the problem to be solved was classified as “insight” or “noninsight” but were given 210 s to work through the problem. In the contemporary problem set, compound remote associate and anagram components were counterbalanced. Five practice trials preceded each set. Participants were given 30 s to solve each contemporary problem.

Participants were given information on aha experiences to respond in their ratings to each problem. A vignette describing aha experiences (drawn from Danek, Fraps, von Müller, Grothe, & Öllinger, 2014a, b; see the supplementary materials for the vignette) was presented at the beginning of the experiment. After each problem solving task, participants were presented with the scales drawn from (Danek et al., 2014a). We chose to use these scales as they individuate components of insight from one another, and as a visual analogue they require minimal processing. Participants were asked to rate: (1) the confidence that the given response was correct (very unsure to very sure), (2) the strength of the aha experience (very weak to very strong), (3) the pleasantness of the insight experience (very unpleasant to very pleasant), (4) the surprising nature of the insight experience (not surprising at all to very surprising), and (5) the feeling of impasse before the insight experience (no impasse at all to very stuck). Participants responded by moving a slider (preset at 50) along a scale of 0–100.

Data analysis

Analyses were conducted using JASP (Love et al., 2015) and R. Differences in the aha ratings across problem types were investigated using a series of one-way ANOVAs, whereas the correlation plots were created using the R package corrplot (Wei & Simko, 2016).

Results

Problems were scored as either correct or incorrect and averaged across category (insight, noninsight, compound remote associates, anagrams), as were the ratings of insight-related affect (see the supplementary materials). Descriptive statistics for performance accuracy and the ratings of insight-related affect are displayed in Table 2.

Table 2 Descriptive statistics of accuracy and insight related affect across problem types

We calculated the percentage of participants solving each problem, as well as the mean time to solution, in seconds. We also calculated the mean ratings of insight for each problem, and then further investigated the mean ratings of aha experience by response accuracy. These data are presented in the Appendix in descending order according to mean strength of insight elicited in correct responses.

Relationships between problem types

We examined the relationships between problems used as insight problems (classic insight problems, anagrams, and compound remote associates), and problems used as noninsight problems (classic noninsight problems and Raven’s Advanced Progressive Matrices) in terms of both accuracy and the strength of the aha experience.

The correlations between problem types on ratings of aha experience indicated moderate to strong positive relationships across problem types, as can be seen above the diagonal in Fig. 1 (note that all relationships are above a Pearson r value of .4 and significant at p < .001; below the diagonal are the correlations for accuracy). This indicates that individual differences may underlie the tendency to report a problem to be solved with insight across both insight and noninsight problem types, as has been noted through the use of compound remote associates and anagrams in the cognitive neuroscience literature (Bowden et al., 2005; Kounios & Beeman, 2014).

Fig. 1
figure 1

Correlation plots between accuracy and aha across problem types. The size of each circle and its saturation of color show the strength of the correlation; the color shows the direction of the relationship, with positive being blue. The upper half of the correlation plot details aha results, and the lower half details accuracy. Nonsignificant correlations have been removed (see the supplementary materials for the correlation statistics). The correlation plot was created using the R package corrplot (Wei & Simko, 2016)

Performance accuracy

The pattern of relationships across problem types in terms of accuracy indicates significant positive relationships between classic insight problems and all other problem types (see the lower half of Fig. 1; also see the supplementary materials for correlation statistics), as well as significant moderate positive relationships between solution accuracy on anagrams and compound remote associates [r(99) = .51, p < .001] and between noninsight problems and both compound remote associates [r(99) = .25, p = .01]. However, accuracy on noninsight problems was not correlated with anagrams [r(99) = .18, p = .07]. Furthermore, despite significant positive relationships between Raven’s and both insight [r(99) = .39, p < .001] and noninsight [r(99) = .56, p < .001] problems, there were no significant relationships between Raven’s and either anagrams [r(99) = – .09, p = .39], or compound remote associates [r(99) = .06, p = .51]. This may reflect the necessity of an extensive vocabulary for the solution of both compound remote associates and anagrams, whereas Raven’s is a nonlexical solution. It also reflects some of the complications of using these problem types interchangeably, as was noted by Ball and Stevens (2009).

Differences between problem types for accuracy and insight

We were also interested in whether particular problem types (e.g., classic insight problems) would elicit higher ratings of insight experience, particularly, ratings of the aha experience. If all problems considered to be insight problems can be used interchangeably, we would expect a significant difference in aha ratings for problems considered to be insight problems (i.e., classic insight problems, compound remote associates, anagrams) as compared to problems considered to be noninsight problems (i.e., classic noninsight problems, Raven’s), and no difference between problem types within insight or noninsight categories. A repeated measures analysis of variance on ratings of aha experience across problem types (see Fig. 2) indicated a significant difference between problem types on aha ratings: F(4, 400) = 65.85, p < .001, η 2 = .40. Post-hoc comparisons showed no significant difference between insight problems and compound remote associates in aha ratings. This implies that classic insight problems and compound remote associates elicit, on average, ratings of insight that are not significantly different from each other, which is reassuring for a literature that is moving from the use of classic insight problems to compound remote associates.

Fig. 2
figure 2

Mean (a) aha ratings and (b) performance accuracy across problem types. (Error bars show standard deviations)

Similarly, no significant differences emerged between noninsight problems and Raven’s in aha ratings, which may indicate that Raven’s is a valid measure of noninsight problem solving; however, noninsight problems resulted in significantly higher ratings of aha experience than both insight problems (p < .001, mean difference = 6.97, Cohen’s d = 0.62) and compound remote associates (p < .001, mean difference = 10.02, Cohen’s d = 0.55). (Similarly, Raven’s resulted in significantly higher ratings of aha than either insight problems—p < .001, mean difference = 11.17—or compound remote associates—p < .001, mean difference = 14.22.) These results extend the findings of Danek et al. (2016), who noted that classic insight problems could be solved without insight, with the finding that classic noninsight problems can be solved with strong feelings of insight.

Finally, anagrams elicited significantly higher ratings of aha experience than did all other problem types (anagrams to classic insight: p < .001, mean difference = 21.55, Cohen’s d = 1.11; anagrams to compound remote associates: p < .001, mean difference = 24.59, Cohen’s d = 1.57; anagrams to noninsight: p < .001, mean difference = 14.58, Cohen’s d = 0.77; anagrams to Raven’s: p < .001, mean difference = 10.37, Cohen’s d = 0.48).

Accuracy

Given the process-oriented approach of interpreting the correct solution of an insight problem as indicative of insight, we performed the same repeated measures ANOVA across problem types for solution accuracy (see Fig. 2b). We found a significant difference in accuracy across problem types, F(4, 400) = 222.40, p < .001, η 2 = .68, with participants being significantly more accurate at solving anagrams than at solving all other problem types (anagrams to classic insight: p < .001, mean difference = .47, Cohen’s d = 2.30; anagrams to compound remote associates: p < .001, mean difference = .44, Cohen’s d = 2.66; anagrams to noninsight: p < .001, mean difference = .17, Cohen’s d = 0.74; anagrams to Raven’s: p = .004, mean difference = .07, Cohen’s d = 0.27). Participants solved significantly more Raven’s problems than noninsight (p < .001, mean difference = .09, Cohen’s d = 0.99), insight (p < .001, mean difference = .40, Cohen’s d = 1.92), or compound remote associates (p < .001, mean difference = .37, Cohen’s d = 1.44) problems. They also solved more noninsight problems than either insight problems (p < .001, mean difference = .31, Cohen’s d = 1.81) or compound remote associates (p < .001, mean difference = .27, Cohen’s d = 1.27). We observed no significant difference between insight problems and compound remote associates in accuracy (p = .61, mean difference = .03). The results of accuracy reflect the results of ratings of insight, and suggest a relationship between accuracy and aha. The correlations between accuracy and aha ratings (see the supplementary materials, Fig. 1, for correlation plots) indicate a significant relationship between ratings of aha and solution accuracy for presumed insight problems [classic insight: r(99) = .27, p = .006; compound remote associates: r(99) = .26, p = .008; anagrams: r(99) = .23, p = .02], but no relationship for noninsight problems [r(99) = – .10, p = .31].

Ratings of aha experience conditional on performance accuracy

Given the similarity in the patterns across problems of both aha ratings and accuracy, we performed a series of analyses on aha ratings conditional on whether the problem was correctly solved (see Fig. 3). Looking at the aha ratings across problems when the solution was correct revealed a significant effect of problem type: F(3, 69) = 29.56, p < .001, η 2 = .56. Bonferroni post-hoc tests indicated that anagrams elicited the highest ratings of insight relative to other problem types, with significantly higher ratings than classic insight problems (p < .001, mean difference = 16.08, Cohen’s d = 1.11) or classic noninsight problems (p < .001, mean difference = 19.69, Cohen’s d = 2.35) (anagrams were not significantly different from compound remote associates when analyzing aha ratings conditional on correct solutions: p = 1, mean difference = 3.30). Compound remote associates returned significantly higher self-reports of aha experience than did either insight problems (p < .001, mean difference = 12.78, Cohen’s d = 0.79) or noninsight problems (p < .001, mean difference = 16.39, Cohen’s d = 2.06).

Fig. 3
figure 3

Mean ratings of aha experience across problem types as a function of accuracy. (Error bars show standard deviations)

There was no difference between insight and noninsight problems on ratings of aha for correctly solved problems (p = .91, mean difference = 3.61). This suggests that the original finding of significantly higher rating of aha experience for noninsight problems may have resulted from consistently higher ratings of aha for both correct and incorrectly solved noninsight problems, whereas for insight problems, ratings of insight were high only for correctly solved problems. (We found no significant difference between Raven’s and noninsight problems: p = .19, mean difference = 4.94.)

Across all problem types, a significant difference in aha ratings was apparent for incorrectly solved problems: F(3, 69) = 11.68, p < .001, η 2 = .33. Post-hoc comparisons indicated that this significance was driven largely by high ratings of aha experience for incorrectly solved noninsight problems, and low ratings of aha for incorrectly solved compound remote associates (p < .001, mean difference = 20.35, Cohen’s d = 1.66). There was, for instance, no significant difference between the aha ratings for incorrectly solved insight and noninsight problems (p = .15, mean difference = 8.02, Cohen’s d = 0.45), nor between noninsight and Raven’s problems (p = .14, mean difference = 6.78, Cohen’s d = 0.62). Ratings of aha were also significantly higher for incorrectly solved noninsight problems than for incorrect anagrams (p = .007, mean difference = 11.84, Cohen’s d = 0.66). Ratings of aha in incorrectly solved compound remote associate problems were also significantly lower than for classic insight problems (p = .005, mean difference = 12.33, Cohen’s d = 0.86).

Summary

We investigated aha ratings across a number of problem types, investigating the relationship between aha and accuracy through correlational analysis and analyses of variance. We found that, when investigating the patterns of differences on average for aha ratings and accuracy individually, the patterns were similar for both accuracy and aha ratings. However, investigating aha ratings across problem types conditional on accuracy provided a different pattern of results.

Overall, anagrams were solved with the highest accuracy and highest ratings of aha. Although compound remote associates also elicited high ratings of aha, their low solution rates mean they are dependent on measuring both accurate solutions and ratings of aha experience.

Interestingly, classic insight problems and noninsight problems were not significantly different in terms of aha ratings analyzed conditionally on response accuracy. Interestingly, aha ratings were significantly higher in noninsight problems when not conditionally analyzed. This is in strong contrast to the use of noninsight problems as a control problem (though we recognize that noninsight problems are possibly effective as problems that, more often than not, do not require restructuring). However, we found a significant relationship between accuracy and aha experience in presumed insight problems (compound remote associates, anagrams, and classic insight problems) but no relationship with classic noninsight problems.

We used the truncated Raven’s Advanced Progressive Matrices (Arthur & Day, 1994) as noninsight problems. This enabled us to investigate reports of insight-related affect in the solution of Raven’s matrices. There was a significant positive relationship between Raven’s and all problem types regarding aha experiences, and no significant differences between ratings of aha in classic noninsight problems and Raven’s, despite a significant difference in accuracy.

Reliability of classic insight problems to elicit insight

Reliability of insight: Method

In three additional experiments (Webb et al., 2016b; Webb, Little, Cropper, & Roze, 2017), each with large sample sizes (N > 100), we used a subset of the problems that we test here using near identical procedures. This allowed us to investigate the reliability of aha ratings conditional on accuracy across all four experiments. The problem set procedure was identical to the method already outlined in this article, with exceptions to this procedure outlined below. The primary focus of these three experiments were to investigate individual differences in the tendency to report insight, and questionnaires were given to participants to complete as well as the problem solving task, in counterbalanced order.

Study 1

Students from the University of Melbourne (193: 118 female, 75 male; age range = 17–52, mean = 19.639) completed the study for course credit. Nine participants were removed for errors on more than 20% of the tasks.

Materials

“Classic” insight and noninsight problems

The following problems were used in all studies:

  • Insight problems: triangle problem, socks problem, lilies problem, antique coin problem, and egg timer problem

  • Noninsight problems: cards, water jug, trace, police, and dinner

“Contemporary” insight problems: Compound remote associates

We used 20 CRAs drawn from Bowden and Jung-Beeman’s (2003b).

Questionnaires

A series of individual differences measures were presented in random order. These included the Oxford–Liverpool Inventory of Feelings and Experiences (O-LIFE; Mason & Claridge, 2006), Raven’s (1985) Advanced Progressive Matrices, a verbal fluency measure adapted from Lezak (2004), and an adaptation of the alternative-uses task (AUT: Guildford, Christensen, Merrifield, & Wilson, 1978). These measures are reported elsewhere in a follow-up study of the same sample (Webb, Little, Cropper, & Roze, 2017).

Study 2

This data was collected individually online. A further aim of this study was to investigate the effect of feedback on reported insight. We only conduct our analysis on the responses taken before the solution was revealed for each problem. The comparison of aha ratings before and after feedback is reported elsewhere in a follow-up study of the same sample (Webb, Little, Cropper, & Webb, 2017).

We found no significant difference in accuracy or aha ratings between the study completed in lab (Study 1) and the study completed online (Study 2).

Participants

A total of 129 undergraduates (88 female, 41 male; age range = 17–45, mean = 19.059) completed the tasks for course credit. Twelve participants were removed for errors in more than 20% of the tasks.

Materials, procedure, and design

The materials and procedure were identical to Experiment 1, save that participants were given the solution to the problem after their initial attempt.

Study 3

We expanded the individual difference measures in Experiment 3 to include measures of the big five and magical ideation. However, problems and procedure otherwise remained the same as Experiment 1. The tasks were presented individually online.

Participants

Undergraduates from the University of Melbourne (130: 106 female, 24 male; age range = 16–47, mean = 19.60) completed the tasks for course credit. Four participants were removed for errors in more than 20% of the tasks.

Reliability of insight: Results

Interexperiment reliabilities were calculated using Cronbach’s alpha for each problem type. Ratings of aha experience for correctly solved insight problems were highly reliable, with a reliability coefficient of α = .95. Aha ratings for correctly solved noninsight problems were moderately reliable, with α = .79. For incorrectly solved problems, insight α = .66, noninsight α = .95. The drop in reliability for aha ratings in incorrectly solved problems is congruent with an accuracy-related pattern of aha experience in insight problem solution; that is, feelings of insight were more reliably elicited in participants who correctly solved insight problems. In contrast, even incorrectly solved noninsight problems had a high reliability in aha ratings (i.e., reliably low aha ratings). As is presented in Table 3, we investigated the average aha ratings conditional on accuracy, as well as the problems that could be dropped and increase the reliability of the ratings of aha experience. Within insight problems, the triangle problem was the least reliable for aha ratings in both correctly and incorrectly solved problems. Within noninsight problems, the police problem was the least reliable in terms of aha ratings in both correctly and incorrectly solved problems.

Table 3 Averaged aha conditional on correct solution of the problem across four experiments

General discussion

We conducted an extensive investigation into the ratings of insight elicited by problems frequently used as tests of both creativity and insight (classic insight and noninsight problems, compound remote associates and anagrams). We recorded measures of solution time, accuracy and ratings of aha experience. We also recorded insight-related affect (e.g., surprise and confidence in solution). The ratings of self-reported insight experience emphasize both the importance of judging insight versus noninsight processes by the feeling in the solution rather than by task (Bowden & Jung-Beeman, 2003a), as well as using the continuous, strength-based, self-report method used in the present study. The results provide support for the use of compound remote associates, anagrams, and classic insight problems as problems that elicit insight; however, they urge caution for the usage of classic noninsight problems and intelligence tests (e.g., Raven’s Advanced Progressive Matrices) as controls for insight problems.

Ratings of aha

The present results offer preliminary normative data for the tendency of classic and contemporary insight problems to elicit insight processes and affect.Footnote 2 This is particularly useful given the increasing use of compound remote associates to study insight, as existing normative datasets (e.g., Bowden & Jung-Beeman, 2003b) have so far not provided data regarding the tendency of a particular problem to elicit insight affect, only solution rates and reaction time.

Interestingly, ratings of aha experience for anagrams were highest across all problem types; both in the average of reported aha experience and for the problems with correct solutions only. This challenges the perception and use of anagrams as noninsight problems (e.g., Gilhooly & Murphy, 2005; Öllinger, Jones, & Knoblich, 2008; Weisberg, 1995b). For instance, Weisberg (1995b) was concerned that anagrams are a simple memory search task, rather than requiring productive thinking, and so are not true insight tasks. That same critique applies to compound remote associates (Cranford & Moss, 2012), which demonstrably both elicit insight-related affect (Jung-Beeman et al., 2004; Salvi et al., 2016), as well as distinct neurological processes when solved with versus without insight affect (Jung-Beeman et al., 2004; Subramaniam et al., 2009). Interestingly, when analyzed conditionally on accuracy, we found no difference in aha ratings between correctly solved anagrams and compound remote associates. This is congruent with the work of Salvi et al. The high ratings of insight affect in anagrams and compound remote associates may be a consequence of the short solution time required, and the single-word, unambiguous solutions, which may have made the certainty of correct solutions higher and the sense of aha greater (Bowden et al., 2005). Consistent with this, the vignette of Danek et al. (2014a, b) describes insight as being sudden and having a surety of the correctness. In contrast, classic insight and noninsight problems have more ambiguous problem components and solutions, which require holding more information in mind simultaneously.

Despite noninsight problems being used as a control for insight problems (Ash & Wiley, 2006; DeCaro, Van Stockum, & Wieth, 2016; Fleck, 2008; Murray & Byrne, 2005; Wen, Butler, & Koutstaal, 2013; Wieth & Zacks, 2011), no significant differences between classic insight and noninsight problems emerged in ratings of aha experience for correctly solved problems. Furthermore, there were actually higher aha ratings in noninsight than in insight problems when aha ratings were averaged over correct and incorrect responses. This may simply reflect the consistently higher aha ratings for both correct and incorrectly solved noninsight problems, whereas insight problems elicited insight predominately for correctly solved problems. These findings are consistent with the thesis that insight problems might be solved incrementally and noninsight problems might be solved insightfully (Bowden, 1997; Danek et al., 2016; Weisberg, 2014). These results call for the use of self-report in all studies investigating insight affect and insight processes (Bowden & Jung-Beeman, 2003a) until the components underlying the phenomenology are better understood.

We investigated the truncated Raven’s Advanced Progressive Matrices (Arthur & Day, 1994) as noninsight problems, examining the tendency for the solution of Raven’s Matrices to elicit insight affect. Previous studies have found significant positive relationships between Raven’s Matrices and both classic insight problems (Lin et al., 2012; Nęcka et al., 2016; Paulewicz et al., 2007) and the precursors to compound remote associates, the remote associate task (Chermahini et al., 2012; Paulewicz et al., 2007). The relationship between Raven’s and insight problem solving has been argued to reflect the necessity of fluid reasoning in insight problem solving accuracy (Paulewicz et al., 2007), and we can extend this to note that accuracy is important for high ratings of insight. We have provided data, moreover, to indicate that the solution of Raven’s are able to elicit ratings of insight that are not significantly different from those of classic insight problems, which supports a dual-process view of insight problem solving, in which insight can be considered as a normal process, with special add-ons.

On accuracy and insight problem solving

We found positive relationships between accuracy and ratings of aha in presumed insight problems (classic insight problems, anagrams, and compound remote associates), with substantially higher aha ratings for problems with correct solutions. This finding is consistent with the multi-level modeling conducted by Webb, Little, and Cropper (2016b), which showed that insight related affect (i.e., ratings of aha, confidence, and pleasure) were predictive of solution accuracy. From a processing perspective, this finding supports the idea that the solution of presumed insight problems is designed to appear obvious once the problem space has been restructured. Although this supports the idea that restructuring results in an aha experience (Salvi et al., 2016), it is also commensurate with the idea that aha reflects sudden confidence in an answer that is easily verifiable.

One valuable question raised by the present results (and previous results; see Danek et al., 2014a, b; Webb et al., 2016a, b) is whether there is a clear distinction between confidence and the aha experience. The overlap between these constructs arises from the language used to talk about insight. Descriptions typically used in the literature to describe an aha experience typically emphasize the “suddenness and obviousness” of the solution (e.g., Bowden & Jung-Beeman, 2003a; Danek et al., 2014a; Kizilirmak, Gomes da Silva, et al., 2016). The retrospective obviousness of the solution is arguably linked to a subjective increase in confidence. However, a high degree of confidence can arise from slower, analytic problem solving as well; consequently, the aha experience is distinguished from confidence in its suddenness. This dissociation could be tested using ratings of confidence and aha experience conditional on accuracy across trials: if surprise distinguishes confidence and the aha experience, then as solution accuracy becomes more reliable across trials, feelings of confidence will increase (e.g., Peirce & Jastrow, 1884; Yeung & Summerfield, 2012), and the aha experience will decrease. Our present methodology unfortunately does not enable us to make this distinction, since there was not sufficient control over the probable accuracy of response.

Performance on classic insight problems, compound remote associates, and anagrams was positively correlated, but not between classic noninsight problems and contemporary problems (Cinan, Özen, & Hampshire, 2013; Fleck, 2008; Gilhooly & Fioratou, 2009; Gilhooly & Murphy, 2005; Wen et al., 2013; Wieth & Burns, 2000). This finding could reflect differences in the underlying processes of solving insight problems (i.e., restructuring). However, performance on classic insight and noninsight problems was also positively related (see also Gilhooly & Fioratou, 2009). This could reflect the similarity in the phrasing and presentation of the problems. Finally, performance on anagrams and compound remote associates was related, and again is likely to be due to similarities in their structure: both were short verbal problems requiring high crystalized intelligence and verbal fluidity. The absence of a relationship with accuracy on Raven’s Advanced Progressive Matrices is consistent with this supposition.

Methodological implications

The present work raises several issues regarding the way insight problem solving is studied. A well-recognized yet pervasive issue in the literature regards the use of small numbers of tasks in an experiment (Bowden et al., 2005). For instance, 27 articles in the last decade have used a single insight problem to investigate individual differences in insight problem solving. The rationale for using small numbers of problems is clear; classic insight problems are highly diverse and have a low solution rate for any times less than 10 min (Bowden et al., 2005). However, the present research highlights the potential problems inherent in using a single classic problem as a test of insightful problem solving: There are large differences in accuracy and in reported insight affect among all problem tasks and types. One way to ameliorate these issues is to use contemporary problems, such as compound remote associates and anagrams, which allow for a larger number of problems to be tested in a given time period.

It is clear that insight problems, anagrams, and compound remote associates alike are able to elicit insight, and arguably both problem types require restructuring. However, it is important to note that compound remote associates and anagrams are distinctly different tasks from classic insight problems in their cognitive requirements. For example, verbal overshadowing hampers classic insight tasks (Schooler et al., 1993) but facilitates compound remote associates (Ball & Stevens, 2009). The present findings regarding the ability of compound remote associates and anagrams to elicit strong ratings of insight, particularly in the correct solution of the problem, reflects the fragmentation of methodology and findings arising from the different approaches to insight research, and reflect a need to consider once again what insight might mean; whether it is reflected by a feeling, task, or process.

Although normative data has been provided for many of these problems (e.g., Bowden & Jung-Beeman, 2003b), the data are predominately reaction times and solution rates. These are necessary statistics but given the rising interest in insight in problem solving and the lack of reliability of some problems in eliciting insight (e.g., Danek et al., 2016; Webb et al., 2016b), we offer this study both as an indicator for some problems in the literature and as a source to obtain problems that reliably elicit strong insight phenomenology.