The origin of memory limitations has been a major issue of debate in working memory research. The classical model of working memory suggests that both the visual and auditory modalities have separate and independent stores of limited capacity for short-term maintenance (Baddeley, 2010; Baddeley & Hitch, 1974). The two subsystems, as well as other aspects of working memory, have been studied using a concurrent-task procedure in which participants perform two tasks simultaneously during memory encoding—for example, a digit span task and a reasoning task (Baddeley & Hitch, 1974). In such settings, the recall of verbal materials is more disrupted by a phonological task than by a nonphonological visual task, and the recall of visual materials is disrupted more by a spatial task than by a verbal task (Brooks, 1968). Furthermore, when the participant is instructed to continuously shadow (i.e., repeat aloud) spoken letters during memory encoding, this verbal task disrupts delayed recall of simultaneously heard verbal materials more than delayed recall of simultaneously seen visual material (Kroll, Parks, Parkinson, Bieber, & Johnson, 1970). In addition, a spatial task involving letters disrupts spatial tracking but not performance in a verbal task (Baddeley, 1986), and visual memory is less disturbed than auditory memory while simultaneously performing a backward-counting task (Scarborough, 1972). These results suggest separate working memory subsystems for visuospatial and verbal/phonological materials. However, the subsystems are not totally independent, since both are controlled by central executive functions, which explain subtle interference effects. An alternative to separate auditory and visual memory stores is that the limitation arises from some general mechanism or process with a limited capacity, regardless of the sensory modality—for example, from a limited focus of attention (Cowan, 1997, 2011). According to this view, working memory is a part of long-term memory that is activated by the focus of attention (Cowan, 1997, 2011).

A common view suggests that working memory is object-based and that memory or attention capacity is limited to three or four objects (Cowan, 2001). Early studies on visual working memory suggested a purely object-based memory store (Luck & Vogel, 1997, 1998), but later studies have shown that resources compete within a feature dimension, but are separate for different feature dimensions (Wheeler & Treisman, 2002), and that increasing the number of features to be remembered in each object decreases both memory performance and precision (Fougnie, Asplund, & Marois, 2010; Oberauer & Eichenberger, 2013; Olson & Jiang, 2002). Working memory seems not to be purely feature-based, either, since the locations and types of features within objects also affect memory: Features are remembered best when they are in the same spatial region of an object (same-object advantage), worse when they are in different parts of an object, and worst when they belong to different objects (Fougnie, Cormiea, & Alvarez, 2013; Huang, 2010b; Olson & Jiang, 2002; Xu, 2002). Furthermore, increasing the number of objects to be remembered decreases memory performance and memory precision, but increasing the number of features within each object decreases only precision (Fougnie et al., 2010; Fougnie et al., 2013). Thus, according to these studies, working memory capacity is not solely limited by the number of objects or features, but by the number of objects and by the number, location, and type of features within the memorized objects.

Cumulative evidence from both visual (Anderson & Awh, 2012; Anderson, Vogel, & Awh, 2011; Bays, Catalao, & Husain, 2009; Bays & Husain, 2008; Huang, 2010a; Murray, Nobre, Astle, & Stokes, 2012; Salmela, Lähde, & Saarinen, 2012; Salmela, Mäkelä, & Saarinen, 2010; Salmela & Saarinen, 2013; Wilken & Ma, 2004; Zhang & Luck, 2008) and auditory (Kumar et al., 2013) working memory studies suggests that the memory limitations with a small number of items are due to the precision of stored representations; that is, only a few items can be remembered with high precision, but several items can be remembered with lower precision. The trade-off between memory capacity and the precision of representations has been found for several visual features—for example, spatial location (Bays & Husain, 2008), orientation (Anderson et al., 2011; Bays & Husain, 2008; Salmela & Saarinen, 2013), contour shape (Salmela et al., 2012; Salmela et al., 2010; Zhang & Luck, 2008), and color (Anderson & Awh, 2012; Bays et al., 2009; Huang, 2010a). The precision of auditory working memory has been much less studied, but a similar type of trade-off has been found for tone pitch (Kumar et al., 2013). In vision, it seems that the trade-off between the number of stored items and the precision of representations is driven by stimulus information at the encoding stage, and cannot be volitionally varied according to task demands (Murray et al., 2012; Zhang & Luck, 2011), except when only a few items are memorized (Machizawa, Goh, & Driver, 2012). The precision of encoding can, however, vary trial by trial (Fougnie, Suchow, & Alvarez, 2012; van den Berg, Shin, Chou, George, & Ma, 2012). These studies suggest that working memory resources are shared across either the visual or the auditory representations held in memory. We asked whether the resources can also be shared across representations in different sensory modalities, and whether this sharing affects memory precision.

Resources can be flexibly shared across the visual and auditory modalities during simultaneous memory tasks (Morey, Cowan, Morey, & Rouder, 2011). Some studies have revealed clear capacity costs due to cross-modal sharing of resources (Morey & Cowan, 2004, 2005; Morey et al., 2011; Saults & Cowan, 2007; Vergauwe, Barrouillet, & Camos, 2010), whereas others have not (Cocchini, Logie, Della Sala, MacPherson, & Baddeley, 2002; Fougnie & Marois, 2011). The former studies support the general-resource model, and the latter studies support domain-specific resources. To our knowledge, the costs of a cross-modal task on the precision of memory representations have not been addressed previously, and it is not yet known whether resource allocation across modalities affects the precision of memory representations.

To test resource sharing across modalities, we measured memory precision for two visual and two auditory features. In Experiment 1, both visual features were varied in the same sine-wave gratings, both auditory features were varied in the same sine-wave tones, and the gratings and tones were presented simultaneously (Fig. 1). In Experiment 2, two tones and two gratings were presented sequentially, and only one feature was varied in each object. This manipulation was used to control for the same-object advantage when comparing the intra- and cross-modal results from Experiment 1. In every condition, the stimulus—and hence the perceptual load—was identical, and only the participants’ memory task was varied. The participants had to memorize one to four features.

Fig. 1
figure 1

Setup of Experiment 1. A two-interval forced choice setup was used to measure delayed discrimination thresholds for the spatial frequency and orientation of gratings and for the pitch and duration of simultaneous tones. The thresholds for the four features were measured separately, simultaneously, and in all possible two- and three-feature combinations. In every condition, the stimuli varied in all four dimensions. A written cue specifying the one to four target features to be remembered was presented only at the beginning of the experiment and between blocks of 36 trials, except in the precue condition with four features, in which one random cue was presented on every trial

Increasing the number of to-be-remembered features can have different effects on memory precision. If working memory contains separate object-specific representations in the visual and auditory modalities, we should find a larger decrease in precision in Experiment 2 than in Experiment 1, due to the increased number of objects. However, if working memory is based on feature representations and domain-general resources, we should find clear decreases in memory precision in both experiments, immediately as the number of features to be remembered increases.

Previous cross-modal memory studies have tested whether or not sharing resources between modalities decreases memory performance. A decrease in memory task performance has been interpreted as support for domain-general resources, and the absence of a decrease in task performance has been interpreted as support for domain-specific resources. If memory resources are indeed domain-specific, then more resources would be available in cross-modal tasks, which in turn should lead to better memory performance in cross-modal than in intramodal tasks. If, on the contrary, memory resources are shared by modalities, then the intramodal and cross-modal conditions should not differ. Hence, in order to test the specificity of memory resources, the question should be turned around: Is there a cross-modal benefit in working memory tasks? We tested this by comparing the memory precision in intramodal and cross-modal conditions.

Method

Participants

Five participants with normal or corrected-to-normal vision and without any known hearing deficits participated in the experiments. The first author was one of the participants; the other participants were naive as to the purpose of the study. The participants gave written informed consent, and the experiment was conducted according to the ethical standards of the Declaration of Helsinki and approved by the Ethics Committee of Helsinki and Uusimaa Hospital District.

Equipment and stimuli

The stimuli were sine-wave gratings and tones, as well as visual and auditory white noise. The visual stimuli were presented on a linearized LCD monitor (resolution 1,080 × 1,920 pixels). The size of the display was 28.9 × 48.8 deg at the viewing distance of 56 cm. Head position was held constant with a chinrest. The auditory stimuli were presented via closed headphones (Bayerdynamic 770).

A pair of sine-wave gratings in a Gaussian envelope were presented on the left and right sides of a fixation cross (black cross on a white background) at 5.4 deg eccentricity. The space constant of the Gaussian was 1.22 deg, and the size of the stimulus was 4.9 deg. The Michelson contrast of the grating was 0.9, and its phase was random. The orientation and spatial frequency of the grating varied between –45 and 45 deg and between 0.5 and 2.0 c/deg, respectively.

Sine-wave tones were presented to both ears. The intensity of the tones was 62 dB SPL, and the tones had 10-ms linear fade-in and fade-out times. The frequency and the duration of the tones varied between 1000 and 1200 Hz and between 200 and 500 ms, respectively.

All of the stimuli were presented, and the sine-wave gratings and tones were generated, with Presentation software (Version 14.0, www.neurobs.com). White noise was generated with MATLAB (MathWorks Inc). The size of the dynamic visual white noise was 14.5 × 24.4 deg, and the root mean square contrast was 0.2 (standard deviation of the luminance divided with mean luminance). The white noise sound intensity was 33.5 dB SPL.

Procedure—Experiment 1

The precision of working memory representations was quantified as the discrimination threshold. Discrimination thresholds were measured using a two-interval forced choice task with an adaptive 2–1 staircase method yielding a 70.7 % threshold. In Experiment 1, each trial began with a fixation cross for 500 ms (see Fig. 1). Then sine-wave gratings and tones were presented with simultaneous onsets. The duration of the grating was always 500 ms, but the durations of the tones varied. During the 3,000-ms memory period, dynamic white noise (30 × 100-ms frames) was presented on the screen. Throughout the measurement, white noise was also delivered to the headphones. After the memory period, a second stimulus interval was presented (Fig. 1). The spatial frequency and orientation of the visual stimulus and the pitch and duration of the auditory stimulus in the second interval were always either increased or decreased, relative to the first interval. The amount of change was either random or determined by the participant’s responses. After the second stimulus interval, the participants were asked to compare one feature of the second stimulus with that of the first stimulus. The participant’s task was to say whether (1) the grating in the first interval or the grating in the second interval had a higher spatial frequency; (2) the tone in the first interval or the tone in the second interval had a higher pitch; (3) the grating was rotated to the left or the right between the intervals (i.e., whether the orientation was more clockwise in the first or in the second interval); or (4) the tone had a longer duration in the first or in the second interval. According to the participant’s response, the difference in the discriminated stimulus feature between the intervals was changed. After an incorrect answer the difference was increased, and after two consecutive correct answers the difference was decreased. The amount of change was 0.037 c/deg, 4 Hz, 2.5 deg, or 10 ms, for spatial frequency, pitch, orientation, or tone duration, respectively. Until the second reversal, the change steps were increased to 2.5-fold. The final threshold estimate was calculated as an average of the last four reversal points. Each threshold was measured with 36 trials, and the average number of reversal points was ten (varied between four and 16).

The thresholds were measured for each feature separately, for four features simultaneously, and for each combination of two and three features. The participants were instructed to memorize one to four features, but only one feature was tested on each trial. The tested feature was selected randomly. The conditions with two, three, and four features to be memorized were conducted in blocks of 36 trials, and the participant was allowed a break between the blocks. Hence, the total duration of the different conditions varied from 36 (one feature) to 144 trials (four features). The one- and four-feature conditions were repeated three times. In the beginning of the condition, and in the beginning of each block, a written cue was shown to the participant to remind him or her of the feature or features to be memorized. The feature that was tested on a given trial was indicated by a written cue during the response trial after the second stimulus interval (Fig. 1).

For a baseline discrimination threshold with one feature to be remembered, each participant’s performance was measured for each feature with a short, 200-ms memory period. In the additional precue condition, memory for four features was measured, but a cue was presented in the beginning of each trial. This condition was similar to the one-feature condition, except that the feature to be remembered was switched on every trial. In total, the performance of each participant was measured in 60 memory blocks (2,160 trials) and 12 baseline blocks (432 trials). The measurements were done in several 1- to 2-h sessions on different days. The whole experiment took 6–8 h for each participant.

Procedure—Experiment 2

Experiment 2 was identical to Experiment 1, except that in Experiment 2 four separate objects—each containing only one varied feature—were presented sequentially. The visual objects were (1) a grating with fixed orientation (vertical) and variable spatial frequency, presented on the left side of the fixation, and (2) a grating with fixed spatial frequency (1.25 c/deg) and variable orientation, presented on the right side of the fixation. The auditory objects were (1) a sine-wave tone of 500 ms with variable pitch, presented to the left ear, and (2) a sine-wave tone of 800 Hz with variable duration, presented to the right ear. The spatial locations of the gratings and tones (left/right) were randomized across the participants. In the first interval, the stimuli were presented sequentially with a 500-ms interstimulus interval. The order of the two tones and the two gratings was random, but they were always presented alternately—that is, tone–grating–tone–grating. A partial-response procedure was used, and the second interval contained only one object, which contained the feature that was to be tested on a given trial. The memory duration was always 3,000 ms, and hence the duration of the noise mask was varied from 1,500 to 3,000 ms, depending on the serial order of the feature in the first interval. Conditions with one, two, and four features to be memorized were measured. In total, each participant measured 28 memory blocks (1,008 trials) in one 2-h session. For all participants, Experiment 1 was measured before Experiment 2.

Results

Memory precision in Experiment 1

On average, across all features and participants, the discrimination thresholds increased rapidly in Experiment 1 as the number of features to be remembered increased from one to two (Fig. 2a). The thresholds for two features to be remembered were twofold, in comparison to a single feature to be remembered. When the number of features was increased further, the participants’ performance showed a further decrease, and the threshold for storing four features was 2.5-fold, in comparison to the one-feature threshold (Fig. 2a). The results for each feature separately were virtually identical to the average results. Memory precision, measured separately for spatial frequency, pitch, orientation, and tone duration, first decreased rapidly and then reached an asymptote (Fig. 2b). Four separate repeated measures analyses of variance (ANOVAs) showed that the increases of threshold associated with increasing the number of features to be remembered were statistically significant for spatial frequency [F(3, 12) = 13.886, p < .001], pitch [F(3, 12) = 10.244, p = .001], orientation [F(3, 12) = 5.95, p = .010], and duration [F(3, 12) = 7.012, p = .006].

Fig. 2
figure 2

Effect of memory load on precision in Experiment 1. (a) Average increase of thresholds as a function of memory load. The thresholds for four different features are normalized to one-feature conditions. The solid line depicts an asymptotic function fit to the data, and error bars depict standard errors of the means. (b) Results for the four features. The solid lines depict an asymptotic function fit to the data, and error bars depict standard errors of the means. The dotted lines depict baseline results without memory load, and the dashed lines depict a precue/attention-switching condition. SF = spatial frequency

To quantify the effect of memory load on discrimination thresholds, an asymptotic function was fitted to the data:

$$ Th=\left(1/n\right)T{h_n}_{=1}+\left(1\hbox{--} \left[1/n\right]\right) As, $$
(1)

where n is the number of stored features, Th n=1 is the threshold with one item, and As is a constant corresponding to the asymptote of the function. The function predicts, on the basis of only the single-feature thresholds (Th n=1) and resource sharing (1/n), the thresholds for two to four features. The asymptote (As) was the only free parameter, and was a scaling factor without any effect on the shape of the function. The function fit the data very well. The best fit for the average data (Fig. 2a) was obtained with an asymptote of 3.02 (R 2 = .98). For the separate features (Fig. 2b), the best fits were obtained for asymptotes of 0.62 c/deg, 50.19 Hz, 21.78 deg, and 98.70 ms, for spatial frequency, pitch, orientation, and tone duration, respectively.

The precision with a single feature was close to the perceptual precision (Fig. 2b, dotted lines) measured with the short (200-ms) retention interval. The single-feature memory threshold for orientation was slightly higher than the baseline threshold [t(4) = 3.052, p = .038], but for the other features the thresholds did not differ [ts(4) = 0.739–0.778, ps > .48 in all cases]. The baseline thresholds were also lower than the two-feature memory thresholds for spatial frequency, orientation, and pitch [ts(4) = 4.007–5.503, ps < .016 in all cases], but not for duration [t(4) = 2.346, p = .079].

Switching attention between features on every trial (Fig. 2b, dashed lines) had a relatively small effect on memory precision, since the thresholds in the precue condition were between the thresholds for one and two items. The thresholds in the precue condition were not higher than those in the single-feature memory condition [ts(4) = –0.290 to 1.995, ps > .117 in all cases]. The precue thresholds were also not lower than the two-feature memory thresholds [ts(4) = 1.118–1.353, ps > .248 in all cases], except for pitch [t(4) = 3.189, p = .033]. In comparison to the baseline thresholds, the precue thresholds were higher than baseline for spatial frequency [t(4) = 3.008, p = .040] and orientation [t(4) = 3.509, p = .025], but not for pitch [t(4) = 2.443, p = .071] and duration [t(4) = 0.722, p = .51].

Effect of modality in Experiment 1

Increasing the number of features to be remembered increased the memory precision thresholds (Fig. 2). This suggests that memory resources are shared across the items currently stored in memory. To test whether the resource sharing depends on the modalities of the features, the data were reanalyzed according to modality. If the memory resources are domain-specific, then there would be a cross-modal benefit in memory precision.

On average, across all features and participants, the discrimination thresholds were slightly higher in the cross-modal than in the intramodal conditions (Fig. 3a). Orientation discrimination thresholds were higher in the cross-modal condition than in the intramodal condition (Fig. 3b) when two features were to be remembered [t(4) = 4.263, p = .013], but not when three features were to be remembered [t(4) = 2.059, p = .109]. For spatial frequency and tone pitch and duration, the thresholds did not differ significantly [ts(4) = –0.826 to 0.595, ps > .445 in all cases] in the intramodal and cross-modal conditions (Fig. 3b). Importantly, none of the conditions showed any benefit of cross-modality condition, which would be seen as lower thresholds in cross-modal than in intramodal conditions.

Fig. 3
figure 3

Effect of sensory modality in Experiment 1. (a) Average change of threshold due to modality. The thresholds for four different features are normalized to conditions with one feature to be remembered. Error bars depict standard errors of the means. (b) Results for spatial frequency (SF), pitch, grating orientation (OR), and tone duration (DUR). In intramodal conditions, all or most attention was directed within a single modality, and in cross-modal conditions, attention was divided across modalities. One-feature conditions refer to trials in which a single feature was to be memorized. Intramodal two-feature conditions refer to trials in which the other feature was of the same modality (e.g., SF threshold measured together with OR threshold). Cross-modal two-feature conditions refer to trials in which the other feature was of a different modality (e.g., SF threshold measured together with DUR threshold). Intramodal three feature conditions refer to trials in which two features shared a modality, and the third feature was from a different modality (e.g., SF and OR thresholds measured together with pitch threshold). Cross-modal three-feature conditions refer to trials in which the given feature had to be remembered with two features from a different modality (e.g., SF threshold measured with pitch and DUR thresholds), and four-feature conditions refer to trials in which all four features were to be memorized. The one- and four-feature data are actually intramodal and cross-modal points, but they are plotted as reference lines for visual clarity

Memory precision in Experiment 2

In Experiment 2, four objects, each containing only one varied feature, were presented sequentially. The results were almost identical to those from the first experiment, in which stimuli were presented simultaneously. Again, the discrimination thresholds increased rapidly as the number of features to be remembered increased from one to two, and the thresholds reached an asymptote when four features were to be remembered (Fig. 4). The increases of the thresholds were statistically significant for spatial frequency [F(2, 8) = 9.994, p = .007], orientation [F(2, 8) = 40.0552, p < .001], and duration [F(2, 8) = 16.848, p = .001], but not for pitch [F(2, 8) = 2.671, p = .129]. Pitch discrimination thresholds showed only a shallow increase as a function of memory load. The best fit of the asymptotic function (Eq. 1) for the average data (Fig. 4a) was obtained with an asymptote of 3.24 (R 2 = .99). For each feature separately (Fig. 4b), the best fits were obtained for asymptotes of 0.54 c/deg, 33.67 Hz, 23.24 deg, and 103.11 ms, for spatial frequency, pitch, orientation, and tone duration, respectively.

Fig. 4
figure 4

Effect of memory load on precision in Experiment 2. (a) Average increase of thresholds as a function of memory load. The thresholds for four different features are normalized to one-feature conditions. The solid line depicts an asymptotic function fit to the data, and error bars depict standard errors of the means. (b) Results for the four features. The solid lines depict an asymptotic function fit to the data, and error bars depict standard errors of the means. The dotted lines depict the baseline results without memory load measured in Experiment 1. SF = spatial frequency

Effect of modality in Experiment 2

On average, the discrimination thresholds were identical in the cross-modal and intramodal conditions of Experiment 2 (Fig. 5a). When analyzed separately for each feature, the discrimination thresholds for tone pitch were higher in the cross-modal than in the intramodal condition [t(4) = 4.604, p = .010; Fig. 5b]. The tone pitch threshold while memorizing two auditory features was identical to the threshold for tone pitch memorized alone, and the pitch thresholds were also identical when two or four cross-modal features were to be memorized (Fig. 5b). The orientation, spatial frequency, and tone duration discrimination thresholds did not differ significantly [ts(4) = –1.796 to 0.243, ps > .147 in all cases] in intramodal and cross-modal conditions (Fig. 5b). Again, none of the conditions showed any cross-modal benefit.

Fig. 5
figure 5

Effect of sensory modality in Experiment 2. (a) Average change of threshold due to modality. The thresholds for four different features are normalized to conditions with one feature to be remembered. Error bars depict standard errors of the means. (b) Results for spatial frequency (SF), pitch, grating orientation (OR), and tone duration (DUR). See the Fig. 3 caption for more details

Comparison of Experiments 1 and 2

To statistically test the difference between Experiments 1 and 2 (with simultaneous and sequential stimulus presentations, respectively), and the effects of different numbers and types of features to be remembered, a three-way repeated measures ANOVA, with the factors Experiment, Number of Features, and Feature Type, was conducted for the normalized threshold values. The main effects of experiment [F(1, 4) = 0.209, p = .671] and feature type [F(3, 12) = 2.569, p = .103] were not statistically significant, but the effect of number of features was [F(2, 8) = 15.980, p = .002], as we had already observed in the previous analyses. The interactions were not significant [Experiment × Feature Type, F(3, 12) = 1.213, p = .347; Experiment × Number of Features, F(2, 8) = 0.519, p = .614; Experiment × Feature Type × Number of Features, F(6, 24) = 1.273, p = .307], except the Feature Type × Number of Features interaction [F(6, 24) = 3.166, p = .020]. This significant interaction was due to the tone pitch thresholds, which significantly increased as a function of memory load in the first experiment, but not in the second experiment.

Two-way ANOVAs were conducted separately for each feature to test the effects of modality (two intramodal features vs. two cross-modal features) and experiment (simultaneous vs. sequential presentation). Only conditions in which two features were to be memorized were included in the analysis, because intramodality and cross-modality situations were clearly defined in those conditions. When the orientation of the grating was to be remembered, the Modality × Experiment interaction was statistically significant, but all other effects of cross-modality and experimental conditions were not (Table 1). The Modality × Experiment interaction in orientation discrimination was due to the effect of modality found in Experiment 1, but not in Experiment 2.

Table 1 Effects of cross-modality and experimental conditions

Discussion

We tested whether working memory resources are shared across representations in the visual and auditory modalities. Our results showed that memory precision declined immediately as the number of features to be remembered increased to more than one. The memory decline did not depend on the modality of the features (visual/auditory) or the feature type (spatial frequency, orientation, pitch, or duration) or on whether the features were in simultaneously or sequentially presented objects. Importantly, we observed no benefit of cross-modality conditions, and the precision was not better in cross-modal than in intramodal conditions. Hence, the present results support a general-resource model of working memory instead of models containing modality-specific and object-based memory representations. The results suggest that working memory capacity is limited to only a few precise memory representations, and that the memory resources can be shared across sensory modalities with minimal costs in precision.

Working memory performance and capacity has mostly been studied by varying the number of items in the stimulus. However, in the present study, the stimulus was always the same—that is, it always contained a sine-wave grating and a sine-wave tone—and only the number of features to be memorized was varied. In agreement with several previous studies (Anderson & Awh, 2012; Anderson et al., 2011; Bays et al., 2009; Bays & Husain, 2008; Kumar et al., 2013; Salmela et al., 2012; Salmela et al., 2010; Salmela & Saarinen, 2013; Wilken & Ma, 2004; Zhang & Luck, 2008), the present results show that memory performance declines immediately as memory requirements increase. To our knowledge, the present study is the first to show that memory precision also declines in cross-modal conditions. For example, memory precision for visual orientation declined when the pitch of a tone was to be memorized simultaneously.

The present results are not compatible with working memory models that include object-based memory representations. According to object-based models, our memory task should have been trivial, since only a few objects were required to be held in memory. If working memory indeed has a capacity to store several objects, the task for two simultaneous objects should have been very easy, and there should not have been strong effects of increasing the number of items on memory precision. Instead, we found a clear and asymptotic increase in discrimination thresholds as a function of memory load. Previous studies have modeled the decrease of memory precision with different types of functions. The slots + averaging model of working memory suggests that when the number of items to be remembered is less than the number of slots, multiple samples of each item can be stored, and the memory precision is proportional to the square root of the number of samples (Palmer, 1990; Zhang & Luck, 2008). Likewise, resource-based models suggest a power-law relation between precision and the number of items to be remembered (Bays & Husain, 2008). Our asymptotic function is compatible with both of these accounts, but is slightly steeper in shape.

In the present study, we did not find a same-object advantage (Fougnie et al., 2013; Huang, 2010b; Olson & Jiang, 2002; Xu, 2002), and all of the features were remembered equally well when they were either within two object pairs presented simultaneously or within four objects presented sequentially. For tone pitch, in fact, memory precision was better in the sequential condition, showing a same-object disadvantage. One difference between our experiments and previous studies using multiple features within visual objects was that we used narrow-band features. Our gratings were in a Gaussian envelope, and thus contained information only in narrow spatial-frequency and orientation bands. In contrast, previous studies have used sharp-edged stimuli (e.g., a square box with diagonal stripes) containing information across all spatial frequencies and multiple orientations. Perhaps the same-object advantage is only present when using broadband stimuli. Another possibility is that the same-object advantage requires a larger number of features within each object. In our setup, we had only one or two features in each object to be memorized.

The present results are in contrast with the classical working memory model (Baddeley & Hitch, 1974) and with studies proposing independent, modality-specific memory subsystems. Baddeley’s model contain visuospatial and phonological stores. However, instead of phonological stimuli, we used auditory tones, and it is therefore possible that the tones were not stored in the same storage as phonological material. Our experimental design was also quite different from previous studies using concurrent memory tasks that have found modality-specific effects (Baddeley, 1986; Brooks, 1968; Cocchini et al., 2002; Kroll et al., 1970; Scarborough, 1972). In those studies, the stimuli were typically word lists or numbers (therefore, involving semantic content), the tasks were continuous, and free recall was used to collect the responses. In our study, we used a very specific setup and low-level visual and auditory features, and directly tested whether resources could be shared across modalities. Another critical difference was that we used an adaptive method in our measurements. The adaptive method aims to keep task difficulty at the same level in all conditions and avoids possible ceiling and floor effects in the measurements. The decrease in memory precision as a function of memory load was the only general cost across all features that we found. This cost did not depend on whether features were in the same or in different modalities, or on whether the features were simultaneously or sequentially presented. In other words, the cost of adding a cross-modal feature to the memory task was identical to the cost of adding one intramodal feature to the memory task. Thus, we observed costs due to multitasking (i.e., to remembering multiple features simultaneously), but these costs did not depend on the modality.

In the present study, we wanted to test whether working memory resources are modality-specific or domain-general by using similar intramodal and cross-modal tasks. The key prediction of modality-specific resources is that in cross-modal conditions, performance should be improved, due to an increase in available resources. Contrary to this prediction, we did not find a clear cross-modal benefit in any condition, and thus our results suggest that the factor that limits the precision of working memory representations is domain-general. Our results are thus partly in agreement with the “focus of attention” model of working memory (Cowan, 2011). However, our results suggest that this domain-general resource is not object-based, but depends instead on the precision of representations. This suggests that the “size” of the attention focus is inversely related to the precision of representations. Previous studies using various combinations of multiple-object tracking, visuospatial, verbal, and semantic working memory tasks have found domain-general and modality-specific, as well as attention-related, impairments in working memory performance (Fougnie & Marois, 2006; Morey & Cowan, 2004, 2005; Saults & Cowan, 2007). Thus, it seems that with complex memory tasks, different types of interference can be found. With our simple adaptive-memory task, we found only domain-general effects. It might be argued that our results reflect the allocation of resources to modality-specific memory stores by executive operations. This is not likely to be the case, however, since our task was very simple—maintaining one bimodal object (or one to two auditory and one to two visual objects) in memory—and therefore demanded very little executive control. Furthermore, we measured the precision of memory representations, which presumably taps into the contents of memory stores.

During recent years, visual working memory has been studied with recall tasks in which the participant selects the feature value that most resembles the memorized item (e.g., green color) from a continuous set (e.g., a color wheel) of values (Wilken & Ma, 2004). From the distribution of participants’ choices, the probability that participants stored the item (i.e., that the response was not random) and the precision of memory (i.e., the amount of error) can be quantified (Zhang & Luck, 2008). The latter value might describe memory precision more accurately, since random responses—for example, due to memory failures—have been removed. The division of responses into two different distributions (remembered items and random responses) is not always necessary, since random responses in a recall task may be absent up to at least six items (Salmela et al., 2012; Salmela & Saarinen, 2013). Nevertheless, the separation of these two values is most critical when the number of to-be-remembered items is large. In the present study, we used only a low number of features (one to four), and hence, our precision estimates are reliable.

It has been suggested that for audiovisual stimuli capacity limitations are fundamentally different, so that audiovisual integration is limited to only one object (Van der Burg, Awh, & Olivers, 2013). Although in our task integration was not required, a limited capacity of audiovisual processing might also have had an effect in out setup. It is possible that auditory and visual objects are automatically integrated when the objects are presented simultaneously (Degerman et al., 2007). However, our participants were able to perform the task well above chance level (since the adaptive method kept the performance at 70 % correct), even when four features were to be remembered. It has previously been shown that in a difficult visual memory task, only one item can be remembered precisely (Salmela et al., 2010). Instead of a fixed capacity for one object, these results suggest that when high precision is required in the task, all resources may be allocated to maintaining only a single representation.

There are no clear functional explanations for why working memory resources should be allocated flexibly across domains. One explanation is that the resources are dynamically distributed across all representations, to optimize the usage of limited capacity. An ideal-observer analysis based on optimal information transfer does indeed explain many properties of working memory (Sims, Jacobs, & Knill, 2012). Recently, it has been suggested that the limited capacity of attention and memory arise from cognitive “maps”—neural mechanisms analogous to retinotopic maps in the visual cortex (Franconeri, Alvarez, & Cavanagh, 2013). According to this model, the total capacity limit would be the size of the cognitive map, and flexible allocation of resources and interference between memory items would be inherent properties of these cognitive maps. Our results are compatible with this model. A common cognitive map containing representations from both visual and auditory modalities would explain the similarity of the present results for the intramodal and cross-modal conditions.

If working memory limitations are indeed due to cognitive maps, it still remains unclear in which way resources are shared across memory representations. The present and previous (Bays & Husain, 2008; Salmela et al., 2010; Salmela & Saarinen, 2013) results suggest that memory resources are evenly allocated to each item. This type of optimal resource sharing could also be understood in terms of normalization, which is a common neural mechanism found to operate in perceptual systems (Carandini & Heeger, 2012); for example, photoreceptor responses in the retina and neuron responses in primary visual cortex are normalized by dividing the responses by the response of the surrounding receptors or neurons (corresponding to the prevailing mean luminance or mean contrast, respectively). The normalization process optimizes the dynamic operation ranges of receptors and neurons to a variety of environmental conditions. The normalization model has been suggested to explain several visual-attention effects: For example, due to normalization, sensitivity to low-contrast target stimuli is increased and the effect of distractors is reduced (Reynolds & Heeger, 2009). Recently, it has been shown that noise in a neural model containing divisive normalization and population coding can account for errors in memory precision (Bays, 2014). For cross-modal representations, working memory could operate by normalizing resources across representations within the cognitive map: Each working memory representation is always allocated 1/n of the total resources, and thus is stored at the expense of other representations on the cognitive map.

In conclusion, our results suggest that working memory resources are shared across sensory modalities and that the precision of memory representations determines the amount of information that can be retained at a time. For every tested visual and auditory feature, the resource-sharing function fit the data very well. Resource sharing could be understood as normalization of resources across memory representations, in order to optimize the usage of limited capacity.