1 Introduction

Hundreds of thousands of studies, case reports, or patient records, capture observations in human neuroscience, basic or clinical. Statistical analysis of this large amount of data could provide new insights. Unfortunately, most of the spatial information that these data contain is difficult to extract automatically, because it is hidden in unstructured text, in sentences such as: “[...] in the anterolateral temporal cortex, especially the temporal pole and inferior and middle temporal gyri” [1].

This data cannot be processed easily by a machine, as a machine does not know where the temporal cortex is. As we will show, simply looking up such terms in atlases does not suffice. Indeed, even atlases disagree [2]. Furthermore, joint processing of many reports faces varying terminologies, with regions represented in different atlases that differ and overlap. Finally, not all terms in a report carry the same importance, and practitioners use terms that are not the exact labels of any atlas. Coordinate-based meta-analyses capture the spatial distribution of a term from the literature [3, 4], but they also lack a model to combine terms.

Here, we propose to map case reports automatically to the brain locations that they discuss: we learn mappings of anatomical terms to brain regions from medical publications. We propose a new learning framework for translating anatomical terms to brain images – a process that we call “encoding”. We learn such a mapping, quantify its performance, and compare possible choices of representation of spatial data. We then show in a proof of concept that our model can predict the brain area for textual case reports.

2 Methods: Formalizing Text-to-Brain-Map Translation

2.1 Problem Setting: From Text to Spatial Distributions

We want to predict the likelihood of the location of relevant brain structures described in a document. For this purpose, we perform supervised learning on a corpus of brain-imaging studies, each containing: (i) a text, and (ii) the locations – i.e. the stereotactic coordinates – of its observations. Indeed, Functional Magnetic Resonance Imaging (fMRI) studies report the coordinates of activation peaks (e.g., [5, Table 1]), and Voxel Based Morphometry (VBM) analyses report the location of differences in gray matter density (e.g., [1, Table 2]). Following neuroimaging meta-analyses [3], we frame the problem in terms of spatial distributions of observations in the brain. In a document, observed locations \(\mathcal {L} = \{l_a \in \mathbb {R}^3, a = 1 \dots c\}\) are sampled from a probability density function (pdf) p over the brain. Our goal is to predict this pdf p from the text \(\mathcal {T}\). We denote q our predicted pdf. A predicted pdf q should be close to p, or take high values at the coordinates actually reported in the study: \(\prod _{l \in \mathcal {L}} q(l)\) must be large. In a supervised learning setting, we start from a collection of studies \(\mathcal {S} = (\mathcal {T}, \mathcal {L})\), with \(\mathcal {T}\) the text and \(\mathcal {L}\) the locations. Building the prediction engine then entails the choice of a model relating the predicted pdf p to the text \(\mathcal {T}\), the choice of a loss, or data-fit term, and some regularization on the model parameters. We now detail how we make each of these choices to construct a prediction.

Model. We start by modelling the dependency of our spatial pdf q on the study text \(\mathcal {T}\). This entails both choosing a representation for q and writing it as a function of the text. While q is defined on a subvolume of \(\mathbb {R}^3\), the brain volume, we build it using a partition to work on a finite probably space: this can be either a regular grid of voxels or a set of anatomical regions (i.e. an atlas) \(\mathcal {R} = \{\mathcal {R}_k, k=0\dots m\}\). As such a partitioning imposes on each region to be homogeneous, q is then formally written on \(\mathbb {R}^3\) in terms of the indicator functions of the partsFootnote 1: \(\{r_k = \frac{\mathbb {I}_k}{\Vert \mathbb {I}_k\Vert _1}, ~ k = 1 \dots m\}\). Importantly, the volume of each part \(\Vert \mathbb {I}_k\Vert _1\) appears as a normalization constant.

To link q to the text \(\mathcal {T}\) of the study, we start by building a term-frequency vector representation of \(\mathcal {T}\), which we denote \(\varvec{x} \in \mathbb {R}^d\). d is the size of our vocabulary of English words \(\mathcal {W} = \{w_t\}\), and \(\varvec{x}_t\) is the frequency of word \(w_t\) in the text. We assign to each atlas region a weight that depends linearly on \(\varvec{x}\):

$$\begin{aligned} q(z) = \sum _{t=1}^d \sum _{k=1}^m \varvec{x}_t \varvec{\beta }_{t,k}r_k(z) \quad \forall z \in \mathbb {R}^3 \end{aligned}$$
(1)

where \(\varvec{\beta } \in \mathbb {R}^{d \times m}\) are model parameters, which we will learn.

Using an atlas is a form of regularization: constraining the prediction to be in the span of \(\{r_k\}\) reduces the size of the search space. Fine partitions, e.g. atlases with many regions or voxel grids, yield models with more expressive power, but more likely to overfit. Choosing an atlas thus amounts to a bias-variance tradeoff.

Label-Constrained Encoder. A simple heuristic to turn a text into a brain map is to use atlas labels and ignore interactions between terms. The probability of a region is taken to be proportional to the frequency of its label in the text. The vocabulary is then the set of labels: \(d = m\). As the word \(w_k\) is the label of \(\mathcal {R}_k\), \(\varvec{\beta }\) is diagonal. For example, for a region \(\mathcal {R}_k\) in the atlas labelled “parietal lobe”, the probability on \(\mathcal {R}_k\) depends only on the frequency of the phrase “parietal lobe” in the text. We call this model label-constrained encoder.

2.2 Loss Function: Measuring Errors on Spatial Distributions

Strategy. We will fit the coefficients \(\varvec{\beta }\) of our model, see Eq. (1), by minimizing a risk \(\mathcal {E}(p, q)\): the expectation of a distance between p and q.

A Plugin Estimator of p . We do not have access to the true pdf, p; we need a plugin estimator, which we denote \(\hat{p}\). By construction of our prediction q, the best approximation of p we can hope for belongs to the span of our regions \(\{r_k\}\). Hence, we build our estimator \(\hat{p}\) in this space, setting the probability of a region to be proportional to the number of coordinates that fell inside it:

$$\begin{aligned} \hat{p} = \sum _{k=1}^m \frac{|\{a, \mathbb {I}_k(l_a) = 1\}|}{c} r_k ~ = ~ \sum _{k=1}^m \frac{1}{c} \sum _{a=1}^c\mathbb {I}_k(l_a) r_k ~ \triangleq ~ \sum _{k=1}^m \hat{\varvec{y}}_k r_k. \end{aligned}$$
(2)

When regions are voxels, there are too many regions and too few coordinates. Hence we use Gaussian Kernel Density Estimation (KDE) to smooth the estimated pdfFootnote 2. Our supplementary material details a fast KDE implementation.

Choice of \(\mathcal{E}\). We use two common distance functions for our loss. The first is Total Variation (TV), a common distance for distributions. Note that p defines a probability measure on the finite sample space \(\mathcal {R}\), \(\mathcal {P}(\mathcal {R}_k)= \int _{\mathcal {R}_k} p(z) dz\), where \(\mathcal {R} = \{\mathcal {R}_k, k=1 \ldots m\}\) and \(\mathcal {R}_k = \mathrm{supp}(r_k)\). q defines Q in the same way. Then,

$$\begin{aligned} \mathrm{TV}(\mathcal {P}, \mathcal {Q}) = \sup _{\mathcal {A} \subset \mathcal {R}}|\mathcal {P}(\mathcal {A}) - \mathcal {Q}(\mathcal {A})|. \end{aligned}$$
(3)

Since \(\mathcal {R}\) is finite, a classical result (see [6]) shows that this supremum is attained by taking \(\mathcal {A} = \{\mathcal {R}_k | \mathcal {P}(\mathcal {R}_k) > \mathcal {Q}(\mathcal {R}_k)\} \) (or its complementary) and:

$$\begin{aligned} \mathrm{TV}(\mathcal {P}, \mathcal {Q}) = \frac{1}{2}\sum _{k=1}^m|\mathcal {P}(\mathcal {R}_k) - \mathcal {Q}(\mathcal {R}_k)| = \frac{1}{2}\int _{\mathbb {R}^3}|p(z) - q(z)| dz. \end{aligned}$$
(4)

The TV is half of the \(\ell _1\) distance between the pdfs. \(\Vert \hat{p} - q\Vert _1\) is therefore a natural choice for our loss. The second choice is \(\Vert \hat{p} - q\Vert _2^2\), which is a popular distance and has the appeal of being differentiable everywhere.

Factorizing the Loss. Let us call \(v_k\) the volume of \(r_k\), i.e. the size of its support: \(v_k \triangleq \Vert \mathbb {I}_k\Vert _1, ~ k = 1 \dots m\). Remember that \(r_k = \frac{1}{v_k}\mathbb {I}_k\). Our loss can now be factorized (see supplementary material for details):

$$\begin{aligned} \int _{\mathbb {R}^3} \delta (\hat{p}(z) - q(z))dz= & {} \sum _{k=1}^m v_k \delta \left( \frac{\hat{\varvec{y}}_k}{v_k} - \frac{\sum _{t=1}^d\varvec{x}_{t}\varvec{\beta }_{t,k}}{v_k} \right) \end{aligned}$$
(5)

Here, \(\delta \) is either the absolute value of the difference or the squared difference.

2.3 Training the Model: Efficient Minimization Approaches

To set the model parameters \(\varvec{\beta }\), we used n example studies \(\{\mathcal {S}_i = (\mathcal {T}_i, \mathcal {L}_i), ~ i = 1 \dots n \}\). We learn \(\varvec{\beta }\) by minimizing the empirical risk on \(\{S_i\}\) and an \(\ell _2\) penalty on \(\varvec{\beta }\). We add to the previous notations the index i of each example: \(p_i\), \(q_i\), \(\hat{\varvec{y}}_i\), \(\varvec{x}_i\). \(\hat{\varvec{Y}} \in \mathbb {R}^{n \times m}\) is the matrix such that \(\hat{\varvec{Y}}_{i} = \hat{\varvec{y}}_i\), and \(\varvec{X} \in \mathbb {R}^{n \times d}\) such that \(\varvec{X}_i = \varvec{x}_i\).

Case \(\delta = \ell _2^2\). The empirical risk is

$$\begin{aligned} \sum _{i=1}^n \sum _{k=1}^m \left( \frac{\hat{\varvec{Y}}_{i,k}}{\sqrt{v_k}} - \sum _{t=1}^d\frac{1}{\sqrt{v_k}}\varvec{X}_{i,t}\varvec{\beta }_{t,k} \right) ^2. \end{aligned}$$
(6)

Defining \(\varvec{Y}'_{:,k} = \frac{\hat{\varvec{Y}}_{:,k}}{\sqrt{(}v_k)}\) and \(\varvec{\beta }'_{:,k} = \frac{\varvec{\beta }_{:,k}}{\sqrt{(}v_k)}\), with an \(\ell _2\) penalty, the problem is:

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\varvec{\beta }'}}\left( \Vert \varvec{Y}' - \varvec{\beta }'\varvec{X}\Vert _2^2 + \lambda \Vert \varvec{\beta }'\Vert _2^2 \right) \end{aligned}$$
(7)

where \(\lambda \in \mathbb {R}_+\). This is the least-squares ridge regression predicting \(\hat{p}\) expressed in the orthonormal basis of our search space \(\{\frac{r_k}{\Vert r_k\Vert _2}\}\).

Case \(\delta = \ell _1\). The empirical risk becomes

$$\begin{aligned} \sum _{i=1}^n \sum _{k=1}^m|\hat{\varvec{Y}}_{i,k} - \sum _{t=1}^d\varvec{X}_{i,t}\varvec{\beta }_{t,k}| \end{aligned}$$
(8)

This problem is also known as a least-deviations regression, a particular case of quantile regression [7, 8]. Unlike \(\ell _2\) regression, which provides an estimate of the conditional mean of the target variable, \(\ell _1\) provides an estimate of the median. Quantile regression has been studied (e.g. by economists), as it is more robust to outliers and better-suited than least-squares when the noise is heteroscedastic [7]. Adding an \(\ell _2\) penalty, we have the minimization problem:

$$\begin{aligned} \hat{\varvec{\beta }} = {\mathop {\hbox {argmin}}\limits _{\varvec{\beta }}}\left( \Vert \hat{\varvec{Y}} - \varvec{X}\varvec{\beta }\Vert _1 + \lambda \Vert \varvec{\beta }\Vert _2^2 \right) \end{aligned}$$
(9)

Unpenalized quantile regression is often written as a linear program and solved with the simplex algorithm [9], iteratively reweighted least squares, or interior point methods [10]. [11] uses a coordinate-descent to solve a differentiable approximation of the quantile loss (the Huber loss) with elastic-net penalty. Here, we minimize Eq. (9) via its dual formulation (c.f. supplementary material):

$$\begin{aligned} \hat{\varvec{\nu }} = {\mathop {\hbox {argmax}}\limits _{\varvec{\nu }}}\left( Tr(\varvec{\nu }^T\hat{\varvec{Y}} - \frac{1}{4\lambda }\varvec{\nu }^T\varvec{X}\varvec{X}^T\varvec{\nu }) \right) \quad \text { s.t. } \Vert \varvec{\nu }\Vert _\infty \le 1, \end{aligned}$$
(10)

where \(\varvec{\nu } \in \mathbb {R}^{n \times m}\). The primal solution is given by \(\hat{\varvec{\beta }} = \frac{\varvec{X}^T\hat{\varvec{\nu }}}{2\lambda }\). As the dual loss g is differentiable and the constraints are bound constraints, we can use an efficient quasi-Newton method (L-BFGS, [12]). g and its gradient are fast to compute as \(\varvec{X}\) is sparse. \(\lambda \) is set by cross-validation on the training set. We use warm-start on the regularization path (decreasing values for \(\lambda \)) to initialize each problem.

Training the Label-Constrained Encoder. The columns of \(\varvec{\beta }\) can be fitted independently from each other. If we want \(\varvec{\beta }\) to be diagonal, we only include one feature in each regression: we fit m univariate regressions \(\varvec{\hat{y}}_{:,k} \simeq \varvec{X}_{:,k}\varvec{\beta }_{k,k}\).

2.4 Evaluation: A Natural Model-Comparison Metric

Our metric is the mean log-likelihood of an article’s coordinates in the predicted distribution, which diverges wherever \(q=0\). we add a uniform background to the prediction, to ensure that it is non-zero everywhere:

$$\begin{aligned} \text {the predicted pdf is written}&q' = \frac{1}{2} (\sum _{k=1}^m\frac{\mathbb {I}_k}{v_k} + q)&\end{aligned}$$
(11)
$$\begin{aligned} \text {the score for a study }\mathcal {S}_i=(\mathcal {T}_i, \mathcal {L}_i), \mathcal {L}_i = \{l_{i,a}\} \text { is}&\frac{1}{c_i}\sum _{a=1}^{c_i}\log (q'_i(l_{i, a})) \end{aligned}$$
(12)

3 Empirical Study

3.1 Data: Mining Neuroimaging Publications

We downloaded roughly 140K neuroimaging articles from online sources including Pubmed Central and commercial publishers. About 14K of these contain coordinates, which we extracted, as in [4]. We built a vocabulary of around 1000 anatomical region names by grouping the labels of several atlases and the Wikipedia page “List of regions in the human brain”Footnote 3. So in practice, \(n \approx 14\cdot 10^3\) and \(d \approx 1000\). m depends on the atlas (or voxel grid) and ranges from 20 to 30K.

Fig. 1.
figure 1

Log-Likelihood of coordinates reported by left-out articles in the predicted distribution (Eq. (12)). The vertical line represents the test log-likelihood given a uniform distribution over the brain. Voxel-wise encoding is better than relying on any atlas. In this setting, \(\ell _1\) regression significantly outperforms least squares.

3.2 Text-to-Brain Encoding Performance

Comparison of Atlases and Models. We perform 100 folds of shuffle-split cross-validation (10% in test set). As choices of \(\{\mathcal {R}_k\}\), we compare several atlases and a grid of cubic 4-mm voxels. We also compare \(\ell _1\) and \(\ell _2\) regression, and label-constrained \(\ell _2\). The label-constrained encoder is not used for the voxel grid, as it does not have labels. As a baseline, we include a prediction based on the average of the brain maps seen during training (i.e. independent of the text).

Figure 1 gives the results: for all models, voxel-wise encoding performs better than any atlas. Large atlas regions regularize too much. Despite its higher dimensionality, voxel-wise encoding learns better representations of anatomical terms. The label-constrained model performs poorly, sometimes below chance, as the labels of a single atlas do not cover enough words and interactions between terms are important. For voxel-wise encoding, \(\ell _1\) regression outperforms \(\ell _2\). The best encoder is therefore learned using a \(\ell _1\) loss and a voxel partition.

Prediction Examples. Figure 2 shows the true pdf (estimated with KDE) and the prediction for the articles which obtained respectively the best and the first-quartile scores. The median is shown in the supplementary material.

Fig. 2.
figure 2

True map (left) and prediction (right) for best prediction and \(1^{st}\) quartile

Examples of Coefficients Learned by the Linear Regression. The coefficients of the linear regression (rows of \(\varvec{\beta }\)) are the brain maps that the model associates with each anatomical term. For frequent terms, they are close to what experts would expect (see for example Figs. 3 and 4).

Fig. 3.
figure 3

Regression coefficient for “anterior cingulate”

Fig. 4.
figure 4

Regression coefficients for “left amygdala”, “amygdala”, and “right amygdala”

Fig. 5.
figure 5

Predicted density for Huntington’s and Parkinson’s. In agreement with Huntington’s physiopathology [13], our method highlights the putamen, and the caudate nucleus. Also, in the case of Parkinson’s [14], the brain stem, the thalamus, and the motor cortex are highlighted.

Fig. 6.
figure 6

Predicted density for aphasia, centered on Broca’s and Wernicke’s areas, in agreement with the literature [15].

3.3 Leveraging Text Without Coordinates: Neurological Examples

Our framework can leverage unstructured spatial information contained in a large corpus of unannotated text. To showcase this, assume that we want to know which parts of the brain are associated with Huntington’s disease. Our labelled corpus by itself is insufficient: only 21 documents mention the term “huntington”. But we use it to learn associations between anatomical terms and locations in the brain (Sect. 2). This gives us access to the spatial information contained in the unlabelled corpus, which was out of reach before (Sect. 3.2). We contrast the mean encoding of articles which mention “huntington” against the mean distribution (taking the difference of their log). Since the large corpus contains more information about Huntington’s disease (over 400 articles mention it), this is sufficient to see the striatum highlighted in the resulting map (Fig. 5, left). Figure 5 (right) shows the experiment for Parkinson, and Fig. 6 for Aphasia.

4 Conclusion

We have introduced a theoretical framework to translate textual description of studies into spatial distributions over the brain. Such a translation enables pooling together many studies which only provide text (no images or coordinates), for statistical analysis of their results in brain space. The statistical model gives a natural metric to validate. This metric enables comparing representations, showing that voxel-wise encoding is a better approach than relying on atlases. Building prediction models tailored to our task leads to a linear regression with an \(\ell _1\) loss (least absolute deviation), the total-variation distance between the true and the predicted spatial distributions. Such a model can be trained efficiently on dozens of thousands of data points and outperforms simpler approaches.

Applied to descriptions of pathologies that lack spatial information, our model synthesizes accurate brain maps that reflect the domain knowledge. Predicting spatial distributions of medical observations from text opens new alleys for clinical research from patient health records and case reports.