Abstract
Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is formulated in terms of minimization of a risk on distributions which leads to a least-deviation cost function. An efficient algorithm in the dual then learns the mapping from documents to brain structures. Empirical results using coordinates extracted from the brain-imaging literature show that (i) models must adapt to semantic variation in the terms used to describe a given anatomical structure, (ii) voxel-wise parameterization leads to higher likelihood of locations reported in unseen documents, (iii) least-deviation cost outperforms least-square. As a proof of concept for our method, we use our model of spatial distributions to predict the distribution of specific neurological conditions from text-only reports.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Hundreds of thousands of studies, case reports, or patient records, capture observations in human neuroscience, basic or clinical. Statistical analysis of this large amount of data could provide new insights. Unfortunately, most of the spatial information that these data contain is difficult to extract automatically, because it is hidden in unstructured text, in sentences such as: “[...] in the anterolateral temporal cortex, especially the temporal pole and inferior and middle temporal gyri” [1].
This data cannot be processed easily by a machine, as a machine does not know where the temporal cortex is. As we will show, simply looking up such terms in atlases does not suffice. Indeed, even atlases disagree [2]. Furthermore, joint processing of many reports faces varying terminologies, with regions represented in different atlases that differ and overlap. Finally, not all terms in a report carry the same importance, and practitioners use terms that are not the exact labels of any atlas. Coordinate-based meta-analyses capture the spatial distribution of a term from the literature [3, 4], but they also lack a model to combine terms.
Here, we propose to map case reports automatically to the brain locations that they discuss: we learn mappings of anatomical terms to brain regions from medical publications. We propose a new learning framework for translating anatomical terms to brain images – a process that we call “encoding”. We learn such a mapping, quantify its performance, and compare possible choices of representation of spatial data. We then show in a proof of concept that our model can predict the brain area for textual case reports.
2 Methods: Formalizing Text-to-Brain-Map Translation
2.1 Problem Setting: From Text to Spatial Distributions
We want to predict the likelihood of the location of relevant brain structures described in a document. For this purpose, we perform supervised learning on a corpus of brain-imaging studies, each containing: (i) a text, and (ii) the locations – i.e. the stereotactic coordinates – of its observations. Indeed, Functional Magnetic Resonance Imaging (fMRI) studies report the coordinates of activation peaks (e.g., [5, Table 1]), and Voxel Based Morphometry (VBM) analyses report the location of differences in gray matter density (e.g., [1, Table 2]). Following neuroimaging meta-analyses [3], we frame the problem in terms of spatial distributions of observations in the brain. In a document, observed locations \(\mathcal {L} = \{l_a \in \mathbb {R}^3, a = 1 \dots c\}\) are sampled from a probability density function (pdf) p over the brain. Our goal is to predict this pdf p from the text \(\mathcal {T}\). We denote q our predicted pdf. A predicted pdf q should be close to p, or take high values at the coordinates actually reported in the study: \(\prod _{l \in \mathcal {L}} q(l)\) must be large. In a supervised learning setting, we start from a collection of studies \(\mathcal {S} = (\mathcal {T}, \mathcal {L})\), with \(\mathcal {T}\) the text and \(\mathcal {L}\) the locations. Building the prediction engine then entails the choice of a model relating the predicted pdf p to the text \(\mathcal {T}\), the choice of a loss, or data-fit term, and some regularization on the model parameters. We now detail how we make each of these choices to construct a prediction.
Model. We start by modelling the dependency of our spatial pdf q on the study text \(\mathcal {T}\). This entails both choosing a representation for q and writing it as a function of the text. While q is defined on a subvolume of \(\mathbb {R}^3\), the brain volume, we build it using a partition to work on a finite probably space: this can be either a regular grid of voxels or a set of anatomical regions (i.e. an atlas) \(\mathcal {R} = \{\mathcal {R}_k, k=0\dots m\}\). As such a partitioning imposes on each region to be homogeneous, q is then formally written on \(\mathbb {R}^3\) in terms of the indicator functions of the partsFootnote 1: \(\{r_k = \frac{\mathbb {I}_k}{\Vert \mathbb {I}_k\Vert _1}, ~ k = 1 \dots m\}\). Importantly, the volume of each part \(\Vert \mathbb {I}_k\Vert _1\) appears as a normalization constant.
To link q to the text \(\mathcal {T}\) of the study, we start by building a term-frequency vector representation of \(\mathcal {T}\), which we denote \(\varvec{x} \in \mathbb {R}^d\). d is the size of our vocabulary of English words \(\mathcal {W} = \{w_t\}\), and \(\varvec{x}_t\) is the frequency of word \(w_t\) in the text. We assign to each atlas region a weight that depends linearly on \(\varvec{x}\):
where \(\varvec{\beta } \in \mathbb {R}^{d \times m}\) are model parameters, which we will learn.
Using an atlas is a form of regularization: constraining the prediction to be in the span of \(\{r_k\}\) reduces the size of the search space. Fine partitions, e.g. atlases with many regions or voxel grids, yield models with more expressive power, but more likely to overfit. Choosing an atlas thus amounts to a bias-variance tradeoff.
Label-Constrained Encoder. A simple heuristic to turn a text into a brain map is to use atlas labels and ignore interactions between terms. The probability of a region is taken to be proportional to the frequency of its label in the text. The vocabulary is then the set of labels: \(d = m\). As the word \(w_k\) is the label of \(\mathcal {R}_k\), \(\varvec{\beta }\) is diagonal. For example, for a region \(\mathcal {R}_k\) in the atlas labelled “parietal lobe”, the probability on \(\mathcal {R}_k\) depends only on the frequency of the phrase “parietal lobe” in the text. We call this model label-constrained encoder.
2.2 Loss Function: Measuring Errors on Spatial Distributions
Strategy. We will fit the coefficients \(\varvec{\beta }\) of our model, see Eq. (1), by minimizing a risk \(\mathcal {E}(p, q)\): the expectation of a distance between p and q.
A Plugin Estimator of p . We do not have access to the true pdf, p; we need a plugin estimator, which we denote \(\hat{p}\). By construction of our prediction q, the best approximation of p we can hope for belongs to the span of our regions \(\{r_k\}\). Hence, we build our estimator \(\hat{p}\) in this space, setting the probability of a region to be proportional to the number of coordinates that fell inside it:
When regions are voxels, there are too many regions and too few coordinates. Hence we use Gaussian Kernel Density Estimation (KDE) to smooth the estimated pdfFootnote 2. Our supplementary material details a fast KDE implementation.
Choice of \(\mathcal{E}\). We use two common distance functions for our loss. The first is Total Variation (TV), a common distance for distributions. Note that p defines a probability measure on the finite sample space \(\mathcal {R}\), \(\mathcal {P}(\mathcal {R}_k)= \int _{\mathcal {R}_k} p(z) dz\), where \(\mathcal {R} = \{\mathcal {R}_k, k=1 \ldots m\}\) and \(\mathcal {R}_k = \mathrm{supp}(r_k)\). q defines Q in the same way. Then,
Since \(\mathcal {R}\) is finite, a classical result (see [6]) shows that this supremum is attained by taking \(\mathcal {A} = \{\mathcal {R}_k | \mathcal {P}(\mathcal {R}_k) > \mathcal {Q}(\mathcal {R}_k)\} \) (or its complementary) and:
The TV is half of the \(\ell _1\) distance between the pdfs. \(\Vert \hat{p} - q\Vert _1\) is therefore a natural choice for our loss. The second choice is \(\Vert \hat{p} - q\Vert _2^2\), which is a popular distance and has the appeal of being differentiable everywhere.
Factorizing the Loss. Let us call \(v_k\) the volume of \(r_k\), i.e. the size of its support: \(v_k \triangleq \Vert \mathbb {I}_k\Vert _1, ~ k = 1 \dots m\). Remember that \(r_k = \frac{1}{v_k}\mathbb {I}_k\). Our loss can now be factorized (see supplementary material for details):
Here, \(\delta \) is either the absolute value of the difference or the squared difference.
2.3 Training the Model: Efficient Minimization Approaches
To set the model parameters \(\varvec{\beta }\), we used n example studies \(\{\mathcal {S}_i = (\mathcal {T}_i, \mathcal {L}_i), ~ i = 1 \dots n \}\). We learn \(\varvec{\beta }\) by minimizing the empirical risk on \(\{S_i\}\) and an \(\ell _2\) penalty on \(\varvec{\beta }\). We add to the previous notations the index i of each example: \(p_i\), \(q_i\), \(\hat{\varvec{y}}_i\), \(\varvec{x}_i\). \(\hat{\varvec{Y}} \in \mathbb {R}^{n \times m}\) is the matrix such that \(\hat{\varvec{Y}}_{i} = \hat{\varvec{y}}_i\), and \(\varvec{X} \in \mathbb {R}^{n \times d}\) such that \(\varvec{X}_i = \varvec{x}_i\).
Case \(\delta = \ell _2^2\). The empirical risk is
Defining \(\varvec{Y}'_{:,k} = \frac{\hat{\varvec{Y}}_{:,k}}{\sqrt{(}v_k)}\) and \(\varvec{\beta }'_{:,k} = \frac{\varvec{\beta }_{:,k}}{\sqrt{(}v_k)}\), with an \(\ell _2\) penalty, the problem is:
where \(\lambda \in \mathbb {R}_+\). This is the least-squares ridge regression predicting \(\hat{p}\) expressed in the orthonormal basis of our search space \(\{\frac{r_k}{\Vert r_k\Vert _2}\}\).
Case \(\delta = \ell _1\). The empirical risk becomes
This problem is also known as a least-deviations regression, a particular case of quantile regression [7, 8]. Unlike \(\ell _2\) regression, which provides an estimate of the conditional mean of the target variable, \(\ell _1\) provides an estimate of the median. Quantile regression has been studied (e.g. by economists), as it is more robust to outliers and better-suited than least-squares when the noise is heteroscedastic [7]. Adding an \(\ell _2\) penalty, we have the minimization problem:
Unpenalized quantile regression is often written as a linear program and solved with the simplex algorithm [9], iteratively reweighted least squares, or interior point methods [10]. [11] uses a coordinate-descent to solve a differentiable approximation of the quantile loss (the Huber loss) with elastic-net penalty. Here, we minimize Eq. (9) via its dual formulation (c.f. supplementary material):
where \(\varvec{\nu } \in \mathbb {R}^{n \times m}\). The primal solution is given by \(\hat{\varvec{\beta }} = \frac{\varvec{X}^T\hat{\varvec{\nu }}}{2\lambda }\). As the dual loss g is differentiable and the constraints are bound constraints, we can use an efficient quasi-Newton method (L-BFGS, [12]). g and its gradient are fast to compute as \(\varvec{X}\) is sparse. \(\lambda \) is set by cross-validation on the training set. We use warm-start on the regularization path (decreasing values for \(\lambda \)) to initialize each problem.
Training the Label-Constrained Encoder. The columns of \(\varvec{\beta }\) can be fitted independently from each other. If we want \(\varvec{\beta }\) to be diagonal, we only include one feature in each regression: we fit m univariate regressions \(\varvec{\hat{y}}_{:,k} \simeq \varvec{X}_{:,k}\varvec{\beta }_{k,k}\).
2.4 Evaluation: A Natural Model-Comparison Metric
Our metric is the mean log-likelihood of an article’s coordinates in the predicted distribution, which diverges wherever \(q=0\). we add a uniform background to the prediction, to ensure that it is non-zero everywhere:
3 Empirical Study
3.1 Data: Mining Neuroimaging Publications
We downloaded roughly 140K neuroimaging articles from online sources including Pubmed Central and commercial publishers. About 14K of these contain coordinates, which we extracted, as in [4]. We built a vocabulary of around 1000 anatomical region names by grouping the labels of several atlases and the Wikipedia page “List of regions in the human brain”Footnote 3. So in practice, \(n \approx 14\cdot 10^3\) and \(d \approx 1000\). m depends on the atlas (or voxel grid) and ranges from 20 to 30K.
3.2 Text-to-Brain Encoding Performance
Comparison of Atlases and Models. We perform 100 folds of shuffle-split cross-validation (10% in test set). As choices of \(\{\mathcal {R}_k\}\), we compare several atlases and a grid of cubic 4-mm voxels. We also compare \(\ell _1\) and \(\ell _2\) regression, and label-constrained \(\ell _2\). The label-constrained encoder is not used for the voxel grid, as it does not have labels. As a baseline, we include a prediction based on the average of the brain maps seen during training (i.e. independent of the text).
Figure 1 gives the results: for all models, voxel-wise encoding performs better than any atlas. Large atlas regions regularize too much. Despite its higher dimensionality, voxel-wise encoding learns better representations of anatomical terms. The label-constrained model performs poorly, sometimes below chance, as the labels of a single atlas do not cover enough words and interactions between terms are important. For voxel-wise encoding, \(\ell _1\) regression outperforms \(\ell _2\). The best encoder is therefore learned using a \(\ell _1\) loss and a voxel partition.
Prediction Examples. Figure 2 shows the true pdf (estimated with KDE) and the prediction for the articles which obtained respectively the best and the first-quartile scores. The median is shown in the supplementary material.
Examples of Coefficients Learned by the Linear Regression. The coefficients of the linear regression (rows of \(\varvec{\beta }\)) are the brain maps that the model associates with each anatomical term. For frequent terms, they are close to what experts would expect (see for example Figs. 3 and 4).
3.3 Leveraging Text Without Coordinates: Neurological Examples
Our framework can leverage unstructured spatial information contained in a large corpus of unannotated text. To showcase this, assume that we want to know which parts of the brain are associated with Huntington’s disease. Our labelled corpus by itself is insufficient: only 21 documents mention the term “huntington”. But we use it to learn associations between anatomical terms and locations in the brain (Sect. 2). This gives us access to the spatial information contained in the unlabelled corpus, which was out of reach before (Sect. 3.2). We contrast the mean encoding of articles which mention “huntington” against the mean distribution (taking the difference of their log). Since the large corpus contains more information about Huntington’s disease (over 400 articles mention it), this is sufficient to see the striatum highlighted in the resulting map (Fig. 5, left). Figure 5 (right) shows the experiment for Parkinson, and Fig. 6 for Aphasia.
4 Conclusion
We have introduced a theoretical framework to translate textual description of studies into spatial distributions over the brain. Such a translation enables pooling together many studies which only provide text (no images or coordinates), for statistical analysis of their results in brain space. The statistical model gives a natural metric to validate. This metric enables comparing representations, showing that voxel-wise encoding is a better approach than relying on atlases. Building prediction models tailored to our task leads to a linear regression with an \(\ell _1\) loss (least absolute deviation), the total-variation distance between the true and the predicted spatial distributions. Such a model can be trained efficiently on dozens of thousands of data points and outperforms simpler approaches.
Applied to descriptions of pathologies that lack spatial information, our model synthesizes accurate brain maps that reflect the domain knowledge. Predicting spatial distributions of medical observations from text opens new alleys for clinical research from patient health records and case reports.
Notes
- 1.
\(\mathcal {R}_0\) denotes the volume outside of the brain, or background, on which q is 0.
- 2.
Using an atlas is also a form of KDE, with kernel \((z, z') \mapsto 1 / \Vert \mathbb {I}_k\Vert _1\) if z and \(z'\) belong to the same region \(\mathcal {R}_k, k \in \{1,\dots m\}\), 0 otherwise.
- 3.
References
Mummery, C.J., Patterson, K., Price, C.J., et al.: A voxel-based morphometry study of semantic dementia: relationship between temporal lobe atrophy and semantic memory. Ann. Neurol. 47(1), 36–45 (2000)
Bohland, J., Bokil, H., Allen, C., Mitra, P.: The brain atlas concordance problem: quantitative comparison of anatomical parcellations. PloS one 4(9), e7200 (2009)
Laird, A.R., Fox, P.M., Price, C.J., et al.: ALE meta-analysis: controlling the false discovery rate and performing statistical contrasts. Hum. Brain Mapp. 25, 155 (2005)
Yarkoni, T., Poldrack, R.A., Nichols, T.E., et al.: Large-scale automated synthesis of human functional neuroimaging data. Nat. Methods 8, 665 (2011)
Van der Zwaag, W., Gentile, G., et al.: Where sound position influences sound object representations: a 7-T fMRI study. Neuroimage 54(3), 1803–1811 (2011)
Gibbs, A.L., Su, F.E.: On choosing and bounding probability metrics. Int. Stat. Rev. 70(3), 419–435 (2002)
Koenker, R., Bassett Jr., G.: Regression quantiles. Econom.: J. Econom. Soc. 46, 33–50 (1978)
Chen, C., Wei, Y.: Computational issues for quantile regression. Sankhyā: Indian J. Stat. 67, 399–417 (2005)
Koenker, R., d’Orey, V.: Remark AS R92: a remark on algorithm as 229: computing dual regression quantiles and regression rank scores. J. R. Stat. Soc. Ser. C 43(2), 410–414 (1994)
Portnoy, S., Koenker, R., et al.: The gaussian hare and the laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12(4), 279–300 (1997)
Yi, C., Huang, J.: Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. J. Comput. Graph. Stat. 26, 547 (2017)
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)
Walker, F.O.: Huntington’s disease. Lancet 369(9557), 218–228 (2007)
Davie, C.A.: A review of parkinson’s disease. Br. Med. Bull. 86, 109 (2008)
Damasio, A.R.: Aphasia. N. Engl. J. Med. 326(8), 531–539 (1992)
Surguladze, S., et al.: Interaction of catechol O-methyltransferase and serotonin transporter genes modulates effective connectivity in a facial emotion-processing circuitry. Transl. Psychiatry 2(1), e70 (2012)
Van Dam, W.O., Rueschemeyer, S.A., Bekkering, H.: How specifically are action verbs represented in the neural motor system: an fMRI study. Neuroimage 53(4), 1318–1325 (2010)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis, vol. 26. CRC Press, Boca Raton (1986)
Simonoff, J.S.: Smoothing Methods in Statistics. Springer Science & Business Media, New York (2012). https://doi.org/10.1007/978-1-4612-4026-6
Wand, M.: Fast computation of multivariate kernel estimators. J. Comput. Graph. Stat. 3(4), 433–445 (1994)
Gramacki, A., Gramacki, J.: FFT-based fast computation of multivariate kernel density estimators with unconstrained bandwidth matrices. J. Comput. Graph. Stat. 26, 459–462 (2016)
Acknowledgements
This project received funding from: the European Union’s H2020 Research Programme under Grant Agreement No. 785907 (HBP SGA2), the Metacog Digiteo project, the MetaMRI associate team, and ERC NeuroLang.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Dockès, J., Wassermann, D., Poldrack, R., Suchanek, F., Thirion, B., Varoquaux, G. (2018). Text to Brain: Predicting the Spatial Distribution of Neuroimaging Observations from Text Reports. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11072. Springer, Cham. https://doi.org/10.1007/978-3-030-00931-1_67
Download citation
DOI: https://doi.org/10.1007/978-3-030-00931-1_67
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00930-4
Online ISBN: 978-3-030-00931-1
eBook Packages: Computer ScienceComputer Science (R0)