Abstract
The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.
References
Barash, Y. and N. Friedman (2002): “Context-specific Bayesian clustering for gene expression data,” J. Comput. Biol., 9, 169–191.Search in Google Scholar
Kirk, P., J. E. Griffin, R. S. Savage, Z. Ghahramani and D. L. Wild (2012): “Bayesian correlated clustering to integrate multiple datasets,” Bioinformatics, 28, 3290–3297.10.1093/bioinformatics/bts595Search in Google Scholar PubMed PubMed Central
Liu, X., S. Sivaganesan, K. Y. Yeung, J. Guo, R. E. Bumgarner and M. Medvedovic (2006): “Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset,” Bioinformatics, 22, 1737–1744.10.1093/bioinformatics/btl184Search in Google Scholar PubMed PubMed Central
Liu, X., W. J. Jessen, S. Sivaganesan, B. J. Aronow and M. Medvedovic (2007): “Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data,” BMC Bioinformatics, 8, 283.10.1186/1471-2105-8-283Search in Google Scholar PubMed PubMed Central
Nvidia (2013): “Compute Unified Device Architecture,” URL http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Search in Google Scholar
Rogers, S., A. Klami, J. Sinkkonen, M. Girolami and S. Kaski (2009): “Infinite factorization of multiple non-parametric views,” Mach. Learn., 79, 201–226.Search in Google Scholar
Savage, R. S., Z. Ghahramani, J. E. Griffin, B. J. de la Cruz and D. L. Wild (2010): “Discovering transcriptional modules by Bayesian data integration,” Bioinformatics, 26, i158–i167.10.1093/bioinformatics/btq210Search in Google Scholar PubMed PubMed Central
Savage, R. S., Z. Ghahramani, J. E. Griffin, P. Kirk and D. L. Wild (2013): “Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data,” in International Conference on Machine Learning.Search in Google Scholar
Suchard, M. A., Q. Wang, C. Chan, J. Frelinger, A. Cron and M. West (2010): “Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures,” J. Comput. Graph. Stat., 19, 419–438.Search in Google Scholar
Yuan, Y., R. S. Savage and F. Markowetz (2011): “Patient-specific data fusion defines prognostic cancer subtypes,” PLoS Comput. Biol. 7, e1002227.Search in Google Scholar
Supplemental Material:
The online version of this article (DOI: 10.1515/sagmb-2015-0055) offers supplementary material, available to authorized users.
©2016 by De Gruyter