From access and integration to mining of secure genomic data sets across the Grid

doi:10.1016/j.future.2006.07.007

Future Generation Computer Systems

Volume 23, Issue 3, March 2007, Pages 447-456

https://doi.org/10.1016/j.future.2006.07.007 Get rights and content

Abstract

The UK Department of Trade and Industry (DTI) funded BRIDGES project (Biomedical Research Informatics Delivered by Grid Enabled Services) has developed a Grid infrastructure to support cardiovascular research. This includes the provision of a compute Grid and a data Grid infrastructure with security at its heart. In this paper we focus on the BRIDGES data Grid. A primary aim of the BRIDGES data Grid is to help control the complexity in access to and integration of a myriad of genomic data sets through simple Grid-based tools. We outline these tools, how they are delivered to the end user scientists. We also describe how these tools are to be extended in the BBSRC funded Grid Enabled Microarray Expression Profile Search (GEMEPS) to support a richer vocabulary of search capabilities to support mining of microarray data sets. As with BRIDGES, fine grain Grid security underpins GEMEPS.

Introduction

The completion of sequencing of human and several other eukaryotic genomes as well as more than a hundred microbial ones marks the beginning of the post-genomic era, where emphasis has shifted from searching for genes onto understanding of their function. The new discipline–functional genomics–was born and its success has been massively facilitated by development of modern post-genomic technologies that enabled comprehensive studies of mRNA, protein and metabolite complements of biological samples. Because of the high-throughput fashion of these technologies, functional genomics has been generating large amounts of data. To be able to analyze sensibly these data well-designed data standards are required. Human Proteome Organisation’s (HUPO) Proteomics Standards Initiative (PSI) [1] has adopted the PEDRO (Proteomics Experiment Data Repository) standard [2] for proteomic data. Recently, the ArMet (Architecture for Metabolomics) model was proposed for metabolomic data [3]. The most advanced work, however, has been done by the microarray community through the development of the MIAME (Minimum Information about a Microarray Experiment) standard for transcriptomic data [4], [19], [20]. These days leading journals require microarray data associated with publications to be MIAME compliant, and this standard has been adopted by several public data repositories.

Data stored in these repositories can be easily searched using various terms belonging to carefully crafted controlled vocabularies. However, none of the existing repositories provides means for searching the deposited data by the results of a particular microarray experiment. In other words, currently a researcher cannot assess if a similar experiment has been undertaken previously; if other experiments have produced similar results, or generally understand how their experiment compares to previously undertaken experiments. More generally from a biological perspective we can introduce the concept of a Simple FunctIonal geNomics eXperiment (SFINX) where SFINX represents a comparison of two biological conditions represented by two groups of biologically replicated samples (healthy vs. diseased tissue, wild type vs. mutant animals, drug-treated vs. control cells, etc.). Each sample contains a population of elements (e.g. mRNAs, proteins, metabolites). By identifying statistically significant differences in the above populations we want to learn about a biological process that could explain a difference between the two conditions. Modern post-genomic technologies enable quantitative measurement of changes in populations of elements. In case of gene expression arrays that measure differences in the transcriptome content, the elements are genes (or rather mRNAs). In the case of quantitative proteomics technologies such as Differential Gel Electrophoresis [46] or Isotope Coded Affinity Tags [47] that measure differences in proteome composition the elements are proteins. In the case of quantitative metabolomic technologies such as Fourier Transform Ion Cyclotron Mass Spectrometry [48] that measure differences in metabolite concentrations the elements are metabolites. As a result of a SFINX, each particular element is given a measure of its change and the complete list of these measures constitutes a profile that fully characterises the experiment. It is perfectly reasonable that after calculating a SFINX profile a researcher would like to know if somebody somewhere performed another experiment of a similar profile. Such information could lead to the assumption that at the first approximation similar biological processes took place in both experiments. This could potentially save time, efforts and resources in identifying such processes. Unfortunately, at present there are no mechanisms for such a search available in the public repositories of functional genomics’ data.

In order to make such searches meaningful, several conditions have to be fulfilled. First, the SFINX profile has to be reliable; second, it has to have a set of sub-profiles corresponding to different level of confidence; third, the library of profiles has to be constructed using a standardized method; and last a similarity measure between profiles has to be established. The Sir Henry Wellcome Functional Genomics Facility at the University of Glasgow (SHWFGF) has developed a number of new techniques to analyze large genomic datasets such as microarray results [23]. These techniques try to combine statistical and biological reasoning in an automated framework [24]. They include: Rank Products (RP)—powerful test statistics to detect differentially regulated genes in replicated experiments [49]; iterative Group Analysis (iGA)—an algorithm to compute physiological interpretations of experiments based on existing functional annotations [50] and Graph-based iterative Group Analysis (GiGA)—a graph-based extension of iGA that uses expression data to highlight relevant areas in an “evidence network” to facilitate data interpretation and visualization [51]. These methods have shown how with local data sets, a novel, fully automated pipeline for the analysis of Affymetrix GeneChip arrays can be supported. This schema has been running successfully for several months in the SHWFGF. So far analysis of nearly 500 SFINXs that comprised nearly 2000 chips has been performed.

To extend this approach to deal with distributed data sets requires several challenges to be overcome. First, data must be found and accessed–often requiring local security issues to be dealt with. Second, these data must be integrated with other data sets–where remote data sets are continually evolving. Third and ideally, data should be mined to bring more understanding and support richer mechanisms for comparison and evaluation of biological experiments/results. The BRIDGES project [14] has developed a Grid infrastructure that addresses the first two of these concerns: data access and integration. A follow on BBSRC funded project Grid Enabled Microarray Expression Profile Search (GEMEPS) [52] will enhance this infrastructure to move towards data mining capabilities.

Grid technologies directly address many of the difficulties present in large scale heterogeneous distributed systems where collections of remote data and compute resources are to be seamlessly accessed and used. One of the primary challenges that Grid technologies face is managing the access to and usage of a broad array of remote, evolving and largely autonomous data sets (in the sense of being managed by separate organizations). Whilst it is possible to have data curation centres where, for example, microarray data are stored and maintained centrally, e.g. such as the Medical Research Council/Imperial College funded Clinical Science Centre microarray data warehouse [5]; large centralized centres are costly to set up and subsequently manage and have a significant drawback, namely they require that scientists are prepared to hand over their data sets to a third party to manage and ensure that appropriate security mechanisms are in place. Scientists are generally unwilling to make their microarray data sets (or research data sets more generally) available before their experiments are formally published in journals/conferences [6]. As such, these data curation repositories are always likely to be populated with older data sets, so scientists wishing to perform experiments are unable to determine whether recent experiments have been performed already and hence are unable to perform any useful comparison until papers have been published. This can, depending upon the journal, be especially time consuming.

A better model is to allow scientists to keep and maintain their own local data sets, and allow them to provide secure access to their data in a tightly controlled setting, e.g. to specific colleagues or centres wishing to compare mutually beneficial experiments. To achieve this and bearing in mind the competitive nature of research and costs incurred in running experiments, security of data is an important factor. Individual sites will of course have their own procedures and policies for how they deal with data security; however, the Grid community has developed generic security solutions that can be applied to augment existing security infrastructures. Through these additional mechanisms local security policies can be enforced restricting and controlling access to research data sets that might otherwise not be available, i.e. not yet published data. This is achievable through recent Grid security standardization activities [7], recent technological developments [8], [9], [10], [11], [12], [13] and direct experiences of the National e-Science Centre (NeSC) at the University of Glasgow in projects such as the JISC funded DyVOSE project [15], the MRC funded VOTES project [16] and the CSO funded Genetics and Healthcare project [17].

Section snippets

BRIDGES project overview

Arguably the primary objective in applying Grid technology is to establish virtual organizations (VOs). VOs allow shared use of computational and data resources by collaborating institutions/scientists [27], [42], [45]. Establishing a VO requires that efficient security access control mechanisms to the shared resources by known individuals are in place. One example of a VO is the Wellcome Trust funded (£4.34M) Cardiovascular Functional Genomics (CFG) project [18] who are investigating possible

Grid-based security

With the open and collaborative nature of the Grid, ensuring that local security constraints are met and not weakened by Grid security solutions is paramount. Public Key Infrastructure (PKI) represents the most common way in which security is addressed. Through PKIs, it is possible to validate the identity of a given user requesting access to a given resource. For example, with the Globus toolkit [28] solution, gatekeepers are used to ensure that signed requests are valid, i.e. from known

Conclusions and future work

One of the major challenges facing in-silico life science research is managing the data deluge. Understanding biological processes necessitates access to and understanding of collections of potentially distributed, separately owned and managed biological data sets. Federated data models represent the most realistic approach due to the expense of centrally based data curation and the general reluctance of biologists to hand over their data to another party. Given that the CFG project are

Acknowledgements

This work was supported by a grant from the Department of Trade and Industry. The authors would also like to thank members of the BRIDGES and CFG team including Professor David Gilbert, Professor Malcolm Atkinson, Dr Dave Berry, Dr Ela Hunt and Dr Neil Hanlon. Magnus Ferrier is acknowledged for his contribution to the MagnaVista software, Dr Jos Koetsier for his work on the GeneVista application, Micha Bayer for feedback on earlier versions of this paper and work on portals and Grid-based

Professor Richard Sinnott is the Technical Director of the National e-Science Centre (NeSC) at the University of Glasgow. In addition to this, Dr. Sinnott has the role of Deputy Director (Technical) of the Bioinformatics Research Centre.

Dr. Sinnott is responsible for establishing an environment for e-Science at Glasgow University. This includes provision of the necessary computational infrastructure as well as the training and education of future e-Scientists. He lectures advanced level

References (53)

Human Proteome Organisation (HUPO), Proteomics Standards Initiative (PSI),...
Proteomics Experiment Data Repository...
Architecture for Metabolomics (ArMet),...
Minimal Information About a Microarray Experiment...
Clinical Sciences Centre/Imperial...
Joint Data Standards Survey...
Global Grid Forum, Frameworks and Mechanisms...
D.W. Chadwick et al.
The PERMIS X.509 Role based privilege management infrastructure, Future Generation Computer Systems
(2002)
L. Pearlman et al., A community authorisation service for group collaboration, in: Proceedings of the IEEE 3rd...
R. Leprog, Cardea: Dynamic access control in distributed systems, NASA Technical Report NAS-03-020, November...

Globus Grid Security Infrastructure (GSI),...

W. Johnston, S. Mudumbai, M. Thompson, Authorization and attribute certificates for widely distributed access control,...

S. Newhouse, Virtual Organisation Management, The London E-Science centre,...

BioMedical Research Informatics Delivered by Grid Enabled Services project (BRIDGES),...

Dynamic Virtual Organisations in e-Science Education project (DyVOSE),...

Virtual Organisations for Trials and Epidemiological Studies (VOTES),...

Genomics and Healthcare Initiative...

Cardiovasular Functional Genomics...

MIAMexpress,...

MaxDLoad,...

Computational Biology Service Unit, Cornell University, Ithaca, New York,...

Riken Genomic Sciences Centre Bioinformatics Group, Yokohama Institute, Yokohama, Japan....

Functional genomics of nutrienstransport in arabidopsis; Bioinformatics Approach, BBSRC grant, April...

T.R. Golub

Molecular classification of Cancer: Class discovery and class prediction by gene expression monitoring

Science

(1999)

JISC Authentication, Authorisation and Accounting (AAA) Programme Technologies for Information Environment Security...

A. Whitten, J.D. Tygar, Why Johnny can’t encrypt: A usability evaluation of PGP 5.0. in: 9th USENIX security symposium,...

Cited by (2)

APHID: An architecture for private, high-performance integrated data mining
2010, Future Generation Computer Systems
Citation Excerpt :
The easiest and most accurate arrangement to ensure privacy in cooperative data mining would be for all of the organizations to transfer their data to a trusted third party, who would consolidate the data on a single machine and execute standard data mining algorithms on the combined data set. Unfortunately, this ideal trusted party may not exist [14,15,4,16,17,13]. Even if a suitable third party is found, organizations will likely be apprehensive to relinquish administrative control over their data to a third party.
While the emerging field of privacy preserving data mining (PPDM) will enable many new data mining applications, it suffers from several practical difficulties. PPDM algorithms are challenging to develop and computationally intensive to execute. Developers need convenient abstractions to simplify the engineering of PPDM applications. The individual parties involved in the data mining process need a way to bring high-performance, parallel computers to bear on the computationally intensive parts of the PPDM tasks. This paper discusses APHID (Architecture for Private and High-performance Integrated Data mining), a practical architecture and software framework for developing and executing large scale PPDM applications. At one tier, the system supports simplified use of cluster and grid resources, and at another tier, the system abstracts communication for easy PPDM algorithm development. This paper offers a detailed analysis of the challenges in developing PPDM algorithms with existing frameworks, and motivates the design of a new infrastructure based on these challenges.
Towards a Virtual anonymisation Grid for unified access to remote clinical data
2008, Studies in Health Technology and Informatics

Dr. Sinnott is also involved in several ongoing e-Science projects. He is principal investigator of the National e-Science Centre at Glasgow University; PI on the Dynamic Virtual Organizations for e-Science Education (DyVOSE) project; and PI on the Grid Enabled Microarray Expression Profile Search (GEMEPS). He is also involved on a range of e-Health projects including the MRC funded Virtual Organisations for Trials and Epidemiological Studies (VOTES) project; Scottish Bioinformatics Research Network (SBRN); Biomedical Research Informatics Delivered by Grid Enabled Services (BRIDGES) project; Generation Scotland Scottish Family Health Study, and acting consultant on the Joint Data Standards Survey (JDSS), the ESP-Grid project and the Grid Enabled Occupational Data Environment (GEODE) project.

Before coming to Glasgow, Dr. Sinnott ran his own consultancy company based in Germany specializing in formal technologies and their application to real time systems development, especially in the telecommunications domain.

He holds a Ph.D. from the University of Stirling where his research was based on the modelling and architectural design of open distributed processing systems. He edited several international standards in this domain. He also holds an MSc in Software Engineering from the University of Stirling (dissertation on the Formal Specification of Electronic Components in LOTOS) and a B.Sc. in Theoretical Physics from the University of East Anglia (UEA) in Norwich (dissertation on Computer Simulation of a Linear Polymer).

Dr. Sinnott’s current research is focused around grid computing using technologies such as the Globus Toolkit and its application to a broad spectrum of scientific areas—bioinformatics being one exemplar. Dr Sinnott also maintains an interest in formal methods and their application to real time, distributed systems development.

View full text