ABSTRACT
Genome-wide association studies (GWAS) analyze genetic variation (SNPs) across the entire human genome, searching for SNPs that are associated with certain phenotypes, most often diseases, such as breast cancer. In GWAS, we seek a ranking of SNPs in terms of their relevance to the given phenotype. However, because certain SNPs are known to be highly correlated with one another across individuals, it can be beneficial to take into account these correlations when ranking. If a SNP appears associated with the phenotype, and we question whether this association is real, the extent to which its neighbors (correlated SNPs) also appear associated can be informative. Therefore, we propose CollectRank, a ranking approach which allows SNPs to reinforce one another via the correlation structure. CollectRank is loosely analogous to the well-known PageRank algorithm. We first evaluate CollectRank on synthetic data generated from a variety of genetic models under different settings. The numerical results suggest CollectRank can significantly outperform common GWAS methods at the cost of a small amount of extra computation. We further evaluate CollectRank on two real-world GWAS on breast cancer and atrial fibrillation/flutter, and CollectRank performs well in both studies. We finally provide a theoretical analysis that also suggests CollectRank's advantages.
- C. Ambroise and G. J. McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences of the United States of America, 99(10):6562--6566, 2002.Google ScholarCross Ref
- P. Armitage. Tests for linear trends in proportions and frequencies. BIOMETRICS, 11:375--386, 1955.Google ScholarCross Ref
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107--117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. Google ScholarDigital Library
- C. S. Carlson, M. A. Eberle, M. J. Rieder, Q. Yi, L. Kruglyak, and D. A. Nickerson. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet, 74(1):106--120, January 2004.Google ScholarCross Ref
- W. G. Cochran. Some methods for strengthening the common chi-square tests. BIOMETRICS, 10:417--451, 1954.Google ScholarCross Ref
- P. I. W. de Bakker, R. Yelensky, I. Pe'er, S. B. Gabriel, M. J. Daly, and D. Altshuler. Efficiency and power in genetic association studies. Nature Genetics, 37(11):1217--1223, November 2005.Google ScholarCross Ref
- D. F. Easton, K. A. Pooley, A. M. Dunning, P. D. P. Pharoah, D. Thompson, D. G. Ballinger, J. P. Struewing,..., A. Mannermaa, V.-M. Kosma, V. Kataja, J. Hartikainen, N. E. Day, D. R. Cox, and B. A. J. Ponder. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447:1087--1093, May 2007.Google ScholarCross Ref
- E. Eskin. Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Genome Research, 18(4):653--660, 2008.Google ScholarCross Ref
- B. Freidlin, G. Zheng, Z. Li, and J. L. Gastwirth. Trend tests for case-control studies of genetic markers: power, sample size and robustness. HUM HERED, 53(3):146--152, 2002.Google ScholarCross Ref
- E. Halperin and D. A. Stephan. Maximizing power in association studies. Nature Biotechnology, 27(3):255--256, March 2009.Google ScholarCross Ref
- B. Han, H. M. Kang, M. S. Seo, N. Zaitlen, and E. Eskin. Efficient association study design via power-optimized tag SNP selection. Annals of Human Genetics, 72(6):834--847, November 2008.Google ScholarCross Ref
- D. J. Hunter, P. Kraft, K. B. Jacobs, D. G. Cox, M. Yeager, S. E. Hankinson, S. Wacholder, Z. Wang, R. Welch, A. Hutchinson, J. Wang, K. Yu, N. Chatterjee, N. Orr, W. C. Willett, G. A. Colditz, R. G. Ziegler, C. D. Berg, S. S. Buys, C. A. Mccarty, H. S. Feigelson, E. E. Calle, M. J. Thun, R. B. Hayes, M. Tucker, D. S. Gerhard, J. F. Fraumeni, R. N. Hoover, G. Thomas, and S. J. Chanock. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nature Genetics, 39(7):870--874, May 2007.Google ScholarCross Ref
- K. Kira and L. A. Rendell. A practical approach to feature selection. In Proceedings of the Ninth International Workshop on Machine Learning, pages 249--256, 1992. Google ScholarDigital Library
- P. Lichtenstein, N. V. Holm, P. K. Verkasalo, A. Iliadou, J. Kaprio, M. Koskenvuo, E. Pukkala, A. Skytthe, and K. Hemminki. Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl J Med, 343:78--85, 2000.Google ScholarCross Ref
- C. McCarty, R. Wilke, P. Giampietro, S. Wesbrook, and M. Caldwell. Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. PERS MED, 2:49--79, 2005.Google ScholarCross Ref
- C. A. McCarty, R. L. Chisholm, C. G. Chute, I. J. Kullo, G. P. Jarvik, E. B. Larson, R. Li, D. R. Masys, M. D. Ritchie, D. M. Roden, J. P. Struewing, W. A. Wolf, and eMERGE Team. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC MED GENET, 4(1):13, 2011.Google Scholar
- P. D. P. Pharoah, A. C. Antoniou, D. F. Easton, and B. A. J. Ponder. Polygenes, risk prediction, and targeted prevention of breast cancer. N Engl J Med, 358(26):2796--2803, June 2008.Google ScholarCross Ref
- W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992. Google ScholarDigital Library
- J. K. Pritchard and M. Przeworski. Linkage disequilibrium in humans: models and data. Am J Hum Genet, 69(1):1--14, 2001.Google ScholarCross Ref
- S. L. Slager and D. J. Schaid. Case-control studies of genetic markers: power and sample size approximations for Armitage's test for trend. HUM HERED, 52(3):149--153, 2001.Google ScholarCross Ref
- Z. Su, J. Marchini, and P. Donnelly. HAPGEN2: simulation of multiple disease SNPs. BIOINFORMATICS, 2011. Google ScholarDigital Library
- The International HapMap Consortium. The international HapMap project. Nature, 426:789--796, 2003.Google ScholarCross Ref
- M. Waddell, D. Page, F. Zhan, B. Barlogie, and J. Shaughnessy, Jr. Predicting cancer susceptibility from single-nucleotide polymorphism data: A case study in multiple myeloma. In Proceedings of BIOKDD '05, Chicago, Illinois, August 2005, Aug 2005. Google ScholarDigital Library
- Z. Wei, K. Wang, H.-Q. Qu, H. Zhang, J. Bradfield, C. Kim, E. Frackleton, C. Hou, J. T. Glessner, R. Chiavacci, C. Stanley, D. Monos, S. F. A. Grant, C. Polychronakos, and H. Hakonarson. From disease association to risk assessment: An optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genetics, 5:e1000678, 2009.Google ScholarCross Ref
- M. C. Wu, P. Kraft, M. P. Epstein, D. M. Taylor, S. J. Chanock, D. J. Hunter, and X. Lin. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet, 86(6):929--942, June 2010.Google ScholarCross Ref
- T. T. Wu, Y. F. Chen, T. Hastie, E. M. Sobel, and K. Lange. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714--721, 2009. Google ScholarDigital Library
- L. Yu and H. Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5:1205--1224, 2004. Google ScholarDigital Library
- N. Zaitlen, H. M. Kang, E. Eskin, and E. Halperin. Leveraging the hapmap correlation structure in association studies. Am J Hum Genet, 80(4):683--691, April 2007.Google ScholarCross Ref
Index Terms
- A collective ranking method for genome-wide association studies
Recommendations
Privacy-preserving data exploration in genome-wide association studies
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningGenome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals ...
High-throughput analysis of epistasis in genome-wide association studies with BiForce
Motivation: Gene–gene interactions (epistasis) are thought to be important in shaping complex traits, but they have been under-explored in genome-wide association studies (GWAS) due to the computational challenge of enumerating billions of single ...
Comments