Abstract
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)—lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present package, available at http://github.com/junyangq/multiSnpnet that works on top of files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.
Funding Statement
This research has been conducted using the UK Biobank Resource under Application Number 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf). Based on the information provided in Protocol 44532, the Stanford IRB has determined that the research does not involve human subjects, as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g). All participants of UK Biobank provided written informed consent (more information is available at https://www.ukbiobank.ac.uk/2018/02/gdpr/).
Research reported in this publication was supported by the National Human Genome Research Institute of the NIH under Award Number R01HG010140 (M.A.R.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Y.T. is supported by a Funai Overseas Scholarship from the Funai Foundation for Information Technology and the Stanford University School of Medicine.
M.A.R. is supported by Stanford University, a National Institutes of Health (NIH) Center for Multi- and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080) and grant 1R01MH121455-01. M.A.R. also wants to thank the Sanofi IDEA award.
R.T. was partially supported by NIH grant 5R01 EB001988-16 and NSF grant 19 DMS1208164.
T.H. was partially supported by grant DMS-1407548 from the National Science Foundation and grant 5R01 EB 001988-21 from the National Institutes of Health.
Acknowledgments
We want to thank the Editor, an Associate Editor, and two anonymous reviewers from the journal for their constructive comments that helped us improve the manuscript significantly. We also want to thank all the participants in the UK Biobank.
Citation
Junyang Qian. Yosuke Tanigawa. Ruilin Li. Robert Tibshirani. Manuel A. Rivas. Trevor Hastie. "Large-scale multivariate sparse regression with applications to UK Biobank." Ann. Appl. Stat. 16 (3) 1891 - 1918, September 2022. https://doi.org/10.1214/21-AOAS1575
Information