September 2022 Large-scale multivariate sparse regression with applications to UK Biobank
Junyang Qian, Yosuke Tanigawa, Ruilin Li, Robert Tibshirani, Manuel A. Rivas, Trevor Hastie
Author Affiliations +
Ann. Appl. Stat. 16(3): 1891-1918 (September 2022). DOI: 10.1214/21-AOAS1575

Abstract

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)—lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

Funding Statement

This research has been conducted using the UK Biobank Resource under Application Number 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf). Based on the information provided in Protocol 44532, the Stanford IRB has determined that the research does not involve human subjects, as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g). All participants of UK Biobank provided written informed consent (more information is available at https://www.ukbiobank.ac.uk/2018/02/gdpr/).
Research reported in this publication was supported by the National Human Genome Research Institute of the NIH under Award Number R01HG010140 (M.A.R.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Y.T. is supported by a Funai Overseas Scholarship from the Funai Foundation for Information Technology and the Stanford University School of Medicine.
M.A.R. is supported by Stanford University, a National Institutes of Health (NIH) Center for Multi- and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080) and grant 1R01MH121455-01. M.A.R. also wants to thank the Sanofi IDEA award.
R.T. was partially supported by NIH grant 5R01 EB001988-16 and NSF grant 19 DMS1208164.
T.H. was partially supported by grant DMS-1407548 from the National Science Foundation and grant 5R01 EB 001988-21 from the National Institutes of Health.

Acknowledgments

We want to thank the Editor, an Associate Editor, and two anonymous reviewers from the journal for their constructive comments that helped us improve the manuscript significantly. We also want to thank all the participants in the UK Biobank.

Citation

Download Citation

Junyang Qian. Yosuke Tanigawa. Ruilin Li. Robert Tibshirani. Manuel A. Rivas. Trevor Hastie. "Large-scale multivariate sparse regression with applications to UK Biobank." Ann. Appl. Stat. 16 (3) 1891 - 1918, September 2022. https://doi.org/10.1214/21-AOAS1575

Information

Received: 1 September 2020; Revised: 1 October 2021; Published: September 2022
First available in Project Euclid: 19 July 2022

MathSciNet: MR4455904
zbMATH: 1498.62135
Digital Object Identifier: 10.1214/21-AOAS1575

Keywords: Large-scale algorithm , polygenic risk score , reduced-rank regression , Sparse regression , UK Biobank , ultrahigh-dimensional problem

Rights: Copyright © 2022 Institute of Mathematical Statistics

JOURNAL ARTICLE
28 PAGES

This article is only available to subscribers.
It is not available for individual sale.
+ SAVE TO MY LIBRARY

Vol.16 • No. 3 • September 2022
Back to Top