Large-scale multivariate sparse regression with applications to UK Biobank

Junyang Qian; Yosuke Tanigawa; Ruilin Li; Robert Tibshirani; Manuel A. Rivas; Trevor Hastie

doi:10.1214/21-AOAS1575

September 2022 Large-scale multivariate sparse regression with applications to UK Biobank

Junyang Qian, Yosuke Tanigawa, Ruilin Li, Robert Tibshirani, Manuel A. Rivas, Trevor Hastie

Author Affiliations +

Junyang Qian,¹ Yosuke Tanigawa,² Ruilin Li,³ Robert Tibshirani,¹ Manuel A. Rivas,² Trevor Hastie¹
¹Department of Statistics, Stanford University
²Department of Biomedical Data Science, Stanford University
³Institute for Computational and Mathematical Engineering, Stanford University

Ann. Appl. Stat. 16(3): 1891-1918 (September 2022). DOI: 10.1214/21-AOAS1575

ABOUT
FIRST PAGE
CITED BY
REFERENCES
SUPPLEMENTAL CONTENT
DOWNLOAD PAPER SAVE TO MY LIBRARY

PERSONAL SIGN IN
Full access may be available with your subscription

Password Forgot your password?

Show

Remember Email on this computer

Remember Password

No Project Euclid account? Create an account
or Sign in with your institutional credentials

PURCHASE SINGLE ARTICLE

This article is only available to subscribers. It is not available for individual sale.

This will count as one of your downloads.

You will have access to both the presentation and article (if available).

DOWNLOAD NOW

This content is available for download via your institution's subscription. To access this item, please sign in to your personal account.

Password Forgot your password?

Show

Remember Email on this computer

Remember Password

No Project Euclid account? Create an account

My Library

You currently do not have any folders to save your paper to! Create a new folder below.

Abstract

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)—lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present $\mathtt{multiSnpnet}$ package, available at http://github.com/junyangq/multiSnpnet that works on top of $\mathtt{\mathtt{PLINK}\mathtt{2}}$ files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

Funding Statement

This research has been conducted using the UK Biobank Resource under Application Number 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf). Based on the information provided in Protocol 44532, the Stanford IRB has determined that the research does not involve human subjects, as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g). All participants of UK Biobank provided written informed consent (more information is available at https://www.ukbiobank.ac.uk/2018/02/gdpr/).
Research reported in this publication was supported by the National Human Genome Research Institute of the NIH under Award Number R01HG010140 (M.A.R.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Y.T. is supported by a Funai Overseas Scholarship from the Funai Foundation for Information Technology and the Stanford University School of Medicine.
M.A.R. is supported by Stanford University, a National Institutes of Health (NIH) Center for Multi- and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080) and grant 1R01MH121455-01. M.A.R. also wants to thank the Sanofi IDEA award.
R.T. was partially supported by NIH grant 5R01 EB001988-16 and NSF grant 19 DMS1208164.
T.H. was partially supported by grant DMS-1407548 from the National Science Foundation and grant 5R01 EB 001988-21 from the National Institutes of Health.

Acknowledgments

We want to thank the Editor, an Associate Editor, and two anonymous reviewers from the journal for their constructive comments that helped us improve the manuscript significantly. We also want to thank all the participants in the UK Biobank.

Citation

Download Citation

Junyang Qian. Yosuke Tanigawa. Ruilin Li. Robert Tibshirani. Manuel A. Rivas. Trevor Hastie. "Large-scale multivariate sparse regression with applications to UK Biobank." Ann. Appl. Stat. 16 (3) 1891 - 1918, September 2022. https://doi.org/10.1214/21-AOAS1575

Information

Received: 1 September 2020; Revised: 1 October 2021; Published: September 2022

First available in Project Euclid: 19 July 2022

MathSciNet: MR4455904

zbMATH: 1498.62135

Digital Object Identifier: 10.1214/21-AOAS1575

Keywords: Large-scale algorithm , polygenic risk score , reduced-rank regression , Sparse regression , UK Biobank , ultrahigh-dimensional problem

ACCESS THE FULL ARTICLE

PERSONAL SIGN IN
Full access may be available with your subscription

Password Forgot your password?

Show

Remember Email on this computer

Remember Password

No Project Euclid account? Create an account
or Sign in with your institutional credentials

PURCHASE THIS CONTENT

PURCHASE SINGLE ARTICLE

This article is only available to subscribers.
It is not available for individual sale.

JOURNAL ARTICLE
28 PAGES

This article is only available to subscribers.
It is not available for individual sale.

+ SAVE TO MY LIBRARY

GET CITATION

My Library

You currently do not have any folders to save your paper to! Create a new folder below.

Folder Name

Folder Description

< Previous Article

Next Article >

Ann. Appl. Stat.

Vol.16 • No. 3 • September 2022

Institute of Mathematical Statistics

Subscribe to Project Euclid

Receive erratum alerts for this article

Junyang Qian, Yosuke Tanigawa, Ruilin Li, Robert Tibshirani, Manuel A. Rivas, Trevor Hastie "Large-scale multivariate sparse regression with applications to UK Biobank," The Annals of Applied Statistics, Ann. Appl. Stat. 16(3), 1891-1918, (September 2022)

Include:

Citation Only

Citation & Abstract

Format:

RIS

EndNote

BibTex

Print Friendly Version (PDF)

Abstract

Funding Statement

Acknowledgments

Citation

Information

KEYWORDS/PHRASES

PUBLICATION TITLE:

PUBLICATION YEARS