Abstract
Answer ALS is a biological and clinical resource of patient-derived, induced pluripotent stem (iPS) cell lines, multi-omic data derived from iPS neurons and longitudinal clinical and smartphone data from over 1,000 patients with ALS. This resource provides population-level biological and clinical data that may be employed to identify clinical–molecular–biochemical subtypes of amyotrophic lateral sclerosis (ALS). A unique smartphone-based system was employed to collect deep clinical data, including fine motor activity, speech, breathing and linguistics/cognition. The iPS spinal neurons were blood derived from each patient and these cells underwent multi-omic analytics including whole-genome sequencing, RNA transcriptomics, ATAC-sequencing and proteomics. The intent of these data is for the generation of integrated clinical and biological signatures using bioinformatics, statistics and computational biology to establish patterns that may lead to a better understanding of the underlying mechanisms of disease, including subgroup identification. A web portal for open-source sharing of all data was developed for widespread community-based data analytics.
Similar content being viewed by others
Main
Over the last several decades, tremendous progress in the optimization of therapies for various medical conditions, such as cancer, has been realized. Many factors underlie this therapeutic success, including optimization of clinical trial design, new pathway-specific pharmaceuticals and the coordination of participant recruitment efforts across clinics. Perhaps one of the most powerful and fundamental reasons for the success of some cancer therapies is the ability to sample diseased tissues and thereby distinguish the biological and molecular events responsible for individual diseases or disease subgroups within a disease cluster1. Thus, skin, breast or prostate biopsies have been important starting points for the investigation of various types of melanomas and breast or prostate cancers. Neurodegenerative diseases such as ALS, Alzheimer’s disease and Huntington’s disease have, however, not seen such advances. Clinical trials in humans, often based on findings from nonhuman model systems, have repeatedly proven disappointing2,3. Although there are probably many reasons for such failures (for example, poor pharmacokinetics, wrong biological pathway, lack of target engagement), a critical reason is the inability to identify disease pathways in patient tissues and to segment patients for clinical trials according to these pathways. As a result of the high risk of disability, brain and spinal cord biopsies for tissue analysis are not feasible in neurodegenerative diseases and therefore, unlike the biopsy of other organs and tissues, obtaining neural tissue during the disease course is a significant hurdle to effective therapeutic development.
An alternative is to use stem cell technology and infer disease pathways from cell lines derived from the patients’ own blood. Evidence for this approach is beginning to emerge. Early work employing iPS spinal neurons from patients with C9orf72 ALS/frontotemporal dementia led the way to the development of the first antisense-based gene therapy for this common familial form of ALS (fALS), with an international clinical trial already under way (clinicaltrials.gov: NCT03626012)4,5. But for most patients with ALS, who have sporadic disease (sALS), these discoveries have yet to translate into meaningful therapies. A major barrier has been the lack of a predictive preclinical human model for sALS. However, with advances in iPS cell technology and the unprecedented data and specimen collection efforts of Answer ALS, we can now take an iPS cell-based approach to unraveling mechanisms that may cause or contribute to the heterogeneous clinical spectra of sALS, such as pattern and speed of spread and certain nonmotor manifestations. Notably, multiple gene mutations are already known to cause fALS and represent quite diverse pathways: RNA metabolism, nuclear transport, protein aggregation, axonal trafficking, glial dysfunction, etc.6. Curiously, the variability in clinical features is nearly as great when comparing patients with any single mutated gene as it is when comparing across genes or with sALS. Little is known about the derangements in specific biological pathway(s) driving sALS or whether there are ALS subgroups defined by specific biological derangements. Knowledge of these biological subgroups may be critically important and the success of disease-modifying therapies may depend on treating the right ‘subgroup’ with the proper pathway-targeting drug.
The Answer ALS (AALS) program was conceived as a program to generate iPS cell lines from a large number of patients with ALS and apply well-established molecular, biochemical and imaging techniques to understand the heterogeneity of sALS in these patient-derived spinal neurons, to serve as a ‘biopsy-like’ equivalent. After ensuring that results were reproducible, we assembled comprehensive biological datasets from individual subject iPS cell lines and combined them with the longitudinal clinical data. In contrast to smaller previous iPS cell experiments, studies of iPS cells from a large population, like AALS, provide the first opportunity to explore biologically relevant subgroups of sALS. This resource program was designed with the core goals of providing large clinical and biological datasets in an open source-like application that affords researchers the proper tools to identify biological subgroups and an extensive collection of IPS cell lines with which to test ALS therapies and hypotheses about ALS pathogenesis.
Results
Clinical demographics and clinical data generation
Population demographics
The enrolled participant population for the AALS program (Fig. 1a, Extended Data Fig. 1, Supplementary Information and Supplementary Tables 1–5) had clinical characteristics comparable to past large sALS population demographics, with a slightly higher number of male than female participants, site of disease onset predominantly a limb rather than bulbar and a mean age of disease onset of approximately 57 years. The mean delay in clinical diagnosis for ALS patients included in the study was 14.8 months. A higher percentage of patients with rapid progression had bulbar-onset disease. There was a wide range of disease progression rates over the time period of observation (Fig. 1b,c), with an average follow-up duration of 12.5 months and an average rate of decline of 0.77 points per month (Fig. 1b,c). The smaller population of patients with fALS in the resource had typical representations of the common gene mutations including C9orf72 and SOD1 (Table 1), with a small subset of patients with C9orf72 and non-C9orf72 ALSs developing cognitive decline during the study (https://dataportal.AnswerALS.org). A small number of individuals were ALS mutation carriers (asymptomatic ALS) without overt neurological disease (Table 1). Non-ALS motor neuron disease (MND) included patients with predominantly upper MND, not formally categorized as ALS (for example, primary lateral sclerosis), and their demographic information is included in Supplementary Table 4. The healthy control subject population consisted of age-matched participants without ALS or a family history of ALS.
App-based voice recordings—motor and speech analyses
A core tool to gather more comprehensive longitudinal clinical data, ultimately to integrate with the biological datasets, was the development of a new smartphone app, designed to inform elements of motor activity, speech, breathing, voice and cognition (Supplementary Information) while patients were at home. Given the nature of this progressively disabling disorder, the reliability of utilization is an important variable. Compliance for using the smartphone app was analyzed over 18 months from the beginning of the app rollout to a subset of 80 study subjects. Surprisingly, only a modest decrease in compliance was observed with increased duration of use (Fig. 2a).
App data accurately predicted clinical progression
From speech recordings, we extracted linguistic features to evaluate word diversity and complexity of thought such as semantic similarity, dispersion and frequency, as recently detailed7. Features derived from the voice tasks (single-breath count, read-aloud passage and free speech; Extended Data Fig. 2) each correlated highly with the bulbar subdomain of the ALS Functional Rating Scale-Revised (ALSFRS-R; Pearson’s R = 0.8, slope = 1.14; Pearson’s R = 0.89, slope = 0.98; and Pearson’s R = 0.71, slope = 1.12, respectively). Features from the finger tracing showed modest individual correlations with the ALSFRS-R total score (Fig. 2b and Extended Data Fig. 2). Importantly, the combination of features from all of these tasks correlated very highly with the ALSFRS-R total score (Pearson’s R = 0.89, slope = 1.16; Fig. 2c).
Features obtained from the single-breath counting task correlated well with vital capacity (R = 0.63) and strongly suggest that voice analysis could be a proxy for vital capacity measurements in a clinic. Similar results by others employing sustained phonation are in agreement with our new observations8.
Importantly, semantic analysis of the picture description task was highly correlated with the ALS-Cognitive Behavioral Screen (CBS) (R = 0.72) and less correlated with the central nervous system (CNS) lability scale (R = 0.45). These studies then also suggest that at-home app analytics can be useful for longitudinal cognition analytics.
This task also predicted well the ALSFRS-R speech subscore (Fig. 2b); however, models using features from the reading task outperformed the counting and picture description tasks. A more detailed account of these results is reported elsewhere7.
These results demonstrate that the modules implemented to assess hand function and speech may be useful to quantify ALS function when patients are not in clinic and can substantially aid in the acquisition of progressively declining clinical indices. Furthermore, the picture description task may be useful to evaluate cognitive function in ALS. The potential to record voice and store it encrypted in the cloud could provide a powerful clinical tool to assess change over time that could be used clinically and in ALS trials.
Production of the iPS cell line
A core design and strength of the program are the set of iPS cell lines from a large population of >1,000 patients with ALS and control subjects, all deeply phenotyped, provided to the research community. To date, more than 850 of the iPS cell lines have been generated and are available through the web portal. Out of the ~850 unique samples, only 18 lines (~2%) failed reprogramming. As there are multiple different protocols to generate iPS cells and differentiate them into motor neurons, it was essential that the uniformity of the generated cultures be evaluated, thereby establishing the reliability of this new and renewable biological resource. To address this central issue, we evaluated the iPS cell-derived spinal neurons from a large cohort of 217 control and ALS iPS cell lines. Specifically, we examined expression of five different cell-identifying markers for neurons and glia, including cell markers NKX6.1, SMI32, ISL1, TUJ1 and S100beta. This differentiation protocol (Extended Data Fig. 3) generates a mixed population of neurons consisting of ~75% (±8%) βIII-tubulin- (TuJ1-) and ~70% (±10%) NF-H-positive cells, ~19% (±6%) Islet-1- and ~34% (±9%) Nkx6.1-positive spinal motor neurons, and ~18% (+/13%) S100B-positive progenitors 32 d after the onset of differentiation (Fig. 3 and Supplementary Table 6). As shown in Fig. 3, there was great uniformity in the cellular composition of the cultures for this large selection of human lines. This was important, because past work or methods can lead to variable cultures, making the interpretation of downstream analysis complicated. Notably the cellular composition was not substantially different between the ALS and control iPS cell-derived neurons. As expected, these cultures presented a mixture of motor neurons, neurons and, to a lesser extent, glia. This was important, because ALS is not simply a motor neuron disease, but is a disorder of multiple different nervous system cell types, as reflected in these uniformly generated cultures.
Generation of multi-omics data
Genomics
As an appreciation of the overall diversity of the program’s ALS and control population, especially valuable for future global analytics, we evaluated the AALS cohort using New York Genome Center’s (NYGC’s) ancestry pipeline9. Most participants were white and of European descent (91.45%); the remainder had ancestry consistent with the Americas (1.69%), Africa (4.94%) and east (1.33%) and south Asia (0.6%) (Fig. 4). On average, each sample harbored a total of ~4.1 million variants and ~9,800 protein-altering variants, including SNPs, frameshift and nonframeshift deletions and insertions, and protein-truncating variants (Table 2 and Fig. 4a–d), similar to previous reports10. Notably, the samples with African descent had a higher number of variants than other ethnic populations, as expected (Fig. 4b)11.
We used PCA12,13 to visualize the ancestry background of the AALS cohort and a set of 2,504 samples from the 1000 Genomes Project with well-defined ancestry. We find that most of the samples clustered with the NYGC’s European samples, although some were closer to the African group and a few clustered with the Asian group (Fig. 4e), corroborating the NYGC ancestry results and probably consistent with the local recruiting clinics geographic locations (Extended Data Fig. 1).
Variants in ALS genes
As most of the ALS lines were derived from patients with sALS, an analysis of the genomic variants is important, especially as future opportunities for researchers to correlate the observed variants along with the deep clinical and multi-omics data, as well as the future use of the living cell lines. Within the 830 samples, we observed 440 exonic variants in the 33-ALS genes (Supplementary Information) that were <1% frequent (Fig. 4c,d, Table 2 and Supplementary Table 7). Both controls and ALS cases averaged 1.5 rare ALS variants per individual within the 33-ALS genes. Of these, 79% were SNPs, 13% uncharacterized, ~1% splicing, ~1% nonframeshift deletion, 1% frameshift deletion, 1% frameshift insertion, 2% frameshift insertion, 2% nonframeshift insertion and 1% stop-gain (Supplementary Table 7).
As future biological pathways in ALS subgroups could reflect the expression of genetic variants of established ALS genes, we first evaluated how many pathogenic or probably pathogenic variants existed as reported in ClinVar (CP) in the 33-ALS genes. We found that 12% of ALS cases harbored a CP variant within one of the 33-ALS genes (Supplementary Tables 7 and 8). All of these CP variants were rare (<1% frequency within the population) except two found within the OPTN gene. For example, we observed five SOD1 CP variants (within eight patients with ALS), two TDP43 CP variants (within two patients with ALS) and one CP FUS variant in a patient with ALS (Supplementary Tables 7 and 8). CP variants were also detected in individuals who did not show signs of ALS at the time of the clinic visit, and there were eleven CP variants within control samples (within ALS2, SETX, OPTN and PFN1), four CP variants in the pre-fALS cohort (within FIG4, OPTN and CHCHD10), three CP variants within individuals with other MNDs (within SQSTM1, OPTN and PFN1) and three CP variants in uncharacterized individuals (within SQSTM1 and SETX; Supplementary Table 8). In summary, rare CP variants were observed in 3.11% (22 total) of ALS cases and 1% of controls (1 out of 92 samples). We also investigated the number of P/LP variants called by Intervar (IP), in in silico prediction (ISD variants) and a new combination of ACMG gene criteria as well as the in silico prediction and family-based segregation data, a list of high-confidence causal variants in 12 genes—ALS2, CCNF, CHCHD10, FUS, OPTN, PFN1, SOD1, TARDBP, TBK1, UBQLN2, VAPB and VCP—which have been curated and designated as the HP (Harms P/LP, Supplementary Table 7) variants. These are reported in Supplementary Tables 7–11. We investigated CP, IP and ISD variants found across all genes in 830 samples and these are listed in Supplementary Tables 12, 13 and 14.
Expansions in C9orf72 and ATXN2
Genomic expansions of both C9orf72 and ataxin 2 are associated with both fALS and sALS. The availability of large numbers of iPS cell lines and the matched multi-omics data from this phenotypically variable genetic subgroup provide a unique future opportunity to investigate these genes that alternatively lead to ALS and/or FTD. Using Expansion Hunter to identify repeat expansions within whole-genome sequencing (WGS) data, we found 601 expanded regions in the 830 samples14. In total, 41 patients with ALS and 4 pre-fALS subjects in the AALS study population harbored hexanucleotide expansions in C9orf72 that were >26 repeats (Fig. 4f and Supplementary Table 15). We also observed 35 patients with ALS, 4 controls and 1 uncharacterized individual harboring CAG triplet repeat expansions in ATXN2 >26 repeats (Fig. 4g and Supplementary Table 16). All patients with ALS with >26 ATXN2 repeats had clinical phenotype characteristics of MNDs and no other reported neurological abnormalities. Notably, in this population of patients and cell lines, for carriers of expansions in both ATXN2 and C9orf72 simultaneously, we found no correlation between age of ALS onset and expansion size (Fig. 4h,i and Supplementary Tables 15). However, future multi-omic studies of the patient iPS spinal neurons may reveal different biological pathways/properties when both mutations are co-expressed in humans.
ACMG genes
Pathogenic or probable pathogenic variants in 59 genes are currently considered to be medically actionable by the American College of Medical Genetics and Genomics (ACMG), due to the potential for medical intervention to modify morbidity and mortality in carriers of such variants15. Within the 830 samples, we identified 73 C-PLP variants within 32 ACMG genes (Supplementary Table 17). Of the individuals, 50.4% did not harbor a C-PLP variant in an ACMG gene, 41.2% harbored 1, 7.6% harbored 2 and 0.84% harbored 3 C-PLP variants. Of these variants found within 110 individuals, 66 were rare (<1%; Supplementary Table 17). We also found 42 I-PLP variants within ACMG genes within 51 individuals, all of which were rare (Supplementary Table 18). Participants were offered to receive the results of these medically actionable genes through the return of genetic results substudy (Extended methods).
Transcriptomics
For each of the omics assays, vials from an identical pool of differentiated motor neurons were processed to ensure comparability, including batch differentiation controls (BDCs) and batch technical controls (BTCs) from the control 2AE8 line, as detailed in Extended methods. Overall the analytics revealed minimal to no technical confounders and low batch effects between differentiation and no clear batch-related abnormalities with regard to disease status (Extended Data Figs. 4a,d and 5a).
Annotation of transcripts detected in the samples revealed various RNA species that were captured in the deep sequencing, with protein-coding RNAs accounting for most (~82%) of all RNAs, followed by long intergenic noncoding (linc)RNA (~13%) (Fig. 5a). A low proportion of reads mapped to small RNAs and a very minimal portion to ribosomal RNAs, which were depleted during library preparation and act as a technical quality assessment. The use of total RNA-sequencing (RNA-seq) and deeper sequencing allows for differential alternative splicing analyses, as well as circular RNA and cryptic exon analyses (Fig. 5e,f). As an example of RNA-seq analyses, we assessed the ability of our cell model and RNA-seq methods to capture common, alternative splicing types and found significant enrichment in skipped exon (SE, 52%) and retention of introns (RIs, 35%) when comparing male C9 samples with male controls (Fig. 5e). RNA-binding protein (RBP) motif enrichment analysis of the significant RI events (cryptic exons) predicts that the binding of HNRNPA2B1 (Fig. 5f) is upregulated in ALS samples. These findings are consistent with previous reports in human postmortem brain tissue16.
To assess pathway activities, we used gene set variation analysis (GSVA) to score samples against canonical Kyoto Encyclopedia of Genes and Genomes (KEGG) and Biocarta pathways from the MsigDB database, and identified pathways that are differentially regulated between subjects with bulbar and limb onset (Fig. 5g). Using these pathway activity scores, we also identified pathways that are positively or negatively correlated with the patient ALSFRS progression slope (Fig. 5h).
These data indicate that both gene expression differences and RNA-splicing differences could be captured by our differentiated iPS cell model. Notably, these data can be explored for additional new alterations in ALS and potential associations with ALS subtype and clinical data, and with other omics data that are being captured from these samples.
Epigenomics
Overall the quality of transposase-accessible chromatin using sequencing (ATAC-seq) data was high, with very good reproducibility of BDCs and BTCs, as assessed by the simple error rate estimate (SERE) (Fig. 5b, Extended Data Figs. 4b,e and 5b, and Supplementary Information). Hypersensitive sites were distributed across the genome in the expected regions (Extended Data Fig. 6a,b), especially in previously annotated regulatory regions, with very few reads in ENCODE blacklist regions. Although, overall, samples did not cluster by genotype or disease status, many loci did show strong differences between patients and controls (Extended Data Fig. 6c). As an example of a potential application of the epigenomic data, we identified potential transcriptional regulators through analysis of sequence motifs in the open chromatin (Extended Data Fig. 6d). Consistent with the expected cell composition, we observed an overrepresentation of transcription factors implicated in neuronal differentiation, such as Pdx1, Cux2 and the Lhx family (Extended Data Fig. 6d).
Proteomics
In total, >25,000 peptides corresponding to >3,600 proteins per sample were quantified. As detailed in the Supplementary Information, for proteomic analytics, there was minimal drift between the batches (Fig. 5c and Extended Data Figs. 4c,f and 6c). Although patient and control iPS neuron clusters are interspersed, indicating their overall similarity, these iPS neuron models have significant individual protein-level differences and we selected representative proteins ECH1 and PCKGM (Fig. 5d) that show significant (P ≤ 0.05) differences, based on what is seen in the differential analysis-based evidence (Fig. 5d).
Longitudinal single-cell imaging and analysis
Validation of the identification of pathological phenotypes was achieved with longitudinal single-cell robotic imaging of mutant SOD1 patient-derived iPS spinal neurons as described previously (Fig. 6a)17. As shown in Fig. 6b, mutant SOD1 neurons exhibited an enhanced cell death profile, similar to that reported previously with spinal motor neurons18. Future data will be available on similar analytics of cohorts of the sporadic iPS cell-derived neurons from the AALS dataset.
Data dissemination: data portal
The AALS data portal (http://data.answerals.org; Supplementary Table 3) was designed to provide information about the various types of biological and clinical data generated by the AALS partners and to allow easy visualization/access to the metadata and data, along with links to obtain biofluids and iPS cell lines. Additional details regarding the portal can be found in Extended methods. In the future, the portal will also host online data analytics and visualization tools.
Discussion
The pathogenesis of sALS remains a mystery and few comprehensive data collections, on a population scale, exist to truly inform researchers about the biological underpinnings of the disease or the possibility of disparate biological subgroups. To date, clinical studies alone have not yielded reliable data to suggest a common pathway or, more importantly, a means to target relevant biological subgroups. The identification of biological subgroups has been impactful in various cancers, where the ability to actually sample disease tissues from skin, liver, prostate or pancreas biopsies, coupled with clinical characteristics of tumor type, has led to marked improvements in therapeutic approaches, drug treatments and decisions about disease management19,20.
The core goal of AALS is to provide a comprehensive set of tools including deeply phenotyped longitudinal clinical data and biological tools such as iPS cell lines, and a multi-omics platform consisting of whole-genome, iPS-derived, spinal neuron-enriched proteomes, transcriptomes and epigenomes, to uncover underlying biological subgroups. Previous studies have demonstrated the ability to generate small populations of fALS or sALS iPS cell-derived motor neurons and glia, as well as relatively limited multi-omics data. However, none approximates true population-based tools, with reproducible quality assurance protocols, necessary to accurately assess disease pathways or identify population subgroups combining longitudinal clinical, genomic and living multi-omics data4,21,22.
The AALS reagent collection includes individual iPS cell lines from approximately 850 sALS and control participants (soon to reach >1,200), the iPS cell-derived spinal neurons from each participant, their longitudinal clinical data (collected over 1 year), sequentially amassed fluid biospecimens (blood and cerebrospinal fluid (CSF)) and the early multi-omics data generated from each participant’s blood (whole genome) as well as from their ‘spinal cord biopsy’-equivalent, iPS-derived neuronal cell lines. The collection also includes autopsy samples and pathology data from a subset of participants. The autopsy pathology data and CNS specimens will eventually be available through the AALS web portal and coupled with the iPS cell lines from these participants.
A reasonable question is the utility of patient-derived iPS cells to predict the disease-causing pathways in an adult-onset disease. Can reprogrammed human spinal neurons reflect adult-onset disease pathogenic cascades? Already multiple studies have documented that human iPS cell lines, in either two-dimensional cultures or three-dimensional organoids, can reproduce the pathology seen in human brain23,24,25. One advantage of the iPS platform is the ability to dynamically detect early pathogenic events and even serially occurring events. In fact, early use of the AALS iPS cell lines has already provided evidence that the iPS collection can provide insights into new pathways (nuclear pore complex and nuclear transport defects) in ALS pathophysiology, generate new therapies and validate gene therapy based on the approaches4,25,26,27.
This population and its dataset were never envisioned to enable the identification of new ALS genes. A cohort of ~1,000 ALS participants does not amount to a large enough database for new gene identifications. However, sharing the whole-genome sequences from this dataset has aided in the identification of a new ALS gene, Kif5A28. In fact, the estimated 6+ billion data points generated from each participant, combining the longitudinal clinical demographic and observational data, the longitudinal smartphone app data (motor activity, speech, breathing, cognition) and the aggregate multi-omics data (whole genome, epigenome, proteome, transcriptome) represent an exceptionally large set of data per participant. Furthermore, the core multi-omics dataset reflects the human cells affected in individual ALS participants and spinal neurons, and acts as an organ- or tissue-specific biopsy. When these combined longitudinal and multidimensional clinical and biological data are analyzed by integrative methods, such as artificial intelligence, clinical and biological subgroups might emerge, potentially assigning a unique risk or modifier gene or a unique molecular pathway to a specific patient subgroup, which could one day enable patient-specific interventions, or serve as drug target engagement marker or subgroup biomarker.
How many individual sporadic patient lines would be required to detect one of more pathophysiologically relevant subgroups is simply not known. Prior work in fALS suggests that at least 10–15 C9orf72 iPS cell lines is sufficient to robustly detect defects in nuclear pore biology. However, sALS may have multiple risk pathways associated with gene variants (for example, ataxin 2 expansion, TMEM 106b)29,30 or environmental stressors and, as such, may require more patient cell lines and multi-omics data to allow detection of robust pathway readouts. A recent study, targeting imaging-based strategies to detect and evaluate an ESCRT-III-based pathway and therapy in >40 different sporadic and C9orf72 ALS and control iPS cell lines, approached the size of a small clinical trial25. However, it remains unclear how many iPS cell lines are needed to robustly and reproducibly detect pathophysiological alterations from human omics analyses.
The other research advantage to such a dataset and living tools is the immediate ability to test for potentially ALS-relevant pathogenic pathways using the participant’s own iPS cells/iPS cell-derived spinal neurons to test drugs for candidate pathogenic pathways and, importantly, to develop CNS biomarkers from the iPS cells and validate drug target engagement. Libraries of iPS cell lines derived from participants with neurological diseases, including Alzheimer’s disease and FTD, have been growing over the last several years and represent a valuable tool to truly examine specific disease pathways31,32. Most of these iPS cell libraries are relatively small, including our original library of 22 fALS iPS cell lines21, with a few selected lines for each disease mutation and, when appropriate, isogenic controls. None represents the far more common sporadic forms of the disease. Furthermore, none provides deep longitudinal clinical and extensive multi-omics data.
Aside from the biological data generated from the program, the results from the AALS smartphone app demonstrate that the modules implemented to assess limb function, speech and cognition may be useful to identify early bulbar and cognitive symptoms in ALS and track disease progression over time. Specifically, limb-function tests reveal that it can be useful to infer ALSFRS-R scores. Importantly, we observed that, by combining the features from multiple domains, motor tests and all the voice tests highly correlated with the ALSFRS, now commonly used as a primary or secondary outcome measure in ALS clinical trials, thereby providing a reliable tool for at-home longitudinal monitoring of patient progression. Furthermore, the single-breath testing also correlated well with in-clinic forced vital capacity (FVC), often a prominent secondary outcome measure in clinical trials. This test typically requires in-clinic testing, which limits enrollment or follow-up data collection in clinical trials. The application of this app test alone could greatly enhance patient participation in nationwide clinical trials—especially in those areas where travel to a testing center is challenging. Overall, we observe that quantitative motor speech analysis holds tremendous promise in both identifying changes limited not only to ALS rating scales but also to others such as cognitive assessment. The potential to record voice, and store it encrypted in the cloud, could provide a powerful clinical tool to assess change over time for use clinically and in ALS trials. Overall, the app data, coupled with in-clinic data, provide deep and longitudinal clinical datasets available for multi-domain biological and clinical correlations for future users.
The overall clinical demographics and population genomics in the AALS program accurately reflect the ALS subject population described in previous studies. This observation validates the AALS iPS cell lines and multi-omics platform as a database that others can employ to generate and test biological hypotheses.
Importantly, all the clinical data, multi-omic data and iPS cell lines were generated to be freely accessible to all researchers, academic and commercial, free of restrictions other than standard Health Insurance Portability and Accountability Act (HIPAA) compliance rules. A web portal for downloading filtered datasets, for example, proteome, whole genome, etc., has been set up with minimal but appropriate requirements for data access (Supplementary Table 3). The ALS and control iPS cell lines, matched to datasets, are also fully available for research studies, for a minimal fee (to cover the replacement of the depleted stock of cells). Biospecimens (for example, CSF and plasma) longitudinally collected from patients are also available (Supplementary Table 3). Future web-based links will include access to autopsied CNS tissues from patients matched to the iPS cell lines and iPS cell-based multi-omics.
Methods
Program process
Overall design (Extended Data Fig. 1)
The overall AALS program, from clinical enrollment to smartphone app data collection, iPS cell-line generation, biological data generation and data storage is outlined in Extended Data Fig. 1 (ClinicalTrials.gov: NCT02574390). Methods for each element of the program are provided below and in Supplementary Methods.
Enrollment, clinical characterization and sample collection
The clinical portions of AALS were coordinated through Johns Hopkins University and Massachusetts General Hospital. The eight enrolling neuromuscular clinics were distributed across the USA and included Johns Hopkins University, Massachusetts General Hospital, Ohio State, Emory University, Washington University, Northwestern University, Cedars-Sinai and Texas Neurology (Supplementary Table 1 and Extended Data Fig. 1). The study was approved by local institutional review boards, and all participants provided written informed consent. Consent was uniform across all sites and included agreement to share data broadly for medical research (also see Data access in Supplementary Information). Subjects with sALS, fALS and related MNDs (referred to as non-ALS MNDs), including those with primary lateral sclerosis, progressive bulbar palsy and progressive muscular atrophy, along with asymptomatic ALS gene mutation carriers, were enrolled in AALS. Age-matched control participants without ALS or a family history of ALS were also enrolled. Additional enrollment details are provided in Supplementary Information.
Participants were monitored every 3 months for a year and, when possible, the ALSFRS-R was conducted by telephone every 3 months for another year thereafter. Baseline descriptors included the following: demographics and vital signs, genetic and family history of MND, general medical history, CNS lability and a brief focused history of environmental exposures. Concomitant medications and past medical history were collected at enrollment and updated throughout study participation. Measures of ALS progression included: deep tendon reflexes, Ashworth Spasticity Scale, Hand Held Dynamometry, ALSFRS-R and pulmonary slow vital capacity (Supplementary Tables 2 and 3 and Supplementary Information). To enhance depth of longitudinal clinical data collection, a secure and HIPAA-compliant smartphone app, with a specific focus on motor activity, voice and cognition, was created for home data collection (Fig. 2 and Extended Data Fig. 2). At each in-clinic visit, blood was collected and processed according to the methods outlined in Supplementary Information. At the first visit, whole blood was collected for generation of primary peripheral blood mononuclear cell (PBMC)-derived iPS cell lines.
Biofluid collection and processing
At each in-clinic visit along with follow-up visits, approximately 50–100 ml of blood was collected from each participant. Plasma and serum were processed for storage and PBMC isolation. Whole blood was sent to the NYGC for DNA extraction and WGS. CSF was optionally collected and flash frozen at −80 °C. Serum, plasma and CSF samples were shipped on dry ice to a centralized biofluid repository to be stored at −80 °C (Supplementary Table 3). Additional details are provided in Supplementary Methods
Return of AALS results
To provide medical and ethically appropriate feedback, study participants with ALS were offered the opportunity to receive the results of their WGS for 5 ALS genes (C9orf72, SOD1, FUS, TARDBP and TBK1), as well as 59 genes designated as medically actionable by the ACMG15, as part of a substudy, Return of Answer ALS Results (ROAR). ROAR participants completed a separate online consent after enrollment in the parent study. Additional details are provided in Supplementary Information.
AALS smartphone app
The app has seven modules designed to gather information about upper limb motor function, respiration, bulbar function and cognition. Six modules measured arm function: finger tapping, finger tracing and phone tilt tracing; each was performed using the right and left hand separately (Fig. 2a). The speech module (Fig. 2c), consisted of three tasks, rotated weekly to reduce learning effect: (1) single-breath count, in which participants were instructed to draw in a deep breath and count at a measured pace (a surrogate for FVC)34; (2) read-aloud passage, in which participants read aloud one of four standardized passages from their screen; and (3) picture description, in which participants described one of three line-art illustrations over 30–120 s. Details regarding this digital clinical module are included in Supplementary Information.
The iPS cell-line methods
PBMC processing
Fresh blood was collected, and samples were centrifuged at 18–25 °C in a horizontal rotor centrifuge for 20 min at 1,800 r.c.f. within 2 h of collection. The plasma/buffy coat mixture was collected and centrifuged for 15 min at 300 r.c.f. Isolated PBMCs were counted and cryopreserved. The average cell count was ~25 million PBMCs per sample with an average cell viability of 91%. Additional details are provided in Supplementary Methods
Generation, reprogramming and QC of iPS cells
The iPS cells were generated by reprogramming the cryopreserved and nonexpanded PBMCs, using a method based on a nonintegrating episome. Clones were isolated, expanded and maintained according to standard feeder-free protocols and characterized extensively as described in Supplementary Table 6. The iPS cell lines were generated from ~25 patients per month and stored frozen until they were differentiated (Extended Data Fig. 3a). Each cell line was thawed and cultured for 2–3 weeks before passaging for differentiation. Cell lines were differentiated in batches of up to 11 lines. PBMCs were used instead of fibroblasts to limit the potential for genetic defects and facilitate sampling from the large number of patients enrolled in our study. Overall, blood draws are less invasive and carry lower risk for patients than skin biopsies, which improved the overall risk–benefit ratio for the study. Rigorous quality control (QC) (Supplementary Table 6) was performed on each AALS iPS cell line, similar to previously publications35. G-band karyotype was performed at multiple passages for each AALS iPS cell line, which provides confidence about the genetic integrity of the AALS iPS cell repository, given that each iPS cell line is karyotyped at multiple passages. Cell-line authentication is repeated at multiple stages. The cell line authentication (STR) is performed on the original donor blood/PBMC sample, then performed on the reprogrammed iPS cell line and the differentiated neurons (Supplementary Tables 6 and 19). Additional details are provided in Supplementary Information.
Generation of iPS cell spinal neurons
The iPS cells were differentiated into motor neurons according to the direct iPS cell diMN protocol, which comprises three main stages (Extended Data Fig. 3 and Supplementary Table 6), as described previously25. Additional details are provided in Supplementary Information. On day 32 of differentiation, cell lines were collected and pelleted as illustrated in Fig. 4. Thus far, ~800 iPS cell lines have been successfully reprogrammed and one clone line banked and characterized per donor. Out of the ~800 unique samples, only 18 lines (~3%) failed reprogramming. Additional details are provided in Supplementary Information.
QC of diMNs
As referenced in Extended Data Fig. 3, on day 32 one 6-well plate from each cell line for immunostaining was reserved for QCs, which included the following markers of neuronal differentiation: SMI32 (NF-H), TUBB3 (TUJ), ISL1, NKX6.1, S100β and Nestin. This protocol generates a mixed population of neurons consisting of ~75% (±8%) βIII-tubulin- (TuJ1-) and ~70% (±10%) NF-H-positive cells, ~19% (±6%) Islet-1- and ~34% (±9%) Nkx6.1-positive spinal motor neuron, and ~18% (+/13%) S100B-positive progenitors 32 d after the onset of differentiation (Fig. 3). Additional details are provided in Supplementary Information.
Multi-omics data generation for each iPS cell-derived motor neuron line
At the end of the 32-d differentiation protocol, the spinal neurons were harvested for RNA-seq, proteomics or epigenome profiling as detailed in Supplementary Methods. WGS was performed on PBMCs. Day 32, chosen for independent experiments with selected C9orf72 ALS/FTD iPS cell-derived spinal neurons, demonstrated phenotypic and molecular changes in nuclear pore complex and biology, matching that seen in patient autopsies, by this time point26.
Program QCs: cell generation batch controls
To detect and compensate for cell culture-associated confounders, all differentiations were conducted in a single facility and included two key control groups of biological samples: BDCs were differentiated with each batch from the same original line to assess interbatch variability of iPS cell differentiation to diMNs and BTCs, consisting of a single differentiation of the same line were frozen, aliquoted and distributed with each batch to assess technical variability of the omics assay batch runs, were performed as detailed in Supplementary Information. Complete details for the design and implementation of these critical operational controls (Extended Data Figs. 4 and 5) can be found in Supplementary Information.
Data quality and batch effect assessments
RNA-seq
For the RNA-seq data samples were processed and passed all QC metrics including RNA integrity (Extended Data Fig. 4a), library and sequencing QC metrics. To assess data quality and technical batch effects, sample-to-sample SERE scores (0 = identical samples) were generated using gene expression for three groups: the BDCs, BTCs and all other samples (Extended Data Figs. 4 and 5).
A heatmap of SERE scores between all samples with hierarchical clustering (Extended Data Fig. 5) shows that, although BTCs form their own cluster, the rest of the samples fall into multiple small clusters with no clear relationship to their disease status.
Proteomics
Each block of samples comprised case, control, BDC samples and HEK293 cell control samples. The numbers of proteins and peptides quantified for all 66 samples were very consistent (Extended Data Fig. 4c). The percentage coefficient of variation for the proteins quantified were calculated for the BTC and BDC samples (Extended Data Fig. 4f). Individual samples are normalized to the total MS2 spectra intensity across the chromatographic profile of eluting peptides to smooth any inconsistencies in sample loading on to the mass spectrometry (MS) instrument, thereby eliminating systemic variation in signal intensities (Extended Data Fig. 4c). We found that BTCs and BDCs (both originating from the 2AE8 CTR cell line) cluster tightly (Extended Data Fig. 6c), indicating minimal drift between the MS batches.
Epigenetics
ATAC-seq data quality was determined according to ENCODE36. The distribution of fragment sizes across all samples revealed a clear nucleosome-free region and regular peaks corresponding to nucleosomal fractions (Extended Data Fig. 6). As expected, replicates from our batch control line were highly correlated with each other, with BTCs having an even smaller variation in correlation values compared with BDCs (Extended Data Fig. 4e). We also generated a consensus set of peaks present in >10% of samples using DiffBind (Extended Data Fig. 6) and characterized transcription factor motif enrichment within these peaks using HOMER37. There was an overrepresentation of transcription factors implicated in neuronal differentiation, such as Pdx1, Cux2 and the Lhx family (Extended Data Fig. 6d). We then obtained a counts matrix of reads mapped to each peak in the consensus peakset across all samples and performed hierarchical clustering using the same approach as the RNA-seq data (Extended Data Figs. 4, 5 and 6). Subjects did not cluster by disease status, presence of C9 mutation, sex or processing batch. Additional data on quality control can be found in Supplementary Methods.
Whole-genome methods: WGS and analysis
PBMCs were sent by each clinic to the NYGC (https://www.nygenome.org) for DNA extraction and sample QC and WGS libraries. We evaluated pathogenic or probable pathogenic variants reported in ClinVar (C-PLP) for all genes. We also examined pathogenic variants called by Intervar Li38 (I-PLP) and predicted damaging variants as called by in silico prediction tools (IS-D), which are reported in Table 2 and Supplementary Table 8. The variant calls from NYGC were assessed by examining the actual reads for alignment issues and spot checking the BAM files for specific variants in Integrative Genomic Viewer determined to be of good quality. The variant call formats (VCFs) were converted into genomic VCFs (GVCFs), and joint genotyping calling was run using Sentieon v.201911 (https://www.sentieon.com); applied variant quality score recalibration (VQSR) was done using GATK v.3.8 (truth sensitivity level = 99.0), and the files were annotated using Annovar v.2018Apr16 (ref. 39). For each variant, we also incorporated functional in silico predictions from nine programs, including databases such as SIFT40, PolyPhen2 (ref. 41) and Mutation Taster42, and those described in Li et al.43. Additional databases were included that assess the variant tolerance of each gene using the Residual Variation Intolerance Score (RVIS)44 and the gene damage index (GDI)45 and LoFTool46. For variants in genes that are highly expressed in the brain, we incorporated data from the Human Protein Atlas47 (http://www.proteinatlas.org) and expression data from GTEx portal48,49 (https://gtexportal.org/home) for the cortex and spinal cord. Frequency information from three databases on all known variants was obtained from ExAC50, the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP)51 and the 1000 Genomes Project10.
PCA was carried out (Fig. 4d) to reveal how the AALS samples cluster among various ancestry groups of the 1000 Genomes Project dataset. PCA was used12,13 to visualize the ancestry background of the AALS cohort and a set of 2,504 samples from the 1000 Genomes Project with well-defined ancestry. We used a set of 10,000 randomly chosen autosomal SNPs (singletons and multiallelic SNPs were removed) that were present in both datasets and removed correlated SNPs by linkage disequilibrium pruning. We implemented randomized PCA52 using the Python library scikit-allel package53.
The annotation pipeline incorporated elements from ANNOVAR39 and generated reports, including genotypes for all samples. These reports are available on request. The following annotation was used: for genes and exonic variants that have clinical significance, the Clinical Genomic Database54, the Online Mendelian Inheritance in Man55 and ClinVar56, and genes listed in the ACMG57 database were incorporated. We also incorporated Intervar, which is based on the ACMG and AMP standards and guidelines for interpretation of variants58,59,60,61. This tool uses 18 criteria to prescribe the clinical significance and classifies based on a 5-tiered system62. To flag ALS genes, ALS gene lists and variants were incorporated from ALSoD63 (http://alsod.iop.kcl.ac.uk), a list provided by M. Harms, a gene list from J. Landers and associations from DisGeNet64. Functional predictions were based on in silico prediction from nine databases: SIFT40, PolyPhen2 (refs. 65,66,67) (HDIV and HVAR), LRT_Prediction67, Mutation Taster42, Mutation assessor68, FATHMM prediction69,70,71 and dbNSFP (RadialSVM_pred and LR_pred)72,73,74. Databases that assess the variant tolerance of each gene using the RVIS44 and the GDI45 were also included, and LoFTool46 will be incorporated. To identify variants in genes that are highly expressed in the brain, data from the Human Protein Atlas47 (http://www.proteinatlas.org) and the GTEx portal75,76 (https://gtexportal.org/home) for the cortex and spinal cord were used. Frequency information was derived from ExAC50, the NHLBI ESP51 and the 1000 Genomes Project11.
A separate annotation pipeline was developed for variants in intergenic and regulatory regions. Variants are reported relative to the closest gene, whether intronic, upstream and downstream (up to 4 kb from the start and stop of a gene) or in 5′- and 3′-UTRs. The annotation was based on RegulomeDB, which annotates variants with known or predicted regulatory elements such as transcription factor-binding sites, expression quantitative trait loci, validated functional SNPs and DNase sensitivity77, with source data from ENCODE78,79 and the Gene Expression Omnibus80. Additional regulatory databases such as Target Scan, an algorithm that uses 14 features to predict and identify microRNA (miRNA) target sites within messenger RNAs81 and miRBase82,83,84, were also used. Extensive details on the methods for whole-genome analytics can be found in Supplementary Methods.
RNA methods
Total RNA was isolated from each sample using the QIAGEN RNeasy mini-kit. RNA QC was conducted using an Agilent Bioanalyzer and Nanodrop. Our primary QC metric for RNA quality is based on RNA integrity number (RIN) values ranging from 0 to 10, 10 being the highest quality RNA. In addition, we collected QC data on total RNA concentration and 260:280 and 260:230 ratios to evaluate any potential contamination. Only samples with RIN > 8 were used for library prep and sequencing. The rRNAs were removed and libraries generated using TruSeq Stranded Total RNA library prep kit with Ribo-Zero (QIAGEN). RNA-seq libraries were titrated by quantitative (q)PCR (Kapa), normalized according to size (Agilent Bioanalyzer 2100 High Sensitivity chip). Each complementary DNA library was then subjected to 100 Illumina (Novaseq 6000) PE sequencing cycles to obtain over 50 million PE reads per sample. After sequencing, raw reads were subject to QC measures and reads with quality scores >20 collected and analyzed. Reads were mapped to the GRCh38 reference genome using Hisat2, QCed and gene expression quantified with featureCounts85, and differential expression was quantified using DESeq2 (ref. 86). Normalized and transformed count data were also used for exploratory analysis and differentially expressed genes (false discovery rate (FDR) < 0.1) were analyzed with commercial and open-source pathway and network analysis tools, including Ingenuity Pathway Analysis, gene set enrichment analysis (GSEA), GOrilla, Cytoscape and other tools to identify transcriptional regulators, predict epigenomic changes and determine potential effects on downstream pathways and cellular functions.
ATAC-seq methods
We used the assay for ATAC-seq to assess chromatin accessibility and identify functional regulatory sites involved in driving transcriptional changes associated with ALS. ATAC-seq sample prep, sequencing and peak generation were carried out by Diagenode Inc. as further described87. Briefly, cells were lysed in ATAC-seq resuspension buffer (RSB; 10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2 and protease inhibitors) with a mixture of detergents (0.1% Tween-20, 0.1% NP-40 and 0.01% digitonin) on ice for 5 min. The lysis reaction was washed out with additional ATAC–RSB containing 0.1% Tween-20 and inverted to mix. Then 50,000 nuclei were collected and centrifuged at 450 r.c.f. for 5 min at 4 °C. The pellet was resuspended in 50 µl of transposition mixture (25 µl of 2× Illumina Tagment DNA buffer, 2.5 µl of Illumina Tagment DNA enzyme, 16.5 µl of phosphate-buffered saline, 0.5 µl of 1% digitonin, 0.5 µl of 10% Tween-20 and 5 µl of water). The transposition reaction was incubated at 37 °C for 30 min followed by DNA purification. An initial PCR amplification was performed on the tagmented DNA using Nextera indexing primers (Illumina). Real-time (RT)-qPCR was run with a fraction of the tagmented DNA to determine the number of additional PCR cycles needed, and a final PCR amplification was performed. Size selection was done using AMPure XP beads (Beckman Coulter) to remove small, unwanted fragments (<100 bp). The final libraries were sequenced using the Illumina NextSeq platform (PE, 75-nt kit). All samples passed QC checks that included morphological evaluation of nuclei, fluorescence-based electrophoresis of libraries to assess size distribution and RT-qPCR to assess the enrichment of open chromatin sites. The quality of the sequencing was assessed using FastQC and the reads were aligned to GRCh38 genome build using Bowtie2. We identified open chromatin regions separately for each sample using the peak-calling software MACS2 (ref. 88) and determined differentially open sites using DESeq2 (FDR < 0.1). Peaks were assigned to unique genes using the default HOMER37 parameters, and gene ontology analysis was performed using GOrilla89.
Proteome methods
Whole-proteome extracts from frozen diMNs were digested with trypsin and LysC and subjected to acquisition on the SCIEX 6600 as detailed below. Snap-frozen cell pellets were stored at −80 °C and transferred to the Cedars-Science Medical Center proteomics lab on dry ice, where it was stored at −80 °C until use. Samples were lyophilized and aliquoted into 600-µl polystyrene microcentrifuge tubes containing lysis buffer (6 M urea and 1 mM dithiothreitol in 1.5 M NH4HCO3). The sample was sonicated (QSonica Q800R1) by alternating 10 s on and 10 s off at 70% amplitude while rotating in a 4 °C water bath until the solution was homogenized (~20 min). Samples were centrifuged and the protein concentration determined on the supernatant according to manufacturer’s instructions (Pierce BCA Protein Assay Kit). Then 200 µg of each sample was transferred to a 96-well plate in aliquots and processed on the Biomek i7 Automated workstation (Beckman Coulter) as outlined previously. Briefly, samples underwent the following: reduction of disulfide bonds in 3 mM tris(2-carboxyethyl)phosphine hydrochloride solution, alkylated in 5 mM iodo-3-acetic acid. Addition of β-galactosidase at 2 µg and protein digestion in solution using equimolar trypsin and LysC enzyme mixture (Promega, catalog no. V5111) followed at 1:40 enzyme:protein ratio under optimized digestion conditions (4 h at 37 °C). Digested proteins were desalted on a 5-mg Oasis HLB 96-well plate (Waters, catalog no. 186000309) and eluted in 50% acetonitrile. Samples were dried to completion using a speed-vac system and stored at −80 °C until MS analysis. For MS analysis, digested peptides were resuspended in 0.1% formic acid (FA) and analyzed on a 6600 Triple TOF (Sciex) in data-independent acquisition (DIA) mode and on the 6600 Triple TOF (Sciex) for data-dependent acquisition (DDA) mode. Specifically, samples were acquired in DDA mode for ion library generation and in DIA mode over 100 variable windows, similar to previously described acquisition protocols90,91.
DDA data were used for the generation of a sample-specific peptide ion library. DDA files were run through a trans-proteome pipeline using a human canonical FASTA file (Uniprot). A consensus peptide library with decoys was generated and used to quantify ions identified in DIA data files. Previously described DDA library build principles92 were utilized to generate a cell-specific library, which allowed for greater accuracy in matching DIA data to the DDA library during OpenSWATH, as indicated by higher d scores in PyProphet. The differential protein expression between ALS and control samples analyzed was calculated using mapDIA93.
DIA data files were analyzed using OpenSWATH pipeline against the sample-specific peptide ion library generated. Protein-level quantification is calculated by summing transition level intensities for all the proteotypic peptides identified. Differential protein expression between ALS and control samples analyzed was calculated using mapDIA.
Imaging methods
Longitudinal single-cell imaging and analysis
Differentiated iMNs from a subset of the AALS iPS cell lines were plated on 96-well plates for longitudinal single-cell imaging using robotic microscopy as previously described94,95,96,97,98,99,100,101,102,103. At day 25, cells were transduced with expression marker plasmids such as synapsin::EGFP33 to visualize cell morphology and viability. After transduction cells were imaged in an automated fashion with robotic microscopy once per day for 10–14 d. Some image analysis was performed in a computational pipeline constructed within the open-source program Galaxy, to identify and track individual cells and perform survival analysis and other morphological measurements. Additional method details can be found in Supplementary Methods.
Statistics
Statistical methods for the various programs are detailed in the Supplementary Information for the various programs.
Data portal
Data storage and data integration/analytics
AALS was designed to be an ‘open source’ program. All of the clinical datasets, the various omics results, including whole-genome, proteome, transcriptome and epigenome, along with the data integration have been posted to a portal for data sharing and crowd sourcing (https://data.answerals.org; Supplementary Table 3). Data are available for download to all academic and commercial researchers.
Web-based analytics
We have included online analytics for the many ALS researchers who will neither need nor want to download the full dataset. The current set of tools available at http://data.answerals.org/analyze allows users to select genes/pathways of interest and visualize them using braid maps, heatmaps, volcano plots, bar charts or networks (Fig. 4).
The data portal provides users with information about the AALS program, the data, relevant terminology and data release notes. Users can download a metadata package associated with each versioned release. This versioned package contains comprehensive clinical, iPS cell and inventory metadata. In addition, processes for enrolling patients, producing iPS cell lines and performing WGS are explained with links provided to the relevant facilities/institutions. Explanations for sample collection and analysis of epigenomic, proteomic and transcriptomic data are available. Finally, precise definitions are provided for our data levels, which are ways to stratify all the various omics data coming from our analyses (Supplementary Table 20).
Data dissemination
The AALS data portal (http://data.answerals.org; Supplementary Table 3) provides all raw and processed data including longitudinal clinical data and biological data generated by the AALS program, along with visualization/access to the metadata, data and biosamples released. The portal provides an overview of the data release notes, assays, data-level descriptions and links to sites for viewing cell lines/biosamples associated with the program. The website allows browsing of all available metadata (using filter and text search functions), the option to download all data and metadata or a filtered subset and links to obtain individual iPS cell lines from the Cedars-Sinai Biomanufacturing Center. Users interested in downloading datasets are required to submit an online form, acknowledge data use parameters and return a signed Data Use Agreement in compliance with the HIPAA.
Data organization and naming
Data products were organized and named in a unified and systematic manner to allow a smooth end-user experience. Data levels (Supplementary Table 20) were employed as a categorization schema to group similar types of omics data products together. Supplementary Table 21 describes examples of these data levels in action with each experimental assay that our program collects. All data products were prefixed in a systematic manner. The prefix consists of the following components: whether the sample is from a diseased patient or healthy control patient, the de-identified patient GUID, the sample vial ID and the assay type abbreviation. An example of this is the raw transcriptomics FASTQ file CASE-NEUAA599TMX-5310-T_P10_1.fastq.gz. The first underscore separates the prefix from any supplementary file information, allowing for easy tokenization. This nomenclature is applied consistently to all metadata and data files, making it easy to establish relationships with a single study participant.
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All data supporting the findings of the present study are available within the paper, its Supplementary information files and the AALS web portals listed in Supplementary Table 3 (or via data.answerals.org).
References
Hovestadt, V. et al. Medulloblastomics revisited: biological and clinical insights from thousands of patients. Nat. Rev. Cancer 20, 42–56 (2020).
Katyal, N. & Govindarajan, R. Shortcomings in the current amyotrophic lateral sclerosis trials and potential solutions for improvement. Front. Neurol. 8, 521 (2017).
Philips, T. & Rothstein, J. D. Rodent models of amyotrophic lateral sclerosis. Curr. Protoc. Pharm. 69, 5 67 61–21 (2015).
Donnelly, C. J. et al. RNA toxicity from the ALS/FTD C9ORF72 expansion is mitigated by antisense intervention. Neuron 80, 415–428 (2013).
Sareen, D. et al. Targeting RNA foci in iPSC-derived motor neurons from ALS patients with a C9ORF72 repeat expansion. Sci. Transl. Med. 5, 208ra149 (2013).
Taylor, J. P., Brown, R. H. Jr. & Cleveland, D. W. Decoding ALS: from genes to mechanism. Nature 539, 197–206 (2016).
Agurto, C. et al. Analyzing progression of motor and speech impairment in ALS. Annu. Int. Conf. IEEE Eng. Med Biol. Soc. 2019, 6097–6102 (2019).
Stegmann, G. M. et al. Estimation of forced vital capacity using speech acoustics in patients with ALS. Amyotroph. Lateral Scler. Frontotemporal Degeneration 22, 14–21 (2021).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Genomes Project, C. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
McVean, G. A genealogical interpretation of principal components analysis. PLoS Genet. 5, e1000686 (2009).
Dolzhenko, E. et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 27, 1895–1903 (2017).
Kalia, S. S. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet. Med. 19, 249–255 (2017).
Prudencio, M. et al. Distinct brain transcriptome profiles in C9orf72-associated and sporadic ALS. Nat. Neurosci. 18, 1175–1182 (2015).
Linsley, J. W. et al. Automated four-dimensional long term imaging enables single cell tracking within organotypic brain slices to study neurodevelopment and degeneration. Commun. Biol. 2, 155 (2019).
Kiskinis, E. et al. Pathways disrupted in human ALS motor neurons identified through genetic correction of mutant SOD1. Cell Stem Cell 14, 781–795 (2014).
Zhang, H. et al. Subgroup analysis reveals molecular heterogeneity and provides potential precise treatment for pancreatic cancers. Onco. Targets Ther. 11, 5811–5819 (2018).
Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–522 (2016).
Li, Y. et al. A comprehensive library of familial human amyotrophic lateral sclerosis induced pluripotent stem cells. PloS ONE 10, e0118266 (2015).
Neuro, L. C. et al. An integrated multi-omic analysis of iPSC-derived motor neurons from C9ORF72 ALS patients. iScience 24, 103221 (2021).
Choi, S. H. et al. A three-dimensional human neural cell culture model of Alzheimer’s disease. Nature 515, 274–278 (2014).
Lim, R. G. et al. Huntington’s disease iPSC-derived brain microvascular endothelial cells reveal WNT-mediated angiogenic and blood-brain barrier deficits. Cell Rep. 19, 1365–1377 (2017).
Coyne, A. N. et al. Nuclear accumulation of CHMP7 initiates nuclear pore complex injury and subsequent TDP-43 dysfunction in sporadic and familial ALS. Sci. Transl. Med. https://doi.org/10.1126/scitranslmed.abe1923 (2021).
Coyne, A. N. et al. G4C2 repeat RNA initiates a POM121-mediated reduction in specific nucleoporins in C9orf72 ALS/FTD. Neuron 107, 1124–1140.e1111 (2020).
Zhang, K. et al. Stress granule assembly disrupts nucleocytoplasmic transport. Cell 173, 958–971.e917 (2018).
Nicolas, A. et al. Genome-wide analyses identify KIF5A as a novel ALS gene. Neuron 97, 1268–1283.e1266 (2018).
Vass, R. et al. Risk genotypes at TMEM106B are associated with cognitive impairment in amyotrophic lateral sclerosis. Acta Neuropathol. 121, 373–380 (2011).
Elden, A. C. et al. Ataxin-2 intermediate-length polyglutamine expansions are associated with increased risk for ALS. Nature 466, 1069–1075 (2010).
Kwart, D. et al. A large panel of isogenic APP and PSEN1 mutant human iPSC neurons reveals shared endosomal abnormalities mediated by APP beta-CTFs, not abeta. Neuron 104, 256–270.e255 (2019).
Karch, C. M. et al. A comprehensive resource for induced pluripotent stem cells from patients with primary tauopathies. Stem Cell Rep. 13, 939–955 (2019).
Marchetto, M. C. et al. A model for neural development and treatment of Rett syndrome using human induced pluripotent stem cells. Cell 143, 527–539 (2010).
Elsheikh, B. et al. Correlation of single-breath count test and neck flexor muscle strength with spirometry in myasthenia gravis. Muscle Nerve 53, 134–136 (2016).
Toombs, J. et al. Generation of twenty four induced pluripotent stem cell lines from twenty four members of the Lothian Birth Cohort 1936. Stem cell Res. 46, 101851 (2020).
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Li, M. X., Gui, H. S., Kwan, J. S., Bao, S. Y. & Sham, P. C. A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases. Nucleic Acids Res. 40, e53 (2012).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Sim, N. L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).
Schwarz, J. M., Rodelsperger, C., Schuelke, M. & Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods 7, 575–576 (2010).
Li, M. X. et al. Predicting mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 9, e1003143 (2013).
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Itan, Y. et al. The human gene damage index as a gene-level approach to prioritizing exome variants. Proc. Natl Acad. Sci. USA 112, 13615–13620 (2015).
Fadista, J., Oskolkov, N., Hansson, O. & Groop, L. LoFtool: a gene intolerance score based on loss-of-function variants in 60 706 individuals. Bioinformatics 33, 471–474 (2017).
Uhlen, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Consortium, G. T. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Consortium, G. T. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).
Galinsky, K. J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).
Fabian Pedregosa, G. V. et al. Scikit-learn: machine learning in Python. J. Machine Learn. Res. hal-00650905v2 (2012).
Solomon, B. D., Nguyen, A. D., Bear, K. A. & Wolfsberg, T. G. Clinical genomic database. Proc. Natl Acad. Sci. USA 110, 9851–9855 (2013).
Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Green, R. C. et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet. Med. 15, 565–574 (2013).
Richards, C. S. et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: revisions 2007. Genet. Med. 10, 294–300 (2008).
Kazazian, J., Boehm, C. D. & Seltzer, W. K. ACMG recommendations for standards for interpretation of sequence variations. Genet. Med. 2, 302–303 (2000).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Li, Q. & Wang, K. InterVar: clinical interpretation of genetic variants by the 2015 ACMG-AMP guidelines. Am. J. Hum. Genet. 100, 267–280 (2017).
Farrer, L. A. et al. Effects of age, sex, and ethnicity on the association between apolipoprotein E genotype and Alzheimer disease. A meta-analysis. APOE and Alzheimer Disease Meta Analysis Consortium. JAMA 278, 1349–1356 (1997).
Abel, O. et al. Development of a smartphone app for a genetics website: the amyotrophic lateral sclerosis online genetics database (ALSoD). JMIR Mhealth Uhealth 1, e18 (2013).
Pinero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2017).
Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using PolyPhen-2. Curr. Protoc. Hum. Genet. Chapter 7, Unit7 20 (2013).
Ramensky, V., Bork, P. & Sunyaev, S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 30, 3894–3900 (2002).
Sunyaev, S. R. et al. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng. 12, 387–394 (1999).
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).
Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum. Mutat. 34, 57–65 (2013).
Shihab, H. A., Gough, J., Cooper, D. N., Day, I. N. & Gaunt, T. R. Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29, 1504–1510 (2013).
Shihab, H. A. et al. Ranking non-synonymous single nucleotide polymorphisms based on disease concepts. Hum. Genom. 8, 11 (2014).
Liu, X., Jian, X. & Boerwinkle, E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum. Mutat. 32, 894–899 (2011).
Liu, X., Jian, X. & Boerwinkle, E. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum. Mutat. 34, E2393–E2402 (2013).
Liu, X., Wu, C., Li, C. & Boerwinkle, E. dbNSFP v3.0: a one-stop database of functional predictions and snnotations for human nonsynonymous and splice-site SNVs. Hum. Mutat. 37, 235–241 (2016).
The GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
The GTEx Consortium. Human genomics. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Boyle, A. P. et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 1790–1797 (2012).
Encode Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) project. Science 306, 636–640 (2004).
Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Barrett, T. et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 37, D885–D890 (2009).
Agarwal, V., Bell, G. W., Nam, J. W. & Bartel, D. P. Predicting effective microRNA target sites in mammalian mRNAs. eLife 4, e05005 (2015).
Griffiths-Jones, S. The microRNA registry. Nucleic Acids Res. 32, D109–D111 (2004).
Griffiths-Jones, S., Grocock, R. J., van Dongen, S., Bateman, A. & Enright, A. J. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 34, D140–D144 (2006).
Griffiths-Jones, S., Saini, H. K., van Dongen, S. & Enright, A. J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36, D154–D158 (2008).
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Milani, P. et al. Cell freezing protocol suitable for ATAC-Seq on motor neurons derived from human induced pluripotent stem cells. Sci. Rep. 6, 25474 (2016).
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinform. 10, 48 (2009).
Holewinski, R. J., Parker, S. J., Matlock, A. D., Venkatraman, V. & Van Eyk, J. E. Methods for SWATH: data independent acquisition on TripleTOF Mass Spectrometers. Methods Mol. Biol. 1410, 265–279 (2016).
Kirk, J. A. et al. Pacemaker-induced transient asynchrony suppresses heart failure progression. Sci. Transl. Med. 7, 319ra207 (2015).
Parker, S. J., Venkatraman, V. & Van Eyk, J. E. Effect of peptide assay library size and composition in targeted data-independent acquisition–MS analyses. Proteomics 16, 2221–2237 (2016).
Teo, G. et al. mapDIA: preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry. J. Proteom. 129, 108–120 (2015).
Arrasate, M. & Finkbeiner, S. Automated microscope system for determining factors that predict neuronal fate. Proc. Natl Acad. Sci. USA 102, 3840–3845 (2005).
Arrasate, M. & Finkbeiner, S. Protein aggregates in Huntington’s disease. Exp. Neurol. 238, 1–11 (2012).
Arrasate, M., Mitra, S., Schweitzer, E. S., Segal, M. R. & Finkbeiner, S. Inclusion body formation reduces levels of mutant huntingtin and the risk of neuronal death. Nature 431, 805–810 (2004).
Miller, J. et al. Identifying polyglutamine protein species in situ that best predict neurodegeneration. Nat. Chem. Biol. 7, 925–934 (2011).
Mitra, S., Tsvetkov, A. S. & Finkbeiner, S. Single neuron ubiquitin-proteasome dynamics accompanying inclusion body formation in Huntington disease. J. Biol. Chem. 284, 4398–4403 (2009b).
Tsvetkov, A. S. et al. Proteostasis of polyglutamine varies among neurons and predicts neurodegeneration. Nat. Chem. Biol. 9, 586–592 (2013).
HD iPSC Consortium et al. Induced pluripotent stem cells from patients with Huntington’s disease show CAG-repeat-expansion-associated phenotypes. Cell Stem Cell 11, 264–278 (2012).
Barmada, S. J. et al. Cytoplasmic mislocalization of TDP-43 is toxic to neurons and enhanced by a mutation associated with familial amyotrophic lateral sclerosis. J. Neurosci. 30, 639–649 (2010).
Bilican, B. et al. Mutant induced pluripotent stem cell lines recapitulate aspects of TDP-43 proteinopathies and reveal cell-specific vulnerability. Proc. Natl Acad. Sci. USA 109, 5803–5808 (2012).
Serio, A. et al. Astrocyte pathology and the absence of non-cell autonomy in an induced pluripotent stem cell model of TDP-43 proteinopathy. Proc. Natl Acad. Sci. USA 110, 4697–4702 (2013).
Acknowledgements
Program support was provided by the following: Robert Packard Center for ALS Research at Johns Hopkins, Travelers Insurance, ALS Finding a Cure Foundation, Stay Strong Vs. ALS, Answer ALS Foundation, Microsoft, Caterpillar Foundation, American Airlines, Team Gleason, National Institutes of Health, Fishman Family Foundation, Aviators Against ALS, AbbVie Foundation, Chan Zuckerberg Initiative, ALS Association, National Football League, F. Prime, M. Armstrong, Bruce Edwards Foundation, The Judith and Jean Pape Adams Charitable Foundation, Muscular Dystrophy Association, Les Turner ALS Foundation, PGA Tour and Bari Lipp Foundation. We thank the following for overall AALS program guidance: L. Bruijn, J. Fishman, E. Rapp, P. Warlick, C. Durrett, P. Foss, L.P. Rizzuto, D. Rizzuto, S. Gleason, P. Varisco, R. Fishman, B. Goulet and M. Sutherland.
Author information
Authors and Affiliations
Contributions
J.D.R. and C.N.S. conceived the program. J.D.R., L.M.T., E.F., S.F., M.C., J.B., N.M., J.E.V.E., C.N.S. and D.S. designed the overall program and oversaw all resource development. J.D.R., E.G.B., L.M.T., E.F., S.F., M.C., J.B., N.M., J.E.V.E., C.N.V., D.S., J.A.K., J.R., R.N., J.C.B., N.M., S.K. and D.R. wrote the manuscript with input and edits from all the authors. E.G.B., T.G.T., S.F., E.F., D.S., J.B., N.M., J.E.V.E., L.M.T., M.E.C., C.N.S. and J.D.R. provided project leadership. E.G.B., T.G.T., B.L., L.O., E.M., S.T. and S.M.F. managed the project. L.O., L.P., E.G., D.P., I.M. and D.S. produced iPS cells. A.F., S.L., B.M., H.T., L.P., M.G.B., D.S. and C.N.S. performed iPS cell differentiation and distribution. A.F., S.L., C.L., R.M., V.G., M.W., R.H., D.S. and C.N.S. performed iPS cell differentiation analysis. J.A.K., E.V., N.A., L.L., S.W., J.R., M.B.H. and S.F. performed whole-genome analysis and genetics. R.G.L., J.W., J.S., R.M., K.W. and L.M.T. performed transcriptomics. A.M., V.V., R.H., N.S., R.P., D.M.M., V.V. and J.E.V.E. performed proteomics. A.D., N.H., M.A., B.T.W. and E.F. performed epigenomics. K.R., E.V., N.A., R.T., J.A.K. and S.F. performed cell imaging and phenotyping. A.N.C., L.H. and J.D.R. performed cell-based studies. J.L., D.R., R.L., J.W., J.A.K., K.S., A.L., L.P.M., S.F., L.M.T., E.F., N.L.P.M. and S.F. performed integrative analysis and computational modeling. E.M., S.T., A.C., S.L., A.F., L.P., C.P., A.J., S.H., T. Morgan, J.J.B., E.K., J.M., M.M., B.J., D.A., S.K., S.A.D., R.B., D.H., T. Miller, J.D.G., J.B., N.M. and J.D.R. were the clinical study team. D.R., H.Y., E.S., P.V. and A.S. managed the clinical data. O.A., P.R., J.C.B., E.G.B., J.D.R. and J.K. developed the smartphone app. R.N., C.A. and G.C. performed the smartphone app data analytics. T.G.T., B.L., A.L., M.B., Y.R., K.S., E.G.B., T.E., E.B. and E.F. developed the web portal.
Corresponding author
Ethics declarations
Competing interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. R.N., C.A. and G.A.C. disclose that their employer, IBM Research, is the research branch of IBM Corporation. R.N, C.A. and G.A.C. own stock in IBM Corporation.
Ethics statement
The AALS trial and smartphone app were approved by the Johns Hopkins institutional review board (nos. 00082277 and 00240000). The AALS program is registered at clinicaltrials.gov (NCT02574390).
Peer review
Peer review information
Nature Neuroscience thanks John Landers and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Answer ALS Operations.
Top. Answer ALS Research Program. Graphic illustration of overall program flow. Bottom. Clinical Sites. Participating clinics were districted nationally at 8 academic or private neurology clinics specializing in ALS clinical care and research.
Extended Data Fig. 2 Smartphone App.
a. Smartphone App. Illustrations from app of various activities. a’. Main Menu, b’. Upper limb motor tests, c’. Bulbar activities, including single breath counting, speech and cognition, d’. Example of cartoon used for speech/cognition analytics. b. Examples of speech and fine motor tasks performed by the smartphone app study participants. Data are collected with an app called “Help us Answer ALS”. Each week, the app asks the participant to perform different tasks. The tasks involve motor control in the upper body, speech and cognition. Each task is performed once per week. The speech tasks include describing a picture (a,b,c), reading a passage (d,e,f), and counting until the subject runs out of breath (not represented). Describing a picture also serves as a cognition task. The motor task involves tracing 3 different contours in sequential order (h,i,j), alternating hand each day of the week.
Extended Data Fig. 3 Production of ALS and control iPS cell spinal motor neurons.
a. Example of IPS Generation Schedule. b. Method of generating iPS cell-derived motor neuron cell lines using the diMNs protocol. c. Brightfield images show the morphology of the cells during differentiation from iPS cell stage to the generation of motor neurons over a period of 32 days. d. Production flow and harvesting schematic of diMNs for multi-omics analyses. e. Quality control of the diMNs produced from iPS cells is performed by imaging of representative wells for immunohistochemical staining with neuronal, motor neuron and glial markers after 32 days of differentiation. Scale bar=400μm. Images representative of over 600 patient cell lines.
Extended Data Fig. 4 Omics Quality Control metrics.
a. Histogram of RNA integrity numbers for current AALS samples. Density plot and histogram of RIN values for all current AALS samples with RNAseq data. Plot shows all processed samples have RIN > 8. b. fragment size distribution Size distribution of ATAC seq data, with peaks representing different n-nucleosomal fragments and clear nucleosome-free regions separated by ~147 bp, the size of a nucleosome. c. Number of Proteins and peptide identification consistency in the data generation batches of AALS samples. d. Violin plot of SERE values for RNAseq data for current AALS samples. Violin plot showing variance of SERE values in BTC (green) and BDC (red) control samples relative to all other (blue) current AALS samples. BTC shows lowest score with the least amount of variance indicating that samples are true technical replicates, while BDC and other samples show increase variance. e. Violin plot of SERE values for ATACseq data for current AALS samples. Similar to RNA data the BTC (green) show lowest variability indicating low technical confounds. f. Coefficient of Variation (CV) for Batch Technical Control (BTC) and Batch differentiation control (BDC) replicates showing 80% proteins to be under a CV of 25%.
Extended Data Fig. 5 Heatmap and hierarchical clustering of current AALS samples.
a&b. Heatmap and hierarchical clustering of SERE values using RNA/ATACseq data. Heatmap and clustering of current AALS samples using SERE values from the (a) RNAseq and (b) ATACseq data. Samples are annotated with gender, genotype, and C9orf72 mutation. No distinct clustering separates samples by these categories, but BTC sample cluster together. c. Spearman correlation matrix plot for the AALS proteomics data.
Extended Data Fig. 6 ATACSeq data.
a and b. CDFs. The number of all peaks (a) and promoter peaks (b) that are common to different numbers of samples. (c) PLEKHG4B locus. (Left) ATAC-seq read density upstream of the PLEKHG4B gene for ALS (middle) and CTR (bottom) samples. Average coverage for each group is shown at the top. (Right) Zoomed in region around the starred peak. d. Motifs. The most overrepresented genomic motifs corresponding to known transcription factors as determined by the HOMER discovery algorithm for ATAC-seq. Motifs for transcription factors implicated in neuronal identity, such as Pdx1, Cux2, and the Lhx family, are significantly enriched.
Supplementary information
Supplementary Information
(A) Expanded methods and (B) Data collection forms.
Supplementary Tables
Supplementary Table 1 Clinic locations for AALS. Supplementary Table 2 AALS clinical events. Supplementary Table 3 Data and biospecimen sources. The authors declare that all data supporting the findings of the present study are available within the paper and its supplementary information files and web portals listed in this table. Supplementary Table 4 Overall clinical demographics. Supplementary Table 5 Fast versus slow progression demographics. Supplementary Table 6 Characterization and validation of IPS cell lines. SupplementaryTable 19 Cell line authentication. Supplementary Table 20 Data-level definitions. Supplementary Table 21 Examples of data levels for each assay. Supplementary Table 22 Stage 1 cell culture media. Supplementary Table 23 Stage 2 platedown media. Supplementary Table 24 Stage 2 media. Supplementary Table 25 Stage 3 media. Supplementary Table 26 Antibody reagents for iPS evaluation.
Supplementary Tables 7–18
Large Excel file-containing Tables 7–18: Supp Table 7_33-ALS-Summary; Supp Table 8_33-ALS-C-PLP Supp Table 9-33-ALS-I-PLP; Supp Table 10-33-ALS-H-PLP; Supp Table 11-33-ALS-ISD; Supp Table 12_C-PLP_All; Supp Table 13_I-PLP_All; Supp Table 14_IS-D_All; Supp Table 15-C9orf72_EH; Supp Table 16-ATXN2_EH; Supp Table 17_ACMG_Gene_ClinVar; Supp Table 18_ACMG_Gene_Intervr.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Baxi, E.G., Thompson, T., Li, J. et al. Answer ALS, a large-scale resource for sporadic and familial ALS combining clinical and multi-omics data from induced pluripotent cell lines. Nat Neurosci 25, 226–237 (2022). https://doi.org/10.1038/s41593-021-01006-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41593-021-01006-0
This article is cited by
-
High-dimensional phenotyping to define the genetic basis of cellular morphology
Nature Communications (2024)
-
Induced pluripotent stem cells (iPSCs): molecular mechanisms of induction and applications
Signal Transduction and Targeted Therapy (2024)
-
G2C4 targeting antisense oligonucleotides potently mitigate TDP-43 dysfunction in human C9orf72 ALS/FTD induced pluripotent stem cell derived neurons
Acta Neuropathologica (2024)
-
PolyGR and polyPR knock-in mice reveal a conserved neuroprotective extracellular matrix signature in C9orf72 ALS/FTD neurons
Nature Neuroscience (2024)
-
Advances in sequencing technologies for amyotrophic lateral sclerosis research
Molecular Neurodegeneration (2023)