Main

NP from marine and terrestrial environments, including their inhabiting microorganisms, plants, animals, and humans, are routinely analyzed using MS. However, a single MS experiment can collect thousands of MS/MS spectra in minutes1, and individual projects can acquire millions of spectra. These data sets are too large for manual analysis. Furthermore, comprehensive software and proper computational infrastructure are not readily available and only low-throughput sharing of either raw or annotated spectra is feasible, even among members of the same laboratory. The potentially useful information in MS/MS data sets can thus remain buried in papers, laboratory notebooks, and private databases, hindering retrieval, mining, and sharing of data and knowledge. Although several NP databases—Dictionary of Natural Products2, AntiBase3, and MarinLit4—assist in dereplication (identification of known compounds), these resources are not freely available and do not process MS data. Conversely, MS databases, including MassBank5, Metlin6, mzCloud7, and ReSpect8, host MS/MS spectra but limit data analyses to several individual spectra or a limited amount of liquid chromatography (LC)–MS files. Other free online computation resources that leverage the MS/MS spectra of Metlin, such as those provided by mzCloud and XCMS Online, are available. However, neither of those allows free download of its reference library.

Global genomics and proteomics research has been facilitated by the development of integral resources, such as the US National Center for Biotechnology Information (NCBI; Bethesda, MD, USA) and UniProt KnowledgeBase (UniProtKB), which provide robust platforms for data sharing and knowledge dissemination9,10. Recognizing the need for an analogous community platform to analyze NP MS data, we present GNPS. GNPS is a data-driven platform for the storage, analysis, and knowledge dissemination of MS/MS spectra that enables community sharing of raw spectra, continuous annotation of deposited data, and collaborative curation of reference spectra (referred to as spectral libraries) and experimental data (organized as data sets).

GNPS provides the ability to analyze a data set and to compare it to all publicly available data. By building on the computational infrastructure of the University of California San Diego (UCSD) Center for Computational Mass Spectrometry (CCMS; http://proteomics.ucsd.edu/), GNPS provides public data set deposition and/or retrieval through the Mass Spectrometry Interactive Virtual Environment (MassIVE) data repository. The GNPS analysis infrastructure further enables online dereplication6,11,12,13, automated molecular networking analysis14,15,16,17,18,19,20,21, and crowdsourced MS/MS spectrum curation. Each data set added to the GNPS repository is automatically reanalyzed in the next monthly cycle of continuous identification (see 'Living data by continuous analysis' below). Each of these tens of millions of spectra in GNPS data sets is matched to reference spectral libraries to annotate molecules and to discover putative analogs (Fig. 1a). From January 2014 to November 2015, GNPS grew to serve 9,267 users from 100 countries (Fig. 1b), with 42,486 analysis sessions that have processed >93 million spectra as molecular networks from a quarter-million LC–MS runs. Searches against a combined catalog of over 221,000 MS/MS reference library spectra from 18,163 compounds (Supplementary Table 1) are possible, and GNPS has matched almost one hundred million MS/MS spectra in all public and private search jobs using an estimated 84,000 compute hours.

Figure 1: Overview of GNPS.
figure 1

(a) Representation of interactions among the NP community, GNPS spectral libraries, and GNPS data sets. At present 221,083 MS/MS spectra from 18,163 unique compounds are used for searches in GNPS. These include both third-party libraries, such as MassBank, ReSpect, and NIST, as well as spectral libraries created for GNPS (GNPS-Collections) and spectra from the NP community (GNPS-Community). GNPS spectral libraries grow through user contributions of new identifications of MS/MS spectra. To date, 55 community members have contributed 8,853 MS/MS spectra from 5,568 unique compounds (30.5% of the unique compounds available). In addition, ongoing curation efforts have already yielded 563 annotation updates for library spectra. The utility of these libraries is to dereplicate compounds (recognition of previously characterized and studied known compounds), in both public and private data. This dereplication process is performed on all public data sets and results are automatically reported, thus enabling users to query all data sets, organisms, and conditions. Automatic reanalysis of all public data creates a virtuous cycle in which contributions to libraries can be matched to all public data. Combined with molecular networking (Fig. 3), this automatic reanalysis empowers community members to identify analogs that can then be added to GNPS spectral libraries. (b) The GNPS platform has grown to serve a global user base of >9,200 users from 100 countries.

GNPS spectral libraries

GNPS spectral libraries enable dereplication, variable dereplication (approximate matches to spectra of related molecules), and identification of spectra in molecular networks. GNPS has collected available MS/MS spectral libraries relevant to NP (which also include other metabolites and molecules), including MassBank5, ReSpect8, and NIST22 (Table 1, Fig. 2a and Supplementary Table 1). Altogether, these third-party libraries total 212,230 MS/MS spectra representing 12,694 unique compounds (Fig. 2b). Although this combined collection of reference spectra provides a starting point for dereplication, only 1.01% of all spectra in public GNPS data sets has been matched to this collection, indicating insufficient chemical space coverage. Although the NP community is working to populate this 'missing' chemical space, there is no way to report discoveries of chemistries in an easily verifiable and reusable format.

Table 1 Metabolomics and NP MS/MS computational resources overview
Figure 2: GNPS spectral libraries.
figure 2

(a) The computational resources of the metabolomics and the NP community fall into two main categories: first, reference collections (red dots) of MS/MS spectral libraries; and second, data repositories (blue dots) designed to publicly share raw MS data associated with research projects. Reference collection resources are contributors and aggregators of reference MS/MS spectra, some of which also include data analysis tools, for example, online multi-spectrum MS/MS search (magnifying glass icon). Several resources have aggregated MS/MS spectra from various reference collections so that the analysis tools at a respective resource can leverage more of the community efforts to annotate data (red and blue arrows). GNPS has imported all freely available reference collections (>221,000 MS/MS spectra) and makes them available for online analyses. GNPS and several other resources provide both reference MS/MS spectra and data in an open and free manner to the public (pink caps). (b) Comparison of spectral library sizes of available libraries (MassBank, ReSpect, and NIST) and GNPS libraries; GNPS-Collections includes newly acquired spectra from synthetic or purified compounds and GNPS-Community includes all community-contributed spectra. (c) Searching all public GNPS data sets revealed that MassBank, ReSpect, and NIST libraries matched to 1,217 unique compounds, with GNPS libraries increasing unique compound matches by 41% (corresponding to 29% of total unique matches) with an accompanying 4% increase in spectral library size. Overall, GNPS libraries increase the total number of spectra matched in public data sets by 144% (59% of total public MS/MS matches), and spectra matches across all GNPS public and private data by 767% (88% of all MS/MS matches). (d) The distribution of precursor masses in all GNPS public data sets is shown in gray and compared to the precursor mass distributions of MassBank, ReSpect, NIST, and GNPS libraries (color key as in b). Though GNPS libraries have a combined size that is smaller than MassBank, ReSpect, and NIST, GNPS libraries have a higher proportion of molecules in the higher m/z range and therefore complement the proportionately lower precursor mass molecules in other libraries. (e) The quality of spectrum matches obtained by searching against the available spectral libraries is assessed by user ratings (1 to 4 stars; Supplementary Table 6) of continuous identification results. User ratings of >2.5 stars for >98% of GNPS library matches compares favorably with the 90% mark for NIST matches, whose high marks demonstrate how important these third-party libraries still are to the GNPS platform. We note that the lower mark for NIST matches does not suggest lower-quality spectra. It is more likely explained by its higher emphasis on lower precursor mass molecules with spectra that have fewer peaks and are generally harder to match.

To begin to address this pressing need, GNPS houses both newly acquired reference spectra (GNPS-Collections) as well as a crowdsourced library of community-contributed reference spectra (GNPS-Community). The GNPS-Collections data set includes NP and pharmacologically active compounds, totaling 6,629 MS/MS spectra of 4,243 compounds (Fig. 2b, Supplementary Table 1, Supplementary Notes 1 and 2, and Supplementary Table 2). The GNPS-Community library has grown to include 2,224 MS/MS spectra of 1,325 compounds from 55 worldwide contributors. Although the total number of MS/MS spectra in GNPS libraries is only 4% of the MS/MS spectra collected in third-party libraries, GNPS libraries contribute matches of MS/MS spectra at a scale disproportionate to their size (Fig. 2c). The GNPS libraries account for 29% of unique compound matches and 59% of the MS/MS matches in public (88% of public and private) data. This indicates that the GNPS libraries contain compounds that are complementary to the chemical space represented in other libraries (Fig. 2c,d). Moreover, in contrast to third-party libraries, spectra submitted to GNPS-Community libraries are immediately searchable by the whole community, such that submissions seamlessly transfer knowledge between laboratories (Fig. 1a) in a process that is akin to the addition of genome annotations to GenBank9.

To create a robust library, we have to ensure that submissions are peer-reviewed and, if necessary, annotations corrected or updated as appropriate. Reference spectra submitted to the GNPS-Community library are categorized by the estimated reliability of the proposed submissions. Gold reference spectra must be derived from structurally characterized synthetic or purified compounds and can be submitted only by approved users. Approval is given to contributors who have undergone training. Training is initiated by contacting the corresponding authors or CCMS administrators. Silver reference spectra need to be supported by an associated publication, and bronze reference spectra comprise all remaining putative annotations (Supplementary Table 3). This type of division of spectra is reminiscent of RefSeq/TPA/GenBank9,23 (genomics) and Swiss-Prot/TrEMBL/UniProt24,25 (proteomics), allowing varying tradeoffs between comprehensiveness and reliability of annotations defined as gold, silver, or bronze (Fig. 2e).

To enable refinements or corrections of annotations, GNPS allows community-driven, iterative re-annotation of reference MS/MS spectra in a wiki-like fashion, to progressively improve the library and converge toward consensus annotation of all MS/MS spectra of interest. This is a process similar to the iterative annotation of the human genome9. To date, 563 annotation revisions have been made in GNPS (Supplementary Table 4), most of which added metadata to library spectra or refined compound names. The history of each annotation is retained so that users can discuss the proper annotation and address disagreements through comment threads.

Dereplication using GNPS

High-throughput dereplication of NP MS/MS data is implemented in GNPS by querying newly acquired MS/MS spectra against all the accumulated reference spectra in GNPS spectral libraries (Fig. 3a). To date, >93 million MS/MS spectra from various instruments (including Orbitrap, Ion Trap, qTOF, and FT-ICR) have been searched at GNPS, yielding putative dereplication matches of 7.7 million spectra to 15,477 compounds. In the second stage of dereplication, GNPS goes beyond re-identification by using variable dereplication, which is a modification-tolerant spectral library search that is mediated by a spectral alignment algorithm. Variable dereplication enables the detection of significant matches to either putative analogs of known compounds (e.g., differing by one modification or substitution of a chemical group) or compounds belonging to the same general class of molecules (Fig. 3b). Variable dereplication is not available through any other computational platform. For example, GNPS variable dereplication has detected compounds with different levels of glycosylation on various substrates. As MS/MS fragmentation preferentially results in peaks from glycan fragments, it is possible to detect sets of compounds with related glycans even when the substrates to which the glycans are attached are themselves unrelated26. To date, 3,891 putative analogs have been identified in public data using GNPS variable dereplication (Supplementary Table 5). These 3,891 putative analogs include several unique molecules that could be user-curated and added to GNPS reference libraries (see 'Molecular Explorer' below on accessing and annotating putative analogs).

Figure 3: Molecular network creation and visualization.
figure 3

(a) Molecular networks are constructed from the alignment of MS/MS spectra to one another. Edges connecting nodes (MS/MS spectra) are defined by a modified cosine scoring scheme that determines the similarity of two MS/MS spectra with scores ranging from 0 (totally dissimilar) to 1 (completely identical). MS/MS spectra are also searched against GNPS spectral libraries, seeding putative node matches in the molecular networks. Networks are visualized online in-browser or exported for third-party visualization software such as Cytoscape31. (b) An example alignment between three MS/MS spectra of compounds with structural modifications that are captured by modification-tolerant spectral matching used in variable dereplication and molecular networking. (c) In-browser molecular network visualization enables users to interactively explore molecular networks without requiring any external software. To date, >11,000 molecular networks have been analyzed using this feature. Within this interface, (i) users are able to define cohorts of input data and correspondingly, nodes within the network are represented as pie charts to visualize spectral count differences for each molecule across cohorts. (ii) Node labels indicate matches made to GNPS spectral libraries, with additional information displayed with mouseovers. These matches provide users a starting point to annotate unidentified MS/MS spectra within the network. (iii) To facilitate identification of unknowns, users can display MS/MS spectra in the right panels by clicking on the nodes in the network, giving direct interactive access to the underlying MS/MS peak data. Furthermore, alignments between spectra are visualized between spectra in the top right and bottom right panels to gain insight as to what underlying characteristics of the molecule could elicit fragmentation perturbations.

To assess the reliability of the MS/MS matches found by GNPS dereplication, GNPS users can rate the quality of matches returned by automated GNPS reanalysis (see below). These ratings are four star (correct), three star (likely correct; e.g., could also be isomers with similar fragmentation patterns), two star (unable to confirm the annotation due to limited information), and one star (incorrect) (Supplementary Table 6). So far, of the 3,608 matches that have been rated, 139 (3.9%) matches were given one or two stars (insufficient information (2.9%) or incorrect (1%)) by user ratings. These percentages are consistent with the false-discovery rates estimated using spectral library searches of benchmark LC–MS data sets with compound standards (Supplementary Note 3, Supplementary Figs. 1 and 2, and Supplementary Table 7). Furthermore, these 3,608 match ratings were associated with 2,041 library spectra, therefore, the average rating of a library spectrum can offer insight into the reliability of its reference annotation, not unlike Yelp ratings for restaurants. Incorrect matches can arise through either spurious high-scoring matches to library spectra or incorrect annotations for library spectra. Of the 2,041 library spectra with match ratings, 72 (3.5%) of spectra had average ratings below 2.5 stars. These percentage ratings were further broken down by spectral library (Fig. 2e). We found that for GNPS-Collection and GNPS-Community libraries, only 29 out of 1,746 (1.7%) of the rated library spectra had average ratings below 2.5 stars. These ratings demonstrate that the perceived reliability of GNPS spectral libraries compares favorably with established community resources such as NIST and MassBank, in which 10.5% and 20.1% of the ratings were below 2.5 stars, respectively, and provides confidence that the community curation process is robust and that third-party libraries integrate well with GNPS. The main advantages of searching using GNPS are the option to run simple or variable dereplication against all publicly accessible reference spectra, and that community-rated matches can be used to improve the quality of the reference libraries and matching algorithms. These dereplication capabilities are not possible with existing published resources.

Molecular networking

Molecular networks are visual displays of the chemical space present in MS experiments. GNPS can be used for molecular networking14,15,16,17,18,19,20,21,27,28, a spectral correlation and visualization approach that can detect sets of spectra from related molecules (so-called spectral networks29), even when the spectra themselves are not matched to any known compounds (Fig. 3a). Spectral alignment15,27 detects similar spectra from structurally related molecules, assuming these molecules fragment in similar ways reflected in their MS/MS patterns (Fig. 3b), analogous to the detection of related protein or nucleotide sequences by sequence alignment.

GNPS is currently the only public infrastructure that enables molecular networking. The visualization of molecular networks in GNPS represents each spectrum as a node, and spectrum-to-spectrum alignments as edges (connections) between nodes. Nodes can be supplemented with metadata, including dereplication matches or information that is provided by the user, such as abundance, origin of product, biochemical activity or hydrophobicity, which can be reflected in a node's size or color. It is possible to visualize the map of related molecules as a molecular network21,30,31,32,33 (Supplementary Fig. 3) online at GNPS (Fig. 3c) or exported for analysis in Cytoscape31. Molecular networking analyses of 272 public data sets (Fig. 4a) from a diverse range of samples reveal that on average 35.2% of all unidentified nodes are matched to other spectra of related molecules within a cosine score of 0.8 (44.7% of all nodes in more exploratory networks with a cosine score of 0.65; Supplementary Table 8). This suggests that a large fraction of all unidentified spectra would be identifiable if their or their neighboring nodes' reference spectra were available in the reference spectral libraries.

Figure 4: 'Living data' in GNPS by crowdsourcing molecular annotations.
figure 4

(a) A global snapshot of the state of MS/MS matching of public NP data sets available in GNPS using molecular networking and library search tools. Identified molecules (1.9% of the data) are MS/MS spectrum matches to library spectra with a cosine >0.7. Putative analog molecules (another 1.9% of the data) are MS/MS spectra that are not identified by library search but rather are immediate neighbors of identified MS/MS spectra in molecular networks. Identified Networks (9.9% of the data) are connected components within a molecular network that have at least one spectrum match to library spectra. Unidentified networks (25.2% of the data) are molecular networks where none of the spectra match to library spectra; these networks potentially represent compound classes that have not yet been characterized. Exploratory networks (an additional 20.1% of the data) are unidentified connected components in molecular networks with more relaxed parameters (Supplementary Table 8). Thus, 55.3% of the MS/MS spectra at least have one related MS/MS spectrum in spectral networks, with 44.7% having none. In this 44.7% of the data, each MS/MS spectrum has been observed in two separate instances and should not constitute noise. Altogether, this analysis indicates that most of the chemical space captured by MS remains unexplored. (b) In the past year, there has been substantial growth in the GNPS spectral libraries, driving an increase in the match rates of all public data. The number of unique compounds matched in the public data has increased tenfold; the number of total spectra matched has increased 22-fold; and the average match rate has increased threefold. It is expected that identification rates will continue to grow with further contributions from the community to the GNPS-Community spectral library.

Living data by continuous analysis

Funding agencies and publishers have called for raw scientific data, including MS data, and analysis methods to be made publicly available where possible. Consistent with this aim, GNPS data sets usually comprise the full set of MS files produced during a NP research project or the full set of spectra analyzed for a peer-reviewed publication (Supplementary Note 4). Although it is potentially advantageous to the community for all data to be made public, GNPS user data can remain private until users explicitly choose to make them public (private data are also analyzable and privately sharable, with >93 million spectra in >250,000 private LC–MS runs already searched using GNPS). GNPS has the largest collection of publicly accessible natural product and metabolomics MS/MS data sets and is the only infrastructure where public data sets can be reanalyzed together and compared with each other (Table 1). To date, GNPS has made 272 public GNPS data sets openly available, which comprise >30,000 MS runs with 84 million MS/MS spectra. In common with other public repositories34,35, GNPS data sets can be downloaded. However, data availability on its own does not suffice to enable data reuse. GNPS is unique among MS repositories by enabling continuous identification: the periodic and automated reanalysis of all public data sets (Supplementary Notes 5 and 6, and Supplementary Tables 9 and 10). This continuous reanalysis, which incorporates molecular networking and dereplication tools, implements a 'virtuous cycle' (Fig. 1a). Because GNPS spectral libraries are constantly growing, owing to community contributions and continued generation of reference spectra, the number of matches made by successive reanalyses of public data sets has already grown and is expected to continue to grow over time (Fig. 4b). GNPS users are periodically updated with alerts of new search results.

For example, a Streptomyces roseosporus project (MSV000078577) was deposited April 8, 2014. At first, only seven MS/MS spectra were matched. However, as of July 14, 2015, 36 spectral matches were made to GNPS libraries. Overall, the total number of compounds matched to GNPS data sets increased more than tenfold, whereas the number of matched MS/MS spectra in GNPS data sets increased >20-fold in 2015 (Fig. 4b). GNPS users can also subscribe to specific data sets of interest, rather like 'following' people on Twitter. When new matches are made, changed, or revoked, all subscribers are notified of new information by an e-mail summarizing changes in identification. From April 2014 to July 2015, 45 updates were initiated by CCMS and automatically sent to subscribers (Supplementary Fig. 4). Update e-mails have led to substantially more views per data set, compared with non-GNPS data sets (192 proteomics data sets deposited in MassIVE). Continuous identification not only keeps a single data set 'alive', it can also create connections between data sets and users over time. Similarities between data sets could form the basis of a data-mediated social network of users with potentially related research interests despite seemingly disparate research fields, rather like the 'People You May Know' feature on LinkedIn. On average, each GNPS user already has five suggested collaborators (Supplementary Fig. 5).

Molecular explorer

Molecular Explorer is a feature that can only be implemented on 'living data' repositories and thus exists only in GNPS. Molecular Explorer allows users to find all data sets and putative analogs that have ever been observed for a given molecule of interest. We anticipate that this feature could guide the discovery of previously unknown analogs of existing antibiotics. Public NP data contain >100 unidentified putative analogs of antibiotics, such as valinomycin, actinomycin, etamycin, hormaomycin, stendomycin, daptomycin, erythromycin, napsamycin, clindamycin, arylomycin, and rifamycin, highlighting a clear potential to generate leads to discover structurally related antibiotics through the application of GNPS (Supplementary Fig. 6, Supplementary Table 5 and Supplementary Note 7). Box 1 illustrates how this approach was applied to stenothricin (Fig. 5).

Figure 5: GNPS enabled discovery of stenothricin.
figure 5

(a) The stenothricin molecular family was identified during analysis of a molecular network between chemical extracts of S. roseosporus NRRL 15998 (green) and Streptomyces sp. DSM5940 (blue). This analysis indicates that Streptomyces sp. DSM5940 produces a structurally similar compound to stenothricin with a −41 Da m/z difference. An enlarged version of the network can be found in Supplementary Figure 8. (b) Based on preliminary structural analysis, stenothricin-GNPS (41 Da) may contain a Lys to Ser substitution. (c) Comparison of the MS/MS of stenothricin D with stenothricin-GNPS 2. (d) Although structurally related, stenothricin and stenothricin-GNPS have different effects on E. coli as visualized using fluorescence microscopy. Red is the membrane stain FM4-64, blue is the membrane-permeable DNA stain DAPI (4′,6-diamidino-2-phenylindole), green is the membrane impermeable DNA stain SYTOX green. SYTOX green stains DNA only when the cell membrane is damaged. Scale bar, 2 μm.

Several published applications of molecular networking and MS/MS-based dereplication using GNPS have been reported while the infrastructure has been under development. Specifically, GNPS has enabled the discovery of NP including colibactin41,42,43,44,45, characterization of biosynthetic pathways46,47, understanding of the chemistry of ecological interactions28,48,49,50,51,52, and development of metabolomics bioinformatics methods53. The application of GNPS workflows to such diverse research areas demonstrates its utility.

Conclusions

GNPS provides a community-led knowledge space in which NP data can be shared, analyzed, and annotated by researchers worldwide. It enables a cycle of annotation in which users curate data, continuous dereplication enables product identification, and a knowledge base of reference spectral libraries and public data sets is created. Selected views from community members were sought by Nature Biotechnology and are presented, together with author responses, in Supplementary Note 8.

The transformation of deposited spectra into living data that are enabled by the GNPS platform could mediate connections between researchers and has the potential to transform data networks into social networks. Of 1,272 compound identifications obtained by continuous identification with the GNPS-Community library, 1,063 (83.6%) were made using reference spectra that were not uploaded by the submitter. In other words, the vast majority of identifications were enabled by other community members. This reuse of knowledge and data is analogous to other community-wide curation efforts including Wikipedia and crowdsourced dictionaries. From the time of their initial deposition, 59% of data sets have an increased number of identifications, with the average data set more than doubling the number of identifications since submission (Supplementary Fig. 19). GNPS enables facile sharing of individual analyses (Supplementary Fig. 20) and uses molecular networks to reveal connections among data sets from different laboratories and biological sources that would otherwise remain disconnected. To date, 3,145 analysis jobs have included files shared among GNPS users, encompassing 548 unique pairs of individuals' collaborations. GNPS recasts public data sets as 'conversation starters' in a data-mediated social network.

Although we have described only one simple application of GNPS in this Perspective (the identification of a stenothricin analog in Box 1), the community has already begun to use GNPS to expedite NP analysis28,41,43,45,46,50,52. Furthermore, we expect the user base of GNPS to expand to include other communities that use MS/MS data, including those studying metabolomes, microbiomes, exposomes (measurements of life-course environmental exposures), and the chemistry of the human habitat, or researchers involved in areas as diverse as drug discovery, biomarker stratification of patients and adsorption, distribution, metabolism, excretion and toxicology studies, food science, agricultural sciences, and ocean science, to name a few, all resulting in different GNPS workflows42,44,47,51,53.

Genomics9 and protein structure analysis54 have already shown that models of global collaboration and social cooperation can empower scientific communities to collectively translate big data into shared, reusable knowledge. We believe that GNPS will transform NP research in a similar manner, profoundly influencing the way we explore molecules using MS.

Additional details about the methods used in this work can be found in the Supplementary Methods. Source code and license are available at the CCMS software tools webpage as well as at GitHub (https://github.com/CCMS-UCSD). Source code is also available with this manuscript as Supplementary Source Code.