ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

RVDB-prot, a reference viral protein database and its HMM profiles

[version 1; peer review: 2 approved with reservations]
PUBLISHED 23 Apr 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

We present RVDB-prot, a database corresponding to the protein equivalent of the nucleic acid reference virus database RVDB. Protein databases can be helpful to perform more sensitive protein sequence comparisons. Similarly to its homologous public repository, RVDB-prot aims to provide reliable and accurately annotated unique entries, while including also an Hidden Markov Model (HMM) protein profiles database for distant protein searching.

Keywords

virus, genomes, proteins, hmm, clusters, annotations, database

Introduction

Sequence assignation often uses similarity criteria to infer homology, and hence taxonomy and / or protein type. In order to search for this similarity, reliable, accurate and comprehensive databases are required. In the specific field of viruses, several solutions are available yet their ability to provide valid results is highly dependant on the goal of the study and on the available computer resources. Using a database with a high number of sequences, such as NCBI nr/nt may seem appropriate, but it implies an increased computation time and annotation quality is not always optimal. RefSeq on the other hand, is generally better curated, but it contains only full-length genomes and rarely includes the latest discoveries. Other specialized databases provide only specific groups of taxa for specific purposes, for instance, virus families responsible for infectious diseases like HIV or influenza.

Thus, the need for better, well-annotated and comprehensive public viral databases that can be used for the identification of viruses by high-throughput sequencing lead Goodacre et al. to propose their Reference Viral DataBase (RVDB)1. This database consists of a collection of all currently known viral genomes and virus-related nucleic sequences retrieved from NCBI nr or RefSeq, which includes a specific, both manual and computational reviewing process, as well as four updates of the contents per year. These features make RVDB quite attractive for the virology research community and in fact, in February 2018, version 15.1 was released.

Since viral genomes mainly consist of coding sequences, the need for an equivalent reference database that provides the protein version of these sequences may prove quite advantageous.

Indeed, protein sequences are useful when searching for distant homologs: their substitution rates are much lower than nucleic sequences. Additionally, proteins can also be efficiently clustered according to their similarity, and the resulting clusters can then be used to build Hidden Markov Model (HMM) Profiles in order to identify more evolutionary distant proteins. In fact, programs like HMMER2 allow the building of a HMM profile from a multiple protein sequence alignment. This profile can then be able to recognize proteins based on complex positionspecific models of sequence conservation and evolution, and it does so in a more accurate way than if a classic sequence alignment is used.

Thus, we propose a protein sequence version of RVDB whose update will be synchronized with the original nucleotide RVDB release. Here we describe the conversion from the nucleotide version of RVDB to the protein version RVDB-prot, as well as the clustering process leading to the HMM profiles.

Methods

Conversion from RVDB nucleic database to RVDB-prot

The current version of RVDB, v15.13 consists of a collection of 2 719 839 nucleic sequences1. The accession numbers were extracted in order to gather the corresponding database entries in genbank format. From these entries, coding domain sequences and the description of these sequences were located and copied into the protein collection. The resulting protein file contains the nucleic sequence reference, for traceability purposes. The sequence names are formatted in the following way:

>acc|<p_bank>|<p_acc>|<n_bank>|<n_acc>|<descr[sp]>, where:

p_bank is the bank in which the protein can be found

p_acc is the accession number corresponding to the protein sequence

n_bank is the bank in which the original nucleotide sequence was found

n_acc is the original information found in the nucleic database

descr is the description of the protein sequence as found in the database entry

sp is the species name.

This process produces a 3 899 699 protein sequence file.

Generation of HMM profiles

The HMM generation rationale was inspired from VFam (the database of profile HMM built from all the viral proteins present in RefSeq, discontinued from 2014)4, but was entirely re-coded as a Snakemake pipeline5, using different tools for some key steps (clustering, alignment). The proteins sequences were clustered with a 100% identity criteria to duplicates, using CDHit 4.7.06. Then, the sequences were processed using Blast 2.2.267 performing an all-against-all comparison. These comparisons allow Silix 1.2.68 to define clusters of sequences according to the sequence similarity. This step produces a file text in which each sequence is associated to one cluster. The information of each cluster containing at least four sequences was transformed into a fasta file containing all of its sequences. Then, we performed multiple alignment using Mafft 7.0239 in auto mode. The multiple sequence alignments were processed by HMMER 3.2.12 in order to obtain the HMM profiles. The HMM profiles were then put together in a single file.

Annotation of HMM profiles

In our pipeline, a cluster consists in a set of sequences, where each sequence belongs to a species, and each sequence is associated with a description. In order to characterize the clusters, these pieces of information and other indicators (such as cluster length and sequence number) are combined into an annotation database, in SQLite format. The schema of this database is shown in Figure 1.

4ec8038a-3243-4a04-a0f9-178ca583e7a6_figure1.gif

Figure 1. Schema of the annotation database.

The first type of data associated to a cluster is a set of keywords. These keywords correspond to the union of all the set of sequence names belonging to the cluster, weighted according to their frequencies, and excluding trivial words. For instance, for the cluster number 1, containing 588 sequences, the keywords and their frequencies, are: parvovirus(441), protein(423), Canine(359), capsid(345), VP2(233), virus(89), VP1(83), disease(48), Aleutian(48), mink(48) allowing to describe a cluster composed of Canine parvovirus capsid protein sequences. The database stores all these taxa, using NCBI TaxIDs. For each cluster, the taxonomic information is summarized by a Last Common Ancestor (LCA) that corresponds to the taxon in the tree of life to which all the sequence taxa belong. Finally, the database also provides the length (number of amino acids of the multiple sequences alignment) and the number of sequences in each cluster.

This database is available in SQLite format, and to provide more direct access, flat text files are proposed. A text file for each cluster, identified with its cluster number contains all the information related to it.

Software availability

The different steps explained above are performed using a Snakemake pipeline5, available at Institut Pasteur’s Gitlab.

Several tools are needed to run the pipeline, including: Python, Mafft, Golden, Hmmer, Snakemake, Silix, Blast+. The versions of these tools compatible with the pipeline are listed in the README file.

Data availability

Underlying data

Database files are available at https://rvdb-prot.pasteur.fr/. Release 15.1 described in this manuscript is also available from Figshare.

Figshare: U-RVDBv15.1 https://dx.doi.org/10.6084/m9.figshare.77459693.

This project contains the following underlying data:

  • U-RVDBv15.1-prot.fasta (fasta file containing protein features of the original database: -prot.fasta)

  • U-RVDBv15.1-prot.fasta-prot.hmm (the HMM profiles, generated with and for hmmer 3.2.1 (from 2019, 3.1b2 before))

  • U-RVDBv15.1-prot.fasta-prot-hmm.sqlite (SQLite db containing annotations (please find a documentation below))

  • U-RVDBv15.1-prot.fasta-annot.txt (a directory of annotations with plain text files (one per protein family))

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Table 1 shows some summary metrics for the entries of this release and the different resources.

Table 1. Metrics for release 15.1.

Nucleic sequencesRVDB2 719 839
ProteinsRVDB-prot3 899 699
Unique proteinsRVDB-prot489 207
ClustersRVDB-prot HMM86 482

Updates are manually curated each time a new release of the main database (nucleic RVDB) is announced, i.e., four times a year. The following older versions are also available online: 14.0 (2018-09), 13.0 (2018-06), 12.2(2018-03), 11.5 (2017-10), 10.2 (2017-04).

Usage HMMER can be used to search for all profiles in a fasta sequence file (sequences.fasta): hmmsearch U-RVDBv15.1-prot.fasta-prot.hmm sequences.fasta > result.out. Additional options are available in HMMER User’s Guide.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 23 Apr 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Bigot T, Temmam S, Pérot P and Eloit M. RVDB-prot, a reference viral protein database and its HMM profiles [version 1; peer review: 2 approved with reservations] F1000Research 2019, 8:530 (https://doi.org/10.12688/f1000research.18776.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 23 Apr 2019
Views
17
Cite
Reviewer Report 20 May 2019
Guy Perriere, Laboratoire de Biométrie et Biologie Evolutive, CNRS, UMR5558, Universite Claude Bernard - Lyon 1, Villeurbanne, 69622, France 
Approved with Reservations
VIEWS 17
This is an interesting resource that can be of use for people dealing with comparative genomics in viruses. There are some points that need to be clarified before this paper can be indexed though.
  1. In order
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Perriere G. Reviewer Report For: RVDB-prot, a reference viral protein database and its HMM profiles [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:530 (https://doi.org/10.5256/f1000research.20570.r47713)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 07 Sep 2020
    Thomas Bigot, Institut Pasteur, Paris, France
    07 Sep 2020
    Author Response
    We would like to thank the Reviewer. Please find below our line-by-line responses.

    In order to ease reproducibility, the parameters used for the different programs (e.g. HMMER, SiLiX) of ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 07 Sep 2020
    Thomas Bigot, Institut Pasteur, Paris, France
    07 Sep 2020
    Author Response
    We would like to thank the Reviewer. Please find below our line-by-line responses.

    In order to ease reproducibility, the parameters used for the different programs (e.g. HMMER, SiLiX) of ... Continue reading
Views
25
Cite
Reviewer Report 16 May 2019
Philippe le Mercier, Swiss-Prot Group, CMU, Swiss Institute of Bioinformatics, Geneva, Switzerland 
Approved with Reservations
VIEWS 25
In this article, the authors present a RVDB-prot, a reference viral protein database and its HMM profiles. The purpose of this approach is providing a complete reference database of viral proteins to identify new sequences. Their database is based on ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
le Mercier P. Reviewer Report For: RVDB-prot, a reference viral protein database and its HMM profiles [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:530 (https://doi.org/10.5256/f1000research.20570.r47715)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 07 Sep 2020
    Thomas Bigot, Institut Pasteur, Paris, France
    07 Sep 2020
    Author Response
    We would like to thank the Reviewer. Please find below our line-by-line responses.

    The database is described as facilitating sequence assignation. This seems a bit vague, a sentence describing ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 07 Sep 2020
    Thomas Bigot, Institut Pasteur, Paris, France
    07 Sep 2020
    Author Response
    We would like to thank the Reviewer. Please find below our line-by-line responses.

    The database is described as facilitating sequence assignation. This seems a bit vague, a sentence describing ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 23 Apr 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.