Phat—a gene finding program for Plasmodium falciparum

https://doi.org/10.1016/S0166-6851(01)00363-2Get rights and content

Abstract

We describe and assess the performance of the gene finding program pretty handy annotation tool (Phat) on sequence from the malaria parasite Plasmodium falciparum. Phat is based on a generalized hidden Markov model (GHMM) similar to the models used in GENSCAN, Genie and HMMgene. In a test set of 44 confirmed gene structures Phat achieves nucleotide-level sensitivity and specificity of greater than 95%, performing as well as the other P. falciparum gene finding programs Hexamer and GlimmerM. Phat is particularly useful for P. falciparum and other eukaryotes for which there are few gene finding programs available as it is distributed with code for retraining it on new organisms. Moreover, the full source code is freely available under the GNU General Public License, allowing for users to further develop and customize it.

Introduction

Sequencing of the Plasmodium falciparum genome is proceeding apace. Two completely sequenced chromosomes have been published [1], [2] as well as the mitochondrion, and substantial amounts of the sequence of other chromosomes are already available [3], [4], [5], [6]. The two published chromosomes have been annotated extensively, in each case making use of a gene-finding program. GlimmerM [7], [8], a eukaryotic gene-finding program based on Glimmer [9], was used in the analysis of chromosome 2, while chromosome 3 was annotated with the help of Hexamer [10] and Genefinder [11]. Furthermore, chromosome 3 was revisited later with GlimmerM [12].

Before either of these chromosome sequences was published, there was no publicly available gene-finding program trained on P. falciparum sequence, which is known to have a base composition different enough from other organisms to preclude simply using an existing program. Since some of our colleagues had a desire to analyze the sequence then available for genes, one of us wrote a gene-finding program [13]. This paper is about a descendent of that original program which we call pretty handy annotation tool (Phat).

Broadly speaking, there are now four publicly available Plasmodium gene-finding programs: Genefinder, GlimmerM, Hexamer and Phat. They each differ somewhat in the way in which they seek to exploit sequence features to find genes, in their availability, and in the extent to which they can be re-trained on new data and used by people other than their authors. As well as introducing Phat, we compare and contrast it with the other programs.

Section snippets

The model

Phat models genomic DNA with a generalized hidden Markov model (GHMM), similar to existing GHMM gene models such as GENSCAN [14] Genie [15], [16] and HMMgene [17]. There is an underlying state space consisting of three main types of states: exons, introns and intergenic regions (Fig. 1). Introns are classified as phase 0, 1 or 2 according to the number of bases of the final codon generated in the previous exon (where previous means the last exon in the 5′ direction, on the coding strand). Exons

Results

We have conducted a study to compare the performance of Phat [21], [13] with other gene finding programs on P. falciparum sequence. Currently the other main programs are Hexamer [10], Genefinder [11] and GlimmerM [7], [8]. Hexamer operates quite differently to the others using only Hexamer frequencies to predict individual coding regions. It does not attempt to detect exon boundaries, nor does it assemble its predicted coding regions together into whole genes.

The remaining three programs all

Discussion

Both genefinders displayed relatively high sensitivity and specificity on both the training and test sets of genes. It is a little surprising that both gene finders performed better on the examples on which they had not been trained, perhaps the genes in the training set are in some sense more difficult to predict accurately. A reviewer with extensive experience in the field has found that GlimmerM tends consistently to under-annotate while Phat tends consistently to over-annotate. He also

Acknowledgments

This work was made possible with advice and data from many others. The authors would like to thank Mauro De Lorenzi, Alan Cowman and Tony Triglia at WEHI, Mihaela Pertea and Steven Salzberg at TIGR, Allan Saul and Robert Heustis at QIMR, Sharen Bowman and Neil Hall at the Sanger Centre, Winston Hide and Ralhston Muller at SANBI and Jane Carlton at NCBI. S.C. was supported by DOE grant DE-FGO3-97ER62387.

References (21)

There are more references available in the full text version of this article.

Cited by (39)

  • Structure and Content of the Entamoeba histolytica Genome

    2007, Advances in Parasitology
    Citation Excerpt :

    Therefore, it was decided to undertake and publish an analysis of the genome draft following assembly of the shotgun reads. Annotation of the protein coding regions of the genome was initially carried out using two genefinders [GlimmerHMM (Majoros et al., 2004) and Phat (Cawley et al., 2001)] previously used successfully on another low G + C genome, that of P. falciparum. The software was re‐trained specifically for analysis of the E. histolytica genome.

  • Protozoan genomes: Gene identification and annotation

    2005, International Journal for Parasitology
  • Genomic organization and expression of 23 new genes from MATα locus of Cryptococcus neoformans var. gattii

    2004, Biochemical and Biophysical Research Communications
    Citation Excerpt :

    The DNA-to-protein alignments were performed using the DPS/NAP set of programs from the AAT tool package [20]. Additionally, gene sequences were predicted de novo using four gene-finding programs: GlimmerM ([21], http://www.tigr.org/software/glimmerm/), Phat ([22]), TWINSCAN ([23]; http://genes.cs.wustl.edu/), and Unveil ([21]; http://www.tigr.org/software/Unveil/index.shtml). All of the evidence collected above (alignments to cDNAs, ESTs, and proteins, and automated gene predictions) was then combined computationally using the Combiner program ([24]; http://www.tigr.org/software/combiner/).

  • Genome wide application of DNA melting analysis

    2009, Journal of Physics Condensed Matter
View all citing articles on Scopus
View full text