Assessing the Quality of the DNA Sequence from The Human Genome Project

  1. Adam Felsenfeld1,
  2. Jane Peterson,
  3. Jeffery Schloss, and
  4. Mark Guyer
  1. National Human Genome Research Institute (NHGRI), National Institutes of Health, Bethesda, Maryland 20892-6050 USA

This extract was created in the absence of an abstract.

It is sometimes hard to remember that the first DNA sequence of the entire genome of a free-living organism, Hemophilus influenzae, was reported <4 years ago (Fleischmann et al. 1995). Since then, the genomes of >17 other prokaryotes (http://linkage.rockefeller.edu/wli/seq/), a unicellular eukaryote,Saccharomyces cerevisiae (Nature 1996), and a multicellular organism, Caenorhabditis elegans (The C. elegans Sequencing Consortium 1998), have been completely sequenced. Progress toward determination of the human DNA sequence has also become more rapid; at the time of this writing, the public databases contain 227.2 Mb of nonredundant, finished sequence available in contigs of >30 kb (and another 152.7 Mb of unfinished sequence) (http://www.ncbi.nlm.nih.gov/genome/seq/weekly_report.html). In comparison, there was 84.4 Mb of finished data (http://www.ebi.ac.uk/∼sterk/genome-MOT/) in February 1998. It is increasingly likely that the human sequence will be complete by 2003, and a working draft will be in hand even sooner (Collins et al. 1998;Venter et al. 1998). One consequence of our increased sequencing capacity is that within the next couple of years, we expect the rate of deposition of sequence data to increase from the current ∼3 Mb per week, to an average of well over 10 Mb per week worldwide. Very few scientific fields can measure progress as easily as can be done for large-scale genomic sequencing, quantifiable as it is into base pairs per unit time.

However, mere numbers can be deceptive—the essential “production” nature of large-scale genomic sequencing leaves it susceptible to errors in ways other scientific endeavors are not. Because of the rapid accumulation of human genomic sequence data, there is little opportunity for, or even possibility of, direct peer review of data prior to publication. The major venue for primary publication of genomic data is not the peer-reviewed literature at all, but public databases. This is appropriate: Current peer-reviewed biological …

| Table of Contents

Preprint Server