RePS: A Sequence Assembler That Masks Exact Repeats Identified from the Shotgun Data

  1. Jun Wang1,2,3,5,
  2. Gane Ka-Shu Wong1,2,4,5,
  3. Peixiang Ni2,
  4. Yujun Han2,
  5. Xiangang Huang2,
  6. Jianguo Zhang2,
  7. Chen Ye2,
  8. Yong Zhang2,3,
  9. Jianfei Hu2,3,
  10. Kunlin Zhang2,3,
  11. Xin Xu1,
  12. Lijuan Cong1,
  13. Hong Lu1,
  14. Xide Ren1,
  15. Xiaoyu Ren1,
  16. Jun He1,
  17. Lin Tao1,2,
  18. Douglas A. Passey4,
  19. Jian Wang1,2,
  20. Huanming Yang1,2,
  21. Jun Yu1,2,4, and
  22. Songgang Li2,3
  1. 1Hangzhou Genomics Institute, Institute of Bioinformatics of Zhejiang University, Key Laboratory of Bioinformatics of Zhejiang Province, Hangzhou 310007, China; 2Beijing Genomic Institute, Center of Genomics and Bioinformatics, Chinese Academy of Sciences, Beijing, 101300, China; 3College of Life Sciences, Peking University, Beijing, 100871, China; 4University of Washington Genome Center, Department of Medicine, Fluke Hall, M/C 352145, Seattle, Washington 98195, USA

Abstract

We describe a sequence assembler, RePS(repeat-masked Phrap with scaffolding), that explicitly identifies exact 20mer repeats from the shotgun data and removes them prior to the assembly. The established software Phrap is used to compute meaningful error probabilities for each base. Clone-end-pairing information is used to construct scaffolds that order and orient the contigs. We show with real data for human and rice that reasonable assemblies are possible even at coverages of only 4× to 6×, despite having up to 42.2% in exact repeats.

[The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: P. Green and A.F. Smit.]

Footnotes

  • 5 Corresponding authors.

  • E-MAIL wangj{at}genomics.org.cn; FAX 0086-10-80498676.

  • E-MAIL gksw{at}u.washington.edu; FAX (206) 685-7344.

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.165102.

    • Received February 6, 2001.
    • Accepted March 19, 2002.
| Table of Contents

Preprint Server