Abstract
Viral subtyping is the process of classifying a virus genome into a subtype inside its family. Moreover, it plays a major role in the appropriate diagnosis and treatment of illness. In this context, researches use alignment-based methods to process viral subtyping classification. Nevertheless, alignment-based methods are slow and we need to expose the privacy of the sample genome consulted. For that reason, some methods have emerged, they use machine learning models that take the viral sample genome and predict the virus subtyping. Additionally, the performance of machine learning models depends on the feature vector computed, the most remarkable methods are based on k-mer frequency as features. In this study, we compared the two most relevant methods based on k-mer frequency, Kameris, and Castor-KRFE on a dataset of Polyomavirus and HIV-1 genomes. Both have the same results when we avoid their dimensionality reduction and feature elimination, but when not, Kameris slightly outperform Castor-KRFE. Moreover, Castor-KRFE could get a small feature vector for \(k>5\) (in k-mer).
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Adetiba, E., Badejo, J.A., Thakur, S., Matthews, V.O., Adebiyi, M.O., Adebiyi, E.F.: Experimental investigation of frequency chaos game representation for in silico and accurate classification of viral pathogens from genomic sequences. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 155–164. Springer (2017)
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Banerji, J., Rusconi, S., Schaffner, W.: Expression of a \(\beta \)-globin gene is enhanced by remote SV40 DNA sequences. Cell 27(2), 299–308 (1981)
Bansiwal, A.: Analysis of circulating recombinant forms (CRFs) of HIV-1 using Chaos Game Representation (CGR). Ph.D. thesis, IISER M (2014)
Bjornson, R.D., Sherman, A.H., Weston, S.B., Willard, N., Wing, J.: Turboblast (r): a parallel implementation of blast built on the turbohub. In: ipdps, p. 0183. IEEE (2002)
Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. 83(14), 5155–5159 (1986)
Calvignac-Spencer, S., Feltkamp, M.C.W., Daugherty, M.D., Moens, U., Ramqvist, T., Johne, R., Ehlers, B., et al.: A taxonomy update for the family polyomaviridae. Arch. Virol. 161(6), 1739–1750 (2016)
Chan, R.H., Chan, T.H., Yeung, H.M., Wang, R.W.: Composition vector method based on maximum entropy principle for sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(1), 79–87 (2011)
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Clumeck, N., Pozniak, A., Raffi, F.: EACS Executive Committee: European aids clinical society (EACS) guidelines for the clinical management and treatment of HIV-infected adults. HIV Med. 9(2), 65–71 (2008)
De Oliveira, T., Deforche, K., Cassol, S., Salminen, M., Paraskevis, D., Seebregts, C., Snoeck, J., Van Rensburg, E.J., Wensing, A.M.J., Van De Vijver, D.A., et al.: An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics 21(19), 3797–3800 (2005)
Duffy, S., Shackelton, L.A., Holmes, E.C.: Rates of evolutionary change in viruses: patterns and determinants. Nat. Rev. Genet. 9(4), 267–276 (2008)
Edgar, R.C.: Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19), 2460–2461 (2010)
Fiscon, G., Weitschek, E., Cella, E., Lo Presti, A., Giovanetti, M., Babakir-Mina, M., Ciotti, M., Ciccozzi, M., Pierangeli, A., Bertolazzi, P., et al.: Missel: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification. BioData Min. 9(1), 38 (2016)
Ghandi, M., Lee, D., Mohammad-Noori, M., Beer, M.A.: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10(7), e1003711 (2014)
Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)
Joy, J.B., Liang, R.H., Nguyen, T., McCloskey, R.M., Poon, A.F.Y.: Origin and evolution of human immunodeficiency viruses. In: Global Virology I-Identifying and Investigating Viral Diseases, pp. 587–611. Springer (2015)
Lebatteux, D., Remita, A.M., Diallo, A.B.: Toward an alignment-free method for feature extraction and accurate classification of viral sequences. J. Comput. Biol. 26(6), 519–535 (2019)
Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M.S., Sun, F.: New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J. Theoret. Biol. 284(1), 106–116 (2011)
Lowe, D.B., Shearer, M.H., Jumper, C.A., Kennedy, R.C.: Sv40 association with human malignancies and mechanisms of tumor immunity by large tumor antigen. Cell. Mol. Life Sci. 64(7–8), 803 (2007)
Moens, U., Calvignac-Spencer, S., Lauber, C., Ramqvist, T., Feltkamp, M.C.W., Daugherty, M.D., Verschoor, E.J., Ehlers, B., et al.: ICTV virus taxonomy profile: polyomaviridae. J. Gener. Virol. 98(6), 1159–1160 (2017)
Oehmen, C., Nieplocha, J.: Scalablast: a scalable implementation of blast for high-performance data-intensive bioinformatics analysis. IEEE Trans. Parallel Distrib. Syst. 17(8), 740–749 (2006)
Oehmen, C.S., Baxter, D.J.: Scalablast 2.0: rapid and robust blast calculations on multiprocessor systems. Bioinformatics 29(6), 797–798 (2013)
Pandit, A., Sinha, S.: Using genomic signatures for HIV-1 sub-typing. BMC Bioinform. 11(S1), S26 (2010)
Pond, S.L.K., Posada, D., Stawiski, E., Chappey, C., Poon, A.F.Y., Hughes, G., Fearnhill, E., Gravenor, M.B., Brown, A.J.L., Frost, S.D.W.: An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput. Biol. 5(11), e1000581 (2009)
Poulin, D.L., DeCaprio, J.A.: Is there a role for SV40 in human cancer? J. Clin. Oncol. 24(26), 4356–4365 (2006)
Randhawa, G.S., Soltysiak, M.P.M., El Roz, H., de Souza, C.P.E., Hill, K.A., Kari, L.: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study. bioRxiv (2020)
Remita, M.A., Halioui, A., Diouara, A.A.M., Daigle, B., Kiani, G., Diallo, A.B.: A machine learning approach for viral genome classification. BMC Bioinform. 18(1), 208 (2017)
Ren, J., Ahlgren, N.A., Lu, Y.Y., Fuhrman, J.A., Sun, F.: Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5(1), 69 (2017)
Sharp, P.M., Hahn, B.H.: Origins of HIV and the aids pandemic. Cold Spring Harbor Perspect. Med. 1(1), a006841 (2011)
Silva, J.C.F., Carvalho, T.F.M., Basso, M.F., Deguchi, M., Pereira, W.A., Sobrinho, R.R., Vidigal, P.M.P., Brustolini, O.J.B., Silva, F.F., Dal-Bianco, M., et al.: Geminivirus data warehouse: a database enriched with machine learning approaches. BMC Bioinform. 18(1), 240 (2017)
Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)
Solis-Reyes, S., Avino, M., Poon, A., Kari, L.: An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PloS One 13(11), e0206409 (2018)
Struck, D., Lawyer, G., Ternes, A.-M., Schmit, J.-C., Bercoff, D.P.: Comet: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res. 42(18), e144–e144 (2014)
Tanchotsrinon, W., Lursinsap, C., Poovorawan, Y.: A high performance prediction of HPV genotypes by chaos game representation and singular value decomposition. BMC Bioinform. 16(1), 71 (2015)
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)
Vinga, S.: Alignment-free methods in computational biology (2014)
Xing, Z., Pei, J., Keogh, E.: A brief survey on sequence classification. ACM SIGKDD Explor. Newsl. 12(1), 40–48 (2010)
Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Arceda, V.E.M. (2021). An Analysis of k-Mer Frequency Features with Machine Learning Models for Viral Subtyping of Polyomavirus and HIV-1 Genomes. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. FTC 2020. Advances in Intelligent Systems and Computing, vol 1288. Springer, Cham. https://doi.org/10.1007/978-3-030-63128-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-63128-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63127-7
Online ISBN: 978-3-030-63128-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)