Abstract
Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ward, J.J., et al.: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol 337, 635–645 (2004)
Romero, P., et al.: Sequence complexity of disordered protein. Proteins: Structure, Function, and Genetics 42, 38–48 (2001)
Coeytaux, K., Poupon, A.: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics 21, 1891–1900 (2005)
Radivojac, P., et al.: Prediction of boundaries between intrinsically ordered and disordered protein regions. In: Pacific Symposium on Biocomputing, pp. 216–227 (2003)
Weathers, E.A., et al.: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett. 576, 348–352 (2004)
Hansen, J.C., et al.: Intrinsic protein disorder, amino acid composition, and histone terminal domains. J. Biol. Chem. 281, 1853–1856 (2006)
Uversky, V.N., et al.: Showing your id. J. Mol. Recognit. 18, 343–384 (2005)
Dosztanyi, Z., et al.: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 374, 827–839 (2005)
Vullo, A., et al.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34, 164–168 (2006)
Mitchell, T.M.: Machine learning. McGraw-Hill, New York (1997)
Peng, K., et al.: Optimizing long intrinsic disorder predictors with protein evolutionary information. J. Bioinform. Comput. Biol. 3, 35–60 (2005)
Peng, K., et al.: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7, 208 (2006)
Siepen, J.A., et al.: Beta edge strands in protein structure prediction and aggregation. Protein Sci. 12, 2348–2359 (2003)
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Oh, J., et al.: Estimating neuronal variable importance with random forest. In: IEEE Bioengineering Conference, vol. 29, pp. 33–34. IEEE, Los Alamitos (2003)
Bridewell, W., et al.: Reducing overfitting in process model induction. In: Twenty-Second International Conference on Machine Learning, pp. 81–88 (2005)
Blake, C.L., et al.: UCI repository of machine learning databases (1998)
Obradovic, Z., et al.: Predicting intrinsic disorder from amino acid sequence. Proteins: Structure, Function and Bioinformatics 53, 566–572 (2003)
Hobohm, U., Sander, C.: Enlarged representative set of protein structures. Protein Sci. 3, 522 (1994)
Romero, P., et al.: Sequence data analysis for long disordered regions prediction in the calcineurin family. Genome Informatics 8, 110–124 (1997)
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Fornasari, M.S., et al.: Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations. Molecular Biology and Evolution 19, 352–356 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Han, P., Zhang, X., Norton, R.S., Feng, Z. (2007). Reducing Overfitting in Predicting Intrinsically Unstructured Proteins. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_53
Download citation
DOI: https://doi.org/10.1007/978-3-540-71701-0_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)