ABSTRACT
Biological systems are complex systems and often the biological data is available in different resolutions. Computational algorithms are often designed to work with only specific resolution of data. Hence, upsampling or downsampling is necessary before the data can be fed to the algorithm. Moreover, high-resolution data incorporates significant amount of noise thus producing explosion of redundant patterns such as maximal frequent itemset, closed frequent itemset and non-derivable itemset in the data which can be solved by downsampling the data if the information loss is insignificant during sampling. Furthermore, comparing the results of an algorithm on data in different resolution can produce interesting results which aids in determining suitable resolution of data. In addition, experiments in different resolutions can be helpful in determining the appropriate resolution for computational methods. In this paper, three methods of downsampling are proposed, implemented and experiments are performed on different resolutions and the suitability of the proposed methods are validated and the results compared. Mixture models are trained on the data and the results are analyzed and it was seen that the proposed methods produce plausible results showing that the significant patterns in the data are retained in lower resolution. The proposed methods can be extensively used in integration of databases.
- L. G. Shaffer and N. Tommerup. ISCN 2005: An International System for Human Cytogenetic Nomenclature (2005) Recommendations of the International Standing Committee on Human Cytogenetic Nomenclature. Karger, 2005.Google Scholar
- A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray, F. Waldman, and D. Pinkel. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. SCIENCE, 258(5083):818--821, OCT 30 1992.Google ScholarCross Ref
- D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W. L. Kuo, C. Chen, Y. Zhai, S. H. Dairkee, B. M. Ljung, J. W. Gray, and D. G. Albertson. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20: 207--211, 1998.Google ScholarCross Ref
- I. K. Fodor. A survey of dimension reduction techniques. Technical report, U.S. Department of Energy, June 2002.Google Scholar
- R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207--216, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Usama M. Fayyad and Ramasamy Uthurusamy, editors, AAAI Workshop on Knowledge Discovery in Databases (KDD-94), pages 181--192, Seattle, Washington, 1994. AAAI Press.Google Scholar
- Arianna Gallo, Pauli Miettinen, and Heikki Mannila. Finding subgroups having several descriptions: Algorithms for redescription mining. In SDM, pages 334--345, 2008.Google ScholarCross Ref
- Doug Burdick, Manuel Calimlim, and Johannes Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In In ICDE, pages 443--452, 2001. Google ScholarDigital Library
- J. R. Pollack, C. M. Perou, A. A. Alizadeh, M. B. Eisen, A. Pergamenschikov, C. F. Williams, S. S. Jeffrey, D. Botstein, and P. O. Brown. Genome-wide analysis of dna copy-number changes using cdna microarrays. Nature Genetics, 23(1):41--46, 1999.Google ScholarCross Ref
- S. Knuutila, Y. Aalto, K. Autio, A. Björkqvist, W. El-Rifai, Hemmer S., T. Huhta, E. Kettunen, S. Kiuru-Kuhlefelt, M. L. Larramendy, T Lushnikova, O. Monni, H. Pere, J. Tapper, M. Tarkkanen, A. Varis, V. Wasenius, M. Wolf, and Y. Zhu. Dna copy number losses in human neoplasms. Gynecologic Oncology, 155(2):683--694, 1999.Google Scholar
- S. Myllykangas, J. Himberg, T. Böhling, B. Nagy, J. Hollmén, and S. Knuutila. DNA copy number amplification profiling of human neoplasms. Oncogene, 25(55):7324--7332, 2006.Google ScholarCross Ref
- S. Myllykangas, J. Tikka, T. Böhling, S. Knuutila, and J. Hollmén. Classification of human cancers based on DNA copy number amplification modeling. BMC Medical Genomics, 1:15, 2008.Google ScholarCross Ref
- J. Tikka, J. Hollmén, and S. Myllykangas. Mixture modeling of DNA copy number amplification patterns in cancer. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4507 LNCS:972--979, 2007. Google ScholarDigital Library
- J. Hollmén and J. Tikka. Compact and understandable descriptions of mixtures of bernoulli distributions. Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 4723 LNCS:1--12, 2007. Google ScholarDigital Library
- P. M. V. Rancoita, M. Hutter, F. Bertoni, and I. Kwee. Bayesian DNA copy number analysis. BMC Bioinformatics, 10, 2009.Google ScholarCross Ref
- B. D'haene, J. Vandesompele, and J. Hellemans. Accurate and objective copy number profiling using real-time quantitative PCR. Methods, 50(4):262--270, 2010.Google ScholarCross Ref
- E. Despierre, D. Lambrechts, P. Neven, F. Amant, S. Lambrechts, and I. Vergote. The molecular genetic basis of ovarian cancer and its roadmap towards a better treatment. Gynecologic Oncology, 117(2):358--365, 2010.Google ScholarCross Ref
- L. Wall. Perl: Practical Extraction and Report Language. Website, 1987. http://www.perl.org/: Last Accessed: 15 Mar 2010.Google Scholar
- National Center for Biotechnology Information. Human genome project. Website, February 2010. http://www.ncbi.nlm.nih.gov/projects/mapview/ Last Accessed: 5 Feb 2010.Google Scholar
- G. J. McLachlan and D. Peel. Finite mixture models, volume 299 of Probability and Statistics -- Applied Probability and Statistics Section. Wiley, New York, 2000.Google Scholar
- B. S. Everitt and D. J. Hand. Finite mixture distributions. Chapman and Hall, 1981.Google ScholarCross Ref
- C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1st ed. 2006. corr. 2nd printing edition, October 2007. Google ScholarDigital Library
- S. Geisser. A predictive approach to the random effect model. Biometrika, 61(1):101--107, 1974.Google ScholarCross Ref
- F. Monsteller and J. Tukey. Data analysis including statistics. In Lindzey G. and Aronson E., editors, Handbook of Social Psychology, Vol-2, Addison-Wesley, 1968.Google Scholar
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal Of The Royal Statistical Society, Series B, 39(1):1--38, 1977.Google Scholar
- J. H. Wolfe. Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5:329--350, 1970.Google ScholarCross Ref
- J. Hollmén. BernoulliMix: Program package for finite mixture models of multivariate Bernoulli distributions, May 2009. Freely available in http://www.cis.hut.fi/jHollmen/BernoulliMix/.Google Scholar
- Mathworks. Matlab: the language of technical computing. Website, 1994. http://www.mathworks.com/products/matlab/: Last Accessed: 15 Mar 2010.Google Scholar
- G. W. Stewart. Matrix Algorithms: Volume 1, Basic Decompositions. Society for Industrial Mathematics, 1998.Google Scholar
- S. D. Gay. Datamining in proteomics: extracting knowledge from peptide mass fingerprinting spectra. PhD thesis, University of Geneva, Geneva, 2002.Google Scholar
- G. J. Mclachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1 edition, November 1996.Google Scholar
- W. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, 2007. Google ScholarDigital Library
Comments