Unsupervised pattern recognition of mixed data structures with numerical and categorical features using a mixture regression modelling framework
Introduction
Multivariate mixture models, especially normal mixtures, have been widely used in statistical pattern recognition as an unsupervised model-based tool to cluster data in a wide variety of scientific fields, including bioinformatics, biostatistics, health, medical imaging, medicine, among many others [1], [2], [3], [4], [5]. Here, unsupervised pattern recognition refers to the situations where there is no prior knowledge on the group structure of the underlying population. In the present era of Big Data, massive data collection in these fields becomes a fact of the 21st Century science. For example, big data in population health research is now commonplace, due to increasing information-rich government-based and sophisticated data management and data linking technologies [6]. There has been an increasing number of large-scale national surveys conducted worldwide to study important population health problems. Big data collected in these studies often exhibit a mix of numerical and categorical feature variables (referred to as mixed or hybrid-attribute data in the pattern recognition literature [7], [8]), imposing serious challenges in direct application of conventional unsupervised mixture model-based clustering methods. When the data are mixed mode, common multivariate distributions adopted in mixture modelling become invalid. To this end, a location modelling approach has been considered in both supervised and unsupervised classification of mixed feature variables [9], [10], [11], where the features are split into two feature groups of numerical and categorical variables, entering into the mixture model via a conditional density structure. However, this approach becomes intractable when the categorical feature variables have a large number of distinct patterns (“locations”). This problem is particularly apparent in applications where it is implausible to split the mixed features into numerical and categorical variables; for example, the outcome feature variables are in a categorical form and only risk features are in mixed mode. An example of such a problem is the analysis of national morbidity data in an attempt to quantify heterogeneous comorbidity patterns of health conditions among individuals and identify characterised features of individuals who are at risk of poorer health outcomes.
Comorbidity1 is recognised as a serious burden and challenge on the healthcare system in many countries, closely associated with increases in hospitalisation, psychological distress, use of health services, mortality and associated cost, but a decrease in productivity [12], [13], [14], [15]. The synergistic nature of comorbidity has a greater impact than is suggested by single disease prevalence on morbidity burden for individuals and societies. Advanced knowledge to address increasingly complex health needs related to comorbidity will be valuable to inform interventions that will be most effective and viable and improve the health and wellbeing of individuals in the community.
In recent years, pattern recognition techniques have been adopted to reveal heterogeneous comorbidity patterns among individuals, including the use of hierarchical clustering methods [16], [17], latent class analysis [18], [19], mixture models [20], principal component methods [21], and self-organising maps [22]. However, these methods focus on identifying clusters of individuals with different comorbidity patterns. Exploration of an individual’s characteristics (such as demographics, socioeconomic status, and lifestyle factors) that differentiate various comorbidity patterns is often performed via regression models in a post-hoc approach. Ignoring the misclassification errors in clustering of individuals may induce serious biases in a subsequent regression analysis. In this paper, we extend the work of Ng [20], which used a post-hoc approach described above (see also [18], [19]), by proposing an unsupervised mixture regression model of multivariate generalised Bernoulli distributions to simultaneously cluster individuals into groups of different comorbidity patterns and identify relevant risk features that explain the heterogeneity in the comorbidity patterns. An ultimate goal is to investigate the impact of comorbidity patterns on individual health outcomes and service utilisation. The findings about the nature and patterns of comorbidity as well as their synergistic effects on health outcomes will contribute to the evidence base for improved prevention, treatment and care management for individuals with multiple conditions.
The paper is organised as follows: Section 2 introduces how categorical outcome feature variables are formed on the basis of binary morbidity data for comorbidity analysis. In Section 3, we present the theory of a mixture regression modelling framework for unsupervised clustering of individuals to quantify heterogeneous comorbidity patterns and describe the EM algorithm for the iterative computation of the maximum likelihood estimates of the model parameters. In particular, we show that the mixture regression modelling framework has desirable statistical properties with regards to its flexibility in handling mixed feature variables. Section 4 demonstrates the application of the proposed method to two real data sets concerning comorbidity patterns among 20,788 Australians who participated in the 2007–2008 NHS and among 470 patients who were recruited to undergo in-patient detoxification from alcohol, heroin or cocaine. In Section 5, we present simulation studies to assess the performance of the unsupervised mixture regression model under finite samples. Section 6 ends the paper with concluding remarks and discussion.
Section snippets
Formation of categorical outcome variables via cluster analysis
Let n be the number of individuals and m the number of health conditions. We let the vector containing the features that are considered as outcome (or response) variables, where the superscript T denotes vector transpose. And there are q risk feature variables that are associated with the jth individual taking either numerical or categorical forms. In comorbidity analysis, are one or zero, indicating the presence or absence of the ith
Theory of a mixture regression modelling framework for revealing heterogeneous comorbidity patterns
Here we cluster individuals on the basis of the outcome feature vectors and the risk feature vectors for . With the mixture model-based approach, the observed p-dimensional outcome feature vectors are assumed to have come from a mixture of a finite number, say g, of components, where each feature vector is taken to be a realisation of the mixture probability density function defined byfor where the mixing
Examples: Australian NHS data and Boston HELP RCT data
The 2007–2008 Australian NHS was conducted by the Australian Bureau of Statistics (ABS) from July 2007 to June 2008 [33], collecting information about the prevalence of current long-term conditions (which were defined as medical conditions that were current at the time of the survey and that had lasted or expected to last for at least six months) from 20,788 Australians. The NHS data in Confidentialised Unit Record Files (CURFs) are available on the ABS website [34] at //www.abs.gov.au/ausstats/[email protected]/mf/4324.0
Simulation study
We assessed the performance of the mixture regression model of multivariate generalised Bernoulli distributions (3) via statistical pattern recognition of simulated morbidity data. We assumed a setting of condition groups, G1 to G6, from individuals, with components corresponding to low and high levels of comorbidity, respectively. The number of distinct labels for the comorbidity groups was assumed to be (3,3,2,2,2,2), respectively. It means that for the first two comorbidity
Discussion
We have developed an unsupervised mixture regression model of multivariate generalised Bernoulli distributions to handle the problems in statistical pattern recognition, where clustering of individuals is performed on the basis of categorical outcome features and mixed risk features. In contrast to post-hoc approaches, this new method simultaneously clusters individuals into groups according to individual patterns in the categorical outcome features and identifies significant risk features that
Acknowledgments
The authors wish to thank the Editor, an Associate Editor, and three reviewers for helpful comments on the paper. This work was supported by the Australian Research Council (Grant number DP170100907). The authors have no competing interests to declare.
Shu-Kay Ng received the B.Sc. degree (Hons.) in civil engineering from the University of Hong Kong, Hong Kong, in 1986, and the Ph.D. degree in statistics from the University of Queensland, Brisbane, Australia, in 1999. He was awarded an Australian Research Council (ARC) Australian Postdoctoral Fellowship in 2003. He joined the School of Medicine, Griffith University, in 2007. Professor Ng has engaged in many multidisciplinary research projects, Government and consultancy research contracts.
References (42)
- et al.
Speeding up the EM algorithm for mixture model-based segmentation of magnetic resonance images
Pattern Recognit.
(2004) - et al.
Extension of mixture-of-experts networks for binary classification of hierarchical data
Artif. Intell. Med.
(2007) - et al.
Determining the number of clusters using information entropy for mixed data
Pattern Recognit.
(2012) - et al.
Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation
Pattern Recognit.
(2007) - et al.
Clusters of multiple complex chronic conditions: a latent class analysis of children at end of life
J. Pain Symptom Manag.
(2016) A two-way clustering framework to identify disparities in multimorbidity patterns of mental and physical health conditions among australians
Stat. Med.
(2015)- et al.
Defining comorbidity: implications for understanding health and health services
Ann. Fam. Med.
(2009) - et al.
Mixture models for clustering multilevel growth trajectories
Comput. Stat. Data Anal.
(2014) - et al.
The EM Algorithm and Extensions
(2008) - et al.
Finite Mixture Models
(2000)
Modelling the distribution of ischaemic stroke-specific survival time using an EM-based mixture approach with random effects adjustment
Stat. Med.
Inference on differences between classes using cluster-specific contrasts of mixed effects
Biostatistics
Finding group structures in “big data” in healthcare research using mixture models
Mixture separation for mixed-mode data
Stat. Comput.
Mixture model clustering: a brief introduction to the MULTIMIX program
Aust. NZ. J. Stat.
Expert networks with mixed continuous and categorical feature variables: a location modeling approach
Multimorbidity research challenges: where to go from here?
J. Comorbidity
Patterns of multimorbidity in working australians
Popul. Health Metr.
Identifying comorbidity patterns of health conditions via cluster analysis of pairwise concordance statistics
Stat. Med.
Patterns of comorbidity and the use of health services in the dutch population
Eur. J. Public Health
Deconstructing complex multimorbidity in the very old: findings from the newcastle 85+ study
BioMed Res. Int.
Cited by (11)
How to Use K-means for Big Data Clustering?
2023, Pattern RecognitionCitation Excerpt :Cluster analysis methods have proven to be a powerful tool for data mining. These methods solve the problem of unsupervised classification of patterns and have numerous applications in different areas such as pattern recognition [1], pattern classification [2], image retrieval and recognition [3], multimodal learning [4], data mining and knowledge discovery [5], network analysis [6], document clustering and data compression [7]. The application of cluster analysis assumes the existence of a cluster structure in the analyzed data [8].
A categorical data clustering framework on graph representation
2022, Pattern RecognitionCARs-Lands: An associative classifier for large-scale datasets
2020, Pattern RecognitionCitation Excerpt :In [31], an MECR-tree has been used to store original datasets and the concept of pre-large itemsets is used to avoid re-scanning the original dataset. Nowadays, large-scale or massive datasets have been the common form of data which are used in learning problems such as pattern recognition [32] and learning [33]. Although proposing new parallel classifiers based on software framework such as apache spark is a usual approach, sampling massive approaches are also used for large-scale datasets [34,35].
Multilevel joint frailty model for hierarchically clustered binary and survival data
2023, Statistics in Medicine
Shu-Kay Ng received the B.Sc. degree (Hons.) in civil engineering from the University of Hong Kong, Hong Kong, in 1986, and the Ph.D. degree in statistics from the University of Queensland, Brisbane, Australia, in 1999. He was awarded an Australian Research Council (ARC) Australian Postdoctoral Fellowship in 2003. He joined the School of Medicine, Griffith University, in 2007. Professor Ng has engaged in many multidisciplinary research projects, Government and consultancy research contracts. His research interests lie in the fields of cluster analysis, pattern recognition, image segmentation, neural networks, random-effects modelling, survival analysis, longitudinal analysis, and biostatistics, with particular focus being given to the theory and applications of mixture models, as well as computational statistics concerning the development of the EM algorithm for estimation of mixture models. Professor Ng has authored over 100 research publications. He is an Associate Editor of Journal of Statistical Computation and Simulation and has served in the Program Technical Panel for IEEE International Conference on Bioinformatics and Biomedicine BIBM 2017 and 2018.
Richard Tawiah received the B.Sc. degree (Hons.) in Mathematical Science from the University for Development Studies and M.Phil. in Applied Mathematics from Kwame Nkrumah University of Science and Technology (KNUST), both in Ghana. Currently he is pursuing a Ph.D. degree in statistics at Griffith University in Australia. His research interests are frailty modelling of recurrent event data, cure models, comorbidity, survival analysis, and longitudinal analysis. From 2014 to 2016, he worked as part of a research team working in collaboration with DANIDA on the KNUST chapter of the Building Stronger Universities (BSU) project.
Geoffrey John McLachlan received the B.Sc. (Hons.) and Ph.D. degrees from the University of Queensland in 1969 and 1973, respectively. Since 1975 he has been a faculty member of the Department of Mathematics of the University of Queensland. In 1994, he was awarded aD.Sc. degree by the University of Queensland on the basis of his publications in the scientific literature. Since 2002, he has had a joint appointment with the Institute for Molecular Bioscience and he is a chief investigator of the Australian Research Council Centre of Excellence in Biomathematics. In 2007, he was awarded an Australian Professorial Fellowship. Professor McLachlan is a fellow of the American Statistical Association, the Royal Statistical Society, and the Australian Mathematical Society. His research interests have been concentrated in the related fields of classification, cluster and discriminant analyses, image analysis, machine learning, neural networks, pattern recognition, and data mining, and in the field of statistical inference. More recently, he has become actively involved in the field of bioinformatics with the focus on the statistical analysis of microarray gene-expression data. In these fields, he has published over 190 research articles, including six monographs. The last five monographs, which are volumes in the Wiley Series in Probability and Statistics, are on the topics of discriminant analysis, the EM algorithm (including a second edition), finite mixture models, and the analysis of microarray data. Professor McLachlan is on the editorial board of several international journals and has served on the program committee for many international conferences. He is a member of the College of Experts of the Australian Research Council and is President-elect of the International Federation of Classification Societies.