Elsevier

Pattern Recognition

Volume 88, April 2019, Pages 261-271
Pattern Recognition

Unsupervised pattern recognition of mixed data structures with numerical and categorical features using a mixture regression modelling framework

https://doi.org/10.1016/j.patcog.2018.11.022Get rights and content

Highlights

  • Cluster analysis of mixed-feature data imposes challenges in mixture modelling.

  • Comorbid-condition groups inform potential shared biologic processes among diseases.

  • Individuals with heterogeneous comorbidity patterns show different risk features.

  • Regression models improve clustering results by adjustment of relevant risk factors.

  • This method is applicable for more general mixed data, via consensus clustering.

Abstract

In the present era of “Big Data”, data collection involving massive amount of features with a mix of variable types is commonplace. Mixture model-based techniques for statistical cluster analysis of mixed numerical and categorical feature data have their limitations, due to the difficulty in specifying appropriate component-densities when common multivariate distributions become invalid. This problem is particularly apparent in applications where the outcome feature variables are in a categorical form. An example of such an application is the analysis of binary morbidity data in national health survey, where the aims are to quantify heterogeneous comorbidity patterns of health conditions and identify (risk)-features of individuals that explain the heterogeneity. In this paper, we propose an unsupervised mixture regression model of multivariate generalised Bernoulli distributions for cluster analysis on the basis of categorical outcome features and mixed risk features. The proposed method is illustrated using simulated data and two real data sets concerning comorbidity patterns among 20,788 Australians who participated in the 2007–2008 National Health Survey (NHS) and among 470 patients who were recruited in a randomised controlled trial of a health intervention about in-patient detoxification from alcohol, heroin or cocaine in Boston. The method is also readily applicable to cluster more general mixed-feature data via the framework of consensus clustering.

Introduction

Multivariate mixture models, especially normal mixtures, have been widely used in statistical pattern recognition as an unsupervised model-based tool to cluster data in a wide variety of scientific fields, including bioinformatics, biostatistics, health, medical imaging, medicine, among many others [1], [2], [3], [4], [5]. Here, unsupervised pattern recognition refers to the situations where there is no prior knowledge on the group structure of the underlying population. In the present era of Big Data, massive data collection in these fields becomes a fact of the 21st Century science. For example, big data in population health research is now commonplace, due to increasing information-rich government-based and sophisticated data management and data linking technologies [6]. There has been an increasing number of large-scale national surveys conducted worldwide to study important population health problems. Big data collected in these studies often exhibit a mix of numerical and categorical feature variables (referred to as mixed or hybrid-attribute data in the pattern recognition literature [7], [8]), imposing serious challenges in direct application of conventional unsupervised mixture model-based clustering methods. When the data are mixed mode, common multivariate distributions adopted in mixture modelling become invalid. To this end, a location modelling approach has been considered in both supervised and unsupervised classification of mixed feature variables [9], [10], [11], where the features are split into two feature groups of numerical and categorical variables, entering into the mixture model via a conditional density structure. However, this approach becomes intractable when the categorical feature variables have a large number of distinct patterns (“locations”). This problem is particularly apparent in applications where it is implausible to split the mixed features into numerical and categorical variables; for example, the outcome feature variables are in a categorical form and only risk features are in mixed mode. An example of such a problem is the analysis of national morbidity data in an attempt to quantify heterogeneous comorbidity patterns of health conditions among individuals and identify characterised features of individuals who are at risk of poorer health outcomes.

Comorbidity1 is recognised as a serious burden and challenge on the healthcare system in many countries, closely associated with increases in hospitalisation, psychological distress, use of health services, mortality and associated cost, but a decrease in productivity [12], [13], [14], [15]. The synergistic nature of comorbidity has a greater impact than is suggested by single disease prevalence on morbidity burden for individuals and societies. Advanced knowledge to address increasingly complex health needs related to comorbidity will be valuable to inform interventions that will be most effective and viable and improve the health and wellbeing of individuals in the community.

In recent years, pattern recognition techniques have been adopted to reveal heterogeneous comorbidity patterns among individuals, including the use of hierarchical clustering methods [16], [17], latent class analysis [18], [19], mixture models [20], principal component methods [21], and self-organising maps [22]. However, these methods focus on identifying clusters of individuals with different comorbidity patterns. Exploration of an individual’s characteristics (such as demographics, socioeconomic status, and lifestyle factors) that differentiate various comorbidity patterns is often performed via regression models in a post-hoc approach. Ignoring the misclassification errors in clustering of individuals may induce serious biases in a subsequent regression analysis. In this paper, we extend the work of Ng [20], which used a post-hoc approach described above (see also [18], [19]), by proposing an unsupervised mixture regression model of multivariate generalised Bernoulli distributions to simultaneously cluster individuals into groups of different comorbidity patterns and identify relevant risk features that explain the heterogeneity in the comorbidity patterns. An ultimate goal is to investigate the impact of comorbidity patterns on individual health outcomes and service utilisation. The findings about the nature and patterns of comorbidity as well as their synergistic effects on health outcomes will contribute to the evidence base for improved prevention, treatment and care management for individuals with multiple conditions.

The paper is organised as follows: Section 2 introduces how categorical outcome feature variables are formed on the basis of binary morbidity data for comorbidity analysis. In Section 3, we present the theory of a mixture regression modelling framework for unsupervised clustering of individuals to quantify heterogeneous comorbidity patterns and describe the EM algorithm for the iterative computation of the maximum likelihood estimates of the model parameters. In particular, we show that the mixture regression modelling framework has desirable statistical properties with regards to its flexibility in handling mixed feature variables. Section 4 demonstrates the application of the proposed method to two real data sets concerning comorbidity patterns among 20,788 Australians who participated in the 2007–2008 NHS and among 470 patients who were recruited to undergo in-patient detoxification from alcohol, heroin or cocaine. In Section 5, we present simulation studies to assess the performance of the unsupervised mixture regression model under finite samples. Section 6 ends the paper with concluding remarks and discussion.

Section snippets

Formation of categorical outcome variables via cluster analysis

Let n be the number of individuals and m the number of health conditions. We let yj=(y1j,,ymj)T(j=1,,n) the vector containing the features that are considered as outcome (or response) variables, where the superscript T denotes vector transpose. And there are q risk feature variables x1j,,xqj that are associated with the jth individual (j=1,,n), taking either numerical or categorical forms. In comorbidity analysis, yij(i=1,,m) are one or zero, indicating the presence or absence of the ith

Theory of a mixture regression modelling framework for revealing heterogeneous comorbidity patterns

Here we cluster individuals on the basis of the outcome feature vectors (y1j,,ypj)T and the risk feature vectors (x1j,,xqj) for j=1,,n. With the mixture model-based approach, the observed p-dimensional outcome feature vectors yj(j=1,,n) are assumed to have come from a mixture of a finite number, say g, of components, where each feature vector yj is taken to be a realisation of the mixture probability density function defined byf(yj,xj;Ψ)=h=1gπh(xj;α)fh(yj;θh)for j=1,,n, where the mixing

Examples: Australian NHS data and Boston HELP RCT data

The 2007–2008 Australian NHS was conducted by the Australian Bureau of Statistics (ABS) from July 2007 to June 2008 [33], collecting information about the prevalence of current long-term conditions (which were defined as medical conditions that were current at the time of the survey and that had lasted or expected to last for at least six months) from 20,788 Australians. The NHS data in Confidentialised Unit Record Files (CURFs) are available on the ABS website [34] at //www.abs.gov.au/ausstats/[email protected]/mf/4324.0

Simulation study

We assessed the performance of the mixture regression model of multivariate generalised Bernoulli distributions (3) via statistical pattern recognition of simulated morbidity data. We assumed a setting of p=6 condition groups, G1 to G6, from n=500 individuals, with g=2 components corresponding to low and high levels of comorbidity, respectively. The number of distinct labels for the p=6 comorbidity groups was assumed to be (3,3,2,2,2,2), respectively. It means that for the first two comorbidity

Discussion

We have developed an unsupervised mixture regression model of multivariate generalised Bernoulli distributions to handle the problems in statistical pattern recognition, where clustering of individuals is performed on the basis of categorical outcome features and mixed risk features. In contrast to post-hoc approaches, this new method simultaneously clusters individuals into groups according to individual patterns in the categorical outcome features and identifies significant risk features that

Acknowledgments

The authors wish to thank the Editor, an Associate Editor, and three reviewers for helpful comments on the paper. This work was supported by the Australian Research Council (Grant number DP170100907). The authors have no competing interests to declare.

Shu-Kay Ng received the B.Sc. degree (Hons.) in civil engineering from the University of Hong Kong, Hong Kong, in 1986, and the Ph.D. degree in statistics from the University of Queensland, Brisbane, Australia, in 1999. He was awarded an Australian Research Council (ARC) Australian Postdoctoral Fellowship in 2003. He joined the School of Medicine, Griffith University, in 2007. Professor Ng has engaged in many multidisciplinary research projects, Government and consultancy research contracts.

References (42)

  • S.K. Ng et al.

    Modelling the distribution of ischaemic stroke-specific survival time using an EM-based mixture approach with random effects adjustment

    Stat. Med.

    (2004)
  • S.K. Ng et al.

    Inference on differences between classes using cluster-specific contrasts of mixed effects

    Biostatistics

    (2015)
  • S.K. Ng et al.

    Finding group structures in “big data” in healthcare research using mixture models

  • C.J. Lawrence et al.

    Mixture separation for mixed-mode data

    Stat. Comput.

    (1996)
  • L.A. Hunt et al.

    Mixture model clustering: a brief introduction to the MULTIMIX program

    Aust. NZ. J. Stat.

    (1999)
  • S.K. Ng et al.

    Expert networks with mixed continuous and categorical feature variables: a location modeling approach

  • G. Caughey et al.

    Multimorbidity research challenges: where to go from here?

    J. Comorbidity

    (2011)
  • L. Holden et al.

    Patterns of multimorbidity in working australians

    Popul. Health Metr.

    (2011)
  • S.K. Ng et al.

    Identifying comorbidity patterns of health conditions via cluster analysis of pairwise concordance statistics

    Stat. Med.

    (2012)
  • G.P. Westert et al.

    Patterns of comorbidity and the use of health services in the dutch population

    Eur. J. Public Health

    (2001)
  • J. Collerton et al.

    Deconstructing complex multimorbidity in the very old: findings from the newcastle 85+ study

    BioMed Res. Int.

    (2016)
  • Cited by (11)

    • How to Use K-means for Big Data Clustering?

      2023, Pattern Recognition
      Citation Excerpt :

      Cluster analysis methods have proven to be a powerful tool for data mining. These methods solve the problem of unsupervised classification of patterns and have numerous applications in different areas such as pattern recognition [1], pattern classification [2], image retrieval and recognition [3], multimodal learning [4], data mining and knowledge discovery [5], network analysis [6], document clustering and data compression [7]. The application of cluster analysis assumes the existence of a cluster structure in the analyzed data [8].

    • CARs-Lands: An associative classifier for large-scale datasets

      2020, Pattern Recognition
      Citation Excerpt :

      In [31], an MECR-tree has been used to store original datasets and the concept of pre-large itemsets is used to avoid re-scanning the original dataset. Nowadays, large-scale or massive datasets have been the common form of data which are used in learning problems such as pattern recognition [32] and learning [33]. Although proposing new parallel classifiers based on software framework such as apache spark is a usual approach, sampling massive approaches are also used for large-scale datasets [34,35].

    View all citing articles on Scopus

    Shu-Kay Ng received the B.Sc. degree (Hons.) in civil engineering from the University of Hong Kong, Hong Kong, in 1986, and the Ph.D. degree in statistics from the University of Queensland, Brisbane, Australia, in 1999. He was awarded an Australian Research Council (ARC) Australian Postdoctoral Fellowship in 2003. He joined the School of Medicine, Griffith University, in 2007. Professor Ng has engaged in many multidisciplinary research projects, Government and consultancy research contracts. His research interests lie in the fields of cluster analysis, pattern recognition, image segmentation, neural networks, random-effects modelling, survival analysis, longitudinal analysis, and biostatistics, with particular focus being given to the theory and applications of mixture models, as well as computational statistics concerning the development of the EM algorithm for estimation of mixture models. Professor Ng has authored over 100 research publications. He is an Associate Editor of Journal of Statistical Computation and Simulation and has served in the Program Technical Panel for IEEE International Conference on Bioinformatics and Biomedicine BIBM 2017 and 2018.

    Richard Tawiah received the B.Sc. degree (Hons.) in Mathematical Science from the University for Development Studies and M.Phil. in Applied Mathematics from Kwame Nkrumah University of Science and Technology (KNUST), both in Ghana. Currently he is pursuing a Ph.D. degree in statistics at Griffith University in Australia. His research interests are frailty modelling of recurrent event data, cure models, comorbidity, survival analysis, and longitudinal analysis. From 2014 to 2016, he worked as part of a research team working in collaboration with DANIDA on the KNUST chapter of the Building Stronger Universities (BSU) project.

    Geoffrey John McLachlan received the B.Sc. (Hons.) and Ph.D. degrees from the University of Queensland in 1969 and 1973, respectively. Since 1975 he has been a faculty member of the Department of Mathematics of the University of Queensland. In 1994, he was awarded aD.Sc. degree by the University of Queensland on the basis of his publications in the scientific literature. Since 2002, he has had a joint appointment with the Institute for Molecular Bioscience and he is a chief investigator of the Australian Research Council Centre of Excellence in Biomathematics. In 2007, he was awarded an Australian Professorial Fellowship. Professor McLachlan is a fellow of the American Statistical Association, the Royal Statistical Society, and the Australian Mathematical Society. His research interests have been concentrated in the related fields of classification, cluster and discriminant analyses, image analysis, machine learning, neural networks, pattern recognition, and data mining, and in the field of statistical inference. More recently, he has become actively involved in the field of bioinformatics with the focus on the statistical analysis of microarray gene-expression data. In these fields, he has published over 190 research articles, including six monographs. The last five monographs, which are volumes in the Wiley Series in Probability and Statistics, are on the topics of discriminant analysis, the EM algorithm (including a second edition), finite mixture models, and the analysis of microarray data. Professor McLachlan is on the editorial board of several international journals and has served on the program committee for many international conferences. He is a member of the College of Experts of the Australian Research Council and is President-elect of the International Federation of Classification Societies.

    View full text