Mining and visualising ordinal data with non-parametric continuous BBNs

https://doi.org/10.1016/j.csda.2008.09.032Get rights and content

Abstract

Data mining is the process of extracting and analysing information from large databases. Graphical models are a suitable framework for probabilistic modelling. A Bayesian Belief Net (BBN) is a probabilistic graphical model, which represents joint distributions in an intuitive and efficient way. It encodes the probability density (or mass) function of a set of variables by specifying a number of conditional independence statements in the form of a directed acyclic graph. Specifying the structure of the model is one of the most important design choices in graphical modelling. Notwithstanding their potential, there are only a limited number of applications of graphical models on very complex and large databases. A method for mining ordinal multivariate data using non-parametric BBNs is presented. The main advantage of this method is that it can handle a large number of continuous variables, without making any assumptions about their marginal distributions, in a very fast manner. Once the BBN is learned from data, it can be further used for prediction. This approach allows for rapid conditionalisation, which is a very important feature of a BBN from a user’s standpoint.

Introduction

An ordinal multivariate data set is one in which the numerical ordering of values for each variable is meaningful. A database of street addresses is not ordinal, but a database of fine particulate concentrations at various measuring stations is ordinal; higher concentrations are harmful to human health. We describe a method for mining ordinal multivariate data using non-parametric Bayesian Belief Nets (BBNs), and illustrate this with ordinal data of pollutants emissions and fine particulate concentrations. The data are gathered from electricity generating stations and from collection sites in the United States over the course of seven years (1999–2005). The database contains monthly emissions of SO2 and NOx at different locations, and monthly means of the readings of PM2.5 concentrations at various monitoring sites. SO2 is the formula for the chemical compound sulfur dioxide. This gas is the main product from the combustion of sulfur compounds and is of significant environmental concern. NOx is a generic term for mono-nitrogen oxides (NO and NO2). These oxides are produced during combustion, especially combustion at high temperatures. The notation PM2.5 is used to describe particles of 2.5μm or less in diameter.

There are 786 emission stations and 801 collection sites. For most emission stations there is information on emissions of both SO2 and NOx, but for some we only have information about one or the other. This data set allows us to relate the emissions with the air quality and interpret this relationship.

Let us assume that we are interested in the air quality in Washington DC and how is this influenced by selected power plant emissions (see Fig. 1). Additional variables that influence the PM2.5 concentration in Washington DC are the meteorological conditions. We incorporate in our analysis the monthly average temperature, the average wind speed and wind direction.

Definitions and concepts are introduced in Section 2, but suffice to say now that BBNs are directed acyclic graphs where an arrow connecting a parent node to a child node indicates that influence flows from parent to child. A BBN for Washington DC ambient PM2.5 is shown in Fig. 2. This model is similar to the one described and analysed in Morgenstern et al. (2008). It involves the same 14 variables as nodes, but the arcs between them are different. There are 5 emission stations in the following locations: Richmond, Masontown, Dumfries, Girard and Philadelphia. For each such station, there are 2 nodes in the BBN. One corresponds to the emission of SO2, and the other to the emission of NOx. The variable of interest is the PM2.5 concentration in Washington DC (DC_monthly_concPM25). There are 3 nodes that correspond to the meteorological conditions, namely the wind speed, wind direction and the temperature in DC. Conditional independence relations are given by the separation properties of the graph (see Section 5); thus nox_Philadelphia and DC_WindDir are independent conditional on DC_Temp and DC_WindSpeed. The methodology is designed specifically to handle large numbers of variables, in the order of several hundreds (see Morales-Napoles et al. (2007)), but a smaller number of variables is more suitable for explaining the method.

Most BBNs are discrete and arrows represent mathematical relationships in form of conditional probability tables. If the number of possible values for nodes is modestly large (in the order 10) such models quickly become intractable. Thus, a variable like DC_monthly_concPM25 (see Fig. 2) with all variables discretised to 10 possible values, would require a conditional probability table with 109 entries. So-called discrete continuous BBNs (Cowell et al., 1999) allow continuous nodes with either continuous or discrete parents, but they assume that the continuous nodes are joint normal. Influence between continuous nodes is represented as partial regression coefficients (Pearl, 1988, Shachter and Kenley, 1989). The restriction to joint normality is rather severe. Fig. 3 shows the same BBN as Fig. 2, but the nodes are replaced by histograms showing the marginal distributions at each node. They are far from normal. Our approach discharges the assumption of joint normality and builds a joint density for ordinal data using the joint normal copula. This means that we model the data as if it were transformed from a joint normal distribution. Influences are represented as (conditional) Spearman’s rank correlations according to a protocol explained in Section 2. Other copulas could be used, but (to our knowledge) only the joint normal copula affords the advantages of rapid conditionalisation, while preserving the (conditional) independence for zero (conditional) correlation. Conditionalisation is performed on the transformed variables, which are assumed to follow a joint normal distribution, hence any conditional distribution will also be normal with known mean and variance. Finding the conditional distribution of a corresponding original variable will just be a matter of transforming it back using the inverse distribution function of this variable and the standard normal distribution function (Hanea et al., 2006).

Rapid conditionalisaion is a very important feature of a BBN from a user’s standpoint. To illustrate, Fig. 4, Fig. 5 show the result of conditionalising the joint distribution on cold weather (275 K) in Washington and low (Fig. 4) and high (Fig. 5) concentrations of PM2.5 in Washington. The differences between the emitters’ conditional distributions (black), and the original ones (gray), caused by changing the concentration, are striking, in spite of the relatively weak correlations with Washington’s concentrations.

Of course, rapid computations are of little value if the model itself cannot be validated. Validation involves two steps:

  • 1.

    Validating that the joint normal copula adequately represents the multivariate data, and

  • 2.

    Validating that the BBN with its conditional independence relations is an adequate model of the saturated graph.

Validation requires an overall measure of multivariate dependence on which statistical tests can be based. The discussion in Section 3.2 leads to the choice of the determinant of the correlation matrix as an overall dependence measure. This determinant attains the maximal value of 1 if all variables are uncorrelated, and attains a minimum value of 0 if there is linear dependence between the variables. We briefly sketch the two validation steps for the present example. Since we are dealing with copulae models, it is more natural to work with the determinant of the rank correlation matrices.

If we convert the original fine particulate data to ranks and compute the determinant of the empirical rank correlation matrix (DER) we find the value 0.1518E04. To represent the data with a joint normal copula, we must transform the marginals to standard normals, compute the correlation matrix, and compute the determinant of the normal rank correlation matrix (DNR) using Pearson’s transformation (see Section 2). This relation of correlation and rank correlation is specific to the normal distribution and reflects the normal copula. DNR is not in general equal to DER. In this case DNR=0.4506E−04. Use of the normal copula typically introduces some smoothing into the empirical joint distribution, and this is reflected in a somewhat higher value of the determinant of the rank correlation matrix.

We can test the hypothesis whether this empirical rank distribution came from a joint normal copula in a straightforward way. We determine the sampling distribution of the DNR by simulation. Based on 1000 simulations, we find that the 90% central confidence interval for DNR is [0.0601E−04, 0.4792E−04]. The hypothesis that the data were generated from the joint normal copula would not be rejected at the 5% level.

DNR corresponds to the determinant of the saturated BBN, in which each variable is connected with every other variable. With 14 variables, there are 91 arcs in the saturated graph. Many of these influences are very small and reflect sample jitter. To build a perspicuous model we should eliminate noisy influences.

The BBN of Fig. 2 has 26 arcs. To determine whether these 26 arcs are sufficient to represent the saturated graph, we compute the determinant of the rank correlation matrix based on the BBN (DBBN). This differs from DNR, as we have changed many correlations to zero and introduced conditional independencies. In this case, DBBN=1.5092E−04. We determine the sampling distribution of the DBBN by simulation. Based on 1000 simulations, we find that the 90% central confidence interval for DBBN is [0.2070E−04, 1.5905E−04]. DNR is within the above mentioned 90% central confidence band. A simpler BBN involving only 22 arcs is shown in Fig. 6. It has a DBBN of 4.8522E−04. The 90% central confidence interval for this DBBN is [0.7021E−04, 5.0123E−04]. This interval does not contain DNR and would be rejected.

In general, changing correlations disturbs the positive definiteness of the rank correlation matrix. Moreover, the nodes connected in a BBN represent only a portion of the correlations. We can apply simple heuristics to search for a suitable BBN model without becoming embroiled in matrix completion and positive definitness preservation because of the way we represent joint distributions in a BBN. The conditional rank correlations in a BBN are algebraically independent and, together with the graphical structure and marginal distributions, uniquely determine the joint distribution. These facts have been established in Hanea et al. (2006) and are reviewed in Section 2. The key notion is to link a BBN with a nested sequence of regular vines.

In Section 3.1 we present a short overview of the existing methods for learning the structure of a BBN from data. In order to introduce our approach we need to select a measure of multivariate dependence. Section 3.2 contains a discussion of various such measures. In Section 3.3 we introduce our learning algorithm, and in Section 4 we present this approach using the database of pollutants emissions and fine particulate concentrations. In the last part of this paper we discuss alternative ways to calculate the correlation matrix of a BBN and illustrate how these may speed up the updating algorithm.

Section snippets

Definitions & preliminaries

In this section we present, in a more formal fashion, concepts that are used in learning the structure of a BBN. We discuss non-parametric continuous BBNs and their relationship with the graphical model vines.

A BBN encodes the probability density (or mass) function of a set of variables by specifying a number of conditional independence statements in a form of a directed acyclic graph and a set of conditional distribution functions of each variable given its parents in the graph. In Fig. 2 we

Overview of existing methods

Data mining is the process of extracting and analysing information from large databases. For discrete data BBNs are often used as they describe joint distributions in an intuitive way and allow rapid conditionalisation (Cowell et al., 1999).

In the process of learning a BBN from data, two aspects are of interest: learning the parameters of the BBN, given the structure, and learning the structure itself. We focus on structure learning. A vast literature is available on this subject. Neither the

Ordinal PM2.5 data mining with UniNet

We illustrate our method for learning a BBN from data using an ordinal multivariate data set that we briefly introduced in Section 1. The data are gathered from electricity generating stations and from collection sites in the United States over the course of seven years (1999–2005). The data base contains monthly emissions of SO2 and NOx in different locations and monthly means of the readings of PM2.5 concentrations at various monitoring sites. Since we have monthly data over the course of

Alternative ways to calculate the correlation matrix of a BBN

In both learning the structure of the BBN and the conditioning step, which was briefly presented in Section 1, an important operation is calculating the correlation matrix from the partial correlations specified. To do so, we are repeatedly using Eq. (2.1). When working with very large structures, this operation can be time consuming. In order to avoid this problem we will further present a number of results that will reduce the use of Eq. (2.1). It is known that a BBN induces a (non-unique)

Conclusions and future research

In this paper, we have described a method for mining ordinal multivariate data using non-parametric BBNs. The main advantage of this method is that it can handle a large number of continuous variables, without making any assumptions about their marginal distributions, in a very fast and efficient way. Inferring the structure of a BBN from data requires a suitable measure of multivariate dependence. The discussion in this paper led to the choice of the determinant of the correlation matrix as an

References (36)

  • H. Joe

    Multivariate concordance

    Journal of Multivariate Analysis

    (1990)
  • T. Bedford et al.

    Vines—A new graphical model for dependent random variables

    Annals of Statistics

    (2002)
  • J. Cheng et al.

    An algorithm for bayesian network construction from data

    Artificial Intelligence and Statistics

    (1997)
  • Chickering, D., Geiger, D., Heckerman, D., 1994. Learning bayesian networks is np-hard. Technical Report MSR-TR-94-17,...
  • R. Cooke

    Markov and entropy properties of tree and vine-dependent variables

  • G. Cooper et al.

    A bayesian method for the induction of probabilistic networks from data

    Machine Learning

    (1992)
  • R. Cowell et al.

    Probabilistic Networks and Expert Systems, Statistics for Engineering and Information Sciences

    (1999)
  • Friedman, N., Goldszmidt, M., 1996. Discretizing continuous attributes while learning bayesian networks. In: Proc....
  • Hanea, A.M., Kurowicka, D., 2008. Mixed non-parametric continuous and discrete bayesian belief nets. Advances in...
  • A. Hanea et al.

    Hybrid method for quantifying and analyzing bayesian belief nets

    Quality and Reliability Engineering International

    (2006)
  • Heckerman, D., 1995. A tutorial on learning bayesian networks. Technical Report MSR-TR-95-06, Microsoft...
  • Heckerman, D., Geiger, D., 1995. Learning Bayesian networks: A unification for discrete and gaussian domains. UAI...
  • Heckerman, D., Geiger, D., Chickering, D., 1997. Learning bayesian networks: The combination of knowledge and...
  • H. Joe

    Relative entropy measures of multivariate dependence

    Journal of the American Statistical Association

    (1989)
  • John, G., Langley, P., 1995. Estimating continuous distributions in bayesian classifiers. UAI...
  • M. Kendall et al.

    The Advanced Theory of Statistics

    (1961)
  • Kurowicka, D., Techniques in Representing High Dimensional Distributions. Ph.D. Dissertation. Delft Institute of...
  • D. Kurowicka et al.

    Uncertainty Analysis with High Dimensional Dependence Modelling

    (2006)
  • Cited by (0)

    View full text