Measuring relevance between discrete and continuous features based on neighborhood mutual information
Highlights
► We study the measures of relevance between numerical and nominal attributes. ► Shannon’s entropy is extended to neighborhood entropy and neighborhood mutual information is introduced to calculate the relevance. ► Neighborhood mutual information is combined a feature selection strategy, called minimal redundancy and maximal relevance.
Introduction
Evaluating relevance between features (attributes, variables) is an important task in pattern recognition and machine learning. In decision tree construction, indexes such as Gini, towing, deviance and mutual information were introduced to compute the relevance between inputs and output, thus guilding the algorithms to select an informative feature to split samples (Breiman, 1993, Quinlan, 1986, Quinlan, 1993). In filter based feature selection techniques, a number of relevance indexes were introduced to compute the goodness of features for predicting decisions (Guyon and Elisseeff, 2003, Hall, 2000, Liu and Yu, 2005). In discretization, a relevance index can be used to evaluate the effectiveness of a set of cuts by computing the effectiveness of a set of cuts by computing the relevance between the discretized features and decision (Fayyad and Irani, 1992, Fayyad and Irani, 1993, Liu et al., 2002). Relevance is also widely used in dependency analysis, feature weighting and distance learning (Düntsch and Gediga, 1997, Wettschereck et al., 1997).
In the last decades, a great number of indexes have been introduced or developed for computing relevance between features. Pearson’s correlation coefficient, which reflects the linear correlation degrees of two random numerical variables, was introduced in Hall (2000). Obviously, there is some limitation in using this coefficient. First, correlation coefficient can just reflect the linear dependency between variables, while relations between variables are usually nonlinear in practice. Second, correlation coefficient cannot measure the relevance between a set of variables and another variable. In feature selection, we are usually confronted the task to compute the relation between a candidate feature and a subset of selected features. Furthermore, this coefficient may be not effective in computing the dependency between discrete variables. In order to address these problems, a number of new measures were introduced, such as mutual information (Battiti, 1994), dependency (Hu and Cercone, 1995, Pawlak and Rauszer, 1985) and fuzzy dependency in the rough set theory (Hu, Xie, & Yu, 2007), consistency in feature subset selection (Dash & Liu, 2003), Chi2 for feature selection and discretization (Liu & Setiono, 1997), Relief and ReliefF to estimate attributes (Sikonja & Kononenko, 2003). Dependency is the ratio of consistent samples which have the same decision if their values of inputs are the same over the whole set of training data. Fuzzy dependency generalizes this definition to the fuzzy condition. Consistency, proposed by Dash and Liu (2003), can be viewed as the ration of samples which can be correctly classified according to the majority decision.
Among these measures, mutual information (MI) is the most widely used one in computing relevance. In ID3 and C4.5, MI is used to find good features for splitting samples (Quinlan, 1986, Quinlan, 1993). In feature selection, MI is employed to measure the quality of candidate features (Battiti, 1994, Fleuret, 2004, Hall, 1999, Hu et al., 2006, Hu et al., 2006, Huang et al., 2008, Kwak and Choi, 2002, Kwak and Choi, 2002, Liu et al., 2005, Peng et al., 2005, Qu et al., 2005, Wang et al., 1999, Yu and Liu, 2004). Given two random variables A and B, the MI is defined asThus, MI can be considered as a statistics which reflects the degree of linear or nonlinear dependency between A and B. Generally speaking, one may desire that the selected features are highly dependent on the decision variable, but are independent between them. This condition makes the selected features maximally relevant and minimally redundant.
In order to compute mutual information, we should know the probability distributions of variables and their joint distribution. However, these distributions are not known in practice. Given a set of samples, we have to estimate the probability distributions and joint distributions of features. If features are discrete, histogram can be used to estimate the probabilities. The probabilities are computed as the relative frequency of samples with the corresponding feature values. If there are continuous variables, two techniques were developed. One is to estimate probabilities based on the technique of Parzen Window (Kwak and Choi, 2002, Wang et al., 1999). The other is to partition the domains of variables into several subsets with a discretization algorithm. From the theoretical perspective, the first solution is feasible. Whereas, it is usually difficult to obtain accurate estimates for multivariate density as samples in high-dimensional space is sparsely distributed. The computational cost is also very high (Liu et al., 2005, Peng et al., 2005). Considering the limit of Parzen Window, techniques of discretization are usually integrated with mutual information in feature selection and decision tree construction (C4.5 implicitly discretizes numerical variables into multiple intervals) (Hall, 1999, Liu et al., 2002, Qu et al., 2005, Yu and Liu, 2004). Discretization, as an enabling technique for inductive learning, is useful for rule extraction and concept learning (Liu et al., 2002). However, it is superfluous for C4.5, neural network and SVM. Moreover, discretization is not applicable to regression analysis, where relevance between continuous variables is desirable. In these cases an information measure for computing relevance between continuous features become useful.
In Hu, Yu, Liu, and Wu (2008), the authors considered that in human reasoning the assumptions of classification consistency are different in discrete and continuous feature spaces. In discrete spaces, the objects with the same feature values should be assigned with the same decision class; otherwise, we think the decision is not consistent. In the meanwhile, since the probability of two samples with the completely same feature values is very small in continuous spaces, we think the objects with the most similar feature values should belong to a decision class; otherwise, the decision is not consistent. The assumption of similarity in continuous spaces extends the one of equivalence in discrete spaces. Based on this assumption, Hu and his coworkers extended equivalence relation based dependency function to neighborhood relation based one, where neighborhood, computed with distance, is looked as the subset of samples which have the similar feature values with the centroid. Then by checking the purity of the neighborhood, we can determine whether the centroid sample is consistent or not. However, neighborhood dependency just reflects whether the sample is consistent, it is not able to record the degree of consistency of this sample; this makes the measure not so effective as mutual information in terms of stability and robustness. In this paper, we integrate the concept of neighborhood into Shannon’s information theory, and propose a new information measure, called neighborhood entropy. Then, we derive the concepts of joint neighborhood entropy, neighborhood conditional entropy and neighborhood mutual information for computing the relevance between continuous variables and discrete decision features. Given this generalization, mutual information can be directly used to evaluate and select continuous features.
Our study is focused on three problems. First, we introduce the new definitions on neighborhood entropy and neighborhood mutual information. The properties of these measures are discussed. We show that the neighborhood entropy is a natural generalization of Shannon’s entropy. Neighborhood entropy converts to the Shannon’s one if a discrete distance is used.
Second, we discuss the problem how to use the proposed measures in feature selection. We give an axiomatic approach to feature subset selection and discuss the difference between the proposed one and other two approaches. In addition, we consider the ideas of maximal dependency, maximal relevance and minimal redundancy in the context of neighborhood entropy, and discuss their computational complexities. Finally, three strategies are proposed for selecting features based on neighborhood mutual information: maximal dependency (MD), minimal redundancy and maximal relevance (mRMR), minimal redundancy and maximal dependency (mRMD).
Finally, with comprehensive experiments, we exhibit the properties of neighborhood entropy and compare MD, mRMR and mRMD with some existing algorithms, such as CFS, consistency based feature selection, FCBF and neighborhood rough set based algorithm. The experimental results show the proposed measures are effective when being integrated with mRMR and mRMD.
The rest of the paper is organized as follows. Section 2 presents the preliminaries on Shannon’s entropy and neighborhood rough sets. Section 3 introduces the definitions of neighborhood entropy and neighborhood mutual information and discusses their properties and interpretation. Section 4 integrates neighborhood mutual information with feature selection, where the relationships between MD, mRMR and mRMD are studied. Experimental analysis is described in Section 5. Finally, conclusion and future work are given in Section 6.
Section snippets
Entropy and mutual information
Shannon’s entropy, first introduced in 1948 (Shannon, 1948), is a measure of uncertainty of random variables. Let A = {a1, a2, … , an} be a random variable. If p(ai) is the probability of ai, the entropy of A is defined asIf A and B = {b1, b2, … , bm} are two random variables, the joint probability is p(ai, bj), where i = 1, … , n, j = 1, … , m. The joint entropy of A and B isAssuming that the variable B is known, the uncertainty of A, named conditional
Neighborhood mutual information in metric spaces
Shannon’s entropy and mutual information cannot be used to compute relevance between numerical features due to the difficulty in estimating probability density. In this section, we introduce the concept of neighborhood into information theory, and generalize Shannon’s entropy for the numerical information. Definition 1 Given a set of samples U = {x1, x2, … , xn} described by numerical or discrete features F, S ⊆ F is a subset of attributes. The neighborhood of sample xi in S is denoted by δS(xi). Then the
Axiomatization of feature selection
Neighborhood mutual information measures the relevance between numerical or nominal variables. It is also shown that the neighborhood entropy will degenerate to the Shannon’s entropy if the features are nominal, thus neighborhood mutual information will reduce to the classical mutual information. Mutual information is widely used in selecting nominal features. We extend these algorithms to select numerical and nominal features by computing relevance with neighborhood mutual information.
As
Experimental analysis
In this section, we will first show the properties of neighborhood mutual information. Then we compare the neighborhood mutual information based MD feature selection algorithm with neighborhood rough sets based algorithm (NRS) (Hu et al., 2008), correlation based feature selection (CFS) (Hall, 2000), consistency based algorithm (Dash & Liu, 2003) and FCBF (Yu & Liu, 2004). Finally, the effectiveness of mRMR, MD and mRMD is discussed.
Conclusion and future work
Measures for computing the relevance between features play an important role in discretization, feature selection, decision tree construction. A number of measures were developed. Given its effectiveness, mutual information is widely used and discussed for effectiveness. However, it is difficult to compute relevance between numerical features based on mutual information. In this work, we generalize Shannon’s information entropy to neighborhood information entropy and propose the concept of
Acknowledgement
This work is supported by National Natural Science Foundation of China under Grant 60703013 and 10978011, and The Hong Kong Polytechnic University (G-YX3B).
References (42)
- et al.
Consistency-based search in feature selection
Artificial Intelligence
(2003) - et al.
Statistical evaluation of rough set dependency analysis
International Journal of Human Computer Studies
(1997) - et al.
A parameterless feature ranking algorithm based on MI
Neurocomputing
(2008) - et al.
Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation
Pattern Recognition
(2007) - et al.
Neighborhood rough set based heterogeneous feature subset selection
Information Sciences
(2008) - et al.
Information-preserving hybrid data reduction based on fuzzy-rough techniques
Pattern Recognition Letters
(2006) - et al.
EROS: Ensemble rough subspaces
Pattern Recognition
(2007) - et al.
Knowledge structure, knowledge granulation and knowledge distance in a knowledge base
International Journal of Approximate Reasoning
(2009) - et al.
Uncertainty measures for fuzzy relations and their applications
Apply Soft Computing
(2007) Using mutual information for Selecting features in supervised neural net learning
IEEE Transactions on Neural Networks
(1994)
A formalism for relevance and its application in feature subset selection
Machine Learning
Classification and regression trees
On the handling of continuous-valued attributes in decision tree generation
Machine Learning
Fast binary feature selection with conditional mutual information
Journal of Machine Learning Research
An introduction to variable and feature selection
Journal of Machine Learning Research
Learning in relational databases: A rough set approach
Computational Intelligence
Fuzzy probabilistic approximation spaces and their information measures
IEEE Transactions on Fuzzy Systems
Cited by (151)
Estimating Japanese bank performance: Stochastic entropic analysis on the basis of ideal solutions
2024, Expert Systems with ApplicationsGranule-specific feature selection for continuous data classification using neighborhood rough sets
2024, Expert Systems with ApplicationsFeature selections based on three improved condition entropies and one new similarity degree in interval-valued decision systems
2023, Engineering Applications of Artificial IntelligenceGranularity self-information based uncertainty measure for feature selection and robust classification
2023, Fuzzy Sets and SystemsFeature selection in threes: Neighborhood relevancy, redundancy, and granularity interactivity
2023, Applied Soft Computing