Measuring relevance between discrete and continuous features based on neighborhood mutual information

doi:10.1016/j.eswa.2011.01.023

Expert Systems with Applications

Volume 38, Issue 9, September 2011, Pages 10737-10750

https://doi.org/10.1016/j.eswa.2011.01.023 Get rights and content

Abstract

Measures of relevance between features play an important role in classification and regression analysis. Mutual information has been proved an effective measure for decision tree construction and feature selection. However, there is a limitation in computing relevance between numerical features with mutual information due to problems of estimating probability density functions in high-dimensional spaces. In this work, we generalize Shannon’s information entropy to neighborhood information entropy and propose a measure of neighborhood mutual information. It is shown that the new measure is a natural extension of classical mutual information which reduces to the classical one if features are discrete; thus the new measure can also be used to compute the relevance between discrete variables. In addition, the new measure introduces a parameter delta to control the granularity in analyzing data. With numeric experiments, we show that neighborhood mutual information produces the nearly same outputs as mutual information. However, unlike mutual information, no discretization is required in computing relevance when used the proposed algorithm. We combine the proposed measure with four classes of evaluating strategies used for feature selection. Finally, the proposed algorithms are tested on several benchmark data sets. The results show that neighborhood mutual information based algorithms yield better performance than some classical ones.

Highlights

► We study the measures of relevance between numerical and nominal attributes. ► Shannon’s entropy is extended to neighborhood entropy and neighborhood mutual information is introduced to calculate the relevance. ► Neighborhood mutual information is combined a feature selection strategy, called minimal redundancy and maximal relevance.

Introduction

Evaluating relevance between features (attributes, variables) is an important task in pattern recognition and machine learning. In decision tree construction, indexes such as Gini, towing, deviance and mutual information were introduced to compute the relevance between inputs and output, thus guilding the algorithms to select an informative feature to split samples (Breiman, 1993, Quinlan, 1986, Quinlan, 1993). In filter based feature selection techniques, a number of relevance indexes were introduced to compute the goodness of features for predicting decisions (Guyon and Elisseeff, 2003, Hall, 2000, Liu and Yu, 2005). In discretization, a relevance index can be used to evaluate the effectiveness of a set of cuts by computing the effectiveness of a set of cuts by computing the relevance between the discretized features and decision (Fayyad and Irani, 1992, Fayyad and Irani, 1993, Liu et al., 2002). Relevance is also widely used in dependency analysis, feature weighting and distance learning (Düntsch and Gediga, 1997, Wettschereck et al., 1997).

In the last decades, a great number of indexes have been introduced or developed for computing relevance between features. Pearson’s correlation coefficient, which reflects the linear correlation degrees of two random numerical variables, was introduced in Hall (2000). Obviously, there is some limitation in using this coefficient. First, correlation coefficient can just reflect the linear dependency between variables, while relations between variables are usually nonlinear in practice. Second, correlation coefficient cannot measure the relevance between a set of variables and another variable. In feature selection, we are usually confronted the task to compute the relation between a candidate feature and a subset of selected features. Furthermore, this coefficient may be not effective in computing the dependency between discrete variables. In order to address these problems, a number of new measures were introduced, such as mutual information (Battiti, 1994), dependency (Hu and Cercone, 1995, Pawlak and Rauszer, 1985) and fuzzy dependency in the rough set theory (Hu, Xie, & Yu, 2007), consistency in feature subset selection (Dash & Liu, 2003), Chi2 for feature selection and discretization (Liu & Setiono, 1997), Relief and ReliefF to estimate attributes (Sikonja & Kononenko, 2003). Dependency is the ratio of consistent samples which have the same decision if their values of inputs are the same over the whole set of training data. Fuzzy dependency generalizes this definition to the fuzzy condition. Consistency, proposed by Dash and Liu (2003), can be viewed as the ration of samples which can be correctly classified according to the majority decision.

Among these measures, mutual information (MI) is the most widely used one in computing relevance. In ID3 and C4.5, MI is used to find good features for splitting samples (Quinlan, 1986, Quinlan, 1993). In feature selection, MI is employed to measure the quality of candidate features (Battiti, 1994, Fleuret, 2004, Hall, 1999, Hu et al., 2006, Hu et al., 2006, Huang et al., 2008, Kwak and Choi, 2002, Kwak and Choi, 2002, Liu et al., 2005, Peng et al., 2005, Qu et al., 2005, Wang et al., 1999, Yu and Liu, 2004). Given two random variables A and B, the MI is defined as $MI (A, B) = \sum_{a \in A} \sum_{b \in B} p (a, b) \log \frac{p (a, b)}{p (a) p (b)} .$ Thus, MI can be considered as a statistics which reflects the degree of linear or nonlinear dependency between A and B. Generally speaking, one may desire that the selected features are highly dependent on the decision variable, but are independent between them. This condition makes the selected features maximally relevant and minimally redundant.

In order to compute mutual information, we should know the probability distributions of variables and their joint distribution. However, these distributions are not known in practice. Given a set of samples, we have to estimate the probability distributions and joint distributions of features. If features are discrete, histogram can be used to estimate the probabilities. The probabilities are computed as the relative frequency of samples with the corresponding feature values. If there are continuous variables, two techniques were developed. One is to estimate probabilities based on the technique of Parzen Window (Kwak and Choi, 2002, Wang et al., 1999). The other is to partition the domains of variables into several subsets with a discretization algorithm. From the theoretical perspective, the first solution is feasible. Whereas, it is usually difficult to obtain accurate estimates for multivariate density as samples in high-dimensional space is sparsely distributed. The computational cost is also very high (Liu et al., 2005, Peng et al., 2005). Considering the limit of Parzen Window, techniques of discretization are usually integrated with mutual information in feature selection and decision tree construction (C4.5 implicitly discretizes numerical variables into multiple intervals) (Hall, 1999, Liu et al., 2002, Qu et al., 2005, Yu and Liu, 2004). Discretization, as an enabling technique for inductive learning, is useful for rule extraction and concept learning (Liu et al., 2002). However, it is superfluous for C4.5, neural network and SVM. Moreover, discretization is not applicable to regression analysis, where relevance between continuous variables is desirable. In these cases an information measure for computing relevance between continuous features become useful.

In Hu, Yu, Liu, and Wu (2008), the authors considered that in human reasoning the assumptions of classification consistency are different in discrete and continuous feature spaces. In discrete spaces, the objects with the same feature values should be assigned with the same decision class; otherwise, we think the decision is not consistent. In the meanwhile, since the probability of two samples with the completely same feature values is very small in continuous spaces, we think the objects with the most similar feature values should belong to a decision class; otherwise, the decision is not consistent. The assumption of similarity in continuous spaces extends the one of equivalence in discrete spaces. Based on this assumption, Hu and his coworkers extended equivalence relation based dependency function to neighborhood relation based one, where neighborhood, computed with distance, is looked as the subset of samples which have the similar feature values with the centroid. Then by checking the purity of the neighborhood, we can determine whether the centroid sample is consistent or not. However, neighborhood dependency just reflects whether the sample is consistent, it is not able to record the degree of consistency of this sample; this makes the measure not so effective as mutual information in terms of stability and robustness. In this paper, we integrate the concept of neighborhood into Shannon’s information theory, and propose a new information measure, called neighborhood entropy. Then, we derive the concepts of joint neighborhood entropy, neighborhood conditional entropy and neighborhood mutual information for computing the relevance between continuous variables and discrete decision features. Given this generalization, mutual information can be directly used to evaluate and select continuous features.

Our study is focused on three problems. First, we introduce the new definitions on neighborhood entropy and neighborhood mutual information. The properties of these measures are discussed. We show that the neighborhood entropy is a natural generalization of Shannon’s entropy. Neighborhood entropy converts to the Shannon’s one if a discrete distance is used.

Second, we discuss the problem how to use the proposed measures in feature selection. We give an axiomatic approach to feature subset selection and discuss the difference between the proposed one and other two approaches. In addition, we consider the ideas of maximal dependency, maximal relevance and minimal redundancy in the context of neighborhood entropy, and discuss their computational complexities. Finally, three strategies are proposed for selecting features based on neighborhood mutual information: maximal dependency (MD), minimal redundancy and maximal relevance (mRMR), minimal redundancy and maximal dependency (mRMD).

Finally, with comprehensive experiments, we exhibit the properties of neighborhood entropy and compare MD, mRMR and mRMD with some existing algorithms, such as CFS, consistency based feature selection, FCBF and neighborhood rough set based algorithm. The experimental results show the proposed measures are effective when being integrated with mRMR and mRMD.

The rest of the paper is organized as follows. Section 2 presents the preliminaries on Shannon’s entropy and neighborhood rough sets. Section 3 introduces the definitions of neighborhood entropy and neighborhood mutual information and discusses their properties and interpretation. Section 4 integrates neighborhood mutual information with feature selection, where the relationships between MD, mRMR and mRMD are studied. Experimental analysis is described in Section 5. Finally, conclusion and future work are given in Section 6.

Section snippets

Entropy and mutual information

Shannon’s entropy, first introduced in 1948 (Shannon, 1948), is a measure of uncertainty of random variables. Let A = {a₁, a₂, … , a_n} be a random variable. If p(a_i) is the probability of a_i, the entropy of A is defined as $H (A) = - \sum_{i = 1}^{n} p (a_{i}) \log p (a_{i}) .$ If A and B = {b₁, b₂, … , b_m} are two random variables, the joint probability is p(a_i, b_j), where i = 1, … , n, j = 1, … , m. The joint entropy of A and B is $H (A, B) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} p (a_{i}, b_{j}) \log p (a_{i}, b_{j}) .$ Assuming that the variable B is known, the uncertainty of A, named conditional

Neighborhood mutual information in metric spaces

Shannon’s entropy and mutual information cannot be used to compute relevance between numerical features due to the difficulty in estimating probability density. In this section, we introduce the concept of neighborhood into information theory, and generalize Shannon’s entropy for the numerical information.

Definition 1

Given a set of samples U = {x₁, x₂, … , x_n} described by numerical or discrete features F, S ⊆ F is a subset of attributes. The neighborhood of sample x_i in S is denoted by δ_S(x_i). Then the

Axiomatization of feature selection

Neighborhood mutual information measures the relevance between numerical or nominal variables. It is also shown that the neighborhood entropy will degenerate to the Shannon’s entropy if the features are nominal, thus neighborhood mutual information will reduce to the classical mutual information. Mutual information is widely used in selecting nominal features. We extend these algorithms to select numerical and nominal features by computing relevance with neighborhood mutual information.

Experimental analysis

In this section, we will first show the properties of neighborhood mutual information. Then we compare the neighborhood mutual information based MD feature selection algorithm with neighborhood rough sets based algorithm (NRS) (Hu et al., 2008), correlation based feature selection (CFS) (Hall, 2000), consistency based algorithm (Dash & Liu, 2003) and FCBF (Yu & Liu, 2004). Finally, the effectiveness of mRMR, MD and mRMD is discussed.

Conclusion and future work

Measures for computing the relevance between features play an important role in discretization, feature selection, decision tree construction. A number of measures were developed. Given its effectiveness, mutual information is widely used and discussed for effectiveness. However, it is difficult to compute relevance between numerical features based on mutual information. In this work, we generalize Shannon’s information entropy to neighborhood information entropy and propose the concept of

Acknowledgement

This work is supported by National Natural Science Foundation of China under Grant 60703013 and 10978011, and The Hong Kong Polytechnic University (G-YX3B).

References (42)

M. Dash et al.
Consistency-based search in feature selection
Artificial Intelligence
(2003)
I. Düntsch et al.
Statistical evaluation of rough set dependency analysis
International Journal of Human Computer Studies
(1997)
J.J. Huang et al.
A parameterless feature ranking algorithm based on MI
Neurocomputing
(2008)
Q.H. Hu et al.
Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation
Pattern Recognition
(2007)
Q.H. Hu et al.
Neighborhood rough set based heterogeneous feature subset selection
Information Sciences
(2008)
Q.H. Hu et al.
Information-preserving hybrid data reduction based on fuzzy-rough techniques
Pattern Recognition Letters
(2006)
Q.H. Hu et al.
EROS: Ensemble rough subspaces
Pattern Recognition
(2007)
Y.H. Qian et al.
Knowledge structure, knowledge granulation and knowledge distance in a knowledge base
International Journal of Approximate Reasoning
(2009)
D.R. Yu et al.
Uncertainty measures for fuzzy relations and their applications
Apply Soft Computing
(2007)
R. Battiti
Using mutual information for Selecting features in supervised neural net learning
IEEE Transactions on Neural Networks
(1994)

D. Bell et al.

A formalism for relevance and its application in feature subset selection

Machine Learning

(2000)

Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases....

L. Breiman

Classification and regression trees

(1993)

Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification...

U.M. Fayyad et al.

On the handling of continuous-valued attributes in decision tree generation

Machine Learning

(1992)

F. Fleuret

Fast binary feature selection with conditional mutual information

Journal of Machine Learning Research

(2004)

I. Guyon et al.

An introduction to variable and feature selection

Journal of Machine Learning Research

(2003)

Hall, M. A. (1999). Correlation-based feature subset selection for machine learning, Ph. D. dissertation, Univ....

Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings...

X.H. Hu et al.

Learning in relational databases: A rough set approach

Computational Intelligence

(1995)

Q.H. Hu et al.

Fuzzy probabilistic approximation spaces and their information measures

IEEE Transactions on Fuzzy Systems

(2006)

Cited by (151)

Estimating Japanese bank performance: Stochastic entropic analysis on the basis of ideal solutions
2024, Expert Systems with Applications
We propose a stochastic entropic analysis based on ideal solutions and apply it to evaluate the performance of Japanese banks. Not only is the odds ratio of the learning feedback modeled by Beta-distributed priors, but our innovative model also combines the advantages of Data Envelopment Analysis (DEA) and the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) in performance evaluation. The information gains in terms of contextual variables are predicted using Machine Learning methods. Our findings demonstrate that our proposed method generates more comprehensive performance scores compared to DEA and TOPSIS. Additionally, we find that total liabilities, total assets, and total deposits, in that order, are the three most relevant inputs for understanding the outputs. Operating revenue, total income, and total loans are the output variables most pertinent to comprehending input behavior. We argue that while mutual information seems important in explaining efficiency variations, continuous improvement in the production process plays an even more crucial role in enhancing the performance of Japanese banks.
Granule-specific feature selection for continuous data classification using neighborhood rough sets
2024, Expert Systems with Applications
Neighborhood rough set theories are commonly used in global feature selection to achieve high performance in continuous data classification. However, selecting a single feature subset to represent the entire dataset may degrade the performance when there are intra-class dissimilarities among objects. Therefore, this paper proposes a novel feature-selection method, Granule-specific Feature Selection (GFS) to select local feature subsets for continuous data classification. The feature selection approach constitutes a novel feature selection algorithm and a novel feature evaluation function and uses existing approaches for granule identification and classification with some adjustments. The neighborhood rough set theories are used in granule (subclass) identification within each class when there are no subclass label information available in the training data, while an improved k-Nearest Neighbors approach is used in classification with granule-specific feature subsets. Experimental results show GFS outperforms most of the global, class-specific, and local feature selection baselines in terms of classification performance.
Feature selection based on self-information combining double-quantitative class weights and three-order approximation accuracies in neighborhood rough sets
2024, Information Sciences
Feature selection is related to information processing, and its measurement and algorithm use various intelligent methodologies, such as neighborhood rough sets (NRSs). At present by NRSs, the relative neighborhood self-information (Relative-NSI) introduces an information function for algebraic characterization, and its feature selection algorithm (NSI-FS) has been successfully applied. However, Relative-NSI and NSI-FS ignore the underlying recognition information of decision classes; thus, they have the advancement space. In this study, absolute rates and correlative coefficients of decision classes are double-quantitatively introduced, and three-order approximation accuracies are systematically established to induce a robust self-information measure; thus, corresponding feature selection optimally improves NSI-FS to have generalization abilities. Firstly, three-order approximation accuracies are constructed by using two-time class weights of absolute rates and Spearman's correlation coefficients, and the new measure called ClaWNSI promotes Relative-NSI in terms of class recognition. Then, the three-order approximation accuracies and subsequent ClaWNSI are evolved in matrix forms, and relevant class weights of Spearman's correlation coefficients are realized by two vectors from approximation matrices. Furthermore, ClaWNSI and its feature significance motivate a heuristic selection algorithm (called ClaWNSI-FS). Finally, data experiments on 14 datasets validate ClaWNSI and ClaWNSI-FS; ClaWNSI-FS outperforms NSI-FS and five other contrast algorithms to acquire better classification performance.
Feature selections based on three improved condition entropies and one new similarity degree in interval-valued decision systems
2023, Engineering Applications of Artificial Intelligence
Feature selections facilitate classification learning in various data environments. Aiming at interval-valued decision systems (IVDSs), feature selections rely on information measures and similarity degrees, whereas current selection algorithms on credibility-based condition entropy and classical similarity degree are accompanied with some measurement limitations and advancement space. In this paper based on IVDSs, three coverage-credibility-based condition entropies and one geometry-probabilistic similarity degree are proposed across two dimensions of informationization and granulation, and they improve the existing condition entropy and similarity degree; accordingly, 4 × 2 feature selections emerge for optimization and applicability, and they systematically contain one initial selection algorithm and seven new/robuster algorithms. At first, three-way granular measures (i.e., credibility, coverage, and integrated coverage-credibility) are formulated in IVDSs, and three novel condition entropies are established by implementing three information structures on coverage-credibility. These condition entropies acquire in-depth improvements, hierarchical algorithms, size relationships, maximum/minimum conditions, and granulation non-monotonicity. Then, the probabilistic similarity degree is defined by a six-piecewise function with quadratic factors, and this new measure gains the geometry-probability mechanism and high-quality improvement. Furthermore, feature selections are determined by preserving condition entropies and by mining feature significances, so eight selection algorithms are obtained by combining condition entropies and similarity degrees. Finally, data experiments are performed to validate relevant uncertainty measures and feature selections, and seven constructional selection algorithms outperform three contrastive algorithms to achieve better classification performances.
Granularity self-information based uncertainty measure for feature selection and robust classification
2023, Fuzzy Sets and Systems
Information entropy theory has been widely studied and successfully applied to machine learning and data mining. The fuzzy entropy and neighborhood entropy theories have been rapidly developed and widely used in uncertainty measure. In this paper, a granularity self-information theory is first proposed to measure uncertainty robustly. The theory improves the shortcomings of neighborhood self-information in measuring sample uncertainty by combining with data distributions. Then, granularity entropy theory is put forward and fully explained. With the proposed theories, a novel feature selection algorithm and a robust classification algorithm are designed and validated with some experiments. The experimental results show the designed algorithms have good performance. This indicates the efficacy of granularity self-information and granularity entropy for evaluating samples and features.
Feature selection in threes: Neighborhood relevancy, redundancy, and granularity interactivity
2023, Applied Soft Computing
As a fundamental granular computing strategy, neighborhood granulation has been acknowledged as an intuitive and effective approach to feature evaluation and selection. However, such an approach always has a bias towards a fixed neighborhood granularity, while ignoring the observations across different levels of granularity. To this end, a novel algorithm based on Neighborhood relevancY, redundancY, and granularity interactivitY (N3Y) is proposed. Technically, N3Y adheres well to the rudiment of three-way decision, evaluating and selecting features in threes: 1) feature-to-class relevancy; 2) feature-to-feature redundancy; 3) granularity-to-granularity interactivity. Specifically, firstly, the neighborhood symmetrical uncertainty induced by neighborhood measures is adopted to evaluate the relevancy and redundancy of candidate feature subset; secondly, the proposed neighborhood granularity interactivity allows an uncertainty quantification for finer-to-coarser granularity, and is leveraged as a supplemental factor to guide the relevancy and redundancy, making our procedure more comprehensive; thirdly, a forward-greedy selector is devised, which is required to maximize the evaluation criterion integrating neighborhood relevancy, redundancy, and granularity interactivity. Extensive experiments demonstrate that N3Y outperforms several other advanced feature selectors.

View all citing articles on Scopus

View full text

Measuring relevance between discrete and continuous features based on neighborhood mutual information

Abstract

Highlights

Introduction

Section snippets

Entropy and mutual information

Neighborhood mutual information in metric spaces

Axiomatization of feature selection

Experimental analysis

Conclusion and future work

Acknowledgement

Artificial Intelligence

International Journal of Human Computer Studies

Neurocomputing

Pattern Recognition

Information Sciences

Pattern Recognition Letters

Pattern Recognition

International Journal of Approximate Reasoning

Apply Soft Computing

Using mutual information for Selecting features in supervised neural net learning

IEEE Transactions on Neural Networks

A formalism for relevance and its application in feature subset selection

Machine Learning

Classification and regression trees

On the handling of continuous-valued attributes in decision tree generation

Machine Learning

Fast binary feature selection with conditional mutual information

Journal of Machine Learning Research

An introduction to variable and feature selection

Journal of Machine Learning Research

Learning in relational databases: A rough set approach

Computational Intelligence

Fuzzy probabilistic approximation spaces and their information measures

IEEE Transactions on Fuzzy Systems