A Framework for Feature Selection to Exploit Feature Group Structures

Perera, Kushani; Chan, Jeffrey; Karunasekera, Shanika

doi:10.1007/978-3-030-47426-3_61

Kushani Perera¹⁴,
Jeffrey Chan¹⁵ &
Shanika Karunasekera¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12084))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

5428 Accesses
1 Citations

Abstract

Filter feature selection methods play an important role in machine learning tasks when low computational costs, classifier independence or simplicity is important. Existing filter methods predominantly focus only on the input data and do not take advantage of the external sources of correlations within feature groups to improve the classification accuracy. We propose a framework which facilitates supervised filter feature selection methods to exploit feature group information from external sources of knowledge and use this framework to incorporate feature group information into minimum Redundancy Maximum Relevance (mRMR) algorithm, resulting in GroupMRMR algorithm. We show that GroupMRMR achieves high accuracy gains over mRMR (up to ${\sim }$35%) and other popular filter methods (up to ${\sim }$50%). GroupMRMR has same computational complexity as that of mRMR, therefore, does not incur additional computational costs. Proposed method has many real world applications, particularly the ones that use genomic, text and image data whose features demonstrate strong group structures.

You have full access to this open access chapter, Download conference paper PDF

Feature Selection Using Approximate Multivariate Markov Blankets

Mrmr+ and Cfs+ feature selection algorithms for high-dimensional data

Article 19 December 2018

A Novel Meta-heuristic Search Based on Mutual Information for Filter-Based Feature Selection

Keywords

1 Introduction

Feature selection is proven to be an effective method in preparing high dimensional data for machine learning tasks such as classification. The benefits of feature selection include increasing the prediction accuracy, reducing the computational costs and producing more comprehensible data and models. Among the three main feature selection methods, filter methods are preferred to wrapper and embedded methods in applications where the computational efficiency, classifier independence, simplicity, ease of use and the stability of the results are required. Therefore, filter feature selection remains an interesting topic in many recent research areas such as biomarker identification for cancer prediction and drugs discovery, text classification and predicting defective software [3,4,5, 10, 11, 16, 18] and has growing interest in big data applications [19]; according to the Google Scholar search results, the number of research papers published related to filter methods in year 2018 is ${\sim }$1,800 of which ${\sim }$170 are in gene selection area.

Most of the existing filter methods perform feature selection based on the instance-feature data alone [7]. However, in real world datasets, there are external sources of correlations within feature groups which can improve the usefulness of feature selection. For example, the genes in genomic data can be grouped based on the Gene Ontology terms they are annotated with [2] to improve bio-marker identification for the tasks such as disease prediction and drugs discovery. The words in documents can be grouped according to their semantics to select more significant words which are useful in document analysis [14]. The nearby pixels in images can be grouped together based on their spatial locality to improve selection of pixels for image classification. In software data, software metrics can be grouped according to their granularity in the code to improve the prediction of defective software [11, 18]. In Sect. 4, using a text dataset as a concrete example, we demonstrate the importance of feature group information for filter feature selection to achieve good classification accuracy.

Although feature group information have been used to improve feature selection in wrapper and embedded approaches [8, 12], group information is only rarely used to improve the feature selection accuracy in filter methods. Yu et al. [19] proposes a group based filter method, GroupSAOLA (GSAOLA), yet being an online method, it achieves poor accuracy, which we show experimentally. The common method used by embedded methods to exploit feature group information is minimising the $L_1$ and $L_2$ norms of the feature weight matrix, while minimising the classification error. Depending on whether the features are encouraged from the same group [8] or different groups [12], $L_1$ norm is used to cause inter group or intra group sparsity. Selecting features from different groups is shown to be more effective than selecting features from the same group [12].

Motivated by these approaches, we show that squared $L_{0,2}$ norm minimization of the feature weight matrix can be used to encourage features from different feature groups in filter feature selection. We propose a generic framework which combines existing filter feature ranking methods with feature weight matrix norm minimisation and use this framework to incorporate feature group information in to mRMR objective [7] because mRMR algorithm achieves high accuracy and efficiency at the same time, compared to other filter methods [3, 4]. However, the proposed framework can be used to improve any other filter method, such as information gain based methods. As $L_{0}$ norm minimization is an NP-hard problem, we propose a greedy feature selection algorithm, GroupMRMR, to achieve the feature selection objective, which has the same computational complexity as the mRMR algorithm. We experimentally show that for the datasets with feature group structures, GroupMRMR obtains significantly higher classification accuracy than the existing filter methods. Our main contributions are as follows.

We propose a framework which supports the filter feature selection methods to utilise feature group information to improve their classification accuracy.
Using the proposed framework, we integrate feature group information into mRMR algorithm and propose a novel feature selection algorithm.
Through extensive experiments we show that our algorithm obtains significantly higher classification accuracy than the mRMR and existing filter feature selection algorithms for no additional computational costs.

Table 1. Frequently used definitions

Full size table

2 Related Work

Utilization of feature group information to improve prediction accuracy has been popular in embedded feature selection [8, 12, 17]. Among them, algorithms such as GroupLasso [8] encourage features from the same group while algorithms such as Uncorrelated GroupLasso [12] encourage features from different groups. We select the second approach as it is proven to be more effective for real data [12]. Filter feature selection is preferred over wrapper and embedded methods due to their classifier independence, computational efficiency and simplicity, yet have comparatively low prediction accuracy. However, most filter methods select the features based on the instance-feature data alone, which are coded in the data matrix, using information theoretic measures [7, 13, 15]. Some methods [20] use the feature group concept, yet the groups are also formed using instance-feature data to reduce feature redundancy. None of these methods take advantage of the external sources of knowledge about feature group structures. GSAOLA [19] is an online filter method which exploits feature groups, however we experimentally show that our method significantly outperforms it in terms of accuracy.

3 Preliminaries

In this section and Table 1, we introduce the terms used later in the paper. Let C be the class variable of a dataset, D, and $f_{i}$, $f_{j}$ any two feature variables.

Definition 1

Given that X and Y are two feature variables in D, with feature values x and y respectively, mutual information between X and Y, is given by $I(X;Y) = \sum _{x \in X} \sum _{y \in Y} p(x,y)log\frac{p(x,y)}{p(x)p(y)}$.

Definition 2

The relevancy of $f_i$ = $Rel (f_i) = I(f_i;C)$.

Definition 3

The redundancy between $f_{i}$ and $f_{j}$ = $Red(f_{i},f_{j}) = I(f_{i};f_{j})$.

Given that W $\in $ $\mathbb {R}^{M \times N}$, $W_i$ is the $i^{th}$ row of W, $W_{ij}$ is the $j^{th}$ element in $W_i$, the squared $L_{0,2}$ norm of W is defined as $\Vert W \Vert _{0,2}^2$ = $\sum _{i=1}^M (\Vert W_i \Vert _0)^2$ = $\sum _{i=1}^{M}N_i^{2}$ where $N_i$ = $\Vert W_i \Vert _0$ = # ($j| W_{ij} \ne 0$). For the scenarios in which the rows of W have different importance levels, we define $\Vert W \Vert _{0,2}^2$ = $\sum _{i=1}^M \epsilon _i(\Vert W_i \Vert _0)^2$ = $\sum _{i=1}^{M}N_i^{2}\epsilon _i$. $\epsilon _i$ is the weight of $W_i$. k is the required number of features.

4 Motivation and Background

Ignoring the external sources of correlations within feature groups may result in poor classification accuracy for the datasets whose features show a group behaviour. We demonstrate this using mRMR algorithm as a concrete example, a filter method which otherwise achieves good accuracy.

mRMR Algorithm: mRMR objective for selecting a feature subset S $\subseteq $ F of size k is as follows.

$$\begin{aligned} \mathop {\text {max}}\limits _{S} \sum _{f \in S} Rel (f) - \frac{1}{|S|} \sum _{f_{i}, f_{j} \in S} Red(f_{i},f_{j}) \text { subject to } |S| = k, k \in \mathbb {Z}^{+} \end{aligned}$$

(1)

To achieve the above objective, mRMR selects one feature at a time to maximise the relevancy of the new feature x with the class variable and to minimise its redundancy with the already selected feature set, as shown in Eq. (2).

$$\begin{aligned} \mathop {\text {max}}\limits _x Rel (x) - \frac{1}{|S|} \sum _{f \in S} Red(x,f) \end{aligned}$$

(2)

Example 1: Consider selecting two features from the dataset in Fig. 1. In this dataset, each document is classified into one of the four types: Botany, Zoology, Physics or Agriculture. The rows represent the feature vector, the words which have occurred in the documents. 1 means the word has occurred within the document (or has occurred with high frequency) and 0 means otherwise.

The relevancies of the features, Apple, Rice, Cow and Sheep are 0.549, 0.443, 0.311 and 0.311, respectively. mRMR first selects Apple, which has the highest relevancy. The redundancies of Rice, Cow and Sheep with respect to Apple are 0.07, 0.017 and 0.016, respectively. Therefore, mRMR next selects Rice, the feature with the highest relevancy redundancy difference, 0.373 (0.443 - 0.07). Global mRMR optimisation approaches [15] also select {Apple, Rice}.

Exploiting Feature Group Semantics: Figure 2 shows the value pattern distribution of {Apple, Sheep} and {Apple, Rice} pairs within each class. In {Apple, Sheep}, the highest probability value pattern in each class is different from one another. Therefore, each value pattern is associated with a different class, which helps distinguishing all the document types from one another. In {Apple, Rice}, there is no such distinctive relationship between the value patterns and classes. Using the value pattern distribution, the classification algorithm cannot distinguish between the Zoology and Physics documents and between Agriculture and Botany documents. This shows that features from different groups have achieved better class discrimination.

The reason behind the suboptimal result of the mRMR algorithm is its ignorance about the high level feature group structures. The words Apple and Rice form a group as they are plant names. Cow and Sheep form another group as they are animal names. The documents are classified according to whether they contain plant names or/and animal names, regardless of the exact plant or animal name they contain. Botany documents ($d_{1}$–$d_{4}$) contain plant names (Apple or Rice) and no animal names. Zoology documents ($d_{5}$–$d_{8}$) contain animal names (Cow or Sheep) and no plant names. This high level insight is not captured by the instance-feature data alone. Using feature group information as an external source of knowledge and encouraging features from different feature groups help solving this problem.

5 Proposed Method: GroupMRMR

We propose a framework which facilitates filter feature selection methods to exploit feature group information to achieve better classification accuracy. Using this framework, we extend mRMR algorithm into GroupMRMR algorithm, which encourages features from different groups to bring in different semantics which help selecting a more balanced set of features. We select mRMR algorithm for extension because it has proven good classification accuracy with low computation costs, compared to other filter feature selection methods. The feature groups are assigned weights ($\alpha _i$) to represent their importance levels, and GroupMRMR selects more features from the groups with higher importance. Group weights may be decided according to factors such as group size and group quality. For this paper, we assume that the feature groups do not overlap but plan to investigate overlapping groups in the future.

5.1 Feature Selection Objective

Our feature selection objective includes both the filter feature selection objective and encouraging features from different feature groups. To encourage features from different groups, we minimise $\Vert W \Vert _{0,2}^2$ of the feature weight matrix, W. Using $L_0$ norm at intra group level enforces intra group sparsity, discouraging features to be selected from the same group. Using $L_2$ norm at inter group level encourages features from different feature groups [12].

Let W $\in $ $\mathbb {R}^{|G| \times |F|}$ be a feature weight matrix such that $W_{ij}$ = 1 if $f_{j}$ $\in $ S and $f_{j}$ $\in $ $G_i$. Otherwise, $W_{ij}$ = 0. Given that g(W) is any maximisation quantity used in an existing filter feature selection objective which can be expressed a function of W and $\lambda $ is a user defined parameter, our objective is to select S $\subseteq $ F to maximise the following subject to |S| = k, k $\in $ $\mathbb {Z}^{+}$:

$$\begin{aligned} \mathop {\text {max}}\limits _{S} h(S)= g(W) - \lambda \Vert W \Vert _{0,2}^2 \end{aligned}$$

(3)

Given that R1 $\in $ $\mathbb {R}^{|F|\times |F|}$ is a diagonal matrix in which $R1_{jj}$ = $Rel(f_j)$ and R2 $\in $ $\mathbb {R}^{|F|\times |F|}$ such that $R2_{ij}$ = $Red(f_i, f_j)$ for i $\ne $ j $R1_{ij}$ = 0 for i = j, it can be shown that $\Vert WR1W^{T} \Vert _{1,1}$ - $\frac{1}{2|S|} \Vert WR2W^{T} \Vert _{1,1}$ = $\sum _{f \in S} Rel(f)$ - $\frac{1}{|S|}$ $\sum _{f_{i}, f_{j} \in S}$ $Red(f_{i},f_{j})$, where $W^T$ is the transpose of W. That is, the maximisation quantity in mRMR objective in Eq. (1) is a function of W. Consequently, g(W) in Eq. (3) can be replaced with the mRMR objective as shown in Eq. (4).

$$\begin{aligned} \mathop {\text {max}}\limits _{S} h(S)= \sum _{f \in S} Rel(f) - \frac{1}{|S|} \sum _{f_{i}, f_{j} \in S} Red(f_{i},f_{j}) - \lambda \Vert W \Vert _{0,2}^2 \end{aligned}$$

(4)

Definition 4

Given that S and $G_{i}$ are as defined in Table 1, $n_{i}$ = $|S \cap G_{i}|$ = No. of features in S and $G_i$.

Given $n_i$ is as defined in Definition 4, according to Sect. 3, $\Vert W \Vert _{0,2}^2$ = $\sum _{i=1}^{|G|}n_i^{2}$. When the feature groups have different weights, the rows of W also have different importance levels. In such scenarios, $\Vert W \Vert _{0,2}^2$ = $\sum _{i=1}^{|G|}n_i^{2}\epsilon _i$, where $\epsilon _i$ = $\frac{1}{\alpha _i}$ where $\alpha _i$ > 0. Consequently, we can rewrite the objective in Eq. (4) as in Eq. (5) subject to |S| = k, k $\in $ $\mathbb {Z}^{+}$. As the feature groups do not overlap, $\sum _{i=1}^{|G|} n_{i}$ = |S|. Using Eq. (5), we present Theorem 1 that shows minimising $\Vert W \Vert _{0,2}^2$ is equivalent to encouraging features from different groups in to S.

$$\begin{aligned} \mathop {\text {max}}\limits _{S} h(S) = \sum _{f \in S} Rel (f) - \frac{1}{|S|} \sum _{f_{i}, f_{j} \in S} Red(f_{i},f_{j}) - \lambda \sum _{i=1}^{|G|}\frac{n_i^2}{\alpha _i} \end{aligned}$$

(5)

Theorem 1

Given $\sum _{i=1}^{|G|}$ $n_i$ = |S|= k, minimum $\sum _{i=1}^{|G|}$ $\frac{n_i^2}{\alpha _i}$ is obtained when $\frac{n_i}{\alpha _i}$ = $\frac{n_j}{\alpha _j}$, $\forall $ i, j $\in $ I, where k $\in $ $\mathbb {Z}^{+}$ is a constant.

Proof

Using Lagrange multipliers method, we show minimum $\sum _{i=1}^{|G|}\frac{n_i^2}{\alpha _i}$ is achieved when $\frac{n_1}{\alpha _1}$ = $\frac{n_2}{\alpha _2}$ = $\cdots $ = $\frac{n_{|G|}}{\alpha _{|G|}}$. Please refer to this link^{Footnote 1} for the detailed proof.

5.2 Iterative Feature Selection

As $L_{0,2}^{2}$ minimisation is NP-hard, we propose a heuristic algorithm to achieve the objective in Eq. (4). The algorithm selects a feature, $f_t$, at each iteration t to maximise the difference between $h(S_t)$ and $h(S_{t-1})$, where $S_t$ and $S_{t-1}$ are the feature subsets selected after Iteration t and $t-1$ respectively and h(.) is as defined in Eq. (5). As there are datasets with millions of features we seek an algorithm to select $f_t$ with linear complexity. Theorem 2 shows that $h(S_t)$ - $h(S_{t-1})$ can be maximised by adding the term, $\lambda \frac{2n_p +1}{\alpha _p}$ to the mRMR algorithm in Eq. (2). p is the feature group of the evaluated feature ($f_x$), $n_{p}$ is the number of features already selected from p before Iteration t and ${\alpha _p}$ is the weight of p.

Theorem 2

Given that $S_t$, $S_{t-1}$, $h(S_t)$, $h(S_{t-1})$, p, $n_{p}$, ${\alpha _p}$ as defined above and $S'_{t-1}$ is the unselected feature subset after Iteration $t-1$, ${{\,\mathrm{argmax}\,}}_{f_{x \in S'_{t-1}}}$ $h(S_{t})$ - $h(S_{t-1})$ = ${{\,\mathrm{argmax}\,}}_{f_{x \in S'_{t-1}}}$ $Rel(f_{x};c)$ - $\frac{1}{|S_{t-1}|}$ $\sum _{f_i \in S_{t-1}}$ $Red(f_{x};f_i)$ - $\lambda $ $\Big (\frac{2n_{p}+1}{\alpha _p}\Big )$.

Proof

To prove this, we use the fact that $|S_t|$ and $|S_{t-1}|$ are constants at a given iteration. Please refer to this link (see footnote 1) for the detailed proof.

Based on Theorem 2, we propose GroupMRMR algorithm. At each iteration, the feature score of each feature in U is computed as shown in Line 5 of Algorithm 1. The feature with the highest score is removed from U and added to S (Line 7–10 in Algorithm 1). The algorithm can be modified to encourage the features from the same group as well by setting $\lambda $ < 0.

Example 1 Revisited: Next, we apply GroupMRMR for Example 1. We assume $\lambda $ = 1 and $\alpha _i$ = $\alpha _j$ = 1, $\forall $ i, j $\in $ I. GroupMRMR first selects Apple, the feature with highest relevancy (0.549). In Iteration 2, $n_p$ value for Rice, Cow, and Sheep are 1, 0 and 0, respectively and $\frac{2n_p+1}{\alpha _p}$ are 3, 0 and 0, respectively. The redundancies of each feature with Apple are same as computed in Sect. 4. The feature scores for Rice, Cow and Sheep are −2.627 (0.443-0.07-3), 0.294 (0.311-0.017-0) and 0.295 (0.311-0.016-0), respectively and GroupMRMR selects Sheep, the feature with the highest feature score. Therefore, GroupMRMR selects {Apple, Sheep}, the optimal feature subset, as discussed in Sect. 4.

Computation Complexity: The computational complexity of GroupMRMR is the same as that of mRMR, which is O(|S||F|). |S| and |F| are the cardinalities of the selected feature subset and the complete feature set, respectively. As |S| $<<$ |F|, GroupMRMR is effectively linear with |F|.

6 Experiments

This section discusses the experimental results for GroupMRMR for real datasets.

Datasets: We evaluate GroupMRMR, using real datasets, which are benchmark datasets used to test group based feature selection. Table 2 shows a summary of them. Images in Yale have a 32 $\times $ 32 pixel map. GRV is a JIRA software defect dataset whose features are code quality metrics.

Table 2. Dataset description. m: # features, n: # instances, c: # classes

Full size table

Grouping Features: The pixel map of the images are partitioned into m $\times $ m non overlapping squares such that each square is a feature group. This introduces spatial locality information, not available from just the data (instance-feature) itself. The genes in genomic data are clustered based on the Gene Ontology term annotations as described in [2]. The number of groups is set to 0.04 of the original feature set, based on the previous findings for MT dataset [2]. Words in BBC dataset are clustered using k-means algorithm, based on the semantics available from Word2Vec [14]. We use only 2,411 features, only the words available in the Brown’s corpus. Number of word groups is 50, which is selected by cross validation results on the training data. The code metrics in software defect data are grouped into five groups based on their granularity in the code [18].

Table 3. Comparison of accuracies achieved by different algorithms. Row 1: The maximum accuracy (in AVGF) gained by each algorithm in each dataset. The highest maximum AVGF for each dataset is in bold letters. Row 2 (x): the number of features at which the highest AVGF is achieved. Row 3 (%): The average accuracy gain of GroupMRMR over the baseline. +: GroupMRMR wins, −: GroupMRMR losses

Full size table

Baselines: We compare GroupMRMR with existing filter methods which have proven high accuracy. mRMR algorithm, of which the GroupMRMR is an extension, is a greedy approach to achieve mRMR objective while SPECCMI [15] is a global optimisation algorithm to achieve the same. Conditional Mutual Information (CMIM) [15] is a mutual information based filter method not belonging to the mRMR family. ReliefF [13] is a distance based filter method. GSAOLA [19] is an online filter method which utilises feature group information.

Evaluation Method: The classifier’s prediction accuracy on the test dataset with selected features is considered as the prediction accuracy of the feature selection algorithm. It is measured in terms of the Macro-F1, the average of the F1-scores for each class (AVGF). Average accuracy is the average of AVGFs for all the selected feature numbers up to the point algorithm accuracies converge. The log value of the average run time (measured in seconds) is reported.

Experimental Setup: We split each dataset, 60% instances for training set and 40% for test set, using stratified random sampling method. Feature selection is performed on the training set and the classifier is trained on the training set with the selected features. The classifier is then used to predict the labels of the test set. Due to the small sample size of the datasets we do not use a separate validation set for tuning $\lambda $. Instead, we select $\lambda $ $\in $ [0, 2], which gives the highest classification accuracy on the training set. The classifier used is the Support Vector Machine. For image data, default m = 4. For genomic data, $\alpha _i$ = 1, $\forall $ i. For other datasets, $\alpha _i$ = $\frac{|G_i|}{|F|}$ ($G_i$,F are defined in Table 1).

Experiment 1: Measures the classification accuracy obtained for the datasets with selected features. Experiment 2: Performs feature selection for image datasets with different feature group sizes: m $\times $ m (m = 2,4,8). This tests the effect of the group size on the classification accuracy. Experiment 3: Runs GroupMRMR for different $\lambda $ $\in $ [−1, 1]. This tests the effect of $\lambda $ on the classification accuracy. Experiment 4: Executes each feature selection algorithm 20 times and compute the average run time to evaluate algorithm efficiency.

Experimental Results: Table 3 shows that GroupMRMR achieves the highest AVGF in all datasets over baselines. In LK dataset, the 100% accuracy is achieved with a lower number of features than baselines. GroupMRMR achieves higher or same average accuracy compared to baselines in 32 out of 35 cases. Figure 3 shows that, despite the slightly low average accuracy compared to ReliefF, GroupMRMR maintains a higher accuracy than baselines in Multi-A for most of the selected feature numbers. Other datasets also show similar results, yet we show only three graphs due to the space limitations. Please refer to this link (see footnote 1) to see all the results graphs. The maximum accuracy gain of GroupMRMR over the accuracy gained by the complete feature set is 2%, 10%, 2%, 2%, 1% and 6% for MT, CNS, Multi-A, Yale, BBC and GRV datasets, respectively. The maximum accuracy gain of GroupMRMR is 50% over SPECCMI in Yale dataset at 50 selected features. The highest accuracy gain of GroupMRMR over mRMR is 35% in CNS dataset at 70 selected features. Figure 4a shows that the classification accuracy of GroupMRMR for 8 $\times $ 8 image partitions is less than for 4 $\times $ 4 and 2 $\times $ 2 partitions. Figure 4b shows that the classification accuracy is not much sensitive to $\lambda $ in the [$10^{-3}$, 1] range, yet degrades to a large extent when $\lambda $ < 0. Figure 4c shows that the runtime of GroupMRMR is almost the same as the run time of mRMR algorithm and lower than most of the other baseline methods (${\sim }$10 times lower than SPECCMI and CMIM for BBC dataset).

Evaluation Insights: GroupMRMR consistently shows good classification accuracy compared to baselines for all the datasets (highest average accuracy and highest maximum accuracy in almost all datasets). The equal run times of GroupMRMR and mRMR show that the accuracy gain is obtained for no additional costs and supports the time complexity analysis in Sect. 5. Better prediction accuracy is obtained for small groups because large feature groups resemble the original feature set with no groupings. This shows the importance of feature group information to gain high feature selection accuracy. The accuracy is lower when the features are encouraged from the same group ($\lambda $ < 0) instead from different groups ($\lambda $ > 0), which supports our hypothesis. The classification accuracy is less sensitive to $\lambda $ $\ge $ $10^{-3}$, therefore parameter tuning is less required.

7 Conclusion

We propose a framework which facilitates filter feature selection methods to exploit feature group information as an external source of information. Using this framework, we incorporate feature group information into mRMR algorithm, resulting in GroupMRMR algorithm. We show that compared to baselines, GroupMRMR achieves high classification accuracy for the datasets with feature group structures. The run time of GroupMRMR is same as the run time of mRMR, which is lower than many existing feature selection algorithms. Our future work include experimenting the proposed framework for other filter methods and detecting whether a dataset contains feature group structures.

Notes

1.
https://sites.google.com/view/kushani/publications.

References

Cancer program datasets. http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi. Accessed Nov 2019
Acharya, S., Saha, S., Nikhil, N.: Unsupervised gene selection using biological knowledge: application in sample clustering. BMC Bioinform. 18(1), 513 (2017)
Article Google Scholar
Alirezanejad, M., Enayatifar, R., Motameni, H., et al.: Heuristic filter feature selection methods for medical datasets. Genomics (2019). https://doi.org/10.1016/j.ygeno.2019.07.002
Article Google Scholar
Bolón-Canedo, V., Rego-Fernández, D., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdiñas, B., Sánchez-Maroño, N.: On the scalability of feature selection methods on high-dimensional data. Knowl. Inf. Syst. 56(2), 395–442 (2017). https://doi.org/10.1007/s10115-017-1140-3
Article Google Scholar
Bommert, A., Sun, X., Bischl, B., et al.: Benchmark for filter methods for feature selection in high-dimensional classification data. CSDA 143, 106839 (2020)
MathSciNet MATH Google Scholar
Cai, D., He, X., Hu, Y., et al.: Learning a spatially smooth subspace for face recognition. In: Proceedings of IEEE CVPR 2007, pp. 1–7 (2007)
Google Scholar
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. JBCB 3(02), 185–205 (2005)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd ICML, pp. 377–384 (2006). https://doi.org/10.1145/1143844.1143892
Hancer, E., Xue, B., Zhang, M.: Differential evolution for filter feature selection based on information theory and feature ranking. Knowl.-Based Syst. 140, 103–119 (2018). https://doi.org/10.1016/j.knosys.2017.10.028
Article Google Scholar
Jiarpakdee, J., Tantithamthavorn, C., Treude, C.: Autospearman: Automatically mitigating correlated metrics for interpreting defect models. arXiv preprint arXiv:1806.09791 (2018)
Kong, D., Liu, J., Liu, B., et al.: Uncorrelated group lasso. In: AAAI, pp. 1765–1771 (2016)
Google Scholar
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Chapter Google Scholar
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of the 14th IEEE ICCI* CC, pp. 136–140 (2015). https://doi.org/10.1109/ICCI-CC.2015.7259377
Nguyen, X.V., Chan, J., Romano, S., et al.: Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD, pp. 512–521 (2014). https://doi.org/10.1145/2623330.2623611
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Article Google Scholar
Wang, J., Wang, M., Li, P., et al.: Online feature selection with group structure analysis. IEEE TKDE 27(11), 3029–3041 (2015)
Google Scholar
Yatish, S., Jiarpakdee, J., Thongtanunam, P., et al.: Mining software defects: should we consider affected releases? In: Proceedings of the 41st International Conference on Software Engineering, pp. 654–665. IEEE Press (2019)
Google Scholar
Yu, K., Wu, X., Ding, W., et al.: Scalable and accurate online feature selection for big data. ACM TKDD 11(2), 16 (2016). https://doi.org/10.1145/2976744
Article Google Scholar
Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD, pp. 803–811 (2008). https://doi.org/10.1145/1401890.1401986

Download references

Acknowledgements

This work is supported by the Australian Government.

Author information

Authors and Affiliations

University of Melbourne, Melbourne, VIC, 3010, Australia
Kushani Perera & Shanika Karunasekera
RMIT University, Melbourne, VIC, 3000, Australia
Jeffrey Chan

Authors

Kushani Perera
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Chan
View author publications
You can also search for this author in PubMed Google Scholar
Shanika Karunasekera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kushani Perera .

Editor information

Editors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
Hady W. Lauw
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Raymond Chi-Wing Wong
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
Alexandros Ntoulas
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Institute of Data Science, National University of Singapore, Singapore, Singapore
See-Kiong Ng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Sinno Jialin Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perera, K., Chan, J., Karunasekera, S. (2020). A Framework for Feature Selection to Exploit Feature Group Structures. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12084. Springer, Cham. https://doi.org/10.1007/978-3-030-47426-3_61

Download citation

DOI: https://doi.org/10.1007/978-3-030-47426-3_61
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47425-6
Online ISBN: 978-3-030-47426-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Framework for Feature Selection to Exploit Feature Group Structures

Abstract

Similar content being viewed by others

Feature Selection Using Approximate Multivariate Markov Blankets

Mrmr+ and Cfs+ feature selection algorithms for high-dimensional data

A Novel Meta-heuristic Search Based on Mutual Information for Filter-Based Feature Selection

Keywords

1 Introduction

2 Related Work