Keywords

1 Introduction

Feature selection is an important task in preparing high dimensional data for machine learning tasks. It improves the prediction accuracy and simplicity of the learning models and reduces the computational costs. Unlike deep learning methods, feature selection identifies the important features that can be interpreted by the humans when explaining AI decisions (E.g.: genes related to certain diseases [12]). Feature selection methods are of two types, supervised and unsupervised, based on the availability of class labels in data. Among them, unsupervised feature selection has wide applicability because data in most real world scenarios are unlabelled. For example, there is a vast amount of text and image data in the web, yet the label information, such as the subject of a tweet, the topic of an image is only rarely available. Due to the unavailability of labels, unsupervised approach is more challenging than the supervised approach and achieving good accuracy remains a challenge.

Many unsupervised feature selection methods evaluate features using instance-feature data alone, which is available in the form of the data matrix [9, 14]. In contrast, recent work shows that features can be grouped according to various criteria and this group information can improve the usefulness of the feature selection [17]. For example, the nearby pixels in images can be grouped together considering the spatial locality to improve selection of pixels for image analysis. The words in document datasets can be grouped according to their semantics [13] to improve selection of words for document analysis. Genes in genomic data can be grouped using Gene Ontology information [3] to improve bio-marker identification for disease prediction and drug discovery. We show that considering this group structure can enable selection of a better feature subset in real world applications. In Sect. 4, we illustrate this using a concrete text data example.

In contrast to supervised feature selection [11], little work exist in unsupervised feature selection which exploits feature group information. The existing ones are limited to genomic data in which feature selection is limited to simple methods such as selecting the centroids of feature groups [3]. They do not use group information in combination with instance-feature data, which is also useful for feature selection. Hierarchical Unsupervised Feature Selection (HUFS) [17] uses feature group information together with instance-feature data to improve feature selection accuracy and is applicable for different data types. Like many state of the art feature selection methods, HUFS is also an embedded approach, yet embedded methods do not have a significant advantage in unsupervised feature selection due to the unavailability of class labels. Compared to embedded methods, filter methods are fast and produce more generic solutions [15]. Consequently, they are still popular in applications such as bio-marker identification [12] and have growing interest in big data applications [7, 16, 20].

We propose a framework which helps incorporating feature group information into unsupervised filter feature selection methods. To demonstrate the usefulness of our approach, we incorporate feature group information into Laplace Score (LS) algorithm [9], a well established feature selection method which achieves good accuracy with very low computational costs. We mathematically show that the proposed feature selection objective can be represented as a standard quadratic optimisation problem, such that standard optimisation algorithms can be used to solve the optimisation problem. However, quadratic programming optimisation algorithms are slow and cannot scale to larger problems which are typically encountered, hence we also propose a greedy optimisation method, Group Laplace Score (GLS), which is faster than quadratic optimisation algorithms, yet show comparable performance. Through extensive experiments we show that GLS achieves high clustering performance with low computational costs, compared to existing feature selection methods. Our main contributions are as follows.

  • We propose a framework which facilitates unsupervised filter feature selection methods to exploit the knowledge about feature groups to achieve higher clustering performance.

  • We use the proposed framework to incorporate feature group information into LS algorithm and propose a new feature selection algorithm, GLS.

  • We experimentally show that GLS obtains significantly higher clustering performance than the existing feature selection algorithms.

2 Related Work

Many unsupervised feature selection methods, both similarity preserving (filter) [9, 19] and embedded [6, 8, 10, 14] methods, are based on input data alone and rarely take the advantage of the external sources of knowledge about feature group structures. The feature groups used by some feature selection methods are also formed with input data [15, 18]. Some domain specific unsupervised methods [3] are proposed for selecting genes from different gene groups, yet they do not combine group based feature selection with instance-feature data which is also useful for feature selection. In contrast, HUFS uses feature group information to improve the instance-feature data based feature selection and is applicable for different data types. However, HUFS encourages features from the same group which is not effective in most real world applications [11]. In contrast, our method encourages features from different groups and we experimentally show that our method outperforms HUFS in terms of accuracy and efficiency. Compared to HUFS, our method requires less parameter tuning too.

3 Preliminaries

This section discusses some frequently used definitions and terms in the paper. X \(\in \) \(\mathbb {R}^{n \times m}\) is the input data matrix, where n is the number of instances and m is the number of features in X. F is the set of all features in X, S \(\subseteq \) F is the selected feature subset, \(f_i\) \(\in \) F the \(i^{th}\) feature in X and k is the number of features to be selected. \(G_i\) is the set of features in \(i^{th}\) feature group and r is number of groups. Given a matrix A \(\in \) \(\mathbb {R}^{n \times m}\), \(a_{i,j}\), is its element in \(i^{th}\) row and \(j^{th}\) column. \(L_{1,1}\) norm of A, \(\left\Vert A\right\Vert _{1,1} \) = \(\sum _{i=1}^{n}\sum _{j=1}^{m}|a_{i,j}|\).

Definition 1

The feature indicator matrix, U \(\in \) \(\mathbb {R}^{m \times m}\), is a diagonal matrix whose \(i^{th}\) diagonal entry, \(u_{i,i}\) = \(u_i\) = 1 if the \(i^{th}\) feature in X is selected into S and \(u_{i,i}\) = \(u_i\) = 0 otherwise. \(u_{i,j}\) = 0 \((\forall \) i \(\ne \) j).

Definition 2

Given that S is the selected feature subset and \(G_{i}\) is the set of features in \(i^{th}\) feature group, \(w_{i}\) = \( \frac{\text {No. of features in } S \text { and } G_i}{\text {No. of features in } S}\) = \(\frac{|S \cap G_{i}|}{|S|}\).

4 Motivation and Background

In this section, we demonstrate the importance of external feature group information for feature selection accuracy, using Reuters (RT) text dataset [1] as a concrete example. As the complete dataset is too large, we select only some instances and feature values which are helpful for the discussion.

Fig. 1.
figure 1

Feature selection in the text dataset in Example 1

Example 1: Figure 1a shows a part of the RT dataset in which the words are the features and documents (\(d_i\)) are the instances. Feature values represent the occurrence frequency of each word in each document. Each document is one of the three types: Business, Health, Technical, but in the unsupervised feature selection, the algorithm is not provided this. The feature selection problem is to select three features which achieves the best clustering performance.

The features which result in small distances between the same class instances and large distances between different class instances help the same class instances to get clustered together. For example, with respect to “Bank”, business documents have lower distances between each other and large distances with the rest (Manhattan distance of 3 between \(d_1\) and \(d_2\) and 13 between \(d_1\) and \(d_5\)). Therefore, “Bank” discriminates business documents from the rest. Similarly, “Google” and “Patient” discriminate some technical (\(d_5\)) and health (\(d_3\)) documents. {Bank, Patient, Google} collectively discriminate between different class instances from one another. Figure 1b shows the k-means (k = 3) cluster assignments for this feature subset. Only \(d_4\) is assigned to a wrong cluster and cluster purities are 1,1, and 0.67. Clustering performance in terms of NMI [9] is 0.74.

In contrast, no feature in {Bank, Patient, Cell} discriminates between business and technical documents and “Patient” and “Cell” cause large distances between the health documents, the same class instances, leading to poor clustering performance. Figure 1c shows that \(d_4\), \(d_5\), \(d_6\) are assigned to wrong clusters, resulting in impure clusters (cluster purities of 1, 1, and 0.5) compared to the previous case. Clustering performance in terms of NMI is 0.65. Therefore, {Bank, Patient, Google} is better compared to {Bank, Patient, Cell}. However, “Cell” and “Google” have very similar feature value distributions, and class labels are not available for feature selection. Therefore, “Cell” and “Google” cannot be differentiated from one another using instance feature data alone. We show this using LS algorithm, which selects the features which best preserve the locality structure of the instances, as a concrete example.

Fig. 2.
figure 2

Matrices for the dataset in Example 1. b: Bank, p: Patient, c: Cell, g: Google

LS Algorithm: Given that A is the adjacency matrix between the instances, D is the degree matrix and L is the Laplace matrix such that L = \(D - A\), the Laplace score of a feature f, \(l_f\) = \(\frac{\tilde{f}^TL\tilde{f}}{\tilde{f}^TD\tilde{f}}\), where \(\tilde{f}\) = \(f - \mu _f\) and \(\mu _f\) is the mean of f. LS objective for selecting k features is shown in Eq. (1). LS algorithm achieves this by selecting the features with k minimum Laplace scores. Figure 2a shows L for RT dataset, assuming a 1-Nearest Neighbour A. Laplace scores for “Bank”, “Cell”, “Patient” and “Google” are 0.39, 1.06, 1.06 and 1.1, respectively. The selected feature subset is therefore {Bank, Cell, Patient}, which is not optimal.

$$\begin{aligned} \min _{S} \sum _{\tilde{f} \in S} \frac{\tilde{f}^TL\tilde{f}}{\tilde{f}^TD\tilde{f}} \text { subject to } |S| = k \end{aligned}$$
(1)

Using Feature Group Information: Consider using Wordnet [13] as an external source of knowledge for Example 1. Wordnet shows a high semantic similarity (0.7) between “Cell” and “Patient”, and low similarity between other feature pairs (0.1 between “Google” and “Bank”). Three feature groups can be created based on semantic similarity. Group 1: {Bank}, Group 2: {Patient, Cell}, Group 3: {Google}. Encouraging features from different groups results in {Bank, Patient, Google}, which is optimal. This is because semantically similar words tend to occur in similar types of documents. Consequently, words from different groups discriminate different types of documents from one another and result in lower distances between the same type of documents. For example, given “Patient”, selecting “Google” (from a different group), results in a lower distance between \(d_3\) and \(d_4\) than selecting “Cell” (from the same group). Opposed to “Cell”, “Google” also discriminates between business and technical documents.

5 Proposed Method

We propose a framework which facilitates the unsupervised filter feature selection methods to encourage features from different groups and use this framework to incorporate feature group information into LS algorithm. When the feature groups have different importance levels based on factors such as group size and group quality, more features are encouraged from the groups with higher importance. Proposed feature selection objective can be solved using quadratic optimisation methods, but we also propose a greedy approach, GLS, which achieves the same performance faster. In this paper, we focus on non-overlapped groups, yet the proposed method can easily be extended to overlapped groups as well.

Modelling Feature Group Information: We define G \(\in \) \(\mathbb {R}^{m \times m}\), the feature group matrix. If \(f_i\), \(f_j\) \(\in \) F are in the same group, \(g_{i,j}\) = \(g_{j,i}\) = 1. Otherwise \(g_{i,j}\) = \(g_{j,i}\) = 0. \(\forall i\) = 1, \(\dots \), m, \(g_{i,i}\) = 0. G for Example 1 is shown in Fig. 2b. Multiplying G by U twice makes the rows and columns of G corresponding to the unselected features all zeros. This results in \(G'\) = UGU \(\in \) \(\mathbb {R}^{m \times m}\), feature group matrix of the features in S. The number of zeros in \(G'\) increases when the features in S are from different feature groups and all the elements in \(G'\) \(\ge \) 0. Therefore, given that k features are to be selected, to encourage features from different feature groups, our objective is to select U to minimise \(\left\Vert UGU\right\Vert _{1,1}\) subject to \(\left\Vert U\right\Vert _{1,1} = k\).

Figure 2c and d show U and \(G'\) when S = {Bank, Patient, Cell}, for which \(\left\Vert G'\right\Vert _{1,1}\) = 2. When S = {Bank, Patient, Google} U is a diagonal matrix where diag(U) = [1, 1, 0, 1], \(G'\) \(\in \) \(\mathbb {R}^{4 \times 4}\) is a matrix of all zeros and \(\left\Vert G'\right\Vert _{1,1}\) = 0. This shows that \(\left\Vert UGU\right\Vert _{1,1}\) is minimal when the features are selected from different groups. When the feature groups have different importance levels, to encourage more features from the groups with higher importance, we set \(g_{i,j}\) = \(g_{j,i}\) = \(\frac{1}{\alpha _i}\) (instead of 1), where \(\alpha _i\) is the weight of \(G_{i}\).

Input Data Based Feature Selection: We next propose a common framework to combine group based feature selection with any unsupervised filter feature ranking method. Let Q be a diagonal matrix, where, \(q_{i,i}\) = \(l_i\), where \(l_i\) is the feature score of \(f_i\), in terms of its capability to preserve the sample similarity. \(Q'\) = UQU is the feature score matrix for selected features in S. \(Q'\) is a diagonal matrix in which \(q'_{i,i}\) = \(l_i\) if \(f_i\) \(\in \) S and \(q'_{i,i}\) = 0 otherwise. Given that \(l_i\) \(\ge \) 0, \(\forall \) i, the feature selection objective is to select U to minimise or maximise \(\left\Vert UQU\right\Vert _{1,1}\) subject to \(\left\Vert U\right\Vert _{1,1}\) = k. Minimisation or maximisation is decided based on the algorithm used to compute \(l_i\).

Theorem 1 shows that Laplace score is always non-negative and eligible for Q. Consequently, Eq. (1) can be reformulated as minimising \(\left\Vert UQU\right\Vert _{1,1}\) subject to \(\left\Vert U\right\Vert _{1,1}\) = k, where \(l_i\) = Laplace score of \(f_i\). For example, in Example 1, diag(Q) = [0.39, 1.06, 1.06, 1.1]. When S = {Bank, Patient, Cell}, \(diag(Q')\) = [0.39, 1.06, 1.06, 0] and \(\left\Vert Q'\right\Vert _{1,1}\) = 2.51. When S = {Bank, Patient, Google}, \(diag(Q')\) = [0.39, 1.06, 0, 1.1] and \(\left\Vert Q'\right\Vert _{1,1}\) = 2.55. Therefore, minimal \(\left\Vert UQU\right\Vert _{1,1}\) is achieved for {Bank, Patient, Cell}, the same feature subset selected by LS algorithm. For the rest of the paper, we assume \(l_i\) is computed using Laplace score, therefore minimise \(\left\Vert UQU\right\Vert _{1,1}\). Maximisation is equivalent to minimising -\(\left\Vert UQU\right\Vert _{1,1}\).

Theorem 1

Given that \(l_i\) is the Laplace score of \(f_i\) \(\in \) F, \(l_i\) \(\ge \) 0, \(\forall \) i = \(1, \cdots ,m\).

Proof

Because L and D are positive definite. Refer to this linkFootnote 1 for the proof.

Feature Selection Objective: The feature selection objective which combines both group based feature selection and input data based feature selection is shown in Eq. (2). \(\lambda \) is a user defined parameter. In this paper, we assign a fixed value for \(\lambda \). In future, we plan to iteratively decide \(\lambda \) value for each feature selected. Based on Theorem 2, we reformulate Eq. (2) into Eq. (3).

$$\begin{aligned} \min _{U} \left\Vert UQU\right\Vert _{1,1} + \lambda \left\Vert UGU\right\Vert _{1,1} \text { subject to } \left\Vert U\right\Vert _{1,1} = k \end{aligned}$$
(2)
$$\begin{aligned} \min _{U} \left\Vert U(Q+ \lambda G)U\right\Vert _{1,1} \text { subject to } \left\Vert U\right\Vert _{1,1} = k \end{aligned}$$
(3)

Theorem 2

Given \(\lambda \) \(\ge \) 0, \(\left\Vert UQU\right\Vert _{1,1}\) + \(\lambda \left\Vert UGU\right\Vert _{1,1}\) = \(\left\Vert U(Q+ \lambda G)U\right\Vert _{1,1}\)

Proof

Because \(u_{i,j}\), \(q_{i,j}\), \(g_{i,j}\) \(\ge \) 0 \(\forall \) ij. Refer to this link (See footnote 1) for the proof.

Given u = [\(u_1\), \(\cdots \), \(u_m\)]\(^T\), where \(u_i\) is the \(i^{th}\) diagonal element of U, Theorem 3 shows that \(\left\Vert U(Q+ \lambda G)U\right\Vert _{1,1}\) can be reformulated as a quadratic function of u. Therefore, to solve Eq. (3), we use two approaches: (1) Standard Quadratic Programming (QP) methods (2) Greedy method (GLS algorithm). As the QP method, we use the MATLAB inbuilt “fmincon” function with “interior point” method, but omitted the details due to space limitations. Please refer to this link (See footnote 1) for details. The greedy method showed comparable accuracy to QP method, yet faster. Therefore, in this paper, we focus on the greedy method.

Theorem 3

Given that H = \(Q+ \lambda G\), and u as defined above, \(\left\Vert UHU\right\Vert _{1,1}\) = \(u^{T}Hu\) = h(u), that is \(\left\Vert UHU\right\Vert _{1,1}\) is a quadratic function of u.

Proof

Please refer to this link (See footnote 1) for the proof.

Greedy Method: As discussed, \(\left\Vert U(Q+ \lambda G)U\right\Vert _{1,1}\) = \(\left\Vert UHU\right\Vert _{1,1}\) = h(u). At each Iteration t, GLS selects a feature, \(f_t\), such that \(f_t\) = \({{\,\mathrm{argmin}\,}}_{f_x \in S'_{t-1}}\) \(h(u_{t}) - h(u_{t-1})\), where \(u_{t-1}\) and \(u_t\) are the selected feature indicator vectors (u) after Iteration \((t-1)\) and t, respectively and \(S'_{t-1}\) is the unselected feature subset after Iteration \(t-1\). According to Theorem 4, this is equivalent to selecting \(f_t\) = \({{\,\mathrm{argmin}\,}}_{f_x \in S'_{t-1}} l_x + \lambda \frac{w_i}{\alpha _i}\), where \(f_x\) is any feature in \(S'_{t-1}\), \(l_{x}\) is the Laplace score of \(f_x\), \(G_i\) the feature group of \(f_x\), \(\alpha _i\) is the weight of \(G_i\), \(w_i\) = \(\frac{|S_{t-1} \cap G_i|}{S_{t-1}}\) and \(S_{t-1}\) is the selected feature subset after Iteration \(t-1\). Therefore, as shown in Algorithm 1, GLS selects \(f_x\) to minimise this quantity (Line 5), which avoids complex matrix multiplication operations.

Theorem 4

Given that \(S_{t-1}\), \(S'_{t-1}\), \(u_{t-1}\), \(u_t\), \(f_x\) \(\in \) \(S'_{t-1}\), \(l_x\), \(w_i\) and \(\alpha _i\) are as defined above, \({{\,\mathrm{argmin}\,}}_{f_x \in S'_{t-1}} h(u_{t}) - h(u_{t-1})\) = \({{\,\mathrm{argmin}\,}}_{f_x \in S'_{t-1}} l_x + \lambda \frac{w_i}{\alpha _i}\).

Proof

Refer to this link (See footnote 1) for the proof.

figure a

Example 1 Revisited: We apply GLS for Example 1, given the feature groups created in Sect. 4. \(\lambda \) = 1, \({\alpha _i}\) = 1 \(\forall \) = i. GLS first selects “Bank” which has the minimum Laplace score (0.39). In Iteration 2, for all remaining features, \(w_i\) = 0. Therefore, GLS selects “Patient” or “Cell”, which has next minimum Laplace score (1.06). Assume it selects “Patient”. In Iteration 3, for “Cell” and “Google”, \(w_i\) = 0.5 and 0, respectively and \(l_{i}\) + \(\lambda \) \(\frac{w_i}{\alpha _i}\) = 1.56 and 1.1, respectively. GLS selects “Google” which has minimal feature score. Therefore, the selected feature subset is {Bank, Patient, Google}, which is optimal according to Sect. 4.

Computation Complexity Analysis: Given F and S are as defined in Sect. 3, time complexity for computing the Laplace score is O(|F|). The complexity of the iterative group based feature selection (Line 2–11 in Algorithm 1), is O(|S||F|). As |S| \(<<\) |F|, the time complexity of GLS is linear to |F|.

6 Experimental Evaluation

In this section, we discuss the experimental results obtained by GLS algorithm.

Datasets: We evaluate GLS, using real datasets, which are benchmark datasets used to test group based feature selection. Table 1 shows a summary of them. Yale, ORL and COIL20 have a 32 \(\times \) 32 pixel map and USPS a 16 \(\times \) 16 pixel map.

Feature Grouping: To introduce spatial locality information, which is not available from the input data matrix alone, we partition the pixel map of an image into \(p \times p\) non overlapping squares. Each square is a feature group. Default p for USPS is 2 and 4 for other image datasets. In text data, pairwise semantic similarities between the words are found using WordNet [13] and words are clustered based on the similarity values, using spectral clustering. We use only 2,468 words, available in WordNet. Genes in genomic data are clustered based on Gene Ontology information as discussed in [3]. Number of groups is set to 0.04 of the original feature set based on the previous findings for MT dataset [3].

Table 1. Dataset description. m: # features, n: # instances, c: # classes

Baselines: As baselines, we use LS algorithm and Spectral Feature Selection SPEC [19] as similarity preserving methods and Multi Cluster Feature Selection (MCFS) [6], Robust Unsupervised Feature Selection (RUFS) [14] and HUFS as embedded methods. RUFS has proven high performance compared to many existing embedded methods and HUFS uses feature group information similar to our method. RUFS and MCFS use two different approaches to control feature redundancy (\(L_{2,1}\) norm vs. \(L_{1}\) norm). k-medoid (KM) [3] is specific for genomic datasets, therefore, we use it with genomic data only. For HUFS, we consider the complete pixel hierarchy as described in [17].

Evaluation Criteria: We consider the clustering performance as the measure of feature selection accuracy and evaluate it in terms of NMI [9]. k-means is the cluster method used. It is run 20 times and we report the average NMI. SD is the standard deviation of NMI obtained for the 20 iterations. Average accuracy of an algorithm in a dataset is the average of the NMIs obtained for all the selected feature numbers in that dataset. We select features up to the point all algorithm accuracies converge. Algorithm run times are measured in seconds.

Table 2. Comparison of the clustering performances of different algorithms. Row 1: maximum NMI of each algorithm for each dataset. The highest maximum NMI for each dataset is in bold letters. Row 2 (±): SD corresponding to maximum NMI. Row 3 (x): the number of features at which the maximum NMI is achieved. Row 4: Algorithm rankings in terms of average accuracy (1 corresponds to the highest average accuracy)

Experimental Setup: We split each dataset, 60% instances for training set and 40% for test test, using stratified random sampling method and remove the class labels from both. We perform feature selection on the training dataset and evaluate the clustering performance of the test set, using only the selected feature subset. By default, \(\alpha _i\) = 1 for all feature groups and \(\lambda \) = 1.

Experiment 1 evaluates the clustering performance of different algorithms for different numbers of selected features. Experiment 2 evaluates the clustering performance of GLS in text and genomic data, for \(\alpha _i\) = \(\frac{|G_i|}{|F|}\) and \(\alpha _i\) = 1 \(\forall \) i. This tests the effect of group weights on clustering performance. Experiment 3 executes each feature selection algorithm 100 times and reports the log value of the average run time to evaluate the algorithm efficiency. Experiment 4 performs feature selection in image datasets for p = 2, 4, 8, 16. This tests the effect of the group size on the clustering performance. Experiment 5 runs GLS for \(\lambda \) \(\in \) [-1, 3]. This tests the effect of \(\lambda \) on the clustering performance.

Fig. 3.
figure 3

GLS execution time and accuracy variation for different settings for COIL20

Experimental Results: Table 2 shows that GLS achieves the highest NMI over baselines in 7 out of 9 datasets. In ORL and COIL20, GLS achieves the highest NMI with a smaller number of features than baselines. In all datasets, GLS has the highest average accuracy (rank 1), yet the rankings of baselines vary across the datasets. GLS’s average NMI gain over SPEC in Multi-B dataset is \(\sim \)30%, which is its maximum NMI gain over baselines. Maximum NMI gain of GLS over the NMI obtained by the complete feature set is 3%, 1%, 1%, 2%, 10%, 11%, 4%, 12% and 24% for Yale, ORL, COIL20, USPS, RT, MT, CNS, DLBCL-B and Multi-B respectively. GLS’s average accuracy gains for \(\alpha _i\) = \(\frac{|G_i|}{|F|}\) over \(\alpha _i\) = 1 are 0.3% and 3% in RT and DLBCL-B datasets, respectively. Due to space limitations, we omit the results graphs for Experiment 1 and 2. Please refer to this link (See footnote 1) to see all the results graphs. GLS also has the lowest SD for clustering performance for 7 out of 9 datasets. Figure 3a shows that GLS has only little increase of run time than LS, which is significantly low compared to embedded methods. For COIL20 dataset, the run time of GLS is \(\sim \)50, \(\sim \)20 and \(\sim \)70 times lower than the run time of MCFS, RUFS and HUFS. Figure 3b shows that compared to large and small feature groups (p = 2, 16), GLS performance for medium sized groups (p = 4, 8) is high. According to Fig. 3c, clustering performance is less sensitive to \(\lambda \) for \(\lambda \) > 0, yet significantly low for \(\lambda \) \(\le \) 0.

Evaluation Insights: Compared to baselines, GLS consistently shows high clustering performance for all the datasets (highest average accuracy in all datasets and maximum accuracy in 7 out of 9 datasets), with low computational costs (\(\sim \)50 times lower run time than embedded methods on average). In all datasets, GLS achieves higher accuracy than using the complete feature set, with a comparatively smaller number of features. Higher accuracy obtained by weighted feature groups show that in some cases, knowledge about the importance level of different feature groups improves the accuracy of GLS. Low SD values for NMI show that GLS produces more stable clusters and more precise performance results than the baselines. Medium sized groups achieve higher accuracy because large and small groups more resemble the case of no groupings. This demonstrates the contribution of feature group information to achieve high accuracy. Low accuracy for \(\lambda \) \(\le \) 0 supports our hypothesis that selecting features from the same group is less effective than selecting from different groups. Less parameter tuning is required for GLS as its accuracy is less sensitive to \(\lambda \) (> 0).

7 Conclusion

We propose a framework which facilitates exploiting feature group information by unsupervised feature selection methods and use this framework to incorporate feature group information into LS algorithm. We show that compared to baselines, the proposed method achieves high clustering performance for the datasets with feature group structures with low computational costs and requires less parameter tuning. Our future work includes using the proposed framework for unsupervised feature selection methods other than the LS algorithm.