Keywords

1 Introduction

Feature selection is proven to be an effective method in preparing high dimensional data for machine learning tasks such as classification. The benefits of feature selection include increasing the prediction accuracy, reducing the computational costs and producing more comprehensible data and models. Among the three main feature selection methods, filter methods are preferred to wrapper and embedded methods in applications where the computational efficiency, classifier independence, simplicity, ease of use and the stability of the results are required. Therefore, filter feature selection remains an interesting topic in many recent research areas such as biomarker identification for cancer prediction and drugs discovery, text classification and predicting defective software [3,4,5, 10, 11, 16, 18] and has growing interest in big data applications [19]; according to the Google Scholar search results, the number of research papers published related to filter methods in year 2018 is \({\sim }\)1,800 of which \({\sim }\)170 are in gene selection area.

Most of the existing filter methods perform feature selection based on the instance-feature data alone [7]. However, in real world datasets, there are external sources of correlations within feature groups which can improve the usefulness of feature selection. For example, the genes in genomic data can be grouped based on the Gene Ontology terms they are annotated with [2] to improve bio-marker identification for the tasks such as disease prediction and drugs discovery. The words in documents can be grouped according to their semantics to select more significant words which are useful in document analysis [14]. The nearby pixels in images can be grouped together based on their spatial locality to improve selection of pixels for image classification. In software data, software metrics can be grouped according to their granularity in the code to improve the prediction of defective software [11, 18]. In Sect. 4, using a text dataset as a concrete example, we demonstrate the importance of feature group information for filter feature selection to achieve good classification accuracy.

Although feature group information have been used to improve feature selection in wrapper and embedded approaches [8, 12], group information is only rarely used to improve the feature selection accuracy in filter methods. Yu et al. [19] proposes a group based filter method, GroupSAOLA (GSAOLA), yet being an online method, it achieves poor accuracy, which we show experimentally. The common method used by embedded methods to exploit feature group information is minimising the \(L_1\) and \(L_2\) norms of the feature weight matrix, while minimising the classification error. Depending on whether the features are encouraged from the same group [8] or different groups [12], \(L_1\) norm is used to cause inter group or intra group sparsity. Selecting features from different groups is shown to be more effective than selecting features from the same group [12].

Motivated by these approaches, we show that squared \(L_{0,2}\) norm minimization of the feature weight matrix can be used to encourage features from different feature groups in filter feature selection. We propose a generic framework which combines existing filter feature ranking methods with feature weight matrix norm minimisation and use this framework to incorporate feature group information in to mRMR objective [7] because mRMR algorithm achieves high accuracy and efficiency at the same time, compared to other filter methods [3, 4]. However, the proposed framework can be used to improve any other filter method, such as information gain based methods. As \(L_{0}\) norm minimization is an NP-hard problem, we propose a greedy feature selection algorithm, GroupMRMR, to achieve the feature selection objective, which has the same computational complexity as the mRMR algorithm. We experimentally show that for the datasets with feature group structures, GroupMRMR obtains significantly higher classification accuracy than the existing filter methods. Our main contributions are as follows.

  • We propose a framework which supports the filter feature selection methods to utilise feature group information to improve their classification accuracy.

  • Using the proposed framework, we integrate feature group information into mRMR algorithm and propose a novel feature selection algorithm.

  • Through extensive experiments we show that our algorithm obtains significantly higher classification accuracy than the mRMR and existing filter feature selection algorithms for no additional computational costs.

Table 1. Frequently used definitions

2 Related Work

Utilization of feature group information to improve prediction accuracy has been popular in embedded feature selection [8, 12, 17]. Among them, algorithms such as GroupLasso [8] encourage features from the same group while algorithms such as Uncorrelated GroupLasso [12] encourage features from different groups. We select the second approach as it is proven to be more effective for real data [12]. Filter feature selection is preferred over wrapper and embedded methods due to their classifier independence, computational efficiency and simplicity, yet have comparatively low prediction accuracy. However, most filter methods select the features based on the instance-feature data alone, which are coded in the data matrix, using information theoretic measures [7, 13, 15]. Some methods [20] use the feature group concept, yet the groups are also formed using instance-feature data to reduce feature redundancy. None of these methods take advantage of the external sources of knowledge about feature group structures. GSAOLA [19] is an online filter method which exploits feature groups, however we experimentally show that our method significantly outperforms it in terms of accuracy.

3 Preliminaries

In this section and Table 1, we introduce the terms used later in the paper. Let C be the class variable of a dataset, D, and \(f_{i}\), \(f_{j}\) any two feature variables.

Definition 1

Given that X and Y are two feature variables in D, with feature values x and y respectively, mutual information between X and Y, is given by \(I(X;Y) = \sum _{x \in X} \sum _{y \in Y} p(x,y)log\frac{p(x,y)}{p(x)p(y)}\).

Definition 2

The relevancy of \(f_i\) = \(Rel (f_i) = I(f_i;C)\).

Definition 3

The redundancy between \(f_{i}\) and \(f_{j}\) = \(Red(f_{i},f_{j}) = I(f_{i};f_{j})\).

Given that W \(\in \) \(\mathbb {R}^{M \times N}\), \(W_i\) is the \(i^{th}\) row of W, \(W_{ij}\) is the \(j^{th}\) element in \(W_i\), the squared \(L_{0,2}\) norm of W is defined as \(\Vert W \Vert _{0,2}^2\) = \(\sum _{i=1}^M (\Vert W_i \Vert _0)^2\) = \(\sum _{i=1}^{M}N_i^{2}\) where \(N_i\) = \(\Vert W_i \Vert _0\) = # (\(j| W_{ij} \ne 0\)). For the scenarios in which the rows of W have different importance levels, we define \(\Vert W \Vert _{0,2}^2\) = \(\sum _{i=1}^M \epsilon _i(\Vert W_i \Vert _0)^2\) = \(\sum _{i=1}^{M}N_i^{2}\epsilon _i\). \(\epsilon _i\) is the weight of \(W_i\). k is the required number of features.

4 Motivation and Background

Ignoring the external sources of correlations within feature groups may result in poor classification accuracy for the datasets whose features show a group behaviour. We demonstrate this using mRMR algorithm as a concrete example, a filter method which otherwise achieves good accuracy.

mRMR Algorithm: mRMR objective for selecting a feature subset S \(\subseteq \) F of size k is as follows.

$$\begin{aligned} \mathop {\text {max}}\limits _{S} \sum _{f \in S} Rel (f) - \frac{1}{|S|} \sum _{f_{i}, f_{j} \in S} Red(f_{i},f_{j}) \text { subject to } |S| = k, k \in \mathbb {Z}^{+} \end{aligned}$$
(1)

To achieve the above objective, mRMR selects one feature at a time to maximise the relevancy of the new feature x with the class variable and to minimise its redundancy with the already selected feature set, as shown in Eq. (2).

$$\begin{aligned} \mathop {\text {max}}\limits _x Rel (x) - \frac{1}{|S|} \sum _{f \in S} Red(x,f) \end{aligned}$$
(2)

Example 1: Consider selecting two features from the dataset in Fig. 1. In this dataset, each document is classified into one of the four types: Botany, Zoology, Physics or Agriculture. The rows represent the feature vector, the words which have occurred in the documents. 1 means the word has occurred within the document (or has occurred with high frequency) and 0 means otherwise.

The relevancies of the features, Apple, Rice, Cow and Sheep are 0.549, 0.443, 0.311 and 0.311, respectively. mRMR first selects Apple, which has the highest relevancy. The redundancies of Rice, Cow and Sheep with respect to Apple are 0.07, 0.017 and 0.016, respectively. Therefore, mRMR next selects Rice, the feature with the highest relevancy redundancy difference, 0.373 (0.443 - 0.07). Global mRMR optimisation approaches [15] also select {Apple, Rice}.

Fig. 1.
figure 1

Example text document dataset. Column (\(d_i\)): a document/instance, Row: a word/feature, Class: document type, 1/0: Occurrence of a word, B: Botany, Z: Zoology, P: Physics, A: Agriculture

Exploiting Feature Group Semantics: Figure 2 shows the value pattern distribution of {Apple, Sheep} and {Apple, Rice} pairs within each class. In {Apple, Sheep}, the highest probability value pattern in each class is different from one another. Therefore, each value pattern is associated with a different class, which helps distinguishing all the document types from one another. In {Apple, Rice}, there is no such distinctive relationship between the value patterns and classes. Using the value pattern distribution, the classification algorithm cannot distinguish between the Zoology and Physics documents and between Agriculture and Botany documents. This shows that features from different groups have achieved better class discrimination.

The reason behind the suboptimal result of the mRMR algorithm is its ignorance about the high level feature group structures. The words Apple and Rice form a group as they are plant names. Cow and Sheep form another group as they are animal names. The documents are classified according to whether they contain plant names or/and animal names, regardless of the exact plant or animal name they contain. Botany documents (\(d_{1}\)\(d_{4}\)) contain plant names (Apple or Rice) and no animal names. Zoology documents (\(d_{5}\)\(d_{8}\)) contain animal names (Cow or Sheep) and no plant names. This high level insight is not captured by the instance-feature data alone. Using feature group information as an external source of knowledge and encouraging features from different feature groups help solving this problem.

Fig. 2.
figure 2

Value pattern probabilities created by different feature subsets in each class, A: Agriculture, B: Botany, P: Physics, Z: Zoology, Class: The class assigned to the value pattern, %: \(\frac{\mathrm{\#(x,y)value\ patterns\ in\ class\ } c}{\mathrm{\#instances\ in\ class\ } c}\times 100\); x, y \(\in \) {0,1}, a: Apple, r: Rice, s: Sheep

5 Proposed Method: GroupMRMR

We propose a framework which facilitates filter feature selection methods to exploit feature group information to achieve better classification accuracy. Using this framework, we extend mRMR algorithm into GroupMRMR algorithm, which encourages features from different groups to bring in different semantics which help selecting a more balanced set of features. We select mRMR algorithm for extension because it has proven good classification accuracy with low computation costs, compared to other filter feature selection methods. The feature groups are assigned weights (\(\alpha _i\)) to represent their importance levels, and GroupMRMR selects more features from the groups with higher importance. Group weights may be decided according to factors such as group size and group quality. For this paper, we assume that the feature groups do not overlap but plan to investigate overlapping groups in the future.

5.1 Feature Selection Objective

Our feature selection objective includes both the filter feature selection objective and encouraging features from different feature groups. To encourage features from different groups, we minimise \(\Vert W \Vert _{0,2}^2\) of the feature weight matrix, W. Using \(L_0\) norm at intra group level enforces intra group sparsity, discouraging features to be selected from the same group. Using \(L_2\) norm at inter group level encourages features from different feature groups [12].

Let W \(\in \) \(\mathbb {R}^{|G| \times |F|}\) be a feature weight matrix such that \(W_{ij}\) = 1 if \(f_{j}\) \(\in \) S and \(f_{j}\) \(\in \) \(G_i\). Otherwise, \(W_{ij}\) = 0. Given that g(W) is any maximisation quantity used in an existing filter feature selection objective which can be expressed a function of W and \(\lambda \) is a user defined parameter, our objective is to select S \(\subseteq \) F to maximise the following subject to |S| = k, k \(\in \) \(\mathbb {Z}^{+}\):

$$\begin{aligned} \mathop {\text {max}}\limits _{S} h(S)= g(W) - \lambda \Vert W \Vert _{0,2}^2 \end{aligned}$$
(3)

Given that R1 \(\in \) \(\mathbb {R}^{|F|\times |F|}\) is a diagonal matrix in which \(R1_{jj}\) = \(Rel(f_j)\) and R2 \(\in \) \(\mathbb {R}^{|F|\times |F|}\) such that \(R2_{ij}\) = \(Red(f_i, f_j)\) for i \(\ne \) j \(R1_{ij}\) = 0 for i = j, it can be shown that \(\Vert WR1W^{T} \Vert _{1,1}\) - \(\frac{1}{2|S|} \Vert WR2W^{T} \Vert _{1,1}\) = \(\sum _{f \in S} Rel(f)\) - \(\frac{1}{|S|}\) \(\sum _{f_{i}, f_{j} \in S}\) \(Red(f_{i},f_{j})\), where \(W^T\) is the transpose of W. That is, the maximisation quantity in mRMR objective in Eq. (1) is a function of W. Consequently, g(W) in Eq. (3) can be replaced with the mRMR objective as shown in Eq. (4).

$$\begin{aligned} \mathop {\text {max}}\limits _{S} h(S)= \sum _{f \in S} Rel(f) - \frac{1}{|S|} \sum _{f_{i}, f_{j} \in S} Red(f_{i},f_{j}) - \lambda \Vert W \Vert _{0,2}^2 \end{aligned}$$
(4)

Definition 4

Given that S and \(G_{i}\) are as defined in Table 1, \(n_{i}\) = \(|S \cap G_{i}|\) = No. of features in S and \(G_i\).

Given \(n_i\) is as defined in Definition 4, according to Sect. 3, \(\Vert W \Vert _{0,2}^2\) = \(\sum _{i=1}^{|G|}n_i^{2}\). When the feature groups have different weights, the rows of W also have different importance levels. In such scenarios, \(\Vert W \Vert _{0,2}^2\) = \(\sum _{i=1}^{|G|}n_i^{2}\epsilon _i\), where \(\epsilon _i\) = \(\frac{1}{\alpha _i}\) where \(\alpha _i\) > 0. Consequently, we can rewrite the objective in Eq. (4) as in Eq. (5) subject to |S| = k, k \(\in \) \(\mathbb {Z}^{+}\). As the feature groups do not overlap, \(\sum _{i=1}^{|G|} n_{i}\) = |S|. Using Eq. (5), we present Theorem 1 that shows minimising \(\Vert W \Vert _{0,2}^2\) is equivalent to encouraging features from different groups in to S.

$$\begin{aligned} \mathop {\text {max}}\limits _{S} h(S) = \sum _{f \in S} Rel (f) - \frac{1}{|S|} \sum _{f_{i}, f_{j} \in S} Red(f_{i},f_{j}) - \lambda \sum _{i=1}^{|G|}\frac{n_i^2}{\alpha _i} \end{aligned}$$
(5)

Theorem 1

Given \(\sum _{i=1}^{|G|}\) \(n_i\) = |S|= k, minimum \(\sum _{i=1}^{|G|}\) \(\frac{n_i^2}{\alpha _i}\) is obtained when \(\frac{n_i}{\alpha _i}\) = \(\frac{n_j}{\alpha _j}\), \(\forall \) i, j \(\in \) I, where k \(\in \) \(\mathbb {Z}^{+}\) is a constant.

Proof

Using Lagrange multipliers method, we show minimum \(\sum _{i=1}^{|G|}\frac{n_i^2}{\alpha _i}\) is achieved when \(\frac{n_1}{\alpha _1}\) = \(\frac{n_2}{\alpha _2}\) = \(\cdots \) = \(\frac{n_{|G|}}{\alpha _{|G|}}\). Please refer to this linkFootnote 1 for the detailed proof.

5.2 Iterative Feature Selection

As \(L_{0,2}^{2}\) minimisation is NP-hard, we propose a heuristic algorithm to achieve the objective in Eq. (4). The algorithm selects a feature, \(f_t\), at each iteration t to maximise the difference between \(h(S_t)\) and \(h(S_{t-1})\), where \(S_t\) and \(S_{t-1}\) are the feature subsets selected after Iteration t and \(t-1\) respectively and h(.) is as defined in Eq. (5). As there are datasets with millions of features we seek an algorithm to select \(f_t\) with linear complexity. Theorem 2 shows that \(h(S_t)\) - \(h(S_{t-1})\) can be maximised by adding the term, \(\lambda \frac{2n_p +1}{\alpha _p}\) to the mRMR algorithm in Eq. (2). p is the feature group of the evaluated feature (\(f_x\)), \(n_{p}\) is the number of features already selected from p before Iteration t and \({\alpha _p}\) is the weight of p.

figure a

Theorem 2

Given that \(S_t\), \(S_{t-1}\), \(h(S_t)\), \(h(S_{t-1})\), p, \(n_{p}\), \({\alpha _p}\) as defined above and \(S'_{t-1}\) is the unselected feature subset after Iteration \(t-1\), \({{\,\mathrm{argmax}\,}}_{f_{x \in S'_{t-1}}}\) \(h(S_{t})\) - \(h(S_{t-1})\) = \({{\,\mathrm{argmax}\,}}_{f_{x \in S'_{t-1}}}\) \(Rel(f_{x};c)\) - \(\frac{1}{|S_{t-1}|}\) \(\sum _{f_i \in S_{t-1}}\) \(Red(f_{x};f_i)\) - \(\lambda \) \(\Big (\frac{2n_{p}+1}{\alpha _p}\Big )\).

Proof

To prove this, we use the fact that \(|S_t|\) and \(|S_{t-1}|\) are constants at a given iteration. Please refer to this link (see footnote 1) for the detailed proof.

Based on Theorem 2, we propose GroupMRMR algorithm. At each iteration, the feature score of each feature in U is computed as shown in Line 5 of Algorithm 1. The feature with the highest score is removed from U and added to S (Line 7–10 in Algorithm 1). The algorithm can be modified to encourage the features from the same group as well by setting \(\lambda \) < 0.

Example 1 Revisited: Next, we apply GroupMRMR for Example 1. We assume \(\lambda \) = 1 and \(\alpha _i\) = \(\alpha _j\) = 1, \(\forall \) i, j \(\in \) I. GroupMRMR first selects Apple, the feature with highest relevancy (0.549). In Iteration 2, \(n_p\) value for Rice, Cow, and Sheep are 1, 0 and 0, respectively and \(\frac{2n_p+1}{\alpha _p}\) are 3, 0 and 0, respectively. The redundancies of each feature with Apple are same as computed in Sect. 4. The feature scores for Rice, Cow and Sheep are −2.627 (0.443-0.07-3), 0.294 (0.311-0.017-0) and 0.295 (0.311-0.016-0), respectively and GroupMRMR selects Sheep, the feature with the highest feature score. Therefore, GroupMRMR selects {Apple, Sheep}, the optimal feature subset, as discussed in Sect. 4.

Computation Complexity: The computational complexity of GroupMRMR is the same as that of mRMR, which is O(|S||F|). |S| and |F| are the cardinalities of the selected feature subset and the complete feature set, respectively. As |S| \(<<\) |F|, GroupMRMR is effectively linear with |F|.

6 Experiments

This section discusses the experimental results for GroupMRMR for real datasets.

Datasets: We evaluate GroupMRMR, using real datasets, which are benchmark datasets used to test group based feature selection. Table 2 shows a summary of them. Images in Yale have a 32 \(\times \) 32 pixel map. GRV is a JIRA software defect dataset whose features are code quality metrics.

Table 2. Dataset description. m: # features, n: # instances, c: # classes

Grouping Features: The pixel map of the images are partitioned into m \(\times \) m non overlapping squares such that each square is a feature group. This introduces spatial locality information, not available from just the data (instance-feature) itself. The genes in genomic data are clustered based on the Gene Ontology term annotations as described in [2]. The number of groups is set to 0.04 of the original feature set, based on the previous findings for MT dataset [2]. Words in BBC dataset are clustered using k-means algorithm, based on the semantics available from Word2Vec [14]. We use only 2,411 features, only the words available in the Brown’s corpus. Number of word groups is 50, which is selected by cross validation results on the training data. The code metrics in software defect data are grouped into five groups based on their granularity in the code [18].

Table 3. Comparison of accuracies achieved by different algorithms. Row 1: The maximum accuracy (in AVGF) gained by each algorithm in each dataset. The highest maximum AVGF for each dataset is in bold letters. Row 2 (x): the number of features at which the highest AVGF is achieved. Row 3 (%): The average accuracy gain of GroupMRMR over the baseline. +: GroupMRMR wins, −: GroupMRMR losses

Baselines: We compare GroupMRMR with existing filter methods which have proven high accuracy. mRMR algorithm, of which the GroupMRMR is an extension, is a greedy approach to achieve mRMR objective while SPECCMI [15] is a global optimisation algorithm to achieve the same. Conditional Mutual Information (CMIM) [15] is a mutual information based filter method not belonging to the mRMR family. ReliefF [13] is a distance based filter method. GSAOLA [19] is an online filter method which utilises feature group information.

Evaluation Method: The classifier’s prediction accuracy on the test dataset with selected features is considered as the prediction accuracy of the feature selection algorithm. It is measured in terms of the Macro-F1, the average of the F1-scores for each class (AVGF). Average accuracy is the average of AVGFs for all the selected feature numbers up to the point algorithm accuracies converge. The log value of the average run time (measured in seconds) is reported.

Fig. 3.
figure 3

Classification accuracy variation with the number of selected features

Fig. 4.
figure 4

Accuracy and runtime variations for Yale and BBC datasets (a) Accuracy variation with the group size (Yale) (b) Accuracy variation with \(\lambda \) (Yale) (c) Average run time variation (in log scale) of the algorithms (BBC). 95% confidence interval error bars are too small to be visible due to the high precision (standard deviations \({\sim }\)2 s)

Experimental Setup: We split each dataset, 60% instances for training set and 40% for test set, using stratified random sampling method. Feature selection is performed on the training set and the classifier is trained on the training set with the selected features. The classifier is then used to predict the labels of the test set. Due to the small sample size of the datasets we do not use a separate validation set for tuning \(\lambda \). Instead, we select \(\lambda \) \(\in \) [0, 2], which gives the highest classification accuracy on the training set. The classifier used is the Support Vector Machine. For image data, default m = 4. For genomic data, \(\alpha _i\) = 1, \(\forall \) i. For other datasets, \(\alpha _i\) = \(\frac{|G_i|}{|F|}\) (\(G_i\),F are defined in Table 1).

Experiment 1: Measures the classification accuracy obtained for the datasets with selected features. Experiment 2: Performs feature selection for image datasets with different feature group sizes: m \(\times \) m (m = 2,4,8). This tests the effect of the group size on the classification accuracy. Experiment 3: Runs GroupMRMR for different \(\lambda \) \(\in \) [−1, 1]. This tests the effect of \(\lambda \) on the classification accuracy. Experiment 4: Executes each feature selection algorithm 20 times and compute the average run time to evaluate algorithm efficiency.

Experimental Results: Table 3 shows that GroupMRMR achieves the highest AVGF in all datasets over baselines. In LK dataset, the 100% accuracy is achieved with a lower number of features than baselines. GroupMRMR achieves higher or same average accuracy compared to baselines in 32 out of 35 cases. Figure 3 shows that, despite the slightly low average accuracy compared to ReliefF, GroupMRMR maintains a higher accuracy than baselines in Multi-A for most of the selected feature numbers. Other datasets also show similar results, yet we show only three graphs due to the space limitations. Please refer to this link (see footnote 1) to see all the results graphs. The maximum accuracy gain of GroupMRMR over the accuracy gained by the complete feature set is 2%, 10%, 2%, 2%, 1% and 6% for MT, CNS, Multi-A, Yale, BBC and GRV datasets, respectively. The maximum accuracy gain of GroupMRMR is 50% over SPECCMI in Yale dataset at 50 selected features. The highest accuracy gain of GroupMRMR over mRMR is 35% in CNS dataset at 70 selected features. Figure 4a shows that the classification accuracy of GroupMRMR for 8 \(\times \) 8 image partitions is less than for 4 \(\times \) 4 and 2 \(\times \) 2 partitions. Figure 4b shows that the classification accuracy is not much sensitive to \(\lambda \) in the [\(10^{-3}\), 1] range, yet degrades to a large extent when \(\lambda \) < 0. Figure 4c shows that the runtime of GroupMRMR is almost the same as the run time of mRMR algorithm and lower than most of the other baseline methods (\({\sim }\)10 times lower than SPECCMI and CMIM for BBC dataset).

Evaluation Insights: GroupMRMR consistently shows good classification accuracy compared to baselines for all the datasets (highest average accuracy and highest maximum accuracy in almost all datasets). The equal run times of GroupMRMR and mRMR show that the accuracy gain is obtained for no additional costs and supports the time complexity analysis in Sect. 5. Better prediction accuracy is obtained for small groups because large feature groups resemble the original feature set with no groupings. This shows the importance of feature group information to gain high feature selection accuracy. The accuracy is lower when the features are encouraged from the same group (\(\lambda \) < 0) instead from different groups (\(\lambda \) > 0), which supports our hypothesis. The classification accuracy is less sensitive to \(\lambda \) \(\ge \) \(10^{-3}\), therefore parameter tuning is less required.

7 Conclusion

We propose a framework which facilitates filter feature selection methods to exploit feature group information as an external source of information. Using this framework, we incorporate feature group information into mRMR algorithm, resulting in GroupMRMR algorithm. We show that compared to baselines, GroupMRMR achieves high classification accuracy for the datasets with feature group structures. The run time of GroupMRMR is same as the run time of mRMR, which is lower than many existing feature selection algorithms. Our future work include experimenting the proposed framework for other filter methods and detecting whether a dataset contains feature group structures.