A Practical Approach to Feature Selection
Abstract
In real-world concept learning problems, the representation of data often uses many features, only a few of which may be related to the target concept. In this situation, feature selection is important both to speed up learning and to improve concept quality. A new feature selection algorithm Relief uses a statistical method and avoids heuristic search. Relief requires linear time in the number of given features and the number of training instances regardless of the target concept to be learned. Although the algorithm does not necessarily find the smallest subset of features, the size tends to be small because only statistically relevant features are selected. This paper focuses on empirical test results in two artificial domains; the LED Display domain and the Parity domain with and without noise. Comparison with other feature selection algorithms shows Relief's advantages in terms of learning time and the accuracy of the learned concept, suggesting Relief's practicality.
References (0)
Cited by (2453)
A new filter-based gene selection approach in the DNA microarray domain
2024, Expert Systems with ApplicationsThe high dimensionality of data hinders the learning ability of machine learning algorithms. Feature selection techniques can be used to reduce dimensionality, which is an important step for processing high-dimensional data. Feature selection solves this problem by removing irrelevant and redundant information, which can improve learning models, reduce calculation time, and improve learning accuracy. In this paper, a novel filter in mixed-attribute datasets for feature selection is proposed. The independent attributes are mixed or heterogeneous in the sense that both numerical and categorical attribute types may appear together in the same dataset. Based on the preordonnances theory, we use a new concept to quantify the relevance and redundancy of features even if there are heterogeneous (mixed-type) data. The technique for order preference by similarity to the ideal solution is one of the well-known multicriteria decision-making methods; it is utilized as a weighting and informative feature selection filter. To assess the effectiveness of the proposed method, several experiments, both simulated and real, are performed, including a comparison to other well-known filter methods. The experimental results show that, in most cases, the method yielded competitive results in comparison to other methods.
Intelligent assessment of atrial fibrillation gradation based on sinus rhythm electrocardiogram and baseline information
2024, Computer Methods and Programs in BiomedicineAtrial fibrillation (AF) is a progressive arrhythmia that significantly affects a patient's quality of life. The 4S-AF scheme is clinically recommended for AF management; however, the evaluation process is complex and time-consuming. This renders its promotion in primary medical institutions challenging. This retrospective study aimed to simplify the evaluation process and present an objective assessment model for AF gradation.
In total, 189 12-lead electrocardiogram (ECG) recordings from 64 patients were included in this study. The data were annotated into two groups (mild and severe) according to the 4S-AF scheme. Using a preprocessed ECG during the sinus rhythm (SR), we obtained a synthesized vectorcardiogram (VCG). Subsequently, various features were calculated from both signals, and age, sex, and medical history were included as baseline characteristics. Different machine learning models, including support vector machines, random forests (RF), and logistic regression, were finally tested with a combination of feature selection techniques.
The proposed method demonstrated excellent performance in the classification of AF gradation. With an optimized feature set of VCG and baseline features, the RF model achieved accuracy, sensitivity, and specificity of 83.02 %, 80.56 %, and 88.24 %, respectively, under the inter-patient paradigm.
Our results demonstrate the value of physiological signals in AF gradation evaluation, and VCG signals were effective in identifying mild and severe AF. Considering its low computational complexity and high assessment performance, the proposed model is expected to serve as a useful prognostic tool for clinical AF management.
Predictive machine learning in earth pressure balanced tunnelling for main drive torque estimation of tunnel boring machines
2024, Tunnelling and Underground Space TechnologyDesigning the main drive motor capacity of Earth Pressure Balanced Tunnel Boring Machines (EPB TBMs) is a crucial task for every EPB tunnelling project. The machine needs to be equipped with sufficient power to master the geotechnical conditions of the respective project. On the other hand, overpowering the machine should be avoided for economic and sustainability reasons. Main drive torque estimation for EPB TBMs is challenging due to a multitude of impact factors and reciprocal mechanisms between the geotechnical conditions and the tunnelling process. In EPB TBM tunnelling active tunnel face support is achieved in soft and mixed ground or weak and unstable rock by generating a pressurized earth paste in the tool gap and excavation chamber of the machine. Complexity arises due to tribological and rheological effects of the active tunnel face support. These elements of uncertainty, the expected main drive torque is frequently overestimated to prevent a jamming of the machine in the ground. Mean main drive torque values often lie below 50 % of the installed nominal main drive torque capacity. In scope of this research machine learning algorithms, such as regressions, decision trees, tree ensembles, support vector machines and gaussian process regressions, have been used to predict the main drive torque. Models have been trained and tested on data collected from 9 different reference projects and validated on the data of 3 additional reference projects to test the transferability of the model. TBM diameters of the reference projects vary between 6,5 and 15,9 m and TBMs have been operating in a wide range of geotechnical boundary conditions. Different feature selection algorithms have been used and prediction results have been compared to models trained on manually selected features. Models using tree ensembles and manually selected features showed best prediction results and model performance. The machine learning approach returned a smaller and more accurate torque estimation range than traditional estimation approaches and prediction accuracy has been improved. Transparent and robust tree ensembles proofed to be suitable tools for TBM torque estimation.
An intelligent fault detection and diagnosis model for refrigeration systems with a comprehensive feature selection method
2024, International Journal of RefrigerationFeature selection and model establishment are two essential steps for fault detection and diagnosis (FDD) of refrigeration systems. A robust and powerful FDD model combined with a suitable feature selection method can exhibit excellent performance in FDD tasks for refrigeration systems. In this study, a novel FDD method that integrates a comprehensive feature selection method and a deep learning-based intelligent FDD model is proposed. Including three steps, the comprehensive feature selection method combines filter methods and wrapper methods. It can optimize the features and the model jointly by using the multi-objective optimization algorithm to achieve a better performance. In addition, a novel FDD model that combines one-dimensional convolutional neural network (1D-CNN) and self-attention (SA) mechanism is proposed based on the deep learning technology. To evaluate the proposed method, experiments are performed on a miniature refrigeration system under 4 situations with multiple working conditions, forming a dataset for the FDD study. The proposed three-step feature selection method is utilized to obtain the best feature subset. The 1D-CNN and SA FDD model is constructed and the model is jointly optimized with the features. Several comparisons are carried out to demonstrate the effectiveness and superiority of the proposed feature selection method and the FDD model. The results demonstrate that the presented integrated optimization achieved a test accuracy of around 99.66 %, surpassing other popular FDD models including MLP, CNN, and LSTM.
Automatic performance assessment in Virtual Reality medical simulators: A model based on procedure trajectories and machine learning
2024, Expert Systems with ApplicationsVirtual Reality (VR) simulators for medical training helps students develop the sensorimotor skills required to perform specific procedures. However, as a training resource, a VR simulator usually lacks the capacity to assess the user’s performance automatically and provide feedback. The main objective of this study is to develop a model to automatically assess performance in VR medical simulators, based on the medical instrument trajectories and machine learning. We designed a model to use the data collected in medical VR simulators and defined two different labeling systems: by level of expertise or by a dental instructor assessment. We calculated 98 features related to the participant’s performance and combined three different feature selection/fusion algorithms with five classifiers. The SVM algorithm accomplished the best results overall, with an accuracy of 0.77, specificity of 0.60 and sensitivity of 0.94 using the labeling by the dental instructor’s assessment. Overall, the results for the performance assessment were promising and the model can be trained for similar VR medical simulators that collects trajectory data through a haptic device.
Predicting the crack repair rate of self-healing concrete using soft-computing tools
2024, Materials Today CommunicationsTo tackle the challenges of cracking and insufficient durability in traditional concrete, researchers are exploring self-healing concrete (SHC) as a potential alternative material. Nevertheless, doing laboratory studies can incur significant expenses and consume considerable time. Therefore, utilizing machine learning (ML) algorithms can contribute to advancing improved predictions for self-healing concrete. This study aimed to create ML models, specifically adaptive boosting (AB), artificial neural network (ANN), and gradient boosting (GB), to forecast and evaluate the repair rate of the cracked region in SHC that incorporates fibers and bacteria in their compositions. To further evaluate the importance of inputs, RReliefF analysis was performed. The findings indicated that the AB algorithm outperformed GB and ANN algorithms by achieving a coefficient of determination (R2) value of 0.987, a mean absolute error (MAE) value of 0.001 mm, and a root mean square error (RMSE) value of 0.026 mm. The GB approach yielded an R2 value of 0.962, an MAE value of 0.035 mm, and an RMSE value of 0.044 mm. Similarly, the ANN approach yielded an R2 value of 0.943, an MAE value of 0.040 mm, and an RMSE value of 0.054 mm. These results demonstrate that all AB, GB, and ANN algorithms outperformed in terms of prediction accuracy and model fit. Hence, the application of these models can be employed to construct and verify advanced SHC compositions that rely on polymeric fibers and bacteria, aiming to achieve superior performance. However, comparing the outcomes of developed models revealed that the AB model accuracy is higher than the GB and ANN models for predicting the area repair rate of cracks.