MDPI - Publisher of Open Access Journals

27 pages, 582 KiB

Open AccessArticle

An Empirical Evaluation of Ensemble Models for Python Code Smell Detection

by Rajwant Singh Rao, Seema Dewangan and Alok Mishra

Appl. Sci. 2025, 15(13), 7472; https://doi.org/10.3390/app15137472 - 3 Jul 2025

Viewed by 320

Code smells, which represent poor design choices or suboptimal code implementations, reduce software quality and hinder the code maintenance process. Detecting code smells is, therefore, essential during software development. This study introduces a Python-based code smell dataset targeting two smell types: Large Class [...] Read more.

Code smells, which represent poor design choices or suboptimal code implementations, reduce software quality and hinder the code maintenance process. Detecting code smells is, therefore, essential during software development. This study introduces a Python-based code smell dataset targeting two smell types: Large Class and Long Method. Five ensemble learning methods—Bagging, Gradient Boost, Max Voting, AdaBoost, and XGBoost—were employed to detect code smells within these datasets. The ten most significant features were selected using the Chi-square feature selection technique. To address the class imbalance, the SMOTE algorithm was applied. Experimental results yielded a best accuracy score of 0.96 and an MCC of 0.85 for the Large Class dataset using the Max Voting model. For the Long Method dataset, a best accuracy score of 0.98 and an MCC of 0.94 were achieved using the Gradient Boost model in conjunction with Chi-square feature selection. These results highlight the effectiveness of the proposed methodology and its potential to enhance code smell detection in Python significantly, reinforcing confidence in the approach’s thoroughness and applicability. Full article

(This article belongs to the Special Issue Intelligent Software Engineering: Innovations, Challenges, and Applications)

► Show Figures

Figure 1

19 pages, 4132 KiB

Open AccessArticle

Comparative Analysis of Deep Learning-Based Feature Extraction and Traditional Classification Approaches for Tomato Disease Detection

by Hakan Terzioğlu, Adem Gölcük, Adnan Mohammad Anwer Shakarji and Mateen Yilmaz Al-Bayati

Agronomy 2025, 15(7), 1509; https://doi.org/10.3390/agronomy15071509 - 21 Jun 2025

Viewed by 414

Abstract

In recent years, significant advancements in artificial intelligence, particularly in the field of deep learning, have increasingly been integrated into agricultural applications, including critical processes such as disease detection. Tomato, being one of the most widely consumed agricultural products globally and highly susceptible [...] Read more.

In recent years, significant advancements in artificial intelligence, particularly in the field of deep learning, have increasingly been integrated into agricultural applications, including critical processes such as disease detection. Tomato, being one of the most widely consumed agricultural products globally and highly susceptible to a variety of fungal, bacterial, and viral pathogens, remains a prominent focus in disease detection research. In this study, we propose a deep learning-based approach for the detection of tomato diseases, a critical challenge in agriculture due to the crop’s vulnerability to fungal, bacterial, and viral pathogens. We constructed an original dataset comprising 6414 images captured under real production conditions, categorized into three image types: leaves, green tomatoes, and red tomatoes. The dataset includes five classes: healthy samples, late blight, early blight, gray mold, and bacterial cancer. Twenty-one deep learning models were evaluated, and the top five performers (EfficientNet-b0, NasNet-Large, ResNet-50, DenseNet-201, and Places365-GoogLeNet) were selected for feature extraction. From each model, 1000 deep features were extracted, and feature selection was conducted using MRMR, Chi-Square (Chi²), and ReliefF methods. The top 100 features from each selection technique were then used for reclassification with traditional machine learning classifiers under five-fold cross-validation. The highest test accuracy of 92.0% was achieved with EfficientNet-b0 features, Chi² selection, and the Fine KNN classifier. EfficientNet-b0 consistently outperformed other models, while the combination of NasNet-Large and Wide Neural Network yielded the lowest performance. These results demonstrate the effectiveness of combining deep learning-based feature extraction with traditional classifiers and feature selection techniques for robust detection of tomato diseases in real-world agricultural environments. Full article

(This article belongs to the Section Pest and Disease Management)

► Show Figures

Figure 1

18 pages, 2629 KiB

Open AccessArticle

Ensemble Machine Learning Models Utilizing a Hybrid Recursive Feature Elimination (RFE) Technique for Detecting GPS Spoofing Attacks Against Unmanned Aerial Vehicles

by Raghad Al-Syouf, Omar Y. Aljarrah, Raed Bani-Hani and Abdallah Alma’aitah

Sensors 2025, 25(8), 2388; https://doi.org/10.3390/s25082388 - 9 Apr 2025

Viewed by 633

Abstract

The dependency of Unmanned Aerial Vehicles (UAVs), also known as drones, on off-board data, such as control and position data, makes them highly susceptible to serious safety and security threats, including data interceptions, Global Positioning System (GPS) jamming, and spoofing attacks. This indeed [...] Read more.

The dependency of Unmanned Aerial Vehicles (UAVs), also known as drones, on off-board data, such as control and position data, makes them highly susceptible to serious safety and security threats, including data interceptions, Global Positioning System (GPS) jamming, and spoofing attacks. This indeed necessitates the existence of an Intrusion Detection System (IDS) in place to detect potential security threats/intrusions promptly. Recently, machine-learning-based IDSs have gained popularity due to their high performance in detecting known as well as novel cyber-attacks. However, the time and computation efficiencies of ML-based IDSs still present a challenge in the UAV domain. Therefore, this paper proposes a hybrid Recursive Feature Elimination (RFE) technique based on feature importance ranking along with a Spearman Correlation Analysis (SCA). This technique is built on ensemble learning approaches, namely, bagging, boosting, stacking, and voting classifiers, to efficiently detect GPS spoofing attacks. Two benchmark datasets are employed: the GPS spoofing dataset and the UAV location GPS spoofing dataset. The results show that our proposed ensemble models achieved a notable balance between efficacy and efficiency, showing that the bagging classifier achieved the highest accuracy rate of 99.50%. At the same time, the Decision Tree (DT) and the bagging classifiers achieved the lowest processing time of 0.003 s and 0.029 s, respectively, using the GPS spoofing dataset. For the UAV location GPS spoofing dataset, the bagging classifier emerged as the top performer, achieving 99.16% accuracy and 0.002 s processing time compared to other well-known ML models. In addition, the experimental results show that our proposed methodology (RFE) outperformed other well-known ML models built on conventional feature selection techniques for detecting GPS spoofing attacks, such as mutual information gain, correlation matrices, and the chi-square test. Full article

(This article belongs to the Section Navigation and Positioning)

► Show Figures

Figure 1

22 pages, 872 KiB

Open AccessArticle

Effective ML-Based Android Malware Detection and Categorization

by Areej Alhogail and Rawan Abdulaziz Alharbi

Electronics 2025, 14(8), 1486; https://doi.org/10.3390/electronics14081486 - 8 Apr 2025

Cited by 2 | Viewed by 1461

Abstract

The rapid proliferation of malware poses a significant challenge regarding digital security, necessitating the development of advanced techniques for malware detection and categorization. In this study, we investigate Android malware detection and categorization using a two-step machine learning (ML) framework combined with feature [...] Read more.

The rapid proliferation of malware poses a significant challenge regarding digital security, necessitating the development of advanced techniques for malware detection and categorization. In this study, we investigate Android malware detection and categorization using a two-step machine learning (ML) framework combined with feature engineering. The proposed framework first performs binary categorization to detect malware and then applies multi-class categorization to categorize malware into types, such as adware, banking Trojans, SMS malware, and riskware. Feature selection techniques such as chi-squared testing and select-from-model (SFM) were employed to reduce dimensionality and enhance model performance. Various ML classifiers were evaluated, and the proposed model achieved outstanding accuracy, at 97.82% for malware detection and 96.09% for malware categorization. The proposed framework outperforms existing approaches, demonstrating the effectiveness of feature engineering and random forest (RF) models in addressing computational efficiency. This research contributes a robust and interpretable framework for Android malware detection that is resource-efficient and practical for use in real-world applications. It also offers a scalable approach via which practitioners can deploy efficient malware detection systems. Future work will focus on real-time implementation and adaptive methodologies to address evolving malware threats. Full article

(This article belongs to the Special Issue Artificial Intelligence in Cyberspace Security)

► Show Figures

Figure 1

20 pages, 3271 KiB

Open AccessArticle

Fine-Tuned Machine Learning Classifiers for Diagnosing Parkinson’s Disease Using Vocal Characteristics: A Comparative Analysis

by Mehmet Meral, Ferdi Ozbilgin and Fatih Durmus

Diagnostics 2025, 15(5), 645; https://doi.org/10.3390/diagnostics15050645 - 6 Mar 2025

Viewed by 1300

Abstract

Background/Objectives: This paper is significant in highlighting the importance of early and precise diagnosis of Parkinson’s Disease (PD) that affects both motor and non-motor functions to achieve better disease control and patient outcomes. This study seeks to assess the effectiveness of machine [...] Read more.

Background/Objectives: This paper is significant in highlighting the importance of early and precise diagnosis of Parkinson’s Disease (PD) that affects both motor and non-motor functions to achieve better disease control and patient outcomes. This study seeks to assess the effectiveness of machine learning algorithms optimized to classify PD based on vocal characteristics to serve as a non-invasive and easily accessible diagnostic tool. Methods: This study used a publicly available dataset of vocal samples from 188 people with PD and 64 controls. Acoustic features like baseline characteristics, time-frequency components, Mel Frequency Cepstral Coefficients (MFCCs), and wavelet transform-based metrics were extracted and analyzed. The Chi-Square test was used for feature selection to determine the most important attributes that enhanced the accuracy of the classification. Six different machine learning classifiers, namely SVM, k-NN, DT, NN, Ensemble and Stacking models, were developed and optimized via Bayesian Optimization (BO), Grid Search (GS) and Random Search (RS). Accuracy, precision, recall, F1-score and AUC-ROC were used for evaluation. Results: It has been found that Stacking models, especially those fine-tuned via Grid Search, yielded the best performance with 92.07% accuracy and an F1-score of 0.95. In addition to that, the choice of relevant vocal features, in conjunction with the Chi-Square feature selection method, greatly enhanced the computational efficiency and classification performance. Conclusions: This study highlights the potential of combining advanced feature selection techniques with hyperparameter optimization strategies to enhance machine learning-based PD diagnosis using vocal characteristics. Ensemble models proved particularly effective in handling complex datasets, demonstrating robust diagnostic performance. Future research may focus on deep learning approaches and temporal feature integration to further improve diagnostic accuracy and scalability for clinical applications. Full article

(This article belongs to the Special Issue Artificial Intelligence and Deep Learning in Clinical Classification and Prediction)

► Show Figures

Figure 1

12 pages, 290 KiB

Open AccessArticle

Predictors of Complications in Radiofrequency Ablation for Hepatocellular Carcinoma: A Comprehensive Analysis of 1000 Cases

by Mohamed H. Farag, Mohamed H. Shaaban, Hamdy Abdelkader, Adel Al Fatease, Sara O. Elgendy and Hussein H. Okasha

Medicina 2025, 61(3), 458; https://doi.org/10.3390/medicina61030458 - 6 Mar 2025

Viewed by 758

Abstract

Background and Objectives: Primary liver cancer is a major cause of mortality, ranking third among the most fatal cancers. In Egypt, liver cancer constitutes 11.75% of gastrointestinal malignancies, with HCC representing 70.5% of cases. The landscape of HCC management was revolutionized by [...] Read more.

Background and Objectives: Primary liver cancer is a major cause of mortality, ranking third among the most fatal cancers. In Egypt, liver cancer constitutes 11.75% of gastrointestinal malignancies, with HCC representing 70.5% of cases. The landscape of HCC management was revolutionized by locoregional modalities, which offer a comparable alternative to conventional techniques, with low complications and minimal invasiveness. RFA is a technique that is suitable for early-stage lesions in the liver, with a high overall survival and low complication rates. However, the associated complications cause potential mortality and morbidity. The proper selection of patients may avoid such complications. This study presents a five-year experience of radiofrequency ablation (RFA) for hepatocellular carcinoma (HCC) in Egypt, analyzing the predictors of complications and the computed tomography (CT) features associated with complications post-ablation. Materials and Methods: The study included 1000 cases (84% males with a mean age of 60), with 90% having HCC. Exclusion criteria included prior chemoembolization and non-HCC primary hepatic tumors. Patients underwent RFA at Cairo University Hospital and two private centers from January 2014 to January 2019. The workup involved clinical assessments, lab tests, and CT scans. Complications were classified as major or minor. Statistical analysis was conducted via SPSS software Version 22.0, with associations evaluated using a chi-square test. A decision tree was employed to determine the predictive values for different variables associated with the complications. Results: Overall, the rate of complications was 4%, and mortality stood low at 0.1%. Subcapsular lesions were associated with complications, as well as the lesion size, site, Child–Pugh classification, and the number of RFA sessions. Decision tree analysis determined the size of a lesion to be the most predictive factor of major complications, whereas the site of the lesion predicted the occurrence of minor complications. Conclusions: RFA offers low complication rates; however, precise patient selection is critical. The approach and imaging modality choice influence the outcomes. Clinician experience enhances early complication detection, thereby allowing for effective treatments. Full article

(This article belongs to the Special Issue Advances in the Diagnosis, Treatment and Prognosis of Hepatocellular Carcinoma)

22 pages, 5382 KiB

Open AccessArticle

Impact of Feature Selection Techniques on the Performance of Machine Learning Models for Depression Detection Using EEG Data

by Marwa Hassan and Naima Kaabouch

Appl. Sci. 2024, 14(22), 10532; https://doi.org/10.3390/app142210532 - 15 Nov 2024

Cited by 3 | Viewed by 2725

Abstract

Major depressive disorder (MDD) poses a significant challenge in mental healthcare due to difficulties in accurate diagnosis and timely identification. This study explores the potential of machine learning models trained on EEG-based features for depression detection. Six models and six feature selection techniques [...] Read more.

Major depressive disorder (MDD) poses a significant challenge in mental healthcare due to difficulties in accurate diagnosis and timely identification. This study explores the potential of machine learning models trained on EEG-based features for depression detection. Six models and six feature selection techniques were compared, highlighting the crucial role of feature selection in enhancing classifier performance. This study investigates the six feature selection methods: Elastic Net, Mutual Information (MI), Chi-Square, Forward Feature Selection with Stochastic Gradient Descent (FFS-SGD), Support Vector Machine-based Recursive Feature Elimination (SVM-RFE), and Minimal-Redundancy-Maximal-Relevance (mRMR). These methods were combined with six diverse classifiers: Logistic Regression, Support Vector Machine (SVM), Random Forest, Extreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), and Light Gradient Boosting Machine (LightGBM). The results demonstrate the substantial impact of feature selection on model performance. SVM-RFE with SVM achieved the highest accuracy (93.54%) and F1 score (95.29%), followed by Logistic Regression with an accuracy of 92.86% and F1 score of 94.84%. Elastic Net also delivered strong results, with SVM and Logistic Regression both achieving 90.47% accuracy. Other feature selection methods yielded lower performance, emphasizing the importance of selecting appropriate feature selection and machine learning algorithms. These findings suggest that careful selection and application of feature selection techniques can significantly enhance the accuracy of EEG-based depression detection. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

23 pages, 2466 KiB

Open AccessEditor’s ChoiceArticle

Enhancing Regional Wind Power Forecasting through Advanced Machine-Learning and Feature-Selection Techniques

by Nabi Taheri and Mauro Tucci

Energies 2024, 17(21), 5431; https://doi.org/10.3390/en17215431 - 30 Oct 2024

Cited by 4 | Viewed by 1308

Abstract

In this study, an in-depth analysis is presented on forecasting aggregated wind power production at the regional level, using advanced Machine-Learning (ML) techniques and feature-selection methods. The main problem consists of selecting the wind speed measuring points within a large region, as the [...] Read more.

In this study, an in-depth analysis is presented on forecasting aggregated wind power production at the regional level, using advanced Machine-Learning (ML) techniques and feature-selection methods. The main problem consists of selecting the wind speed measuring points within a large region, as the wind plant locations are assumed to be unknown. For this purpose, the main cities (province capitals) are considered as possible features and four feature-selection methods are explored: Pearson correlation, Spearman correlation, mutual information, and Chi-squared test with Fisher score. The results demonstrate that proper feature selection significantly improves prediction performance, particularly when dealing with high-dimensional data and regional forecasting challenges. Additionally, the performance of five prominent machine-learning models is analyzed: Long Short-Term Memory (LSTM) networks, Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Convolutional Neural Networks (CNNs), and Extreme-Learning Machines (ELMs). Through rigorous testing, LSTM is identified as the most effective model for the case study in northern Italy. This study offers valuable insights into optimizing wind power forecasting models and underscores the importance of feature selection in achieving reliable and accurate predictions. Full article

(This article belongs to the Section A3: Wind, Wave and Tidal Energy)

► Show Figures

Figure 1

18 pages, 4563 KiB

Open AccessArticle

Kashif: A Chrome Extension for Classifying Arabic Content on Web Pages Using Machine Learning

by Malak Aljabri, Hanan S. Altamimi, Shahd A. Albelali, Maimunah Al-Harbi, Haya T. Alhuraib, Najd K. Alotaibi, Amal A. Alahmadi, Fahd Alhaidari and Rami Mustafa A. Mohammad

Appl. Sci. 2024, 14(20), 9222; https://doi.org/10.3390/app14209222 - 11 Oct 2024

Cited by 1 | Viewed by 1618

Abstract

Search engines are significant tools for finding and retrieving information. Every day, many new web pages in various languages are added. The threats of cyberattacks are expanding rapidly with this massive volume of data. The majority of studies on the detection of malicious [...] Read more.

Search engines are significant tools for finding and retrieving information. Every day, many new web pages in various languages are added. The threats of cyberattacks are expanding rapidly with this massive volume of data. The majority of studies on the detection of malicious websites focus on English-language websites. This necessitates more studies on malicious detection on Arabic-content websites. In this research, we aimed to investigate the security of Arabic-content websites by developing a detection tool that analyzes Arabic content based on artificial intelligence (AI) techniques. We contributed to the field of cybersecurity and AI by building a new dataset of 4048 Arabic-content websites. We created and conducted a comparative performance evaluation for four different machine-learning (ML) models using feature extraction and selection techniques: extreme gradient boosting, support vector machines, decision trees, and random forests. The best-performing model was then integrated into a Chrome plugin, created based on a random forest (RF) model, and utilized the features selected via the chi-square technique. This produced plugin tool attained an accuracy of 92.96% for classifying Arabic-content websites as phishing, suspicious, or benign. To our knowledge, this is the first tool designed specifically for Arabic-content websites. Full article

(This article belongs to the Special Issue Data Mining and Machine Learning in Cybersecurity)

► Show Figures

Figure 1

19 pages, 2353 KiB

Open AccessArticle

Enhancing Non-Small Cell Lung Cancer Survival Prediction through Multi-Omics Integration Using Graph Attention Network

by Murtada K. Elbashir, Abdullah Almotilag, Mahmood A. Mahmood and Mohanad Mohammed

Diagnostics 2024, 14(19), 2178; https://doi.org/10.3390/diagnostics14192178 - 29 Sep 2024

Cited by 4 | Viewed by 2564

Abstract

Background: Cancer survival prediction is vital in improving patients’ prospects and recommending therapies. Understanding the molecular behavior of cancer can be enhanced through the integration of multi-omics data, including mRNA, miRNA, and DNA methylation data. In light of these multi-omics data, we [...] Read more.

Background: Cancer survival prediction is vital in improving patients’ prospects and recommending therapies. Understanding the molecular behavior of cancer can be enhanced through the integration of multi-omics data, including mRNA, miRNA, and DNA methylation data. In light of these multi-omics data, we proposed a graph attention network (GAT) model in this study to predict the survival of non-small cell lung cancer (NSCLC). Methods: The different omics data were obtained from The Cancer Genome Atlas (TCGA) and preprocessed and combined into a single dataset using the sample ID. We used the chi-square test to select the most significant features to be used in our model. We used the synthetic minority oversampling technique (SMOTE) to balance the dataset and the concordance index (C-index) to measure the performance of our model on different combinations of omics data. Results: Our model demonstrated superior performance, with the highest value of the C-index obtained when we used both mRNA and miRNA data. This demonstrates that the multi-omics approach could be effective in predicting survival. Further pathway analysis conducted with KEGG showed that our GAT model provided high weights to the features that are associated with the viral entry pathways, such as the Epstein–Barr virus and Influenza A pathways, which are involved in lung cancer development. From our findings, it can be observed that the proposed GAT model leads to a significantly improved prediction of survival by exploiting the strengths of multiple omics datasets and the findings from the enriched pathways. Our GAT model outperforms other state-of-the-art methods that are used for NSCLC prediction. Conclusions: In this study, we developed a new model for the survival prediction of NSCLC using the GAT based on multi-omics data. Our model showed outstanding predictive values, and the KEGG analysis of the selected significant features showed that they were implicated in pivotal biological processes underlying pathways such as Influenza A and the Epstein–Barr virus infection, which are linked to lung cancer progression. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

20 pages, 3320 KiB

Open AccessArticle

Characterization of Maize, Common Bean, and Avocado Crops under Abiotic Stress Factors Using Spectral Signatures on the Visible to Near-Infrared Spectrum

by Manuel Goez, Maria C. Torres-Madronero, Tatiana Rondon, Manuel A. Guzman, Maria Casamitjana and Juan Manuel Gonzalez

Agronomy 2024, 14(10), 2228; https://doi.org/10.3390/agronomy14102228 - 27 Sep 2024

Viewed by 1018

Abstract

Abiotic stress factors can be detected using visible and near-infrared spectral signatures. Previous work demonstrated the potential of this technology in crop monitoring, although a large majority used vegetation indices, which did not consider the complete spectral information. This work explored the capabilities [...] Read more.

Abiotic stress factors can be detected using visible and near-infrared spectral signatures. Previous work demonstrated the potential of this technology in crop monitoring, although a large majority used vegetation indices, which did not consider the complete spectral information. This work explored the capabilities of spectral information for abiotic stress detection using supervised machine learning techniques such as support vector machine (SVM), random forest (RF), and neural network (NN). This study used avocados grown under various water treatments, maize submitted to nitrogen deficiency, and common beans under phosphorous restriction. The spectral characterization of the crops subjected to abiotic stress was studied on the visible to near-infrared (450 to 900 nm) spectrum, identifying discriminative bands and spectral ranges. Then, the advantages of using an integrated approach based on machine learning to detect abiotic stress in crops were demonstrated. Instead of relying on vegetation indices, the proposed approach used several spectral features obtained by analyzing the discriminative signature shape, applying a spectral subset band selection algorithm based on similarity, and using the minimum redundancy maximum relevance (MRMR), F-test and chi-square test ranks for feature selection. The results showed that supervised classifiers applied to the spectral features outperform the accuracies obtained from vegetation indices. The best common bean results were obtained using SVM with accuracies up to 91%; for maize and avocado, NN obtained 90% and 82%, respectively. It is noted that detection accuracy depends on various factors, such as crop type, genotype, and level of stress. Full article

(This article belongs to the Section Agricultural Biosystem and Biological Engineering)

► Show Figures

Figure 1

16 pages, 1311 KiB

Open AccessArticle

Hybrid Predictive Machine Learning Model for the Prediction of Immunodominant Peptides of Respiratory Syncytial Virus

by Syed Nisar Hussain Bukhari and Kingsley A. Ogudo

Bioengineering 2024, 11(8), 791; https://doi.org/10.3390/bioengineering11080791 - 5 Aug 2024

Viewed by 1966

Abstract

Respiratory syncytial virus (RSV) is a common respiratory pathogen that infects the human lungs and respiratory tract, often causing symptoms similar to the common cold. Vaccination is the most effective strategy for managing viral outbreaks. Currently, extensive efforts are focused on developing a [...] Read more.

Respiratory syncytial virus (RSV) is a common respiratory pathogen that infects the human lungs and respiratory tract, often causing symptoms similar to the common cold. Vaccination is the most effective strategy for managing viral outbreaks. Currently, extensive efforts are focused on developing a vaccine for RSV. Traditional vaccine design typically involves using an attenuated form of the pathogen to elicit an immune response. In contrast, peptide-based vaccines (PBVs) aim to identify and chemically synthesize specific immunodominant peptides (IPs), known as T-cell epitopes (TCEs), to induce a targeted immune response. Despite their potential for enhancing vaccine safety and immunogenicity, PBVs have received comparatively less attention. Identifying IPs for PBV design through conventional wet-lab experiments is challenging, costly, and time-consuming. Machine learning (ML) techniques offer a promising alternative, accurately predicting TCEs and significantly reducing the time and cost of vaccine development. This study proposes the development and evaluation of eight hybrid ML predictive models created through the permutations and combinations of two classification methods, two feature weighting techniques, and two feature selection algorithms, all aimed at predicting the TCEs of RSV. The models were trained using the experimentally determined TCEs and non-TCE sequences acquired from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) repository. The hybrid model composed of the XGBoost (XGB) classifier, chi-squared (ChST) weighting technique, and backward search (BST) as the optimal feature selection algorithm (ChST−BST–XGB) was identified as the best model, achieving an accuracy, sensitivity, specificity, F1 score, AUC, precision, and MCC of 97.10%, 0.98, 0.97, 0.98, 0.99, 0.99, and 0.96, respectively. Additionally, K-fold cross-validation (KFCV) was performed to ensure the model’s reliability and an average accuracy of 97.21% was recorded for the ChST−BST–XGB model. The results indicate that the hybrid XGBoost model consistently outperforms other hybrid approaches. The epitopes predicted by the proposed model may serve as promising vaccine candidates for RSV, subject to in vitro and in vivo scientific assessments. This model can assist the scientific community in expediting the screening of active TCE candidates for RSV, ultimately saving time and resources in vaccine development. Full article

(This article belongs to the Special Issue Machine Learning Technology in Predictive Healthcare)

► Show Figures

Figure 1

23 pages, 886 KiB

Open AccessArticle

Combining Advanced Feature-Selection Methods to Uncover Atypical Energy-Consumption Patterns

by Lucas Henriques, Felipe Prata Lima and Cecilia Castro

Future Internet 2024, 16(7), 229; https://doi.org/10.3390/fi16070229 - 28 Jun 2024

Cited by 2 | Viewed by 3638

Abstract

Understanding household energy-consumption patterns is essential for developing effective energy-conservation strategies. This study aims to identify ‘out-profiled’ consumers—households that exhibit atypical energy-usage behaviors—by applying four distinct feature-selection methodologies. Specifically, we utilized the chi-square independence test to assess feature independence, recursive feature elimination with [...] Read more.

Understanding household energy-consumption patterns is essential for developing effective energy-conservation strategies. This study aims to identify ‘out-profiled’ consumers—households that exhibit atypical energy-usage behaviors—by applying four distinct feature-selection methodologies. Specifically, we utilized the chi-square independence test to assess feature independence, recursive feature elimination with multinomial logistic regression (RFE-MLR) to identify optimal feature subsets, random forest (RF) to determine feature importance, and a combined fuzzy rough feature selection with fuzzy rough nearest neighbors (FRFS-FRNN) for handling uncertainty and imprecision in data. These methods were applied to a dataset based on a survey of 383 households in Brazil, capturing various factors such as household size, income levels, geographical location, and appliance usage. Our analysis revealed that key features such as the number of people in the household, heating and air conditioning usage, and income levels significantly influence energy consumption. The novelty of our work lies in the comprehensive application of these advanced feature-selection techniques to identify atypical consumption patterns in a specific regional context. The results showed that households without heating and air conditioning equipment in medium- or high-consumption profiles, and those with lower- or medium-income levels in medium- or high-consumption profiles, were considered out-profiled. These findings provide actionable insights for energy providers and policymakers, enabling the design of targeted energy-conservation strategies. This study demonstrates the importance of tailored approaches in promoting sustainable energy consumption and highlights notable deviations in energy-use patterns, offering a foundation for future research and policy development. Full article

► Show Figures

Figure 1

17 pages, 1112 KiB

Open AccessArticle

Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality

by Sadam Al-Azani, Omer S. Alkhnbashi, Emad Ramadan and Motaz Alfarraj

Int. J. Mol. Sci. 2024, 25(4), 2102; https://doi.org/10.3390/ijms25042102 - 9 Feb 2024

Cited by 9 | Viewed by 2081

Abstract

Cancer is a leading cause of death globally. The majority of cancer cases are only diagnosed in the late stages of cancer due to the use of conventional methods. This reduces the chance of survival for cancer patients. Therefore, early detection consequently followed [...] Read more.

Cancer is a leading cause of death globally. The majority of cancer cases are only diagnosed in the late stages of cancer due to the use of conventional methods. This reduces the chance of survival for cancer patients. Therefore, early detection consequently followed by early diagnoses are important tasks in cancer research. Gene expression microarray technology has been applied to detect and diagnose most types of cancers in their early stages and has gained encouraging results. In this paper, we address the problem of classifying cancer based on gene expression for handling the class imbalance problem and the curse of dimensionality. The oversampling technique is utilized to overcome this problem by adding synthetic samples. Another common issue related to the gene expression dataset addressed in this paper is the curse of dimensionality. This problem is addressed by applying chi-square and information gain feature selection techniques. After applying these techniques individually, we proposed a method to select the most significant genes by combining those two techniques (CHiS and IG). We investigated the effect of these techniques individually and in combination. Four benchmarking biomedical datasets (Leukemia-subtypes, Leukemia-ALLAML, Colon, and CuMiDa) were used. The experimental results reveal that the oversampling techniques improve the results in most cases. Additionally, the performance of the proposed feature selection technique outperforms individual techniques in nearly all cases. In addition, this study provides an empirical study for evaluating several oversampling techniques along with ensemble-based learning. The experimental results also reveal that SVM-SMOTE, along with the random forests classifier, achieved the highest results, with a reporting accuracy of 100%. The obtained results surpass the findings in the existing literature as well. Full article

(This article belongs to the Section Molecular Biophysics)

► Show Figures

Figure 1

17 pages, 3022 KiB

Open AccessArticle

An Optimized Hybrid Approach for Feature Selection Based on Chi-Square and Particle Swarm Optimization Algorithms

by Amani Abdo, Rasha Mostafa and Laila Abdel-Hamid

Data 2024, 9(2), 20; https://doi.org/10.3390/data9020020 - 25 Jan 2024

Cited by 10 | Viewed by 3611

Abstract

Feature selection is a significant issue in the machine learning process. Most datasets include features that are not needed for the problem being studied. These irrelevant features reduce both the efficiency and accuracy of the algorithm. It is possible to think about feature [...] Read more.

Feature selection is a significant issue in the machine learning process. Most datasets include features that are not needed for the problem being studied. These irrelevant features reduce both the efficiency and accuracy of the algorithm. It is possible to think about feature selection as an optimization problem. Swarm intelligence algorithms are promising techniques for solving this problem. This research paper presents a hybrid approach for tackling the problem of feature selection. A filter method (chi-square) and two wrapper swarm intelligence algorithms (grey wolf optimization (GWO) and particle swarm optimization (PSO)) are used in two different techniques to improve feature selection accuracy and system execution time. The performance of the two phases of the proposed approach is assessed using two distinct datasets. The results show that PSOGWO yields a maximum accuracy boost of 95.3%, while chi2-PSOGWO yields a maximum accuracy improvement of 95.961% for feature selection. The experimental results show that the proposed approach performs better than the compared approaches. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

Search Results (50)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (50)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI