MDPI - Publisher of Open Access Journals

23 pages, 9844 KiB

Open AccessArticle

Mechanistic Exploration of Aristolochic Acid I-Induced Hepatocellular Carcinoma: Insights from Network Toxicology, Machine Learning, Molecular Docking, and Molecular Dynamics Simulation

by Tiantaixi Tu, Tongtong Zheng, Hangqi Lin, Peifeng Cheng, Ye Yang, Bolin Liu, Xinwang Ying and Qingfeng Xie

Toxins 2025, 17(8), 390; https://doi.org/10.3390/toxins17080390 - 5 Aug 2025

Abstract

This study explores how aristolochic acid I (AAI) drives hepatocellular carcinoma (HCC). We first employ network toxicology and machine learning to map the key molecular target genes. Next, our research utilizes molecular docking to evaluate how AAI binds to these targets, and finally [...] Read more.

This study explores how aristolochic acid I (AAI) drives hepatocellular carcinoma (HCC). We first employ network toxicology and machine learning to map the key molecular target genes. Next, our research utilizes molecular docking to evaluate how AAI binds to these targets, and finally confirms the stability and dynamics of the resulting complexes through molecular dynamics simulations. We identified 193 overlapping target genes between AAI and HCC through databases such as PubChem, OMIM, and ChEMBL. Machine learning algorithms (SVM-RFE, random forest, and LASSO regression) were employed to screen 11 core genes. LASSO serves as a rapid dimension-reduction tool, SVM-RFE recursively eliminates the features with the smallest weights, and Random Forest achieves ensemble learning through decision trees. Protein–protein interaction networks were constructed using Cytoscape 3.9.1, and key genes were validated through GO and KEGG enrichment analyses, an immune infiltration analysis, a drug sensitivity analysis, and a survival analysis. Molecular-docking experiments showed that AAI binds to each of the core targets with a binding affinity stronger than −5 kcal mol⁻¹, and subsequent molecular dynamics simulations verified that these complexes remain stable over time. This study determined the potential molecular mechanisms underlying AAI-induced HCC and identified key genes (CYP1A2, ESR1, and AURKA) as potential therapeutic targets, providing valuable insights for developing targeted strategies to mitigate the health risks associated with AAI exposure. Full article

(This article belongs to the Section Plant Toxins)

► Show Figures

Graphical abstract

16 pages, 1493 KiB

Open AccessArticle

Baseline Radiomics as a Prognostic Tool for Clinical Benefit from Immune Checkpoint Inhibition in Inoperable NSCLC Without Activating Mutations

by Fedor Moiseenko, Marko Radulovic, Nadezhda Tsvetkova, Vera Chernobrivceva, Albina Gabina, Any Oganesian, Maria Makarkina, Ekaterina Elsakova, Maria Krasavina, Daria Barsova, Elizaveta Artemeva, Valeria Khenshtein, Natalia Levchenko, Viacheslav Chubenko, Vitaliy Egorenkov, Nikita Volkov, Alexei Bogdanov and Vladimir Moiseyenko

Cancers 2025, 17(11), 1790; https://doi.org/10.3390/cancers17111790 - 27 May 2025

Viewed by 497

Abstract

Background/Objectives: Checkpoint inhibitors (ICIs) are key therapies for NSCLC, but current selection criteria, such as excluding mutation carriers and assessing PD-L1, lack sensitivity. As a result, many patients receive costly treatments with limited benefit. Therefore, this study aimed to predict which NSCLC patients [...] Read more.

Background/Objectives: Checkpoint inhibitors (ICIs) are key therapies for NSCLC, but current selection criteria, such as excluding mutation carriers and assessing PD-L1, lack sensitivity. As a result, many patients receive costly treatments with limited benefit. Therefore, this study aimed to predict which NSCLC patients would achieve durable survival (≥24 months) with immunotherapy. Methods: A comprehensive ensemble radiomics approach was applied to pretreatment CT scans to prognosticate overall survival (OS) and predict progression-free survival (PFS) in a cohort of 220 consecutive patients with inoperable NSCLC treated with first-line ICIs (pembrolizumab or atezolizumab, nivolumab or prolgolimab) as monotherapy or in combination. The radiomics pipeline evaluated four normalization methods (none, min-max, Z-score, mean), four feature selection techniques (ANOVA, RFE, Kruskal–Wallis, Relief), and ten classifiers (e.g., SVM, random forest). Using two to eight radiomics features, 1680 models were built in the Feature Explorer (FAE) Python package. Results: Three feature sets were evaluated: clinicopathological (CP) only, radiomics only, and a combined set, using 6- and 12-month PFS and 24-month OS endpoints. The top 15 models were ensembled by averaging their probability scores. The best performance was achieved at 24-month OS with the combined CP and radiomics ensemble (AUC = 0.863, accuracy = 85%), followed by radiomics-only (AUC = 0.796, accuracy = 82%) and CP-only (AUC = 0.671, accuracy = 76%). Predictive performance was lower for 6-month (AUC = 0.719) and 12-month PFS (AUC = 0.739) endpoints. Conclusions: Our radiomics pipeline improved selection of NSCLC patients for immunotherapy and could spare non-responders unnecessary toxicity while enhancing cost-effectiveness. Full article

(This article belongs to the Special Issue Enhancing Precision in Cancer Treatment: AI-Driven Innovations in Imaging)

► Show Figures

Graphical abstract

22 pages, 1088 KiB

Open AccessArticle

Intelligent Feature Selection Ensemble Model for Price Prediction in Real Estate Markets

by Daniel Cristóbal Andrade-Girón, William Joel Marin-Rodriguez and Marcelo Gumercindo Zuñiga-Rojas

Informatics 2025, 12(2), 52; https://doi.org/10.3390/informatics12020052 - 20 May 2025

Viewed by 2030

Abstract

Real estate is crucial to the global economy, propelling economic and social development. This study examines the effects of dimensionality reduction through Recursive Feature Elimination (RFE), Random Forest (RF), and Boruta on real estate price prediction, assessing ensemble models like Bagging, Random Forest, [...] Read more.

Real estate is crucial to the global economy, propelling economic and social development. This study examines the effects of dimensionality reduction through Recursive Feature Elimination (RFE), Random Forest (RF), and Boruta on real estate price prediction, assessing ensemble models like Bagging, Random Forest, Gradient Boosting, AdaBoost, Stacking, Voting, and Extra Trees. The results indicate that the Stacking model achieved the best performance with an MAE (mean absolute error) of 14,090, MSE (mean squared error) of 5.338 × 10⁸, RMSE (root mean square error) of 23,100, R² of 0.924, and a Concordance Correlation Coefficient (CCC) of 0.960, also demonstrating notable computational efficiency with a time of 67.23 s. Gradient Boosting closely followed, with an MAE of 14,540, R² of 0.920, and a CCC of 0.958, requiring 1.76 s for computation. Variable reduction through RFE in both Gradient Boosting and Stacking led to an increase in MAE by 16.9% and 14.6%, respectively, along with slight reductions in R² and CCC. The application of Boruta reduced the variables to 16, maintaining performance in Stacking, with an increase in MAE of 9.8% and a R² of 0.908. These dimensionality reduction techniques enhanced computational efficiency and proved effective for practical applications without significantly compromising accuracy. Future research should explore automatic hyperparameter optimization and hybrid approaches to improve the adaptability and robustness of models in complex contexts. Full article

(This article belongs to the Section Machine Learning)

► Show Figures

Figure 1

21 pages, 1228 KiB

Open AccessArticle

Automatic Feature Selection for Imbalanced Echocardiogram Data Using Event-Based Self-Similarity

by Huang-Nan Huang, Hong-Min Chen, Wei-Wen Lin, Rita Wiryasaputra, Yung-Cheng Chen, Yu-Huei Wang and Chao-Tung Yang

Diagnostics 2025, 15(8), 976; https://doi.org/10.3390/diagnostics15080976 - 11 Apr 2025

Viewed by 652

Abstract

Background and Objective: Using echocardiogram data for cardiovascular disease (CVD) can lead to difficulties due to imbalanced datasets, leading to biased predictions. Machine learning models can enhance prognosis accuracy, but their effectiveness is influenced by optimal feature selection and robust classification techniques. This [...] Read more.

Background and Objective: Using echocardiogram data for cardiovascular disease (CVD) can lead to difficulties due to imbalanced datasets, leading to biased predictions. Machine learning models can enhance prognosis accuracy, but their effectiveness is influenced by optimal feature selection and robust classification techniques. This study introduces an event-based self-similarity approach to enhance automatic feature selection approach for imbalanced echocardiogram data. Critical features correlated with disease progression were identified by leveraging self-similarity patterns. This study used an echocardiogram dataset, visual presentations of high-frequency sound wave signals, and data of patients with heart disease who are treated using three treatment methods: catheter ablation, ventricular defibrillator, and drug control—over the course of three years. Methods: The dataset was classified into nine categories and Recursive Feature Elimination (RFE) was applied to identify the most relevant features, reducing model complexity while maintaining diagnostic accuracy. Machine learning classification models, including XGBoost and CATBoost, were trained and evaluated. Results: Both models achieved comparable accuracy values, 84.3% and 88.4%, respectively, under different normalization techniques. To further optimize performance, the models were combined into a voting ensemble, improving feature selection and predictive accuracy. Four essential features—age, aorta (AO), left ventricular (LV), and left atrium (LA)—were identified as critical for prognosis and were found in Random Forest (RF)-voting ensemble classifier. The results underscore the importance of feature selection techniques in handling imbalanced datasets, improving classification robustness, and reducing bias in automated prognosis systems. Conclusions: Our findings highlight the potential of machine learning-driven echocardiogram analysis to enhance patient care by providing accurate, data-driven assessments. Full article

(This article belongs to the Section Medical Imaging and Theranostics)

► Show Figures

Figure 1

24 pages, 5151 KiB

Open AccessArticle

Evaluating the Impact of Recursive Feature Elimination on Machine Learning Models for Predicting Forest Fire-Prone Zones

by Ali Rezaei Barzani, Parham Pahlavani, Omid Ghorbanzadeh, Khalil Gholamnia and Pedram Ghamisi

Fire 2024, 7(12), 440; https://doi.org/10.3390/fire7120440 - 28 Nov 2024

Cited by 4 | Viewed by 1907

Abstract

This study aimed to enhance the accuracy of forest fire susceptibility mapping (FSM) by innovatively applying recursive feature elimination (RFE) with an ensemble of machine learning models, specifically Support Vector Machine (SVM) and Random Forest (RF), to identify key fire factors. The fire [...] Read more.

This study aimed to enhance the accuracy of forest fire susceptibility mapping (FSM) by innovatively applying recursive feature elimination (RFE) with an ensemble of machine learning models, specifically Support Vector Machine (SVM) and Random Forest (RF), to identify key fire factors. The fire zones were derived from MODIS satellite imagery from 2012 to 2017. Further validation of these data has been provided by field surveys and reviews of land records in rangelands and forests; a total of 326 fire points were determined in this study. Seventeen factors involving topography, geomorphology, meteorology, hydrology, and human factors were identified as being effective primary factors in triggering and spreading fires in the selected mountainous case study area. As a first step, the RFE models RF, Extra Trees, Gradient Boosting, and AdaBoost were used to identify important fire factors among all selected primary factors. The SVM and RF models were applied once on all factors and secondly on those derived from the RFE model as the key factors in FSM. Training and testing data were divided tenfold, and the model’s performance was evaluated using cross-validation. Various metrics, including recall, precision, F1 score, accuracy, area under the curve (AUC), Matthew’s correlation coefficient (MCC), and Kappa, were employed to measure the performance of the models. The assessments demonstrate that leveraging RFE models enhances the FSM results by identifying key factors and excluding unnecessary ones. Notably, the SVM model exhibits significant improvement, achieving an increase of over 10.97% in accuracy and 8.61% in AUC metrics. This improvement underscores the effectiveness of the RFE approach in enhancing the predictive performance of the SVM model. Full article

(This article belongs to the Section Mathematical Modelling and Numerical Simulation of Combustion and Fire)

► Show Figures

Figure 1

16 pages, 5465 KiB

Open AccessArticle

Estimation of Cotton SPAD Based on Multi-Source Feature Fusion and Voting Regression Ensemble Learning in Intercropping Pattern of Cotton and Soybean

by Xiaoli Wang, Jingqian Li, Junqiang Zhang, Lei Yang, Wenhao Cui, Xiaowei Han, Dulin Qin, Guotao Han, Qi Zhou, Zesheng Wang, Jing Zhao and Yubin Lan

Agronomy 2024, 14(10), 2245; https://doi.org/10.3390/agronomy14102245 - 29 Sep 2024

Cited by 1 | Viewed by 1435

Abstract

The accurate estimation of soil plant analytical development (SPAD) values in cotton under various intercropping patterns with soybean is crucial for monitoring cotton growth and determining a suitable intercropping pattern. In this study, we utilized an unmanned aerial vehicle (UAV) to capture visible [...] Read more.

The accurate estimation of soil plant analytical development (SPAD) values in cotton under various intercropping patterns with soybean is crucial for monitoring cotton growth and determining a suitable intercropping pattern. In this study, we utilized an unmanned aerial vehicle (UAV) to capture visible (RGB) and multispectral (MS) data of cotton at the bud stage, early flowering stage, and full flowering stage in a cotton–soybean intercropping pattern in the Yellow River Delta region of China, and we used SPAD502 Plus and tapeline to collect SPAD and cotton plant height (CH) data of the cotton canopy, respectively. We analyzed the differences in cotton SPAD and CH under different intercropping ratio patterns. It was conducted using Pearson correlation analysis between the RGB features, MS features, and cotton SPAD, then the recursive feature elimination (RFE) method was employed to select image features. Seven feature sets including MS features (five vegetation indices + five texture features), RGB features (five vegetation indices + cotton cover), and CH, as well as combinations of these three types of features with each other, were established. Voting regression (VR) ensemble learning was proposed for estimating cotton SPAD and compared with the performances of three models: random forest regression (RFR), gradient boosting regression (GBR), and support vector regression (SVR). The optimal model was then used to estimate and visualize cotton SPAD under different intercropping patterns. The results were as follows: (1) There was little difference in the mean value of SPAD or CH under different intercropping patterns; a significant positive correlation existed between CH and SPAD throughout the entire growth period. (2) All VR models were optimal when each of the seven feature sets were used as input. When the features set was MS + RGB, the determination coefficient (R²) of the validation set of the VR model was 0.902, the root mean square error (RMSE) was 1.599, and the relative prediction deviation (RPD) was 3.24. (3) When the features set was CH + MS + RGB, the accuracy of the VR model was further improved, compared with the feature set MS + RGB, the R² and RPD were increased by 1.55% and 8.95%, respectively, and the RMSE was decreased by 7.38%. (4) In the intercropping of cotton and soybean, cotton growing under 4:6 planting patterns was better. The results can provide a reference for the selection of intercropping patterns and the estimation of cotton SPAD. Full article

(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture—2nd Edition)

► Show Figures

Figure 1

43 pages, 431 KiB

Open AccessArticle

Setting Ranges in Potential Biomarkers for Type 2 Diabetes Mellitus Patients Early Detection By Sex—An Approach with Machine Learning Algorithms

by Jorge A. Morgan-Benita, José M. Celaya-Padilla, Huizilopoztli Luna-García, Carlos E. Galván-Tejada, Miguel Cruz, Jorge I. Galván-Tejada, Hamurabi Gamboa-Rosales, Ana G. Sánchez-Reyna, David Rondon and Klinge O. Villalba-Condori

Diagnostics 2024, 14(15), 1623; https://doi.org/10.3390/diagnostics14151623 - 27 Jul 2024

Cited by 1 | Viewed by 2695

Abstract

Type 2 diabetes mellitus (T2DM) is one of the most common metabolic diseases in the world and poses a significant public health challenge. Early detection and management of this metabolic disorder is crucial to prevent complications and improve outcomes. This paper aims to [...] Read more.

Type 2 diabetes mellitus (T2DM) is one of the most common metabolic diseases in the world and poses a significant public health challenge. Early detection and management of this metabolic disorder is crucial to prevent complications and improve outcomes. This paper aims to find core differences in male and female markers to detect T2DM by their clinic and anthropometric features, seeking out ranges in potential biomarkers identified to provide useful information as a pre-diagnostic tool whie excluding glucose-related biomarkers using machine learning (ML) models. We used a dataset containing clinical and anthropometric variables from patients diagnosed with T2DM and patients without TD2M as control. We applied feature selection with three different techniques to identify relevant biomarker models: an improved recursive feature elimination (RFE) evaluating each set from all the features to one feature with the Akaike information criterion (AIC) to find optimal outputs; Least Absolute Shrinkage and Selection Operator (LASSO) with glmnet; and Genetic Algorithms (GA) with GALGO and forward selection (FS) applied to GALGO output. We then used these for comparison with the AIC to measure the performance of each technique and collect the optimal set of global features. Then, an implementation and comparison of five different ML models was carried out to identify the most accurate and interpretable one, considering the following models: logistic regression (LR), artificial neural network (ANN), support vector machine (SVM), k-nearest neighbors (KNN), and nearest centroid (Nearcent). The models were then combined in an ensemble to provide a more robust approximation. The results showed that potential biomarkers such as systolic blood pressure (SBP) and triglycerides are together significantly associated with T2DM. This approach also identified triglycerides, cholesterol, and diastolic blood pressure as biomarkers with differences between male and female actors that have not been previously reported in the literature. The most accurate ML model was selection with RFE and random forest (RF) as the estimator improved with the AIC, which achieved an accuracy of 0.8820. In conclusion, this study demonstrates the potential of ML models in identifying potential biomarkers for early detection of T2DM, excluding glucose-related biomarkers as well as differences between male and female anthropometric and clinic profiles. These findings may help to improve early detection and management of the T2DM by accounting for differences between male and female subjects in terms of anthropometric and clinic profiles, potentially reducing healthcare costs and improving personalized patient attention. Further research is needed to validate these potential biomarkers ranges in other populations and clinical settings. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

19 pages, 7197 KiB

Open AccessArticle

Evaluation of the Efficiency of Machine Learning Algorithms for Identification of Cattle Behavior Using Accelerometer and Gyroscope Data

by Tsvetelina Mladenova, Irena Valova, Boris Evstatiev, Nikolay Valov, Ivan Varlyakov, Tsvetan Markov, Svetoslava Stoycheva, Lora Mondeshka and Nikolay Markov

AgriEngineering 2024, 6(3), 2179-2197; https://doi.org/10.3390/agriengineering6030128 - 16 Jul 2024

Cited by 4 | Viewed by 1666

Abstract

Animal welfare is a daily concern for livestock farmers. It is known that the activity of cows characterizes their general physiological state and deviations from the normal parameters could be an indicator of different kinds of diseases and conditions. This pilot study investigated [...] Read more.

Animal welfare is a daily concern for livestock farmers. It is known that the activity of cows characterizes their general physiological state and deviations from the normal parameters could be an indicator of different kinds of diseases and conditions. This pilot study investigated the application of machine learning for identifying the behavioral activity of cows using a collar-mounted gyroscope sensor and compared the results with the classical accelerometer approach. The sensor data were classified into three categories, describing the behavior of the animals: “standing and eating”, “standing and ruminating”, and “laying and ruminating”. Four classification algorithms were considered—random forest ensemble (RFE), decision trees (DT), support vector machines (SVM), and naïve Bayes (NB). The training relied on manually classified data with a total duration of 6 h, which were grouped into 1s, 3s, and 5s piles. The obtained results showed that the RFE and DT algorithms performed the best. When using the accelerometer data, the obtained overall accuracy reached 88%; and when using the gyroscope data, the obtained overall accuracy reached 99%. To the best of our knowledge, no other authors have previously reported such results with a gyroscope sensor, which is the main novelty of this study. Full article

(This article belongs to the Section Livestock Farming Technology)

► Show Figures

Figure 1

19 pages, 4756 KiB

Open AccessArticle

Logistic Regression Ensemble Classifier for Intrusion Detection System in Internet of Things

by Silpa Chalichalamala, Niranjana Govindan and Ramani Kasarapu

Sensors 2023, 23(23), 9583; https://doi.org/10.3390/s23239583 - 3 Dec 2023

Cited by 16 | Viewed by 2937

Abstract

The Internet of Things (IoT) is a powerful technology that connect its users worldwide with everyday objects without any human interference. On the contrary, the utilization of IoT infrastructure in different fields such as smart homes, healthcare and transportation also raises potential risks [...] Read more.

The Internet of Things (IoT) is a powerful technology that connect its users worldwide with everyday objects without any human interference. On the contrary, the utilization of IoT infrastructure in different fields such as smart homes, healthcare and transportation also raises potential risks of attacks and anomalies caused through node security breaches. Therefore, an Intrusion Detection System (IDS) must be developed to largely scale up the security of IoT technologies. This paper proposes a Logistic Regression based Ensemble Classifier (LREC) for effective IDS implementation. The LREC combines AdaBoost and Random Forest (RF) to develop an effective classifier using the iterative ensemble approach. The issue of data imbalance is avoided by using the adaptive synthetic sampling (ADASYN) approach. Further, inappropriate features are eliminated using recursive feature elimination (RFE). There are two different datasets, namely BoT-IoT and TON-IoT, for analyzing the proposed RFE-LREC method. The RFE-LREC is analyzed on the basis of accuracy, recall, precision, F1-score, false alarm rate (FAR), receiver operating characteristic (ROC) curve, true negative rate (TNR) and Matthews correlation coefficient (MCC). The existing researches, namely NetFlow-based feature set, TL-IDS and LSTM, are used to compare with the RFE-LREC. The classification accuracy of RFE-LREC for the BoT-IoT dataset is 99.99%, which is higher when compared to those of TL-IDS and LSTM. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

16 pages, 11272 KiB

Open AccessArticle

Classifying Mountain Vegetation Types Using Object-Oriented Machine Learning Methods Based on Different Feature Combinations

by Xiaoli Fu, Wenzuo Zhou, Xinyao Zhou, Feng Li and Yichen Hu

Forests 2023, 14(8), 1624; https://doi.org/10.3390/f14081624 - 11 Aug 2023

Cited by 9 | Viewed by 1901

Abstract

Mountainous vegetation type classification plays a fundamental role in resource investigation in forested areas, making it necessary to accurately identify mountain vegetation types. However, Mountainous vegetation growth is readily affected by terrain and climate, which often makes interpretation difficult. This study utilizes Sentinel-2A [...] Read more.

Mountainous vegetation type classification plays a fundamental role in resource investigation in forested areas, making it necessary to accurately identify mountain vegetation types. However, Mountainous vegetation growth is readily affected by terrain and climate, which often makes interpretation difficult. This study utilizes Sentinel-2A images and object-oriented machine learning methods to map vegetation types in the complex mountainous region of Jiuzhaigou County, China, incorporating multiple auxiliary features. The results showed that the inclusion of different features improved the accuracy of mountain vegetation type classification, with terrain features, vegetation indices, and spectral features providing significant benefits. After feature selection, the accuracy of mountain vegetation type classification was further improved. The random forest recursive feature elimination (RF_RFE) algorithm outperformed the RliefF algorithm in recognizing mountain vegetation types. Extreme learning machine (ELM), random forest (RF), rotation forest (ROF), and ROF_ELM algorithms all achieved good classification performance, with an overall accuracy greater than 84.62%. Comparing the mountain vegetation type distribution maps obtained using different classifiers, we found that classification algorithms with the same base classifier ensemble exhibited similar performance. Overall, the ROF algorithm performed the best, achieving an overall accuracy of 89.68%, an average accuracy of 88.48%, and a Kappa coefficient of 0.879. Full article

(This article belongs to the Special Issue Mapping Forest Vegetation via Remote Sensing Tools)

► Show Figures

Figure 1

21 pages, 3941 KiB

Open AccessArticle

A Novel Machine Learning Approach for Solar Radiation Estimation

by Hasna Hissou, Said Benkirane, Azidine Guezzaz, Mourade Azrour and Abderrahim Beni-Hssane

Sustainability 2023, 15(13), 10609; https://doi.org/10.3390/su151310609 - 5 Jul 2023

Cited by 50 | Viewed by 5386

Abstract

Solar irradiation (Rs) is the electromagnetic radiation energy emitted by the Sun. It plays a crucial role in sustaining life on Earth by providing light, heat, and energy. Furthermore, it serves as a key driver of Earth’s climate and weather systems, influencing the [...] Read more.

Solar irradiation (Rs) is the electromagnetic radiation energy emitted by the Sun. It plays a crucial role in sustaining life on Earth by providing light, heat, and energy. Furthermore, it serves as a key driver of Earth’s climate and weather systems, influencing the distribution of heat across the planet, shaping global air and ocean currents, and determining weather patterns. Variations in Rs levels have significant implications for climate change and long-term climate trends. Moreover, Rs represents an abundant and renewable energy resource, offering a clean and sustainable alternative to fossil fuels. By harnessing solar energy, we can actively reduce greenhouse gas emissions. However, the utilization of Rs comes with its own challenges that must be addressed. One problem is its variability, which makes it difficult to predict and plan for consistent solar energy generation. Its intermittent nature also poses difficulties in meeting continuous energy demand unless appropriate energy storage or backup systems are in place. Integrating large-scale solar energy systems into existing power grids can present technical challenges. Rs levels are influenced by various factors; understanding these factors is crucial for various applications, such as renewable energy planning, climate modeling, and environmental studies. Overcoming the associated challenges requires advancements in technology and innovative solutions. Measuring and harnessing Rs for various applications can be achieved using various devices; however, the expense and scarcity of measuring equipment pose challenges in accurately assessing and monitoring Rs levels. In order to address this, alternative methods have been developed with which to estimate Rs, including artificial intelligence and machine learning (ML) models, like neural networks, kernel algorithms, tree-based models, and ensemble methods. To demonstrate the impact of feature selection methods on Rs predictions, we propose a Multivariate Time Series (MVTS) model using Recursive Feature Elimination (RFE) with a decision tree (DT), Pearson correlation (Pr), logistic regression (LR), Gradient Boosting Models (GBM), and a random forest (RF). Our article introduces a novel framework that integrates various models and incorporates overlooked factors. This framework offers a more comprehensive understanding of Recursive Feature Elimination and its integrations with different models in multivariate solar radiation forecasting. Our research delves into unexplored aspects and challenges existing theories related to solar radiation forecasting. Our results show reliable predictions based on essential criteria. The feature ranking may vary depending on the model used, with the RF Regressor algorithm selecting features such as maximum temperature, minimum temperature, precipitation, wind speed, and relative humidity for specific months. The DT algorithm may yield a slightly different set of selected features. Despite the variations, all of the models exhibit impressive performance, with the LR model demonstrating outstanding performance with low RMSE (0.003) and the highest R2 score (0.002). The other models also show promising results, with RMSE scores ranging from 0.006 to 0.007 and a consistent R2 score of 0.999. Full article

(This article belongs to the Special Issue Machine Learning, IoT and Artificial Intelligence for Sustainable Development)

► Show Figures

Figure 1

84 pages, 26371 KiB

Open AccessArticle

A Study on ML-Based Software Defect Detection for Security Traceability in Smart Healthcare Applications

by Samuel Mcmurray and Ali Hassan Sodhro

Sensors 2023, 23(7), 3470; https://doi.org/10.3390/s23073470 - 26 Mar 2023

Cited by 23 | Viewed by 4226

Abstract

Software Defect Prediction (SDP) is an integral aspect of the Software Development Life-Cycle (SDLC). As the prevalence of software systems increases and becomes more integrated into our daily lives, so the complexity of these systems increases the risks of widespread defects. With reliance [...] Read more.

Software Defect Prediction (SDP) is an integral aspect of the Software Development Life-Cycle (SDLC). As the prevalence of software systems increases and becomes more integrated into our daily lives, so the complexity of these systems increases the risks of widespread defects. With reliance on these systems increasing, the ability to accurately identify a defective model using Machine Learning (ML) has been overlooked and less addressed. Thus, this article contributes an investigation of various ML techniques for SDP. An investigation, comparative analysis and recommendation of appropriate Feature Extraction (FE) techniques, Principal Component Analysis (PCA), Partial Least Squares Regression (PLS), Feature Selection (FS) techniques, Fisher score, Recursive Feature Elimination (RFE), and Elastic Net are presented. Validation of the following techniques, both separately and in combination with ML algorithms, is performed: Support Vector Machine (SVM), Logistic Regression (LR), Naïve Bayes (NB), K-Nearest Neighbour (KNN), Multilayer Perceptron (MLP), Decision Tree (DT), and ensemble learning methods Bootstrap Aggregation (Bagging), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), Random Forest(RF), and Generalized Stacking (Stacking). Extensive experimental setup was built and the results of the experiments revealed that FE and FS can both positively and negatively affect performance over the base model or Baseline. PLS, both separately and in combination with FS techniques, provides impressive, and the most consistent, improvements, while PCA, in combination with Elastic-Net, shows acceptable improvement. Full article

(This article belongs to the Special Issue AI-Enabled Smart Sensing Technologies for Human-Centered Healthcare Applications)

► Show Figures

Figure 1

20 pages, 2229 KiB

Open AccessArticle

Optimal Feature Selection through Search-Based Optimizer in Cross Project

by Rizwan bin Faiz, Saman Shaheen, Mohamed Sharaf and Hafiz Tayyab Rauf

Electronics 2023, 12(3), 514; https://doi.org/10.3390/electronics12030514 - 19 Jan 2023

Cited by 5 | Viewed by 2615

Abstract

Cross project defect prediction (CPDP) is a key method for estimating defect-prone modules of software products. CPDP is a tempting approach since it provides information about predicted defects for those projects in which data are insufficient. Recent studies specifically include instructions on how [...] Read more.

Cross project defect prediction (CPDP) is a key method for estimating defect-prone modules of software products. CPDP is a tempting approach since it provides information about predicted defects for those projects in which data are insufficient. Recent studies specifically include instructions on how to pick training data from large datasets using feature selection (FS) process which contributes the most in the end results. The classifier helps classify the picked-up dataset in specified classes in order to predict the defective and non-defective classes. The aim of our research is to select the optimal set of features from multi-class data through a search-based optimizer for CPDP. We used the explanatory research type and quantitative approach for our experimentation. We have F1 measure as our dependent variable while as independent variables we have KNN filter, ANN filter, random forest ensemble (RFE) model, genetic algorithm (GA), and classifiers as manipulative independent variables. Our experiment follows 1 factor 1 treatment (1F1T) for RQ1 whereas for RQ2, RQ3, and RQ4, there are 1 factor 2 treatments (1F2T) design. We first carried out the explanatory data analysis (EDA) to know the nature of our dataset. Then we pre-processed our data by removing and solving the issues identified. During data preprocessing, we analyze that we have multi-class data; therefore, we first rank features and select multiple feature sets using the info gain algorithm to get maximum variation in features for multi-class dataset. To remove noise, we use ANN-filter and get significant results more than 40% to 60% compared to NN filter with base paper (all, ckloc, IG). Then we applied search-based optimizer i.e., random forest ensemble (RFE) to get the best features set for a software prediction model and we get 30% to 50% significant results compared with genetic instance selection (GIS). Then we used a classifier to predict defects for CPDP. We compare the results of the classifier with base paper classifier using F1-measure and we get almost 35% more than base paper. We validate the experiment using Wilcoxon and Cohen’s d test. Full article

(This article belongs to the Topic Software Engineering and Applications)

► Show Figures

Figure 1

17 pages, 430 KiB

Open AccessArticle

Ensemble Learning Based on Hybrid Deep Learning Model for Heart Disease Early Prediction

by Ahmed Almulihi, Hager Saleh, Ali Mohamed Hussien, Sherif Mostafa, Shaker El-Sappagh, Khaled Alnowaiser, Abdelmgeid A. Ali and Moatamad Refaat Hassan

Diagnostics 2022, 12(12), 3215; https://doi.org/10.3390/diagnostics12123215 - 18 Dec 2022

Cited by 52 | Viewed by 7752

Abstract

Many epidemics have afflicted humanity throughout history, claiming many lives. It has been noted in our time that heart disease is one of the deadliest diseases that humanity has confronted in the contemporary period. The proliferation of poor habits such as smoking, overeating, [...] Read more.

Many epidemics have afflicted humanity throughout history, claiming many lives. It has been noted in our time that heart disease is one of the deadliest diseases that humanity has confronted in the contemporary period. The proliferation of poor habits such as smoking, overeating, and lack of physical activity has contributed to the rise in heart disease. The killing feature of heart disease, which has earned it the moniker the “silent killer,” is that it frequently has no apparent signs in advance. As a result, research is required to develop a promising model for the early identification of heart disease using simple data and symptoms. The paper’s aim is to propose a deep stacking ensemble model to enhance the performance of the prediction of heart disease. The proposed ensemble model integrates two optimized and pre-trained hybrid deep learning models with the Support Vector Machine (SVM) as the meta-learner model. The first hybrid model is Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM) (CNN-LSTM), which integrates CNN and LSTM. The second hybrid model is CNN-GRU, which integrates CNN with a Gated Recurrent Unit (GRU). Recursive Feature Elimination (RFE) is also used for the feature selection optimization process. The proposed model has been optimized and tested using two different heart disease datasets. The proposed ensemble is compared with five machine learning models including Logistic Regression (LR), Random Forest (RF), K-Nearest Neighbors (K-NN), Decision Tree (DT), Naïve Bayes (NB), and hybrid models. In addition, optimization techniques are used to optimize ML, DL, and the proposed models. The results obtained by the proposed model achieved the highest performance using the full feature set. Full article

(This article belongs to the Special Issue Medical Data Processing and Analysis)

► Show Figures

Figure 1

13 pages, 2160 KiB

Open AccessArticle

Diagnosing Coronary Artery Disease on the Basis of Hard Ensemble Voting Optimization

by Hayder Mohammedqasim, Roa’a Mohammedqasem, Oguz Ata and Eman Ibrahim Alyasin

Medicina 2022, 58(12), 1745; https://doi.org/10.3390/medicina58121745 - 28 Nov 2022

Cited by 12 | Viewed by 2317

Abstract

Background and Objectives: Recently, many studies have focused on the early diagnosis of coronary artery disease (CAD), which is one of the leading causes of cardiac-associated death worldwide. The effectiveness of the most important features influencing disease diagnosis determines the performance of [...] Read more.

Background and Objectives: Recently, many studies have focused on the early diagnosis of coronary artery disease (CAD), which is one of the leading causes of cardiac-associated death worldwide. The effectiveness of the most important features influencing disease diagnosis determines the performance of machine learning systems that can allow for timely and accurate treatment. We performed a Hybrid ML framework based on hard ensemble voting optimization (HEVO) to classify patients with CAD using the Z-Alizadeh Sani dataset. All categorical features were converted to numerical forms, the synthetic minority oversampling technique (SMOTE) was employed to overcome imbalanced distribution between two classes in the dataset, and then, recursive feature elimination (RFE) with random forest (RF) was used to obtain the best subset of features. Materials and Methods: After solving the biased distribution in the CAD data set using the SMOTE method and finding the high correlation features that affected the classification of CAD patients. The performance of the proposed model was evaluated using grid search optimization, and the best hyperparameters were identified for developing four applications, namely, RF, AdaBoost, gradient-boosting, and extra trees based on an HEV classifier. Results: Five fold cross-validation experiments with the HEV classifier showed excellent prediction performance results with the 10 best balanced features obtained using SMOTE and feature selection. All evaluation metrics results reached > 98% with the HEV classifier, and the gradient-boosting model was the second best classification model with accuracy = 97% and F1-score = 98%. Conclusions: When compared to modern methods, the proposed method perform well in diagnosing coronary artery disease, and therefore, the proposed method can be used by medical personnel for supplementary therapy for timely, accurate, and efficient identification of CAD cases in suspected patients. Full article

(This article belongs to the Section Cardiology)

► Show Figures

Figure 1

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI