MDPI - Publisher of Open Access Journals

27 pages, 10031 KiB

Open AccessArticle

Predicting Cycle-to-Cycle Variations in Liquid Methane Engines Using CTGAN-Augmented Machine Learning

by Enchang Zhang, Feng Zhou, Haoran Xi, Xiongbo Duan and Jingping Liu

J. Mar. Sci. Eng. 2025, 13(8), 1513; https://doi.org/10.3390/jmse13081513 - 6 Aug 2025

It is imperative to comprehend the cyclical variations inherent in liquid methane engines (LMEs) across both design and operational domains. The theoretical thermal efficiency of LMEs is high at higher compression ratios, but the combustion instability also increases. Obtaining relevant metrics from bench [...] Read more.

It is imperative to comprehend the cyclical variations inherent in liquid methane engines (LMEs) across both design and operational domains. The theoretical thermal efficiency of LMEs is high at higher compression ratios, but the combustion instability also increases. Obtaining relevant metrics from bench experiments is difficult and time-consuming; therefore, in this study, we model tabular data using Conditional GAN (CTGAN) to model the tabular data and generated more virtual samples based on the experimental results of the key metrics (peak pressure, maximum pressure rise rate, and average effective pressure). Through this, a machine learning model was proposed that couples a random forest (RF) model with a Bayesian optimization machine learning model for predicting cyclic variation. The findings indicate that the Bayesian-optimized RF model demonstrates superiority in predicting the metrics with greater accuracy and reliability compared to the gradient boosting (XGBoost) and support vector machine (SVM) models. The R² value of the former model is consistently greater than 0.75, and the root mean square error (RMSE) is typically lower than 0.3. This paper highlights the promising potential of the Bayesian-optimized RF model in predicting unknown cyclic parameters. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

32 pages, 2182 KiB

Open AccessArticle

Detection of Biased Phrases in the Wiki Neutrality Corpus for Fairer Digital Content Management Using Artificial Intelligence

by Abdullah, Muhammad Ateeb Ather, Olga Kolesnikova and Grigori Sidorov

Big Data Cogn. Comput. 2025, 9(7), 190; https://doi.org/10.3390/bdcc9070190 - 21 Jul 2025

Viewed by 464

Abstract

Detecting biased language in large-scale corpora, such as the Wiki Neutrality Corpus, is essential for promoting neutrality in digital content. This study systematically evaluates a range of machine learning (ML) and deep learning (DL) models for the detection of biased and pre-conditioned phrases. [...] Read more.

Detecting biased language in large-scale corpora, such as the Wiki Neutrality Corpus, is essential for promoting neutrality in digital content. This study systematically evaluates a range of machine learning (ML) and deep learning (DL) models for the detection of biased and pre-conditioned phrases. Conventional classifiers, including Extreme Gradient Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM), and Categorical Boosting (CatBoost), are compared with advanced neural architectures such as Bidirectional Encoder Representations from Transformers (BERT), Long Short-Term Memory (LSTM) networks, and Generative Adversarial Networks (GANs). A novel hybrid architecture is proposed, integrating DistilBERT, LSTM, and GANs within a unified framework. Extensive experimentation with intermediate variants DistilBERT + LSTM (without GAN) and DistilBERT + GAN (without LSTM) demonstrates that the fully integrated model consistently outperforms all alternatives. The proposed hybrid model achieves a cross-validation accuracy of 99.00%, significantly surpassing traditional baselines such as XGBoost (96.73%) and LightGBM (96.83%). It also exhibits superior stability, statistical significance (paired t-tests), and favorable trade-offs between performance and computational efficiency. The results underscore the potential of hybrid deep learning models for capturing subtle linguistic bias and advancing more objective and reliable automated content moderation systems. Full article

► Show Figures

Figure 1

17 pages, 497 KiB

Open AccessArticle

Generative Data Modelling for Diverse Populations in Africa: Insights from South Africa

by Sally Sonia Simmons, John Elvis Hagan and Thomas Schack

Information 2025, 16(7), 612; https://doi.org/10.3390/info16070612 - 17 Jul 2025

Viewed by 282

Abstract

Studies on the demography and health of racially diverse African populations are scarce, particularly due to lingering data challenges. Generative data modelling has emerged as a valuable solution to this burden. The study, therefore, examined the efficacy of Conditional Tabular GAN (CTGAN), CopulaGAN, [...] Read more.

Studies on the demography and health of racially diverse African populations are scarce, particularly due to lingering data challenges. Generative data modelling has emerged as a valuable solution to this burden. The study, therefore, examined the efficacy of Conditional Tabular GAN (CTGAN), CopulaGAN, and Tabula Variational Autoencoder (TVAE) for generating synthetic but realistic demographic and health data. This study employed the World Health Organisation stigy on global ageing and adult health survey (SAGE) Wave 1 South African data (n = 4227). Information missing from SAGE Wave 1, including demographic (e.g., race, age) and health (e.g., hypertension, blood pressure) indicators, were imputed using Generative Adversarial Imputation Nets (GAIN). CopulaGAN, CTGAN, and TVAE, sourced from the sdv 1.24.1 python library, generated 104,227 synthetic records based on the SAGE data constituents. The outcomes were accessed with similarity and machine learning (XGBoost) augmentation metrics (sourced from the sdmetrics 0.21.0 python library), including column shapes and overall and precision ratio scores. Generally, the GAIN imputations resulted in data with properties that were comparable to original and with no missing information. CTGAN’s (89.20%) overall quality of performance was above that of TVAE (86.50%) and CopulaGAN (88.45%). These findings underscore the usefulness of generative data modelling in addressing data quality challenges in diverse populations to enhance actionable health research and policy implementation. Full article

► Show Figures

Graphical abstract

34 pages, 2216 KiB

Open AccessArticle

An Optimized Transformer–GAN–AE for Intrusion Detection in Edge and IIoT Systems: Experimental Insights from WUSTL-IIoT-2021, EdgeIIoTset, and TON_IoT Datasets

by Ahmad Salehiyan, Pardis Sadatian Moghaddam and Masoud Kaveh

Future Internet 2025, 17(7), 279; https://doi.org/10.3390/fi17070279 - 24 Jun 2025

Viewed by 507

Abstract

The rapid expansion of Edge and Industrial Internet of Things (IIoT) systems has intensified the risk and complexity of cyberattacks. Detecting advanced intrusions in these heterogeneous and high-dimensional environments remains challenging. As the IIoT becomes integral to critical infrastructure, ensuring security is crucial [...] Read more.

The rapid expansion of Edge and Industrial Internet of Things (IIoT) systems has intensified the risk and complexity of cyberattacks. Detecting advanced intrusions in these heterogeneous and high-dimensional environments remains challenging. As the IIoT becomes integral to critical infrastructure, ensuring security is crucial to prevent disruptions and data breaches. Traditional IDS approaches often fall short against evolving threats, highlighting the need for intelligent and adaptive solutions. While deep learning (DL) offers strong capabilities for pattern recognition, single-model architectures often lack robustness. Thus, hybrid and optimized DL models are increasingly necessary to improve detection performance and address data imbalance and noise. In this study, we propose an optimized hybrid DL framework that combines a transformer, generative adversarial network (GAN), and autoencoder (AE) components, referred to as Transformer–GAN–AE, for robust intrusion detection in Edge and IIoT environments. To enhance the training and convergence of the GAN component, we integrate an improved chimp optimization algorithm (IChOA) for hyperparameter tuning and feature refinement. The proposed method is evaluated using three recent and comprehensive benchmark datasets, WUSTL-IIoT-2021, EdgeIIoTset, and TON_IoT, widely recognized as standard testbeds for IIoT intrusion detection research. Extensive experiments are conducted to assess the model’s performance compared to several state-of-the-art techniques, including standard GAN, convolutional neural network (CNN), deep belief network (DBN), time-series transformer (TST), bidirectional encoder representations from transformers (BERT), and extreme gradient boosting (XGBoost). Evaluation metrics include accuracy, recall, AUC, and run time. Results demonstrate that the proposed Transformer–GAN–AE framework outperforms all baseline methods, achieving a best accuracy of 98.92%, along with superior recall and AUC values. The integration of IChOA enhances GAN stability and accelerates training by optimizing hyperparameters. Together with the transformer for temporal feature extraction and the AE for denoising, the hybrid architecture effectively addresses complex, imbalanced intrusion data. The proposed optimized Transformer–GAN–AE model demonstrates high accuracy and robustness, offering a scalable solution for real-world Edge and IIoT intrusion detection. Full article

(This article belongs to the Special Issue Intrusion Detection and Resiliency in Cyber-Physical Systems and Networks)

► Show Figures

Figure 1

26 pages, 3807 KiB

Open AccessArticle

Evaluation of IMERG Precipitation Product Downscaling Using Nine Machine Learning Algorithms in the Qinghai Lake Basin

by Ke Lei, Lele Zhang and Liming Gao

Water 2025, 17(12), 1776; https://doi.org/10.3390/w17121776 - 13 Jun 2025

Viewed by 573

Abstract

High-quality precipitation data are vital for hydrological research. In regions with sparse observation stations, reliable gridded data cannot be obtained through interpolation, while the coarse resolution of satellite products fails to meet the demands of small watershed studies. Downscaling satellite-based precipitation products offers [...] Read more.

High-quality precipitation data are vital for hydrological research. In regions with sparse observation stations, reliable gridded data cannot be obtained through interpolation, while the coarse resolution of satellite products fails to meet the demands of small watershed studies. Downscaling satellite-based precipitation products offers an effective solution for generating high-resolution data in such areas. Among these techniques, machine learning plays a pivotal role, with performance varying according to surface conditions and algorithmic mechanisms. Using the Qinghai Lake Basin as a case study and rain gauge observations as reference data, this research conducted a systematic comparative evaluation of nine machine learning algorithms (ANN, CLSTM, GAN, KNN, MSRLapN, RF, SVM, Transformer, and XGBoost) for downscaling IMERG precipitation products from 0.1° to 0.01° resolution. The primary objective was to identify the optimal downscaling method for the Qinghai Lake Basin by assessing spatial accuracy, seasonal performance, and residual sensitivity. Seven metrics were employed for assessment: correlation coefficient (CC), root mean square error (RMSE), mean absolute error (MAE), coefficient of determination (R²), standard deviation ratio (Sigma Ratio), Kling-Gupta Efficiency (KGE), and bias. On the annual scale, KNN delivered the best overall results (KGE = 0.70, RMSE = 17.09 mm, Bias = −3.31 mm), followed by Transformer (KGE = 0.69, RMSE = 17.20 mm, Bias = −3.24 mm). During the cold season, KNN and ANN both performed well (KGE = 0.63; RMSE = 5.97 mm and 6.09 mm; Bias = −1.76 mm and −1.75 mm), with SVM ranking next (KGE = 0.63, RMSE = 6.11 mm, Bias = −1.63 mm). In the warm season, Transformer yielded the best results (KGE = 0.74, RMSE = 23.35 mm, Bias = −1.03 mm), followed closely by ANN and KNN (KGE = 0.74; RMSE = 23.38 mm and 23.57 mm; Bias = −1.08 mm and −1.03 mm, respectively). GAN consistently underperformed across all temporal scales, with annual, cold-season, and warm-season KGE values of 0.61, 0.43, and 0.68, respectively—worse than the original 0.1° IMERG product. Considering the ability to represent spatial precipitation gradients, KNN emerged as the most suitable method for IMERG downscaling in the Qinghai Lake Basin. Residual analysis revealed error concentrations along the lakeshore, and model performance declined when residuals exceeded specific thresholds—highlighting the need to account for model-specific sensitivity during correction. SHAP analysis based on ANN, KNN, SVM, and Transformer identified NDVI (0.218), longitude (0.214), and latitude (0.208) as the three most influential predictors. While longitude and latitude affect vapor transport by representing land–sea positioning, NDVI is heavily influenced by anthropogenic activities and sandy surfaces in lakeshore regions, thus limiting prediction accuracy in these areas. This work delivers a high-resolution (0.01°) precipitation dataset for the Qinghai Lake Basin and provides a practical basis for selecting suitable downscaling methods in similar environments. Full article

(This article belongs to the Section New Sensors, New Technologies and Machine Learning in Water Sciences)

► Show Figures

Figure 1

29 pages, 23859 KiB

Open AccessArticle

Super-Resolution of Landsat-8 Land Surface Temperature Using Kolmogorov–Arnold Networks with PlanetScope Imagery and UAV Thermal Data

by Mahdiyeh Fathi, Hossein Arefi, Reza Shah-Hosseini and Armin Moghimi

Remote Sens. 2025, 17(8), 1410; https://doi.org/10.3390/rs17081410 - 16 Apr 2025

Viewed by 1375

Abstract

Super-Resolution Land Surface Temperature (LST_SR) maps are essential for urban heat island (UHI) analysis and temperature monitoring. While much of the literature focuses on improving the resolution of low-resolution LST (e.g., MODIS-derived LST) using high-resolution space-borne data (e.g., Landsat-derived LST), Unmanned [...] Read more.

Super-Resolution Land Surface Temperature (LST_SR) maps are essential for urban heat island (UHI) analysis and temperature monitoring. While much of the literature focuses on improving the resolution of low-resolution LST (e.g., MODIS-derived LST) using high-resolution space-borne data (e.g., Landsat-derived LST), Unmanned Aerial Vehicles (UAVs)/drone thermal imagery are rarely used for this purpose. Additionally, many deep learning (DL)-based super-resolution approaches, such as Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), require significant computational resources. To address these challenges, this study presents a novel approach to generate LST_SR maps by integrating Low-Resolution Landsat-8 LST (LST_LR) with High-Resolution PlanetScope images (I_HR) and UAV-derived thermal imagery (T_HR) using the Kolmogorov–Arnold Network (KAN) model. The KAN efficiently integrates the strengths of splines and Multi-Layer Perceptrons (MLPs), providing a more effective solution for generating LST_SR. The multi-step process involves acquiring and co-registering T_HR via the DJI Mavic 3 thermal (T) drone, I_HR from Planet (3 m resolution), and LST_LR from Landsat-8, with T_HR serving as reference data while I_HR and LST_LR are used as input features for the KAN model. The model was trained at two sites in Germany (Oberfischbach and Mittelfischbach) and tested at Königshain, achieving reasonable performance (RMSE: 4.06 °C, MAE: 3.09 °C, SSIM: 0.83, PSNR: 22.22, MAPE: 9.32%), and outperforming LightGBM, XGBoost, ResDensNet, and ResDensNet-Attention. These results demonstrate the KAN’s superior ability to extract fine-scale temperature patterns (e.g., edges and boundaries) from I_HR, significantly improving LST_LR. This advancement can enhance UHI analysis, local climate monitoring, and LST modeling, providing a scalable solution for urban heat mitigation and broader environmental applications. To improve scalability and generalizability, KAN models benefit from training on a more diverse set of UAV thermal imagery, covering different seasons, land use types, and regions. Despite this, the proposed approach is effective in areas with limited UAV data availability. Full article

(This article belongs to the Section Environmental Remote Sensing)

► Show Figures

Figure 1

19 pages, 4910 KiB

Open AccessArticle

A Novel SHAP-GAN Network for Interpretable Ovarian Cancer Diagnosis

by Jingxun Cai, Zne-Jung Lee, Zhihxian Lin and Ming-Ren Yang

Mathematics 2025, 13(5), 882; https://doi.org/10.3390/math13050882 - 6 Mar 2025

Viewed by 939

Abstract

Ovarian cancer stands out as one of the most formidable adversaries in women’s health, largely due to its typically subtle and nonspecific early symptoms, which pose significant challenges to early detection and diagnosis. Although existing diagnostic methods, such as biomarker testing and imaging, [...] Read more.

Ovarian cancer stands out as one of the most formidable adversaries in women’s health, largely due to its typically subtle and nonspecific early symptoms, which pose significant challenges to early detection and diagnosis. Although existing diagnostic methods, such as biomarker testing and imaging, can help with early diagnosis to some extent, these methods still have limitations in sensitivity and accuracy, often leading to misdiagnosis or missed diagnosis. Ovarian cancer’s high heterogeneity and complexity increase diagnostic challenges, especially in disease progression prediction and patient classification. Machine learning (ML) has outperformed traditional methods in cancer detection by processing large datasets to identify patterns missed by conventional techniques. However, existing AI models still struggle with accuracy in handling imbalanced and high-dimensional data, and their “black-box” nature limits clinical interpretability. To address these issues, this study proposes SHAP-GAN, an innovative diagnostic model for ovarian cancer that integrates Shapley Additive exPlanations (SHAP) with Generative Adversarial Networks (GANs). The SHAP module quantifies each biomarker’s contribution to the diagnosis, while the GAN component optimizes medical data generation. This approach tackles three key challenges in medical diagnosis: data scarcity, model interpretability, and diagnostic accuracy. Results show that SHAP-GAN outperforms traditional methods in sensitivity, accuracy, and interpretability, particularly with high-dimensional and imbalanced ovarian cancer datasets. The top three influential features identified are PRR11, CIAO1, and SMPD3, which exhibit wide SHAP value distributions, highlighting their significant impact on model predictions. The SHAP-GAN network has demonstrated an impressive accuracy rate of 99.34% on the ovarian cancer dataset, significantly outperforming baseline algorithms, including Support Vector Machines (SVM), Logistic Regression (LR), and XGBoost. Specifically, SVM achieved an accuracy of 72.78%, LR achieved 86.09%, and XGBoost achieved 96.69%. These results highlight the superior performance of SHAP-GAN in handling high-dimensional and imbalanced datasets. Furthermore, SHAP-GAN significantly alleviates the challenges associated with intricate genetic data analysis, empowering medical professionals to tailor personalized treatment strategies for individual patients. Full article

(This article belongs to the Special Issue Advances in Artificial Intelligence, Machine Learning and Optimization)

► Show Figures

Figure 1

15 pages, 2204 KiB

Open AccessArticle

The Effectiveness of Generative Adversarial Network-Based Oversampling Methods for Imbalanced Multi-Class Credit Score Classification

by I Nyoman Mahayasa Adiputra, Pei-Chun Lin and Paweena Wanchai

Electronics 2025, 14(4), 697; https://doi.org/10.3390/electronics14040697 - 11 Feb 2025

Cited by 2 | Viewed by 2204

Abstract

Credit score models are essential tools for evaluating creditworthiness and mitigating financial risks. However, the imbalanced nature of multi-class credit score datasets poses significant challenges for traditional classification algorithms, leading to poor performance in minority classes. This study explores the effectiveness of Generative [...] Read more.

Credit score models are essential tools for evaluating creditworthiness and mitigating financial risks. However, the imbalanced nature of multi-class credit score datasets poses significant challenges for traditional classification algorithms, leading to poor performance in minority classes. This study explores the effectiveness of Generative Adversarial Network (GAN)-based oversampling methods, including CTGAN, CopulaGAN, WGAN-GP, and DraGAN, in addressing this issue. By synthesizing realistic data for minority classes and integrating it with majority class data, the study benchmarks these GAN-based methods across classical (KNN, Decision Tree, Logistic Regression) and ensemble machine learning models (XGBoost, Random Forest, LightGBM). Evaluation metrics such as accuracy and F1-score reveal that WGAN-GP consistently achieves superior performance, especially when combined with Random Forest, outperforming other methods in balancing dataset representation and enhancing classification accuracy. The results showed that WGAN-GP + RF achieved 0.873 in accuracy, 0.936 F1-score in the “good” class, 0.806 F1-score in the “poor” class, and 0.816 F1-score in the “standard” class. The findings underscore the potential of GAN-based oversampling in improving multi-class credit score classification and highlight future directions, including hybrid sampling and cost-sensitive learning, to address remaining challenges. Full article

(This article belongs to the Special Issue Advanced System Architectures and AI-Driven Innovations for Next-Generation Computing)

► Show Figures

Figure 1

21 pages, 3796 KiB

Open AccessArticle

The Urban Intersection Accident Detection Method Based on the GAN-XGBoost and Shapley Additive Explanations Hybrid Model

by Zhongji Shi, Yingping Wang, Dong Guo, Fangtong Jiao, Hu Zhang and Feng Sun

Sustainability 2025, 17(2), 453; https://doi.org/10.3390/su17020453 - 9 Jan 2025

Viewed by 1029

Abstract

Traffic accidents at urban intersections may lead to severe traffic congestion, necessitating effective detection and timely intervention. To achieve real-time traffic accident monitoring at intersections more effectively, this paper proposes an urban road intersection accident detection method based on Generative Adversarial Networks (GANs), [...] Read more.

Traffic accidents at urban intersections may lead to severe traffic congestion, necessitating effective detection and timely intervention. To achieve real-time traffic accident monitoring at intersections more effectively, this paper proposes an urban road intersection accident detection method based on Generative Adversarial Networks (GANs), Extreme Gradient Boosting (XGBoost), and the SHAP interpretability framework. Data extraction and processing methods are described, and a brief analysis of accident impact features is provided. To address the issue of data imbalance, GAN is used to generate synthetic accident samples. The XGBoost model is then trained on the balanced dataset, and its accident detection performance is validated. In addition, SHAP is employed to interpret the results and analyze the importance of individual features. The results indicate that the accident samples generated by GAN not only retain the characteristics of real data but also enhance sample diversity, improving the AUC value of the XGBoost model by 7.1% to reach 0.844. Compared with the benchmark models mentioned in the study, the AUC value shows an average improvement of 7%. Additionally, the SHAP model confirms that the time–vehicle ratio and average speed are key factors influencing the model’s detection results. These findings provide a reliable method for urban road intersection accident detection, and accurate accident location detection can assist urban planners in formulating comprehensive emergency management strategies for intersections, ensuring the sustainable operation of traffic flow. Full article

(This article belongs to the Section Sustainable Transportation)

► Show Figures

Figure 1

23 pages, 3243 KiB

Open AccessArticle

A Modular AI-Driven Intrusion Detection System for Network Traffic Monitoring in Industry 4.0, Using Nvidia Morpheus and Generative Adversarial Networks

by Beatrice-Nicoleta Chiriac, Florin-Daniel Anton, Anca-Daniela Ioniță and Bogdan-Valentin Vasilică

Sensors 2025, 25(1), 130; https://doi.org/10.3390/s25010130 - 28 Dec 2024

Cited by 3 | Viewed by 3889

Abstract

Every day, a considerable number of new cybersecurity attacks are reported, and the traditional methods of defense struggle to keep up with them. In the current context of the digital era, where industrial environments handle large data volumes, new cybersecurity solutions are required, [...] Read more.

Every day, a considerable number of new cybersecurity attacks are reported, and the traditional methods of defense struggle to keep up with them. In the current context of the digital era, where industrial environments handle large data volumes, new cybersecurity solutions are required, and intrusion detection systems (IDSs) based on artificial intelligence (AI) algorithms are coming up with an answer to this critical issue. This paper presents an approach for implementing a generic model of a network-based intrusion detection system for Industry 4.0 by integrating the computational advantages of the Nvidia Morpheus open-source AI framework. The solution is modularly built with two pipelines for data analysis. The pipelines use a pre-trained XGBoost (eXtreme Gradient Boosting) model that achieved an accuracy score of up to 90%. The proposed IDS has a fast rate of analysis, managing more than 500,000 inputs in almost 10 s, due to the application of the federated learning methodology. The classification performance of the model was improved by integrating a generative adversarial network (GAN) that generates polymorphic network traffic packets. Full article

(This article belongs to the Special Issue Data Protection and Privacy in Industry 4.0 Era)

► Show Figures

Figure 1

18 pages, 4029 KiB

Open AccessArticle

An Integrated Algorithm with Feature Selection, Data Augmentation, and XGBoost for Ovarian Cancer

by Jingxun Cai, Zne-Jung Lee, Zhihxian Lin, Chih-Hung Hsu and Yun Lin

Mathematics 2024, 12(24), 4041; https://doi.org/10.3390/math12244041 - 23 Dec 2024

Cited by 1 | Viewed by 1109

Abstract

Ovarian cancer is one of the most aggressive gynecological cancers due to its high invasion and chemoresistance. It not only has a high incidence rate but also tops the list of mortality rates. Its subtle early symptoms make subsequent diagnosis difficult, significantly delaying [...] Read more.

Ovarian cancer is one of the most aggressive gynecological cancers due to its high invasion and chemoresistance. It not only has a high incidence rate but also tops the list of mortality rates. Its subtle early symptoms make subsequent diagnosis difficult, significantly delaying timely treatment for patients. Once ovarian cancer reaches an advanced stage, the complexity and difficulty of treatment increase substantially, affecting patient survival rates. Therefore, it is crucial for both medical professionals and patients to remain highly vigilant about the early signs of ovarian cancer to ensure timely intervention. In recent years, ovarian cancer prediction research has advanced, allowing for the analysis of the likelihood and type of cancer based on patients’ genetic data. With the rapid development of machine learning, numerous efficient classification prediction models have emerged. These new technologies offer significant opportunities and potential for developing ovarian cancer diagnostic prediction methods. However, traditional approaches often struggle to achieve satisfactory classification accuracy in high-dimensional genetic datasets with small sample sizes. This research offers a prediction model utilizing genomic data to enhance the early diagnosis rate of ovarian cancer, incorporating feature selection, data augmentation through adversarial conditional generative adversarial networks (AC-GAN), and an extreme gradient boosting (XGBoost) classifier. First, we can simplify the original genetic dataset through feature selection methods, removing irrelevant variables and noise, thereby improving the model’s predictive accuracy. Following dimensionality reduction, AC-GAN enriches the data, producing more realistic genetic samples to enhance the model’s generalization capacity. Finally, the XGBoost classifier is applied to classify the augmented data, achieving efficient predictions for ovarian cancer. These research findings strongly demonstrate that the diagnostic method proposed in this paper has a significant advantage in the predictive diagnosis of ovarian cancer, with an accuracy of 99.01% that surpasses the current technologies in use. Additionally, the algorithm identifies twelve genes highly relevant to ovarian cancer, providing valuable insights for physicians during diagnosis. Full article

► Show Figures

Figure 1

19 pages, 1627 KiB

Open AccessArticle

Multi-Scale Price Forecasting Based on Data Augmentation

by Ting Yue and Yahui Liu

Appl. Sci. 2024, 14(19), 8737; https://doi.org/10.3390/app14198737 - 27 Sep 2024

Cited by 1 | Viewed by 1322

Abstract

When considering agricultural commodity transaction data, long sampling intervals or data sparsity may lead to small samples. Furthermore, training on small samples can lead to overfitting and makes it hard to capture the fine-grained fluctuations in the data. In this study, a multi-scale [...] Read more.

When considering agricultural commodity transaction data, long sampling intervals or data sparsity may lead to small samples. Furthermore, training on small samples can lead to overfitting and makes it hard to capture the fine-grained fluctuations in the data. In this study, a multi-scale forecasting approach combined with a Generative Adversarial Network (GAN) and Temporal Convolutional Network (TCN) is proposed to address the problems related to small sample prediction. First, a Time-series Generative Adversarial Network (TimeGAN) is used to expand the multi-dimensional data and t-SNE is utilized to evaluate the similarity between the original and synthetic data. Second, a greedy algorithm is exploited to calculate the information gain, in order to obtain important features, based on XGBoost. Meanwhile, TCN residual blocks and dilated convolutions are used to tackle the issue of gradient disappearance. Finally, an attention mechanism is added to the TCN, which is beneficial in terms of improving the forecasting accuracy. Experiments are conducted on three products, garlic, ginger and chili. Taking garlic as an example, the RMSE of the proposed method was reduced by 1.7% and 1% when compared to the SVR and RF models, respectively. Its

R^{2}

accuracy was also improved (by 4.3% and 3.4%, respectively). Furthermore, TCN-attention and TCN were found to require less time compared to GRU and LSTM. The accuracy of the proposed method increased by about 5% when compared to that without TimeGAN in the ablation study. Moreover, compared with TCN, the Gated Recurrent Unit (GRU), and the Long Short-term Memory (LSTM) model in the multi-scale price forecasting task, the proposed method can better utilize small samples and high-dimensional data, leading to improved performance. Additionally, the proposed model is compared to the Transformer and TimesNet models in terms of its accuracy, deployment cost, and other metrics. Full article

► Show Figures

Figure 1

14 pages, 1403 KiB

Open AccessArticle

PROTA: A Robust Tool for Protamine Prediction Using a Hybrid Approach of Machine Learning and Deep Learning

by Jorge G. Farias, Lisandra Herrera-Belén, Luis Jimenez and Jorge F. Beltrán

Int. J. Mol. Sci. 2024, 25(19), 10267; https://doi.org/10.3390/ijms251910267 - 24 Sep 2024

Cited by 1 | Viewed by 1371

Abstract

Protamines play a critical role in DNA compaction and stabilization in sperm cells, significantly influencing male fertility and various biotechnological applications. Traditionally, identifying these proteins is a challenging and time-consuming process due to their species-specific variability and complexity. Leveraging advancements in computational biology, [...] Read more.

Protamines play a critical role in DNA compaction and stabilization in sperm cells, significantly influencing male fertility and various biotechnological applications. Traditionally, identifying these proteins is a challenging and time-consuming process due to their species-specific variability and complexity. Leveraging advancements in computational biology, we present PROTA, a novel tool that combines machine learning (ML) and deep learning (DL) techniques to predict protamines with high accuracy. For the first time, we integrate Generative Adversarial Networks (GANs) with supervised learning methods to enhance the accuracy and generalizability of protamine prediction. Our methodology evaluated multiple ML models, including Light Gradient-Boosting Machine (LIGHTGBM), Multilayer Perceptron (MLP), Random Forest (RF), eXtreme Gradient Boosting (XGBOOST), k-Nearest Neighbors (KNN), Logistic Regression (LR), Naive Bayes (NB), and Radial Basis Function-Support Vector Machine (RBF-SVM). During ten-fold cross-validation on our training dataset, the MLP model with GAN-augmented data demonstrated superior performance metrics: 0.997 accuracy, 0.997 F1 score, 0.998 precision, 0.997 sensitivity, and 1.0 AUC. In the independent testing phase, this model achieved 0.999 accuracy, 0.999 F1 score, 1.0 precision, 0.999 sensitivity, and 1.0 AUC. These results establish PROTA, accessible via a user-friendly web application. We anticipate that PROTA will be a crucial resource for researchers, enabling the rapid and reliable prediction of protamines, thereby advancing our understanding of their roles in reproductive biology, biotechnology, and medicine. Full article

(This article belongs to the Special Issue Machine Learning Applications in Bioinformatics and Biomedicine: 2nd Edition)

► Show Figures

Figure 1

21 pages, 5246 KiB

Open AccessArticle

SFCWGAN-BiTCN with Sequential Features for Malware Detection

by Bona Xuan, Jin Li and Yafei Song

Appl. Sci. 2023, 13(4), 2079; https://doi.org/10.3390/app13042079 - 5 Feb 2023

Cited by 9 | Viewed by 2192

Abstract

In the field of adversarial attacks, the generative adversarial network (GAN) has shown better performance. There have been few studies applying it to malware sample supplementation, due to the complexity of handling discrete data. More importantly, unbalanced malware family samples interfere with the [...] Read more.

In the field of adversarial attacks, the generative adversarial network (GAN) has shown better performance. There have been few studies applying it to malware sample supplementation, due to the complexity of handling discrete data. More importantly, unbalanced malware family samples interfere with the analytical power of malware detection models and mislead malware classification. To address the problem of the impact of malware family imbalance on accuracy, a selection feature conditional Wasserstein generative adversarial network (SFCWGAN) and bidirectional temporal convolutional network (BiTCN) are proposed. First, we extract the features of malware Opcode and API sequences and use Word2Vec to represent features, emphasizing the semantic logic between API tuning and Opcode calling sequences. Second, the Spearman correlation coefficient and the whale optimization algorithm extreme gradient boosting (WOA-XGBoost) algorithm are combined to select features, filter out invalid features, and simplify structure. Finally, we propose a GAN-based sequence feature generation algorithm. Samples were generated using the conditional Wasserstein generative adversarial network (CWGAN) on the imbalanced malware family dataset, added to the trainset to supplement the samples, and trained on BiTCN. In comparison, in tests on the Kaggle and DataCon datasets, the model achieved detection accuracies of 99.56% and 96.93%, respectively, which were 0.18% and 2.98% higher than the models of other methods. Full article

► Show Figures

Figure 1

15 pages, 11075 KiB

Open AccessArticle

A Synthetic Data Generation Technique for Enhancement of Prediction Accuracy of Electric Vehicles Demand

by Subhajit Chatterjee and Yung-Cheol Byun

Sensors 2023, 23(2), 594; https://doi.org/10.3390/s23020594 - 4 Jan 2023

Cited by 23 | Viewed by 4534

Abstract

In terms of electric vehicles (EVs), electric kickboards are crucial elements of smart transportation networks for short-distance travel that is risk-free, economical, and environmentally friendly. Forecasting the daily demand can improve the local service provider’s access to information and help them manage their [...] Read more.

In terms of electric vehicles (EVs), electric kickboards are crucial elements of smart transportation networks for short-distance travel that is risk-free, economical, and environmentally friendly. Forecasting the daily demand can improve the local service provider’s access to information and help them manage their short-term supply more effectively. This study developed the forecasting model using real-time data and weather information from Jeju Island, South Korea. Cluster analysis under the rental pattern of the electric kickboard is a component of the forecasting processes. We cannot achieve noticeable results at first because of the low amount of training data. We require a lot of data to produce a solid prediction result. For the sake of the subsequent experimental procedure, we created synthetic time-series data using a generative adversarial networks (GAN) approach and combined the synthetic data with the original data. The outcomes have shown how the GAN-based synthetic data generation approach has the potential to enhance prediction accuracy. We employ an ensemble model to improve prediction results that cannot be achieved using a single regressor model. It is a weighted combination of several base regression models to one meta-regressor. To anticipate the daily demand in this study, we create an ensemble model by merging three separate base machine learning algorithms, namely CatBoost, Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The effectiveness of the suggested strategies was assessed using some evaluation indicators. The forecasting outcomes demonstrate that mixing synthetic data with original data improves the robustness of daily demand forecasting and outperforms other models by generating more agreeable values for suggested assessment measures. The outcomes further show that applying ensemble techniques can reasonably increase the forecasting model’s accuracy for daily electric kickboard demand. Full article

(This article belongs to the Special Issue Smart Sensors and Machine Learning Technique for Damage Detection and Visualization)

► Show Figures

Figure 1

Search Results (17)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (17)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI