MDPI - Publisher of Open Access Journals

27 pages, 524 KB

Open AccessArticle

Synthetic Data Augmentation for Imbalanced Tabular Protein Subcellular Localization: A Comparative Study of SMOTE, CTGAN, TVAE, and TabDDPM Methods

by Ali Fatih Gündüz and Canan Batur Şahin

Appl. Sci. 2026, 16(8), 3694; https://doi.org/10.3390/app16083694 - 9 Apr 2026

Viewed by 826

Abstract

Class imbalance is a persistent challenge in supervised machine learning, particularly in biological datasets where minority classes represent functionally critical categories. Synthetic data generation has emerged as a principal strategy for mitigating this problem, yet systematic comparisons of classical and modern deep generative [...] Read more.

Class imbalance is a persistent challenge in supervised machine learning, particularly in biological datasets where minority classes represent functionally critical categories. Synthetic data generation has emerged as a principal strategy for mitigating this problem, yet systematic comparisons of classical and modern deep generative approaches remain limited. This study presents a comprehensive benchmark evaluation of four synthetic data generation methods—SMOTE, CTGAN, TVAE, and TabDDPM—across two well-established biological datasets from the UCI Machine Learning Repository: the E. coli protein localization dataset (307 samples, 6 features, 4 classes) and the yeast protein localization dataset (1299 samples, 8 features, 4 classes). Synthetic data quality was rigorously assessed using a multi-dimensional evaluation framework encompassing distributional fidelity (Fréchet Distance, Wasserstein Distance), machine learning utility (Train-on-Synthetic-Test-on-Real and Train-on-Real-Test-on-Real protocols using XGBoost version 3.2.0, Logistic Regression, Support Vector Machines, and Random Forest), and distinguishability (Classifier Two-Sample Test). The datasets are rather imbalanced. During the experiments, the dataset size increased to three times its original size while preserving the imbalanced class-sample ratio. To evaluate the quality of synthetic data, the max(AUC,1−AUC) score is proposed. This score is inversely proportional to classification performance, indicating that synthetic data are not easily distinguishable from real data. Per-class analysis reveals that minority classes remain the primary challenge across all generative methods. SMOTE and TabDDPM obtained the highest predictive utility F1-scores across both datasets. TVAE offers the strongest distributional fidelity among deep generative models, producing synthetic samples that are most difficult to distinguish from real data (lowest C2ST scores). CTGAN exhibits significant performance degradation on both small- and medium-scale datasets, with F1 utility ratios below 0.50. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

22 pages, 4100 KB

Open AccessArticle

Explainable Machine Learning-Based Urban Waterlogging Prediction Framework

by Yinghua Deng and Xin Lu

Urban Sci. 2026, 10(3), 156; https://doi.org/10.3390/urbansci10030156 - 13 Mar 2026

Cited by 2 | Viewed by 890

Abstract

Urban waterlogging has become a critical challenge to urban sustainability under the combined pressures of rapid urbanization and increasingly frequent extreme weather events. However, traditional predictive models struggle to achieve real-time, point-specific early warning effectively, primarily due to the interference of redundant high-dimensional [...] Read more.

Urban waterlogging has become a critical challenge to urban sustainability under the combined pressures of rapid urbanization and increasingly frequent extreme weather events. However, traditional predictive models struggle to achieve real-time, point-specific early warning effectively, primarily due to the interference of redundant high-dimensional data and the inability to handle severe data imbalance. This study proposes a lightweight and interpretable machine learning framework for real-time waterlogging hotspot prediction, based on a multi-dimensional feature space. Specifically, we implement a Lasso-based mechanism to distill 37 multi-source variables into five core determinants. This process effectively isolates dominant environmental drivers while filtering noise. To further overcome the recall bottleneck, we propose a Synthetic Minority Over-sampling Technique based on Weighted Distance and Cleaning (SMOTE-WDC) algorithm that incorporates weighted feature distances and density-based noise cleaning. Validating the framework on datasets from Shenzhen (2023–2024), we demonstrate that the integrated Gradient Boosting Decision Tree (GBDT) model integrated with this strategy achieves optimal performance using only five features, yielding an F1-score of 0.808 and an Area Under the Precision-Recall Curve (AUC-PR) of 0.895. Notably, a Recall of 0.882 is attained, representing a 4.6% improvement over the baseline. This study contributes a cost-effective, high-sensitivity approach to disaster risk reduction, advancing predictive urban waterlogging management. Full article

(This article belongs to the Special Issue Flooding Prevention Strategies for Flood-Prone Cities Under Climate Change)

► Show Figures

Figure 1

24 pages, 4999 KB

Open AccessArticle

PhysGMM-MoE: A Physics-Aware GMM-Mixture-of-Experts Framework for Small-Sample Engine Fault Classification

by Qingang Xu, Hongwei Wang, Yunhang Wang and Xicong Chen

Appl. Sci. 2026, 16(5), 2417; https://doi.org/10.3390/app16052417 - 2 Mar 2026

Viewed by 516

Abstract

Accurate engine fault classification with limited labeled data is critical for the safety and reliability of rotating machinery. This task is challenging because operating regimes are time-varying, and key variables must satisfy physical constraints, under which traditional feature classifier pipelines degrade and deep [...] Read more.

Accurate engine fault classification with limited labeled data is critical for the safety and reliability of rotating machinery. This task is challenging because operating regimes are time-varying, and key variables must satisfy physical constraints, under which traditional feature classifier pipelines degrade and deep networks tend to overfit. We propose PhysGMM-MoE, a physics-aware Gaussian Mixture Model (GMM)-Mixture-of-Experts (MoE) framework for small-sample engine fault classification. At the data level, PhysGMM-MoE fits class-conditional, regime-aware GMMs and performs physically constrained, distance-based quality control to selectively augment minority classes while preserving engine operating semantics. At the model level, a heterogeneous pool of lightweight statistical experts and a lightweight Transformer-based deep expert (ECFT-Transformer) capture complementary neighborhood cues and high order multi-sensor correlations, and an L2-regularized logistic regression meta-learner fuses expert outputs via stacking. We evaluate fault classification on the 3500-DEFault diesel-engine dataset using the adopted eight-class cylinder-fault labeling (H, F1–F7) built from in-cylinder pressure statistics and torsional-vibration harmonics; although severity levels exist in the dataset, this study focuses on classification rather than severity estimation. With 40 training samples per class, PhysGMM-MoE achieves a mean accuracy of 0.9875, exceeding SMOTE+XGBoost by 0.0086, and attains the best macro precision/recall/F1 of 0.9878/0.9826/0.9889, demonstrating strong performance under the adopted small-sample setting. Full article

► Show Figures

Figure 1

20 pages, 4551 KB

Open AccessArticle

Explainable Learning Framework for the Assessment and Prediction of Wind Shear-Induced Aviation Turbulence

by Afaq Khattak, Pak-wai Chan, Feng Chen, Adil A. M. Elhassan and Badr T. Alsulami

Atmosphere 2025, 16(12), 1318; https://doi.org/10.3390/atmos16121318 - 22 Nov 2025

Viewed by 946

Abstract

Wind shear-induced aviation turbulence (WSAT) remains a major safety concern during approach and takeoff phases at complex terrain airports. This study develops an interpretable Explainable Boosting Machine (EBM) framework to classify WSAT events at Hong Kong International Airport (HKIA). The framework integrates Differential [...] Read more.

Wind shear-induced aviation turbulence (WSAT) remains a major safety concern during approach and takeoff phases at complex terrain airports. This study develops an interpretable Explainable Boosting Machine (EBM) framework to classify WSAT events at Hong Kong International Airport (HKIA). The framework integrates Differential Evolution with HyperBand (DEHB) for hyperparameter tuning and applies multiple data balance methods such as SMOTE, Borderline SMOTE, Safe-Level SMOTE, and G-SMOTE. The dataset consists of Pilot Reports (PIREPs) collected between 1 January 2007 and 31 July 2023, with 6838 wind shear events that include variables that relate to wind shear magnitude, altitude, runway distance, rainfall condition, and causal factors. Among all configurations, the EBM tuned via DEHB and trained with SMOTE-treated data achieved the highest predictive performance with BA = 0.710, MCC = 0.321, and G-Mean = 0.708, higher than untreated and other balance variants. EBM-based interpretation showed that wind shear altitude and wind shear magnitude were key predictors, and their interaction reflected a nonlinear pattern where WSAT probability rose under moderate-to-high shear conditions (wind shear altitude ≈ 0.5–2.5 and magnitude ≈ 30–35 knots). The DEHB-optimized EBM–SMOTE framework provides a transparent interpretive foundation for WSAT risk assessment and advances quantitative evaluation in aviation meteorology. Full article

(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

► Show Figures

Figure 1

22 pages, 1314 KB

Open AccessArticle

Capturing Compensatory Reserve in Sarcopenia: A Bioengineering Framework for Multidimensional Temporal Analysis of Center-of-Pressure Signals

by Qinghe Zhao, Qing Xiao, Yu Chen, Muyu Yang, Lunzhi Dai, Yan Xiong and Jirong Yue

Bioengineering 2025, 12(11), 1143; https://doi.org/10.3390/bioengineering12111143 - 23 Oct 2025

Viewed by 1010

Abstract

Conventional balance assessments often miss subtle deficits in sarcopenia patients due to compensatory strategies. This study develops a computational framework using multidimensional temporal analysis of center-of-pressure (COP) signals to quantify variations in compensatory reserve—the capacity to mask balance impairments—within these patients. COP data [...] Read more.

Conventional balance assessments often miss subtle deficits in sarcopenia patients due to compensatory strategies. This study develops a computational framework using multidimensional temporal analysis of center-of-pressure (COP) signals to quantify variations in compensatory reserve—the capacity to mask balance impairments—within these patients. COP data were collected from 82 older adults (sarcopenia vs. controls) during static standing on a standard clinical force platform (routine for geriatric balance testing). The framework integrates Dynamic Time Warping distances from a healthy template, fixed-weight LSTM embeddings, and statistical metrics, with feature selection and 5-fold cross-validation (SMOTE) to mitigate overfitting. Semi-tandem stance was most discriminative, achieving 0.84 ± 0.04 accuracy and 0.86 ± 0.05 ROC-AUC—outperforming conventional kinematic features. SHAP analysis identified DTW-based features as primary drivers, correlating with clinical severity indicators, while intra-group variability in prediction probabilities indicated a compensatory reserve gradient. This study introduces a feasible bioengineering methodology based on clinical COP platform analysis, laying the groundwork for future validation and translation into routine clinical assessment tools. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

29 pages, 5489 KB

Open AccessArticle

A Hybrid Deep Learning-Based Architecture for Network Traffic Anomaly Detection via EFMS-Enhanced KMeans Clustering and CNN-GRU Models

by Daniel Quirumbay Yagual, Diego Fernández Iglesias and Francisco J. Nóvoa

Appl. Sci. 2025, 15(20), 10889; https://doi.org/10.3390/app152010889 - 10 Oct 2025

Viewed by 2953

Abstract

Early detection of network traffic anomalies is critical for cybersecurity, as a single compromised host can cause data breaches, reputational damage, and operational disruptions. However, traditional systems based on signatures and static rules are often ineffective against sophisticated and evolving threats. This study [...] Read more.

Early detection of network traffic anomalies is critical for cybersecurity, as a single compromised host can cause data breaches, reputational damage, and operational disruptions. However, traditional systems based on signatures and static rules are often ineffective against sophisticated and evolving threats. This study proposes a hybrid deep learning architecture for proactive anomaly detection in local and metropolitan networks. The dataset underwent an extensive process of cleaning, transformation, and feature selection, including normalization of numerical fields, encoding of ordinal variables, and derivation of behavioral metrics. The EFMS-KMeans algorithm was applied to pre-label traffic as normal or anomalous by estimating dense centers and computing centroid distances, enabling the training of a sequential CNN-GRU network, where the CNN captures spatial patterns and the GRU models temporal dependencies. To address class imbalance, the SMOTE technique was integrated, and the loss function was adjusted to improve training stability. Experimental results show a substantial improvement in accuracy and generalization compared to conventional approaches, validating the effectiveness of the proposed method for detecting anomalous traffic in dynamic and complex network environments. Full article

(This article belongs to the Special Issue Cybersecurity: Advances in Security and Privacy Enhancing Technology)

► Show Figures

Figure 1

20 pages, 58155 KB

Open AccessArticle

Machine Learning-Based Land Cover Mapping of Nanfeng Village with Emphasis on Landslide Detection

by Kieu Anh Nguyen, Chiao-Shin Huang and Walter Chen

Sustainability 2025, 17(18), 8250; https://doi.org/10.3390/su17188250 - 14 Sep 2025

Cited by 3 | Viewed by 1165

Abstract

Landslides pose a significant threat to Taiwan’s mountainous regions, particularly after extreme weather events such as typhoons. This study introduces a machine learning framework for post-disaster land use-land cover (LULC) classification and landslide detection in Nanfeng Village, central Taiwan, following Typhoon Khanun in [...] Read more.

Landslides pose a significant threat to Taiwan’s mountainous regions, particularly after extreme weather events such as typhoons. This study introduces a machine learning framework for post-disaster land use-land cover (LULC) classification and landslide detection in Nanfeng Village, central Taiwan, following Typhoon Khanun in August 2023. Using high-resolution Pléiades imagery and 22 environmental and spectral factors, a Random Forest classifier was developed. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was systematically evaluated across multiple variants. The Distance_SMOTE method yielded the best results, increasing overall accuracy from 74% to 85% and the Kappa coefficient from 0.69 to 0.82. F1-scores for landslides, roads, and grassland improved markedly, reaching 0.97, 0.85, and 0.78, respectively. The optimized model produced accurate pre- and post-typhoon LULC maps, revealing significant expansion of landslide zones after the event. This study demonstrates the practical value of combining SMOTE-based resampling with Random Forest for rapid, reliable post-disaster assessment, offering actionable insights for disaster response and land management in data-imbalanced conditions. By enabling timely mapping of hazard-affected areas and informing targeted recovery actions, the approach supports disaster risk reduction, sustainable land use planning, and ecosystem restoration. These outcomes contribute to the Sustainable Development Goals, particularly SDG 11 (Sustainable Cities and Communities), SDG 13 (Climate Action), and SDG 15 (Life on Land), by strengthening community resilience, promoting climate adaptation, and protecting terrestrial ecosystems in hazard-prone regions. Full article

(This article belongs to the Special Issue Sustainable Assessment and Risk Analysis on Landslide Hazards)

► Show Figures

Figure 1

20 pages, 1647 KB

Open AccessArticle

Research on the Enhancement of Provincial AC/DC Ultra-High Voltage Power Grid Security Based on WGAN-GP

by Zheng Shi, Yonghao Zhang, Zesheng Hu, Yao Wang, Yan Liang, Jiaojiao Deng, Jie Chen and Dingguo An

Electronics 2025, 14(14), 2897; https://doi.org/10.3390/electronics14142897 - 19 Jul 2025

Cited by 2 | Viewed by 880

Abstract

With the advancement in the “dual carbon” strategy and the integration of high proportions of renewable energy sources, AC/DC ultra-high-power grids are facing new security challenges such as commutation failure and multi-infeed coupling effects. Fault diagnosis, as an important tool for assisting power [...] Read more.

With the advancement in the “dual carbon” strategy and the integration of high proportions of renewable energy sources, AC/DC ultra-high-power grids are facing new security challenges such as commutation failure and multi-infeed coupling effects. Fault diagnosis, as an important tool for assisting power grid dispatching, is essential for maintaining the grid’s long-term stable operation. Traditional fault diagnosis methods encounter challenges such as limited samples and data quality issues under complex operating conditions. To overcome these problems, this study proposes a fault sample data enhancement method based on the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP). Firstly, a simulation model of the AC/DC hybrid system is constructed to obtain the original fault sample data. Then, through the adoption of the Wasserstein distance measure and the gradient penalty strategy, an improved WGAN-GP architecture suitable for feature learning of the AC/DC hybrid system is designed. Finally, by comparing the fault diagnosis performance of different data models, the proposed method achieves up to 100% accuracy on certain fault types and improves the average accuracy by 6.3% compared to SMOTE and vanilla GAN, particularly under limited-sample conditions. These results confirm that the proposed approach can effectively extract fault characteristics from complex fault data. Full article

(This article belongs to the Special Issue Applications of Computational Intelligence, 3rd Edition)

► Show Figures

Figure 1

34 pages, 1710 KB

Open AccessArticle

Logistics Sprawl and Urban Congestion Dynamics Toward Sustainability: A Logistic Regression and Random-Forest-Based Model

by Manal El Yadari, Fouad Jawab, Imane Moufad and Jabir Arif

Sustainability 2025, 17(13), 5929; https://doi.org/10.3390/su17135929 - 27 Jun 2025

Cited by 5 | Viewed by 2746

Abstract

Increasing road congestion is the main constraint that may influence the economic development of cities and urban freight transport efficiency because it generates additional costs related to delay, influences social life, increases environmental emissions, and decreases service quality. This may result from several [...] Read more.

Increasing road congestion is the main constraint that may influence the economic development of cities and urban freight transport efficiency because it generates additional costs related to delay, influences social life, increases environmental emissions, and decreases service quality. This may result from several factors, including an increase in logistics activities in the urban core. Therefore, this paper aims to define the relationship between the logistics sprawl phenomenon and congestion level. In this sense, we explored the literature to summarize the phenomenon of logistics sprawl in different cities and defined the dependent and independent variables. Congestion level was defined as the dependent variable, while the increasing distance resulting from logistics sprawl, along with city and operational flow characteristics, was treated as independent variables. We compared the performance of several models, including decision tree, support vector machine, gradient boosting, k-nearest neighbor, logistic regression and random forest. Among all the models tested, we found that the random forest algorithm delivered the best performance in terms of prediction. We combined both logistic regression—for its interpretability—and random forest—for its predictive strength—to define, explain, and interpret the relationship between the studied variables. Subsequently, we collected data from the literature and various databases, including transit city sources. The resulting dataset, composed of secondary and open-source data, was then enhanced through standard augmentation techniques—SMOTE, mixup, Gaussian noise, and linear interpolation—to improve class balance and data quality and ensure the robustness of the analysis. Then, we developed a Python code and executed it in Colab. As a result, we deduced an equation that describes the relationship between the congestion level and the defined independent variables. Full article

(This article belongs to the Special Issue Sustainable Operations and Green Supply Chain)

► Show Figures

Figure 1

21 pages, 3668 KB

Open AccessArticle

LD-SMOTE: A Novel Local Density Estimation-Based Oversampling Method for Imbalanced Datasets

by Jiacheng Lyu, Jie Yang, Zhixun Su and Zilu Zhu

Symmetry 2025, 17(2), 160; https://doi.org/10.3390/sym17020160 - 22 Jan 2025

Cited by 5 | Viewed by 2973

Abstract

Imbalanced data have become an essential stumbling block in the field of machine learning. In this paper, a novel oversampling method based on local density estimation, namely LD-SMOTE, is presented to address constraints of the popular rebalance technique SMOTE. LD-SMOTE initiates with k [...] Read more.

Imbalanced data have become an essential stumbling block in the field of machine learning. In this paper, a novel oversampling method based on local density estimation, namely LD-SMOTE, is presented to address constraints of the popular rebalance technique SMOTE. LD-SMOTE initiates with k-means clustering to quantificationally measure the classification contribution of each feature. Subsequently, a novel distance metric grounded in Jaccard similarity is defined, which accentuates the features that are more intricately linked to the minority class. Utilizing this metric, we estimate the local density with a Gaussian-like function to control the quantity of synthetic samples around every minority sample, thus simulating the distribution of the minority class. Additionally, the generation of synthetic samples occurs within a triangular region constructed by this minority sample and its two chosen neighbors in LD-SMOTE, instead of on the line connecting the minority sample and one of its neighbors. Experimental comparisons between LD-SMOTE and 16 existing resampling methods on 19 datasets reveal a significant average increase in LD-SMOTE with 6.4% in accuracy, 4.4% in the F-measure, 5.4% in the G-mean, and 4.0% in AUC. This result indicates that LD-SMOTE can be an alternative oversampling method for imbalanced datasets. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

16 pages, 1034 KB

Open AccessArticle

Efficient Sleep Stage Identification Using Piecewise Linear EEG Signal Reduction: A Novel Algorithm for Sleep Disorder Diagnosis

by Yash Paul, Rajesh Singh, Surbhi Sharma, Saurabh Singh and In-Ho Ra

Sensors 2024, 24(16), 5265; https://doi.org/10.3390/s24165265 - 14 Aug 2024

Cited by 7 | Viewed by 3555

Abstract

Sleep is a vital physiological process for human health, and accurately detecting various sleep states is crucial for diagnosing sleep disorders. This study presents a novel algorithm for identifying sleep stages using EEG signals, which is more efficient and accurate than the state-of-the-art [...] Read more.

Sleep is a vital physiological process for human health, and accurately detecting various sleep states is crucial for diagnosing sleep disorders. This study presents a novel algorithm for identifying sleep stages using EEG signals, which is more efficient and accurate than the state-of-the-art methods. The key innovation lies in employing a piecewise linear data reduction technique called the Halfwave method in the time domain. This method simplifies EEG signals into a piecewise linear form with reduced complexity while preserving sleep stage characteristics. Then, a features vector with six statistical features is built using parameters obtained from the reduced piecewise linear function. We used the MIT-BIH Polysomnographic Database to test our proposed method, which includes more than 80 h of long data from different biomedical signals with six main sleep classes. We used different classifiers and found that the K-Nearest Neighbor classifier performs better in our proposed method. According to experimental findings, the average sensitivity, specificity, and accuracy of the proposed algorithm on the Polysomnographic Database considering eight records is estimated as 94.82%, 96.65%, and 95.73%, respectively. Furthermore, the algorithm shows promise in its computational efficiency, making it suitable for real-time applications such as sleep monitoring devices. Its robust performance across various sleep classes suggests its potential for widespread clinical adoption, making significant advances in the knowledge, detection, and management of sleep problems. Full article

(This article belongs to the Section Biosensors)

► Show Figures

Figure 1

15 pages, 479 KB

Open AccessArticle

Data-Centric Solutions for Addressing Big Data Veracity with Class Imbalance, High Dimensionality, and Class Overlapping

by Armando Bolívar, Vicente García, Roberto Alejo, Rogelio Florencia-Juárez and J. Salvador Sánchez

Appl. Sci. 2024, 14(13), 5845; https://doi.org/10.3390/app14135845 - 4 Jul 2024

Cited by 6 | Viewed by 2650

Abstract

An innovative strategy for organizations to obtain value from their large datasets, allowing them to guide future strategic actions and improve their initiatives, is the use of machine learning algorithms. This has led to a growing and rapid application of various machine learning [...] Read more.

An innovative strategy for organizations to obtain value from their large datasets, allowing them to guide future strategic actions and improve their initiatives, is the use of machine learning algorithms. This has led to a growing and rapid application of various machine learning algorithms with a predominant focus on building and improving the performance of these models. However, this data-centric approach ignores the fact that data quality is crucial for building robust and accurate models. Several dataset issues, such as class imbalance, high dimensionality, and class overlapping, affect data quality, introducing bias to machine learning models. Therefore, adopting a data-centric approach is essential to constructing better datasets and producing effective models. Besides data issues, Big Data imposes new challenges, such as the scalability of algorithms. This paper proposes a scalable hybrid approach to jointly addressing class imbalance, high dimensionality, and class overlapping in Big Data domains. The proposal is based on well-known data-level solutions whose main operation is calculating the nearest neighbor using the Euclidean distance as a similarity metric. However, these strategies may lose their effectiveness on datasets with high dimensionality. Hence, the data quality is achieved by combining a data transformation approach using fractional norms and SMOTE to obtain a balanced and reduced dataset. Experiments carried out on nine two-class imbalanced and high-dimensional large datasets showed that our scalable methodology implemented in Spark outperforms the traditional approach. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

16 pages, 4372 KB

Open AccessEditor’s ChoiceArticle

Wind Shear and Aircraft Aborted Landings: A Deep Learning Perspective for Prediction and Analysis

by Afaq Khattak, Jianping Zhang, Pak-Wai Chan, Feng Chen, Arshad Hussain and Hamad Almujibah

Atmosphere 2024, 15(5), 545; https://doi.org/10.3390/atmos15050545 - 29 Apr 2024

Cited by 10 | Viewed by 3657

Abstract

In civil aviation, severe weather conditions such as strong wind shear, crosswinds, and thunderstorms near airport runways often compel pilots to abort landings to ensure flight safety. While aborted landings due to wind shear are not common, they occur under specific environmental and [...] Read more.

In civil aviation, severe weather conditions such as strong wind shear, crosswinds, and thunderstorms near airport runways often compel pilots to abort landings to ensure flight safety. While aborted landings due to wind shear are not common, they occur under specific environmental and situational circumstances. This research aims to accurately predict aircraft aborted landings using three advanced deep learning techniques: the conventional deep neural network (DNN), the deep and cross network (DCN), and the wide and deep network (WDN). These models are supplemented by various data augmentation methods, including the Synthetic Minority Over-Sampling Technique (SMOTE), KMeans-SMOTE, and Borderline-SMOTE, to correct the imbalance in pilot report data. Bayesian optimization was utilized to fine-tune the models for optimal predictive accuracy. The effectiveness of these models was assessed through metrics including sensitivity, precision, F1-score, and the Matthew Correlation Coefficient. The Shapley Additive Explanations (SHAP) algorithm was then applied to the most effective models to interpret their results and identify key factors, revealing that the intensity of wind shear, specific runways like 07R, and the vertical distance of wind shear from the runway (within 700 feet above runway level) were significant factors. The results of this research provide valuable insights to civil aviation experts, potentially revolutionizing safety protocols for managing aborted landings under adverse weather conditions, thereby improving overall airport efficiency and safety. Full article

(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

► Show Figures

Figure 1

14 pages, 628 KB

Open AccessArticle

Three-Stage Sampling Algorithm for Highly Imbalanced Multi-Classification Time Series Datasets

by Haoming Wang

Symmetry 2023, 15(10), 1849; https://doi.org/10.3390/sym15101849 - 1 Oct 2023

Cited by 2 | Viewed by 2704

Abstract

To alleviate the data imbalance problem caused by subjective and objective factors, scholars have developed different data-preprocessing algorithms, among which undersampling algorithms are widely used because of their fast and efficient performance. However, when the number of samples of some categories in a [...] Read more.

To alleviate the data imbalance problem caused by subjective and objective factors, scholars have developed different data-preprocessing algorithms, among which undersampling algorithms are widely used because of their fast and efficient performance. However, when the number of samples of some categories in a multi-classification dataset is too small to be processed via sampling or the number of minority class samples is only one or two, the traditional undersampling algorithms will be less effective. In this study, we select nine multi-classification time series datasets with extremely few samples as research objects, fully consider the characteristics of time series data, and use a three-stage algorithm to alleviate the data imbalance problem. In stage one, random oversampling with disturbance items is used to increase the number of sample points; in stage two, on the basis of the latter operation, SMOTE (synthetic minority oversampling technique) oversampling is employed; in stage three, the dynamic time-warping distance is used to calculate the distance between sample points, identify the sample points of Tomek links at the boundary, and clean up the boundary noise. This study proposes a new sampling algorithm. In the nine multi-classification time series datasets with extremely few samples, the new sampling algorithm is compared with four classic undersampling algorithms, namely, ENN (edited nearest neighbours), NCR (neighborhood cleaning rule), OSS (one-side selection), and RENN (repeated edited nearest neighbors), based on the macro accuracy, recall rate, and F1-score evaluation indicators. The results are as follows: of the nine datasets selected, for the dataset with the most categories and the fewest minority class samples, FiftyWords, the accuracy of the new sampling algorithm was 0.7156, far beyond that of ENN, RENN, OSS, and NCR; its recall rate was also better than that of the four undersampling algorithms used for comparison, corresponding to 0.7261; and its F1-score was 200.71%, 188.74%, 155.29%, and 85.61% better than that of ENN, RENN, OSS, and NCR, respectively. For the other eight datasets, this new sampling algorithm also showed good indicator scores. The new algorithm proposed in this study can effectively alleviate the data imbalance problem of multi-classification time series datasets with many categories and few minority class samples and, at the same time, clean up the boundary noise data between classes. Full article

(This article belongs to the Topic Advances in Computational Materials Sciences)

► Show Figures

Figure 1

23 pages, 1105 KB

Open AccessArticle

Automated Battery Making Fault Classification Using Over-Sampled Image Data CNN Features

by Nasir Ud Din, Li Zhang and Yatao Yang

Sensors 2023, 23(4), 1927; https://doi.org/10.3390/s23041927 - 8 Feb 2023

Cited by 30 | Viewed by 4385

Abstract

Due to the tremendous expectations placed on batteries to produce a reliable and secure product, fault detection has become a critical part of the manufacturing process. Manually, it takes much labor and effort to test each battery individually for manufacturing faults including burning, [...] Read more.

Due to the tremendous expectations placed on batteries to produce a reliable and secure product, fault detection has become a critical part of the manufacturing process. Manually, it takes much labor and effort to test each battery individually for manufacturing faults including burning, welding that is too high, missing welds, shifting, welding holes, and so forth. Additionally, manual battery fault detection takes too much time and is extremely expensive. We solved this issue by using image processing and machine learning techniques to automatically detect faults in the battery manufacturing process. Our approach will reduce the need for human intervention, save time, and be easy to implement. A CMOS camera was used to collect a large number of images belonging to eight common battery manufacturing faults. The welding area of the batteries’ positive and negative terminals was captured from different distances, between 40 and 50 cm. Before deploying the learning models, first, we used the CNN for feature extraction from the image data. To over-sample the dataset, we used the Synthetic Minority Over-sampling Technique (SMOTE) since the dataset was highly imbalanced, resulting in over-fitting of the learning model. Several machine learning and deep learning models were deployed on the CNN-extracted features and over-sampled data. Random forest achieved a significant 84% accuracy with our proposed approach. Additionally, we applied K-fold cross-validation with the proposed approach to validate the significance of the approach, and the logistic regression achieved an 81.897% mean accuracy score and a +/− 0.0255 standard deviation. Full article

(This article belongs to the Special Issue Sensor Applications in Fault Diagnosis and Monitoring of Electrical Machines II)

► Show Figures

Figure 1

Search Results (23)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (23)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI