Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (29)

Search Parameters:
Keywords = K-means SMOTE

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
32 pages, 4091 KiB  
Article
Improving Early Detection of Dementia: Extra Trees-Based Classification Model Using Inter-Relation-Based Features and K-Means Synthetic Minority Oversampling Technique
by Yanawut Chaiyo, Worasak Rueangsirarak, Georgi Hristov and Punnarumol Temdee
Big Data Cogn. Comput. 2025, 9(6), 148; https://doi.org/10.3390/bdcc9060148 - 30 May 2025
Viewed by 665
Abstract
The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in [...] Read more.
The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in which readily available biochemical and physiological features from electronic health records are employed to develop a machine learning-based binary classification model, improving accessibility and early detection. A dataset of 14,763 records from Phachanukroh Hospital, Chiang Rai, Thailand, was used for model construction. The use of a hybrid data enrichment framework involving feature augmentation and data balancing was proposed in order to increase the dimensionality of the data. Medical domain knowledge was used to generate inter-relation-based features (IRFs), which improve data diversity and promote explainability by making the features more informative. For data balancing, the K-Means Synthetic Minority Oversampling Technique (K-Means SMOTE) was applied to generate synthetic samples in under-represented regions of the feature space, addressing class imbalance. Extra Trees (ET) was used for model construction due to its noise resilience and ability to manage multicollinearity. The performance of the proposed method was compared with that of Support Vector Machine, K-Nearest Neighbors, Artificial Neural Networks, Random Forest, and Gradient Boosting. The results reveal that the ET model significantly outperformed other models on the combined dataset with four IRFs and K-Means SMOTE across key metrics, including accuracy (96.47%), precision (94.79%), recall (97.86%), F1 score (96.30%), and area under the receiver operating characteristic curve (99.51%). Full article
Show Figures

Figure 1

21 pages, 3668 KiB  
Article
LD-SMOTE: A Novel Local Density Estimation-Based Oversampling Method for Imbalanced Datasets
by Jiacheng Lyu, Jie Yang, Zhixun Su and Zilu Zhu
Symmetry 2025, 17(2), 160; https://doi.org/10.3390/sym17020160 - 22 Jan 2025
Viewed by 1113
Abstract
Imbalanced data have become an essential stumbling block in the field of machine learning. In this paper, a novel oversampling method based on local density estimation, namely LD-SMOTE, is presented to address constraints of the popular rebalance technique SMOTE. LD-SMOTE initiates with k [...] Read more.
Imbalanced data have become an essential stumbling block in the field of machine learning. In this paper, a novel oversampling method based on local density estimation, namely LD-SMOTE, is presented to address constraints of the popular rebalance technique SMOTE. LD-SMOTE initiates with k-means clustering to quantificationally measure the classification contribution of each feature. Subsequently, a novel distance metric grounded in Jaccard similarity is defined, which accentuates the features that are more intricately linked to the minority class. Utilizing this metric, we estimate the local density with a Gaussian-like function to control the quantity of synthetic samples around every minority sample, thus simulating the distribution of the minority class. Additionally, the generation of synthetic samples occurs within a triangular region constructed by this minority sample and its two chosen neighbors in LD-SMOTE, instead of on the line connecting the minority sample and one of its neighbors. Experimental comparisons between LD-SMOTE and 16 existing resampling methods on 19 datasets reveal a significant average increase in LD-SMOTE with 6.4% in accuracy, 4.4% in the F-measure, 5.4% in the G-mean, and 4.0% in AUC. This result indicates that LD-SMOTE can be an alternative oversampling method for imbalanced datasets. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

15 pages, 4963 KiB  
Article
Anti-Rollover Trajectory Planning Method for Heavy Vehicles in Human–Machine Cooperative Driving
by Haixiao Wu, Zhongming Wu, Junfeng Lu and Li Sun
World Electr. Veh. J. 2024, 15(8), 328; https://doi.org/10.3390/wevj15080328 - 24 Jul 2024
Viewed by 1016
Abstract
The existing trajectory planning research mainly considers the safety of the obstacle avoidance process rather than the anti-rollover requirements of heavy vehicles. When there are driving risks such as rollover and collision, how to coordinate the game relationship between the two is the [...] Read more.
The existing trajectory planning research mainly considers the safety of the obstacle avoidance process rather than the anti-rollover requirements of heavy vehicles. When there are driving risks such as rollover and collision, how to coordinate the game relationship between the two is the key technical problem to realizing the anti-rollover trajectory planning under the condition of driving risk triggering. Given the above problems, this paper studies the non-cooperative game model construction method of the obstacle avoidance process that integrates the vehicle driving risk in a complex traffic environment. Then it obtains the obstacle avoidance area that satisfies both the collision and rollover profit requirements based on the Nash equilibrium. A Kmeans-SMOTE risk clustering fusion is proposed in this paper, in which more sampling points are supplemented by the SMOTE oversampling method, and then the ideal obstacle avoidance area is obtained through clustering algorithm fusion to determine the optimal feasible area for obstacle avoidance trajectory planning. On this basis, to solve the convergence problems of the existing multi-objective particle swarm optimization algorithm and analyze the influence of weight parameters and the diversity of the optimization process, this paper proposes an anti-rollover trajectory planning method based on the improved cosine variable weight factor MOPSO algorithm. The simulation results show that the trajectory obtained based on the method proposed in this paper can effectively improve the anti-rollover performance of the controlled vehicle while avoiding obstacles. Full article
(This article belongs to the Special Issue Dynamics, Control and Simulation of Electrified Vehicles)
Show Figures

Figure 1

17 pages, 1688 KiB  
Article
Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation
by Ying Zhang, Li Deng and Bo Wei
Mathematics 2024, 12(11), 1709; https://doi.org/10.3390/math12111709 - 30 May 2024
Cited by 11 | Viewed by 3923
Abstract
Oversampling techniques are widely used to rebalance imbalanced datasets. However, most of the oversampling methods may introduce noise and fuzzy boundaries for dataset classification, leading to the overfitting phenomenon. To solve this problem, we propose a new method (FSDR-SMOTE) based on Random-SMOTE and [...] Read more.
Oversampling techniques are widely used to rebalance imbalanced datasets. However, most of the oversampling methods may introduce noise and fuzzy boundaries for dataset classification, leading to the overfitting phenomenon. To solve this problem, we propose a new method (FSDR-SMOTE) based on Random-SMOTE and Feature Standard Deviation for rebalancing imbalanced datasets. The method first removes noisy samples based on the Tukey criterion and then calculates the feature standard deviation reflecting the degree of data discretization to detect the sample location, and classifies the samples into boundary samples and safety samples. Secondly, the K-means clustering algorithm is employed to partition the minority class samples into several sub-clusters. Within each sub-cluster, new samples are generated based on random samples, boundary samples, and the corresponding sub-cluster center. The experimental results show that the average evaluation value obtained by FSDR-SMOTE is 93.31% (93.16%, and 86.53%) in terms of the F-measure (G-mean, and MCC) on the 20 benchmark datasets selected from the UCI machine learning library. Full article
Show Figures

Figure 1

16 pages, 4372 KiB  
Article
Wind Shear and Aircraft Aborted Landings: A Deep Learning Perspective for Prediction and Analysis
by Afaq Khattak, Jianping Zhang, Pak-Wai Chan, Feng Chen, Arshad Hussain and Hamad Almujibah
Atmosphere 2024, 15(5), 545; https://doi.org/10.3390/atmos15050545 - 29 Apr 2024
Cited by 4 | Viewed by 2428
Abstract
In civil aviation, severe weather conditions such as strong wind shear, crosswinds, and thunderstorms near airport runways often compel pilots to abort landings to ensure flight safety. While aborted landings due to wind shear are not common, they occur under specific environmental and [...] Read more.
In civil aviation, severe weather conditions such as strong wind shear, crosswinds, and thunderstorms near airport runways often compel pilots to abort landings to ensure flight safety. While aborted landings due to wind shear are not common, they occur under specific environmental and situational circumstances. This research aims to accurately predict aircraft aborted landings using three advanced deep learning techniques: the conventional deep neural network (DNN), the deep and cross network (DCN), and the wide and deep network (WDN). These models are supplemented by various data augmentation methods, including the Synthetic Minority Over-Sampling Technique (SMOTE), KMeans-SMOTE, and Borderline-SMOTE, to correct the imbalance in pilot report data. Bayesian optimization was utilized to fine-tune the models for optimal predictive accuracy. The effectiveness of these models was assessed through metrics including sensitivity, precision, F1-score, and the Matthew Correlation Coefficient. The Shapley Additive Explanations (SHAP) algorithm was then applied to the most effective models to interpret their results and identify key factors, revealing that the intensity of wind shear, specific runways like 07R, and the vertical distance of wind shear from the runway (within 700 feet above runway level) were significant factors. The results of this research provide valuable insights to civil aviation experts, potentially revolutionizing safety protocols for managing aborted landings under adverse weather conditions, thereby improving overall airport efficiency and safety. Full article
(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)
Show Figures

Figure 1

16 pages, 750 KiB  
Article
Evaluating Outcome Prediction via Baseline, End-of-Treatment, and Delta Radiomics on PET-CT Images of Primary Mediastinal Large B-Cell Lymphoma
by Fereshteh Yousefirizi, Claire Gowdy, Ivan S. Klyuzhin, Maziar Sabouri, Petter Tonseth, Anna R. Hayden, Donald Wilson, Laurie H. Sehn, David W. Scott, Christian Steidl, Kerry J. Savage, Carlos F. Uribe and Arman Rahmim
Cancers 2024, 16(6), 1090; https://doi.org/10.3390/cancers16061090 - 8 Mar 2024
Cited by 12 | Viewed by 2960
Abstract
Objectives: Accurate outcome prediction is important for making informed clinical decisions in cancer treatment. In this study, we assessed the feasibility of using changes in radiomic features over time (Delta radiomics: absolute and relative) following chemotherapy, to predict relapse/progression and time to progression [...] Read more.
Objectives: Accurate outcome prediction is important for making informed clinical decisions in cancer treatment. In this study, we assessed the feasibility of using changes in radiomic features over time (Delta radiomics: absolute and relative) following chemotherapy, to predict relapse/progression and time to progression (TTP) of primary mediastinal large B-cell lymphoma (PMBCL) patients. Material and Methods: Given the lack of standard staging PET scans until 2011, only 31 out of 103 PMBCL patients in our retrospective study had both pre-treatment and end-of-treatment (EoT) scans. Consequently, our radiomics analysis focused on these 31 patients who underwent [18F]FDG PET-CT scans before and after R-CHOP chemotherapy. Expert manual lesion segmentation was conducted on their scans for delta radiomics analysis, along with an additional 19 EoT scans, totaling 50 segmented scans for single time point analysis. Radiomics features (on PET and CT), along with maximum and mean standardized uptake values (SUVmax and SUVmean), total metabolic tumor volume (TMTV), tumor dissemination (Dmax), total lesion glycolysis (TLG), and the area under the curve of cumulative standardized uptake value-volume histogram (AUC-CSH) were calculated. We additionally applied longitudinal analysis using radial mean intensity (RIM) changes. For prediction of relapse/progression, we utilized the individual coefficient approximation for risk estimation (ICARE) and machine learning (ML) techniques (K-Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Random Forest (RF)) including sequential feature selection (SFS) following correlation analysis for feature selection. For TTP, ICARE and CoxNet approaches were utilized. In all models, we used nested cross-validation (CV) (with 10 outer folds and 5 repetitions, along with 5 inner folds and 20 repetitions) after balancing the dataset using Synthetic Minority Oversampling TEchnique (SMOTE). Results: To predict relapse/progression using Delta radiomics between the baseline (staging) and EoT scans, the best performances in terms of accuracy and F1 score (F1 score is the harmonic mean of precision and recall, where precision is the ratio of true positives to the sum of true positives and false positives, and recall is the ratio of true positives to the sum of true positives and false negatives) were achieved with ICARE (accuracy = 0.81 ± 0.15, F1 = 0.77 ± 0.18), RF (accuracy = 0.89 ± 0.04, F1 = 0.87 ± 0.04), and LDA (accuracy = 0.89 ± 0.03, F1 = 0.89 ± 0.03), that are higher compared to the predictive power achieved by using only EoT radiomics features. For the second category of our analysis, TTP prediction, the best performer was CoxNet (LASSO feature selection) with c-index = 0.67 ± 0.06 when using baseline + Delta features (inclusion of both baseline and Delta features). The TTP results via Delta radiomics were comparable to the use of radiomics features extracted from EoT scans for TTP analysis (c-index = 0.68 ± 0.09) using CoxNet (with SFS). The performance of Deauville Score (DS) for TTP was c-index = 0.66 ± 0.09 for n = 50 and 0.67 ± 03 for n = 31 cases when using EoT scans with no significant differences compared to the radiomics signature from either EoT scans or baseline + Delta features (p-value> 0.05). Conclusion: This work demonstrates the potential of Delta radiomics and the importance of using EoT scans to predict progression and TTP from PMBCL [18F]FDG PET-CT scans. Full article
(This article belongs to the Special Issue PET/CT in Cancers Outcomes Prediction)
Show Figures

Figure 1

22 pages, 6678 KiB  
Article
Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification
by Xiao Zhang, Iván Paz, Àngela Nebot, Francisco Mugica and Enrique Romero
Appl. Sci. 2023, 13(23), 12636; https://doi.org/10.3390/app132312636 - 23 Nov 2023
Cited by 1 | Viewed by 1605
Abstract
When classifiers face imbalanced class distributions, they often misclassify minority class samples, consequently diminishing the predictive performance of machine learning models. Existing oversampling techniques predominantly rely on the selection of neighboring data via interpolation, with less emphasis on uncovering the intrinsic patterns and [...] Read more.
When classifiers face imbalanced class distributions, they often misclassify minority class samples, consequently diminishing the predictive performance of machine learning models. Existing oversampling techniques predominantly rely on the selection of neighboring data via interpolation, with less emphasis on uncovering the intrinsic patterns and relationships within the data. In this research, we present the usefulness of an algorithm named RuLer to deal with the problem of classification with imbalanced data. RuLer is a learning algorithm initially designed to recognize new sound patterns within the context of the performative artistic practice known as live coding. This paper demonstrates that this algorithm, once adapted (Ad-RuLer), has great potential to address the problem of oversampling imbalanced data. An extensive comparison with other mainstream oversampling algorithms (SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE), using different classifiers (logistic regression, random forest, and XGBoost) is performed on several real-world datasets with different degrees of data imbalance. The experiment results indicate that Ad-RuLer serves as an effective oversampling technique with extensive applicability. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

19 pages, 12636 KiB  
Article
3D Mineral Prospectivity Mapping from 3D Geological Models Using Return–Risk Analysis and Machine Learning on Imbalance Data
by Qingming Peng, Zhongzheng Wang, Gongwen Wang, Wengao Zhang, Zhengle Chen and Xiaoning Liu
Minerals 2023, 13(11), 1384; https://doi.org/10.3390/min13111384 - 29 Oct 2023
Cited by 6 | Viewed by 4031
Abstract
Three-dimensional Mineral Prospectivity Mapping (3DMPM) is an innovative approach to mineral exploration that combines multiple geological data sources to create a three-dimensional (3D) model of a mineral deposit. It provides an accurate representation of the subsurface that can be used to identify areas [...] Read more.
Three-dimensional Mineral Prospectivity Mapping (3DMPM) is an innovative approach to mineral exploration that combines multiple geological data sources to create a three-dimensional (3D) model of a mineral deposit. It provides an accurate representation of the subsurface that can be used to identify areas with mineral potential. These 3D geological models are the typical data source for 3D prospective modeling. Geological data sets from multiple sources are used to construct 3D geological models. Since in practice there is a significant imbalance in the ratio of mineralized to non-mineralized classes, the classification results will be biased in favor of the more observed classes. Borderline-SMOTE (BLSMOTE) is an oversampling technique used to solve the problem of unbalanced datasets and works by generating synthetic data points along the boundary line between the minority and majority classes. This helps to create a more balanced dataset without introducing too much noise. Non-mineralized samples can be generated by randomly selecting non-mineralized locations, which means that uncertainties are generated. In this paper, we take the shallow-forming low-temperature hydrothermal deposit Guizhou Lannigou gold deposit as an example to extract the ore-controlling elements and establish a 3D geological model. A total of 50 training samples are generated using the sampling method described above, and 50 mineralization prospects are generated using Random Forests. A return–risk analysis was used to explore the uncertainties associated with synthetic positive samples and randomly selected negative samples, and to determine the final mineral potential values. Based on the evaluation metrics G-mean and F-value, the model using BLSMOTE outperforms the model without the synthetic algorithm and the models using SMOTE and KMeansSMOTE. The optimal model BLSMOTE18 has an AUC of 0.9288. The methodology also performs superiorly with different levels of class imbalance datasets. Excluding the predictions where the results highly overlap with known deposits, five target zones were circled for the targets using a P-A plot, all of which have obvious metallogenic geological features. Among them, Target1 and Target2 have good potential for mineralization, which is of great significance for future mineral exploration work. Full article
(This article belongs to the Special Issue 3D Modeling of Mineral Deposits)
Show Figures

Figure 1

19 pages, 1599 KiB  
Article
A Data Enhancement Algorithm for DDoS Attacks Using IoT
by Haibin Lv, Yanhui Du, Xing Zhou, Wenkai Ni and Xingbang Ma
Sensors 2023, 23(17), 7496; https://doi.org/10.3390/s23177496 - 29 Aug 2023
Cited by 5 | Viewed by 1698
Abstract
With the rapid development of the Internet of Things (IoT), the frequency of attackers using botnets to control IoT devices in order to perform distributed denial-of-service attacks (DDoS) and other cyber attacks on the internet has significantly increased. In the actual attack process, [...] Read more.
With the rapid development of the Internet of Things (IoT), the frequency of attackers using botnets to control IoT devices in order to perform distributed denial-of-service attacks (DDoS) and other cyber attacks on the internet has significantly increased. In the actual attack process, the small percentage of attack packets in IoT leads to low accuracy of intrusion detection. Based on this problem, the paper proposes an oversampling algorithm, KG-SMOTE, based on Gaussian distribution and K-means clustering, which inserts synthetic samples through Gaussian probability distribution, extends the clustering nodes in minority class samples in the same proportion, increases the density of minority class samples, and improves the amount of minority class sample data in order to provide data support for IoT-based DDoS attack detection. Experiments show that the balanced dataset generated by this method effectively improves the intrusion detection accuracy in each category and effectively solves the data imbalance problem. Full article
(This article belongs to the Special Issue Anomaly Detection and Monitoring for Networks and IoT Systems)
Show Figures

Figure 1

17 pages, 2631 KiB  
Article
CSK-CNN: Network Intrusion Detection Model Based on Two-Layer Convolution Neural Network for Handling Imbalanced Dataset
by Jiaming Song, Xiaojuan Wang, Mingshu He and Lei Jin
Information 2023, 14(2), 130; https://doi.org/10.3390/info14020130 - 16 Feb 2023
Cited by 13 | Viewed by 3186
Abstract
In computer networks, Network Intrusion Detection System (NIDS) plays a very important role in identifying intrusion behaviors. NIDS can identify abnormal behaviors by analyzing network traffic. However, the performance of classifier is not very good in identifying abnormal traffic for minority classes. In [...] Read more.
In computer networks, Network Intrusion Detection System (NIDS) plays a very important role in identifying intrusion behaviors. NIDS can identify abnormal behaviors by analyzing network traffic. However, the performance of classifier is not very good in identifying abnormal traffic for minority classes. In order to improve the detection rate on class imbalanced dataset, we propose a network intrusion detection model based on two-layer CNN and Cluster-SMOTE + K-means algorithm (CSK-CNN) to process imbalanced dataset. CSK combines the cluster based Synthetic Minority Over Sampling Technique (Cluster-SMOTE) and K-means based under sampling algorithm. Through the two-layer network, abnormal traffic can not only be identified, but also be classified into specific attack types. This paper has been verified on UNSW-NB15 dataset and CICIDS2017 dataset, and the performance of the proposed model has been evaluated using such indicators as accuracy, recall, precision, F1-score, ROC curve, AUC value, training time and testing time. The experiment shows that the proposed CSK-CNN in this paper is obviously superior to other comparison algorithms in terms of network intrusion detection performance, and is suitable for deployment in the real network environment. Full article
(This article belongs to the Special Issue Advances in Computing, Communication & Security)
Show Figures

Figure 1

17 pages, 1788 KiB  
Article
A Novel Intelligent Method for Fault Diagnosis of Steam Turbines Based on T-SNE and XGBoost
by Zhiguo Liang, Lijun Zhang and Xizhe Wang
Algorithms 2023, 16(2), 98; https://doi.org/10.3390/a16020098 - 9 Feb 2023
Cited by 22 | Viewed by 4174
Abstract
Since failure of steam turbines occurs frequently and can causes huge losses for thermal plants, it is important to identify a fault in advance. A novel clustering fault diagnosis method for steam turbines based on t-distribution stochastic neighborhood embedding (t-SNE) and extreme gradient [...] Read more.
Since failure of steam turbines occurs frequently and can causes huge losses for thermal plants, it is important to identify a fault in advance. A novel clustering fault diagnosis method for steam turbines based on t-distribution stochastic neighborhood embedding (t-SNE) and extreme gradient boosting (XGBoost) is proposed in this paper. First, the t-SNE algorithm was used to map the high-dimensional data to the low-dimensional space; and the data clustering method of K-means was performed in the low-dimensional space to distinguish the fault data from the normal data. Then, the imbalance problem in the data was processed by the synthetic minority over-sampling technique (SMOTE) algorithm to obtain the steam turbine characteristic data set with fault labels. Finally, the XGBoost algorithm was used to solve this multi-classification problem. The data set used in this paper was derived from the time series data of a steam turbine of a thermal power plant. In the processing analysis, the method achieved the best performance with an overall accuracy of 97% and an early warning of at least two hours in advance. The experimental results show that this method can effectively evaluate the condition and provide fault warning for power plant equipment. Full article
(This article belongs to the Special Issue Artificial Intelligence for Fault Detection and Diagnosis)
Show Figures

Figure 1

23 pages, 1105 KiB  
Article
Automated Battery Making Fault Classification Using Over-Sampled Image Data CNN Features
by Nasir Ud Din, Li Zhang and Yatao Yang
Sensors 2023, 23(4), 1927; https://doi.org/10.3390/s23041927 - 8 Feb 2023
Cited by 18 | Viewed by 3428
Abstract
Due to the tremendous expectations placed on batteries to produce a reliable and secure product, fault detection has become a critical part of the manufacturing process. Manually, it takes much labor and effort to test each battery individually for manufacturing faults including burning, [...] Read more.
Due to the tremendous expectations placed on batteries to produce a reliable and secure product, fault detection has become a critical part of the manufacturing process. Manually, it takes much labor and effort to test each battery individually for manufacturing faults including burning, welding that is too high, missing welds, shifting, welding holes, and so forth. Additionally, manual battery fault detection takes too much time and is extremely expensive. We solved this issue by using image processing and machine learning techniques to automatically detect faults in the battery manufacturing process. Our approach will reduce the need for human intervention, save time, and be easy to implement. A CMOS camera was used to collect a large number of images belonging to eight common battery manufacturing faults. The welding area of the batteries’ positive and negative terminals was captured from different distances, between 40 and 50 cm. Before deploying the learning models, first, we used the CNN for feature extraction from the image data. To over-sample the dataset, we used the Synthetic Minority Over-sampling Technique (SMOTE) since the dataset was highly imbalanced, resulting in over-fitting of the learning model. Several machine learning and deep learning models were deployed on the CNN-extracted features and over-sampled data. Random forest achieved a significant 84% accuracy with our proposed approach. Additionally, we applied K-fold cross-validation with the proposed approach to validate the significance of the approach, and the logistic regression achieved an 81.897% mean accuracy score and a +/− 0.0255 standard deviation. Full article
Show Figures

Figure 1

30 pages, 2787 KiB  
Article
Detection of Malicious Websites Using Symbolic Classifier
by Nikola Anđelić, Sandi Baressi Šegota, Ivan Lorencin and Matko Glučina
Future Internet 2022, 14(12), 358; https://doi.org/10.3390/fi14120358 - 29 Nov 2022
Cited by 6 | Viewed by 3363
Abstract
Malicious websites are web locations that attempt to install malware, which is the general term for anything that will cause problems in computer operation, gather confidential information, or gain total control over the computer. In this paper, a novel approach is proposed which [...] Read more.
Malicious websites are web locations that attempt to install malware, which is the general term for anything that will cause problems in computer operation, gather confidential information, or gain total control over the computer. In this paper, a novel approach is proposed which consists of the implementation of the genetic programming symbolic classifier (GPSC) algorithm on a publicly available dataset to obtain a simple symbolic expression (mathematical equation) which could detect malicious websites with high classification accuracy. Due to a large imbalance of classes in the initial dataset, several data sampling methods (random undersampling/oversampling, ADASYN, SMOTE, BorderlineSMOTE, and KmeansSMOTE) were used to balance the dataset classes. For this investigation, the hyperparameter search method was developed to find the combination of GPSC hyperparameters with which high classification accuracy could be achieved. The first investigation was conducted using GPSC with a random hyperparameter search method and each dataset variation was divided on a train and test dataset in a ratio of 70:30. To evaluate each symbolic expression, the performance of each symbolic expression was measured on the train and test dataset and the mean and standard deviation values of accuracy (ACC), AUC, precision, recall and f1-score were obtained. The second investigation was also conducted using GPSC with the random hyperparameter search method; however, 70%, i.e., the train dataset, was used to perform 5-fold cross-validation. If the mean accuracy, AUC, precision, recall, and f1-score values were above 0.97 then final training and testing (train/test 70:30) were performed with GPSC with the same randomly chosen hyperparameters used in a 5-fold cross-validation process and the final mean and standard deviation values of the aforementioned evaluation methods were obtained. In both investigations, the best symbolic expression was obtained in the case where the dataset balanced with the KMeansSMOTE method was used for training and testing. The best symbolic expression obtained using GPSC with the random hyperparameter search method and classic train–test procedure (70:30) on a dataset balanced with the KMeansSMOTE method achieved values of ACC¯, AUC¯, Precsion¯, Recall¯ and F1-score¯ (with standard deviation) 0.9992±2.249×105, 0.9995±9.945×106, 0.9995±1.09×105, 0.999±5.17×105, 0.9992±5.17×106, respectively. The best symbolic expression obtained using GPSC with a random hyperparameter search method and 5-fold cross-validation on a dataset balanced with the KMeansSMOTE method achieved values of ACC¯, AUC¯, Precsion¯, Recall¯ and F1-score¯ (with standard deviation) 0.9994±1.13×105, 0.9994±1.2×105, 1.0±0, 0.9988±2.4×105, and 0.9994±1.2×105, respectively. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

18 pages, 2094 KiB  
Article
Fault Detection for Wind Turbine Blade Bolts Based on GSG Combined with CS-LightGBM
by Mingzhu Tang, Caihua Meng, Huawei Wu, Hongqiu Zhu, Jiabiao Yi, Jun Tang and Yifan Wang
Sensors 2022, 22(18), 6763; https://doi.org/10.3390/s22186763 - 7 Sep 2022
Cited by 17 | Viewed by 3142
Abstract
Aiming at the problem of class imbalance in the wind turbine blade bolts operation-monitoring dataset, a fault detection method for wind turbine blade bolts based on Gaussian Mixture Model–Synthetic Minority Oversampling Technique–Gaussian Mixture Model (GSG) combined with Cost-Sensitive LightGBM (CS-LightGBM) was proposed. Since [...] Read more.
Aiming at the problem of class imbalance in the wind turbine blade bolts operation-monitoring dataset, a fault detection method for wind turbine blade bolts based on Gaussian Mixture Model–Synthetic Minority Oversampling Technique–Gaussian Mixture Model (GSG) combined with Cost-Sensitive LightGBM (CS-LightGBM) was proposed. Since it is difficult to obtain the fault samples of blade bolts, the GSG oversampling method was constructed to increase the fault samples in the blade bolt dataset. The method obtains the optimal number of clusters through the BIC criterion, and uses the GMM based on the optimal number of clusters to optimally cluster the fault samples in the blade bolt dataset. According to the density distribution of fault samples in inter-clusters, we synthesized new fault samples using SMOTE in an intra-cluster. This retains the distribution characteristics of the original fault class samples. Then, we used the GMM with the same initial cluster center to cluster the fault class samples that were added to new samples, and removed the synthetic fault class samples that were not clustered into the corresponding clusters. Finally, the synthetic data training set was used to train the CS-LightGBM fault detection model. Additionally, the hyperparameters of CS-LightGBM were optimized by the Bayesian optimization algorithm to obtain the optimal CS-LightGBM fault detection model. The experimental results show that compared with six models including SMOTE-LightGBM, CS-LightGBM, K-means-SMOTE-LightGBM, etc., the proposed fault detection model is superior to the other comparison methods in the false alarm rate, missing alarm rate and F1-score index. The method can well realize the fault detection of large wind turbine blade bolts. Full article
Show Figures

Figure 1

15 pages, 2179 KiB  
Article
Virtual Screening of Drug Proteins Based on the Prediction Classification Model of Imbalanced Data Mining
by Lili Yin, Xiaokang Du, Chao Ma and Hengwen Gu
Processes 2022, 10(7), 1420; https://doi.org/10.3390/pr10071420 - 21 Jul 2022
Cited by 6 | Viewed by 2116
Abstract
We propose a virtual screening method based on imbalanced data mining in this paper, which combines virtual screening techniques with imbalanced data classification methods to improve the traditional virtual screening process. First, in the actual virtual screening process, we apply k-means and smote [...] Read more.
We propose a virtual screening method based on imbalanced data mining in this paper, which combines virtual screening techniques with imbalanced data classification methods to improve the traditional virtual screening process. First, in the actual virtual screening process, we apply k-means and smote heuristic oversampling method to deal with imbalanced data. Meanwhile, to enhance the accuracy of the virtual screening process, a particle swarm optimization algorithm is introduced to optimize the parameters of the support vector machine classifier, and the concept of ensemble learning is brought in. The classification technique based on particle swarm optimization, support vector machine and adaptive boosting is used to screen the molecular docking conformation to improve the accuracy of the prediction. Finally, in the experimental construction and analysis section, the proposed method was validated using relevant data from the protein data bank database and PubChem database. The experimental results indicated that the proposed method can effectively improve the accuracy of virus screening and has practical guidance for new drug development. This research regards virtual screening as a problem of imbalanced data classification, which has obvious guiding significance and also provides a certain reference for the problems faced by virtual screening technology. Full article
Show Figures

Figure 1

Back to TopTop