Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (36)

Search Parameters:
Keywords = mislabeled training data

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 3365 KB  
Article
Robust Federated Learning Against Data Poisoning Attacks: Prevention and Detection of Attacked Nodes
by Pretom Roy Ovi and Aryya Gangopadhyay
Electronics 2025, 14(15), 2970; https://doi.org/10.3390/electronics14152970 - 25 Jul 2025
Viewed by 697
Abstract
Federated learning (FL) enables collaborative model building among a large number of participants without sharing sensitive data to the central server. Because of its distributed nature, FL has limited control over local data and the corresponding training process. Therefore, it is susceptible to [...] Read more.
Federated learning (FL) enables collaborative model building among a large number of participants without sharing sensitive data to the central server. Because of its distributed nature, FL has limited control over local data and the corresponding training process. Therefore, it is susceptible to data poisoning attacks where malicious workers use malicious training data to train the model. Furthermore, attackers on the worker side can easily manipulate local data by swapping the labels of training instances, adding noise to training instances, and adding out-of-distribution training instances in the local data to initiate data poisoning attacks. And local workers under such attacks carry incorrect information to the server, poison the global model, and cause misclassifications. So, the prevention and detection of such data poisoning attacks is crucial to build a robust federated training framework. To address this, we propose a prevention strategy in federated learning, namely confident federated learning, to protect workers from such data poisoning attacks. Our proposed prevention strategy at first validates the label quality of local training samples by characterizing and identifying label errors in the local training data, and then excludes the detected mislabeled samples from the local training. To this aim, we experiment with our proposed approach on both the image and audio domains, and our experimental results validated the robustness of our proposed confident federated learning in preventing the data poisoning attacks. Our proposed method can successfully detect the mislabeled training samples with above 85% accuracy and exclude those detected samples from the training set to prevent data poisoning attacks on the local workers. However, our prevention strategy can successfully prevent the attack locally in the presence of a certain percentage of poisonous samples. Beyond that percentage, the prevention strategy may not be effective in preventing attacks. In such cases, detection of the attacked workers is needed. So, in addition to the prevention strategy, we propose a novel detection strategy in the federated learning framework to detect the malicious workers under attack. We propose to create a class-wise cluster representation for every participating worker by utilizing the neuron activation maps of local models and analyze the resulting clusters to filter out the workers under attack before model aggregation. We experimentally demonstrated the efficacy of our proposed detection strategy in detecting workers affected by data poisoning attacks, along with the attack types, e.g., label-flipping or dirty labeling. In addition, our experimental results suggest that the global model could not converge even after a large number of training rounds in the presence of malicious workers, whereas after detecting the malicious workers with our proposed detection method and discarding them from model aggregation, we ensured that the global model achieved convergence within very few training rounds. Furthermore, our proposed approach stays robust under different data distributions and model sizes and does not require prior knowledge about the number of attackers in the system. Full article
Show Figures

Figure 1

19 pages, 3116 KB  
Article
Deep Learning for Visual Leading of Ships: AI for Human Factor Accident Prevention
by Manuel Vázquez Neira, Genaro Cao Feijóo, Blanca Sánchez Fernández and José A. Orosa
Appl. Sci. 2025, 15(15), 8261; https://doi.org/10.3390/app15158261 - 24 Jul 2025
Viewed by 491
Abstract
Traditional navigation relies on visual alignment with leading lights, a task typically monitored by bridge officers over extended periods. This process can lead to fatigue-related human factor errors, increasing the risk of maritime accidents and environmental damage. To address this issue, this study [...] Read more.
Traditional navigation relies on visual alignment with leading lights, a task typically monitored by bridge officers over extended periods. This process can lead to fatigue-related human factor errors, increasing the risk of maritime accidents and environmental damage. To address this issue, this study explores the use of convolutional neural networks (CNNs), evaluating different training strategies and hyperparameter configurations to assist officers in identifying deviations from proper visual leading. Using video data captured from a navigation simulator, we trained a lightweight CNN capable of advising bridge personnel with an accuracy of 86% during night-time operations. Notably, the model demonstrated robustness against visual interference from other light sources, such as lighthouses or coastal lights. The primary source of classification error was linked to images with low bow deviation, largely influenced by human mislabeling during dataset preparation. Future work will focus on refining the classification scheme to enhance model performance. We (1) propose a lightweight CNN based on SqueezeNet for night-time ship navigation, (2) expand the traditional binary risk classification into six operational categories, and (3) demonstrate improved performance over human judgment in visually ambiguous conditions. Full article
Show Figures

Figure 1

16 pages, 3808 KB  
Article
Impact of Data Quality on CNN-Based Sewer Defect Detection
by Seokwoo Jang and Dooil Kim
Water 2025, 17(13), 2028; https://doi.org/10.3390/w17132028 - 6 Jul 2025
Viewed by 635
Abstract
Sewer pipelines are essential urban infrastructure that play a key role in sanitation and disaster prevention. Regular condition assessments are necessary to detect defects early and determine optimal maintenance timing. However, traditional visual inspection using closed-circuit television (CCTV) footage is time-consuming, labor-intensive, and [...] Read more.
Sewer pipelines are essential urban infrastructure that play a key role in sanitation and disaster prevention. Regular condition assessments are necessary to detect defects early and determine optimal maintenance timing. However, traditional visual inspection using closed-circuit television (CCTV) footage is time-consuming, labor-intensive, and dependent on subjective human judgment. To address these limitations, this study develops a convolutional neural network (CNN)-based sewer defect classification model and analyzes how data quality—such as mislabeled or redundant images—affects model accuracy. A large-scale public dataset of approximately 470,000 sewer images was used for training. The model was designed to classify non-defect and three major defect categories. Based on the ResNet50 architecture, the model incorporated dropout and L2 regularization to prevent overfitting. Experimental results showed the highest accuracy of 92.75% at a dropout rate of 0.2 and a regularization coefficient of 0.01. Further analysis revealed that mislabeled, redundant, or obscured images within the dataset negatively impacted model performance. Additional experiments quantified the impact of data quality on accuracy, emphasizing the importance of proper dataset curation. This study provides practical insights into optimizing data-driven approaches for automated sewer defect detection and high-performance model development. Full article
(This article belongs to the Special Issue Urban Sewer Systems: Monitoring, Modeling and Management)
Show Figures

Figure 1

23 pages, 7950 KB  
Article
Tripartite: Tackling Realistic Noisy Labels with More Precise Partitions
by Lida Yu, Xuefeng Liang, Chang Cao, Longshan Yao and Xingyu Liu
Sensors 2025, 25(11), 3369; https://doi.org/10.3390/s25113369 - 27 May 2025
Viewed by 446
Abstract
Samples in large-scale datasets may be mislabeled for various reasons, and deep models are inclined to over-fit some noisy samples using conventional training procedures. The key solution is to alleviate the harm of these noisy labels. Many existing methods try to divide training [...] Read more.
Samples in large-scale datasets may be mislabeled for various reasons, and deep models are inclined to over-fit some noisy samples using conventional training procedures. The key solution is to alleviate the harm of these noisy labels. Many existing methods try to divide training data into clean and noisy subsets in terms of loss values. We observe that a reason hindering the better performance of deep models is the uncertain samples, which have relatively small losses and often appear in real-world datasets. Due to small losses, many uncertain noisy samples are divided into the clean subset and then degrade models’ performance. Instead, we propose a Tripartite solution to partition training data into three subsets, uncertain, clean and noisy according to the following criteria: the inconsistency of the predictions of two networks and the given labels. Tripartite considerably improves the quality of the clean subset. Moreover, to maximize the value of clean samples in the uncertain subset and minimize the harm of noisy labels, we apply low-weight learning and a semi-supervised learning, respectively. Extensive experiments demonstrate that Tripartite can filter out noisy samples more precisely and outperforms most state-of-the-art methods on four benchmark datasets and especially real-world datasets. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems)
Show Figures

Figure 1

22 pages, 1882 KB  
Article
Optimizing CNN-Based Diagnosis of Knee Osteoarthritis: Enhancing Model Accuracy with CleanLab Relabeling
by Thomures Momenpour and Arafat Abu Mallouh
Diagnostics 2025, 15(11), 1332; https://doi.org/10.3390/diagnostics15111332 - 26 May 2025
Viewed by 1430
Abstract
Background: Knee Osteoarthritis (KOA) is a prevalent and debilitating joint disorder that significantly impacts quality of life, particularly in aging populations. Accurate and consistent classification of KOA severity, typically using the Kellgren-Lawrence (KL) grading system, is crucial for effective diagnosis, treatment planning, and [...] Read more.
Background: Knee Osteoarthritis (KOA) is a prevalent and debilitating joint disorder that significantly impacts quality of life, particularly in aging populations. Accurate and consistent classification of KOA severity, typically using the Kellgren-Lawrence (KL) grading system, is crucial for effective diagnosis, treatment planning, and monitoring disease progression. However, traditional KL grading is known for its inherent subjectivity and inter-rater variability, which underscores the pressing need for objective, automated, and reliable classification methods. Methods: This study investigates the performance of an EfficientNetB5 deep learning model, enhanced with transfer learning from the ImageNet dataset, for the task of classifying KOA severity into five distinct KL grades (0–4). We utilized a publicly available Kaggle dataset comprising 9786 knee X-ray images. A key aspect of our methodology was a comprehensive data-centric preprocessing pipeline, which involved an initial phase of outlier removal to reduce noise, followed by systematic label correction using the Cleanlab framework to identify and rectify potential inconsistencies within the original dataset labels. Results: The final EfficientNetB5 model, trained on the preprocessed and Cleanlab-remediated data, achieved an overall accuracy of 82.07% on the test set. This performance represents a significant improvement over previously reported benchmarks for five-class KOA classification on this dataset, such as ResNet-101 which achieved 69% accuracy. The substantial enhancement in model performance is primarily attributed to Cleanlab’s robust ability to detect and correct mislabeled instances, thereby improving the overall quality and reliability of the training data and enabling the model to better learn and capture complex radiographic patterns associated with KOA. Class-wise performance analysis indicated strong differentiation between healthy (KL Grade 0) and severe (KL Grade 4) cases. However, the “Doubtful” (KL Grade 1) class presented ongoing challenges, exhibiting lower recall and precision compared to other grades. When evaluated against other architectures like MobileNetV3 and Xception for multi-class tasks, our EfficientNetB5 demonstrated highly competitive results. Conclusions: The integration of an EfficientNetB5 model with a rigorous data-centric preprocessing approach, particularly Cleanlab-based label correction and outlier removal, provides a robust and significantly more accurate method for five-class KOA severity classification. While limitations in handling inherently ambiguous cases (such as KL Grade 1) and the small sample size for severe KOA warrant further investigation, this study demonstrates a promising pathway to enhance diagnostic precision. The developed pipeline shows considerable potential for future clinical applications, aiding in more objective and reliable KOA assessment. Full article
(This article belongs to the Special Issue 3rd Edition: AI/ML-Based Medical Image Processing and Analysis)
Show Figures

Figure 1

22 pages, 1049 KB  
Article
Introducing a Quality-Driven Approach for Federated Learning
by Muhammad Usman, Mario Luca Bernardi and Marta Cimitile
Sensors 2025, 25(10), 3083; https://doi.org/10.3390/s25103083 - 13 May 2025
Viewed by 1129
Abstract
The advancement of pervasive systems has made distributed real-world data across multiple devices increasingly valuable for training machine learning models. Traditional centralized learning approaches face limitations such as data security concerns and computational constraints. Federated learning (FL) provides privacy benefits but is hindered [...] Read more.
The advancement of pervasive systems has made distributed real-world data across multiple devices increasingly valuable for training machine learning models. Traditional centralized learning approaches face limitations such as data security concerns and computational constraints. Federated learning (FL) provides privacy benefits but is hindered by challenges like data heterogeneity (Non-IID distributions) and noise heterogeneity (mislabeling and inconsistencies in local datasets), which degrade model performance. This paper proposes a model-agnostic, quality-driven approach, called DQFed, for training machine learning models across distributed and diverse client datasets while preserving data privacy. The DQFed framework demonstrates improvements in accuracy and reliability over existing FL frameworks. By effectively addressing class imbalance and noise heterogeneity, DQFed offers a robust and versatile solution for federated learning applications in diverse fields. Full article
(This article belongs to the Special Issue Operationalize Edge AI for Next-Generation IoT Applications)
Show Figures

Figure 1

17 pages, 1102 KB  
Article
Identifying and Mitigating Label Noise in Deep Learning for Image Classification
by César González-Santoyo, Diego Renza and Ernesto Moya-Albor
Technologies 2025, 13(4), 132; https://doi.org/10.3390/technologies13040132 - 1 Apr 2025
Cited by 2 | Viewed by 2331
Abstract
Labeling errors in datasets are a persistent challenge in machine learning because they introduce noise and bias and reduce the model’s generalization. This study proposes a novel methodology for detecting and correcting mislabeled samples in image datasets by using the Cumulative Spectral Gradient [...] Read more.
Labeling errors in datasets are a persistent challenge in machine learning because they introduce noise and bias and reduce the model’s generalization. This study proposes a novel methodology for detecting and correcting mislabeled samples in image datasets by using the Cumulative Spectral Gradient (CSG) metric to assess the intrinsic complexity of the data. This methodology is applied to the noisy CIFAR-10/100 and CIFAR-10n/100n datasets, where mislabeled samples in CIFAR-10n/100n are identified and relabeled using CIFAR-10/100 as a reference. The DenseNet and Xception models pre-trained on ImageNet are fine-tuned to evaluate the impact of label correction on the model performance. Evaluation metrics based on the confusion matrix are used to compare the model performance on the original and noisy datasets and on the label-corrected datasets. The results show that correcting the mislabeled samples significantly improves the accuracy and robustness of the model, highlighting the importance of dataset quality in machine learning. Full article
Show Figures

Figure 1

18 pages, 2548 KB  
Article
Honey Differentiation Using Infrared and Raman Spectroscopy Analysis and the Employment of Machine-Learning-Based Authentication Models
by Maria David, Camelia Berghian-Grosan and Dana Alina Magdas
Foods 2025, 14(6), 1032; https://doi.org/10.3390/foods14061032 - 18 Mar 2025
Cited by 1 | Viewed by 867
Abstract
Due to rising concerns regarding the adulteration and mislabeling of honey, new directives at the European level encourage researchers to develop reliable honey authentication models based on rapid and cost-effective analytical techniques, such as vibrational spectroscopies. The present study discusses the identification of [...] Read more.
Due to rising concerns regarding the adulteration and mislabeling of honey, new directives at the European level encourage researchers to develop reliable honey authentication models based on rapid and cost-effective analytical techniques, such as vibrational spectroscopies. The present study discusses the identification of the main vibrational bands of the FT-Raman and ATR-IR spectra of the most consumed honey varieties in Transylvania: acacia, honeydew, and rapeseed, exposing the ways the spectral fingerprint differs based on the honey’s varietal-dependent composition. Additionally, a pilot study on honey authentication describes a new methodology of processing the combined vibrational data with the most efficient machine learning algorithms. By employing the proposed methodology, the developed model was capable of distinguishing honey produced in a narrow geographical region (Transylvania) with an accuracy of 85.2% and 93.8% on training and testing datasets when the Trilayered Neural Network algorithm was applied to the combined IR and Raman data. Moreover, acacia honey was differentiated against fifteen other sources with a 87% accuracy on training and testing datasets. The proposed methodology proved efficiency and can be further employed for label control and food safety enhancement. Full article
(This article belongs to the Special Issue Research Progress on Honey Adulteration and Classification)
Show Figures

Figure 1

17 pages, 4871 KB  
Article
MF-Match: A Semi-Supervised Model for Human Action Recognition
by Tianhe Yun and Zhangang Wang
Sensors 2024, 24(15), 4940; https://doi.org/10.3390/s24154940 - 30 Jul 2024
Cited by 2 | Viewed by 1418
Abstract
Human action recognition (HAR) technology based on radar signals has garnered significant attention from both industry and academia due to its exceptional privacy-preserving capabilities, noncontact sensing characteristics, and insensitivity to lighting conditions. However, the scarcity of accurately labeled human radar data poses a [...] Read more.
Human action recognition (HAR) technology based on radar signals has garnered significant attention from both industry and academia due to its exceptional privacy-preserving capabilities, noncontact sensing characteristics, and insensitivity to lighting conditions. However, the scarcity of accurately labeled human radar data poses a significant challenge in meeting the demand for large-scale training datasets required by deep model-based HAR technology, thus substantially impeding technological advancements in this field. To address this issue, a semi-supervised learning algorithm, MF-Match, is proposed in this paper. This algorithm computes pseudo-labels for larger-scale unsupervised radar data, enabling the model to extract embedded human behavioral information and enhance the accuracy of HAR algorithms. Furthermore, the method incorporates contrastive learning principles to improve the quality of model-generated pseudo-labels and mitigate the impact of mislabeled pseudo-labels on recognition performance. Experimental results demonstrate that this method achieves action recognition accuracies of 86.69% and 91.48% on two widely used radar spectrum datasets, respectively, utilizing only 10% labeled data, thereby validating the effectiveness of the proposed approach. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

29 pages, 2335 KB  
Article
Robust Support Vector Data Description with Truncated Loss Function for Outliers Depression
by Huakun Chen, Yongxi Lyu, Jingping Shi and Weiguo Zhang
Entropy 2024, 26(8), 628; https://doi.org/10.3390/e26080628 - 25 Jul 2024
Viewed by 1442
Abstract
Support vector data description (SVDD) is widely regarded as an effective technique for addressing anomaly detection problems. However, its performance can significantly deteriorate when the training data are affected by outliers or mislabeled observations. This study introduces a universal truncated loss function framework [...] Read more.
Support vector data description (SVDD) is widely regarded as an effective technique for addressing anomaly detection problems. However, its performance can significantly deteriorate when the training data are affected by outliers or mislabeled observations. This study introduces a universal truncated loss function framework into the SVDD model to enhance its robustness and employs the fast alternating direction method of multipliers (ADMM) algorithm to solve various truncated loss functions. Moreover, the convergence of the fast ADMM algorithm is analyzed theoretically. Within this framework, we developed the truncated generalized ramp, truncated binary cross entropy, and truncated linear exponential loss functions for SVDD. We conducted extensive experiments on synthetic and real-world datasets to validate the effectiveness of these three SVDD models in handling data with different noise levels, demonstrating their superior robustness and generalization capabilities compared to other SVDD models. Full article
(This article belongs to the Special Issue Applications of Information Theory to Machine Learning)
Show Figures

Figure 1

18 pages, 1106 KB  
Article
MKDAT: Multi-Level Knowledge Distillation with Adaptive Temperature for Distantly Supervised Relation Extraction
by Jun Long, Zhuoying Yin, Yan Han and Wenti Huang
Information 2024, 15(7), 382; https://doi.org/10.3390/info15070382 - 30 Jun 2024
Cited by 2 | Viewed by 2136
Abstract
Distantly supervised relation extraction (DSRE), first used to address the limitations of manually annotated data via automatically annotating the data with triplet facts, is prone to issues such as mislabeled annotations due to the interference of noisy annotations. To address the interference of [...] Read more.
Distantly supervised relation extraction (DSRE), first used to address the limitations of manually annotated data via automatically annotating the data with triplet facts, is prone to issues such as mislabeled annotations due to the interference of noisy annotations. To address the interference of noisy annotations, we leveraged a novel knowledge distillation (KD) method which was different from the conventional models on DSRE. More specifically, we proposed a model-agnostic KD method, Multi-Level Knowledge Distillation with Adaptive Temperature (MKDAT), which mainly involves two modules: Adaptive Temperature Regulation (ATR) and Multi-Level Knowledge Distilling (MKD). ATR allocates adaptive entropy-based distillation temperatures to different training instances for providing a moderate softening supervision to the student, in which label hardening is possible for instances with great entropy. MKD combines the bag-level and instance-level knowledge of the teacher as supervisions of the student, and trains the teacher and student at the bag and instance levels, respectively, which aims at mitigating the effects of noisy annotation and improving the sentence-level prediction performance. In addition, we implemented three MKDAT models based on the CNN, PCNN, and ATT-BiLSTM neural networks, respectively, and the experimental results show that our distillation models outperform the baseline models on bag-level and instance-level evaluations. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

27 pages, 1184 KB  
Article
Methodology for the Detection of Contaminated Training Datasets for Machine Learning-Based Network Intrusion-Detection Systems
by Joaquín Gaspar Medina-Arco, Roberto Magán-Carrión, Rafael Alejandro Rodríguez-Gómez and Pedro García-Teodoro
Sensors 2024, 24(2), 479; https://doi.org/10.3390/s24020479 - 12 Jan 2024
Cited by 5 | Viewed by 2965
Abstract
With the significant increase in cyber-attacks and attempts to gain unauthorised access to systems and information, Network Intrusion-Detection Systems (NIDSs) have become essential detection tools. Anomaly-based systems use machine learning techniques to distinguish between normal and anomalous traffic. They do this by using [...] Read more.
With the significant increase in cyber-attacks and attempts to gain unauthorised access to systems and information, Network Intrusion-Detection Systems (NIDSs) have become essential detection tools. Anomaly-based systems use machine learning techniques to distinguish between normal and anomalous traffic. They do this by using training datasets that have been previously gathered and labelled, allowing them to learn to detect anomalies in future data. However, such datasets can be accidentally or deliberately contaminated, compromising the performance of NIDS. This has been the case of the UGR’16 dataset, in which, during the labelling process, botnet-type attacks were not identified in the subset intended for training. This paper addresses the mislabelling problem of real network traffic datasets by introducing a novel methodology that (i) allows analysing the quality of a network traffic dataset by identifying possible hidden or unidentified anomalies and (ii) selects the ideal subset of data to optimise the performance of the anomaly detection model even in the presence of hidden attacks erroneously labelled as normal network traffic. To this end, a two-step process that makes incremental use of the training dataset is proposed. Experiments conducted on the contaminated UGR’16 dataset in conjunction with the state-of-the-art NIDS, Kitsune, conclude with the feasibility of the approach to reveal observations of hidden botnet-based attacks on this dataset. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

26 pages, 2261 KB  
Article
AgriSen-COG, a Multicountry, Multitemporal Large-Scale Sentinel-2 Benchmark Dataset for Crop Mapping Using Deep Learning
by Teodora Selea
Remote Sens. 2023, 15(12), 2980; https://doi.org/10.3390/rs15122980 - 7 Jun 2023
Cited by 8 | Viewed by 4659
Abstract
With the increasing volume of collected Earth observation (EO) data, artificial intelligence (AI) methods have become state-of-the-art in processing and analyzing them. However, there is still a lack of high-quality, large-scale EO datasets for training robust networks. This paper presents AgriSen-COG, a large-scale [...] Read more.
With the increasing volume of collected Earth observation (EO) data, artificial intelligence (AI) methods have become state-of-the-art in processing and analyzing them. However, there is still a lack of high-quality, large-scale EO datasets for training robust networks. This paper presents AgriSen-COG, a large-scale benchmark dataset for crop type mapping based on Sentinel-2 data. AgriSen-COG deals with the challenges of remote sensing (RS) datasets. First, it includes data from five different European countries (Austria, Belgium, Spain, Denmark, and the Netherlands), targeting the problem of domain adaptation. Second, it is multitemporal and multiyear (2019–2020), therefore enabling analysis based on the growth of crops in time and yearly variability. Third, AgriSen-COG includes an anomaly detection preprocessing step, which reduces the amount of mislabeled information. AgriSen-COG comprises 6,972,485 parcels, making it the most extensive available dataset for crop type mapping. It includes two types of data: pixel-level data and parcel aggregated information. By carrying this out, we target two computer vision (CV) problems: semantic segmentation and classification. To establish the validity of the proposed dataset, we conducted several experiments using state-of-the-art deep-learning models for temporal semantic segmentation with pixel-level data (U-Net and ConvStar networks) and time-series classification with parcel aggregated information (LSTM, Transformer, TempCNN networks). The most popular models (U-Net and LSTM) achieve the best performance in the Belgium region, with a weighted F1 score of 0.956 (U-Net) and 0.918 (LSTM).The proposed data are distributed as a cloud-optimized GeoTIFF (COG), together with a SpatioTemporal Asset Catalog (STAC), which makes AgriSen-COG a findable, accessible, interoperable, and reusable (FAIR) dataset. Full article
Show Figures

Figure 1

19 pages, 3114 KB  
Article
Development of PCA-MLP Model Based on Visible and Shortwave Near Infrared Spectroscopy for Authenticating Arabica Coffee Origins
by Agus Dharmawan, Rudiati Evi Masithoh and Hanim Zuhrotul Amanah
Foods 2023, 12(11), 2112; https://doi.org/10.3390/foods12112112 - 24 May 2023
Cited by 24 | Viewed by 3272
Abstract
Arabica coffee, one of Indonesia’s economically important coffee commodities, is commonly subject to fraud due to mislabeling and adulteration. In many studies, spectroscopic techniques combined with chemometric methods have been massively employed in classification issues, such as principal component analysis (PCA) and discriminant [...] Read more.
Arabica coffee, one of Indonesia’s economically important coffee commodities, is commonly subject to fraud due to mislabeling and adulteration. In many studies, spectroscopic techniques combined with chemometric methods have been massively employed in classification issues, such as principal component analysis (PCA) and discriminant analyses, compared to machine learning models. In this study, spectroscopy combined with PCA and a machine learning algorithm (artificial neural network, ANN) were developed to verify the authenticity of Arabica coffee collected from four geographical origins in Indonesia, including Temanggung, Toraja, Gayo, and Kintamani. Spectra from pure green coffee were collected from Vis–NIR and SWNIR spectrometers. Several preprocessing techniques were also applied to attain precise information from spectroscopic data. First, PCA compressed spectroscopic information and generated new variables called PCs scores, which would become inputs for the ANN model. The discrimination of Arabica coffee from different origins was conducted with a multilayer perceptron (MLP)-based ANN model. The accuracy attained ranged from 90% to 100% in the internal cross-validation, training, and testing sets. The error in the classification process did not exceed 10%. The generalization ability of the MLP combined with PCA was superior, suitable, and successful for verifying the origin of Arabica coffee. Full article
Show Figures

Graphical abstract

15 pages, 2596 KB  
Article
Tri-Training Algorithm for Adaptive Nearest Neighbor Density Editing and Cross Entropy Evaluation
by Jia Zhao, Yuhang Luo, Renbin Xiao, Runxiu Wu and Tanghuai Fan
Entropy 2023, 25(3), 480; https://doi.org/10.3390/e25030480 - 9 Mar 2023
Cited by 3 | Viewed by 1774
Abstract
Tri-training expands the training set by adding pseudo-labels to unlabeled data, which effectively improves the generalization ability of the classifier, but it is easy to mislabel unlabeled data into training noise, which damages the learning efficiency of the classifier, and the explicit decision [...] Read more.
Tri-training expands the training set by adding pseudo-labels to unlabeled data, which effectively improves the generalization ability of the classifier, but it is easy to mislabel unlabeled data into training noise, which damages the learning efficiency of the classifier, and the explicit decision mechanism tends to make the training noise degrade the accuracy of the classification model in the prediction stage. This study proposes the Tri-training algorithm for adaptive nearest neighbor density editing and cross-entropy evaluation (TTADEC), which is used to reduce the training noise formed during the classifier iteration and to solve the problem of inaccurate prediction by explicit decision mechanism. First, the TTADEC algorithm uses the nearest neighbor editing to label high-confidence samples. Then, combined with the relative nearest neighbor to define the local density of samples to screen the pre-training samples, and then dynamically expand the training set by adaptive technique. Finally, the decision process uses cross-entropy to evaluate the completed base classifier of training and assign appropriate weights to it to construct a decision function. The effectiveness of the TTADEC algorithm is verified on the UCI dataset, and the experimental results show that compared with the standard Tri-training algorithm and its improvement algorithm, the TTADEC algorithm has better classification performance and can effectively deal with the semi-supervised classification problem where the training set is insufficient. Full article
Show Figures

Figure 1

Back to TopTop