FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning

Cho, Yeji; Kim, Junghyun

doi:10.3390/math14020290

Open AccessArticle

FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning

by

Yeji Cho

¹ and

Junghyun Kim

^2,3,*

¹

Department of Convergence Engineering for Artificial Intelligence, Sejong University, Seoul 05006, Republic of Korea

²

Department of Artificial Intelligence and Data Science, Sejong University, Seoul 05006, Republic of Korea

³

Deep Learning Architecture Research Center, Sejong University, Seoul 05006, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(2), 290; https://doi.org/10.3390/math14020290

Submission received: 24 November 2025 / Revised: 9 January 2026 / Accepted: 12 January 2026 / Published: 13 January 2026

Download

Browse Figures

Versions Notes

Abstract

In this paper, we propose FedENLC, an end-to-end noisy label correction model that performs model training and label correction simultaneously to fundamentally mitigate the label noise problem of federated learning (FL). FedENLC consists of two stages. In the first stage, the proposed model employs Symmetric Cross Entropy (SCE), a robust loss function for noisy labels, and label smoothing to prevent the model from being biased by incorrect information in noisy environments. Subsequently, a Bayesian Gaussian Mixture Model (BGMM) is utilized to detect noisy clients. BGMM mitigates extreme parameter bias through its prior distribution, enabling stable and reliable detection in FL environments where data heterogeneity and noisy labels coexist. In the second stage, only the top noisy clients with high noise ratios are selectively included in the label correction process. The selection of top noisy clients is determined dynamically by considering the number of classes, posterior probabilities, and the degree of data heterogeneity. Through this approach, the proposed model prevents performance degradation caused by incorrect detection, while improving both computational efficiency and training stability. Experimental results show that FedENLC achieves significantly improved performance over existing models on the CIFAR-10 and CIFAR-100 datasets under data heterogeneity settings along with four noise settings.

Keywords:

federated learning; noisy labels; label correction; Bayesian Gaussian mixture model

MSC:

68T20

1. Introduction

The widespread adoption of modern edge devices, such as smartphones, has led to the generation of large-scale distributed datasets. Existing centralized learning requires transmitting local data to a central server to utilize distributed datasets, which can lead to serious privacy issues. To address these issues, federated learning (FL) has been actively studied, which shifts the training process from a central server to individual edge devices. FL [1,2,3] enables the use of distributed data across clients without transmitting local data to a central server, and has been successfully applied in various applications that require privacy preservation [4,5,6,7].

FL still suffers from data heterogeneity and label noise. Data heterogeneity refers to differences in the class distribution of local data across clients, and label noise refers to incorrect labels present in local datasets. Data are often not Independent and Identically Distributed (non-IID) across all clients, and many methods [8,9,10,11] have been proposed to address data heterogeneity in recent years. However, research focusing on the issues related to noisy labels within local datasets remains limited. FL assigns labels for local datasets using alternative labeling techniques, such as automatic labeling, due to privacy issues, but these methods inevitably generate noisy labels. The noisy labels induce bias in local models and hinder the convergence of the global model, thereby further disrupting the overall FL training process. Consequently, the data with label noise reduce both the convergence speed and the final performance of FL.

To address these noisy label issues, several studies have been proposed to improve robustness to label noise in federated learning environments. For example, Robust FL [12] proposes a method to reduce the impact of noisy labels by using class-wise feature representation alignment, confidence-based sample selection, and pseudo-labeling based on global model predictions to mitigate learning inconsistencies among clients in federated learning settings. FedNoRo [4] considers both class imbalance and label noise heterogeneity across clients and improves the stability of global model learning by identifying highly noisy clients and suppressing their influence during the model aggregation process.

Recently, an end-to-end label correction model, FedELC [13], which directly corrects noisy labels during the local training process, has been proposed. FedELC integrates the label correction procedure into the model training process, allowing labels to be progressively revised as training proceeds. To achieve this, FedELC designs a differentiable variable and integrates it with the local model update process. The training process of FedELC consists of two stages. In the first stage, the class-wise average loss for each client is collected while training the global model using the CE loss function. A two-component Gaussian Mixture Model (GMM) is used to classify the clients into clean and noisy groups. In the second stage, only the clients in the noisy group perform end-to-end label optimization to directly correct noisy labels during training, while the clients in the clean group apply the same local update procedure as in the first stage. For noisy label correction and local updates of noisy clients, a triplet loss is constructed. The triplet loss consists of a classification loss based on RCE, an entropy regularization loss, and a compatibility regularization loss. After the local update process, FedELC updates the global model through Distance-Aware (DA) aggregation. This framework effectively mitigates the negative impact of the noisy labels on model training and has demonstrated state-of-the-art performance compared with existing FL methods.

In this paper, we propose FedENLC, which improves the robustness and effectiveness of noisy label correction in federated learning settings with noisy labels and non-IID data through a careful integration of robust loss, Bayesian client level uncertainty estimation, and selective correction. FedENLC is also composed of two stages, following the same structure as FedELC. In the first stage, unlike the existing FedELC that uses CE, FedENLC trains the global model using the SCE [14] loss function and collects class-wise average loss values for each client during this process. Label smoothing [15] applied to noisy labels at this stage. After the stage-1 global model training is completed, unlike the existing model that uses a two-component GMM, a three-component BGMM [16] is used to divide clients into a noisy group, an ambiguous group, and a clean group. In the second stage, instead of applying label correction to all noisy clients as in the existing model, only the top clients with high noise ratios within the noisy group are selected to perform label correction. For the clean clients, the same local update process as in the first stage is applied. For noisy label correction and local updates of the selected high-noise clients, the proposed model also uses the triplet loss. In this case, unlike the existing model that uses RCE as the classification loss, the proposed model uses SCE. After the local update process, the proposed model updates the global model using the same DA aggregation method as the existing model.

Our key contributions can be summarized in three points:

SCE is used as the loss function in both stage-1 and stage-2, and label smoothing is applied in stage-1. SCE combines CE and RCE and is a loss function robust to noisy labels, while label smoothing distributes part of the probability mass of the ground-truth label to neighboring classes, mitigating overconfidence and preventing extreme probability shifts caused by noisy labels.
A three-component BGMM is used to distinguish noisy clients. By leveraging Bayesian priors, BGMM assigns prior distributions to parameters to alleviate extreme bias, making it more suitable than EM-based GMM for federated learning environments that are already biased due to label noise and data heterogeneity. In particular, a three-component BGMM allows more fine-grained client separation and can detect noisy clients more accurately and stably than a two-component GMM.
Label correction is applied only to clients with high noise ratios among the noisy clients. This approach selectively applies the computationally expensive label correction procedure only to high-noise clients, thereby reducing performance degradation caused by misclassification, improving the stability of federated learning, and increasing the overall training efficiency.

In our experiments, we evaluate the proposed model on the CIFAR-10 and CIFAR-100 datasets under two non-IID settings along with four noise settings. Experimental results show that the proposed FedENLC achieves significantly higher class-wise average precision and recall than the existing FedELC. In addition, to compare the computational efficiency of the proposed model and the existing model, we measured the training and test times and confirmed that the proposed model requires less training time than the existing model. We then perform an ablation study to verify the effectiveness of the proposed methods, and also performs an analysis of the sensitivity of the newly introduced hyperparameters.

2. Related Works

2.1. Federated Learning

FL is a distributed machine learning method that enables the use of decentralized client data without exposing local data to a central server, thereby preserving privacy while leveraging distributed datasets. FedAvg [2], a representative FL method, aggregates diverse information from client datasets by averaging the updated model parameters. FedProx [17] introduces a proximal term during the local update phase to ensure that the local model parameters do not deviate excessively from the received global model parameters in each communication round. FedExP [18] accelerates FedAvg by applying parameter extrapolation on the server side. These methods primarily focus on addressing the non-IID data commonly observed in distributed client datasets. However, FL also encounters the label noise problem, which arises from the presence of incorrect labels in distributed data. Noisy labels are inherently introduced during the annotation of local datasets through methods such as crowdsourcing and automatic label generation, since the central server cannot verify local data due to privacy constraints in FL.

2.2. Federated Noisy Label Learning

A number of approaches have been developed in centralized learning to address the label noise problem. These methods can be combined with FedAvg [2], the most widely used model aggregation algorithm, and therefore be easily integrated into the FL pipeline. First, Co-teaching [19] and its derivative method Co-teaching+ [20] maintain two peer networks with identical architectures but different initializations. For each batch, each network selects samples for its peer. For example, Co-teaching assumes that samples with lower loss are more reliable and selects them to provide more reliable supervision information to the peer network. Another line of research focuses on designing robust losses [14] or robust training strategies [21,22,23]. SCE [14] is a robust loss function for label noise. This method includes model predictions in the loss term, since the model can provide accurate predictions even though the provided labels may be noisy. Among robust training strategies, the joint optimization framework (Joint Optim) [21] alternately updates network parameters and labels, enabling label correction during training. In this process, labels are gradually updated by averaging the model predictions obtained in previous epochs. SELFIE [22] robustly selects potentially inaccurate samples and gradually incorporates them into the training process. DivideMix [23] integrates multiple techniques, including Co-teaching, MixUp [24] for data augmentation, and MixMatch [25], a semi-supervised learning [26] framework.

Subsequently, research in FL domain that aims to address the label noise issues focuses on designing more robust aggregation methods. Median [27] is a commonly used aggregation technique. Employing a weighted average like FedAvg, it computes the median of client model parameters, thereby reducing the influence of severely corrupted weights on the global model. TrimmedMean [28] removes the largest and smallest parameter values for each selected local model and computes the mean of the remaining values to form the global model. Krum [29] first identifies the closest neighbors for each local model, then calculates the sum of distances between a client and its closest local models, and finally selects the model with the smallest distance sum as the global model. These methods primarily focus on reducing the impact of incorrect model weights introduced by noisy labels through robust aggregation, but they do not directly address the noisy labels themselves.

In recent work, several methods have been proposed that aim to directly handle the noisy labels. Robust FL [12] is the first approach to directly process noisy data without relying on a perfectly annotated, high quality auxiliary dataset [30,31]. This approach collects the local class-wise centroids to form global mean class-wise centroids, which are then used as additional global supervision to regularize the local training process. However, the transmitted class-wise centroids contain sensitive information about client data, which may risk privacy leakage. FedLSR [30] further strengthens data privacy by proposing a local regularization method based on self-distillation [32]. FedRN [33] maintains a server-side pool of client models to utilize a reliable neighbor model for each client. It employs an ensemble Gaussian mixture model trained to fit the loss values of local data evaluated by the reliable neighbor model, thereby detecting clean samples. FedNoRo [4] introduces a two-stage framework that identifies noisy clients and designs distinct local optimization objectives for clean and noisy clients. Noisy clients update the federated model using a knowledge-distillation-based loss function to achieve robustness against label noise. Finally, FedELC [13] is a two-stage end-to-end model that incorporates differentiable variables, enabling simultaneous model training and noisy label correction. Similar to FedNoRo, FedELC consists of two stages: the first stage detects the noisy clients using a GMM, and the second stage jointly performs the label noise optimization and model updates.

3. Preliminaries

3.1. Problem Definition

In this paper, we assume that the FL system consists of a server and N clients. S denotes the set of all N clients. Each client indexed by k maintains a local dataset with

n_{k}

samples, expressed as

D_{k} = {\{(x_{i}, \hat{y} i)\}}_{i = 1}^{n_{k}}

, and the total number of samples is represented as

n = \sum_{k \in S} n_{k}

. The one-hot label

\hat{y}

may contain noise, and the unknown ground-truth label is denoted as

y^{*}

. The overall objective of FL is for the N clients to solve the optimization problem on their own local datasets, which can be formulated as follows:

m i n_{w} f (w) : = \sum_{k \in S} \frac{n_{k}}{n} F_{k} (w),

(1)

where

F_{k} (w)

is defined as

E_{(x, \hat{y}) \sim D_{k}} [l_{k} (\hat{y}, f_{k} (x; w_{k}))]

, and the local objective of client k is to minimize the loss over

D_{k}

. Here,

l_{k}

denotes the loss function for the local dataset of client k, and

f_{k} (x; w_{k})

represents the local prediction for each sample x using the local model parameterized by

w_{k}

.

Subsequently, we assume that client k has its own non-IID dataset

D_{k}

and cannot directly access the data of other clients. Following the standard FL framework [2], FL requires multiple T global rounds to achieve final convergence. In each global round t, the server first selects a group of clients

S^{t}

to perform local updates for E local epochs, and then aggregates the updated local models from

S^{t}

to form the updated global model

w^{t + 1}

. The global model

w^{t + 1}

can be expressed as follows:

w^{t + 1} = \sum_{k \in S^{t}} \frac{n_{k}^{t}}{n^{t}} w_{k}^{t},

(2)

where

n^{t}

denotes the total number of data samples from all selected clients

S^{t}

in global round t, and

n_{k}^{t}

represents the number of data samples of client k in global round t.

3.2. Label Noise

To construct the noisy label environments in FL, synthetic label noise is added to each local dataset using a label transition matrix M. Here, M represents the flipping of the ground-truth label

y^{*}

from the clean class to the noisy class

\hat{y}

. The matrix M typically follows one of two commonly used structures: symmetric flipping or asymmetric (or pairwise) flipping. Symmetric flipping means that the original class label is flipped to any other incorrect class label with equal probability, whereas asymmetric flipping means that the original class label is flipped only to specific incorrect categories. Given a maximum noise rate

ϵ

in FL, each client k is assigned a noise level that increases linearly from 0 to

ϵ

as the client index k increases.

4. Methodology

4.1. Proposed Model Framework

The overall framework of the proposed model, FedENLC, consists of two main stages and is illustrated in Figure 1. In the first stage, a warm-up global model is trained based on FedAvg. During the warm-up training process, the server first sends the initial global model

w_{g l o b a l}

to all clients. Each client then trains the received model using its local data, sends the resulting local model

w_{l o c a l}

back to the server, and the server aggregates the uploaded local models to update the global model. Here, the proposed model applies the SCE loss function, label smoothing, and logit adjustment to update the local models. During this process, the server collects class-wise loss values from each client to construct a loss matrix. After the warm-up training process in stage-1, BGMM classifies clients into noisy clients, clean clients, and ambiguous clients with partial noise based on the loss matrix. In the second stage, only the top noisy clients, those with the highest estimated noise ratios perform the label correction procedure and the model training simultaneously, while the remaining clients follow the same local model update process as in stage-1. To support the local update and the label correction procedure of these top noisy clients, we introduce a learnable parameter as in the existing model and construct three loss functions. Among these three loss functions, the classification loss is defined using SCE. Finally, after the local training process is completed, the global model is aggregated using DA method.

4.1.1. Stage-1: Noisy Client Detection

The first stage is conducted for a total of T global rounds. In each round of stage-1, every client updates its local model using SCE, and the global model is updated using the FedAvg aggregation. The local update process in stage-1 of the proposed FedENLC is illustrated in Figure 2. First, during the local update process in stage-1, each client applies logit adjustment [11] that considers the local class distribution, which can be defined as follows:

\tilde{p} = p + log (π),

(3)

where p denotes the output of the local model f, and

π

represents the prior distribution of the local dataset.

Subsequently, unlike the existing model that uses Cross Entropy (CE) as the loss function for the local update in stage-1, the proposed model employs SCE and applies label smoothing. Label smoothing distributes a portion of the probability mass to neighboring classes instead of using a one-hot vector, preventing the model from becoming overly confident in any particular class. This probabilistic relaxation reduces the sensitivity of the model to incorrect labels and effectively prevents overfitting caused by noisy labels. The label smoothing process can be defined as follows:

{\hat{y}}_{l s} = (1 - ε) \hat{y} + \frac{ε}{M},

(4)

where

\hat{y}

denotes the one-hot hard label that may contain noise, M represents the number of classes in the dataset, and

ε

indicates the smoothing factor, which is set to 0.05 in this paper.

In general, the CE loss function is widely used for model optimization in most deep learning models. However, CE is limited by label noise. Specifically, it tends to overfit easy classes while failing to sufficiently learn hard classes [14]. Accordingly, unlike the existing model that employs CE during the local update process in stage-1, we adopt the SCE loss to ensure the stable learning even when noisy labels are present. SCE combines the standard CE term with an additional Reverse Cross Entropy (RCE) term, utilizing the complementary properties of the two terms. In detail, RCE is defined as the reverse form of CE and is robust to the label noise, whereas CE, despite being not robust to the label noise, provides stable convergence. Therefore, SCE achieves stable learning performance in FL environments with the label noise, and the classification loss of stage-1 for the local updates,

L_{c l s_{1}}

, is defined as follows:

L_{c l s_{1}} = α_{1} C E (\tilde{p}, {\hat{y}}_{l s}) + β_{1} R C E ({\hat{y}}_{l s}, \tilde{p}),

(5)

where

{\hat{y}}_{l s}

represents the soft label obtained by applying label smoothing to the one-hot hard label

\hat{y}

that may contain noise, and

\tilde{p}

denotes the output of the model after applying the logit adjustment. In addition,

α_{1}

and

β_{1}

are hyperparameters that balance CE and RCE; in this paper, we set them to 0.5 and 1 for CIFAR-10 and 1 and 2 for CIFAR-100 based on empirical results. By simultaneously employing label smoothing and SCE loss function, the proposed model enhances robustness to the noisy labels and mitigates overconfidence caused by the noisy labels. Thus, these methods play an important role in improving both the stability and overall performance of the learning process.

After stage-1 global training rounds are completed, the proposed model applies a three-component BGMM instead of the existing two-component GMM to more effectively distinguish client noise levels. During the global rounds, each local client performs the local training process and then transmits vectors composed of class-wise average losses,

l_{k} = (l_{k, 1}, l_{k, 2}, \dots, l_{k, M})

, to the central server. The server constructs an

N \times M

loss matrix

L = [l_{1}, l_{2}, l_{3}, \dots, l_{N}]

based on the loss vectors collected from all clients

k = 1, \dots, N

, and trains a three-component BGMM on this loss matrix to classify all clients into three groups. BGMM assigns prior distributions to each parameter such as mixture weights, means, covariances, which effectively mitigates extreme parameter estimates even under severe non-IID or high label noise conditions, enabling more stable inference compared with the GMM. After training, BGMM can calculate the posterior probability that each client belongs to a specific group. Based on this probability, BGMM classifies the clients with a high probability of belonging to the noisy group as the noisy clients. Through this process, the proposed model classifies clients into clean, ambiguous, and noisy groups, allowing more flexible handling of partially noisy clients (ambiguous clients) compared with the existing model.

4.1.2. Stage-2: End-to-End Label Correction

In the second stage, the label correction procedure is applied only to the top noisy clients with high estimated noise ratios, while the remaining clients update their local models using stage-1 loss function

L_{c l s_{1}}

. The local update process in stage-2 of the proposed model is shown in Figure 3. The label correction procedure refers to jointly performing the model parameter learning and the noisy label correction, for which the learnable soft-label variable

y^{d}

is introduced. The

y^{d}

is initialized based on the one-hot hard label

\hat{y}

and is updated through backpropagation with the parameters of the model. Through this process, the proposed model gradually corrects the noisy labels and guides them to converge toward the soft labels

y^{d}

that approximate the ground-truth label

y^{*}

. Specifically, the soft label

y^{d}

is defined as follows:

\tilde{y} = K \hat{y},

(6)

y^{d} = softmax (\tilde{y}),

(7)

where

\tilde{y}

is a learnable variable, and K is a large constant. Following the existing model, we set it to 10 in this paper. In stage-2, the local optimization objective for the detected top noisy clients consists of three loss terms:

First, unlike the existing model that uses CE as the classification loss, the proposed model employs SCE. Instead of using the one-hot hard label $\hat{y}$ , which may contain noise, the classification loss is computed between the model prediction p and the learnable distribution $\tilde{y}$ , and is defined as follows:

$L_{c l s_{2}} = α_{2} C E (p, \tilde{y}) + β_{2} R C E (\tilde{y}, p),$

(8)

where $α_{2}$ and $β_{2}$ are hyperparameters that balance CE and RCE. Through empirical tuning, we set them to 1 and 0.1 for CIFAR-10, and 1 and 2 for CIFAR-100.
Second, the compatibility regularization loss is used to encourage the soft label $y^{d}$ does not deviate significantly from the hard label $\hat{y}$ , and is defined as follows:

$L_{c o m p} = C o m p a t i b i l i t y (\hat{y}, y^{d}) = - \sum_{m = 1}^{M} {\hat{y}}_{m} l o g (y_{m}^{d}),$

(9)

where M denotes the number of classes in the dataset.
Third, the entropy regularization loss encourages the model to produce more confident and reliable predictions, and is defined as:

$L_{e} = E n t r o p y (p) = - \sum_{m = 1}^{M} p_{m} log (p_{m}),$

(10)

where $p_{m}$ is the softmax probability of the model output for class m.

By combining the above three terms, the triplet loss for the top noisy clients is formulated as:

L = L_{c l s_{2}} + α L_{c o m p} + β L_{e},

(11)

where

α

and

β

are hyperparameters that balance the three loss components, set to 0.2 and 0.5, respectively, following the existing model. The optimization for the learnable variable

\tilde{y}

in

L_{c l s}

and

L_{c o m p}

is defined as follows:

\tilde{y} = \tilde{y} - η (\nabla \tilde{y}),

(12)

where

η

denotes the learning rate for

\tilde{y}

and is set to 1000 for CIFAR-10 and 5000 for CIFAR-100, consistent with the existing model.

η

is not a learning rate of the model but a learning rate used to update the corrected labels

\tilde{y}

, and its role is to control the update speed of

\tilde{y}

. An important point is that the actual amount of change in

\tilde{y}

is determined not by

η

alone but by

η \nabla \tilde{y}

. In our method,

\nabla \tilde{y}

is computed from a softmax based loss and its magnitude is naturally bounded, so even when

η

takes a relatively large value, the updates of

\tilde{y}

remain numerically stable. Consequently, the corrected label that approximates the ground-truth label

y^{*}

is obtained by combining the model prediction p after the local update with the updated

\tilde{y}

.

Subsequently, the proposed model aggregates the global model using the Distance-Aware (DA) method, which adjusts the contribution of each client based on the distance of its local model. First, the distance between the local model weight

w_{i}^{t}

of the i-th client and the closest local model from another client is measured as follows:

d (i) = min_{j \in S^{t}} {∥w_{i}^{t} - w_{j}^{t}∥}_{2},

(13)

where

d (i)

of the clean clients is equal to 0, and the absolute magnitude of this value may vary significantly depending on the model scale. If this value is directly used in the global model aggregation, certain clients may exert excessively large influence, while others may be insufficiently reflected. To prevent this, a normalization process is required based on the maximum distance among all clients, defined as follows:

D (i) = \frac{d (i)}{{max}_{j} d (j)},

(14)

where

D (i)

is within the interval

[0, 1]

and serves as a scaling factor to stably compute the contribution of each client. The global model is then updated by aggregating the local models as follows:

w^{t + 1} = \sum_{i = 1}^{| S^{t} |} \frac{n_{i} e^{- D (i)}}{\sum_{j = 1}^{| S^{t} |} n_{j} e^{- D (j)}} .

(15)

In this aggregation method, the clean clients receive a constant aggregation weight because of

D (i) = 0

, whereas the noisy clients have their weights multiplied by the scaling factor in the range

[0, 1]

. Therefore, the weight

e^{- D (i)}

decreases as the distance between models increases, allowing the contribution of the noisy clients to be adaptively adjusted during global model updates.

4.2. Bayesian Gaussian Mixture Model

In this paper, we adopt BGMM instead of the existing GMM to detect the noisy clients more reliably. GMM estimates the mixture proportion and distribution of each group using Maximum Likelihood Estimation (MLE). However, when the data are imbalanced or the number of samples is limited, GMM often suffers from overfitting, causing the estimates to be overly biased toward specific groups [34]. In contrast, BGMM is based on Bayesian inference and assigns prior distributions to parameters such as mixture proportions, means, and covariances. This prevents extreme parameter estimation and allows BGMM to produce more robust and stable results even in highly noisy environments [35].

Moreover, BGMM provides the posterior probability that each client belongs to a specific group, enabling a quantitative assessment of the reliability and uncertainty of the detected groups. Such posterior probability based information can be directly utilized to identify the top noisy clients with the highest noise ratios. In other words, BGMM not only determines whether a client is noisy but also quantifies the degree of noise in a probabilistic manner, thereby improving the accuracy and stability of the subsequent label correction procedure that focuses on the top noisy clients. For these reasons, we employ BGMM to detect the noisy clients and use the posterior probabilities to select those with high noise ratios for the label correction procedure.

4.3. Label Correction for Top Noisy Clients

Both GMM and BGMM-based probabilistic client detection methods may incorrectly classify clean clients as noisy. In this situation, if the label correction procedure is performed on all noisy clients, the falsely detected clients may gradually deviate from the correct label distribution due to the label correction, even though there is actually little noise. Such bias in the local update can negatively affect the convergence stability of the global model. In addition, the label correction procedure requires significantly higher computational cost than the standard model training process [13]. Thus, applying the label correction to all noisy clients would substantially increase computation and communication overhead. To mitigate these issues, FedENLC does not perform label correction on all detected noisy clients, as done in FedELC. It applies the label correction procedure only to a subset of detected clients with high noise ratios. This method reduces the risks associated with incorrect detection and enhances both stability and efficiency in FL. Accordingly, the selection ratio of the top noisy clients,

t o p_{r}

, is determined dynamically based on the number of classes M, the degree of non-IID

γ

, and the Bayesian posterior probability (detection confidence)

P_{p o s t}

, and is expressed as follows:

t o p_{r} = clip (0.1 + 0.001 M + 0.05 (1 - γ) + P_{p o s t}, 0.1, 0.3),

(16)

where

clip

is a clipping function used to constrain the minimum and maximum selection ratio of noisy clients, which is set to 10% and 30% in this paper, respectively. While Equation (16) is a heuristic formulation, it provides a simple and stable way to adapt the correction ratio based on data heterogeneity and detection confidence, and this formulation can be adjusted depending on different datasets and federated learning settings. Based on the computed value of

t o p_{r}

, we subsequently identify the top noisy clients and apply the label correction procedure.

4.4. Experimental Datasets

In this paper, we use the CIFAR-10 and CIFAR-100 datasets, whose numbers of classes M are 10 and 100, respectively. We normalize the images using the total mean and standard deviation of each dataset, and we employ a Dirichlet distribution to construct data heterogeneity in which each client has a different distribution of the classes. Each client is assigned a portion of the training data sampled from the Dirichlet distribution with a concentration parameter

γ

, where a smaller value of

γ

indicates a higher degree of non-IID data. All experiments in this paper are conducted under the settings

γ = 0.5

and

γ = 1.0

.

We also consider three types of label noise environments in our experiments. The three types of label noise include symmetric noise, asymmetric noise, and mixed noise. For the mixed noise setting, clients are divided into two groups, where one group receives symmetric noise and the other receives asymmetric noise. The noise ratio is linearly increased from 0 to either 0.4 or 0.8 depending on the experiment.

4.5. Evaluation Metrics

To evaluate the performance of the proposed model, we use the class-wise average precision

P r e_{e}

and recall

R e c_{e}

as evaluation metrics. These metrics allow us to assess performance while accounting for class imbalance. The class-wise average precision and recall are defined as follows:

P r e_{e} = \frac{1}{M} \sum_{m = 1}^{M} \frac{T P_{m}}{T P_{m} + F P_{m}},

(17)

R e c_{e} = \frac{1}{M} \sum_{m = 1}^{M} \frac{T P_{m}}{T P_{m} + F N_{m}},

(18)

where M denotes the number of classes in the dataset

(m = 1, 2, \dots, M)

, and

T P_{m}

,

F P_{m}

, and

F N_{m}

represent the numbers of True Positives, False Positives, and False Negatives for class m, respectively.

4.6. Implementation Details

For a fair comparison, we adopt the same experimental settings as the existing model. For training, CIFAR-10 and CIFAR-100 use ResNet-18 and ResNet-34 as their respective backbones, and mixed precision training [25] is applied to accelerate model training. A total of 100 clients are generated, and in each global round, 10 clients are randomly selected to participate in FL. The total number of global rounds is 120, of which 20 rounds correspond to stage-1 warm-up phase T, and the remaining 100 rounds correspond to stage-2 global rounds. We set local epochs

E = 5

for the each client, with a local batch size of 64. For the model optimization, we use the SGD optimizer with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0005.

5. Results

Table 1 and Table 2 present the evaluation results of the proposed FedENLC and various existing FL methods on the CIFAR-10 and CIFAR-100 datasets. For CIFAR-10, the results shown in Table 1 indicate that FedENLC achieves the highest

P r e_{e}

and

R e c_{e}

in all environments except

R e c_{e}

in asymmetric label noise for

γ = 0.5

. In particular, under symmetric label noise levels across 0.0–0.4 with

γ = 1.0

, FedENLC achieves

P r e_{e}

of 82.80% and

R e c_{e}

of 81.62%, corresponding to improvements of approximately 7.80% and 6.7% over FedELC, respectively. In addition, the proposed model shows high performance improvement even at the symmetric label noise across 0.0–0.8 with high noise ratios. Under the

γ = 1.0

setting,

P r e_{e}

and

R e c_{e}

achieve performance improvements of 11.1% and 11.8%, respectively, while under the

γ = 0.5

setting, they achieve improvements of 16.0% and 16.6%. For CIFAR-100, the results in Table 2 demonstrate that FedENLC consistently outperforms the existing methods, achieving substantial performance gains. Under symmetric label noise levels across 0.0–0.4 with

γ = 1.0

, FedENLC achieves

P r e_{e}

of 51.56% and

R e c_{e}

of 49.89%, which correspond to improvements of approximately 15.0% and 14.3% compared to FedELC. Furthermore, under symmetric label noise levels across 0.0–0.8 with

γ = 0.5

, FedENLC achieves substantial performance gains of approximately 23.8% and 19.8% over FedELC.

These results demonstrate that FedENLC provides stable performance even in FL environments with various types of label noise and non-IID settings. In stage-1,the proposed model utilizes the SCE and label smoothing to construct a stable class-wise average loss matrix for each client which is not excessively distorted by noisy labels. The loss matrix generated through the warm-up process in stage-1 mitigates bias toward noisy labels due to SCE. Using this stabilized loss matrix as input, BGMM can probabilistically estimate the noise level of each client, enabling more reliable client classification than MLE-based GMM in non-IID and noisy label environments. As a result, the label correction performed in stage-2 can be accurately applied to noisy clients with high noise ratios, which can simultaneously improve the robustness and convergence stability of the overall learning. In stage-2, the label correction procedure is applied only to the top noisy clients with the high noise ratios among the detected the noisy group, preventing performance degradation caused by unnecessary label modification and improving the overall training efficiency. Through this approach, the proposed model achieves significantly better performance than existing methods.

To evaluate the efficiency of the proposed FedENLC, we compare the training and test times with the existing FedELC, and the experimental results are shown in Table 3. The training time is computed by dividing the total time required for the entire stage-2 global training process by the number of stage-2 global epochs, and the test time is obtained by averaging the results over 10 repeated measurements. In federated learning, training time does not directly reflect communication overhead but rather indicates the computational efficiency of the proposed framework. The experimental setting is based on mixed label noise across 0.0–0.4 with Dirichlet

γ = 1.0

on the CIFAR-10 and CIFAR-100 datasets. As a result, the proposed model achieves a reduction in training time of approximately 4 s on CIFAR-10 and more than 5 s on CIFAR-100 compared to the baseline. For the test time, the two models require similar amounts of time, with the proposed model being slightly faster. These results indicate that the proposed top noisy client selection methods reduce unnecessary computation and communication costs by selectively applying the label correction procedure only to a subset of clients with high noise ratios.

The proposed model includes the hyperparameters

α_{1}

and

β_{1}

for stage-1 SCE loss,

α_{2}

and

β_{2}

for stage-2 SCE loss,

α

and

β

for the triplet loss, and the learning rate

η

for the soft-label variable. Among these,

α

,

β

, and

η

are hyperparameters adopted from previous work, and their sensitivity analysis has already been conducted in the prior study [13]. Therefore, in this paper, we perform sensitivity analysis only on the newly introduced hyperparameters

α_{1}

,

β_{1}

,

α_{2}

, and

β_{2}

, and the results are shown in Figure 4. The experimental setting follows mixed label noise across 0.0–0.4 with Dirichlet

γ = 1.0

on CIFAR-10. From these results, we can observe the following: First, the proposed method shows robust performance across most choices of

α_{1}

,

β_{1}

,

α_{2}

, and

β_{2}

. However, slight performance degradation is observed for certain values of

α_{1}

and

β_{1}

in stage-1. Therefore, it is recommended to set

α_{1}

to at least 0.3 and

β_{1}

to at least 0.4 in stage-1. In contrast,

α_{2}

and

β_{2}

in stage-2 exhibit robust performance and are less sensitive to their values. These results indicate that stage-1 loss has a greater impact on the proposed model than stage-2 loss, and they allow us to identify appropriate ranges of hyperparameters for achieving optimal performance.

Finally, we conducted an ablation study to analyze how each component of FedENLC contributes to performance improvement, and the results are presented in Table 4. The experimental setting follows mixed label noise across 0.0–0.4 with Dirichlet

γ = 1.0

on CIFAR-10. In Table 4, Ablation 1 replaces the three-component BGMM of the proposed model with the two-component GMM of the existing model, Ablation 2 replaces SCE used as stage-1 loss function with CE, and Ablation 3 removes the top noisy client selection process. The experimental results show that Ablation 1 yields similar performance under Dirichlet

γ = 1.0

setting, while its performance degrades under Dirichlet

γ = 0.5

setting. This indicates that, compared to MLE-based GMM, the Bayesian posterior of BGMM can estimate client noise levels more stably as non-IID conditions become more severe, enabling more accurate separation of noisy and clean clients. Ablation 2 shows the most significant degradation in both

P r e_{e}

and

R e c_{e}

compared to the proposed model. This is because using only CE in stage-1 causes the loss distribution to be severely distorted by noisy labels, making the loss-based features used for BGMM training unstable, which reduces the reliability of noisy client detection and label correction in stage-2 and leads to a significant overall performance degradation. Therefore, preventing distortion of stage-1 loss matrix through SCE is the most critical component of the proposed method. Ablation 3 also shows performance degradation across all settings, with more pronounced degradation under the highly heterogeneous Dirichlet

γ = 0.5

setting. This indicates that even among noisy clients, there is substantial variation in noise levels, and uniformly applying label correction to clients with relatively low noise ratios can disrupt training. Thus, selectively performing correction only for highly noisy clients based on the BGMM posterior using top noisy client selection is crucial for improving both the stability and performance of the model.

To summarize the above ablation study results, it is clearly shown that each module of FedENLC contributes substantially to performance improvement. Removing BGMM-based probabilistic client noise estimation, SCE-based warm-up, or top noisy client selection consistently decreases both

P r e_{e}

and

R e c_{e}

with the performance degradation being more pronounced when data heterogeneity and label noise are simultaneously severe. In addition, we have confirmed that selectively applying label correction only to the top noisy clients, rather than uniformly correcting all noisy clients, is advantageous in terms of both model complexity and performance, as it reduces unnecessary operations while preventing performance degradation. These results indicate that FedENLC is a more effective framework for real-world federated learning environments where noisy labels and non-IID data coexist.

6. Discussion

In this paper, we propose FedENLC, an end-to-end label correction framework that improves the robustness of federated learning in realistic settings with noisy labels and non-IID data through a careful integration of robust loss, Bayesian client level uncertainty estimation, and selective correction. From the performance comparisons in Table 1 and Table 2, FedENLC consistently achieves higher

P r e_{e}

and

R e c_{e}

than the baseline FedELC and various robust FL methods, demonstrating that the proposed model performs more effectively label correction in noisy-label FL settings. In addition, the time complexity comparison in Table 3 shows that the selective label correction method of FedENLC reduces unnecessary computation, enabling faster training compared to FedELC. The sensitivity analysis in Figure 4 further confirms the robustness of the key hyperparameters used in stage-1 and stage-2 loss functions. Furthermore, the ablation study results in Table 4 demonstrate that BGMM-based probabilistic noise estimation, SCE-based warm-up, and top noisy client selection each contribute to performance improvement, and when combined, they achieve the most robust performance in noisy-label and non-IID environments. In particular, the effectiveness of each component is more pronounced under the more heterogeneous Dirichlet

γ = 0.5

setting.

However, this paper does not address the integration of the proposed approach with important federated learning components related to privacy preservation, including differential privacy and secure aggregation. In addition, our experiments are based on relatively small-scale image benchmarks such as CIFAR-10 and CIFAR-100, which limits their ability to fully reflect real-world federated learning environments with extreme data heterogeneity or label sparsity. Therefore, future work will explore extending the proposed framework to privacy-preserving federated learning settings by applying privacy-preserving mechanisms to the client level loss statistics used in FedENLC, as well as conducting large-scale experiments on real-world datasets such as Clothing-1M and evaluating generalization to non-visual domains such as text, medical, and sensor data.

7. Conclusions

In this paper, we proposed FedENLC, an enhanced end-to-end noisy label correction model designed to effectively mitigate complex label noise issues in FL. The proposed model extends the existing FedELC architecture through a careful integration of robust loss, Bayesian client level uncertainty estimation, and selective correction. In the first stage, the SCE and label smoothing were applied to enable robust learning against the label noise. Then, sophisticated noisy client detection was performed based on BGMM. In the second stage, the label correction and the local model update were performed simultaneously only for the top noisy clients with high noise ratios. The SCE and label smoothing applied in stage-1 enabled the local model to learn the global model that was robust to the label noise without being affected by incorrect labels, thereby further improving the accuracy and stability of the subsequent label correction procedure. The BGMM further mitigated the extreme bias of the parameters through prior distributions and enabled quantitative evaluation of the noise level for each client using posterior probabilities. Through this approach, the proposed model achieved more stable and reliable noisy client detection than the existing model, by correcting labels only for the top noisy clients, the method reduced the risks caused by unnecessary label modification.

Experimental results showed that FedENLC achieved significantly better performance than FedELC on the CIFAR-10 dataset under the two non-IID settings along with the four label noise settings, except for the

R e c_{e}

under the asymmetric label noise with

γ = 1.0

. On the CIFAR-100 dataset, FedENLC demonstrated significantly improved performance over FedELC and other baselines in all environments. Additionally, through time complexity comparison experiments, we confirmed that FedENLC achieved higher performance and faster training time than FedELC by reducing unnecessary updates of stage-2. We also performed a sensitivity analysis and ablation studies. The sensitivity analysis results demonstrated the appropriate ranges of the SCE hyperparameters used in stage-1 and stage-2, as well as the robustness of the proposed model. The ablation study further showed that BGMM-based noisy client estimation, SCE-based warm-up, and top noisy client selection each contribute to performance improvement, with their effects being more pronounced under more heterogeneous data settings. These results demonstrated that FedENLC can provide high stability and robustness even in practical FL scenarios involving label noise. Furthermore, the BGMM-based client detection method, the combination of SCE and label smoothing, and the selective correction of top noisy clients are expected to be broadly applicable and extendable to future FL methods.

Author Contributions

Conceptualization, Y.C. and J.K.; methodology, Y.C.; software, Y.C.; validation, Y.C.; formal analysis, Y.C.; investigation, Y.C.; writing—original draft, Y.C.; writing—review and editing, J.K.; visualization, Y.C.; supervision, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the metaverse support program to nurture the best talents (IITP-2025-RS-2023-00254529) grant funded by the Korea government (MSIT). This work was also supported by the “Regional Innovation System & Education (RISE)” through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government (2025-RISE-01-019-04).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [CIFAR dataset] at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 22 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Agüera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Wang, Y.; Li, R.; Tan, H.; Jiang, X.; Sun, S.; Liu, M.; Gao, B.; Wu, Z. Federated skewed label learning with logits fusion. arXiv 2023, arXiv:2311.08202. [Google Scholar] [CrossRef]
Wu, N.; Yu, L.; Jiang, X.; Cheng, K.-T.; Yan, Z. FedNoRo: Towards Noise-Robust Federated Learning by Addressing Class Imbalance and Label Noise Heterogeneity. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 4424–4432. [Google Scholar]
Yan, B.; Cao, D.; Jiang, X.; Chen, Y.; Dai, W.; Dong, F.; Huang, W.; Zhang, T.; Gao, C.; Chen, Q.; et al. FedEYE: A scalable and flexible end-to-end federated learning platform for ophthalmology. Patterns 2024, 5, 100928. [Google Scholar] [CrossRef] [PubMed]
Tan, B.; Liu, B.; Zheng, V.; Yang, Q. A federated recommender system for online services. In Proceedings of the ACM Conference on Recommender Systems, Online, 22–26 September 2020; pp. 579–581. [Google Scholar]
Yang, L.; Tan, B.; Zheng, V.W.; Chen, K.; Yang, Q. Federated Recommendation Systems. In Federated Learning; Springer: Cham, Switzerland, 2020; Volume 12500, pp. 225–239. [Google Scholar]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. In Proceedings of the International Conference on Learning Representations, Online, 26 April–1 May 2020; pp. 1–16. [Google Scholar]
Zhou, B.; Cui, Q.; Wei, X.-S.; Chen, Z.-M. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1567–1578. [Google Scholar]
Menon, A.K.; Jayasumana, S.; Rawat, A.S.; Jain, H.; Veit, A.; Kumar, S. Long-tail learning via logit adjustment. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021; pp. 1–24. [Google Scholar]
Yang, S.; Park, H.; Byun, J.; Kim, C. Robust Federated Learning with Noisy Labels. IEEE Intell. Syst. 2022, 37, 35–43. [Google Scholar] [CrossRef]
Jiang, X.; Sun, S.; Li, J.; Xue, J.; Li, R.; Wu, Z.; Xu, G.; Wang, Y.; Liu, M. Tackling noisy clients in federated learning with end-to-end label correction. In Proceedings of the ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 1015–1026. [Google Scholar]
Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; Bailey, J. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 322–330. [Google Scholar]
Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, Ł; Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv 2017, arXiv:1701.06548. [Google Scholar] [CrossRef]
Lu, J. A survey on Bayesian inference for Gaussian mixture model. arXiv 2021, arXiv:2108.11753. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Advances in Machine Learning and Systems, Austin, TX, USA, 2–4 March 2020; pp. 429–450. [Google Scholar]
Jhunjhunwala, D.; Wang, S.; Joshi, G. FedExP: Speeding Up Federated Averaging via Extrapolation. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 5693–5700. [Google Scholar]
Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; pp. 8536–8546. [Google Scholar]
Yu, X.; Han, B.; Yao, J.; Niu, G.; Tsang, I.; Sugiyama, M. How Does Disagreement Help Generalization against Label Corruption? In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7164–7173. [Google Scholar]
Tanaka, D.; Ikami, D.; Yamasaki, T.; Aizawa, K. Joint Optimization Framework for Learning with Noisy Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 5552–5560. [Google Scholar]
Song, H.; Kim, M.; Lee, J.-G. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5907–5915. [Google Scholar]
Li, J.; Socher, R.; Hoi, S. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. In Proceedings of the International Conference on Learning Representations, Online, 26 April–1 May 2020; pp. 1–14. [Google Scholar]
Zhang, H.; Cissé, M.; Dauphin, Y.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–13. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5050–5060. [Google Scholar]
Zhang, C.; Wu, F.; Yi, J.; Xu, D.; Yu, Y.; Wang, J.; Wang, Y.; Xu, T.; Xie, X.; Chen, E. Non-IID Always Bad? Semi-Supervised Heterogeneous Federated Learning with Local Knowledge Enhancement. In Proceedings of the ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 3257–3267. [Google Scholar]
Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and Robust Federated Learning Through Personalization. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 6357–6368. [Google Scholar]
Yin, D.; Chen, Y.; Ramchandran, K.; Bartlett, P. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5636–5645. [Google Scholar]
Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 119–129. [Google Scholar]
Jiang, X.; Sun, S.; Wang, Y.; Liu, M. Towards Federated Learning against Noisy Labels via Local Self-Regularization. In Proceedings of the ACM International Conference on Information and Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 862–873. [Google Scholar]
Xu, J.; Chen, Z.; Quek, T.; Chong, K.F.E. FedCorr: Multi-Stage Federated Learning for Label Noise Correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10174–10183. [Google Scholar]
Allen-Zhu, Z.; Li, Y. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022; pp. 1–12. [Google Scholar]
Kim, S.; Shin, W.; Jang, S.; Song, H.; Yun, S.-Y. FedRN: Exploiting k-Reliable Neighbors Towards Robust Federated Learning. In Proceedings of the ACM International Conference on Information and Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 972–981. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; Volume 4, pp. 78–110. [Google Scholar]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]

Figure 1. The proposed FedENLC framework.

Figure 2. Local update process in stage-1 of the proposed FedENLC.

Figure 3. Local update process in stage-2 of the proposed FedENLC.

Figure 4. Experimental results for sensitivity study: (a) Sensitivity of hyperparameter

α_{1}

. (b) Sensitivity of hyperparameter

β_{1}

. (c) Sensitivity of hyperparameter

α_{2}

. (d) Sensitivity of hyperparameter

β_{2}

.

Figure 4. Experimental results for sensitivity study: (a) Sensitivity of hyperparameter

α_{1}

. (b) Sensitivity of hyperparameter

β_{1}

. (c) Sensitivity of hyperparameter

α_{2}

. (d) Sensitivity of hyperparameter

β_{2}

.

Table 1. Precision (

P r e_{e}

) and recall rates (

R e c_{e}

) on synthetic noisy dataset CIFAR-10 with manually-injected noisy labels. Sym./Asym. refer to symmetric/asymmetric label noise.

Table 1. Precision (

P r e_{e}

) and recall rates (

R e c_{e}

) on synthetic noisy dataset CIFAR-10 with manually-injected noisy labels. Sym./Asym. refer to symmetric/asymmetric label noise.

Method	Dirichlet ( $γ$ = 1.0)								Dirichlet ( $γ$ = 0.5)
	Sym. (0.0–0.4)		Sym. (0.0–0.8)		Asym. (0.0–0.4)		Mixed (0.0–0.4)		Sym. (0.0–0.4)		Sym. (0.0–0.8)		Asym. (0.0–0.4)		Mixed (0.0–0.4)
	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$
FedAvg [2]	75.85	73.58	57.41	54.74	77.60	76.08	77.54	75.72	72.11	62.98	49.44	45.34	74.29	64.66	73.76	67.55
FedProx [17]	75.83	73.60	51.59	53.81	77.41	77.40	77.39	75.41	69.50	63.91	49.01	45.56	74.49	64.63	73.52	66.94
FedExP [18]	71.34	70.15	50.43	51.74	75.90	75.98	77.39	74.21	71.04	62.91	50.21	45.56	74.94	64.44	73.88	67.01
TrimmedMean [28]	71.21	64.29	47.93	44.41	74.06	66.06	72.83	64.71	69.47	59.82	49.47	42.40	69.44	57.17	71.12	58.38
Krum [29]	70.99	65.35	50.09	47.53	75.84	70.63	76.07	68.46	68.54	58.49	49.97	42.43	70.22	58.43	72.28	59.60
Median [27]	72.52	70.51	58.45	56.56	75.36	73.06	73.26	71.53	65.83	56.34	48.23	43.28	72.06	64.85	72.05	63.73
Co-teaching [19]	73.59	71.70	64.44	61.45	75.60	73.64	76.59	74.82	70.58	63.82	49.84	43.82	72.60	64.64	74.11	65.91
Co-teaching+ [20]	69.70	65.47	47.11	49.07	59.07	60.67	74.30	65.77	68.90	64.91	49.03	44.62	74.79	61.13	72.52	55.72
Joint Optim [21]	64.74	64.69	59.78	59.58	75.04	64.77	64.43	64.77	57.76	57.37	52.18	52.18	75.02	64.54	58.74	52.87
SELFIE [22]	73.74	73.58	62.86	60.51	76.79	76.14	76.70	76.14	72.06	61.48	62.39	60.14	70.93	64.03	71.74	58.39
Symmetric CE [14]	76.32	73.38	70.96	66.25	76.94	76.34	76.73	72.83	72.06	61.46	62.39	60.14	70.93	64.03	71.74	58.39
DivideMix [23]	73.68	61.94	61.51	59.51	76.70	76.26	76.73	76.23	69.57	58.65	62.39	61.40	70.93	63.04	71.74	58.39
Robust FL [12]	63.17	61.61	51.34	51.34	65.22	65.24	66.25	66.23	62.65	54.45	55.47	45.06	54.08	42.86	59.43	43.41
FedLSR [30]	71.65	66.92	70.64	66.23	75.13	74.61	73.83	68.52	68.42	57.54	51.47	55.97	70.64	61.54	70.64	57.04
FedRDN [33]	72.66	67.74	62.60	60.71	72.93	70.64	70.93	68.52	65.84	54.42	50.44	51.30	70.44	60.29	70.64	57.32
FedNoRo [4]	73.76	73.52	66.50	66.18	77.55	77.36	77.55	75.52	71.09	57.11	59.11	57.41	73.77	72.77	73.55	72.32
FedELC [13]	76.81	76.72	68.22	66.61	77.97	76.98	77.98	77.24	73.13	71.31	60.31	59.67	75.78	74.65	74.55	73.80
FedENLC	82.80	81.62	75.79	74.48	82.49	81.06	82.79	81.24	79.84	75.28	69.93	64.56	79.32	73.34	80.03	75.91

The highest score for each metric is indicated in bold.

Table 2. Precision (

P r e_{e}

) and recall rates (

R e c_{e}

) on synthetic noisy dataset CIFAR-100 with manually-injected noisy labels. Sym./Asym. refer to symmetric/asymmetric label noise.

Table 2. Precision (

P r e_{e}

) and recall rates (

R e c_{e}

) on synthetic noisy dataset CIFAR-100 with manually-injected noisy labels. Sym./Asym. refer to symmetric/asymmetric label noise.

Method	Dirichlet ( $γ$ = 1.0)								Dirichlet ( $γ$ = 0.5)
	Sym. (0.0–0.4)		Sym. (0.0–0.8)		Asym. (0.0–0.4)		Mixed (0.0–0.4)		Sym. (0.0–0.4)		Sym. (0.0–0.8)		Asym. (0.0–0.4)		Mixed (0.0–0.4)
	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$
FedAvg [2]	43.51	41.01	31.38	28.38	43.57	41.52	42.06	40.37	42.11	40.49	30.27	27.27	44.42	43.36	41.84	40.74
FedProx [17]	40.82	38.27	29.16	27.33	43.33	41.76	42.98	39.97	40.48	37.44	27.49	25.17	44.30	42.56	42.66	40.56
FedExP [18]	43.88	40.36	28.23	28.23	44.46	42.62	41.48	39.57	44.30	40.64	29.26	26.93	44.91	43.57	42.65	40.79
TrimmedMean [28]	34.48	34.48	28.33	25.44	41.33	39.38	41.13	38.47	35.14	34.40	27.60	25.14	40.58	39.11	39.14	37.91
Krum [29]	22.99	17.34	17.36	14.38	29.43	26.38	32.44	18.61	14.40	11.97	7.95	7.55	13.12	11.35	13.91	11.94
Median [27]	43.90	34.47	31.78	27.36	43.44	42.12	43.44	41.00	43.17	34.10	31.21	28.84	42.02	41.30	42.11	40.27
Co-teaching [19]	44.41	42.53	32.74	29.36	44.78	42.83	43.21	41.62	47.83	43.01	30.44	29.30	45.52	43.10	45.37	42.73
Co-teaching+ [20]	36.67	34.42	27.17	27.17	43.71	41.33	38.27	38.67	36.99	34.93	28.20	26.80	42.06	40.94	36.03	35.70
Joint Optim [21]	26.87	22.67	23.59	23.59	27.88	27.44	31.83	27.44	29.37	27.01	21.95	22.80	26.00	27.37	31.03	27.50
SELFIE [22]	44.62	42.92	32.66	30.54	44.90	42.74	43.42	41.95	45.00	42.53	33.20	32.13	40.60	42.54	42.04	41.97
Symmetric CE [14]	43.74	41.36	33.67	31.97	44.04	42.47	43.72	42.18	42.16	40.93	31.90	31.17	42.44	42.20	42.20	42.18
DivideMix [23]	37.42	37.68	30.74	36.78	37.18	38.39	38.59	39.01	37.23	37.39	31.79	31.87	36.98	37.46	36.86	37.01
Robust FL [12]	17.41	16.09	15.08	5.02	19.67	17.59	17.59	9.45	17.57	9.45	7.50	5.41	10.50	7.44	13.54	8.19
FedLSR [30]	36.10	28.13	25.30	18.20	43.74	41.42	43.92	40.23	35.07	27.97	24.23	20.73	42.11	40.29	35.45	34.14
FedRN [33]	20.28	19.73	19.53	17.32	43.32	42.51	35.74	21.55	19.80	18.94	19.07	17.88	42.71	40.59	34.59	34.27
FedNoRo [4]	44.70	43.41	32.98	31.41	43.51	43.56	43.52	41.51	45.09	43.79	32.39	30.33	43.97	43.24	43.54	43.27
FedELC [13]	44.83	43.65	33.07	32.95	44.01	43.83	43.87	42.73	45.22	43.12	32.98	32.67	44.97	43.64	45.39	42.81
FedENLC	51.56	49.89	39.29	37.76	51.16	49.13	50.29	48.58	49.84	47.68	40.83	39.14	50.01	47.40	51.95	49.02

The highest score for each metric is indicated in bold.

Table 3. Training and test time comparison between FedELC and FedENLC.

Method	CIFAR-10		CIFAR-100
Method	Training Time	Test Time	Training Time	Test Time
FedELC [13]	37.19 s	0.974 s	58.789 s	1.498 s
FedENLC	33.49 s	0.947 s	53.581 s	1.482 s

Table 4. Experimental results for ablation analysis of FedENLC.

Method	Dirichlet ( $γ$ = 1.0)		Dirichlet ( $γ$ = 0.5)
	Mixed (0.0–0.4)		Mixed (0.0–0.4)
	${Pre}_{e}$	${Rec}_{e}$	${Pre}_{e}$	${Rec}_{e}$
FedENLC	82.79	81.24	80.03	75.91
Ablation1	82.38	81.23	79.45	74.87
Ablation2	75.99	75.37	73.70	72.23
Ablation3	81.80	80.06	77.63	72.62

The highest score for each metric is indicated in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cho, Y.; Kim, J. FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning. Mathematics 2026, 14, 290. https://doi.org/10.3390/math14020290

AMA Style

Cho Y, Kim J. FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning. Mathematics. 2026; 14(2):290. https://doi.org/10.3390/math14020290

Chicago/Turabian Style

Cho, Yeji, and Junghyun Kim. 2026. "FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning" Mathematics 14, no. 2: 290. https://doi.org/10.3390/math14020290

APA Style

Cho, Y., & Kim, J. (2026). FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning. Mathematics, 14(2), 290. https://doi.org/10.3390/math14020290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FedENLC: An End-to-End Noisy Label Correction Framework in Federated Learning

Abstract

1. Introduction

2. Related Works

2.1. Federated Learning

2.2. Federated Noisy Label Learning

3. Preliminaries

3.1. Problem Definition

3.2. Label Noise

4. Methodology

4.1. Proposed Model Framework

4.1.1. Stage-1: Noisy Client Detection

4.1.2. Stage-2: End-to-End Label Correction

4.2. Bayesian Gaussian Mixture Model

4.3. Label Correction for Top Noisy Clients

4.4. Experimental Datasets

4.5. Evaluation Metrics

4.6. Implementation Details

5. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI