1. Introduction
Today, almost every digital solution is rapidly adopting artificial intelligence (AI) for products and services. These AI solutions are data-intensive, which result in exponential growth in data generation and sharing. The use of data for AI and Machine Learning (ML) is a privacy concern, and recently, many privacy-preserving solutions have been developed [
1,
2,
3]. Federated learning (FL) is a potential solution gaining the attraction of researchers for privacy-preserving ML and deep learning (DL) [
4,
5]. Considering these, the Internet of Things (IoT) is becoming a key adopter of federated learning due to privacy preservation and restrictions on data movement from the source devices [
6,
7,
8,
9]. For example, Rahman et al. [
8] discussed the application of FL for IoT intrusion detection by using the NSL-KDD dataset [
10]. Similarly, there are other use cases for FL application in IoT. A plethora of academic and industrial researches have been carried out recently to produce software solutions and frameworks, that would assist the real world application of FL [
11,
12,
13]. FL limits data sharing, i.e., raw data do not leave the client (restricted to generation point/source) system to be used for the training model. FL is derived from parallel and distributed computing and enables a model’s training in a federated manner by utilizing data from multiple clients. Currently, most FL algorithms are centered around supervised learning, where labeled data are assumed to be present on the client side. This requirement is achievable in a cross-silo (organizations act as participants and may have a data labeling team) setup. However, it is difficult to achieve in a cross-device setup due to several reasons, such as the cost, skills, and complexity associated with manual labeling by clients, as well as the large volume of data and the need for active participation.
In an actual or practical situation for cross-device FL, data at the client device are primarily unlabeled. For example, suppose pictures are stored in the client system for an object detection model. These pictures may be generated using the device camera, or the user can download them. Clearly, these pictures cannot always be assumed to be labeled and will not be ready for training. The data labeling issue for models that use medical imaging for training and testing was presented in [
14].
The proposed work is inspired by weak supervision and semi-supervised learning and applies a clustering-based approach to label data at the client device. We assumed that the parametric server has a set of labeled data for the specific task (it is a valid assumption because the server needs to have labeled data for validating and tuning the global model). The client can seed the labeling process using a small fraction of these labeled data (shared once by the server with each client after joining the federation). In the proposed Expand and Shrink labeling approach, this tiny fraction of data (truth set) is mixed with the client data during the expand phase, and then clustering is applied. During the shrink phase, a set of clusters is shrunk to a class by comparing and tracing the clusters using the truth set. Labeling is performed once at the client in the current experiment, assuming that the client participates with the available data. However, the same approach can be repeated for new data (by mixing with previously labeled data or separately on new data).
Any FL averaging and communication strategy can be applied for post labeling at client side, therefore the proposed work is appropriate for the current FL architecture and federated averaging. The proposed work is suitable and required for using many devices available under the IoT because these devices generate lots of data, and often, these data need to have class labels. The proposed method uses a clustering-based approach that will make it easy to assign new query data, i.e., a sample to a particular cluster, and will have minimal cost. The proposed work is also compatible with existing model compression and quantization techniques, so a compressed model can be used instead of a large model for training IoT devices.
We performed different experiments on the client by varying the labeled data ratio, the number of clusters, and the client participation ratio. We obtained accuracy rates of and by using and of the truth labels, respectively. In general, with the proposed work, we made the following contributions:
We proposed a data labeling method at the client device for supervised federated learning. The proposed labeling adopts pre-initialized centroid clustering methods to infer the class label of the unlabeled sample at the client device.
We proposed an aggregation-independent labeling method that complements the existing supervised federated learning architecture, so no further changes are required in the existing communication and aggregation methods.
We proposed a low cost in terms of the time and extensible labeling approach, i.e., a new sample can be labeled with reduced cost due to newly labeled samples.
We performed extensive experiments to validate the proposed labeling method. In the federated learning setup, the proposed method provides equivalent performance to the human-labeled dataset in terms of accuracy. It achieves a similar level of global accuracy compared to the existing works while requiring much fewer truth labels.
The remainder of the paper is organized as follows.
Section 2 presents existing work that deals with unlabeled data.
Section 3 presents the proposed Expand and Shrink algorithm that employs a clustering-based approach to enable federated learning with unlabeled data.
Section 4 compares the performance of the proposed algorithm with varying degrees of truth label availability. Our results show that the Expand and Shrink algorithm provides minimal labeling cost in terms of time and is extensible, thus allowing the labeling of a new sample at a lower cost. We conclude the paper in
Section 5.
2. Related Work
In the supervised learning approach, unlabeled data are samples without any class label. Although the same data can have different class tags per the classification problem, the unlabeled data sample needs a label for the underlying classification task, for example, class labels between 0 and 9 for digit classification, such as in the MNIST dataset [
15]. In general, the training set for supervised learning is indicated by (
), where
is a feature, and
is the class label of the
sample. The proposed work aims to assign
for
, where
is a noisy label and can be used for supervised training in the absence of the actual label during FL. There are many approaches to labeling datasets for supervised learning. In the semi-supervised method, some labeled data are used to annotate unlabeled data. Supervised training is carried out on the complete dataset (pre and post-labeled data sample). Semi-supervised learning has a smoothness assumption that if two samples
x and
are close in the input space, then their labels
y and
should be the same, i.e.,
and
. Building a supervised classifier using labeled and unlabeled data is familiar, and much literature is available on centralized machine learning.
Virginia R. de Sa [
16] used structure between the pattern distributions of different sensory modalities to propose building a neural network (NN) model from unlabeled data. Caron et al. [
17] used k-mean to cluster the features and then used cluster labels to update the NN weight during training. Recently, Jin et al. [
18] adopted semi-supervised learning for federated learning to address the labeling task at the client. In semi-supervised learning, unlabeled data may degrade the model’s performance. Using federated learning, Albaseer et al. [
19] applied semi-supervised learning for labeling and building a traffic sign detection model. Jeong et al. [
20] used semi-supervised learning in two distinct scenarios: (a) labels-at-client (both labeled and unlabeled data are available at the client) and (b) label-at-server (labeled data are only with the server). Long et al. [
21] also considered the label-at-server scenario and proposed FedCon, i.e., a contrastive learning-based federated learning framework. Rafa et al. [
22] applied federated semi-supervised learning (FSSL) for Android malware detection; similarly, Pei et al. [
23] applied transfer and semi-supervised learning with FL for IoT malware. Itahara et al. [
24] used distillation-based semi-supervised FL to improve the communication for non independent and identically distributed (non-iid) data. Lu et al. [
25] proposed FedUL, which assumes the availability of user class-conditional distributions, and used it to recover the required model from the global model each client trains with the help of surrogate labels for unlabeled data. Wang et al. [
26] explored various setups to improve the semi-supervised federated learning (SSFL) performance and suggested that reducing gradient diversity can result in a fast and improved model. Zhu et al. [
27] proposed to generate pseudo labels for unlabeled data using unlabeled data and global models. In each round, a temporary global model was trained that was tuned using the initial global model to obtain the final global model.
There was a recent development in semi-supervised learning, i.e., self-supervised learning (SSL), which removes the requirement of a human-annotated initial dataset to initiate semi-supervised learning. He et al. [
28] used self-supervised learning for label deficiency in federated learning and also provided personalization for the client in FL. Yan et al. [
29] used self-supervised federated learning to address the data heterogeneity and label deficiency in the medical domain (dataset retinal images, dermatology images, and chest X-rays). Wang et al. [
30] also used contrastive visual representation learning and SSL for various tasks and studied the impact of non-iid and unlabeled data in FL. With the model-assisted labeling process, a small portion of the data is labeled to build an initial model that can be further used for only labeling, i.e., predicting labels for the remaining unlabeled data. An active learning-based FL approach was discussed, involving an oracle initially labeling a few unlabeled data at the client device [
31].
There have been many modern approaches to centralized supervised learning for learning from unlabeled data, like transfer learning and few-shot learning approaches, which are being adopted for federated learning. Li and Wang [
32] applied transfer learning and model distillation for federated learning. Guha et al. [
33] proposed one-shot federated learning for supervised and semi-supervised setups for learning global models in one round of communication.
The existing literature shows that it is required to enable FL to learn from unlabeled data. One major limitation of the current approach is the high computation necessary for the client, for example, retraining a model under the transfer learning approach or inferring labels using the model inference, which are all computationally intensive. Many of these approaches depend heavily on approximation, which can propagate errors to the global model. The proposed work uses simple clustering-based labeling, which requires lower computation. The expand phase exploits the drawback of centralized clustering, i.e., to obtain good clustering performance, the model may result in a higher number of clusters (by splitting similar items into different clusters to improve upon cluster density or other metrics). We aim to group similar items independent of the number of total clusters because during the shrink phase, the clusters will be mapped to the number of the required class.
3. Expand and Shrink: Federated Learning with Unlabeled Data Using Clustering
3.1. Problem Definition
Data labeling is necessary for supervised learning and takes significant time, effort, and resources (computational and financial). Data labeling methods can be divided into two main groups: manual and automatic. The human annotators perform the fully manual labeling, and the programs perform fully automatic labeling. However, manual and automatic can assist each other; for example, if human annotators assist in automatic labeling, it is called human in the loop (HITL). Besides labeling, humans also play the role of verifier or reviewer for the data labeled by other annotators. Most of the time, a human annotator also supervises automatic labeling and acts as a verifier. Human labeling is costly and time-consuming. However, it provided better-quality labeled data. Many applications only considered human labeling; for example, only labels from medically trained professionals are acceptable for medical-related models, such as cancer cell or tumor classification.
The FL system can label data at the client device with or without client participation for labeling and label verification. Client participation in labeling and verification can be implicit, like behavior-based auto labeling, i.e., using the “client click” on the advertisement as a class label or client acceptance of text suggestion as a class label. Such an implicit approach can also be termed as labeling automation without assistance. In explicit participation, the client must actively engage with unwanted and unfeasible labeling processes. Due to client participation, it is called labeling automation with assistance (the item and label are auto-generated, and the user has to verify).
The proposed work, Expand and Shrink, does not require user participation for labeling or verification and is fully automated. The proposed work adopts the popular cluster-then-label approach. In the expand phase, we apply a clustering algorithm to all unlabeled data. During the shrink phase, we use the truth dataset to map the resulting clusters to a specific class label (the possibility of many clusters being mapped to a single class label). With the expand step, we moved from the critical assumption of the cluster to class mapping approaches such that one cluster exactly corresponds to one class because in the proposed work, expand results in more clusters than classes.
Figure 1 presents the proposed Expand and Shrink approach. The expand phase is based on a “higher number of clusters decreases the inertia and lower inertia is better”. So, with a threshold (
I), we keep increasing the number of clusters and stop when the inertia value comes under
I.
Figure 1A shows the ideal case of clustering, where the data point has a proper and uniform shape, resulting in two suitable clusters. However, such a perfect case is rare, and often in the real world, we have scenarios where data points come with variance and may result in different clusters as shown in
Figure 1B. The expand phase considers this real-world use case and so tries to obtain the maximum clusters by grouping the data points with variance into different clusters.
Figure 1C shows the shrink step, where the clusters are mapped to the class label (the number of classes decided as per the selected supervised learning task, for example, 0–10 for MNIST [
15]) by using the truth label set based on the distance and seed sample (labeled sample is mixed with unlabeled data before clustering).
3.2. System Model
Providing class labels to the data at the client device must be automated and performed on the device. Semi-supervised approaches are adopted in federated learning for labeling the client’s data. We propose a clustering-based approach to label the data sample at the client machine in a federated learning architecture.
Figure 2 shows the integration of the federated learning (FL) approach with the data labeling process on the client device.
Table 1 lists the symbols and notations used in this paper.
The sets of N total and S selected participating nodes for each round are defined, respectively, as and , where , , and .
is a set of local unlabeled data of client
i, where the client
i initially has a set of
number of unlabeled training samples client
denoted as
. After applying to the Expand and Shrink method by the client, each data sample of client
i will be mapped to the class label in the set of class label
Y as follows:
where
equals
, which is an element of a set of class labels denoted as
. The local labeling gives mapping results in a set of elements
.
D represents the labeled dataset from all selected clients denoted as
. Each element in the dataset
is a pair
, where
represents a data point and
is the corresponding class label. We introduce a policy set
, which is the policy vector of the federation that is sent to every node by the parameter server. It has a set of values as
:
Model (M): Supervised learning model selected by the parameter server to train using federated learning, for example, CNN, LSTM, etc.
Validation dataset (V): A small proportion of the labeled datasets are provided to each client for the shrink phase of labeling.
Clustering method (C): The client can choose a set of clustering methods. We are currently using only K-mean clustering.
3.3. Data Labeling with Expand and Shrink
For federated learning, each node/device has to join the federation. The joining process will start with the initial setup, i.e., the parameter server will share the with the client, and each client will perform the data labeling by executing the Expand and Shrink method. The client will be selected for a training round based on the ”labeling status“ along with other existing selecting criteria, such as power and computation availability, etc.
We use modified K-means clustering to perform data labeling for each client. The data labeling process is independent of the federated training process, and the client joins the federation and obtains the for performing the labeling of its unlabeled local data . The federated learning process will be similar to the existing approach.
- Step 1—Join:
A device joins the federation and obtains policy vector from the parameter server.
- Step 2—Expand:
Each client applies clustering to its unlabeled data and uses the inertia versus the number of clusters to find the value of
K, where
K is the total number of clusters that give the best inertia value. The inertia represents the sum of squared distances between each data point and its assigned centroid. Any clustering algorithms can be used, as in the proposed work was evaluated using
K-means. Thus, the objective function of the
K-means clustering is defined as a minimization problem, and it is presented in Equation (
1):
where
represents the sample of client
i,
is the set of clients belonging to cluster
k, and
is the centroid of cluster
k.
- Step 3—Shrink:
The client with
K number of clusters starts to shrink by using the distance between the clusters and the sample in validation dataset
V shared by the parameter server. The distance calculation will create
using Equation (
3), which is the
score value matrix, where
K is the total number of clusters in the expansion step and
G is the total number of classes for the supervised learning task. In
, each row will have the distance score
s of a cluster
against all the classes, i.e.,
. The distance matrix
is constructed as
Each distance score can be calculated using the Euclidean distance, computed as
where
c and
v are vectors for the centroid and sample in the validation set for the respective clusters and class labels. So,
and
represent the corresponding elements in the vectors
c and
v at the same index
i.
For merging clusters, we need to assign one or zero based on distance. The cell with the minimum value in each row will be marked as 1, indicating the cluster close enough to a particular class. In summary, each row of
will be converted to
as
where
represents an element of the distance matrix on the row
cluster index and the column
class index.
Now, for merging cluster(s), each column vector will be scanned for 1, and the respective cluster
k will merge as
Then, each member of the respective cluster will be merged as one larger cluster, and it will be labeled as per the respective column class as
Merging clusters and labeling with respective class labels will create labeled data of each client i.
- Step 4—Ready State:
The client can set its status to ready after completing the data labeling so the server can use this information while selecting the client for training.
Figure 3 shows the result of labeling in terms of the accuracy and homogeneity score (
Figure 3b), i.e., the outcome of the proposed algorithm on the unlabeled dataset of an individual client without training (
Figure 3a) and global test accuracy after labeling with a varying number of clusters and training rounds (
Figure 3c). The data labeling accuracy shown in
Figure 3a is 85–90%, equivalent to human-level accuracy, considering the labeling errors in various datasets mentioned by Northcutt et al. [
34]. The labeling performance in terms of accuracy and homogeneity score has less variation by increasing the clients, and the value of both metrics also improves with high clusters in the expand phase. Due to the larger client participation, each has fewer samples, which offers another benefit, and the proposed method works with smaller datasets, which is often the case in FL. A similar trend is observed in global test accuracy with different numbers of clusters and training rounds. The global accuracy gets stable with the higher number of clusters, while the number of training rounds has a smaller impact, so we can stop training with an early stop. We evaluated the labeling result using the truth label available for the experimental dataset. However, measuring only the accuracy of labeling will not be possible in a real-world scenario because there will be no true label. Once the client completes the data labeling process using the proposed method (Algorithm 1), any FL approach can be applied to the labeled data without modifying the existing approach. However, the labeling process must be integrated with the overall training steps as shown in
Figure 2. Further, each step of FL with unlabeled data is explained in detail, and Algorithm 2 presents the pseudocode of the overall training.
Algorithm 1 Data labeling at each client using Expand and Shrink. |
Require: U: Set of unlabeled data, V: validation dataset |
Ensure: D () for |
{Expand: Create K clusters using a clustering algorithm} |
1: Expand(U) using Equation (1) |
2: using Equations (2) and (3) |
{Find the minimum score in each row, mark it with 1 and the others with 0} |
3: ← |
{Shrink: Merge clusters} |
4: |
5: |
6: return |
- Step 0—Data Labeling:
A new client joins the federation and obtains from the parameter server and labels its unlabeled data independently and free from the training round. After labeling, the client changes its status to ready.
- Step 1—Initialization:
The parameter server selects s number of clients from for federated training, initializes the global model M and shares it along with the validation dataset, i.e., V with each selected node in . The V is a set of labeled pairs of .
- Step 2—Local Training:
Each client applies Algorithm 1 on its local unlabeled samples U (explained in the previous section). Each node trains the model (M) on its self-labeled dataset and calculates the gradient difference using Stochastic Gradient Descent (SGD).
- Step 3—Client Update Sharing:
Each node shares the calculated gradient difference () with the parameter server. The gradient is calculated by applying local training on using .
- Step 4—Global Aggregation:
For each global training round, the server collects and aggregates updates from each participating client () and updates the previous model ().
- Step 5—Updated Global Model Sharing:
The final updated global model is shared with previously participating clients, and if the updated model is shared with new participants, then this step is similar to step 1. So, this step is optional and depends upon the training policy.
Steps 1–5 are performed for one training round of federated learning, and a single model is trained in multiple rounds. The termination criteria for training can be a combination of different requirements, such as desired accuracy, the maximum allowed training time, data available, etc. The following section presents the experimental setup and results of the experiments.
Algorithm 2 Federated learning using Expand and Shrink on unlabeled data. |
Require: : Unlabeled Data, N: Number of clients, E: Number of communication rounds- 1:
Initialize , i.e., global model as , V, and C - 2:
for each round r from 1 to E do - 3:
Randomly select a subset of clients // Local Model Update - 4:
for each client i in do - 5:
Receive from the server - 6:
Run Expand and Shrink on using Algorithm 1 - 7:
Split into batches for client i - 8:
for each local epoch do - 9:
Compute local update: - 10:
end for - 11:
Send local update to the server - 12:
end for // Server Aggregation and Global Model Update - 13:
Server aggregates the local updates: - 14:
- 15:
Update global model: - 16:
end for - 17:
return
|