Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems

In this paper, the optimization of network performance to support the deployment of federated learning (FL) is investigated. In particular, in the considered model, each user owns a machine learning (ML) model by training through its own dataset, and then transmits its ML parameters to a base station (BS) which aggregates the ML parameters to obtain a global ML model and transmits it to each user. Due to limited radio frequency (RF) resources, the number of users that participate in FL is restricted. Meanwhile, each user uploading and downloading the FL parameters may increase communication costs thus reducing the number of participating users. To this end, we propose to introduce visible light communication (VLC) as a supplement to RF and use compression methods to reduce the resources needed to transmit FL parameters over wireless links so as to further improve the communication efficiency and simultaneously optimize wireless network through user selection and resource allocation. This user selection and bandwidth allocation problem is formulated as an optimization problem whose goal is to minimize the training loss of FL. We first use a model compression method to reduce the size of FL model parameters that are transmitted over wireless links. Then, the optimization problem is separated into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The ultimate user selection and bandwidth allocation are obtained by iteratively compressing the model and solving these two subproblems. Simulation results show that the proposed FL algorithm can improve the accuracy of object recognition by up to 16.7% and improve the number of selected users by up to 68.7%, compared to a conventional FL algorithm using only RF.


Introduction
Federated learning (FL), which allows edge devices to cooperatively train a shared machine learning model without transmitting private data, is an emerging distributed machine learning technique [1,2]. The FL training process needs to iteratively transmit machine learning parameters over wireless links. However, due to dynamic wireless channels and imperfect wireless transmission, the performance of FL will be significantly affected by wireless communication. In addition, due to limited communication resources, the number of users that can participate in FL is limited.

•
We propose a USBA-MC algorithm over a hybrid VLC/RF system. In the USBA-MC algorithm, each user obtains a local FL model by training through its own dataset and transmits the model parameters to a base station (BS). The BS aggregates the received local models to generate a global FL model and transmits it back to each user. For the considered FL model, the performance is significantly affected by wireless factors such as available bandwidth and users' channel state information. This formulates a joint user selection and bandwidth allocation problem, whose goal is to minimize the FL training loss.

•
To solve this problem, we first introduce a model compression method to reduce the size of FL model parameters that are transmitted over wireless links. To this end, we first sort the model parameters and design a threshold selection mechanism according to the sparsity rate. Then, we cut off the redundant model parameters based on the threshold and, thus, compress an FL model of each user. • Following the model compression, we separate the joint user selection and bandwidth allocation problem into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The ultimate user selection and bandwidth allocation are obtained by iteratively compressing the model and solving these two subproblems.
Simulation results show that the proposed FL algorithm can improve the accuracy of object recognition by up to 16.7% and improve the number of selected users by up to 68.7%, compared to a conventional FL algorithm using only RF.
The remainder of this paper is organized as follows. In Section 2, we introduce the hybrid VLC/RF system model. Section 3 introduces a model compression method. The joint user selection, bandwidth allocation, and model compression algorithm is described in Section 4. Simulation results are presented and discussed in Section 5. Finally, Section 6 draws some important conclusions.

System Model and Problem Formulation
In this section, we first introduce a hybrid VLC/RF system for FL. Then, we introduce the computational model and the communication models of RF and VLC systems. Finally, based on the established model, we introduce a user selection and bandwidth allocation problem.

FL Model
In this model, each user n stores a local dataset D n with D n being the number of training data samples. Therefore, the total number of training data samples of all users is D = ∑ N n=1 D n . We assume that the training data samples of user n can be expressed by {x n , y n } with x n = [x n1 , . . . , x nD n ] and y n = [y n1 , . . . , y nD n ], where each x ni is an input vector of the FL algorithm and y ni is the output of x ni .
For each user, the FL training purpose is to find the model parameter ω that minimizes the loss function: where f i (ω) is a loss function that captures the performance of the FL algorithm. For example, for a linear regression FL, the loss function is [14]. All users aim to minimize the following global loss function: To solve (2), the BS will first transmit the global FL model parameters to its users, and users will use the received global FL model parameters to train their local FL models. Then, the users will transmit their local FL model parameters to the BS to update the global FL model. For strongly convex objective J(ω), the maximum number of global iterations that an FL algorithm needs to converge is [17] where ε is the accuracy of global model and θ is the accuracy of local model. We consider a fixed global accuracy ε.

FL Based on Hybrid VLC/RF System
Due to the limited wireless bandwidth, only a subset of users can be selected for FL training, which can seriously degrade the training accuracy. To enable more users to join the FL training process, we design a hybrid VLC/RF system. The system structure is shown in Figure 1. The considered system consists of one BS, home gateways, and users cooperatively performing an FL algorithm for data analysis and inference. Denote the total users by a set N of N users. Denote the indoor users by a set N 1 of N 1 users and the outdoor users by a set N 2 of N 2 users. In this model, the BS will send the global FL model parameters to outdoor users by RF. Meanwhile, the BS transmits the global model parameters to the home gateways which are connected to the indoor VLC access points (APs). Then, the VLC APs transmit the global FL model parameters to indoor users through the visible light signal. Assuming that the BS and home gateways are connected by fiber on which bit errors can be negligible.
In indoor scenarios, each VLC AP consists of an LED lamp. Each user is served by the AP that provides the strongest signal. In addition, we assume that all indoor users can be covered by visible lights. We also assume that there is a central unit (CU) which controls both VLC and RF systems. Note that there is no interference between the RF and VLC systems, which is a key benefit of introducing VLC for the deployment of FL over wireless networks.

Computational Model
Let c n be the number of CPU cycles for user n to process one sample of data. As the data size of each training data sample is equal, the number of CPU cycles required for user n to execute one local iteration is c n D n . Denote the CPU-cycle frequency of user n by f n . Then, the energy consumption of user n updating its local FL model in one global iteration can be expressed as follows: where n = 1, 2, ..., N, α n 2 is the effective capacitance coefficient of the computing chipset of user n, and ν is a positive constant that depends on the data size of training data sample and the number of conditions in the local problem [14].
Furthermore, the computational time per local iteration of user n can be denoted as c n D n f n , n = 1, 2, ..., N. The computational time, however, depends on the number of local iterations, which is upper bounded by o(log(1/θ)). Therefore, the required computational time of user n for data processing is

RF Transmission Model
We use the orthogonal frequency division multiple access (OFDMA) technique for both uplink and downlink RF transmissions. The uplink rate of user n is given by where r U n = [r U n,1 , ..., r U n,R U ] is a resource block allocation vector and R U is the total number of RBs that the BS can allocate to the users. r U n,i ∈ {0, 1} and that RB i is allocated to user n; otherwise, we have r U n,i = 0; U n represents the set of users that are located at the other service areas and transmit data over RB i; B U is the bandwidth of each RB and P n is the transmit power of user n; h n is the channel gain between user n and the BS; N RF 0 is the noise power spectral density; ∑ i ∈U n P i h i is the interference caused by the users that are located in other service areas and use the same RB.
On the other hand, the downlink data rate of the BS transmitting global FL model parameters to each user n is given by where B D is the bandwidth of each RB that the BS used to transmit the global FL model to each user n; r D n = [r D n,1 , ..., r D n,R D ] is a RB allocation vector with R D being the total number of RBs that the BS can be used for FL parameter transmission. r D n,i ∈ {0, 1} and r D n,i = 1; r D n,i = 1 indicates that RB i is allocated to user n; otherwise, we have r D n,i = 0; P B is the transmit power of the BS; B is the set of other BSs that cause interference to the BS that performs the FL algorithm; h nj is the channel gain between user n and BS j. Let B R be the total RF bandwidth, and we have R U × B U + R D × B D ≤ B R . For simplicity, we assume B U = B D which means the bandwidth of an uplink resource block is equal to that of a downlink RB.
Denote the data size in bit of an FL model that each user needs to upload by s L . To upload the local FL model within transmission delay requirement t U n , we have t U n r U n ≥ s L . Meanwhile, the required energy of user n transmitting FL parameters is E M n = t U n P n . Similarly, we assume that the data size in bit of the global parameters which are transmitted to users is s G . To download the global FL model within transmission delay t D n , we have t D n r D n ≥ s G .

VLC Transmission Model
The optical channel gain of a line-of-sight (LoS) channel can be expressed as [18] where m = − 1 log 2 (cos(θ 1/2 )) is the Lambertian index which is a function of the half-intensity radiation angle θ 1/2 , A p is the receiver's physical area of the photo-diode, d is the distance from the VLC AP to the optical receiver, ϕ is the angle of irradiation and θ is the angle of incidence, Θ F is the half angle of the receiver's file of view (FoV), T s (θ) is the gain of the optical filter, and the concentrator gain g(θ) can be written as where n 0 is the refractive index. For a given user n connected to a VLC AP k, the signal-tointerference-plus-noise ratio (SINR) can be written as where γ is the optical to electric conversion efficiency, P v is the transmitted optical power of a VLC AP, N VLC 0 is the noise power spectral density, u nk is the channel gain between user n and the VLC AP k, u nl is the channel gain between user n and the interfering VLC AP l, and B is the bandwidth of each VLC RB. Each user is served by a single VLC AP which has the largest SINR for the user. In the VLC systems, optical OFDMA is employed. It is known that the input signal of the LEDs is amplitude constrained. Therefore, the classical Shannon capacity formula for complex and average power constrained signal is not applicable in VLC. Therefore, the lower bound of achievable data rate is used, which can be expressed as [19] where s n is the largest SINR which is evaluated as s n = max{s n1 , ..., s nK }, where K is the total number of VLC APs; r V n = [r V n,1 , ..., r V n,R V ] is a RB allocation vector with R V being the total number of VLC RBs, r V n,i ∈ {0, 1} and As the data size of global parameters is s G , the downlink transmission delay of indoor user n in each global iteration will be t dn = s G r n .

Problem Formulation
Next, we introduce the optimization problem. Our goal is to minimize the global loss function under time, energy, and bandwidth allocation constraints. The minimization problem is given by where S denotes the set of selected users participating in FL, S 1 denotes the set of selected indoor users, S 2 denotes the set of selected outdoor users, and |.| denotes the cardinality of a set. In addition, T round is the time threshold for each round and t d denotes the delay between BS and the home gateway. In addition, γ nE is the energy constraint of user n. (12a) and (12b) are the bandwidth constraints of RF link and VLC link, respectively. Constraint (12c) is the delay constraint of each round for all selected indoor users while (12d) is the delay constraint of each round for all selected outdoor users. (12e) denotes the set of selected users. In addition, (12f) is the energy consumption requirement of performing an FL algorithm.

Model Compression
In this section, we first analyze the optimization problem (12) so as to figure out how the communication factors affect the FL performance. Then, we introduce a model compression method to reduce the size of FL model parameters that are transmitted over wireless links so as to increase the number of users that participate in FL.

Problem Analysis
To simplify problem (12), we first provide the following lemma: Lemma 1. Given the transmit power of each user, the optimization problem (12) can be transformed into an optimization problem aiming to maximize the total size of data samples of the selected users, which can be denoted as Proof. Minimizing the global loss function is equivalent to minimizing the gap between the global loss function J(ω t ) at time t and the optimal global loss function J(ω * ). According to the Theorem 1 in [13], the gap is caused by the packet error rate (PER) and the number of selected users. Here, we do not consider the packet errors and hence, we have q i = 0. Using the same simplification method in [13], the optimization problem can be transformed to problem (13). This ends the proof.

Model Weights Compression
From problem (13), we observe that as the number of users that implement FL increases, the gap decreases, and the performance is improved. This is coincide with the experimental conclusions in [20]. To maximize the number of users in FL, we introduce a model compression method to reduce the transmission delay, energy, and bandwidth, so as to increase the number of users that participate in FL. In particular, the FL model has data redundancy during training and, thus, we prune the connections with small weight updates to reduce the size of transmission model parameters. Meanwhile, although the model compression will lose a part of model information, the experiments in [9,10,21] have proved that appropriate compression methods do not significantly affect the convergence speed and accuracy under proper sparsity rates. In this section, we first introduce a com-pression method with non-fixed thresholds. Then, we analyze the impact of the model compression on the optimization problem (13).
An FL model needs to be carefully compressed without affecting the global model training. The change of weights in a model can be used to evaluate their importance [22]. Therefore, an appropriate pruning threshold is the key for FL model compression. To ensure that the gradients of an FL model are in the same order, we first normalize the gradients in each layer [23]. In particular, the gradients of an FL model can be given by where G τ n ∈ R d 1 ×d 2 is the gradients of user n at iteration τ, W τ n ∈ R d 1 ×d 2 is the trained local model weights, and τ ∈ {1, . . . , T} is a global iteration; d 1 and d 2 represent the output and input dimensions, respectively; and Train(W τ n , D n ) refers to the trained model weights of user n. For a given sparsity rate, we obtain a threshold according to the sorted gradients. In particular, the weights less than the threshold are set to 0, while those larger than the threshold are set to 1. This process can be expressed by a sparsifying filter mask M n ∈ R d 1 ×d 2 for user n. Therefore, the compressed local model weights can be written as where W n ∈ R d 1 ×d 2 is the local model weights of user n and ⊗ is the Hadamard product.
Similarly, the compressed global model weights can be expressed as where W ∈ R d 1 ×d 2 is the global model weights and M ∈ R d 1 ×d 2 is the sparsifying filter mask for the BS. From (16), we observe that each user receives the same sparse global model. The gradients that are not transmitted to the BS or the users are called residuals [24], which will be used for the local model training and the global FL model generation. Therefore, residuals can be used to mitigate the errors caused by the sparsification and accelerate the FL convergence speed [21]. In particular, the residuals of user n can be defined by where R T n ∈ R d 1 ×d 2 is the accumulation of the residuals at iteration T, and G τ n,C ∈ R d 1 ×d 2 denotes compressed G τ n . Similarly, residuals of the BS can be defined by where G τ ∈ R d 1 ×d 2 is the model gradients at the BS, and G τ During transmissions, users only need to transmit the positions of non-zero parameters and their values. Through receiving these information, the BS can recover the model, and get the sparsifying filter masks. We assume the initial model weights is W I ∈ R d 1 ×d 2 , the final output is W F ∈ R d 1 ×d 2 and the matrix with all elements of 1 is 1. The overall process of model compression is shown in Algorithm 1.
Let p n and p be the sparsity rate corresponding to M n and M. Then, the size of the compressed local FL model and the compressed global FL model can be expressed as s L C = s L · p n and s G C = s G · p, respectively. Using the proposed compression scheme, the optimization problem (13) can be rewritten by where and E M n,C = t U n,C · P n . Accordingly, we denote the compression algorithm that reduces the communication costs in each iteration by MC (s L , s G ).
Save(R t n , W t n )

The Proposed Algorithm
To solve (19), in this section, we propose a joint user selection, bandwidth allocation, and model compression algorithm, USBA-MC, which divides problem (19) into two subproblems and solve them iteratively. In particular, we first fix the bandwidth allocation and optimize user selection. Then, the problem of bandwidth allocation is formulated and solved with the obtained subset of the selected users. The model compression and these subproblems are iteratively solved until a convergent solution of (19) is obtained.

Optimal User Selection
Given the bandwidth of each RB, (19) can be further simplified as We can observe from (20) that if the bandwidth of each RB is fixed, the subset of selected users is determined by the users' computing power and channel condition. We denote the algorithm that optimizes user selection under fixed bandwidth allocation by Algorithm 2. if t dn,C + t U n,C + t P n + t d,C ≤ T round, and E M n,C + E P n ≤ γ nE then 4: end if 6: end for 7: for n ∈ N 2 do 8: if t D n,C + t U n,C + t P n ≤ T round, and E M n,C + E P n ≤ γ nE then 9: S 2 ← n 10: end if 11: end for

Optimal RB Bandwidth
With an obtained subset of users, we need to find the optimal B, B U , and B D that can further optimize the capability of the hybrid VLC/RF systems. Note that the larger the bandwidth of a RB is, the smaller the delay can be, implying more users can be potentially selected. Based on this observation, the optimal RB bandwidth allocation problems are and max B

Lemma 2. The maximum bandwidth of a RB can be obtained when R U
Proof. We use the contradiction method to prove Lemma 2. First, we assume that maximum B U 0 , B D 0 , and B 0 exist when (21a) and (22a) are not equal. Therefore, we have and However, when (21a) and (22a) are equal, B U 1 , B D 1 , and B 1 satisfy the following equations: and Obviously, B U 1 =B D 1 > B U 0 =B D 0 and B 1 > B 0 , which contradicts the assumption. This ends the proof. Therefore, we have and

Iterative Solution
In each iteration, we first use the proposed model compression method to reduce the transmission delay and energy. Then, we update the selected users based on the constraints, using GetS(B U ,B D ,B). Finally, the bandwidth allocation is obtained by the given selected users, which is denoted by GetB(S). The iteration ends when both the user selection and bandwidth allocation remain fixed. Obviously, the algorithm can always reach convergence after a certain number of iterations. We summarize the proposed USBA-MC algorithm in Algorithm 3.
From Lemma 3, we can also observe there is a gap 2ϑ 1 LD ( . As the number of participated users increases, the gap decreases. Meanwhile, as the number of users increases, the value of F also decreases, which improves the convergence speed of the FL algorithm. (2) Implementation Analysis: Then, we analyze the implementation of the proposed algorithm. To find the optimal user selection set S, the BS must first calculate the total delay and the energy consumption E M n,C + E P n of each user. In our system, the total delay includes the RF link delay t D n,C + t U n,C + t P n and the VLC link delay t dn,C + t U n,C + t P n + t d,C . In order to calculate the total delay, the BS must know the model size required by FL algorithm and the computational time. The size of the FL model depends on the learning task and the sparsity rate. Before implementing an FL algorithm, the BS must first transmit the task information and model information to each user and set model sparsity rate. Therefore, the BS will know the FL model size and sparsity rate before training. In order to calculate the energy consumption and the computational time, the BS also needs to know the users' device information such as transmit power and CPU. In an FL algorithm, the BS can learn the device information when users initially connect to the BS. Given the total delay t D n,C + t U n,C + t P n , t dn,C + t U n,C + t P n + t d,C , and the energy consumption E M n,C + E P n , the BS can compute S 1 and S 2 using GetS(B U ,B D ,B). Given S 1 and S 2 , the BS can compute the bandwidth of each RB using GetB(S). As the function in (20) is linear, the USBA-MC algorithm can determine a user selection set S to improve FL training loss.
(3) ComplexityAnalysis: With regards to the complexity of the USBA-MC algorithm, we first analyze the complexity of the model compression algorithm. In the model compression, the complexity depends on the number of model parameters. Let W O be the number of model parameters, the complexity of the model compression algorithm is O(W O log W O ) [26]. Then, we analyze the complexity of the traversal algorithm. Since the total number of users is N, the complexity of the traversal algorithm is O(N) [27]. In addition, the complexity of the numerical method is O(1) since we only need to allocate RBs according to the user set. Assume that the number of global iterations is T, and the total complexity of the USBA-MC algorithm can be expressed as O(TW O log W O ).

Simulation Results and Analysis
Consider a circular network area having a radius r = 50 m with one BS at its center. There are N = 50 uniformly distributed users, and 80% of the users are in indoors and 20% of them are in outdoors. The system specifications are summarized in Table 1. The following two baselines are considered: (a) the USBA algorithm in a hybrid VLC/RF system [28] and (b) the FL algorithm in RF-only system. To comprehensively evaluate the performance of the proposed USBA-MC algorithm in federated learning systems, we conduct experiments related to three learning tasks: (a) the prediction task of housing price, (b) identification task of identifying the handwritten digits from 0 to 9, and (c) identification task of classifying 10 categories of color images.

Parameter Value
Transmitted optical power per VLC AP, P v 9 W Modulation bandwidth for LED lamp, B 40 MHz The physical area of a PD, A p 1 cm 2 Half-intensity radiation angle, θ 1/2 60 deg. Gain of optical filter, T s (θ) 1 In the housing price prediction task, our goal is to compare the performance of the proposed USBA-MC algorithm under different sparsity rates, and compare the performance of USBA-MC, baselines (a) and (b). The dataset used to train the FL algorithm is Boston house price dataset (http://lib.stat.cmu.edu/datasets/boston (accessed on 27 March 2021)) that is randomly allocated to users equally. In this task, each user trains an FNN with one hidden layer composed of 10 neurons.
In the identification task of handwritten digits, we train FNNs using MNIST dataset [29]. The size of neuron weight matrices are 784 × 200, 200 × 200, and 200 × 10. Sixty-thousand handwritten digits are used to train the network and 10,000 handwritten digits are used to test it.
Finally, we train CNNs on CIFAR-10 [30] to investigate the performance of USBA-MC algorithm with different sparsity rates on non-IID data. The size of neuron weight matrices are 5 × 5 × 3 × 64, 5 × 5 × 64 × 64, 2304 × 384, 384 × 192 and 192 × 10. Fifty-thousand images are used to train the network and 10,000 images are used to test it. Figure 2 shows the performance of the proposed USBA-MC algorithm in two learning tasks under different sparsity rates. We use the coefficient of determination (R 2 ) to measure the quality of the model in the task of predicting housing price, and use the accuracy of classification to measure the performance in the task of identifying handwritten digits. Moreover, we calculated the average of 10 experiments to ensure the reliability of the experimental results. It shows that the R 2 values first increase and then decrease with the sparsity rate. This is because the model information will be lost with low sparsity rate. In particular, the best sparsity rate is 0.4 for predicting housing price and 0.2 for identifying handwritten digital. Figure 3 compares the predictive performance of USBA-MC, baselines (a) and (b). The green line is the true values of data samples, and the sparsity rate of USBA-MC is set to 0.4. Before training, we randomly select 18 samples to form a test set for testing. In Figure 3, we can observe the proposed USBA-MC algorithm can achieve better performance than baselines (a) and (b). In particular, the proposed FL algorithm can improve the R 2 by up to 11% and 15%, compared to baselines (a) and (b). Figure 4 compares the identification performance in the tasks of identifying handwritten digits. It shows USBA-MC is better than baselines (a) and (b) in most global communication rounds, and the final accuracies of these algorithms are 96.52%, 96.45%, and 96.39%, respectively. This is because USBA-MC introduces visible light communication and reduce the size of transmission model, which can increase the number of selected users, and further improving the FL performance.

Number of Selected Users
This subsection evaluates the performance of USBA-MC in user selection. We first compare the number of users selected under different bandwidth conditions and resource constraints. Figure 5 shows how the number of selected users changes as the total number of users varies in different systems. It can be observed that with the increase of the total users, the selected users in three algorithms will also increase. However, compared with baselines (a) and (b), USBA-MC algorithm enables more users to participate in the training process. This trend is more obvious with the increase of total number. For instance, when the total number is 150, the USBA-MC can improve the number of selected users, by, respectively, up to 37.8% and 68.7% compared to baseline (a) and (b). Table 2 shows the ratio of selected users, we can find that USBA-MC always has the highest ratio in all cases.    Figure 6 compares number of selected users under different transmission data sizes. We can observe that the selected users decrease when the data size increases. Table 3 shows the selection ratio of different algorithms with different data sizes when the total user number is 150. The result shows that USBA-MC can achieve better system performance than the other algorithms. This advantage is important when the model size becomes larger due to the complex neural network.

Non-IID Data
In this subsection, we explore the accuracy of USBA-MC algorithm with non-IID data [31]. To obtain a non-IID dataset, we use the same method as in [31].
As shown in Figure 7, the model is trained on the dataset of non-IID nature. We can clearly observe the advantages of USBA-MC compared with other algorithms. In terms of stability and accuracy, the USBA-MC algorithm achieves the best performance. In USBA-MC, a low sparsity rate will increase the stability of the system and improve the final accuracy. However, when the sparsity rate is 0.2, USBA-MC has lower accuracy and higher stability compared to 0.4. This is because decreasing sparsity rate will increase the loss of model information. Moreover, the model will not converge when the sparsity rate is too low. Therefore, there is a trade-off between the sparsity rate and the model performance. Figure 8 shows the use of the proposed USBA-MC algorithm for image identification. In this simulation, the BS uses broadcast techniques to transmit the global model and the local models are trained by CIFAR-10. As shown in Figure 8, the proposed USBA-MC algorithm can still achieve the best performance among the three algorithms in terms of both accuracy and stability. In particular, the USBA-MC can improve the accuracy by up to 3.27% and 6.35%, compared to baselines (a) and (b). We analyze the gain of USBA-MC when the data set is non-IID. According to Section 3 in [31], the weight divergence will reduce the accuracy in non-IID dataset. The weight divergence is caused by the distance between the data distribution on each user and the population distribution. Such distance can be evaluated with the earth mover's distance (EMD). According to the central limit theorem (CLT) of normal distribution [32], as the number of local models increases, the mean of EMD will be approximated by a normal distribution. Therefore, the weight divergence of the trained global model will be smaller and the model performance will be better with the increase of the number of users. Compared with the accuracy of the transmission model, the system is more sensitive to the number of selected users. Therefore, the increase of users will improve the robustness and accuracy of the global model. Figure 9 selects the model accuracy of the last 10 global communications to obtain the average and variance of the accuracy. It can be observed that the accuracy first increase and then decrease with the sparsity rate. The USBA-MC can improve the accuracy by up to 7% and 16.7%, compared to baselines (a) and (b) when the sparsity rate is 0.4. We can also observe that as the number of users increases, the model will be more stable until it cannot converge.

Conclusions
This paper has proposed the introduction of VLC into conventional RF systems for supporting FL. We have formulated a joint user selection and bandwidth allocation problem for FL in a hybrid VLC/RF system. To solve this problem, we first used a model compression method to reduce the size of FL model parameters that are transmitted over wireless links, and then we separated the optimization problem into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The convergent solution is obtained by iteratively compressing the model and solving these two subproblems. Simulation results have demonstrated that the USBA-MC algorithm outperforms USBA and FL in RF-only systems.