Reduction in Data Imbalance for Client-Side Training in Federated Learning for the Prediction of Stock Market Prices

: The approach of federated learning (FL) addresses significant challenges, including access rights, privacy, security, and the availability of diverse data. However, edge devices produce and collect data in a non-independent and identically distributed (non-IID) manner. Therefore, it is possible that the number of data samples may vary among the edge devices. This study elucidates an approach for implementing FL to achieve a balance between training accuracy and imbalanced data. This approach entails the implementation of data augmentation in data distribution by utilizing class estimation and by balancing on the client side during local training. Secondly, simple linear regression is utilized for model training at the client side to manage the optimal computation cost to achieve a reduction in computation cost. To validate the proposed approach, the technique was applied to a stock market dataset comprising stocks (AAL, ADBE, ASDK, and BSX) to predict the day-to-day values of stocks. The proposed approach has demonstrated favorable results, exhibiting a strong fit of 0.95 and above with a low error rate. The R-squared values, predominantly ranging from 0.97 to 0.98, indicate the model’s effectiveness in capturing variations in stock prices. Strong fits are observed within 75 to 80 iterations for stocks displaying consistently high R-squared values, signifying accuracy. On the 100th iteration, the declining MSE, MAE, and RMSE (AAL at 122.03, 4.89, 11.04, respectively; ADBE at 457.35, 17.79, and 21.38, respectively; ASDK at 182.78, 5.81, 13.51, respectively; and BSX at 34.50, 4.87, 5.87, respectively) values corroborated the positive results of the proposed approach with minimal data loss.


Introduction
The proliferation of the internet of things (IoT) and edge devices has led to a substantial surge in data generation.It has been forecasted to surpass 30 billion, resulting in a potential global data volume of approximately 163 zettabytes (trillion gigabytes) [1,2].The cost of computation, processing, and communication associated with storing data in data centers is significant.However, this is not the only concern as the sensitivity of edge data with regard to privacy has resulted in mass data-sharing hesitancy by the general public.In the context of industries, safeguarding data from privacy breaches and cyberattacks is an emerging challenge.To preserve the confidentiality of data residing on edge nodes, federated learning (FL) is a widely adopted approach [3,4].FL has emerged as a promising approach to tackle critical concerns such as access rights, privacy, security, and access to heterogeneous data [5,6].FL facilitates the creation of a collective learning model via multiple nodes without the need to exchange their data samples [7], thus saving the cost of communication and storage of data to central servers while preserving the privacy of edge data [8,9].This technique has successfully found applications in diverse domains, including mobile traffic prediction and monitoring [10,11], healthcare [12,13], the internet of things [14][15][16], transportation and autonomous vehicles [17], digital twin [16], blockchain [18], disaster management [19,20], natural language processing [21], knowledge extraction [22], agriculture [23], pharmaceutics, and medical sciences [24,25].
The approach of FL differs from that of distributed machine learning (DML), where the data are initially centralized on a server and subsequently partitioned into subsets for the purpose of learning tasks.In this scenario, the sample size follows a uniform distribution and is both independent and identically distributed (IID) [26].In contrast, FL distributes the algorithm for processing across edge devices rather than concentrating the data on a central server [27,28], as presented in Figure 1.As a result, it can be observed that FL possesses a greater number of training subsets in comparison to DML.This may lead to non-identical distribution of data (non-IID), as stated in [29].Most classification tasks exhibit imbalanced class distributions, which can lead to biased machine learning algorithms [30].The problem of imbalanced distribution poses a significant challenge.In supervised learning, models require labeled training data for updating their parameters.The imbalance of the training data is in the variation of the number of samples for different classes/labels.Its major solutions are ensemble learning and sampling techniques [31].The under-sampling method is straightforward to implement as it involves sampling the data to achieve a balanced proportion [32,33].Under sampling techniques, examples from the training dataset that belong to the majority class are removed to better balance the class distribution.However, the implementation of this technique involves a large dataset, whereas the local data of the edge devices in the FL network are generally limited in size.
J. Sens. Actuator Netw.2024, 13, x FOR PEER REVIEW 2 of 20 things [14][15][16], transportation and autonomous vehicles [17], digital twin [16], blockchain [18], disaster management [19,20], natural language processing [21], knowledge extraction [22], agriculture [23], pharmaceutics, and medical sciences [24,25].The approach of FL differs from that of distributed machine learning (DML), where the data are initially centralized on a server and subsequently partitioned into subsets for the purpose of learning tasks.In this scenario, the sample size follows a uniform distribution and is both independent and identically distributed (IID) [26].In contrast, FL distributes the algorithm for processing across edge devices rather than concentrating the data on a central server [27,28], as presented in Figure 1.As a result, it can be observed that FL possesses a greater number of training subsets in comparison to DML.This may lead to non-identical distribution of data (non-IID), as stated in [29].Most classification tasks exhibit imbalanced class distributions, which can lead to biased machine learning algorithms [30].The problem of imbalanced distribution poses a significant challenge.In supervised learning, models require labeled training data for updating their parameters.The imbalance of the training data is in the variation of the number of samples for different classes/labels.Its major solutions are ensemble learning and sampling techniques [31].The under-sampling method is straightforward to implement as it involves sampling the data to achieve a balanced proportion [32,33].Under sampling techniques, examples from the training dataset that belong to the majority class are removed to better balance the class distribution.However, the implementation of this technique involves a large dataset, whereas the local data of the edge devices in the FL network are generally limited in size.The edge nodes use the available data to train a collaborative model in an FL setting.The distribution of data originating from edge devices is contingent upon their usage patterns.For instance, when comparing cameras situated in natural habitats to those placed in parks, the latter are inclined to capture a greater number of images featuring individuals.To facilitate a deeper comprehension, these imbalances can be classified into three distinct types: (1) Size imbalance, which refers to a scenario in which the size of the data sample on each edge node is irregular.(2) Local imbalance, also known as non-IID, refers to a scenario where not all nodes in a system follow the same data distribution [34].(3) The phenomenon of global imbalance refers to the situation where the distribution of data across all nodes in a system is characterized by a significant class imbalance, as noted in The edge nodes use the available data to train a collaborative model in an FL setting.The distribution of data originating from edge devices is contingent upon their usage patterns.For instance, when comparing cameras situated in natural habitats to those placed in parks, the latter are inclined to capture a greater number of images featuring individuals.To facilitate a deeper comprehension, these imbalances can be classified into three distinct types: (1) Size imbalance, which refers to a scenario in which the size of the data sample on each edge node is irregular.(2) Local imbalance, also known as non-IID, refers to a scenario where not all nodes in a system follow the same data distribution [34].(3) The phenomenon of global imbalance refers to the situation where the distribution of data across all nodes in a system is characterized by a significant class imbalance, as noted in reference [35].FL aims to create a global model that generalizes well to unseen data.However, if certain classes are underrepresented during training due to data imbalance, the model may not generalize effectively to those classes during inference.
To summarize, FL is a proficient distributed machine learning approach that offers the added benefit of protecting privacy; however, it encounters difficulties in managing datasets that are unbalanced or skewed [35].In FL, models are trained on local datasets from different clients.If some clients have a disproportionately large or small number of samples from a particular class, the model may become biased toward those classes.Similarly, clients with more data may have a larger influence on the global model during aggregation, potentially overshadowing contributions from clients with smaller datasets.This can lead to a lack of fairness in the learning process.This study aims to address this issue to enhance the precision of the results by tackling the issue of local data imbalance in a federated setting by balancing the classes of data and by addressing the challenge of imbalanced data while maintaining privacy and minimizing communication overhead.

Contribution:
The uneven distribution of data can result in bias during the training phase of a model, leading to reduced accuracy in FL applications [34,35].The primary contribution of this paper is the resolution of imbalanced data issues through the implementation of class estimation and balancing technique that adjusts local imbalance via class estimation and data augmentation on the client side: • A novel approach comprising balanced federated learning (Bal-fed) has been proposed for implementation on the FL setting to achieve a class balance to increase the training accuracy with lower computation costs.• The proposed approach harnesses class estimation and balancing to train a linear regression (LR) machine learning algorithm in an FL setting.To assess its applicability, this approach is implemented on stock market data (text and numerical data).• This approach has been shown to enhance the accuracy of the model in FL settings by demonstrating an accuracy rate exceeding 95% in the FL environment, even after mitigating the issue of data imbalance.
The present article is organized into five primary sections.Section 2 elucidates the studies conducted and the outcomes of the experiments executed to address predicaments and challenges like the problem under consideration.The third section of the paper provides a detailed account of the methodology and materials employed in conducting the experiments.Section 4 encompasses the setup and execution of the experiment.The outcomes of the applied methodology.The experiment is concluded in Section 5, where the corresponding results are concluded and future directions in this area are highlighted.

Related Work
In the realm of distributed data, FL is a developing methodology that aims to address privacy concerns [36].The advancement of novel frameworks aimed at enhancing healthcare technologies has been the focus of numerous research endeavors [37][38][39][40][41]. Similarly, the industrial applications of FL are emerging as a potential area of research.The application of deep CNN-based methods in industrial systems holds immense potential for revolutionizing various aspects of operations.In industrial settings, deep CNNs excel in tasks such as image recognition, defect detection, and predictive maintenance.For instance, in quality control processes, a well-trained deep CNN can accurately identify defects in manufactured products by analyzing visual data from production lines.Moreover, predictive maintenance becomes more effective as deep CNNs analyze sensor data to detect subtle anomalies indicative of machinery wear or potential failures.This review underscores the transformative impact of deep CNNs in enhancing efficiency, reducing downtime, and ensuring the overall reliability of industrial systems.A plethora of approaches and frameworks exist for FL; however, only a limited number of studies have been conducted to assess the efficacy of data balancing in FL approaches and frameworks [42].This section provides a comprehensive overview of the experiments conducted, with a particular focus on those that are relevant to our study.
The production and assembly of data using nodes in a network often occurs in a non-IID manner, as noted in previous studies [34,[43][44][45].For instance, individuals who use cellular devices may frequently utilize language in the context of predicting the subsequent word.Additionally, the quantity of data distributed among nodes may vary significantly.The enhancement of the convergence of the FL algorithm can be achieved through the quantification of the statistical heterogeneity of the data.Recent studies have introduced and applied various techniques and instruments for computing statistical heterogeneity through metrics [46].However, it is not possible to measure these metrics before the training process.
To improve the performance of machine learning models, Verma et al. proposed novel methodologies particularly in the context of highly imbalanced datasets [47].The researchers analyzed the participants' performances in various settings.The model in question was an artificial intelligence (AI) model that was developed using an FL approach, which involved the integration of data from multiple sources.
The algorithms utilized for the autonomous computation of updates by clients, based on their local data, in the present model were established by Konecny et al. (2016) [48].The updated data can be transmitted to a central server by them.The central server computes a new global model by amalgamating the changes made by the clients.The primary users of this system are mobile devices, and the efficiency of their communication is of utmost importance.This study proposes two methods, namely structured updates [49] and sketch updates [50], to mitigate the expenses associated with up-link transmission.
Nilsson et al. [51] conducted a benchmark of three FL algorithms.Storing the data on the server led to a comparative analysis of the efficacy of the three methods.Federated averaging, commonly referred to as FedAvg, is a distributed machine learning algorithm that enables the training of models on decentralized data.Federated stochastic variance reduced gradient and CO-OP are among the algorithms utilized in federated learning.The algorithms underwent testing with non-independent and identically distributed data.The concept of independent and identically distributed (IID) variables is a fundamental principle in probability theory and statistics.The present study investigates various data partitioning techniques applied to the MINIST dataset.The investigation revealed that the FedAvg algorithm demonstrated the highest level of accuracy.
The integration of FL and deep reinforcement learning (DRL) as a means of enhancing edge systems is suggested in [52].The implementation of this concept has resulted in improvements in caching, networking, and mobile edge computing (MEC).To leverage edge nodes and facilitate device collaboration, they developed the "In-Edge AI" framework.Empirical evidence has demonstrated that this framework exhibits exceptional performance with minimal cognitive load.Ultimately, the authors deliberated on a range of challenges and opportunities to depict the promising future of "In-Edge Al" [53].Xu et al. [54] conducted a survey to investigate the growth of FL in healthcare informatics.The authors provided a comprehensive overview of the vulnerabilities, statistical challenges, and privacy concerns associated with the issue at hand.They also proposed solutions to address these issues.The authors anticipate that their findings will serve as valuable resources for the researchers in the field of computational research on machine learning algorithms.Specifically, they claimed that their work will aid in the management of large amounts of distributed data while also considering privacy and health informatics [54].
Sattler et al. developed the clustered federated learning (CFL) approach [55] to tackle the issue of reduced accuracy in FL scenarios when the data distribution of local clients deviates.Federated multi-task learning (FMTL) is a technique that is facilitated using CFL.The geometric properties of the FL loss surface are utilized in the FMTL approach to facilitate the categorization of client populations based on the distribution of trainable data.It is recommended that the FL communication mechanism in CFL remain unchanged.The clustering quality is supported by robust mathematical guarantees, which are further enhanced by the incorporation of deep neural networks (DNNs).CFL maintains a variety of client demographics over an extended period while ensuring privacy protection measures are in place.This approach allows for flexibility in data management.When comparing FL and CFL, the latter is claimed to be regarded as a post-processing technique that attains objectives that are either equivalent to or greater than those of the former [56].
Frameworks for secure FL were proposed in [57].The authors presented a comprehensive FL platform that encompasses federated transfer learning (FTL), as well as vertical and horizontal FL.The authors presented concepts related to FL, discussed the necessary infrastructure, and explored the potential implications of FL implementation.The authors provided a comprehensive analysis of the progress made in this area.Furthermore, utilizing federated processes, experts suggested the establishment of data networks among enterprises to facilitate data sharing while ensuring the protection of end-users' privacy [58].
Recently, a proposed framework for agnostic FL optimized a centralized model [59].The optimization of client distributions enables it to be suitable for any resulting target distribution.The authors expressed their opinion that the framework produces a perception of fairness.The authors proposed a rapid stochastic optimization method to tackle the optimization issues at hand.The authors also presented convergence bounds for the approach, assuming a given hypothesis set and a convex loss function.The authors utilized multiple datasets to demonstrate the advantages of their proposed techniques.The paradigm proposed by the authors has potential applicability in various learning contexts, including but not limited to domain adaptability, cloud computing, and drifting, as suggested in a previous research [45].In the field of mobile devices, Bonawitz and colleagues have proposed a scalable production approach in [60].The system was built upon the foundation of TensorFlow (TF).The authors introduced advanced theoretical ideas, discussed diverse challenges and their corresponding resolutions, and illustrated the problems and possible solution [61].
It is apparent from the literature that numerous frameworks and techniques have been developed to address challenges such as communication cost, statistical heterogeneity, convergence, and resource allocation.However, the challenge posed by class imbalance and data imbalance is often overlooked in FL.In the context of FL and imbalanced data, researchers have explored various methods to mitigate the impact of class imbalance on model performance.Some of the approaches include oversampling [62], under-sampling [33,63], class weights [64], and localized balancing [65].The effectiveness of these techniques may vary depending on the specific characteristics of the dataset and the FL setup.This study continues to refine existing methods and proposes a new approach to improve the handling of imbalanced data in federated learning.This article aims to use class estimation and balancing by handling the imbalanced class distribution of the data using a class estimation approach with a balancing algorithm.

Materials and Methods
An FL approach, referred to as Bal-fed (balanced federated learning), is proposed and illustrated in Figure 2 in this paper.This approach aims to rebalance training by implementing certain techniques such as the process of selecting clients at the edge layer and estimating the class of clients being executed.To address potential data bias, a technique known as data augmentation [66] is employed on a global scale.The linear regression algorithm is employed to train the model over edge nodes, as illustrated in Figure 2. The updated model is subsequently transmitted to the central server for model aggregation using FedAvg.The problem of data imbalance is effectively addressed in centralized machine learning.In FL, it is imperative to maintain the confidentiality of personal information.The generation of synthetic data has been proposed as a viable approach to preserving privacy while still allowing for data analysis [67,68].This method involves creating artificial data that mimic the statistical properties of the original data, without revealing any sensitive information.This statement is in accordance with the post-processing guarantees of dif-ferential privacy (DP) [67,69].Augenstein et al. investigated and presented the federated approach for generating synthetic data [70].In the context of a federated environment, the technique of data synthesis can be employed.Furthermore, it is imperative to incorporate client estimation in the self-balancing approach.
J. Sens. Actuator Netw.2024, 13, x FOR PEER REVIEW 6 of 20 presented the federated approach for generating synthetic data [70].In the context of a federated environment, the technique of data synthesis can be employed.Furthermore, it is imperative to incorporate client estimation in the self-balancing approach.

Model Training on the Client Side
In FL settings, the raw data of clients could not be obtained due to privacy concerns.However, the class estimation and balancing scheme [71] can help the class distribution of client services on edge side according to their updated gradients.This class estimation technique can then be used with the application of data augmentation [66] to balance the classes and data evenly.
While training a model in FL, the expectation of gradient square for different classes has the following approximate relation [72]: where L denotes the cost function of the training algorithm; ni and nj are the number of data samples for class i and class j, respectively, where   and i, j ∈ C. Due to the correlation between class distribution and gradient, the class estimation [71] of class Ci, with class ratio ∑ , can be denoted as where β is a hyperparameter tuned for the normalization between classes.Thus, the composition vector R = [R1,..., RC] that indicates the distribution of raw data can be obtained.Additionally, the Kullback-Leibler (KL) divergence can be employed to assess the class imbalance of each edge node using U (vector of classes with magnitude C).This can be defined as: During FL training, after updating the model, the server has the capability to retrieve the local model from each client device.By employing the class estimation approach, we

Model Training on the Client Side
In FL settings, the raw data of clients could not be obtained due to privacy concerns.However, the class estimation and balancing scheme [71] can help the class distribution of client services on edge side according to their updated gradients.This class estimation technique can then be used with the application of data augmentation [66] to balance the classes and data evenly.
While training a model in FL, the expectation of gradient square for different classes has the following approximate relation [72]: where L denotes the cost function of the training algorithm; n i and n j are the number of data samples for class i and class j, respectively, where i ̸ = j and i, j ∈ C. Due to the correlation between class distribution and gradient, the class estimation [71] of class C i , with class ratio , can be denoted as where β is a hyperparameter tuned for the normalization between classes.Thus, the composition vector R = [R 1 ,..., R C ] that indicates the distribution of raw data can be obtained.Additionally, the Kullback-Leibler (KL) divergence can be employed to assess the class imbalance of each edge node using U (vector of classes with magnitude C).This can be defined as: During FL training, after updating the model, the server has the capability to retrieve the local model from each client device.By employing the class estimation approach, we can unveil the composition vector R k for the chosen client k.The reward for client k is then defined as: The class distribution can be unveiled based on the composition vector.Let R k (t) denote the composition vector of client k at time slot t.Consequently, the class ratio can be approximated through the sample mean of the composition vector, expressed as: With the estimated composition vector R and reward r of each client, we can design the client selection scheme with minimal class imbalance according to Algorithm 1. Data augmentation in data analysis is a technique used to increase the amount of data available for analysis [66].It is a set of algorithms that construct synthetic data from an available dataset by creating modified copies of existing data or generating new synthetic data from the current data.Synthetic data commonly involves introducing minor variations in the data, to which the model's predictions should remain invariant [73].Additionally, synthetic data can capture combinations of distant examples that might be challenging to infer otherwise.Data augmentation serves as a valuable tool to enhance training by supplying machine learning models with a more diverse and representative dataset, thereby improving accuracy and robustness.[66,70].This technique has been extensively studied and applied in various fields, as evidenced by the numerous references to it in the literature [66].Regularization is a technique commonly used during the training of machine learning models to prevent overfitting.It helps to reduce the variance of the model by adding a penalty term to the loss function, which discourages the model from fitting the noise in the training data.
In FL, the computation is performed at client sides, so it is better for edge nodes to bear a lower computation cost.Linear regression algorithms have low computational requirements [74] compared to more complex models such as SVM, Random Forest, and DL [75][76][77][78], making them suitable for large datasets or scenarios where computational resources are limited.For the FL setting, LR takes only 7.6 s for the training round, while Random Forest (RF) takes 515 s and SVM takes 4989 s [74].Thus, we used this algorithm for local training of the clients' data.Moreover, this algorithm produces comparable outcomes to convolutional neural networks (CNNs) while minimizing computational cost.LR is wellsuited for scenarios where both the dependent and independent variables are continuous, such as in the analysis of stock market datasets.
The value of a variable can be predicted through the utilization of linear regression analysis [79,80], which is based on the value of another variable.The predictability of the dependent variable is a crucial aspect to consider.By manipulating the independent variable, a hypothesis can be formulated regarding the anticipated value of the dependent variable [75].This type of analysis involves the utilization of one or more independent variables to accurately predict the value of the dependent variable.The coefficients of the linear equation are then calculated based on this prediction.It involves fitting a line or surface to minimize the discrepancies between the anticipated and observed output values.A collection of paired data can be utilized to create basic linear regression models that employ the "least squares" approach to determine the optimal-fit line: The variable y is the predicted value of the dependent variable (y) for a given value of the independent variable (x) [81].The β0 represents the intercept, which is the predicted value of y when x equals 0. On the other hand, β1 is the regression coefficient that indicates the expected change in y as x increases.The variable x is considered as the independent variable, as it is expected to have an influence on y.The variable ε in the equation represents the error of the estimate, which quantifies the amount of variation present in the estimate of the regression coefficient.
LR aims to determine the line of best fit for a given set of data.This is achieved by identifying the regression coefficient (B1) that minimizes the total error (e) of the model.Linear regression commonly employs the mean squared error (MSE) as a metric to evaluate the accuracy of the model.The mean squared error (MSE) is computed through: • A calculation of the deviation of the observed y-values from the predicted y-values for each corresponding x-value.• A calculation of the square of each of these distances.
• A calculation of the mean for each of the squared distances.

Federated Averaging (FedAvg) for Model Aggregation
The benchmark FedAvg algorithm was utilized for the purpose of global model aggregation.A subset of the federation's members, consisting of clients/devices, was randomly selected to receive the initial global model synchronously, as outlined in Algorithm 2 [82].In the current round of training, the local model of each selected client is updated by utilizing local data.The process of updating the server with client information is described in references [82,83].To enhance the overall model, the server computes the average of all updates received from the client.Once the model parameters have reached convergence, as determined by appropriate criteria, the process is repeated with an additional round of training.
Algorithm 2: FedAvg (Federated Averaging).There are n clients, B is the local minibatch size, E is the number of local epochs per communication round, η is the learning rate, and ƒi is the local loss function.

Server Executes initialize ω 0 ;
for each round t = 0, 1,. ..do for each client i = 0,. .., n − 1 in parallel The process of gradient descent occurs on the client side, while the aggregation of the averaged clients' updates takes place on the server.The variable k denotes the set of clients indexed by k, while η is the learning rate.The level of client computation is regulated by three crucial parameters.The present study considers the fraction of clients, denoted by i, that engage in computation during each round.Additionally, the study examines the impact of local minibatch size (B) and the number of local epochs (E) on the overall performance of the system.

Implementation
The proposed framework must be executed to achieve a balanced training process, as illustrated in Figure 3.To mitigate overfitting, data augmentation techniques are utilized to increase the amount of data by either creating new synthetic data from the existing data or by adding modified copies of the current data.In addition, future enhancements to this study may involve the implementation of data synthesis techniques, whereby a novel dataset is generated from the existing one.The input for the process is in the form of .CSV data, which are then utilized to generate a synthetic dataset using differential privacy (DP) techniques.
J. Sens. Actuator Netw.2024, 13, x FOR PEER REVIEW 9 of 20 indexed by k, while η is the learning rate.The level of client computation is regulated by three crucial parameters.The present study considers the fraction of clients, denoted by i, that engage in computation during each round.Additionally, the study examines the impact of local minibatch size (B) and the number of local epochs (E) on the overall performance of the system.

Implementation
The proposed framework must be executed to achieve a balanced training process, as illustrated in Figure 3.To mitigate overfitting, data augmentation techniques are utilized to increase the amount of data by either creating new synthetic data from the existing data or by adding modified copies of the current data.In addition, future enhancements to this study may involve the implementation of data synthesis techniques, whereby a novel dataset is generated from the existing one.The input for the process is in the form of .CSV data, which are then utilized to generate a synthetic dataset using differential privacy (DP) techniques.It is necessary to establish unique settings for model training and aggregation processes.The sequential workflow for the proposed scheme is elucidated in Figure 3.The collected data are transformed into federated data through an automated process.The data are distributed randomly among all clients.Subsequently, the Bal-fed technique is implemented.
Stock market dataset is utilized to evaluate the efficacy of the model in predicting stock prices.In this study, American Airline Group (AAL), Adobe (ADBE) AutoDesk Inc. (ADSK), and the Boston Scientific Corporation (BSX) were utilized as the primary subjects of investigation.The dataset underwent conversion to randomly distributed datasets to render them to be appropriate for the FL framework.The utilization of Bal-fed in handling the stock market comprises both numerical and textual data for predictive modeling tasks, which can enhance the fitness of the model in the FL setting, particularly for a diverse range of problems.
In this study, we employed the Flower framework [83] for FL [84] and other computational tasks.Developers can simulate integrated FL algorithms on their data by utilizing It is necessary to establish unique settings for model training and aggregation processes.The sequential workflow for the proposed scheme is elucidated in Figure 3.The collected data are transformed into federated data through an automated process.The data are distributed randomly among all clients.Subsequently, the Bal-fed technique is implemented.
Stock market dataset is utilized to evaluate the efficacy of the model in predicting stock prices.In this study, American Airline Group (AAL), Adobe (ADBE) AutoDesk Inc. (ADSK), and the Boston Scientific Corporation (BSX) were utilized as the primary subjects of investigation.The dataset underwent conversion to randomly distributed datasets to render them to be appropriate for the FL framework.The utilization of Bal-fed in handling the stock market comprises both numerical and textual data for predictive modeling tasks, which can enhance the fitness of the model in the FL setting, particularly for a diverse range of problems.
In this study, we employed the Flower framework [83] for FL [84] and other computational tasks.Developers can simulate integrated FL algorithms on their data by utilizing Flower.Additionally, testing of novel algorithms can be conducted.The researchers will identify suitable locations for conducting various types of research and provide comprehensive examples.The development of machine learning models is facilitated via the Python code, which is easily comprehensible to humans.

Results and Discussion
This section comprises the findings obtained from the present study.In this section, we present numerical results to showcase the effectiveness of the proposed algorithms.Our technique was evaluated by conducting tests on stock market data.

Evaluation Measure
The evaluation of an algorithm's performance is typically represented by a confusion matrix, which provides insight into the occurrence of errors.The matrix illustrates the number of predicted outcomes from the test data that correspond to the correct class, as well as the number of outcomes that are incorrectly assigned to other classes.The inputted data within the matrix are instrumental in assessing and determining the evaluation metrics of the algorithms.The quality of the classifier will be evaluated using the commonly used parameter accuracy [85], which is defined as: The value of N is determined by the sum of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).TN refers to the number of correctly identified negative cases, while FN represents the number of negative cases that were incorrectly identified as positive.TP denotes the number of correctly identified positive cases, and FP represents the number of positive cases that were incorrectly identified as negative [76].
The mean squared error (MSE), mean absolute error (MAE), and R-squared metrics are primarily utilized to assess the rates of prediction error and the performance of model in regression analysis [86][87][88].Relative MSE is also calculated to provide greater insight about the model generalization.MAE represents the difference between the original and predicted values, calculated by averaging the absolute differences over the dataset.MSE represents the difference between the original and predicted values by squaring the average difference over the dataset.R-squared (coefficient of determination) represents how well the values fit compared to the original values.The value from 0 to 1 is interpreted as a percentage.The relative MSE considers the percentage error rather than the absolute error, thus providing insights into how well a model generalizes across datasets.A lower relative MSE indicates that the model is robust and performs well across a range of data samples.
In the given context, y represents the true price, ŷ represents the predicted price, and n denotes the number of samples in the test dataset.Similarly, relatively MSE can be defined as: where y is the mean and computed by: The training time is quantified to examine the latency in machine learning model training.This time duration is influenced by factors such as the model's complexity, dataset size, and the efficiency of the processing framework.Reducing training delays provides real-time advantages for prediction and classification.

Analysis of Results
Real-time data were collected from the Y-finance API for the purposes of this experimental study.The present study analyzes the financial data of prominent stock market companies, namely American Airline Group (AAL), Adobe (ADBE) AutoDesk Inc. (ADSK), and Boston Scientific Corporation (BSX), which have significant market capitals.The retrieved data were systematically organized chronologically from 1 January 2013 to 1 February 2023, in a continuous manner.The data for each stock were organized into separate CSV files, each containing 2517 records.Each individual entry within the dataset comprised three distinct variables: the date of the recorded observation, the closing price of the asset, and the corresponding prediction.The data were transformed into a federated dataset through an automated technique.The model presented in this study was developed through the utilization of linear regression [88] within the Flower framework.The evaluation function used, i.e., R-squared, MSE, MAE, and RMSE, and the detailed results are presented in Table 1.The minimum number of edge nodes has been established as 20.Each node trains a linear regression algorithm with class estimation and balancing (Algorithm 1) using its own data and subsequently transmits the gradient of loss from the model to the server.The FL server employs the FedAvg algorithm (Algorithm 2) to update the model parameters.
To update each local model, the updated parameters are distributed to the edge nodes.This iterative data processing process is executed in a convergence fashion without data sharing until the training termination requirements (100 rounds of local data training) are met.This model is trained and refined through multiple iterations to attain an optimal state where further training does not significantly improve performance.Each client processes its own local data independently in a decentralized or distributed training approach.We opted for a 90% and 10% partitioning of data for the purposes of training and testing, respectively.The resulting data-frame consisted of columns labeled as Date, Open, and Close.The outcomes of the proposed methodology for predicting stock data are illustrated via a line graph in Figure 4.The predicted values align with the actual values, demonstrating consistency between the model's projections and the observed outcomes.Subsequently, a linear model was fitted to the graph and subsequently displayed, followed by the formulation of observations.The scatter graph in Figure 5 displays the resulting graph and the fitted model.The dataset comprised 20 clients and 100 communication rounds, with each communication round consisting of 5 epochs.The rationale for reducing the number of communication rounds with this dataset is due to its high accuracy rate of 95%, achieved within only 75 rounds.Therefore, the reduction in learning time and communication cost is achieved.MAE represents the average absolute difference between predicted and actual values, offering insight into the average magnitude of errors.A lower MAE is preferable.MSE is the average of squared differences between predicted and actual values, assigning more weight to larger errors.Lower MSE values are indicative of better performance.RMSE, the square root of MSE, measures the average magnitude of errors in the same units as the target variable, providing another assessment of predictive performance.Rsquared gauges how well the model's predictions explain variance in the actual data, with a range from 0 to 1, with 1 signifying a perfect fit.In Table 1, R-squared values are relatively high (e.g., 0.97, 0.98) in most cases, suggesting that the model is performing well in capturing the variation in the stock prices.
For stock AAL during the 80th training iteration, R 2 is 0.69, MAE is calculated as 5.70, MSE is 16.347, and RMSE 10.78643.This suggests that, in the 80th training round for stock AAL, the model had a moderate R-squared value, indicating a moderate fit, as is also presented in Figure 4a.The errors in predictions had relatively low magnitudes (as indicated by low MAE, MSE, and RMSE values).MAE represents the average absolute difference between predicted and actual values, offering insight into the average magnitude of errors.A lower MAE is preferable.MSE is the average of squared differences between predicted and actual values, assigning more weight to larger errors.Lower MSE values are indicative of better performance.RMSE, the square root of MSE, measures the average magnitude of errors in the same units as the target variable, providing another assessment of predictive performance.R-squared gauges how well the model's predictions explain variance in the actual data, with a range from 0 to 1, with 1 signifying a perfect fit.In Table 1, R-squared values are relatively high (e.g., 0.97, 0.98) in most cases, suggesting that the model is performing well in capturing the variation in the stock prices.The model training iterations were set to 100 times; although the higher accuracy and lesser error measures began to be reported only at 80 iterations.This made the accuracy of the Bal-fed much higher.By the 100th iteration, R-squared values reported to be near 0.98, which provides greater fit (as seen in Figure 6a).In the beginning, the R 2 values for AAL and BSX fluctuated until the 61st and 40th iterations, respectively, but after that, the values started to stabilize and were reported to be higher.The model training iterations were set to 100 times; although the higher accuracy and lesser error measures began to be reported only at 80 iterations.This made the accuracy of the Bal-fed much higher.By the 100th iteration, R-squared values reported to be near 0.98, which provides greater fit (as seen in Figure 6a).In the beginning, the R 2 values for AAL and BSX fluctuated until the 61st and 40th iterations, respectively, but after that, the values started to stabilize and were reported to be higher.The MSE, MAE, and RMSE continued to become lower as the number of iterations increased, as depicted in Figure 6.The decline in MSE, MAE, and RMSE is observed for AAL at 122.03, 4.89, and 11.04, respectively.For ADBE, the values of MSE, MAE, and RMSE are recorded as 457.35, 17.79, and 21.38, respectively.For ASDK, the MSE, MAE, and RMSE presented 182.78, 5.81, and 13.51, respectively.At the end, for BSX, the MSE, MAE, and RMSE recorded at 34.50, 4.87, and 5.87, respectively.These values prove the positive results of the proposed approach with minimal data loss, whereas R-squared values for AAL, ADBE, ASDK, and BSX were recorded as 0.95, 0.98, 0.98, and 0.98, respectively.

Comparative Analysis
To compare the efficacy of the Bal-fed technique with the FL approach without class estimation and balancing, we utilize a simple federated learning (FL) technique with randomly distributed data across 20 clients.The data are randomly distributed without balancing classes, and linear regression is used for client-side training without applying class estimation.For model aggregation, the FedAvg (Algorithm 2) technique is employed for global model aggregation.The resulting R-squared values are reported as 0.66, 0.79, 0.78, and 0.77 for AAL, ADBE, ADSK, and BSX, respectively.
Comparing the performance with the proposed Bal-fed technique, which employs an estimation and balancing of classes (Algorithm 1), the following results are observed in Table 2.
The results suggest that the Bal-fed technique with LR and balanced classes achieves higher accuracy (95.01%) compared to the FL technique without class estimation (79.6%).Additionally, Bal-fed results in lower data loss and a slightly longer training time, indicating a trade-off between accuracy and processing time.The lower R-squared values in the FL technique may be attributed to data imbalance issues over edge data.The predictive values achieved a 95% accuracy rate with minimal data loss, as demonstrated in Table 2.The statistical metrics of MSE, relative MSE, MAE, RMSE, and R-squared prove that the Bal-fed model has demonstrated sufficient accuracy in predicting stock prices data.In fact, its performance surpasses that reported in the literature, where a maximum accuracy of 85% was achieved [30].

Conclusions and Future Work
In the context of centralized machine learning, the consolidation of all local data onto a single server poses significant privacy concerns.FL is a machine learning approach that involves utilizing users' data to train a model, which is subsequently transmitted to the server.The sharing of confidential data on cloud servers is not conducted in this manner.Rather, solely the outcomes or trained models are uploaded.This approach exhibits greater efficiency in terms of generalization, addressing privacy concerns, and ensuring the accuracy of the system.However, a significant issue arising from using FL is the presence of data imbalance.To address this issue, we employed the class estimation and data balancing technique.We developed a novel approach for managing class distribution in which augmented data are generated automatically and distributed separately to each client for local model training.This method is designed to update gradients without requiring access to customers' data information.Furthermore, the proposed approach involves implementing class estimation and data balancing mechanism to mitigate the adverse effects of data imbalance.The Bal-fed technique has been implemented on AAL, ADBE, ADSK, and BSX stock price data collected for last 10 years.The iterative training of Bal-fed is in a decentralized manner, without sharing data between iterations.The training continues until a termination condition (100 rounds of local data training) is met.The data are partitioned into training and testing sets to evaluate the model's performance.The utilization of this technique has yielded favorable outcomes with limited data loss.R-squared values are relatively high (e.g., 0.97, 0.98) in most cases, suggesting that the model is performing well in capturing the variation in the stock prices.The model exhibits strong fits within 75 to 80 iterations for stocks (ADBE, ADSK, BSX) with consistently high R-squared values, indicating accuracy.While AAL shows a moderate fit, the model's

Figure 1 .
Figure 1.Federated learning approach where edge devices train the model with their local data and send the trained model to FL server.FL server then aggregates the model to make a global model and then send the updated model to edge nodes.

Figure 1 .
Figure 1.Federated learning approach where edge devices train the model with their local data and send the trained model to FL server.FL server then aggregates the model to make a global model and then send the updated model to edge nodes.

Figure 2 .
Figure 2. Proposed scheme for FL in the scenario of edge networks to reduce data skew problem.

Figure 2 .
Figure 2. Proposed scheme for FL in the scenario of edge networks to reduce data skew problem.

Algorithm 1 :
Class Balancing Algorithm Initialize Set S t = ∅ and R total = ∅ k 0 = arg max k rˆk St←St ∪ {k 0 } while |St| < K do Select k min = arg min k D KL ((R total +R¯k)||U) for k ∈ K\St Set St←St ∪ {k min }, R total ←R total + R¯k min.end while Outputs: St ω): //Run on client i for each local epoch e from 0,. .., E − 1 do for each minibatch b of size B do ω e+1 ←ω e − ησ ƒi(ω e ; b) return ω E to server

Figure 3 .
Figure 3. Schematic workflow sequence to evaluate the prediction performance of Bal-fed.

Figure 3 .
Figure 3. Schematic workflow sequence to evaluate the prediction performance of Bal-fed.

Figure 4 .
Figure 4. Prediction of stock prices using Bal-fed scheme of balancing: (a) predicted price of AAL; (b) predicted price of ADBE; (c) predicted price of ASDK; (d) predicted price of BSX.

Figure 4 .
Figure 4. Prediction of stock prices using Bal-fed scheme of balancing: (a) predicted price of AAL; (b) predicted price of ADBE; (c) predicted price of ASDK; (d) predicted price of BSX.

Figure 5 .
Figure 5.The fitted results of stock price prediction with respect to the actual price in scatter graphs: (a) predicted price vs. actual prices of AAL; (b) predicted price vs. actual prices of ADBE; (c) predicted price vs. actual prices of ASDK; (d) predicted price vs. actual prices of BSX.

Figure 5 .
Figure 5.The fitted results of stock price prediction with respect to the actual price in scatter graphs: (a) predicted price vs. actual prices of AAL; (b) predicted price vs. actual prices of ADBE; (c) predicted price vs. actual prices of ASDK; (d) predicted price vs. actual prices of BSX.For stock AAL during the 80th training iteration, R 2 is 0.69, MAE is calculated as 5.70, MSE is 16.347, and RMSE 10.78643.This suggests that, in the 80th training round for stock AAL, the model had a moderate R-squared value, indicating a moderate fit, as is also presented in Figure 4a.The errors in predictions had relatively low magnitudes (as indicated by low MAE, MSE, and RMSE values).Table 2 provides a detailed snapshot of how the predictive model performs for each stock and iteration, providing insights into the model's accuracy and precision across different training scenarios.ADBE shows consistently high R-squared values (around 0.98), indicating a strong fit of the model to the actual data.The MAE values range from 15 to 17.96, which represents the average magnitude of errors in predicting Adobe's stock prices.MSE and RMSE values also provide insights into the model's accuracy and the magnitude of errors.ADSK also demonstrates high R-squared values (above 0.977), indicating a good fit of the model to the data.The model seems to improve in terms of MAE over training rounds, decreasing from 9.78 to 6.78.MSE and RMSE values show the overall accuracy and precision of the model in predicting Autodesk's stock prices.BSX exhibits high and consistent R-squared values (above 0.984), indicating a strong correlation between predicted and actual stock prices.The MAE values show variations, with the lowest value at 5.03, suggesting relatively small errors in predictions.MSE and RMSE values provide additional magnitude of errors.ADSK also demonstrates high R-squared values (above 0.977), indicating a good fit of the model to the data.The model seems to improve in terms of MAE over training rounds, decreasing from 9.78 to 6.78.MSE and RMSE values show the overall accuracy and precision of the model in predicting Autodesk's stock prices.BSX exhibits high and consistent R-squared values (above 0.984), indicating a strong correlation between predicted and actual stock prices.The MAE values show variations, with the lowest value at 5.03, suggesting relatively small errors in predictions.MSE and RMSE values provide additional insights into the accuracy and precision of the model for Boston Scientific.To summarize, for all the stocks except for AAL (ADBE, ADSK, BSX), the model demonstrates high R-squared values, indicating a strong fit.The variations in MAE, MSE, and RMSE across training rounds provide additional details about the model's performance in predicting stock prices for these specific companies.

Figure 6 .Figure 6 .
Figure 6.Result of analysis measures with respect to each iteration count over local data using the Bal-fed model.(a) Value of MAE with respect to each count.(b) Value of MSE with respect to each Figure 6.Result of analysis measures with respect to each iteration count over local data using the Bal-fed model.(a) Value of MAE with respect to each count.(b) Value of MSE with respect to each count.(c) Value of R-squared with respect to each count.(a,d) Value of RMSE with respect to each count.

Table 1 .
Subset of the variations of R-squared, RMSE, MAE, and MSE with respect to training iteration in stock data.

Table 1 .
Subset of the variations of R-squared, RMSE, MAE, and MSE with respect to training iteration in stock data.

Table 2 .
Results of stock data applied to evaluate the prediction performance of Bal-fed.