Anomaly Detection in Multi-Host Environment Based on Federated Hypersphere Classiﬁer

: Detecting anomalous inputs is essential in many mission-critical systems in various domains, particularly cybersecurity. In particular, deep neural network-based anomaly detection methods have been successful for anomaly detection tasks with the recent advancements in deep learning technology. Nevertheless, the existing methods have considered somewhat idealized problems where it is enough to learn a single detector based on a single dataset. In this paper, we consider a more practical problem where multiple hosts in an organization collect their input data, while data sharing among the hosts is prohibitive due to security reasons, and only a few of them have experienced abnormal inputs. Furthermore, the data distribution of the hosts can be skewed; for example, a particular type of input can be observed by a limited subset of hosts. We propose the federated hypersphere classiﬁer (FHC), which is a new anomaly detection method based on an improved hypersphere classiﬁer suited for running in the federated learning framework to perform anomaly detection in such an environment. Our experiments with image and network intrusion detection datasets show that our method outperforms the state-of-the-art anomaly detection methods trained in a host-wise fashion by learning a consensus model as if we have accessed the input data from all hosts but without communicating such data.


Introduction
Anomaly detection is a task to identify and detect abnormal instances dissimilar to normal instances and therefore not conforming to expected patterns in normal situations [1]. Anomaly detection plays a crucial role in various mission-critical applications such as network intrusion detection [2][3][4], system behavior monitoring [5][6][7], disease detection in medical domains [8][9][10][11], and defect detection in manufactured projects [12][13][14], just to name a few. Anomaly detection is attracting more attention, especially in cybersecurity, as the necessity and importance of related research are increasing as cyber-attacks evolve rapidly in their volume, velocity, and variety, where attacks tend to become harder to detect [15,16].
Before the advent of machine learning, statistical methods such as profile-based [17] and rule-based methods [18] had been widely used for anomaly detection. Machine learning methods are now replacing classical anomaly detection methods. In particular, the one-class SVM (OC-SVM) [19] is one of the first trials to adapt binary classifiers for anomaly detection, finding the optimal hyperplane that characterizes normal data. The support vector data description (SVDD) [20] changes the shape of the decision function from a hyperplane to a hypersphere and finds the optimal hypersphere that circumscribes the normal data. Other popular approaches include isolation forest based on random forest [21] and PCA-based anomaly detection [22].
More recently, anomaly detection methods based on deep neural networks (DNNs) [23] have been successful, outperforming their predecessors in various domains. In particular, Ruff et al. [24] suggested a new idea to substitute the kernel function in SVDD with a DNN to provide the necessary input transformation for optimal decision making. This work has initiated a new wave of related research to design new anomaly detection methods benefiting from the advancement of deep learning research. In particular, anomaly detection based on deep learning has been studied and applied quite actively recently in the cybersecurity domain, for example, intrusion detection for cyber-physical systems [25][26][27], attack detection in smart grids [28,29], and vehicle network intrusion detection [30][31][32], to name a few.
Most of the existing works on anomaly detection focus on an environment where all training data exist in one place in a single host or a storage unit of a data center. However, recent technologies such as smart grids [16,33,34] or IoT devices [35,36] have increased the need for training AI models in a distributed fashion. In particular, the data centralization can cause considerable network traffic and monetary cost for data storage [37]. Another issue of data centralization is that it can impair confidentiality or data privacy [38].
Therefore, this paper proposes a novel anomaly detection method suitable for such distributed environments where training data can exist in multiple hosts where sharing training data is not plausible. To be more specific, each host in such an environment tends to have a minimal number of data compared to the case where one may gather data from all hosts into one location [39,40]. In addition, we assume that abnormal inputs are scarce; in particular, a large number of hosts may not have experienced any intrusion and therefore have no abnormal instances in their data set. In addition, a host may not have experienced all types of inputs, having a skewed distribution of training instances [41]. Our method is based on federated learning [42] and an improved version of the hypersphere classifier (HSC) [43] we have tailored for the federated learning environment so that we can learn a better anomaly detector in the multi-host environment discussed above. Our contribution can be summarized as follows: • We propose the federated hypersphere classifier (FHC), which is a novel federated learning-based anomaly detection method for a multi-host environment where data sharing is limited, the data distributions of hosts are skewed, and only a few hosts contain anomaly data. • We introduce a new version of the hypersphere classifier suited for federated learning. By modifying the objective function to include the radius variable, it is possible to find an optimal consensus radius, which is necessary for decision making in anomaly detection. • We demonstrate our proposed method in a multi-host environment where the data distributions of hosts are skewed, and only a few hosts contain anomaly data. The results show that our method detects anomalies far more accurately than the state-ofthe-art single-host alternatives.

Related Works
This section briefly introduces related works in anomaly detection and the federated learning framework.

Anomaly Detection
We discuss anomaly detection methods, grouping them into two categories depending on whether they have used deep learning technology.

Classical Anomaly Detection
In the early days, statistical profiling methods [17] and rule-based methods [18] were frequently used as anomaly detection methods. The statistical profile-based anomaly detection method profiles normal behaviors and detects anomalies or uncharted behaviors if they deviate from normal behaviors with statistical significance. The rule-based anomaly detection method is a method that captures suspicious circumstances with data or patterns that differ from normality based on predetermined rules, and it has been widely applied to intrusion detection systems.
Before deep learning came into the spotlight, there were studies on machine learningbased anomaly detection methodologies. Scholkopf et al. [19] proposed the one-class SVM (OC-SVM) based on the support vector machine algorithm (SVM) [44] with kernel function to position the normal data away from the origin and find the hyperplane that distinguishes the normal from the anomaly. Tax and Duin [20] proposed the support vector data description (SVDD), which uses a kernel function to map normal data into a hypersphere with a minimum radius. SVDD suggested the direction of anomaly detection technique based on the minimum enclosing ball (MEB) problem minimizing the hypersphere circumscribing normal data in the latent space. In another direction, the isolation forest [21] based on a random forest has been proposed. Isolation forest takes advantage of the fact that abnormal data in each tree are separated as outer leaves than normal data. Furthermore, there were studies on anomaly detection methods based on principal component analysis (PCA) [45]. Given a high-dimensional input, PCA finds a subspace that maintains the highest variance, and one can find anomalies if the reconstructed data from the subspace is far from the original data. Lakhina et al. [22] and Ringberg et al. [46] used PCA for modeling the normal traffic and detecting the outliers.

Deep Learning-Based Anomaly Detection
Recently, as deep neural networks have achieved remarkable success in various domains, numerous anomaly detection methods based on deep neural networks are being studied. One direction is to use deep auto-encoders (AEs) [47]. AE-based anomaly detection methods use the AE learned only with normal data, for which the reconstruction error will be high when the given input does not fit the characteristic of normal inputs. This type of method has further developed, for example, to use the variational autoencoders [48], apply the Gaussian mixture model to a neural network [49], and learn normality regularized by a memory network in the AE [50]. On the other hand, Ruff et al. [24] proposed DeepSVDD, which uses a neural network in place of the feature mappings of SVDD so that a suitable representation of normal data can be learned directly without relying on kernel tricks [51] to represent feature mappings implicitly. Based on DeepSVDD, Hojjati and Armanfard [52] proposed DASVDD recently, which is the unsupervised anomaly detector that optimizes AE and the hypersphere simultaneously.
Furthermore, deep generative models have been adopted for robust anomaly detection. For example, Schlegl et al. [53] proposed AnoGAN, using the generative adversarial nets (GANs) [54] to learn a generator that can create new examples similar to the normal data. Despite its benefits, AnoGAN needs to find the optimal latent vector that generates a sample most similar to a given input for detection; this procedure is inherently iterative and computationally expensive. Schlegl et al. [55] tried to improve the computation issue of AnoGAN by using an auxiliary AE to perform the inversion from an input to a latent vector. Audibert et al. [7] tried to address an issue of AE-based anomaly detection that reconstruction error-based detection is not sensitive enough to find anomalies similar to normal data. Goyal et al. [56] proposed DROCC, which augments anomaly data using adversarial perturbation [57]. However, the method requires tuning various hyperparameters so that generated samples will work as abnormal data.
More recently, several methods tried to bring the idea of semi-supervised learning [58][59][60] into anomaly detection. Hendrycks et al. [61] proposed the outlier exposure (OE) that uses an auxiliary dataset of outliers to train the model to detect unseen anomalies. The trend has been followed by the DeepSAD [62] and the hypersphere classifier (HSC) [43] methods. DeepSAD uses a modified objective function of DeepSVDD, adding a loss term on the anomaly data and adjusting the balance between the loss terms for the anomaly and normal data. HSC reformulates the objective for finding the minimum enclosing ball of normal data into another form similar to the binary cross-entropy loss [63] by adding a loss term for abnormal data and applying the radial basis function [64]. However, both DeepSAD and HSC heavily depend on hyperparameters for decision making (such as the center and the radius of the hypersphere) that require anomaly data for tuning. Therefore, when hosts without anomaly data exist, as in the multi-host environment we consider in this paper, one needs to make a good guess about these hyperparameters to apply these methods to those hosts. This paper suggests transforming the objective of HSC (we chose HSC since it performed better than DeepSAD in the multi-host setting) to include its hyperparameters as training variables so that they can be learned by federated learning.

Federated Learning
Federated learning [42,65] is a machine learning framework that allows multiple hosts to train a global model under the orchestration of a central server while keeping the training data of each host locally. Each host participating in federated learning shares the same learning objective and optimizes local model weights with their local data, and the central server aggregates local weights to compute global model weights.
FedAvg [42] is one of the most popular federated learning algorithms aggregating global model weights by averaging updated model weights from participating hosts. Let D h = {X h , Y h } denote the local dataset composed of input data X h ∈ R N h ×d and the corresponding true labels Y h ∈ R N h , φ(·; θ h ) denote the neural network to train, and θ h denote the local weight of the h-th client. Each host solves the following learning problem: where L(·) is a loss function. Here, we assume there are H different hosts; then, the objective of federated learning becomes: where N h denotes the number of instances in the h-th host and N is the number of whole data instances contained in the H hosts.
Since the concept of federated learning has emerged, research on more advanced federated learning techniques has been conducted. Specifically, FedOpt [66] advanced the FedAvg algorithm to the federated version of adoptive optimizers such as Adagrad [67] or Adam [68], and FedNova [69] added the normalized averaging method in the aggregation stage. These works focused on overcoming the heterogeneity of clients' data participating in federated learning in common. Furthermore, there were attempts to overcome issues from non-IID data distributions with federated learning optimized for individual client data through model personalization. For example, FedProx [70] introduced the proximal term to the federated learning in order to minimize the distance between the global model weight and local model weight. In addition, there were trials to modify the model structure into a suitable form for personalization, such as personalized models [71] or using hypernetworks [72]. Nevertheless, the basic framework of those works does not deviate much from the concept of FedAvg.
Some studies have considered federated learning for anomaly detection. DIoT [73] applied federated learning for detecting compromised IoT devices from their behavior profiles. MT-DNN-FL [74] is a multi-task network anomaly detection based on federated learning to deal with the scarcity of available learning data. P2PK-SMOTE [75] suggested anomaly detection for the IoT environment where host data may not follow the identical distribution or sampling may not be independent. They focused on balancing the number of normal and abnormal instances using the SMOTE method [76]. However, none of these research studies has considered transforming the underlying anomaly detection method to fit for finding an optimal global detector by federated learning in the multihost environment, where many hosts cannot optimize some variables critical for decision making due to the lack of anomaly data. To the best of our knowledge, no study exists addressing the issues in the existing single-host anomaly detectors when applied to the multi-host environment where only a few hosts include anomalous data. Table 1 compares the existing federated learning frameworks and anomaly detection methods with our method in several perspectives.

Methods
This section formally describes our proposed method FHC, a novel hypersphere classifier for anomaly detection in the multi-host environment based on federated learning.

Multi-Host Environment
We consider the multi-host environment with the following characteristics: • Multiple hosts store a certain type of data available to train anomaly detectors to detect normal and abnormal inputs or activities. • The hosts are connected in a network, where exchanging training data is prohibited for privacy or security reasons. • All hosts contain normal data, whereas only a few hosts have abnormal samples due to the rarity of such events. Furthermore, the distribution can be skewed; for example, normal samples from a host may not cover all types of normal data.

Notation
denote the local data of the h-th host with x ∈ R d and y ∈ {0, 1}, where y = 0 and y = 1 indicate normal and abnormal instances, whose numbers are n h and m h , respectively. Our method tries to learn the global model consisting of an embedding neural network f (x; θ emb ) and a hypersphere mapping neural network φ( f (x; θ emb ); θ map ), where θ emb and θ map are learning parameters. For the simplicity of our discussion in the sequel, we use a short-hand notation φ(x; θ) in place of φ( f (x; θ emb ); θ map ) considering θ := (θ emb , θ map ) and also drop the subscript h in n h and m h .

Hypersphere Classifier
The hypersphere classifier (HSC) [43] is one of the latest anomaly detection methods that uses the minimum enclosing ball (MEB) of normal data for decision making as in SVDD [20] while adopting the idea of outlier exposure [61] to use anomaly samples in training to improve detection performance. In particular, HSC uses a loss function similar to the regular binary cross-entropy loss function: HSC replaces the sigmoid function in the regular binary cross-entropy loss with the radial basis function (φ(x i ; θ)) = exp(− φ(x i ; θ) 2 ) in order to obtain a spherical decision boundary. The HSC objective function is as follows: If there are no abnormal data (with y i = 1) in a host, the expression (1) simplifies to the same objective function of DeepSVDD [24] with the center c = 0. When abnormal data exist in the training data, the second term of (1) maximizes the distance between the center c = 0 and abnormal data to be mapped to outside the hypersphere. Still, a radius hyperparameter R needs to be tuned, which determines the boundary between normal and abnormal data: this task usually requires some abnormal data in the validation set for good detection performance. However, as discussed in Section 2.1.2, if some hosts do not have abnormal data, it becomes difficult to tune the radius hyperparameter.

Proposed Method: Federated Hypersphere Classifier
In the multi-host environment discussed in Section 3.1, a straightforward approach will be to learn an anomaly detector for each host separately. However, this approach may not be ideal: first, each host cannot benefit from data stored in other hosts. The detection performance of machine-learning methods tends to increase as the size of the training set increases. In particular, a host may encounter unseen normal and abnormal types of data in the future. Therefore, it will be beneficial to use training data with various types of possible normal and abnormal data. Secondly, if a host has not experienced an anomaly, it becomes hard to tune some critical learning parameters for the host, such as the radius of the HSC classifier. One may use a detector trained by another host, but the detector may still suffer from the first issue discussed above.
As we have seen in Section 2.2, we believe that federated learning can serve as a good solution for the multi-host environment so that each host can benefit from data stored in other hosts without communicating with them directly. We modified the objective of HSC for the federated learning environment as follows: where n and m denote the number of normal and abnormal instances of a host, respectively, and d(x i , c) := φ(x i ; θ) − c 2 is the distance between the latent vector of x i and the center c. Our idea is to include both the center c and the radius R of the minimum enclosing ball in the training objective, and therefore, these parameters can be learned by federated learning benefiting from data in all hosts. The first term in the objective is to force d(x i , c) ≤ R for normal data, and the second term is to satisfy d(x i , c) ≥ R for abnormal data so that the detection can be done accurately using the distance from the center c in the latent space with the threshold R. Algorithm 1 shows the federated learning algorithm based on FedAvg [42] and the modified HSC loss (2) to learn our proposed federated hypersphere classifier (FHC). Our algorithm trains a global model in an iterative manner, where each iteration consists of two parts. In the first part, each host updates the local versions of learning parameters using their training data, minimizing the objective function (2). When a host includes both normal and abnormal instances, we update the learning parameters in an alternating fashion repeating the following steps: • Step 1. We fix the radius R and update the model parameter θ and the center c so that d(x i , c) is minimized for normal instances and maximized for abnormal instances. • Step 2. We fix the model parameter θ and the center c and update the radius R.
In this case, we use the fact that the optimal c and R can be determined by c = 1 n ∑ n i=1 φ h (x i ; θ) and R = 1 n ∑ n i=1 d(x i , c) to reduce computational cost.
Global hyersphere center aggregation: c Global hyersphere radius aggregation: Step 2) Fix the model parameter and the center: Set θ ← θ h , c ← c h Optimize the radius: host with no abnormal data Compute the center: The second part of Algorithm 1 is the global model aggregation at the federation server. After the local training in each host, the server receives local optimal θ h , c h , and R h , and it creates the consensus versions of θ g , c g , and R g by aggregating their host-wise versions. In particular, the consensus versions are computed as weighted averages in the following way: where N denotes the number of total training instances in all hosts and N h denotes the number of the h-th host's training instances. Figure 1 shows the overall training process of our proposed method.

Computational Cost Analysis
Algorithm 1 is based on the FedAvg [42] algorithm. The original FedAvg uses H hosts from the total of H total hosts in each aggregation, repeated for the total of C rounds. Each host runs a minibatch SGD [77] with B local steps per round. The convergence result of FedAvg [65] requires assuming the convexity and the K-smoothness of the training objective function. When the variance of stochastic gradients is bounded above by σ 2 , the convergence rate of FedAvg is O K C 2 + σ √ CBH . In federated learning, we also need to consider computation and communication costs. The computation cost of each host differs depending on whether the host contains only normal data or it has both normal and abnormal data. When a host contains abnormal instances, we compute SGD twice to update θ, c, and R. On the other hand, if no abnormal instance is available, a host can skip one SGD run and explicitly compute c and R. As a result, we can reduce the total computational cost from O(CBH total ) of the original FedAvg to O(CBH ano ), where H total denotes the total number of hosts and H ano is the number of hosts with abnormal instances.
To discuss communication cost, let n θ be the dimension of all parameters combined and n c be the dimension of c, which we need to communicate. FedAvg communicates model parameters two times per round: a central server receives updated model parameters from each host, sending the aggregated parameters back to all hosts. The resulting communication cost is O(C(H + H total )(n θ + n c )). In our case, if a host does not include abnormal instances, it only transmits the updated c and R. Therefore, the communication cost of FHC is O(C(H total + H ano )(n θ + n c )).

Experiments
In this section, we demonstrate the benefits of our proposed method FHC by experiments in the multi-host environment discussed in Section 3.1.

Data Preparation
For our experiments, we use four popular benchmark datasets, consisting of two image datasets and two network intrusion detection datasets. Here, we provide brief explanations about each dataset: • The MNIST [78] dataset consists of 10 classes of handwritten digits, and each data instance is a gray-scale image with 28 × 28 pixels. The total number of the training set is 60,000, and the total number of the evaluation set is 10,000. We use the original datasets to create anomaly detection problems, reflecting the multi-host environment discussed in Section 3.1. We create normal and abnormal data from the original datasets in the following fashion. MNIST and CIFAR-10: We select the class with the furthermost mean KL divergence from other classes in the original datasets as the abnormal data. The rest of the classes are considered normal data. As a result, the digit 1 of MNIST and the 'Automobile' class of CIFAR-10 were chosen as the abnormal data. CICIDS-2017 and TON-IoT: The datasets are composed of multiple attack classes and one benign class. Therefore, we combine all attack classes and use them as abnormal data. Furthermore, we use the k-means clustering algorithm [82] on the benign class to divide them into multiple groups of normal data, which we need to distribute normal classes among the hosts with minimal overlaps.
Finally, we distribute normal data among all hosts, choosing two (normal) classes for each host. We also assign abnormal data to only 20% of the total number of hosts, adjusting their proportion to normal data in a selected host to be 10% of the normal data, in order to simulate the rarity of abnormal instances. Finally, the testset for each host is created to have all normal and abnormal samples from the same distribution to test their detection performance against all possible types of future instances. Figure 2 shows an overview of our data preparation process. In addition, we summarize in Table 2 the average numbers of normal data of each host, also showing the average number of abnormal data for each selected host to contain abnormal data.

Comparisons and Other Settings
We compare our proposed method FHC with the four latest deep anomaly detection methods: DeepSVDD [24], DeepSAD [62], DROCC [56], DASVDD [52], and HSC [43], where the former two are unsupervised, and the latter two are semi-supervised anomaly detection methods that take advantage of labeled abnormal data. These competing methods are trained host-wise; that is, a separate model is trained for each host using the training dataset of each host. Since anomaly detection methods are often hard to train with no anomaly data, we select the best model among the host-wise models based on the validation F1-score [83] and use it for evaluation in the hosts without abnormal data. For evaluation metrics, we use the area under the receiver operating characteristic curve (AUC) [84] and F1-score values.
For embedding data points, we use a convolutional neural network (CNN) model in the image datasets [24] and use the CNN-LSTM model [85] for the cybersecurity datasets. In addition, we use several fully connected (FC) layers to map embedded data to the hypersphere of our anomaly detector. Details of these networks and learning hyperparameters are described in Table 3.

Experimental Results
In order to demonstrate the performance of our proposed method FHC, we report the mean AUC and F1-score on all hosts' testsets in a multi-host environment we discussed in Section 4.1. Table 4 shows the performance of FHC and four competing anomaly detection methods on MNIST, CIFAR-10, CICIDS-2017, and TON-IoT datasets. Overall, the performance of FHC is higher than the other competing methods on all datasets. In particular, the AUC and F1-score of FHC are 9.3% and 72.3% higher than the second-best models on average, respectively. This shows that FHC is more effective than other methods in the multi-host environment. Since FHC can learn the consensus versions of c and R by benefiting from the data in all hosts, FHC has a good chance to achieve higher performance than other anomaly detection methods trained host-wise, even in the environment where the host data distributions are irregular. In addition, Table 4 reports the standard deviations of the AUC and F1-score. Since all hosts experience the test data from the same distribution, it is desirable that the detection model shows small variability in testing performance. The results show that FHC achieves the smallest deviation in almost all cases.
Furthermore, we observe that the AUC values of some competing methods are close to 0.5, which happens when the prediction results are almost uniformly random. For instance, in the case of the CICIDS-2017 dataset, the AUC values of competing methods are close to 0.5, which means that the models trained host-wise do not discriminate abnormal from normal data well in the multi-host environments. On the other hand, FHC records higher AUC than other methods. Specifically, the AUC values of FHC are 8.2%, 2.5%, 24.8%, and 4.0% higher than the second-best models on MNIST, CIFAR-10, CICIDS-2017, and TON-IoT, respectively, showing the advantage of our approach. We observe a similar trend in F1-score, where FHC achieved 150.2%, 68.3%, 21.7%, and 49.1% higher scores than the second-best models on MNIST, CIFAR-10, CICIDS-2017, and TON-IoT, respectively.
Nevertheless, insufficient training examples or a small proportion of hosts with abnormal data can prevent any method from achieving high prediction performance. Therefore, we further investigate the effect of these two factors in the following sections, focusing on the CICIDS-2017 dataset. We investigate the change in detection performance on the testset of each method with different numbers of training examples, using the CICIDS-2017 dataset. Here, we fix the proportion of hosts containing abnormal data to 20.0%. Figure 3 shows the AUC values of detection methods as we change the total number of training instances in the range of 12 k∼49 k, where 1 k is 1000. Here, FHC performs 22.3% higher than the second-best competing method on average in all cases. In addition, our method shows more noticeable improvement than other methods with more training data: in particular, the average AUC of FHC increases by 18.5% as the training examples increases from 12 k to 49 k, whereas the average increment of other methods is 12.1%. This result shows that (i) FHC shows better performance even with a few training examples than other methods, and (ii) FHC can better use more training instances than other methods. However, we can also observe that the performance increase of FHC is marginal from the point where the number of training examples is 24 k. It turns out that the relatively small number of abnormal instances is one of the issues, as we discuss in the next section.  Figure 4 shows the trends of AUC values as we change the proportion of hosts containing abnormal data. We can check that the performance of FHC increases almost linearly as more hosts contain abnormal data, where such a trend is not observed in the competing methods. In particular, AUC is improved from 0.605 to 0.971 as the proportion of the hosts with abnormal data increases from 13.3% to 33.3%, whereas the other methods show almost no improvement. Overall, the AUC of FHC is 32.6% higher than the secondbest methods on average and 62.4% higher than the second-best when 33.3% of the hosts have abnormal data. In particular, the AUC of FHC is 45.2% higher than the second-best method when the proportion of abnormal data is 0.4. This result indicates that FHC utilizes abnormal data better than other single-host anomaly detectors since FHC benefits from data in all hosts through global model aggregation and, consequently, shows the highest performance.

Conclusions
This paper proposed FHC, a novel anomaly detection method based on a re-designed HSC and the federated learning framework, designed for the multi-host environment where the exchange of training data is limited, data distribution of the hosts can be irregular, and only a few hosts have abnormal instances.
One of the issues in the existing anomaly detectors we address in our paper is that they rely on hyperparameter tuning to determine critical parameters (for example, the radius R in the original HSC), which cannot be effectively done when there is no anomaly data. In addition, some methods such as DROCC have been very sensitive to the choice of hyperparameter values. Therefore, we suggested a re-designed version of HSC to include R as a learning parameter that can be optimized using the training data from all participating hosts based on federated learning. Furthermore, our results in Sections 4.3.1 and 4.3.2 indicate that larger training examples will help improve the performance of FHC, in particular when the proportion of abnormal instances increases and the class imbalance issue between the normal and abnormal data is alleviated. Our experiment has been done by increasing the number of hosts with abnormal examples, but it may not be plausible in a real environment. Therefore, we can instead consider data augmentation such as GAN [54] and VAE [48] to increase the proportion of abnormal instances.
Nevertheless, the experiments in our paper have been performed in simplified settings, where multiple hosts in a real environment would have more irregular data distribution among the hosts. Therefore, it would be desirable to consider more realistic configurations of multi-host environments, where it would be beneficial to consider more advanced FL techniques such as FedProx [70] or FedOpt [66].
Finally, the experiment setting in this paper reflects one case of irregular data distribution among the hosts, where there can be many possible configurations of host data distributions. Still, we hope our research can be a stepping stone for the research of anomaly detection methods in multi-host environments. Our method is implemented based on PyTorch, which is available as an open source at https://github.com/sanglee/FHC (accessed on 5 April 2022).  Data Availability Statement: All datasets used in this paper are publicly available, MNIST at http: //yann.lecun.com/exdb/mnist/ (accessed on 28 February 2022), CIFAR-10 at https://www.cs. toronto.edu/~kriz/cifar.html (accessed on 28 February 2022), Ton-IoT at https://research.unsw.edu. au/projects/toniot-datasets (accessed on 3 March 2022), and CICIDS-2017 at https://www.unb.ca/ cic/datasets/ids-2017.html (accessed on 3 March 2022).

Conflicts of Interest:
The authors declare no conflict of interest.