Federated Auto-Meta-Ensemble Learning Framework for AI-Enabled Military Operations

: One of the promises of AI in the military domain that seems to guarantee its adoption is its broad applicability. In a military context, the potential for AI is present in all operational domains (i


Introduction
With an increasing pace, artificial intelligence (AI) is becoming a significant and integral part of modern warfare because it offers new opportunities for the complete automation of large-scale infrastructure and the optimization of numerous defence or cyber-defence systems [1].One of the promises of AI in the military domain [2] that seems to guarantee its adoption is its broad applicability.In a military context, the potential for AI is present in all operational domains (i.e., land, sea, air, space, and cyber-space) and all levels of warfare (i.e., political, strategic, operational, and tactical) [3].Still, at the same time, the complexity is growing exponentially as the number of interconnected systems Applications supporting the missions above that monitor events in real-time are receiving a constant, unlimited stream of observations of interlinked approaches.These data exhibit high variability because their features vary substantially and unexpectedly over time, altering their typical, expected behaviour.The latest data are the most important in the typical case, as ageing is based on their timing.
Military AI-enabled intelligent systems that utilize data can transform military commanders' and operators' knowledge and experience into optimal valid and timely decisions [3,4].However, the lack of detailed knowledge and expertise associated with using complex machine learning architectures can affect the performance of the intelligent model, prevent the periodic adjustment of some critical hyperparameters and ultimately reduce the algorithm's reliability and the generalization that should characterize these systems.These disadvantages are preventing stakeholders of defence, at all echelons of the command chain, from trusting and making effective and systematic use of machine learning systems.In this context and given the inability of traditional decision-making systems to adapt to the changing environment, the adoption of intelligent solutions is imperative.
Furthermore, a general difficulty that reinforces distrust of machine learning systems in defence is the prospect of adopting a single data warehouse for the overall training of intelligent models [1], which could create severe technical challenges and severe issues of privacy [5], logic, and physical security due to the need of establishing a potential single point of failure and a potential strategic/primary target for the adversaries [6].Accordingly, the exchange of data that could make more complete intelligent categorizers that would generalize also poses risks to the security and privacy of sensitive data, which military commanders and operators do not want to risk [7].
To overcome the above double challenge, this work proposes FAMEL.It is a holistic system that automates selecting and using the most appropriate algorithmic hyperparameters that optimally solve a problem under consideration, approaching it as a model for finding algorithmic solutions where it is solved by mapping between input and output data.The proposed framework uses meta-learning to identify similar knowledge accumulated in the past to speed up the process [8].This knowledge is combined using heuristic techniques, implementing a single, constantly updated intelligent framework.Data remains in the local environment of the operators, and only the parameters of the models are exchanged through secure processes, thus making it harder for potential adversaries to intervene with the system [9,10].

Proposed Framework
In the proposed FAMEL framework, each user uses an automatic meta-learning system in a horizontal federated learning approach (horizontal federated learning uses datasets with the same feature space across all devices.Vertical federated learning uses different datasets of different feature space to jointly train a global model).The most appropriate algorithm with the optimal hyperparameters is selected in a fully automated way, which can optimally solve the given problem.The implementation is based on the entity's available data and is not required to be disposed of in a remote repository or shared with a third party [11].
The whole process is described in Figure 1.
appropriate algorithm with the optimal hyperparameters is selected in a fully automated way, which can optimally solve the given problem.The implementation is based on the entity's available data and is not required to be disposed of in a remote repository or shared with a third party [11].The whole process is described in Figure 1.The best models (winner algorithm) that result from the process are channelled to a federated server, where an ensemble learning model through a heuristic mechanism is created.This ensemble model essentially incorporates the knowledge represented by the local best models, which, as mentioned, came from the local data held by the users [12].Hence, collectively, the ensemble model offers high generalization, better predictability, and stability.Its general behaviour smoothens noise while lowering the overall danger of making a false choice due to modelling or prejudice in handling scenarios of local data [13,14].

Federated Learning
Assuming that   = 1,2, … ,  data owners want to train a machine learning model using their data D = {  ,  = 1,2, … , }.A traditional way would be to collect all data into Specifically:

5.
Step 1-Fine-tune the best local model.The fine-tuning process will help to improve the accuracy of each machine learning model by integrating data from an existing dataset and using it as an initialization point to make the training process time-and resource-efficient.6.
Step 2-Upload the local model to the federated server.7.
Step 3-Ensemble the model by the federated server.This ensemble method uses multiple learning algorithms to obtain a better predictive performance than could be obtained from any of the constituent learning algorithms alone.8.
Step 4-Dispatch the ensemble model to local devices.
The best models (winner algorithm) that result from the process are channelled to a federated server, where an ensemble learning model through a heuristic mechanism is created.This ensemble model essentially incorporates the knowledge represented by the local best models, which, as mentioned, came from the local data held by the users [12].Hence, collectively, the ensemble model offers high generalization, better predictability, and stability.Its general behaviour smoothens noise while lowering the overall danger of making a false choice due to modelling or prejudice in handling scenarios of local data [13,14].

Federated Learning
Assuming that F i = 1, 2, . . ., N data owners want to train a machine learning model using their data D = {D i , i = 1, 2, . . ., N}.A traditional way would be to collect all data into a single set The proposed federated learning system creates a single universal model [15]: where K is the total number of nodes used in the process, n is the data points, and t is the number of federated learning rounds [16].The model comes from local models ∆w 1 , ∆w 2 , . . ., ∆w K , which are trained from the D i of each federal user separately.Data D1 of the user F1 is not exposed to other federal users.In addition, the accuracy V sum and M sum of the models V f ed and M f ed , must be close or equal.Specifically, if δ is a negative number, then the federal learning method suffers from a loss of accuracy, as indicated in the following formula [11,13]: The Auto-Machine Learning technique is used to develop an accurate and robust federal system that will remain stable in new information without the ability to generalize or suffer a considerable loss of δ-accuracy [17,18].

Auto-Machine Learning
Initially, each federation member has a set of D data containing attribute vectors and class tags on a supervised problem linked with a job.Data set D is specifically divided into two parts: a set of training S and a set of forecasts B for testing and testing so that D = S, B .Furthermore, the data set D contains vector-label pairings such that D = {(x i , y i )}.Each label represents a known class and belongs to a known set of L labels [19].
Considering P(D) the distribution of aggregate data held by federal agencies, we can sample the issuance of an individual data set such that P d = P d (x, y).Our problem lies in creating a trained classifier M λ : x → y which is fully and optimally configured with λ ∈ Λ so that it can automatically generate predictions for samples from the P d distribution minimizing the expected generalization error so that [20]: The first phase is the best model selection procedure, which appears to be a standard learning procedure in which a data set is regarded as a sample of data.Furthermore, given that each data set of each independent body can only be observed through a set of n independent observations, i.e.,: Implies that we can only empirically approach the generalization error in data samples, i.e., [20,21]: From the above, we conclude that we have access to unconnected, finite samples in practice where D train and D test (D d,train , D d,test ∈ P d ).Therefore, to search for the best machine learning algorithm, we only have access to D train .However, in the end, the performance is calculated once in D test .
Assume a classifier f λ , the parameter λ obtains the likelihood that a data point belongs to the class y specified by the attribute vector x, P λ (y|x) .The best model should increase the likelihood of correctly detecting tags over several training batches B ⊂ D so that [18,22]: Given that there is only a limited collection of quick learning support that can act as fine-tuning, the objective, as with any other work using machine learning, is to minimize the prediction error made on data samples with unknown labels.It is possible that obtaining the best model is challenging to undertake.A fake data set is created with only a tiny fraction of labels to prevent releasing all labels in the model.The optimization technique is modified to make it easier to acquire knowledge quickly.According to this interpretation, each sample pair can be regarded as a data point.As a direct consequence, the model has been educated to the point where it can generalize to fresh, untested data sets.To summarize, the process of computing the best model through the application of the meta-learning approach is represented by the following function [20]: Therefore, the proposed framework performs an automatic search in the solutions area to identify the optimal M λ * : For the calculation of GE, with cross-validation k-fold, the following relation is used [17,20,23]: Accordingly, the problem of optimizing the hyperparameters λ ∈ Λ of the best learning algorithm A is essentially similar to selecting the best model.Some significant characteristics are that hyperparameters are frequently continuous, hyperparameter spaces are often vast, and we can benefit from the correlation between different hyperparameter settings λ 1 , λ 2 , . . ., λ n ∈ Λ.
Specifically, the hyperparameter λ i is subject to the sub-constraints of another hyperparameter λ j , if λ is active only if the hyperparameter λ j takes values from a given set V i (j) Λ j .Based on this logic, the hyperparameters in the proposed framework create a structured solution space which is determined on the basis of a pair of variables with B = G, Θ (where G a graph).Graph G conveys the assumption that each variable λ i is independent of the inheritance undertaken by G.It determines the parameters of the network and, in particular, the whole θ λ i |π i = P B (λ i π i ) for each λ i ⊂ Λ i based on the constraint condition π i , for the set of constraints in G. Therefore, B defines a unique probability distribution such that [24]: Finding the optimal graph path based on the Markov inequality is calculated as [25,26]: Hence, with the following equation, the calculation of its expectation is performed by Λ n : It follows from the above equation: Hence, its r-to factor momentum Λ n is: Finally, given the above-structured solution space, the hyperparameter optimization issue is as follows [27,28]: The Meta-Ensemble Learning technique is used for the proposed framework to lead to stable prediction models while offering generalization, minimizing bias, reducing variance, and eliminating overfitting.

Meta-Ensemble Learning
Once the above procedure has identified the most appropriate algorithm with the optimal hyperparameters to create a single model that improves generalization, the proposed framework creates a boosting ensemble model of all the optimal models that emerged by auto-machine learning.
The proposed technique is based on the logic of the boosting process, where through the creation of successive tree structures, information transfer is applied to solve a distributed problem [29].Specifically, it is set f (x) = 0 and ε i = y i for each observation in the set of training data of each body.
The winning algorithm from the process of auto-machine learning f k is trained in each round k with d nodes having as response variable the categorization errors resulting from the previous classification round (auto-machine learning process), which are denoted by ε i .For the most efficient, effective, and computable feasible implementation of the proposed framework, we consider a tree and even pruned version of a new tree so that [12,30]: Repeating the procedure K times (the user-specified K), the final form of the model is obtained: For the proposed technique to be effective, the user must specify the number and depth of trees to be created.The incredible depth of the trees can easily create overadaptation processes and cannot be generalized.Accordingly, the number of trees controls the complexity of the process [31].The λ parameter defines the learning rate of the model.
Its derivative is first calculated to find the total minimum of the function using the proposed technique, and then the inverse procedure of finding the derivative is used.The derivative measures whether the value of a process will change J(θ) if the variable θ (slope of the function) changes slightly.High values of the function indicate a significant slope and, therefore, a substantial change in its value J(θ) for small changes of θ.This algorithm is iterative, initializes a random value in θ, calculates the derivative of the function at the given point, and changes θ so that [28,32]: Taking as a function of loss the sum of the squares of the incorrect classifications ε i is divided by two so that: the parameter ρ determines how fast it will move in the negative direction of the derivative.The process is repeated until the algorithm converges, which proposes training trees in the negative derivative of the loss function: Calculating the above derivative, we have: The negative derivative of the loss function is equal to the classification errors ε i .Hence, essentially, the procedure provides for the training of a tree based on the classification errors ε i , to which a pruned version of the new tree is added.In this manner, the approach adds successive trees to the negative derivative of the loss function at each given time t, such that [33,34]: where F = f (x) = w q(x) and q : R m → T, w ∈ R T .The q represents the structure of each tree, the T the number of leaves, and each f t corresponds to an independent tree structure q with the leaf weights plotted as w.The loss function that is minimized at any time t has a formula: Two terms are important: the model's capacity for learning from training data (low values imply good learning) and the complexity of each tree (adding a new term to the number of leaves (T), which shrinks the weights of leaves so that: The parameter γ indicates the penalty value for the tree's growth so that large values of γ will lead to small trees.Respectively small values of γ will lead to large trees.The parameter λ regulates whether the tree weights will shrink so that as its value increases, the tree weights will shrink.
Thus, it follows that [33,35]: Therefore, the problem now is deciding which f t (x i ) minimizes the time loss function t: Taylor's Development shows: Hence, the resulting relationship is [33,35]: where: Subtracting the constants, the loss function becomes: where: The set of observations on a leaf j in the above relation is recorded as follows: ∑ i∈I j g i w j + 1 2 ∑ i∈I j h i + λ w 2 j + γT where: G j = ∑ i∈I j g i and H j = ∑ i∈I j h i The following relation emerges: If the structure of the tree (q(x)) is given, the optimal weight on each sheet is obtained by minimizing the concerning w j in the above relationship so that [22,36,37]: Finally, with its replacement w j , the following equation is obtained, which calculates the quality of the new tree: Finally, the algorithm creates divisions using the formula: where the first fraction is the score of the left part of the partition, the second is the score of the right amount of the division, the third is the score if the division is not made, and γ measures the cost of the complexity of the partition.

Experiments and Results
For the experimental implementation of the proposed FAMEL and the performance of the scenario, a collaborative network of three federated partners (domain_alpha, do-main_bravo and domain_charlie) was simulated (Figure 2).We consider that the optimal model created by the Auto-Machine Learning process is an internal affair of each domain, which is implemented on a local server based on the respective architecture of each domain.In the Demilitarized Zone (DMZone) is the Federated Learning Server (LFS), which creates the ensemble model by applying the algorithmic process of assembling the optimal models with the technique discussed above.The proposed intelligent system was evaluated using one of the most extensive datasets for web traffic analysis called CICDoS2019.This dataset was developed under the supervision of the Canadian Institute for Cybersecurity.The evaluation's primary objective was to identify well-organized attacks in which the intruder's identity remained a legal component of a third party [31].Each domain includes 70 independent variables: characteristics or statistics of network analysis and six classes (Benign, Infiltration, SSH-Bruteforce, FTP-BruteForce, DoS Attack-Hulk, and DDOS attack-HOIC).The individual sets include Alpha_dataset 70553, Bravo_dataset 69551, and Charlie_dataset 70128 instances [38].
The initial results of the Auto-Machine Learning process based on the data available in each domain are presented in Tables 1-9 below, as well as the parameters of each optimal model that emerged for each collaborative domain.We used the Area under the ROC Curve (AUC) metric, which represents the degree or measure of separability.It tells how much the model is capable of distinguishing between classes.Specifically, AUC (also known as AUROC) is the Area beneath the entire ROC curve.AUC provides a convenient, single performance metric for our classifiers independent of the specific classification threshold.This enables us to compare models without even looking at their ROC curves.We consider that the optimal model created by the Auto-Machine Learning process is an internal affair of each domain, which is implemented on a local server based on the respective architecture of each domain.In the Demilitarized Zone (DMZone) is the Federated Learning Server (LFS), which creates the ensemble model by applying the algorithmic process of assembling the optimal models with the technique discussed above.The proposed intelligent system was evaluated using one of the most extensive datasets for web traffic analysis called CICDoS2019.This dataset was developed under the supervision of the Canadian Institute for Cybersecurity.The evaluation's primary objective was to identify well-organized attacks in which the intruder's identity remained a legal component of a third party [31].Each domain includes 70 independent variables: characteristics or statistics of network analysis and six classes (Benign, Infiltration, SSH-Bruteforce, FTP-BruteForce, DoS Attack-Hulk, and DDOS attack-HOIC).The individual sets include Alpha_dataset 70553, Bravo_dataset 69551, and Charlie_dataset 70128 instances [38].
The initial results of the Auto-Machine Learning process based on the data available in each domain are presented in Tables 1-9 below, as well as the parameters of each optimal model that emerged for each collaborative domain.We used the Area under the ROC Curve (AUC) metric, which represents the degree or measure of separability.It tells how much the model is capable of distinguishing between classes.Specifically, AUC (also known as AUROC) is the Area beneath the entire ROC curve.AUC provides a convenient, single performance metric for our classifiers independent of the specific classification threshold.This enables us to compare models without even looking at their ROC curves.

Domain_Alpha Best Model Best Parameters of the Winner Model
LGBMClassifier    AUC is measured on a scale of 0 to 1, with higher numbers indicating better performance.Scores in the [0.5, 1] range indicate good performance, while anything less than 0.5 indicates very poor performance.An AUC of 1 indicates a perfect classifier, while an AUC of 0.5 indicates a perfectly random classifier.A model that always predicts a negative sample is more likely than a positive sample to have a positive label.It will have an AUC of 0, indicating a severe modelling failure.
It should be noted that all the tests were performed with 10-fold cross-validation.Each of the ten subsets was used for the algorithm's training and certainly once for its evaluation, so there was no case of misleading the algorithmic result.
Meta-Ensemble Learning is created with an ensemble model that includes the best classifiers.The ensemble model returns through the Federated Learning process in each domain and retests in each local dataset (Alpha_dataset, Bravo_dataset, and Charlie_dataset).Then, the three best models from each domain (LGBMClassifier, Gradient BoostingClassifier, and k-NeighborsClassifier) are sent with the Federated Learning process to FLS.Again, it should be emphasized that all the tests were performed with the method of 10-fold cross-validation so that there was no case of misleading the algorithmic result.The results of the process are presented in the following tables.
The ensemble model ensures an improved categorization accuracy and smoothening of the system.This dramatically simplifies trend detection and visualization by eliminating or reducing statistical noise in the data.The experimental results suggest that using the ensemble model ensures improved categorization accuracy.The categorization becomes more accurate with each instance, providing critical pointers to the failure problems that an individual algorithm's bias could generate [39].This allows for a precise diagnosis before embarking on a new condition or occurrence associated with adversarial attacks or zero-day exploits.This is one of the most effective strategies for predicting a trend's strength and the likelihood of shifting direction [40].
The convergence achieved by employing multiple models provides more outstanding reliability than any of them could provide separately.This revelation, directly related to the experimental outcomes, significantly accelerates arriving at the optimum decision in ambiguous situations [41].It is also important to remember that this process is dynamic, which must be emphasized.This dynamic process ensures the system's adaptability by providing impartiality and generalization, resulting in a system that can respond to highly complicated events [42,43].

Conclusions
Applying machine learning to real-world problems is still particularly challenging [44].This is because highly trained engineers and military specialists who have a wealth of experience and information will be required to coordinate the numerous parameters of the respective algorithms, correlate them with the specific problems, and use the data sets that are currently available.This is a lengthy, laborious, and expensive undertaking.However, the hyperparametric features of algorithms and the design choices for ideal parameters can be viewed as optimization problems because machine learning can be thought of as a search problem that attempts to approach an unknown underlying mapping function between input and output data.
Utilizing the above view, in the present work, FAMEL was presented, extending the idea of formulating a general framework of automatic machine learning with effective universal optimization, which operates at the federal level.It uses automated machine learning to find the optimal local model in the data held by each federal user and then, making extensive meta-learning, creates an ensemble model, which, as shown experimentally, can generalize, providing highly reliable results.In this way, the federal bodies have a dedicated, highly generalized model, the training of which does not require exposure to the federation of the data in their possession.In this regard, FAMEL can be applied to several military applications where continuous learning and environmental adaptation are critical for the supported operations and where the exchange of information might be difficult or not possible due to security reasons.For example, which is the case in the real-time optimization of information sharing concerning tasks and situations.The application of FAMEL would be of special interest in congested environments where IoT sensor grids are deployed, and many security constraints need to be met.Similarly, it can be applied in cyberspace operations to find and identify potential hostile activities in cluttered information environments and complex physical scenarios in real-time, including countering negative digital influence [45,46].It must be noted that the proposed technique can be extended to cover a wider scientific area without reducing the main points that are currently described.It is a universal technic that develops and produces an open-frame holistic federated learning approach.
Although, in general, the methodology of the federated learning technique, the ensemble models, and recently the meta-learning methods have occupied the research community intensely, and relevant work has been proposed that has upgraded the relevant research area, this is the first time that such a comprehensive framework is presented in the international literature.The methodology offered herein is an advanced form of learning.The computational process is not limited to solving a problem but through a productive method of searching the solution space and selecting the optimal one in a meta-heuristic way [47,48].
On the other hand, the federated learning model should apply average aggregation methods to the set of cooperative training data.This raises serious concerns for the effectiveness of this universal approach and, therefore, for the validity of federated architectures in general.Generally, it flattens the unique needs of individual users without considering the local events to be managed.How one can create personalized intelligent models that solve the above limitations is currently a prominent research problem.For example, the study [49] is based on the needs and events that each user must address in a federated format.Explanations are the assortment of characteristics of the interpretable system, which, in the case of a specified illustration, helped to bring about a conclusion and provided the function of the model on both local and global levels.Retraining is suggested only for those features for which the degree of change is considered quite important for the evolution of its functionality.
Essential topics that could expand the research area of the proposed framework concern the Meta-Ensemble Learning process and, specifically, how to solve the problem of creating trees and their depth so that the process is automatically fully simplified.An automated process should also be identified for pruning each tree with optimal separations to avoid negative gain.Finally, explore procedures to add an optimally trimmed tree version to the model to maximize frame efficiency, accuracy, and speed.

Figure 1 . 5 .
Figure 1.The proposed block diagram of the FAMEL framework.Specifically: 5. Step 1-Fine-tune the best local model.The fine-tuning process will help to improve the accuracy of each machine learning model by integrating data from an existing dataset and using it as an initialization point to make the training process time-and resource-efficient.6. Step 2-Upload the local model to the federated server.7. Step 3-Ensemble the model by the federated server.This ensemble method uses multiple learning algorithms to obtain a better predictive performance than could be obtained from any of the constituent learning algorithms alone.8. Step 4-Dispatch the ensemble model to local devices.

Figure 1 .
Figure 1.The proposed block diagram of the FAMEL framework.
val,k) train where M D (train ,k) train λ denote that M λ was trained based on the k-fold dataset D (train ,k) train ⊂ D train and then evaluated by:

Table 1 .
The best model for the Domain Alpha.

Table 2 .
Best parameters of the winner model of the Domain Alpha.

Table 3 .
The best model for the Domain Bravo.

Table 4 .
Best parameters of the winner model of the Domain Bravo.

Table 5 .
The best model for the Domain Charlie.

Table 6 .
Best parameters of the winner model of the Domain Charlie.

Table 7 .
Ensemble model for the Domain Alpha.

Table 8 .
Ensemble model for the Domain Bravo.

Table 9 .
Ensemble model for the Domain Charlie.