Anomaly Detection in 6G Networks Using Machine Learning Methods

: While the cloudiﬁcation of networks with a micro-services-oriented design is a well-known feature of 5G, the 6G era of networks is closely related to intelligent network orchestration and management. Consequently, artiﬁcial intelligence (AI), machine learning (ML), and deep learning (DL) have a big part to play in the 6G paradigm that is being imagined. Future end-to-end automation of networks requires proactive threat detection, the use of clever mitigation strategies, and conﬁrmation that 6G networks will be self-sustaining. To strengthen and consolidate the role of AI in safeguarding 6G networks, this article explores how AI may be employed in 6G security. In order to achieve this, a novel anomaly detection system for 6G networks (AD6GNs) based on ensemble learning (EL) for communication networks was redeveloped in this study. The ﬁrst stage in the EL-ADCN process is pre-processing. The second stage is the feature selection approach. It applies the reimplemented hybrid approach using a comparison of the ensemble learning and feature selection random forest algorithms (CFS-RF). NB2015, CIC_IDS2017, NSL KDD, and CICDDOS2019 are the three datasets, each given a reduced dimensionality, and the top subset characteristic for each is determined separately. Hybrid EL techniques are used in the third step to ﬁnd intrusions. The average voting methodology is employed as an aggregation method, and two classiﬁers—support vector machines (SVM) and random forests (RF)—are modiﬁed to be used as EL algorithms for bagging and adaboosting, respectively. Testing the concept of the last step involves employing classiﬁcation forms that are binary and multi-class. The best experimental results were obtained by applying 30, 35, 40, and 40 features of the reimplemented system to the three datasets: NSL_KDD, UNSW_NB2015, CIC_IDS2017, and CICDDOS2019. For the NSL_KDD dataset, the accuracy was 99.5% with a false alarm rate of 0.0038; the accuracy was 99.9% for the UNSW_NB2015 dataset with a false alarm rate of 0.0076; and the accuracy was 99.8% for the CIC_IDS2017 dataset with a false alarm rate of 0.0009. However, the accuracy was 99.95426% for the CICDDOS2019 dataset, with a false alarm rate of 0.00113.


Introduction
After 2030, users will engage with internet virtual worlds through 6G communications [1]. The 5G network's intended digital transformation has already begun and will Table 1. Security issues with 6G apps and the fundamental security requirement.

References
Challenges in Security 6G Application Requirements for Security [22,23] • Resources are scarce. • Energy is at a premium. • Design of a complete security system.

References Challenges in Security 6G Application
Requirements for Security [24,25] •
Brain-computer connections that are wireless.
The world of technology is shifting toward hardware and simple algorithms, IoE, and complex networking. Although there are ongoing studies, intrusion detection systems are unable to optimize their detection rate (DR), false alarm rate (FAR), false negative rate (FNR), false positive rate (FPR), or execution time due to the huge dimensions of the standard dataset and the incidence of zero-day assaults. Even though it directly affects resources, time complexity has not been acknowledged as a significant component. In order to extract the best subset of the original features, dimensionality reduction with an FS is reimplemented in this study. The accuracy and stability of the intrusion detection systems are then increased while minimizing the necessary calculation time by forwarding the reimplemented hybrid EL to these subsets. The reimplemented process develops hybrid EL and FS algorithms to obtain precise and effective IDs.
In the reimplemented approach, an anomaly message is detected and classified as spam or ham using novel AD6GNs based on EL for wireless communication networks. ML approaches substantially contribute to detecting spam tweets, images/videos, and SMS messages on mobile devices. An IDS guards computer networks against malicious invasions and is used to check for network weaknesses. For network analysis, the three main classifications of an intrusion detection system are signature-based, anomaly-based, and hybrid-based. ML methods significantly aid in detecting various intrusions on host and network systems. The main contributions of this research are as follows: 1.
Within the FS framework, a unique CFS-RF method is utilized to evaluate the correlation of the chosen features. Improving the efficiency of the testing and training processes is quite advantageous. 2.
The effectiveness of the multi-class and binary forms will be enhanced and used on the three unbalanced datasets. The process explains hybrid ensemble algorithms by modifying two different classifiers to function as adaboosting, followed by the average voting technique (bagging method) to combine judgments from several classification judgments using an ensemble of classifiers, such as SVM and RF.

3.
This article investigates how AI may be used in 6G security to strengthen and reinforce safeguarding 6G networks. This study reimplements a novel AD6GN using EL to do this.
The remainder of this paper is organized as follows: Many similar works are presented in Section 2. In Section 3, the basics of threats and attacks are discussed. At the same time, Section 4 AI/ML technology in the 6G network is discussed. Section 5 offers a thorough explanation of the methodology and the reimplemented system. Section 6 discusses various ML techniques, shows how the reimplemented system was put into practice using the datasets that were used, and presents the discussion and analysis of the results of this and other studies. Section 7 ends with a summary of the findings and recommendations for additional investigation. feature selection (i.e., correlation feature selection-forest attribute). The study's experimental findings exclusively utilized the CIC IDS2017 dataset. Additionally, when employing 30 feature-selected tests, the testing accuracy was 87%. Using just two datasets-NSL-KDD and UNSW BN2015-the authors of [57] introduced correlation feature selection methods. These approaches allow users to choose the optimal feature. Furthermore, they only selected those datasets 30 features.
By utilizing an ensemble learning (EL) algorithm-based anomaly detection in communication networks (ADCNs), the authors of [58] suggested a unique anomaly detection in communication networks. In each of the three datasets (NSL_KDD, UNSW_NB2015, and CIC_IDS2017) individually, it minimizes dimensionality and finds the best subset feature. Their results were obtained by applying 30, 35, and 40 features of the system to the three datasets, which resulted in accuracy values of 99.6% for NSL_KDD, 99.1% for UNSW_NB2015, and 99.4% for CIC_IDS2017. Since the work in [58] was the primary study, we applied the same research methods and reimplemented its algorithms for the purpose of benchmarking with a variety of dataset pre-processing and feature combinations. In addition to that, we utilized the more recent dataset known as CICDDOS2019 and concentrated our attention on data anomalies in 6G cellular networks.
For 5G networks, the authors of [59] offered a special feature extraction and detection system where the deep belief network (DBN) and bidirectional long-short-term memory (Bi-LSTM) were utilized as part of a hybrid classifier (HC). The weights of both Bi-LSTM and DBN are optimized using a novel deer hunting updated sun flower optimization (DHSFO) model, which combines the concepts of sun flower optimization (SFO) and deer hunting optimization (DHO) methods, to transform the detecting stage properly and precisely.
A unique architecture was put forth by the developers of [60] with the goal of proactively detecting such traffic anomalies or forecasting impending 5G network traffic anomalies before they happen. Clustering and decision tree-based learning are applied to automatically identify anomalies in 5G network traffic. For the pro-active identification of impending network traffic behaviors, a time series model-more specifically, a Bidirectional long short-term memory (BiLSTM) autoencoder-is used. According to evaluation results, the proposed framework is effective and viable and has the potential to be more successful than a state-of-the-art solution, with a prediction accuracy of up to 90.02%.
Using a deep-learning methodology, the authors of [61] proposed a new technique for anomaly identification that is based on the autoencoder principle and long-short-term memory (LSTM) recurrent neural networks. Real sensor data was used in a number of simulations, and the accuracy of the discovered abnormalities was assessed. The findings indicate that the RNNs have an accuracy of 87% in detecting anomalies over both short and extended periods of time. SDWSNs and RNNs can therefore be utilized as a first step in effectively detecting different kinds of anomalies.
Many researchers have examined distributed machine learning techniques [62,63] and found that they significantly reduce the complexity of high dimensional data. These researchers have demonstrated the advantages of utilizing ML algorithms to handle enormous amounts of data for the IDS pre-processing step. However, DL algorithms could access hidden features in the step that classified anomalies from many targets to find unidentified attacks. They achieve the highest rates of false alarm, false negative, and detection, and they greatly outperform state-of-the-art technologies while consuming the least time when applied to many datasets.
To address computing cost and privacy preservation issues, the authors of [64] introduced a synonym-based multi-keyword ranked search over encrypted cloud data. To allocate an index vector for each document, the suggested mechanism builds an m-Way search tree. It then uses a depth-first search to compute the top score ranking, improving search efficiency. The examination's findings demonstrate that the presumed index vectors might be separated into sub-vectors before being stored in the index tree to improve computing efficiency and security.

Threats and Attacks in 6G Networks
Although 6G networks are still being developed, it is crucial to take into account any potential dangers and assaults that might occur in these networks. Here are a few potential 6G network dangers and assaults [65,66] Man-in-the-middle (MITM) attacks: These attacks allow an attacker to intercept and change network traffic in order to steal sensitive data or launch attacks on other network-connected devices. • Supply chain attacks: By focusing on the vendors and suppliers who provide the network's components, supply chain attacks can be used to compromise network infrastructure. • Quantum assaults: These attacks could risk 6G networks by breaking encryption and other currently thought-to-be-secure security measures using quantum computers.
Network operators and device makers must install strong security mechanisms, such as encryption, authentication, and access control, to handle these dangers and attacks in 6G networks. In order to identify and stop attacks, network operators may also need to invest in extra equipment, such as intrusion detection systems and firewalls. Additionally, it is critical to ensure that the 6G ecosystem's various parts-including devices, networks, and applications-are built with security in mind and routinely patched to address emerging threats and vulnerabilities.
To ensure that they are resilient to a variety of risks and attacks, 6G networks will ultimately need to be designed with a "security-first" approach. In addition to the continued research and development of new security technologies and procedures, this will call for tight cooperation between network operators, device manufacturers, and other 6G ecosystem partners [67,68].
Combining two detection methods results in hybrid-based detection. Denial of service (DoS), remote to local (R2L), user to root (U2R), and probing are four different types of cyberattacks. U2R attacks occur when users attempt to obtain root or admin user access rights. R2L attacks occur when a remote user attempts to log in as a local user. DoS refers to the phenomenon where a genuine user is prevented from accessing the system by a hostile actor who is overloading the network resources. However, fraudsters scan the network when probing to identify vulnerable points for upcoming assaults.

AI/ML Technology in 6G Network
According to recent studies, the network architecture for all 6G network technologies must include AI and ML. The 6G networking industry has paid close attention to artificial intelligence. A lot of training data and strong processing capacity are needed to adopt AI/ML in 5G networks. However, AI/ML have become essential to 6G networks. Different frames of 6G's security defense and protection are secured using AI and ML. Security systems have improved in autonomy, accuracy, and predictiveness due to the use of AI and ML.
This subsection discusses a few of the issues with AI/ML in the 6G system [69,70]:

1.
Reliability: AI manages network security and the accuracy of machine learning models and components.

2.
Visibility: Real-time AI and ML-based security functions are monitored to assure credibility and control.

3.
Ethical and Legal Considerations: AI-based optimization approaches can potentially exclude specific clients or applications. Whether or not AI-powered security solutions protect all users, AI controls who oversees security services that fail.

4.
Versatility and adaptability: Protected data transfers are essential to protecting federated learners' privacy. AI/ML struggle with the scalability of the necessary computation, storage resources, and communication.

5.
Managed security duties: There could be a lot of overhead when AI/ML security solutions are linked to huge data activities. The learning and inference phases should be safe and secure for the model's flexibility. The anticipated intelligent 6G system would leverage advanced AI methods and methodologies to satisfy the demands of new use cases, high service needs, and necessary capabilities. Figure 1 shows the AI/ML-based 6G secured architecture; the following sentences characterize it. 2. Visibility: Real-time AI and ML-based security functions are monitored to assure credibility and control. 3. Ethical and Legal Considerations: AI-based optimization approaches can potentially exclude specific clients or applications. Whether or not AI-powered security solutions protect all users, AI controls who oversees security services that fail. 4. Versatility and adaptability: Protected data transfers are essential to protecting federated learners' privacy. AI/ML struggle with the scalability of the necessary computation, storage resources, and communication. 5. Managed security duties: There could be a lot of overhead when AI/ML security solutions are linked to huge data activities. The learning and inference phases should be safe and secure for the model's flexibility. The anticipated intelligent 6G system would leverage advanced AI methods and methodologies to satisfy the demands of new use cases, high service needs, and necessary capabilities. Figure 1 shows the AI/ML-based 6G secured architecture; the following sentences characterize it.

Security Improvements for 5G
In comparison to earlier generations of mobile networks, 5G networks feature significant security enhancements, but they also present new security difficulties. The following are a few of the major security upgrades in 5G networks [71][72][73][74]: 1. Better encryption: 5G networks employ better encryption methods to secure data in transit. The advanced encryption standard (AES) with 256-bit keys, which is regarded as being very secure, is required by the 5G standard. 2. Network slicing: 5G networks divide the network into virtual networks, each with its own set of security guidelines and restrictions. This makes it possible to separate and secure various types of traffic using various levels of protection. 3. User authentication: To defend against identity theft and other sorts of threats, 5G networks use better user authentication protocols, including 5G authentication and key agreement (AKA). 4. Secure boot: Secure boot certifies that the operating system and software of 5G devices are valid and have not been tampered with. 5. SIM-based security: SIM-based security is used by 5G networks to thwart SIM swap-

Security Improvements for 5G
In comparison to earlier generations of mobile networks, 5G networks feature significant security enhancements, but they also present new security difficulties. The following are a few of the major security upgrades in 5G networks [71][72][73][74]: 1.
Better encryption: 5G networks employ better encryption methods to secure data in transit. The advanced encryption standard (AES) with 256-bit keys, which is regarded as being very secure, is required by the 5G standard.

2.
Network slicing: 5G networks divide the network into virtual networks, each with its own set of security guidelines and restrictions. This makes it possible to separate and secure various types of traffic using various levels of protection.
Electronics 2023, 12, 3300 9 of 28 3. User authentication: To defend against identity theft and other sorts of threats, 5G networks use better user authentication protocols, including 5G authentication and key agreement (AKA).

4.
Secure boot: Secure boot certifies that the operating system and software of 5G devices are valid and have not been tampered with. 5.
SIM-based security: SIM-based security is used by 5G networks to thwart SIM swapping attacks, in which criminals attempt to seize control of a user's SIM card to obtain access to their account. 6.
Software-defined networking: Software-defined networking (SDN) enables dynamic network configuration and security policies to be updated in real-time to address new threats and assaults; 5G networks use SDN to enable this functionality. 7.
Network function virtualization: Network function virtualization (NFV) is used in 5G networks to enable the implementation of network functions in software instead of hardware. This can lessen the danger of hardware-based attacks and make updating and securing network operations easier. 8.
Improved IoT security: To guard against assaults on IoT devices, 5G networks include enhanced security mechanisms for IoT devices, including mutual authentication and secure communication protocols. 9.
Improved privacy: 5G networks include improved privacy features such as the capacity to send data in a manner that conceals the user's identity, making it harder for attackers to monitor or identify users. 10. Better network management: 5G networks now have greater granular control over network resources and security regulations thanks to network slicing and SDN solutions.
While 5G networks have significant security advantages over earlier mobile network generations, they also present new security difficulties, such as the increased use of edge computing and the requirement for security measures to be implemented on a variety of devices. To solve these issues and guarantee the security of 5G networks, network operators, device makers, and other players in the 5G ecosystem must continue to develop and adopt new security measures. This will require continued innovation and cooperation, as well as a "security-first" mindset in the development of 5G networks and devices [72].

ML Technology in 6G Network
As ML technology may be used to enhance network performance, optimize resource allocation, and enable new services and applications, it is anticipated to play a significant role in 6G networks. Listed below are a few potential uses for ML technology in 6G networks [73][74][75][76][77]: 1.
Network optimization: To enhance network performance and lower latency, ML algorithms can be used to evaluate network traffic and optimize network resources, such as bandwidth and power. In order to prevent network slowdowns, ML can also be used to forecast network congestion and modify network resources in real time.

2.
Intelligent resource allocation: ML algorithms can be used to intelligently allocate network resources, such as frequency bands and spectra, to increase network capacity and performance. This can be crucial in 6G networks, which are anticipated to accommodate a variety of devices and applications with various bandwidth and latency requirements.

3.
Anomaly detection: ML algorithms can be used to identify unexpected data transmission patterns in network traffic, which may indicate security breaches or other problems. ML can enable quicker responses to security threats and other network concerns by detecting real-time anomalies. 4.
Predictive maintenance: ML algorithms can be used to forecast the need for maintenance on base stations and antennas, two types of network infrastructure. By analyzing sensor and other data, ML algorithms can find patterns that could point to potential problems or maintenance requirements. This can enable preventative upkeep and repair, minimizing downtime and enhancing network efficiency.

5.
Intelligent edge computing: ML algorithms can be installed on edge devices such as gateways or routers to enable real-time data processing and analysis at the network edge. As a result, less data will need to be sent across long distances, enhancing network efficiency and lowering latency. Additionally, by optimizing edge computing resources such as processing speed and memory, ML can make data processing more effective and efficient. 6.
Intelligent network management: ML algorithms can be used to examine data on network performance to pinpoint areas that can be improved and to streamline network management procedures. ML can be used, for instance, to forecast network issues and suggest proactive maintenance, or it can be used to optimize network routing to ease congestion and boost performance. 7.
Enhanced security: By quickly identifying and addressing security issues, ML algorithms can improve network security. For instance, ML can be used to recognize strange devices or traffic patterns on a network and automatically activate security measures in response.
In general, ML technology can boost network security in 6G networks, enable new services and applications, and enhance network performance [78]. However, implementing machine learning algorithms in 6G networks will necessitate a substantial investment in infrastructure for artificial intelligence and machine learning and careful consideration of data privacy and security issues [79][80][81][82][83]. Additionally, in order to ensure interoperability and compatibility amongst various components of the 6G ecosystem, the deployment of ML in 6G networks would necessitate close cooperation between network operators, device makers, and application developers [84][85][86].

Research Methodology
The reimplemented architecture has several processes for detecting anomalies. The first component of the defensive system is an intrusion detection system with databases hidden behind the firewall (network data is preprocessed). Following preprocessing, it is necessary to identify any replacement of the null values with alternate values after the system checks for any missing values. Default consideration is given to average values, after which duplicate values are purged from the dataset. Dimensional reduction is applied to the encoded data to facilitate data management. Anomaly detection is aided by feature optimization, which is performed to extract the best characteristics from the data.
The filtered data is then moved on to the next stage, where the method known as CFS-RF is used only to choose the affected features for the outcomes. In order to distinguish between legitimate activity and potential attacks, the system employs the reimplemented hybrid adaboosting bagging algorithms (HABBAs) as classifiers. The system's intricate construction is shown in Figure 2. 7. Enhanced security: By quickly identifying and addressing security issues, ML algorithms can improve network security. For instance, ML can be used to recognize strange devices or traffic patterns on a network and automatically activate security measures in response.
In general, ML technology can boost network security in 6G networks, enable new services and applications, and enhance network performance [78]. However, implementing machine learning algorithms in 6G networks will necessitate a substantial investment in infrastructure for artificial intelligence and machine learning and careful consideration of data privacy and security issues [79][80][81][82][83]. Additionally, in order to ensure interoperability and compatibility amongst various components of the 6G ecosystem, the deployment of ML in 6G networks would necessitate close cooperation between network operators, device makers, and application developers [84][85][86].

Research Methodology
The reimplemented architecture has several processes for detecting anomalies. The first component of the defensive system is an intrusion detection system with databases hidden behind the firewall (network data is preprocessed). Following preprocessing, it is necessary to identify any replacement of the null values with alternate values after the system checks for any missing values. Default consideration is given to average values, after which duplicate values are purged from the dataset. Dimensional reduction is applied to the encoded data to facilitate data management. Anomaly detection is aided by feature optimization, which is performed to extract the best characteristics from the data.
The filtered data is then moved on to the next stage, where the method known as CFS-RF is used only to choose the affected features for the outcomes. In order to distinguish between legitimate activity and potential attacks, the system employs the reimplemented hybrid adaboosting bagging algorithms (HABBAs) as classifiers. The system's intricate construction is shown in Figure 2. It comprises several phases, each of which completes a particular task in a series of sequential stages. The previous step serves as input for the following stage. These phases and steps are thoroughly discussed. After reading the collected NSL_KDD, UNSW_NB2015, CIC_IDS2017, and CICDDOS2019 datasets, the three primary processes It comprises several phases, each of which completes a particular task in a series of sequential stages. The previous step serves as input for the following stage. These phases and steps are thoroughly discussed. After reading the collected NSL_KDD, UNSW_NB2015, CIC_IDS2017, and CICDDOS2019 datasets, the three primary processes of the preprocessing stage (filtration, transformation, and normalization) are carried out.

Simulation Steps
Here are the steps we used to simulate a new anomaly detection approach based on group learning to detect attacks in 6G communications using Python with multiple datasets:

•
We define the scope and goals of our simulation. Before we start running the simulation, we work on defining the scope of the simulation and the goals we want to achieve. Which included defining the types of attacks we want to detect, the performance metrics we want to improve, and the datasets we want to use. • Collect and process the datasets: For our simulation experiment, we will need to collect and preprocess three datasets: NB2015, CIC_IDS2017, CICDDOS2019, and NSL KDD, which involves cleaning and pre-processing the data to remove any noise or outliers and converting the data into a format suitable for machine learning and deep learning. Which also needs to divide the data into training, validation, and test sets. • Implement a group learning approach: the next step is to implement a group learning approach to detect anomalies in the connectivity networks, which involves selecting suitable basic models, such as decision trees, neural networks, or support vector machines, and combining them using an appropriate clustering method, such as packing, boosting, or stacking. Python libraries such as scikit-Learn or TensorFlow were used to implement and train these models.

•
Train and validate the model: Once the group learning approach has been implemented, the model can be trained and validated using the three datasets, which may include splitting data sets into training, validation, and test sets and using crossvalidation to evaluate model performance. Measures such as accuracy, recall, and F1 score were used to evaluate the model's performance. • Improving the model: After validating the model, its performance was improved by adjusting the hyperparameters, choosing different basic models, using different clustering methods, and integrating the equations.

•
Visualizing and interpreting the results: After refining the model, the results were visualized and interpreted to gain insights into the attack detection problem and the model's performance. This involved creating visualizations of the data and the model and interpreting the results in the context of the specific attack detection problem.

Description of Datasets
This system implements experiments using the datasets from the UNSW NB2015, the NSL KDD, the CIC_IDS2017, and the CICDDOS2019. The initial data collection is NSL-KDD. It was developed as a strong preprocessing stage (the first step of the reimplemented system) to decrease the forecasting challenges. Different baseline classifiers were used to categorize records with five different levels of complexity. Next to each occurrence, the number of accurate predictions was noted [86][87][88].
The proportion of records picked for each difficulty degree category in the first KD-DCup'99 dataset is inversely related to the number of records chosen. In our sample, we used 125.973 occurrences of the KDD Train set, which included 58.631 attacks and 67.342 instances of routine traffic. Many current low-key assaults were included in the second dataset (UNSW-NB15), which aimed to replicate current network configurations. It contained 45 columns (id = 1, features = 44), 2,540,045 records spread over four bigdata CSV files, and records for testing a total of 82,334 and training a total of 175,342, respectively [89].
The CIC IDS2017 contained current large attacks [87] and benign data in addition to the results of the network traffic analysis using the CIC flow meters. Flows were used to time stamp protocols, IP addresses at the source and destination, ports, and all attacks. The most recent dataset also included updates for threats such as port scanning, DDoS, brute force, XSS, SQL injection, penetration, and botnets. A total of 2,830,744 records were spread among 8 files, each containing 78 unique named features [90].

Stages in Pre-Processing
The preprocessing converts the unprocessed data, in three steps, into an appropriate format for analysis before applying it. Algorithm 1 explains these processes (i.e., filtration, transformation, and normalization).

The Step of Filtration
Filtration removes unnecessary or useless material from the data, making the datasets simpler to use and comprehend. It distributes the remaining data and reorganizes it into classified groupings.

The Step of Transformation
Transformation converts a property's category value into a number using a one-hot encoding algorithm. For instance, it uses this function to transform several protocols, such as the transmission control protocol (TCP) and the user datagram protocol (UDP), into numerical data.

The Step of Normalization
The Minimax function is used to convert numbers between 0 and 1.

For filtration 4.
Do data transformation and 5.
Remove occurrences that are pointless or unnecessary.

6.
Distribution classification is set up.

7.
If the input is not numeric, do 8.
Statistics are obtained by applying the (One-Hot Encoding) function on categorical characteristics.
End For 11.
Step 3: Normative adjustment Using a minimum: Max = Determining the highest value. Min = Determining the Lowest Value.

CFS-RF Hybrid Method
CFS and bagging EL created a hybrid technique for effective FS and accurate categorization. As seen in Algorithm 2, the system takes advantage of the hybrid CFS-RF for FS that has been reimplemented. The following is a description of the reimplemented hybrid Electronics 2023, 12, 3300 13 of 28 CFS-RF method: It first uses the Xi value result from the preprocessing stage, then uses the merit equation to apply it to each feature as follows: where the correlation between features (r f f ) is the correlation between features and classes (r k f ). Equation (1) is explained by the calculated correlation (CFS). After that, it creates RF subsets using. {h(c, x, θ c = 1, 2, 3, · · · } (2) k stands for the integer, c represents the theta, h stands for frequency, and x represents a vector. Do discriminate between the most relevant 4.
and irrelevant features 5.
For each subset 6.
End For 8.
Dimensionality reduction and selected most relevant features 9.
If the datasets discretization, do 10.
evaluation and estimated accuracy 11.
End For 13. Return Features classes End By computing the weight range, the redundant characteristics are verified: where λ is the lambda, and wR λ is the weight range. It determines the most pertinent characteristic with the least amount of standard deviation by calculating the standard division, σ i , which is represented by: where the weight is represented by ω i and standard division by σ i .

Algorithms for Training, Testing, and Recognition Attack Using Hybrid Adaboosting and Bagging
In this phase, the hybrid EL algorithms are developed. To produce usable performance (RF and SVM), the following classifiers are tuned to run consecutively as adaboosting, using their updated weights.

Differentiated RF Classifier
The modified RF used for adaboosting is explained in Figure 3. The parameters and weights are also changed to improve the effectiveness of identifying unknown attacks. Initialization first equalizes all XiBest values with Wi and then uses an equation to produce RF subsets (5). Following that, it calculates the weight and standard deviation for each training set using: where p stands for population and B is a constant. As a stopping condition in Figure 3, it is essential to calculate the value for each XiBest.
Electronics 2023, 12, x FOR PEER REVIEW 14 of 29 (5) where p stands for population and B is a constant. As a stopping condition in Figure 3, it is essential to calculate the value for each XiBest. To obtain the best results from these updated classifiers, the reimplemented model aggregates them to work simultaneously as bagging and uses the weighted average voting strategy. The recommended HABBA's central idea is stated in Figure 3.
As a result of weight updates, Figure 3 modifies the RF to operate sequentially as adaboosting. Aggregation is applied to subsequently updated classifiers, voting using the weighted averages method to acquire the best results for variance and bias .
A modified version of this algorithm produces better results with lower error rates.

Classifier for Modified SVM
The dataset's weights for each XiBest feature are updated. The steps for creating the support vector, categorizing the datasets, and applying the improved SVM classifier are as follows: where weight (w) and bias (b) are two terms. The adaboosting method is used in the first stage of Algorithm 3 with respect to every To obtain the best results from these updated classifiers, the reimplemented model aggregates them to work simultaneously as bagging and uses the weighted average voting strategy. The recommended HABBA's central idea is stated in Figure 3.
As a result of weight updates, Figure 3 modifies the RF to operate sequentially as adaboosting. Aggregation is applied to subsequently updated classifiers, voting using the weighted averages method to acquire the best results for variance and bias.
A modified version of this algorithm produces better results with lower error rates.

Classifier for Modified SVM
The dataset's weights for each XiBest feature are updated. The steps for creating the support vector, categorizing the datasets, and applying the improved SVM classifier are as follows: where weight (w) and bias (b) are two terms. The adaboosting method is used in the first stage of Algorithm 3 with respect to every altered classifier (Figure 3). The value of each Classifier's weight is determined by: where err(Xj) is the error for each classifier, and wj is its weight. Afterward, by calculating the weight using and validating the error rate: As a result, performance is improved and variation is reduced. The second step uses the bagging algorithm's basic idea to make these classifiers function as bootstraps. A shallower tree (during each phase of splitting), a sample of the variables, and a new dataset are utilized in the system to minimize overfitting, creating a composite model with reduced bias.
The average voting method is calculated using Equation (9).

Confusion Matrix in Binary
Three datasets are used to implement the HABBAs. Each class, which has both normal and anomalous traffic, is manually applied to the confusion matrix. The given CFS-RF and HABBAs are submitted to the FSs (i.e., features 13, 30, 35, and 40) to detect intrusions. The confusion matrix is a kind of matrix often used in ML to evaluate the performance of algorithms. It summarizes the total correct and incorrect values predicted by the machine learning algorithms, as shown in Table 2.  The concept uses the binary class form of a confusion matrix. The NSL KDD classes are subject to suggestions. Tables 3 and 4, and Table 4 show the NSL KDD implementation using 30 FS as a binary class. In this FN, the four states are designated as true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), and FNR. These four statuses affect the system's performance metrics. The tables demonstrate how the confusion matrices account for assault frequency and class distribution, demonstrating that the optimal combination of 30 attributes yields the best outcomes.
The NSL KDD confusion matrix for 30 features is shown in Table 3. The best FS is the 30-feature subset, as shown by Table 3's explanation of the NSL KDD binary type. For 35 features from the UNSW NB2015 dataset, see Table 4's confusion matrix. Forty features from the CIC IDS2017 dataset are listed in Table 5 as confusion matrices. The confusion matrix for 40 features from the CICDDOS2019 dataset is shown in Table 6.

2.
Looping / * Equation (7) is used to determine Mi/, which stands for error rate, to modify classifiers (Ki) for each i = 1.

3.
If Mi > 0.5 then/ * in the event that the error rate exceeds 0.5 * / For each procedure, Equation (8) is used to calculate (Wi). Compute each adjusted Classifier's prediction using Ci = Mi. Wi is added by Ci (i.e., RF and SVM) for every pair of classifiers.

End If 5.
It is finished up until the two classifiers.
Looping:/ * SVM and RF are examples of modified classifiers Ci. * / applied the ensemble principle using a bootstrap model to each Ci. The voting concept is applied using Equation (9).

Looping
To obtain both Xi-After and If Xi-Before, the accuracy for each (prediction) after and before is computed after voting.

10.
If Xi-Before is greater than Xi-After (averaging votes with the greatest likelihood).

else
The computation of FAR, FNR, DR, and accuracy End else 12.
End if until the terminating requirement has been met.

13.
Return Observations and a combined model. Different datasets are utilized to find new attacks, strengthening the system's defenses against outside threats (zero-day threats). The FNR and accuracy of these features are thoroughly explained in Table 7. Using 30 features in the reimplemented system can yield the best accuracy and FNR outcomes. FNR is the percentage of false negative discoveries in an experiment separated by the detections of genuine positives and false negatives. Calculating the number of errors found for each assault that has been classified as normal is crucial for assessing the effectiveness and quality of the system. Additionally, the FNR and accuracy measures are insufficient for comparison when using 13 and 41 features. Also, displays the binary class for UNSW_NB2015, which has 35 features. Table 7 shows accuracy measurements for all the features used in these tables, including FN, TN, FP, and TP. It provides an explanation of the binary class of the UNSW_NB2015, demonstrating that the 35-feature FS is the ideal one to use for accurate attack distribution and diagnosis and normal traffic. 35 features are applied with a 0.01 FNR, resulting in a greater accuracy of 99%, as shown in Table 7's data.

End
The reimplemented system was evaluated across all FSs during CICDDOS2019, the last of the test sets. The binary class of the CICDDOS2019 confusion matrix with 40 characteristics. The TP, TN, FP, and FN accuracy measures are shown in Table 8 for all FSs and datasets. When using the reimplemented system with 40 features, the FNR of 0.00779 indicates the best accuracy of 99.95%.  Table 8 uses the f-score measure to show the precision of each assault in the dataset. The best results across all classes using the reimplemented system are shown in Table 8, with Brute Force, XXS, and Botnet achieving 100%. This suggests that the number of characteristics is the right one for recognizing all types of attacks.

System Configuration and Implementation
Three separate NSL KDD, UNSW NB 2015, CIC_IDS2017, and CICDDOS2019 datasets were utilized in implementing the reimplemented approach. The reimplemented system was tested using the remaining 30% of the datasets after 70% had been used for training. The reimplemented work was pushed through 3 sets of selected features-13, 30, 13, 35, and 13, 40, all employing CFS RF for the NSL KDD dataset, UNSW NB2015, CIC IDS2017, and 13, 30 for the NSL KDD dataset, respectively-in order to gauge how well it performed. When employing CFS and collecting these s, it carried out RF with penalizing characteristics for these s, randomly selecting 10 estimators (10 subsets), and an ensemble.
Evaluation learning datasets compute correlations between features. It calculates Wi for each set by using the highest weight and disregarding inferior weights. In the end, only the most influential subset of features-those that impact the performance of intrusion detection-were chosen. The dataset's dimensionality was decreased, and unnecessary attributes were removed using the hybrid CFS-RF approach. To this end, for NSL_KDD, 35 features, UNSW_NB2015, 40 features, CIC_IDS2017, and 40 features, CICDDOS2019, were obtained through the analysis and dissemination of the datasets from the planned CFS-RF.
After that, HABBAs and two other classification forms (binary and multi-class) of a confusion matrix were used to identify potential invasions. Precision, recall, DR, FNR, and FAR are distinct metrics for assessing system performance. They were implemented utilizing the following technical specifications for computer hardware and software: Sklearn Library with Python 3.8 and Colab, IntelCore i9-11900K Processor, 64GB RAM, 16GB (NVIDIA GeForce RTX 3090) CPU, 1 TB of fast SSD Storage, and Windows 11 64-bit OS.
As we intend to examine the complexity time, we used Google Colab locally to have more control over our model environment, allowing us to estimate our performance parameters, so we preferred to work offline. The Google Colab has been installed locally by installing the colab package using the pip manager on top of the Jupyter Notebook.

The Runtime and Complexity
Accuracy, complexity, overfitting, and underfitting are some of the issues the models can encounter. Some techniques for dimensionality reduction can be used, like standard-Scaler, RobustScaler, MinMaxScaler, normalization, principal component analysis (PCA), and non-negative matrix factorization (NMF). One of the potential techniques for reducing overfitting and complexity is using k-fold cross validation (CV), which is a technique that guarantees that the F1 score of the ML model does not depend on how the samples of the training set and test set were picked. In k-fold CV, the dataset is alienated into c subsets [90]. However, k-fold CV, in some cases, may fail to reduce overfitting and complexity, which can occur if the training set samples are quite small or too sparse. In such cases, we need to combine more than one method to avoid complexity and overfitting, for example, PCA and k-fold or normalization and k-fold.
Using the Big-O notation, the complexity time for the algorithm is O(N 2 ). The algorithm's performance is proportional to the square of the size of the input elements. The algorithm uses nested loops and other operations involving quadratic time complexity. The complexity and overfitting are measured based on the training and test sets' accuracy. If the training set has high accuracy and the test set has low accuracy, the model is complex and suffers from overfitting. As the reference [58] was the main paper, we utilized the same research methodology and reimplemented its algorithms for benchmarking with various dataset pre-processing and features. Moreover, we used the more recent dataset CICDDOS2019, and we also focused on data anomalies in 6G cellular networks. Figures 4-6 show complexity time, where complexity time can be seen as the measure of how fast or slow an algorithm will perform for the input size, which is related to the computational complexities of ML Models. In addition, Figures 4-6 show the highest and lowest values. NSL KDD, UNSW NB 2015, CIC_IDS2017, and CICDDOS2019 were used to describe the running time. The DoS class had a maximum time of 9.2 s, while the R2L class had a minimum of 1.1 s, as shown in Figure 4. Figure 4 also shows the maximum complexity time of the NSL KDD dataset. 4-6 show complexity time, where complexity time can be seen as the measure of how fast or slow an algorithm will perform for the input size, which is related to the computational complexities of ML Models. In addition, Figures 4-6 show the highest and lowest values. NSL KDD, UNSW NB 2015, CIC_IDS2017, and CICDDOS2019 were used to describe the running time. The DoS class had a maximum time of 9.2 s, while the R2L class had a minimum of 1.1 s, as shown in Figure 4. Figure 4 also shows the maximum complexity time of the NSL KDD dataset.
The UNSW NB2015 dataset complexity time is shown in Figure 5. The complexity time for the CIC IDS2017 dataset, shown in Figure 6, was 6.3 s for the DDoS class and 0.8 s for the Shellcode class. According to Figure 5, the brute force class had the lowest response time at 0.7 s, and the DDoS stone class had the highest response time at 11.3 s. According to [58], the verifications show almost similar quantities for Figures 4-6. For example, in Figure 4, both results were 1.3, 1.8, 3.2, and 9.6 s for R2L, DoS, Probe, and U2R, respectively. By means of verification and validation of our work, we examined the CICDDOS2019 dataset in Figure 7 and implemented other algorithms, i.e., SVM, DT, RF, DBN, NB, and ANN, in Figures 8-10. According to the CICDDOS2019 dataset, as shown in Figure 7, the FTP and SSH classes had the lowest response time at 0.3 s, and the DDoS class had the highest response time at 6 s. As a result, the running time was proportional to the number of inputs and increased as the inputs increased .   The UNSW NB2015 dataset complexity time is shown in Figure 5. The complexity time for the CIC IDS2017 dataset, shown in Figure 6, was 6.3 s for the DDoS class and 0.8 s for the Shellcode class. According to Figure 5, the brute force class had the lowest response time at 0.7 s, and the DDoS stone class had the highest response time at 11.3 s. According to [58], the verifications show almost similar quantities for Figures 4-6. For example, in Figure 4, both results were 1.3, 1.8, 3.2, and 9.6 s for R2L, DoS, Probe, and U2R, respectively. By means of verification and validation of our work, we examined the CICDDOS2019 dataset in Figure 7 and implemented other algorithms, i.e., SVM, DT, RF, DBN, NB, and ANN, in Figures 8-10. According to the CICDDOS2019 dataset, as shown in Figure 7, the FTP and SSH classes had the lowest response time at 0.3 s, and the DDoS class had the highest response time at 6 s. As a result, the running time was proportional to the number of inputs and increased as the inputs increased.

Restrictions
To make the system more resistant to new threats, the major goal of the study was to differentiate between regular and aberrant actions. However, the study has the following drawbacks: When using dataset assaults, the HABBAs system performs admirably but ignores additional attacks initiated by external networks (when available). It is difficult to Tim Complexity Time

Restrictions
To make the system more resistant to new threats, the major goal of the study was to differentiate between regular and aberrant actions. However, the study has the following drawbacks: When using dataset assaults, the HABBAs system performs admirably but ignores additional attacks initiated by external networks (when available). It is difficult to

Evaluation of ML Models' Performance
We selected the ML methods of DBN, ANN, NB, RF, DT, and SVM since these methods are frequently used machine learning models for cybersecurity. Figure 8 compares the effectiveness of six machine-learning methods using datasets often used for intrusion detection. The values were chosen from the provided tables that, according to the dataset, represented the highest possible levels of recall, precision, accuracy, and F1-score. SVM demonstrated the greatest accuracy, reported on the NSL-KDD dataset at 83.2%. However, the KDD dataset's performance was approximately 98.1%, whereas the F1-score was 90.99%. On practically all datasets, DBN worked excellently and demonstrated intrusion detection accuracy above 95%, and the F1-score was 92.17%.
NB and ANN outperformed other models regarding accuracy on the DARPA dataset. However, ANN provided lower precision values. DBN outperformed other models regarding accuracy, precision, and recall on the NSL-KDD dataset. On the KDD-Cup 99 dataset, SVM and DBN outperformed all other models in terms of precision. Decision trees and random forests had the best precision rates among all the models tested on the KDD dataset. In that order, the best recall rates were demonstrated by the KDD dataset with random forest, DARPA with NB, NSL with DBN, and Cup99 with KDD. The examination of six machine learning methods' performance on widely used malware detection datasets is shown in Figure 9. There are not many benchmark datasets for malware detection. For the most part, the researchers gathered their unique datasets and used machine learning techniques to assess the models. On a tailored dataset, ML ap-   The comparison of accuracy, precision, and recall values for detecting intrusion, spam, and malware is shown in Figure 10. No matter the dataset, the highest value was the token that six machine learning models could produce. It has been demonstrated that SVM, DT, and RF deliver the highest accuracy and precision values for intrusion detection. However, DBN and ANN had the highest recall values for intrusion detection. If intrusion detection accuracy is the highest priority, SVM, DT, NB, and RF are reimplemented as potential candidates.
Comparatively speaking, DBN and ANN were less effective than other models at detecting malware. However, ANN also demonstrated exceptional recall value. Although RF and NB have demonstrated greater accuracy in classifying spam, DBN is still advised in situations where precision and recall are crucial. For greater accuracy, it is advised to use RF and NB for classifying spam, SVM and DT for detecting malware, and DT for both tasks, considering the metrics gathered from the reviewed papers.    The comparison of accuracy, precision, and recall values for detecting intrusion, spam, and malware is shown in Figure 10. No matter the dataset, the highest value was the token that six machine learning models could produce. It has been demonstrated that SVM, DT, and RF deliver the highest accuracy and precision values for intrusion detection. However, DBN and ANN had the highest recall values for intrusion detection. If intrusion detection accuracy is the highest priority, SVM, DT, NB, and RF are reimplemented as potential candidates.
Comparatively speaking, DBN and ANN were less effective than other models at detecting malware. However, ANN also demonstrated exceptional recall value. Although RF and NB have demonstrated greater accuracy in classifying spam, DBN is still advised in situations where precision and recall are crucial. For greater accuracy, it is advised to use RF and NB for classifying spam, SVM and DT for detecting malware, and DT for both tasks, considering the metrics gathered from the reviewed papers.  Precision Accuracy F1-Score Figure 10. Machine learning techniques performance evaluation.

Restrictions
To make the system more resistant to new threats, the major goal of the study was to differentiate between regular and aberrant actions. However, the study has the following drawbacks: When using dataset assaults, the HABBAs system performs admirably but ignores additional attacks initiated by external networks (when available). It is difficult to migrate the HABBAs system to identify new threats once the training phase is over due to the results of the training component.

Evaluation of ML Models' Performance
We selected the ML methods of DBN, ANN, NB, RF, DT, and SVM since these methods are frequently used machine learning models for cybersecurity. Figure 8 compares the effectiveness of six machine-learning methods using datasets often used for intrusion detection. The values were chosen from the provided tables that, according to the dataset, represented the highest possible levels of recall, precision, accuracy, and F1-score. SVM demonstrated the greatest accuracy, reported on the NSL-KDD dataset at 83.2%. However, the KDD dataset's performance was approximately 98.1%, whereas the F1-score was 90.99%. On practically all datasets, DBN worked excellently and demonstrated intrusion detection accuracy above 95%, and the F1-score was 92.17%.
NB and ANN outperformed other models regarding accuracy on the DARPA dataset. However, ANN provided lower precision values. DBN outperformed other models regarding accuracy, precision, and recall on the NSL-KDD dataset. On the KDD-Cup 99 dataset, SVM and DBN outperformed all other models in terms of precision. Decision trees and random forests had the best precision rates among all the models tested on the KDD dataset. In that order, the best recall rates were demonstrated by the KDD dataset with random forest, DARPA with NB, NSL with DBN, and Cup99 with KDD.
The examination of six machine learning methods' performance on widely used malware detection datasets is shown in Figure 9. There are not many benchmark datasets for malware detection. For the most part, the researchers gathered their unique datasets and used machine learning techniques to assess the models. On a tailored dataset, ML approaches frequently display exceptional F1-score, accuracy, precision, and recall values, as has been noted. The offered strategies do not exhibit comparable performance when used on different datasets.
Traditional machine learning methods, such as decision trees, fared better on several datasets. On nearly all datasets, DBN had a fantastic recall value. On the VirusShare dataset, DT and RF displayed improved precision rates. On the Enron dataset, RF demonstrated good values for recall, accuracy, and precision. ANN displayed the weakest accuracy, recall, and precision on the Enron dataset. The NB and DT compared well to other models in terms of accuracy on the VirusShare dataset.
The comparison of accuracy, precision, and recall values for detecting intrusion, spam, and malware is shown in Figure 10. No matter the dataset, the highest value was the token that six machine learning models could produce. It has been demonstrated that SVM, DT, and RF deliver the highest accuracy and precision values for intrusion detection. However, DBN and ANN had the highest recall values for intrusion detection. If intrusion detection accuracy is the highest priority, SVM, DT, NB, and RF are reimplemented as potential candidates.
Comparatively speaking, DBN and ANN were less effective than other models at detecting malware. However, ANN also demonstrated exceptional recall value. Although RF and NB have demonstrated greater accuracy in classifying spam, DBN is still advised in situations where precision and recall are crucial. For greater accuracy, it is advised to use RF and NB for classifying spam, SVM and DT for detecting malware, and DT for both tasks, considering the metrics gathered from the reviewed papers.

Analysis, Discussions and Evaluations
The categorization technique, number of FS, FS, FAR, and DR, as well as the accuracy of the reimplemented method, were all analyzed and contrasted with earlier studies. During testing, the proposal's detection accuracy was 90%, but during training, it was 99%. It produced a higher DR with a lower FAR than the single-stage technique.
Additionally, it regularly achieved the highest accuracy, DR, and FAR levels compared to past research.
The reimplemented system's accuracy, DR, FAR, the number of FS, FS, and classification methods were all studied and contrasted with prior studies. The proposal's detection accuracy was 99% during training and 90% during testing. It produced a greater FAR for the DR, which is lower than for the single-stage technique. In addition, compared to past investigations, it consistently achieved the highest levels of accuracy, DR, and FAR. This trade-off is further explained in Table 9.

Recommendations
The dataset must go through a preprocessing stage to prepare for the CFS-RF feature selection step. As part of each class in the dataset, after the CFS-RF step, it is exposed to analysis in order to confirm and choose only the most potent influencing characteristics that affect the outcomes [86]. Finally, CFS-RF chooses the dataset's most useful feature subset, which includes 40 features for CIC IDS201, 35 for UNSW NB201, and 30 for NSL KD [87]. The classifier stage then begins. In this scenario, both SVM and RF classifiers operate sequentially as an adaboosting algorithm [88], and they compile information using the typical voting method to function concurrently as a bagging algorithm [89].

Conclusions and Future Work
The current IDSs are still ineffective, largely due to the susceptibility of the anticipated wireless paradigms, despite having previously adopted various ML tactics to increase their performance. In the research, the CICDDOS2019 dataset is utilized and analyzed using ML models, where the CICDDOS2019 dataset includes a wider range of attacks, such as DDoS, DoS, brute force, XSS, SQL injection, botnets, web attacks, and infiltration. Susceptibility to attacks caused by predictable context (e.g., replay attack) because they rely on the limited entropy of wireless physical context to protect a shared key.
This study developed a unique IDS technique to handle imbalanced FS and EL algorithms' preferred hybrid approaches, which are based on high-dimensional traffic with less DR. Using samples from NSL KDD, UNSW NB 2015, CIC_IDS2017, and CICDDOS2019, a hybrid EL approach was used to obtain the best 30 features, 35 features, a subset of function correlation, and 40 features, respectively.
The values of the FAR for the datasets UNSW NB2015, NSL KDD, CIC_IDS2017, and CICDDOS2019 were 0.0039, 0.0076, 0.0009, and 0.00113, respectively, with an accuracy of 99.7% for all datasets. The results comparison table includes information on other parametric values. The number of records selected was inversely associated with the percentage of beginning KDDCup'99 dataset records picked for each difficulty degree categorization.
In our sample, there were 125.973 occurrences of the KDD_Train set, including 58.630 attacks and 67.343 instances of routine traffic. The majority of current low-key assaults were included in the second dataset (UNSW-NB15), which aimed to replicate current network configurations. It contained 2,540,044 records spread across 4 big-data CSV files, 45 columns (id = 1, features = 44), 175,341 training records, and 82,332 testing records. The system approach performed better than the current classification algorithms. This approach gave the IDS market a large competitive advantage over other tactics. Although Electronics 2023, 12, 3300 24 of 28 CFS-RF with ensemble HABBA algorithms has advantages, further work is still needed to strengthen the system's ability to handle potential threats from infrequent future traffic.
IDS has applied a connection record for each separately, and by putting the reimplemented NIDS on the private security server firms' systems, the authors think that looking at connections in a stream of data can be useful in identifying undetectable assaults. The provided method is an excellent and trustworthy way to identify network breaches quickly and precisely.
In regard to future work, robust machine learning models are required to handle adversarial inputs. In order to build models resilient against hostile inputs, the model should be trained in hostile scenarios. We have examined some machine learning models that use different datasets to identify a threat to 6G security. However, we recommend that a novice in this field explore this study's entire reference. Our upcoming work will examine and use more ML and DL methods to combat many other cybersecurity concerns. We will evaluate ML models in various cybersecurity domains, such as IoT, smart cities, techniques based on API calls, cellular networks, and smart grids.
In the future, we want to dig deeper into the various 6G network attacks. Future research will be required to address the crucial problem of protecting 6G, such as fuzzy logic rules. Fuzzy logic can help improve the accuracy and effectiveness of 6G security systems by considering uncertainty and imprecision in data.