Real-Time DDoS Attack Detection System Using Big Data Approach

: Currently, the Distributed Denial of Service (DDoS) attack has become rampant, and shows up in various shapes and patterns, therefore it is not easy to detect and solve with previous solutions. Classiﬁcation algorithms have been used in many studies and have aimed to detect and solve the DDoS attack. DDoS attacks are performed easily by using the weaknesses of networks and by generating requests for services for software. Real-time detection of DDoS attacks is difﬁcult to detect and mitigate, but this solution holds signiﬁcant value as these attacks can cause big issues. This paper addresses the prediction of application layer DDoS attacks in real-time with different machine learning models. We applied the two machine learning approaches Random Forest (RF) and Multi-Layer Perceptron (MLP) through the Scikit ML library and big data framework Spark ML library for the detection of Denial of Service (DoS) attacks. In addition to the detection of DoS attacks, we optimized the performance of the models by minimizing the prediction time as compared with other existing approaches using big data framework (Spark ML). We achieved a mean accuracy of 99.5% of the models both with and without big data approaches. However, in training and testing time, the big data approach outperforms the non-big data approach due to that the Spark computations in memory are in a distributed manner. The minimum average training and testing time in minutes was 14.08 and 0.04, respectively. Using a big data tool (Apache Spark), the maximum intermediate training and testing time in minutes was 34.11 and 0.46, respectively, using a non-big data approach. We also achieved these results using the big data approach. We can detect an attack in real-time in few milliseconds.


Introduction
The data stored on the internet is growing day by day particularly for threats that target sensitive or crucial data, and this has raised many security issues, such as malicious intrusions [1]. Targeted attacks and threats such as malware and botnets cause great damage to the community in different factors, such as financial loss or health loss. Significant research has been conducted in which researchers have proposed different Intrusion Detection Systems (IDS) to mitigate the risk of malicious intrusion attacks [2]. Now,

•
Intrusion detection (ID) is divided into two processes: monitoring the intrusions and analyzing a network's events to seek any malicious packets or a source in the computer network or a computer system. • Intrusion Detection System (IDS) detects an event as an intrusion when a noticeably different event occurs from a legitimate or authorized event [9].
Machine learning (ML) is the study that is continuously being enhanced via training and the exploitation of information. It is considered a component of artificial intelligence. There are different kinds of learnings based on the information available, such as supervised, semi-supervised and unsupervised learning. ML has a good results through the use of bimodal approaches through big data and deep learning models. ML techniques are utilized in a broad range of applications, such as in healthcare, for predicting COVID19, Osteoporosis and Schistosomiasis [10][11][12][13][14][15].
The recognition and detection performance in terms of security evaluation remain an issue, such as with DDoS attacks and blockchain. Classification algorithms have been used in many studies and aimed to detect and solve the DDoS attack. DDoS attacks can be performed easily by using the weaknesses of networks and by generating requests for services for software [16,17]. The elapsed time during any kind of real-time detection of DDoS attacks is also a challenge, and as attacks are not easy to detect and mitigate, this solution holds great value, as these attacks can cause big issues [18]. However, the existing approaches have many problems in detecting the DDoS attacks, such as computation costs while detecting, and the inability to deal with large requests passing through the network towards the server. Classification algorithms classify the packets for the identification of DDoS from the normal packets [19].
In recent years, many studies have applied the big data framework Spark ML to achieve better results, but they have not calculated the running time after applying Spark, as in our approach [20][21][22][23][24][25]. Moreover, machine learning algorithms, together with the big data approach, can solve many complex problems and find many hidden patterns in the context [26]. Big data promotes environment sustainability to solve very complex problems in real-time on social media using big data approaches, applications such fake profile detection or supply chain management with blockchain technology [27,28]. Apache Spark has evolved into an illegitimate platform for big data analysis. It is a computational framework for a high-performance cluster featuring Java, R language-integrated APIs, and Python. As a free software product that is quickly evolving, with a rising number of participants from university and business, the whole production and studies underlying Apache Spark, particularly by those that are novices in this domain, has proven difficult for academics. This article presents a practical assessment of extensive data analysis using Apache Spark. The major elements, concepts, and capabilities of Apache Spark are addressed in this analysis. In particular, this article illustrates the architecture and implementation of Apache Spark large-scale database techniques and pathways for master learning, diagram research, and data aggregation [29,30].
Two models were selected for real-time detection: Random Forest (RF) and Multi-Layer Perceptron (MLP). Both classifiers were evaluated using the Scikit ML libraries as a non-big data approach and Spark ML libraries as big data approach. In terms of accuracy, we achieved similar mean accuracy in both of the models. Still, in terms of training time and testing time, the big data approach outperforms the non-big data approach because Spark computations in memory take place in a distributed manner. The minimum average training and testing time in minutes was 14.08 and 0.04, respectively. Using a big data tool (Apache Spark), the maximum average training and testing time in minutes was 34.11 and 0.46, respectively, using a non-big data approach. We also achieved these results using the big data approach. We can detect an attack in real-time in few milliseconds.
The following are the major contributions of our study: • As far as we know, there is no study such as this, which has compared accuracies as well as execution time with machine learning and Apache Spark ML on Distributed Denial of Service (DDoS).

•
We detected DoS attacks in an efficient manner with and without the use of big data machine learning approaches in Random Forest and Multi-Layer Perceptron models.

•
In addition to the detection of DDoS attack, we have optimized the performance of the models by minimizing the execution time as compared with other existing approaches using the big data framework.
The document is organized as follows: Section 2 describes the related work that has been done and reviews the current studies related to machine learning approaches in detecting network attacks; Section 3 presents a methodology and dataset we used for the experiments; Section 4 shows the experiments we performed using both techniques (big data and non-big data); Section 5 discusses the results obtained from both methods of detecting DDoS attacks in real-time; finally, in Section 6, the conclusions and future work are presented.

Related Work
Zhang et al. [31] proposed a framework to detect intrusion using a Random Forest algorithm using big data Apache Spark. Random Forest algorithm was implemented in Apache Spark to detect intrusion in the high-speed network data, and their efficiency and accuracy were then evaluated and compared to existing systems. An issue with this approach is that they evaluated few classification algorithms in their research. They implemented an RF algorithm in Apache Spark's design and evaluated their results compared with other fast speed data network models. Their proposed model achieved higher accuracy and can detect intrusion in a shorter amount time. The results showed that their model could detect intrusion in 0.01 s compared with other existing models. A simple RF model can detect an attack in 1.10 s.
Wang et al. [32] proposed a model that improves data learning with fewer resources and taking less time. They proposed a model "SP-PCA-SVM", which combines Support Vector Machine (SVM) with Parallel Principal Component Analysis (PCA) using Apache Spark tool to reduce training time and learning efficiency. The results showed that the accuracy rate is somehow reduced, reduces classification time (training time), and increases model learning efficiency. They proposed a novel model (SP-PCA-SVM), which significantly reduced training time and increased model learning efficiency. PCA was used to reduce the data set, and the SVM algorithm is used in parallel. Their results showed that the proposed model (SP-PCA-SVM) time was reduced from 46.4 s to 35.8 s, while Parallel SVM and accuracy were reduced from 93.56% to 93.40%; however; the training time for the classifier can be further optimized.
Zekri et al. [33] focused on how Distributed Denial of Service affects cloud performance by utilizing network resources. The attack techniques implemented and evaluated different ML algorithms in a cloud computing environment. The authors presented a DDoS protection design, and the algorithms they implemented and evaluated in cloud environments were Naive Bayes, Decision Tree (C4.5), and K-Means (KM). The results showed that their accuracy and detection time of algorithms Naive Bayes, Decision Tree (C4.5), K-Means were 91.4% in 1.25 s, 98.8% in 0.58 s, and 95.9% in 1.12 s, respectively. This approach can work on real-time anomaly detection and mitigation techniques and other security challenges. The drawback of this approach is that they evaluated only a few models to detect DoS attacks.
Halimaa and Sundarakantham [34] implemented different machine learning techniques and evaluated their accuracies. Authors implemented Support Vector Machine (SVM) and Naive Bayes (NB) and evaluated their performance to detect intrusion using the KDD Cup data set. The authors in this paper performed a detailed comparative analysis for detecting intrusions in the network. The results showed that SVM achieved an accuracy of 93.95%, whereas NB achieved an accuracy of 56.54%.
A Raman et al. [35] study proposed a novel support vector machine (SVM) with Hyper-graph based Genetic Algorithm (HG-GA) to increase the efficiency of intrusion detection and reduce the false alarm rate. The authors evaluated this approach only on a few models. The results showed that the proposed approach increased the accuracy with the feature selection technique and decreased the false alarm rate. The proposed model achieved an accuracy of 96.2% when using with feature selection technique and 95.8% when using it without feature selection technique. Additionally, the false alarm rate scored 0.83 compared to the highest 1.74.
Wang et al. [36] proposed a new framework, "LMDRT-SVM", which is based on SVM (Support Vector Machine) that increases model accuracy and detection rate with augmented features. The NSL-KDD dataset has been used for evaluating model performance by applying the SVM model. The results showed that overall, it improved model performance and achieved accuracy, detection rate, and false alarm rate of 99.31%, 99.20%, and 0.60 on the NSL-KDD data set, respectively, and 99.3%, 99.94%, and 0.10 on the KDD data set, respectively.
Teng et al. [37] proposed a novel approach for securing the network from intrusions with the help of machine learning models. They performed experiments on the KDD Cup data set. The authors proposed an adaptive intrusion detection technique, Collaborative and Adaptive Intrusion Detection Model (CAIDM), which is based on various machine learning algorithms such as Decision Tree (DT) and Support Vector Machine (SVM). In addition, they used a tool, Environments classes, agents, roles, groups, and objects (E-CARGO), for modeling purposes. The results showed that the accuracy, error rate, and training time were significantly improved. Compared with the single type SVM accuracy of the proposed model (CAIDM), accuracy of the model was increased to 89.02% from 81.716%, and the error rate was reduced from 19.39% to 12.19% in CAIDM. However, the training time of the proposed model was also reduced from 29.730 min to 7.247 min.
Ahmad et al. [38] evaluated various machine learning models for detecting intrusions in the network system. Authors evaluated the SVM, RF, and Extreme Learning Machine (ELM) models for accuracy, recall, and precision. The train test split was used for the training and testing purpose at 80% for training and 20% for testing. The results showed that ELM outperforms other approaches with higher accuracy and precision of 95.5% on full sample data, and recall is slightly lower compared with SVM and RF.
Yin et al. [39] proposed a deep learning approach for IDS using recurrent neural networks. The authors used various machine learning models for performing their experiments on the benchmark data sets. Their experiment results showed that the Recurrent Neural Network Intrusion Detection System (RNN-IDS) outperforms other classification models and achieved high accuracy than other machine learning models. The proposed model achieved high accuracy in multiclass and binary classification methods. The results showed that the performance of the "RNN-IDS" model for binary and five category classifications achieved a detection rate of 83.28%.
Li and Yan [40] proposed IDS based on Apache Spark using "MSMOTR" and Adaboost models. The experiment was performed on the KDD99 dataset. The results showed that the proposed approach could reduce the system's error rate and processing time and improves the accuracy rate. The results showed that the traditional model achieved accuracy, error rate, and processing time of 84.28, 17.2, and 18.26 s. The model achieved accuracy, error rate, and processing time of 86.32, 13.6, and 15.24 s, respectively.
Aftab et al. [41] applied deep learning models using the big data approach for the detection of the disease Leukemia. The library they used was Spark BigDL and their results showed that the proposed model was able to identify four types of Leukemia with an accuracy of 0.9733% on training data and 0.9478% on test data.
Al-Qatf et al. [42] implemented an anomaly-based network intrusion detection system. The proposed model can work on big data and achieves scores in minimum time. The proposed model used the big data tools Apache Spark and Hadoop. The authors encapsulated the approaches using data mining methods and machine learning for cyber analytics for IDS. The results showed that the system managed the large-scale network packet analysis quickly using Apache Hadoop and Spark.
Kato and Klyuev [43] studied the major problems of network intrusion detection using machine learning algorithms. The authors focused on designing a practically intelligent intrusion detection system having defined accuracy and a low false-positive rate. The results showed that the proposed approach could develop the intelligent IDS and achieved an accuracy of 86.2%, a false-positive rate of 13%, and a true negative rate of 87%.
Marir et al. [44] proposed a novel distributed approach (distributed deep method) to detect abnormal behavior in large-scale networks. The proposed model combined a deep neural network with multi-layer machine learning ensemble models for detecting abnormalities. The authors used the hidden layer model (SVM) with ensemble technique to enhance the prediction performance by reducing the number of features. Apache Spark was used to train and test the model and deep belief network to reduce the number of features. The results showed that the proposed model outperforms existing approaches.
Kim et al. [45] emphasized the IDS model by using an enthusiastic learning approach. The proposed model was based on "Long Short-Term Memory architecture." The experiment was performed on the KDD cup data set. The proposed model used a soft computing technique to create utilities in the intrusion detection field for learning and flexible capabilities. The proposed model was trained on ten dummy data sets and evaluated its performance. The results showed that the proposed approach achieved an average accuracy of 98.7% and an error rate of 11.3%.
Jha et al. [46] used the approach of data analysis to get such helpful information. Many big data research architecture structures for clustering algorithms have been created to expand big data research. They showed a maintainable cluster with progressive improvement in this paper. The Scalable Random Sampling with Iterative Optimization Fuzzy c-Means algorithm (SRSIO-FCM) approach was used to handle huge data collecting challenges in an Apache Spark cluster fuzzy c-means. SRSIO-performance FCM's was evaluated regarding the suggested expandable implementation of the Apache Spark cluster's Literal Fuzzy c-Means (LFCM) and Random Sampling plus Extension Fuzzy c-Means. Based on the duration and storage difficulty, run-time and clustering suggest that SRSIO-FCM can be run in significantly less time without clustering performance compromise.
Saravanan [47] presented a classification algorithm that works on network security data for intrusion detection. Multiple classification algorithms were implemented and evaluated using the big data tool Apache Spark and training time and testing time were measured. However, the authors evaluated few classification algorithms. As compared to the existing systems, they found better results with a good false-positive ratio. Better results were found for assessing the classification algorithm in Apache Spark on network security data than in existing systems. The accuracy of algorithms Decision Tree (DT), Logistic Regression (LR), Support Vector Machine (SVM) and SVM with Stochastic Gradient Descent (SGD) were 96.8%, 93.9%, 92.8%, and 91.1%.
Syed et al. [48] applied an ML-based application layer DoS attack detection framework for the detection of DoS attack on Message Queuing Telemetry Transport (MQTT) data communication protocol. The ML models applied for the detection of attack were Decision Trees (C4.5), Multi-Layer Perceptron (MLP) and Average One-Dependence Estimator (AODE). The results showed that the C45 outperforms other with an accuracy of 99.09% than the AODE and MLP with accuracy of 99.00% and 95.93%, respectively, using limited features. The recall of C45 achieves a high score of 99.1% with respect to AODE and MLP, with 99% and 95.9%, respectively.
Priya et al. [49] proposed an ML-based model for the detection of DDoS attacks. The authors applied three different machine learning models: K-Nearest Neighbors (KNN), Random Forest (RF) and Naive Bayes (NB) classifier. The proposed approach can detect any type of DDoS attack in the network. The results of the proposed approach showed that the model can detect attacks with an average accuracy of 98.5%.
Ujjan et al. [50] proposed entropy-based DoS detection to identify features of traffic DoS by combining two entropies through Stacked auto encoder (SAE) and CNN. The CPU utilization was much higher and time-consuming. The accuracies of the models were 94% and 93%, respectively.
Gadze et al. [51] proposed deep learning models to detect and mitigate the risk of DDoS attacks that target the centralized controller in Software Defined Network (SDN) through Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). The accuracies of the models were lower. LSTM and CNN were 89.63% and 66%, respectively, when data spitted was in 70/30 ratio. However, in the case of LSTM model to detect TCP, UDP and ICMP, DDoS was the most time-consuming among the 10 attempts.
Ahuja et al. [52] proposed a hybrid ML model Support Vector Classifier with Random Forest Dehkordi et al. [54] proposed a method to detect DDoS attack in Software Defined Network (SDN) using different machine learning models. The proposed method contained three main sections: (1) collect; (2) entropy-based; and (3) classification. The proposed method was applied on three different datasets. The results showed that by applying the ML models Logistic algorithms, J48 algorithm, BayesNet algorithm, Random Tree algorithm and REPTree algorithm, the average accuracy achieved was 99.62% 99.87%, 99.33%, 99.8% and 99.88%, respectively by application on an ISCX-SlowDDos2016 dataset.

Dataset
We used the application layer DDoS dataset available on Kaggle. The dataset belongs to a large dataset category and consists of around 0.9 million records with 77 features columns and a target column. Mainly it consists of three labels: (1) DDoS slow loris; (2) DDoS Hulk; and (3) BENIGN [55]. Figure 1 shows the system block diagram that applies classification algorithms after pre-processing to detect the DDoS attack. There are four main components involved: (1) data acquisition, (2) data pre-processing, (3) classification machine model, and (4) evaluation to produce the output of whether the attack is DDoS. The first component (data acquisition) of the proposed model highlights the dataset that is used for the pre-processing phase of the model. In the second component of the model (pre-processing), we pre-processed the data before applying any machine learning model. First, noise filtering is performed on the dataset. Noise filtering is a collection of procedures used to reduce noise from data collected on development. In the next step, missing values are handled utilizing various policies, such as ignoring data with missing entries, replacing data with a universal, consistent method and explicitly filling in missing attributes depending on your area of expertise. In the third step, we transformed the 3-class problem into 2-class problems. We considered "DDoS slow loris" and "DDoS Hulk" as a single class and "BENIGN" as another class. In the last step, we reduced the number of features from 77 to 25 features. There were 77 features in the dataset, so Principal Component Analysis (PCA) technique was used to reduce the number of features and left only those most suitable features for predicting the model. Principal Component Analysis (PCA) tool is used for dimensionality reduction in both approaches with and without the big data approach. Scikit PCA is used for the non-big data approach, and Spark PCA is used for the big data approach. In the final step, all string type data was transformed into float type. Input data for Apache Spark ML models should be in vector form, so the string values of the data were transformed into numerical to create the dense vectors. The string indexer library was used to index the target column for Spark ML and Standard Scaler library for Scikit ML models.

Our Approach and Data Pre-Processing
In the third component of our proposed model, we applied two machine learning models, RF and MLP, for the training and testing of the data. Both models were applied with and without big data approach. We used Scikit libraries for the modeling purpose and considered it as non-big data approach, and for the big data approach we applied Apache Spark libraries Spark ML for the training and testing of the dataset. Finally, in the last steps we measured the evaluation matrices of all the approaches and compared the results.

Classification Machine Learning Models
We applied two machine learning models for our approach in the case of with and without big data machine learning.

Random Forest
Stochastic classification systems are part of the wide range of outfit approaches to learning. They are easy to construct and quick to operate, and in several fields, they have been found quite beneficial. In the learning phase, multiple "basic" decision trees and the maximum vote (modal) in them at the categorization phase are the fundamental premise behind the Random Forest technique. This voting approach has, amongst other advantages, a correction to the excess dataset for the undesired characteristic of decision bodies. Random forests utilize the broad process-based packaging for individual trees in the composition during the learning phase. Periodically the packing chooses and matches a unique item to update the exercise set. Unlike any cutting, every tree is cultivated. Research has used RF for the construction of spatial distribution of the population density in a region [56].
The Equation (1) of Gini Importance, assuming only two child nodes is as follows: where in Equation (1) the mx y is importance of node and I is Impurity and wt is weight of left and right with number of samples Figure 2 shows the Random Forest classifier diagram of how it works. Several tresses are used for the predictions, and the voting approach is used for the target output. The random character of tree construction is an intrinsic outcome of that. The number of trees in the whole set is a flexible variable that the out-of-bag mistake quickly discovers. As with Naive Bayes-and neighboring methods close to it-Random Forest is very common due to its clarity and strong results. In contrast to the two previous techniques, however, Random Forest is not predictable in terms of the architecture of the resulting framework produced.

Multi-Layer Perceptron
One of the main frequent algorithms in the domain of neural learning is a Multi-Layer Perceptron (MLP). A brain network called 'vanilla' is sometimes referred to as MLP, and the intricate patterns of the present era are easier. However, it has also paved the way for increasingly sophisticated convolutional neural networks [15,57].
The MLP is a neural feedback system, meaning the input layer is used to send knowledge towards the output layer. Weights are allocated to the links across the layers. The gravity indicates the significance of a link. The MLP is used for various activities such as inventory evaluation. A Multi-Layer Perception is comprised of interlinked cells that transmit knowledge to each other, like the human brain.
The MLP updated the weight when it occurred error during misclassification. The Equation (2) shows the weight update through old weight plus learning rate (lr) multiplied with expected value (y) and predicted value (x) during forward and backward processes.
(2) Figure 3 shows the Multi-Layer Perceptron diagram including inputs, hidden layer and output layer. A value is allocated to every cell. The system can be split into three top layers: (1) input Layer, the earliest communication layer that receives an intake to create an outcome; (2) hidden layer(s), where at least one hidden layer is a critical system, and they input data, calculations and processes to create everything useful; and (3) output layer, where there is a significant output in the neurons in this layer.

Experiment Setup
Random Forest and Multi-Layer Perceptron were used to detect an attack in real-time and evaluate the performance with and without the big data approach. We used Apache Spark, a big data tool distributed framework, to speed up the computational and timeconsuming tasks. Spark ML libraries were used to evaluate the performance with the big data approach and Scikit ML on Google Colab [58] libraries for the non-big data approach. The system specifications of the server we used for both approaches are the same. We used Databricks Community Editions for experiments with available memory of 15.3 GB and two cores with a single DB [59]. Spark v3.1.1 was used for Spark ML Libraries, and no worker nodes were used. The availability zone of the server was US-West-2C.

Experimental Parameters
We used the Cross-Validation technique, and started with the 100 trees at the beginning up to 500 trees with a step of 100 and with other default settings. There were two minimum samples used to split the internal node of a tree and one minimum number of samples were used to be at a leaf node. For measuring the quality of the split criteria, "Gini" was selected. Moreover, the parameter was set to auto for the RF to consider the number of features for the split. About 70% of the data was used for training, and 30% was used for testing.
The variables used for performing the experiment with Multi-Layer Perceptron classifier were as follows. We used Cross-Validation technique, and started with the 100 iterations at the beginning up to 500 iterations with a step of 100 and with default other settings. The Adam optimizer was utilized for the weight optimization of the model and Rectified Linear Unit (ReLu) was used as activation function. We had also shuffle samples for each iteration and the value of exponential decay rate was set to 0.9. Here also 70% of the data was used for training and the remaining 30% was used for testing.

Evaluation Results
We measured results through accuracy, precision, recall, F1 score and confusion matrix of our both models. Figure 4 shows the accuracy comparison of both models, Random Forest classifier (RF) and Multi-Layer Perceptron (MLP) for both approaches, big data and non-big Data. The RF classifier was used, and this decision was based on the majority of the voting. When the model was trained with 100 trees, we achieved the minimum accuracy of about 99.868%, and the maximum accuracy achieved was 99.989% when the model was trained with the 400 trees for non-big data approach. On the other hand, for big data approach the model was trained with 200 trees, and we achieved the minimum accuracy of about 99.045%, and maximum accuracy achieved was 99.895% when the model was trained with the 400 trees. The MLP classifier was used with a minimum of 100 iterations up to a maximum of 500 iterations. When the model was trained with 200 iterations, we achieved the minimum accuracy of about 97.5%, and maximum accuracy achieved was 99.9% when the model was trained with the 400 iterations for non-big data approach. On the other hand, for the big data approach, the model was trained with 100 iterations, and we achieved the minimum accuracy of about 96.9%. Maximum accuracy achieved was 99.5% when the model was trained with the 500 iterations. Table 1 shows the evaluation matrix (accuracy, precision, recall, and F1 score) of both models, Random Forest classifier (RF) and Multi-Layer Perceptron (MLP), for both big data and non-big data approaches. When the RF model was evaluated the precision of the model was 99.97%, whereas recall was 99.98%. The false-positive rate was 0.004, and the false-negative rate was 0.002 for non-big data approach. When the big data approach was used the precision of the model was 99.94%, whereas recall was 99.97%. The false-positive rate was 0.007 and false-negative rate was 0.003. When the MLP model was evaluated, the precision of the model was 99.96%, whereas recall was 99.97%. The false-positive rate was 0.005 and false-negative rate was 0.003 for the non-big data approach. When the big data approach was used, the precision of the model was 99.94%, whereas recall was 99.97%. The false-positive rate was 0.007 and false-negative rate was 0.003.

Execuation Time
Our main aim was to calculate the execution time of our all four approaches. Figure 5 shows the running time comparison of both Random Forest classifier (RF) and Multi-Layer Perceptron (MLP) for both approaches, big data and non-big data. When the RF model was trained with 100 trees, we achieved the minimum testing time of about 0.2 min, and the maximum testing time took around 0.8 min when the model was trained with the 400 trees for non-big data approach. On the other hand, for the big data approach the model was trained with 100 trees, and we achieved the minimum testing time of about 0.064 min. For the maximum testing time it took around 0.148 min when the model was trained with the 300 trees.
When the MLP model was trained with 300 trees, we achieved the minimum testing time of about 0.065 min, and the maximum testing time took around 0.168 min when the model was trained with the 500 iterations for non-big data approach. On the other hand, for the big data approach, when the model was trained with 500 iterations, we achieved the minimum testing time of about 0.023 min, and the maximum testing time took around 0.073 min when the model was trained with the 100 iterations.

Discussion
In terms of accuracy, we achieved a similar mean accuracy of the approaches, but in terms of training time and testing time the big data approach outperforms the non-big data approach due to the fact that Spark does all computations in memory in a distributed manner. In addition, the Random Forest classifier outperforms MLP, having a higher mean accuracy. In Figure 6, a comparison of average accuracies is presented for both the approaches. It is clearly seen that the average accuracy of using big data approach or non-big data approach was somehow similar, and Figure 7 shows the average training time of the models using the Scikit Libraries and Apache Spark ML Libraries. The RF models took more time on training as it uses the ensemble technique which uses various decision tress to be trained.    Table 2 shows the recent studies related to DDOS through various machine learning, deep learning and big data approaches with accuracies and execution time. It can clearly be seen above in Table 2 that our approach using Random Forest and Multi-Layer Perceptron and big data framework Spark achieved better accuracies and processing time overall. The minimum accuracy in the above studies achieved was 56.64% in [34] using Naive Bayes classifier with the non-big data approach. By comparing with our approach, our minimum accuracy achieved was 99.05% with MLP classifier using the non-big data approach. The maximum accuracy achieved was 99.88% in [54] with REPTree model but without the big data approach, whereas compared with our approach, the maximum accuracy achieved was 99.94% with RF classifier using big data approach. The minimum processing time achieved was 0.48 s with Logistic Regression classifier in other studies. Using our proposed model, the minimum processing time achieved was 0.04 s using MLP classifier with big-data approach. Our proposed models achieved high accuracy and low processing time as compared with other existing models.
However, there are some limitations of our study, First, we used only two machine learning models. Second, the dataset has only two classes.

Conclusions
With the growing presence of data on the internet, new opportunities for threats to target sensitive data have been raised many security issues, such as malicious intrusions.
One major type of attack is the DDoS attack. Traditional intrusion detection techniques can only work best on slow-speed data or small data. Still, they are inefficient on big data and are incapable of handling high-speed data, so new methods adapted to work on large data to detect any signs of intrusion are needed. In this paper, we predicted DDoS attacks in real-time with different machine learning models using a big data approach. We used a distributed system, Apache Spark, and a classification algorithm to enhance the algorithms' execution. Additionally, we compared the results of the big data approach and how it outperforms the non-big data approach. Apache Spark is a big data tool to detect an attack in real-time with Spark ML libraries. We applied the two machine learning approaches, Random Forest (RF) and Multi-Layer Perceptron (MLP), through the Scikit ML library and big data framework Spark-ML library for the detection of DoS attack. In addition to the detection of DoS attacks we have optimized the performance of the models by minimizing the prediction time as compared with other existing approaches using big data framework. We achieved a similar mean accuracy in the models used, but in terms of training time and testing time big data approach outperforms the non-big data approach due to the fact that Spark performs computations in memory in a distributed manner. The minimum average training and testing time in minutes was 14.08 and 0.04, respectively, by using the big data tool (Apache Spark), and maximum average training and testing time in minutes was 34.11 and 0.46, respectively, by using the non-big data approach. Using the big data approach, we were able to detect an attack in real-time in a few milliseconds.
In the future, we will evaluate Apache Spark with other big data tools in terms of accuracy, training time and testing time of the machine learning models. We could also train different models and combine them with deep learning approaches using neural networks for predicting real-time results from convolutional neural network architectures [60][61][62]. Moreover, we could also apply datasets by recent work on big deep Learning (Big DL) framework [63]. Data Availability Statement: It is stated that dataset was publicly available on the Kaggle website https://www.kaggle.com/wardac/applicationlayer-ddos-dataset (accessed date 7 November 2019).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript.