Near-Real-Time IDS for the U.S. FAA’s NextGen ADS-B

: Modern-day aircraft are ﬂying computer networks, vulnerable to ground station ﬂooding, ghost aircraft injection or ﬂooding, aircraft disappearance, virtual trajectory modiﬁcations or false alarm attacks, and aircraft spooﬁng. This work lays out a data mining process, in the context of big data, to determine ﬂight patterns, including patterns for possible attacks, in the U.S. National Air Space (NAS). Flights outside the ﬂight patterns are possible attacks. For this study, OpenSky was used as the data source of Automatic Dependent Surveillance-Broadcast (ADS-B) messages, NiFi was used for data management, Elasticsearch was used as the log analyzer, Kibana was used to visualize the data for feature selection, and Support Vector Machine (SVM) was used for classiﬁcation. This research provides a solution for attack mitigation by packaging a machine learning algorithm, SVM, into an intrusion detection system and calculating the feasibility of processing US ADS-B messages in near real time. Results of this work show that ADS-B network attacks can be detected using network attack signatures, and volume and velocity calculations show that ADS-B messages are processable at the scale of the U.S. Next Generation (NextGen) Air Trafﬁc Systems using commodity hardware, facilitating real time attack detection. Precision and recall close to 80% were obtained using SVM.


Introduction
At peak operational times, there are 5000 concurrent flights in the U.S. national airspace [1]. The U.S. Federal Aviation Administration (FAA) indicates that the U.S. Gross Domestic Product (GDP) will increase from $16.3 to $26.2 trillion U.S. dollars from 2015 to 2036 [2] respectively. The U.S. Government Accountability Office (GAO) states that the aviation industry is in the process of implementing the U.S. Next Generation (NextGen) Air Traffic System [3][4][5][6][7]. The NextGen component programs are at various stages of development and include Automatic Dependent Surveillance-Broadcast (ADS-B), Collaborative Air Traffic Management Technologies (CATMT), Data Communication, National Airspace System Voice System, NextGen Air Transportation System Weather, and System Wide Information Management (SWIM). The GAO states that a major element of the system is the ADS-B capability, which is seen to be the future of air traffic control through advancements in aircraft tracking and flow management.
The ADS-B system augments traditional radar and transponder surveillance with ADS-B messages and embedded positioning via Global Positioning Systems (GPS). ADS-B is an unencrypted system, meaning that the national airspace system (NAS) is susceptible to a variety of cyber-physical attacks [8][9][10][11][12][13][14][15]. Though the focus of this research is on U.S.

Big Data
Big Data is defined by five characteristics, also known as the 5 v's: volume, variety, velocity, veracity, and value. Data volume measures the scale of the data within the system. Data variety refers to the different structures and sources of data. Data velocity is the analysis of the data as the data are generated. Data veracity illustrates the uncertainty of the data, and data value is the evaluation of the impact the data has on research [16].

Data Mining and Machine Learning
Data Mining (DM) searches for patterns or correlations that provide understanding or predictive power [17,18]. Machine Learning (ML) is a class of computer algorithms that allow computers, without being explicitly programmed, to learn and classify information and recognize patterns. This work uses the SVM machine learner to classify information and recognize patterns. SVM, which is more computationally intensive than many other less sophisticated classification algorithms, has the advantage of working well on datasets that are not linearly separable. SVM finds the best hyperplane that separates observations of one class from those of the other class. The best hyperplane is the one with the largest margin between two classes of observations [19].
The novelty of this research is in providing a third alterative solution for attack mitigation by packaging a machine learner, SVM, into an intrusion detection system and further calculating the feasibility of processing U.S. ADS-B messages in near real time.
The remainder of this paper is organized as follows. Section 2 presents the background, that is, the mechanics of ADS-B. Section 3 presents related works on ADS-B, focusing on works that look at attack mitigation strategies and techniques and works done in the context of Big Data and Machine Learning. Section 4 presents the methodology, including the system architecture and the data mining process used. Section 5 presents the results and discussion. Results are presented in terms of visualization and machine learning results as well as data volume and velocity. Section 6 presents the conclusion and Section 7 is a future works section.

Background: Automatic Dependent Surveillance-Broadcast
This section presents the mechanics of ADS-B. The message types of U.S. NextGen Air Transportation are Mode A, Mode C, Mode S, and ADS-B In and Out [26][27][28]. Mode S has three message types: (i) Data Block Surveillance Interrogation and Reply Message Format; (ii) Data Block Surveillance and Communication Interrogation and Reply-Communication- The ADS-B message field can contain information on traffic, weather, and flights. ADS-B vulnerabilities pertain to confidentiality, integrity, and availability. Anyone with an ADS-B radio can transmit and receive messages, thus precluding confidentiality. Data integrity is affected by attacks such as Ghost Aircraft Injection, Aircraft Disappearance, Virtual Trajectory Modification, and Aircraft Spoofing. Ghost Aircraft Injection occurs when an ADS-B radio transmits fake messages and other aircraft think there is an aircraft. Aircraft Disappearance happens when skillfully timed malformed ADS-B messages are sent with a real aircraft's identification, resulting in ADS-B messages with the particular aircraft to be disregarded. In other words, the remaining aircraft do not believe this particular aircraft exists. Virtual Trajectory Modification is the act of jamming an aircraft or ground station to create false alarms. Aircraft Spoofing is using another aircraft's identification to send ADS-B messages with false information. Finally, availability is a loss associated with Ground Station and Ghost Aircraft Flooding. Ground Station Flooding occurs when ground-based radios are jammed. Ghost Aircraft Flooding happens when a large number of fake ADS-B messages are sent. This makes it such that there are too many real and fake aircraft, and nothing is distinguishable [8][9][10][11][12][13][14][15]. This research will build on the already well-documented vulnerabilities of ADS-B by reproducing these vulnerabilities at the data layer and mixing these messages with OpenSky ADS-B messages.

Related Works
Ref. [9] discusses ADS-B as a new technology for air traffic monitoring. This holds the promise of achieving high precision and is envisioned to replace conventional radar systems. Several works have looked at the security issues related to ADS-B. Ref. [10] looked at both the theoretical and practical efforts that have been used for ADS-B security protocols and discussed the inherent lack of security measures in ADS-B protocols. Refs. [11,[13][14][15] shed further light on the practicality of different threats on ADS-B. Ref. [26] examined encryption schemes and discussed the challenges associated with implementing the encryption schemes for the ADS-B environment. Ref. [27] described a low-cost solution for ADS-B-based real-time air traffic monitoring systems implemented on a software-defined radio platform. This provided an integrated hardware and software solution for rapidly prototyping high-performance wireless communications systems. Ref. [28] constructed a schema for batch verification of ADS-B systems. Ref. [20] developed a passive fingerprinting technique that accurately and efficiently identifies wireless implementations by exploiting variations in transmission behavior. Ref. [8] addressed the mitigation solution by building an authentication framework through introducing a new online/offline identity-based signature scheme. The scheme introduced by Ref. [8] resolved the public-key infrastructure issue by using the identities of aircrafts as public keys.

Works Related to ADS-B in the Big Data Environment
Research shows that a Hadoop-based solution analyzes billions of ADS-B radio messages in approximately 35 min [30]. The results of this research were visualized using density maps. For the betterment of the solution in the context of cybersecurity or digital forensics, the ability to filter messages according to keywords or phrases would reduce noise [31]. A reduction in computational times would assist with real-time processing aspirations.
Ref. [29] provides insight into unaddressed Big Data issues in NextGen, such as identifying issues with the velocity, variety, and veracity of NextGen. SWIM is the data sharing digital backbone of NAS for NextGen; it does not address the veracity of the data received via the ADS-B protocol. NAS has a single point of information in near real-time speed. SWIM includes a surface visualization tool, which allows the Air Traffic Center (ATC) to manage surface traffic.
Ref. [30] presents Big Data platforms that are able to process ADS-B messages. This work shows that conceptually, using a statistical methodology, and in the context of big data, ADS-B messages are processable at the scale of NextGen.

Works Related ADS-B Using Machine Learning
Ref. [32] presents Spoofing Detector for ADS-B (SODA), a two-stage Deep Neural Network (DNN)-based machine learner that detects ADS-B spoofing messages and hence is also an aircraft classifier [32]. Experimental results show that SODA detects ground-based spoofing attacks with a precision of 99.34%, with a false positive rate of 0.43% [32]. SODA outperforms other machine learning techniques, such as XGBoost, Logistic Regression, and SVM. It also identifies individual aircraft with an average F-score of 96.68% and an accuracy of 96.66% [32]. Though SODA DNN is a very promising machine learner, it does not address the feasibility of processing U.S. ADS-B messages in near real time, which this research evaluates.

Addressing the Gap in the Research
DM/ML is used in cybersecurity and is beginning to be leveraged for the intersection of cybersecurity and ADS-B. However, studies have not focused on feasibility in the context of aviation and Big Data and the required computing resources, which this paper assesses. In addition to providing a solution for attack mitigation by packaging a machine learner, SVM, into an intrusion detection system, none of the previous works look at the feasibility of processing ADS-B messages in near real time, which this study addresses.

Methodology
This section presents the system architecture, the data capture and data engineering, and finally the data mining process.

System Architecture
The system architecture of this environment consists of users, hardware, and software interfaces.
As shown in Figure 1a, the user interface is a HyperText Markup Language version 5 (HTML5) presentation of NiFi, Kibana, and Jupyter Notebooks via an SSH tunnel to the Docker container exported ports. The hardware interface includes a Dell 910. The software interface includes CentOS and Docker. Docker orchestrated the creation of four required servers to conduct the experiment: Elasticsearch, Kibana, Apache NiFi, and Jupyter Notebook. Configuration changes to NiFi were required due to the vast size of the JSON files. Since the default memory setting of 512 MiB for Java Virtual Machine (JVM) is not capable of deserializing JSON OpenSky ADS-B messages, memory was increased to 30 GiB. These configuration changes allowed the NiFi processors to process JSON files. Figure 1b presents the specifications of the architecture.

Works Related ADS-B Using Machine Learning
Ref. [32] presents Spoofing Detector for ADS-B (SODA), a two-stage Deep Neural Network (DNN)-based machine learner that detects ADS-B spoofing messages and hence is also an aircraft classifier [32]. Experimental results show that SODA detects groundbased spoofing attacks with a precision of 99.34%, with a false positive rate of 0.43% [32]. SODA outperforms other machine learning techniques, such as XGBoost, Logistic Regression, and SVM. It also identifies individual aircraft with an average F-score of 96.68% and an accuracy of 96.66% [32]. Though SODA DNN is a very promising machine learner, it does not address the feasibility of processing U.S. ADS-B messages in near real time, which this research evaluates.

Addressing the Gap in the Research
DM/ML is used in cybersecurity and is beginning to be leveraged for the intersection of cybersecurity and ADS-B. However, studies have not focused on feasibility in the context of aviation and Big Data and the required computing resources, which this paper assesses. In addition to providing a solution for attack mitigation by packaging a machine learner, SVM, into an intrusion detection system, none of the previous works look at the feasibility of processing ADS-B messages in near real time, which this study addresses.

Methodology
This section presents the system architecture, the data capture and data engineering, and finally the data mining process.

System Architecture
The system architecture of this environment consists of users, hardware, and software interfaces.
As shown in Figure 1a, the user interface is a HyperText Markup Language version 5 (HTML5) presentation of NiFi, Kibana, and Jupyter Notebooks via an SSH tunnel to the Docker container exported ports. The hardware interface includes a Dell 910. The software interface includes CentOS and Docker. Docker orchestrated the creation of four required servers to conduct the experiment: Elasticsearch, Kibana, Apache NiFi, and Jupyter Notebook. Configuration changes to NiFi were required due to the vast size of the JSON files. Since the default memory setting of 512 MiB for Java Virtual Machine (JVM) is not capable of deserializing JSON OpenSky ADS-B messages, memory was increased to 30 GiB. These configuration changes allowed the NiFi processors to process JSON files. Figure 1b presents the specifications of the architecture.

Data Capture and Data Engineering
One of the biggest gaps in this kind of research is the a Data were downloaded from OpenSky ADS-B archives for 2 flow created and fetched each hour-long archive with a GetF by an unpackcontent NiFi processor to untar the archive (Figu of several files, for which routeonattribute NiFi processor was compressed files that contained the actual ADS-B messages. T takes the JSON list of ADS-B JSON objects and flattens them i in its NiFi flow file. The replacedtext NiFi processor properly a JSON format accepted by JSON Search. PutElasticsearchHttp the NiFi flow files into Elasticsearch using the Elastic Search R into Elasticsearch, Kabana visualizes the data using line charts maximums, and peaks ( Figure 3). Recording the minimum an field, the ADS-B attack generator is more effective.

The Data Mining Process
This section is divided into data preprocessing, extraction of patterns using data mining, and post processing of data, that is, what was done to present the findings.

Data Preprocessing
The first step in the DM process is data preprocessing. Data preprocessing includes data cleaning, data integration, and data transformation [17,18]. The first step of data cleaning is the removal of noise and data inconsistencies. In this work, this first step of data cleaning was accomplished by removing events with the same aircraft unique identifier by dropping additional rows with the same icao24. As part of data integration, the Pandas join ( Figure 2) was used. OpenSky ADS-B traffic was combined with NiFi generated attack ADS-B network traffic. From this dataset, all features were analyzed for feature selection by graphing each feature using the Kibana visualization tool. Distinct patterns were found in velocity, baroalititude, geoalititude, vertrate, and geo.
The data were cleaned using the ELK stack [33]. ELK is an end-to-end technology stack providing a complete analytical solution. Since neither baroalititude nor geoalititude showed any statistical advantage over the other, baroalititude was selected, along with velocity and vertrate. Geo, while statistically relevant, would require significant preprocessing using time-series analysis techniques and therefore was not selected as a feature. ADS-B events with no values in baroalititude, velocity, and vertrate were filled with nan values, and rows with nan values were dropped. index field is a built-in Elasticsearch field that contains the name of the index. The index schema used in this work assigned each hour of the day with its own index. Since values within index are strings, and ML requires numeric values, one-hot-encoding was used to assign the numeric value of zero to ADS-B benign or normal network traffic and the numeric value of one to ADS-B attack network traffic. Finally, the data set was split 50/50 for training and testing.
Data selection is the retrieval of relevant data from the data sources. For data selection, a custom NiFi processor was used to ingest the OpenSky ADS-B network traffic data (Table 4) via JSON REST API [34,35] or JSON flat file [36]. Table 4 presents the OpenSky ADS-B JSON Object Definitions (data structure and definitions). Apache NiFi comes with approximately 260 processors, providing a range of processes such as to get, convert, and put. However, Apache NiFi does not come with an ADS-B traffic generator able to produce known attacks on ADS-B networks such as spoofing or injection. Since Logstash,   After inspection of OpenSky data, the work required the creation of a traffic generator using NiFi to create the ADS-B spoofing attack at the data layer. Since there is no ADS-B traffic generator in the NiFi Library, a custom NiFi processor was required and was built using Java. To create the build environment, a maven Project Object Model (POM) build file was necessary. Using Java allowed the fields within the ADS-B message to be configurable in NiFi. The fields within the ADS-B are time, latitude, longitude, velocity, vertrate, barometric altitude, and geoaltitude. The minimum and maximum values were specified within the NiFi processor, and this allowed more effective attacks. GenerateFlowFile NiFi processor used a predetermined amount of ADS-B messages. MyProcessor custom NiFi processor set all the values in the message. PutElasticsearchhttp NiFi processor sent messages to Elastic Search via REST API. With enough values randomly generated over time, the values spread evenly between the minimum and maximum values for all features ( Figure 5).

The Data Mining Process
This section is divided into data preprocessing, extraction of patterns using data mining, and post processing of data, that is, what was done to present the findings.

Data Preprocessing
The first step in the DM process is data preprocessing. Data preprocessing includes data cleaning, data integration, and data transformation [17,18]. The first step of data cleaning is the removal of noise and data inconsistencies. In this work, this first step of data cleaning was accomplished by removing events with the same aircraft unique identifier by dropping additional rows with the same icao24. As part of data integration, the Pandas join ( Figure 2) was used. OpenSky ADS-B traffic was combined with NiFi generated attack ADS-B network traffic. From this dataset, all features were analyzed for feature selection by graphing each feature using the Kibana visualization tool. Distinct patterns were found in velocity, baroalititude, geoalititude, vertrate, and geo.
The data were cleaned using the ELK stack [33]. ELK is an end-to-end technology stack providing a complete analytical solution. Since neither baroalititude nor geoalititude showed any statistical advantage over the other, baroalititude was selected, along with velocity and vertrate. Geo, while statistically relevant, would require significant preprocessing using time-series analysis techniques and therefore was not selected as a feature. ADS-B events with no values in baroalititude, velocity, and vertrate were filled with nan values, and rows with nan values were dropped.

The Data Mining Process
This section is divided into data preprocessing, extraction of patterns using data mining, and post processing of data, that is, what was done to present the findings.

Data Preprocessing
The first step in the DM process is data preprocessing. Data preprocessing includes data cleaning, data integration, and data transformation [17,18]. The first step of data cleaning is the removal of noise and data inconsistencies. In this work, this first step of data cleaning was accomplished by removing events with the same aircraft unique identifier by dropping additional rows with the same icao24. As part of data integration, the Pandas join (Figure 2) was used. OpenSky ADS-B traffic was combined with NiFi generated attack ADS-B network traffic. From this dataset, all features were analyzed for feature selection by graphing each feature using the Kibana visualization tool. Distinct patterns were found in velocity, baroalititude, geoalititude, vertrate, and geo.
The data were cleaned using the ELK stack [33]. ELK is an end-to-end technology stack providing a complete analytical solution. Since neither baroalititude nor geoalititude showed any statistical advantage over the other, baroalititude was selected, along with velocity and vertrate. Geo, while statistically relevant, would require significant preprocessing using time-series analysis techniques and therefore was not selected as a feature. ADS-B events with no values in baroalititude, velocity, and vertrate were filled with nan values, and rows with nan values were dropped. index field is a built-in Elasticsearch field that contains the name of the index. The index schema used in this work assigned each hour of the day with its own index. Since values within index are strings, and ML requires numeric values, one-hot-encoding was used to assign the numeric value of zero to ADS-B benign or normal network traffic and the numeric value of one to ADS-B attack network traffic. Finally, the data set was split 50/50 for training and testing.
Data selection is the retrieval of relevant data from the data sources. For data selection, a custom NiFi processor was used to ingest the OpenSky ADS-B network traffic data (Table 4) via JSON REST API [34,35] or JSON flat file [36]. Table 4 presents the OpenSky ADS-B JSON Object Definitions (data structure and definitions). Apache NiFi comes with approximately 260 processors, providing a range of processes such as to get, convert, and put. However, Apache NiFi does not come with an ADS-B traffic generator able to produce known attacks on ADS-B networks such as spoofing or injection. Since Logstash, index field is a built-in Elasticsearch field that contains the name of the index. The index schema used in this work assigned each hour of the day with its own index. Since values within

The Data Mining Process
This section is divided into data preprocessing, extraction of patterns using data m ing, and post processing of data, that is, what was done to present the findings.

Data Preprocessing
The first step in the DM process is data preprocessing. Data preprocessing includ data cleaning, data integration, and data transformation [17,18]. The first step of da cleaning is the removal of noise and data inconsistencies. In this work, this first step data cleaning was accomplished by removing events with the same aircraft unique ide tifier by dropping additional rows with the same icao24. As part of data integration, t Pandas join (Figure 2) was used. OpenSky ADS-B traffic was combined with NiFi gen ated attack ADS-B network traffic. From this dataset, all features were analyzed for featu selection by graphing each feature using the Kibana visualization tool. Distinct patter were found in velocity, baroalititude, geoalititude, vertrate, and geo.
The data were cleaned using the ELK stack [33]. ELK is an end-to-end technolo stack providing a complete analytical solution. Since neither baroalititude nor geoalititu showed any statistical advantage over the other, baroalititude was selected, along w velocity and vertrate. Geo, while statistically relevant, would require significant prep cessing using time-series analysis techniques and therefore was not selected as a featu ADS-B events with no values in baroalititude, velocity, and vertrate were filled with n values, and rows with nan values were dropped. index field is a built-in Elasticsear field that contains the name of the index. The index schema used in this work assign each hour of the day with its own index. Since values within index are strings, and M requires numeric values, one-hot-encoding was used to assign the numeric value of ze to ADS-B benign or normal network traffic and the numeric value of one to ADS-B atta network traffic. Finally, the data set was split 50/50 for training and testing. Data selection is the retrieval of relevant data from the data sources. For data sel tion, a custom NiFi processor was used to ingest the OpenSky ADS-B network traffic da (Table 4) via JSON REST API [34,35] or JSON flat file [36]. Table 4 presents the OpenS ADS-B JSON Object Definitions (data structure and definitions). Apache NiFi comes w approximately 260 processors, providing a range of processes such as to get, convert, a put. However, Apache NiFi does not come with an ADS-B traffic generator able to p duce known attacks on ADS-B networks such as spoofing or injection. Since Logsta index are strings, and ML requires numeric values, one-hot-encoding was used to assign the numeric value of zero to ADS-B benign or normal network traffic and the numeric value of one to ADS-B attack network traffic. Finally, the data set was split 50/50 for training and testing.
Data selection is the retrieval of relevant data from the data sources. For data selection, a custom NiFi processor was used to ingest the OpenSky ADS-B network traffic data (Table 4) via JSON REST API [34,35] or JSON flat file [36]. Table 4 presents the OpenSky ADS-B JSON Object Definitions (data structure and definitions). Apache NiFi comes with approximately 260 processors, providing a range of processes such as to get, convert, and put. However, Apache NiFi does not come with an ADS-B traffic generator able to produce known attacks on ADS-B networks such as spoofing or injection. Since Logstash, provided by Elasticsearch, is not as versatile for complete data preprocessing since it only ingests data, a custom NiFi processor was the best choice for this work. The third step of data preprocessing-data transformation-is the consolidation of data into appropriate forms for DM by the aggregation of the data [17]. The data were transformed for DM by a custom enrichment NiFi processor.

Extracting Patterns Using Data Mining
The second step of the DM process is the employment of intelligent methods to extract patterns from the data. Classification techniques have the ability to group data with similarities, such as attacks, hence this was considered the best option for processing this data. In this work, classification techniques using a custom NiFi processor were used. A customized NiFi processor was used for creating, sending, receiving, transforming, routing, splitting, merging, and processing FlowFiles. FlowFiles, pieces of data that the user brings into NiFi for processing and distribution, consist of two parts: Attributes and Content. Attributes are key-value pairs that are associated with User Data. Content is user data. The custom NiFi processor is an algorithm written in Python using DM/ML libraries to process FlowFiles.
Jupyter Notebook fetched 10,000 real ADS-B messages from Elasticsearch REST API and 10,000 generated attack ADS-B messages (Figure 3). The results were combined into one pandas data frame for training and testing (Figure 4). Preprocessing dropped all rows with the same icao24. The drop method removed all fields except velocity, baroaltitude, vertrate, and index. The

The Data Mining Process
This section is divided into data preprocessing, extraction of patterns using data mining, and post processing of data, that is, what was done to present the findings.

Data Preprocessing
The first step in the DM process is data preprocessing. Data preprocessing includes data cleaning, data integration, and data transformation [17,18]. The first step of data cleaning is the removal of noise and data inconsistencies. In this work, this first step of data cleaning was accomplished by removing events with the same aircraft unique identifier by dropping additional rows with the same icao24. As part of data integration, the Pandas join (Figure 2) was used. OpenSky ADS-B traffic was combined with NiFi generated attack ADS-B network traffic. From this dataset, all features were analyzed for feature selection by graphing each feature using the Kibana visualization tool. Distinct patterns were found in velocity, baroalititude, geoalititude, vertrate, and geo.
The data were cleaned using the ELK stack [33]. ELK is an end-to-end technology stack providing a complete analytical solution. Since neither baroalititude nor geoalititude showed any statistical advantage over the other, baroalititude was selected, along with velocity and vertrate. Geo, while statistically relevant, would require significant preprocessing using time-series analysis techniques and therefore was not selected as a feature. ADS-B events with no values in baroalititude, velocity, and vertrate were filled with nan values, and rows with nan values were dropped. index field is a built-in Elasticsearch field that contains the name of the index. The index schema used in this work assigned each hour of the day with its own index. Since values within index are strings, and ML requires numeric values, one-hot-encoding was used to assign the numeric value of zero to ADS-B benign or normal network traffic and the numeric value of one to ADS-B attack network traffic. Finally, the data set was split 50/50 for training and testing.
Data selection is the retrieval of relevant data from the data sources. For data selection, a custom NiFi processor was used to ingest the OpenSky ADS-B network traffic data (Table 4) via JSON REST API [34,35] or JSON flat file [36]. Table 4 presents the OpenSky ADS-B JSON Object Definitions (data structure and definitions). Apache NiFi comes with approximately 260 processors, providing a range of processes such as to get, convert, and put. However, Apache NiFi does not come with an ADS-B traffic generator able to produce known attacks on ADS-B networks such as spoofing or injection. Since Logstash, index field determines if the row is an OpenSky collected ADS-B message or a NiFi Custom Processor generated attack. The replace method filled empty fields with the null value of np.nan. The dropna method dropped all fields with np.nan. One-hot encoding replaced the labels within the index numerical representation. Velocity, baroaltitude, and vertrate were numerical values and hence were not one-hotencoded.

Post-Processing
The third step of the DM process is post-processing. Post-processing includes pattern evaluation and knowledge representation. Pattern evaluation, expressed by the comparison of test data and labelled data, identifies relevant patterns leading to knowledge based on interestingness measures. The second step of post-processing-knowledge presentationis the visualization of the knowledge for presentation to users. The knowledge is presented using Kibana from ELK for visualization [37].

Results and Discussion
The results are presented in terms of visualizations, machine learning results, and volume and velocity calculations.

Data Exploration
For data exploration, Kibana visualization was used to categorize the characteristics of OpenSky ADS-B Traffic ( Figure 5). The features selected were velocity, baroaltitude, and vertrate. Velocity has a minimum of zero and a maximum of 324.844. Baroaltitude has a minimum of −335.28 and a maximum of 36,941.762. Vertrate has a minimum of −41.615 and a maximum of 28.611. The distinct spikes indicate a definite pattern in the data, indicative of flight patterns in U.S. National Air Space.

Machine Learning: SVM
SVM was used as a kernel-based method, where feature vectors are implicitly mapped into a higher dimensional space where it is easier to find an optimal hyperplane for classifying observations. A linear kernel was used. Given training vectors x i ∈ R p , i = 1, . . . , n, in two classes and a vector y ∈ {1, −1} n , our goal is to find w ∈ R p and b ∈ R such that the prediction given by sign w T φ(x) + b is correct for most samples. SVC solves the following primal problem [38][39][40]: Its dual is min where e is the vector of all ones, C > 0 is the upper bound, Q is an n by n positive semidefinite matrix, Q ij ≡ y i y j K(x i ,x j ), where K(x i ,x j ) = ϕ(x i ) T ϕ(x j ) is the kernel. Training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function ϕ. The decision function is A confusion matrix was used to present a summary of the results predicted for the classifications. There are four counts in a confusion matrix: true positives, false negatives, true negatives, and false positives. True positives are actual true and predicted true. False negatives are actual true and predicted false. True negatives are actual false and predicted false. False positives are actual false and predicted true.
For predicted ADS-B attacks, there were 3922 true positive messages, 963 false positive messages, 1088 false negatives, and 3149 true negatives, as shown in Table 5. The calculations for the classification report present the precision, recall, and F1 score of the machine learner. Precision indicates actual ADS-B attacks are predicted (Equation (3)  The objective is to achieve high precision as well as high recall. In this case, the precision was 80.29%, and recall, which is also the attack detection rate, was 78.28%.
The F1-Score represents the harmonic mean of the precision and recall (Equation (5)).

Volume
Data volume measures the scale of the data within the system [16]. ADS-B messages persist in 26 Elasticsearch indices. Indices 0 through 23 hold the ADS-B messages from OpenSky. Index 24 holds the generated ADS-B network attack messages. Index 25 holds the machine learner predictions. The volume mean is calculated using the ratio of Elasticsearch index storage size to the count of Elasticsearch documents containing OpenSky ADS-B messages (Equation (6)).
The yearly ADS-B volume is 41 TiB, based on the ADS-B specification and FAA statistics on the U.S. NAS (Equation (7) The yearly ADS-B volume is 91 TiB, based on the volume mean. This is calculated using the ratio of Elasticsearch index storage size to the count of Elasticsearch documents containing OpenSky ADS-B messages, ADS-B specification, and FAA statistics on the U.S. NAS (Equation (8) The velocity was 12.81262557 and the average volume was 247.7604 bytes per message. The bit rate for the U.S. NAS is 423.2614982 bits per second (Equation (10)). In comparison, residentially available Gigabit internet is 1,000,000,000 bits per second, and commercially available Optical Carrier 768 (OC-768) operates at 39,813,000,000.12 bits per second. In this research, the server took 0.3565873159095 s to preprocess 20,000 messages. It took 0.178294 s for the server to preprocess 10,000 records.
The server preprocessed 56.08714 messages per millisecond (Equation (11)). The server took 11,802.38247703109 s to fit the model using 10,000 messages to create a machine learning model. The server fits the model in 0.000847 messages per millisecond. It took the server 0.3594533782452345 s to apply the model to 10,000 messages for the predictions. The server can apply the model with 27.82002 messages per millisecond. The U.S. NAS generates 404,058,960,000 messages per year or approximately 13 messages per millisecond [29].
Using a commercial off-the-shelf server, this system was able to predict 18.69573609 messages per millisecond (Equation (12)). Based on the FAA statistics, the velocity of detectable messages was 16 messages per millisecond. With the Elasticsearch overhead, the model was capable of processing 18.69573609 messages per millisecond. It takes 91 TB of data volume to store a year's worth of ADS-B messages in Elasticsearch.

Conclusions
In this work, flight patterns were characterized, including flight patterns for possible attacks. Flights outside the patterns are possible attacks, and ADS-B network attacks can be detected using network attack signatures. A precision and recall of close to 80% was achieved using SVM classification.
It took the server 0.36 s to preprocess the messages, 11,802.38 s to fit the model, and 0.36 s to apply the model for a prediction. While fitting took substantial time, the combination of preprocessing and applying the already fitted model took less than a second to finish, that is, 0.72 s or 720 milliseconds for 20,000 messages, or approximately 27 ADS-B messages every millisecond. The U.S. National Air Space generates 404,058,960,000 messages per year, approximately 13 messages per millisecond. A commercial off-the-shelf server can process 16 messages per millisecond with the Elasticsearch overhead. This research, which can be applied to GAO identified problems and issues with FAA instantiation of ADS-B for public safety, used a commercial off-the-shelf server to keep up with U.S. NAS velocity of ADS-B messages. These findings will help in taking appropriate action on attacks detected in real time, hence improving flight safety.

Future Works
Combating attackers and mimicking those identified flight patterns using adversarial artificial intelligence would be the next step in this research. Some advanced threats, otherwise known as advanced persistent threats, use artificial intelligence to glean patterns from networks such as the FAA's ADS-B network. These threats carefully construct attacks that mimic those legitimate network traffic patterns. Future research is important because the current machine learner would potentially not detect such attacks as they mimic legitimate ADS-B network traffic. Exploration of other machine learners and artificial intelligence algorithms would add to the research. These possible machine learners include Random Forest classifier, Bayesian classifier, or Neural Networks. Additionally, artificial intelligence algorithms addressed through Neural Networks strive for 95 to 99% precision. Increasing precision allows for more accuracy in detecting attacks. Every year air travel increases; therefore, the increased amount of data needs to be processed with more velocity. To address this, consideration of other big data platforms besides Elasticsearch could increase the system's velocity capability. Other Platforms, such as Spunk, Spark, and even server-less architectures such as Lambda, offer opportunities for exploration.