Cyber Security Tool Kit (CyberSecTK): A Python Library for Machine Learning and Cyber Security

: The cyber security toolkit, CyberSecTK, is a simple Python library for preprocessing and feature extraction of cyber-security-related data. As the digital universe expands, more and more data need to be processed using automated approaches. In recent years, cyber security professionals have seen opportunities to use machine learning approaches to help process and analyze their data. The challenge is that cyber security experts do not have necessary trainings to apply machine learning to their problems. The goal of this library is to help bridge this gap. In particular, we propose the development of a toolkit in Python that can process the most common types of cyber security data. This will help cyber experts to implement a basic machine learning pipeline from beginning to end. This proposed research work is our ﬁrst attempt to achieve this goal. The proposed toolkit is a suite of program modules, data sets, and tutorials supporting research and teaching in cyber security and defense. An example of use cases is presented and discussed. Survey results of students using some of the modules in the library are also presented.


Introduction
The cyber security toolkit (CyberSecTK) is a simple Python library for preprocessing and feature extraction of cyber-security-related data. As the digital universe expands, more and more data need to be processed using automated approaches. For instance, the statistics [1] show a rapid growth in internet of things (IoT) devices. Currently, there are over 12 billion devices that can connect to the internet, and it was predicted to reach 14.2 billion in 2019. According to the publication announcement on the Worldwide Semiannual Internet of Things Spending Guide forecast [2], a 13.6% compound annual growth rate (CAGR) is expected over the 2017-2022 period with an estimated investment of 1.2 trillion US dollars by 2022. The proliferation of the IoT will lead to new challenges in the near future. According to [1], the growth of IoT devices usage will lead to an important problem, which will make it difficult for smart environment operators to ascertain IoT devices within their networks and monitor their operations. The Mirai botnet [3] is an example of risks to IoT devices. This denial of service attack is a wake-up call to the IoT industry, which is possible if IoT devices are not secured. Security of IoT devices, in this example, will be key. GHOST [4], an EU Horizon 2020 Research and Innovation funded project, develops an architecture to safeguard home IoT environments with personalized real-time risk control measures to mitigate cyber security events. The system modules a network and data flow analysis (NDFA) to build a profile builder (PB) based on network activities, which in turn notifies a risk engine (RE) to act in order to secure the smart-home IoT ecosystem. Abnormal behavior detection is one of the crucial components in our daily life interaction with IoT environments and devices.
In recent years, cyber security professionals have seen opportunities to use machine learning (ML) approaches to help process and analyze their data. The challenge is that cyber security experts do not Dynamic approaches need malware to run in a sandbox environment and collect the dynamic behavior of the malware. This dynamic behavior consists, usually, of system calls, registry edits, network connections, access of dynamic link libraries (DLLs), etc. A basic pipeline for this was to take the dynamic malware, run it through an emulator, obtain log files of the behavior and extract features from the logs for ML purposes [14]. According to Hamed et al. [19], a components-based approach enables an easier comparison of methods. Where an intrusion detection system (IDS) can be analyzed in three major phases: preprocessing, detection, and empirical evaluation. For example, a network's physical, datalink, and network packet header's information can be useful to extract features to analyze network behavior. This can be useful to detect anomalies within a network. A malware image analysis technique in [20] was used to classify, detect and analyze particular malware based on image texture analysis. The visualization technique converts the sample malware's executable file into binary strings of 0 s and 1 s as a vector, which is then converted into two-dimensional images. The image is further compared in terms of factors such as creation, techniques, execution environment, propagation media, and negative impacts. A behavior-based malware-detection ML technique was also discussed in [21]. The combined malware and benign instance of the data set from Windows' portable executable (PE) binaries file is processed online to generate dynamic analysis reports. This is then further used for feature selection and analyzed based on available Weka classifiers. One of the proposed modules in our library was designed to process and extract features from log files (e.g., features from dynamic analysis). In its simplest form, the dynamic malware analysis module can treat each log file per malware (or goodware) sample as a text file (similar to how a natural language document would be treated). From there, a simple bag of words approach can be explored, and this is the most basic analysis that can be implemented. Once the features are extracted, to complete the pipeline, the data must be preprocessed and normalized, ML models are built, and results must be analyzed to evaluate performance. Our library interacts well with standard sklearn and TensorFlow functions to complete the ML pipeline. The point of this library is to bring together modules for feature extraction in the domain of cyber security. Whenever possible, we use existing libraries/dependencies to manipulate raw data. For example, we use Python Scapy functions to process network packets. Libraries like sklearn do not have specific modules to extract features from cyber data. Therefore, our proposed library can be used to extract features from cyber data and those extracted features can then be used for ML applications using other tools such as TensorFlow, and sklearn. This library is a work in progress, and more features can be added in the future by incorporating new modules.
The structure of this paper is as follows: first, an introduction of the library and its need was shown. This section was followed by methods and results. Finally, the discussion of results and future work were presented.

Materials and Methods
The library consists of simple Python scripts for feature extraction and some tutorial materials for research and teaching of ML for cyber security. The code for the library is available from GitHub via the link in [22]. The library focuses mainly on two types of data, which are network data and malware-related data. For the network data, we provide examples of how to extract data from Wi-Fi data and from regular network data. All inputs are assumed to be PCAP files. The library uses the Python Scapy [23] library to extract the network features. The outputs are comma-separated values (CSV) files that contain features and their values. Currently, we are focusing on TCP packets and the basic wireless local area network (LAN) link-layer headers without looking into the payloads. As we extend the library, we will explore modules for payload extraction. Payload data are more challenging, because they are more diverse. One common way of looking at payload data is to think of them as documents in language. Therefore, NLP types of approaches can be considered. To preserve confidentiality of data transferred, packets are usually encrypted. Analyzing encrypted packets for ML can be very challenging and is currently not addressed by the library directly. However, according to [24], an encrypted wireless network packet can still be analyzed using the link-layer Information 2020, 11, 100 4 of 14 header information. In this paper, we used the same approach to extract features from wireless network packets. Figure 1 shows an example of a wireless LAN frame.
Information 2020, 11, 100 4 of 14 link-layer header information. In this paper, we used the same approach to extract features from wireless network packets. Figure 1 shows an example of a wireless LAN frame. The wireless network packets to test our modules were collected within a simple testbed lab environment, where multiple IoT devices were connected to an access point (AP). Figure 2 illustrates the lab setup environment. The research work focused on collecting wireless packets as an initial approach to feature extraction. The sample data set is available for the library. A commercial off-theshelf (COTS) radio receiver was used in monitor mode, to passively eavesdrop the wireless network traffic within a controlled lab environment. We preferred the monitor mode over the promiscuous mode due to its capabilities to identify hidden APs, which can passively listen to the wireless network traffic without associating to the network. With this approach, features can be directly extracted based on nonencrypted link-layer header information. For example, a sample IoT wireless LAN data set contains 21 features with 94 instances as a proof of concept to test the cyber security toolkit. We used Python Scapy [23], an open-source Python library, to collect the wireless network traffic and further extract features using the proposed library based on Scapy's built-in library support.
Scapy is handy to use in terms of its built-in functions and support community. The user can simply dump network packets and parse them through different layered information within each packet for further analysis. Similarly, Aircrack-ng [25], an open-source Wi-Fi network security toolset, was used to set up a wireless network adaptor in monitoring mode, before we started capturing the wireless network traffic within our testbed lab environment to create a data set.
For our Wi-Fi IoT data example, the library's module extracts features from wireless network traffic-based PCAP files. The module parses the wireless network layer information using Scapy functions and extracts features based on a predefined list into a final CSV file. In its simplest form, each packet becomes an instance or sample of the output CSV file. The procedure, for this example's feature extraction approach, is summarized in Algorithm 1 (Figure 3), where IOTwireless.csv is a labeled data set file. Sniff is a Scapy built-in function to control network packets, and it allows users to pass a function with an argument "prn" to perform custom actions. The variable Labels represents The wireless network packets to test our modules were collected within a simple testbed lab environment, where multiple IoT devices were connected to an access point (AP). Figure 2 illustrates the lab setup environment. The research work focused on collecting wireless packets as an initial approach to feature extraction. The sample data set is available for the library. A commercial off-the-shelf (COTS) radio receiver was used in monitor mode, to passively eavesdrop the wireless network traffic within a controlled lab environment. We preferred the monitor mode over the promiscuous mode due to its capabilities to identify hidden APs, which can passively listen to the wireless network traffic without associating to the network. With this approach, features can be directly extracted based on nonencrypted link-layer header information.
Information 2020, 11, 100 4 of 14 link-layer header information. In this paper, we used the same approach to extract features from wireless network packets. Figure 1 shows an example of a wireless LAN frame. The wireless network packets to test our modules were collected within a simple testbed lab environment, where multiple IoT devices were connected to an access point (AP). Figure 2 illustrates the lab setup environment. The research work focused on collecting wireless packets as an initial approach to feature extraction. The sample data set is available for the library. A commercial off-theshelf (COTS) radio receiver was used in monitor mode, to passively eavesdrop the wireless network traffic within a controlled lab environment. We preferred the monitor mode over the promiscuous mode due to its capabilities to identify hidden APs, which can passively listen to the wireless network traffic without associating to the network. With this approach, features can be directly extracted based on nonencrypted link-layer header information. For example, a sample IoT wireless LAN data set contains 21 features with 94 instances as a proof of concept to test the cyber security toolkit. We used Python Scapy [23], an open-source Python library, to collect the wireless network traffic and further extract features using the proposed library based on Scapy's built-in library support.
Scapy is handy to use in terms of its built-in functions and support community. The user can simply dump network packets and parse them through different layered information within each packet for further analysis. Similarly, Aircrack-ng [25], an open-source Wi-Fi network security toolset, was used to set up a wireless network adaptor in monitoring mode, before we started capturing the wireless network traffic within our testbed lab environment to create a data set.
For our Wi-Fi IoT data example, the library's module extracts features from wireless network traffic-based PCAP files. The module parses the wireless network layer information using Scapy functions and extracts features based on a predefined list into a final CSV file. In its simplest form, each packet becomes an instance or sample of the output CSV file. The procedure, for this example's feature extraction approach, is summarized in Algorithm 1 (Figure 3), where IOTwireless.csv is a labeled data set file. Sniff is a Scapy built-in function to control network packets, and it allows users to pass a function with an argument "prn" to perform custom actions. The variable Labels represents For example, a sample IoT wireless LAN data set contains 21 features with 94 instances as a proof of concept to test the cyber security toolkit. We used Python Scapy [23], an open-source Python library, to collect the wireless network traffic and further extract features using the proposed library based on Scapy's built-in library support.
Scapy is handy to use in terms of its built-in functions and support community. The user can simply dump network packets and parse them through different layered information within each packet for further analysis. Similarly, Aircrack-ng [25], an open-source Wi-Fi network security toolset, was used to set up a wireless network adaptor in monitoring mode, before we started capturing the wireless network traffic within our testbed lab environment to create a data set.
For our Wi-Fi IoT data example, the library's module extracts features from wireless network traffic-based PCAP files. The module parses the wireless network layer information using Scapy functions and extracts features based on a predefined list into a final CSV file. In its simplest form, each packet becomes an instance or sample of the output CSV file. The procedure, for this example's feature extraction approach, is summarized in Algorithm 1 (Figure 3), where IOTwireless.csv is a labeled data set file. Sniff is a Scapy built-in function to control network packets, and it allows users to pass a function with an argument "prn" to perform custom actions. The variable Labels represents the feature labels list to be extracted from the input PCAP file. Dot11 and Dot11Elt are wireless frame layers, which hold information about the connected wireless network. Dot11 and Dot11Elt are layer fields and are the values associated within each packet with the labels defined in Labels.
Information 2020, 11, 100 5 of 14 the feature labels list to be extracted from the input PCAP file. Dot11 and Dot11Elt are wireless frame layers, which hold information about the connected wireless network. Dot11 and Dot11Elt are layer fields and are the values associated within each packet with the labels defined in Labels. The malware module focuses on data collected via dynamic analysis (e.g., running a virus in a sandbox through an emulator). The emulator generates logs of the behavior of the malware that is processed for feature extraction. Each log is treated as a document, and its content is treated as words. With this view of the data, features can be extracted using a bag of words approach.

Evaluation
Evaluating a module statistically can be challenging given that we need to assess, in a way, how the code was written and if it is useful to the user. We settled on evaluating the use of some of the modules as part of a class. In particular, we used modules for malware feature extraction and for network data analysis. As part of the class, we also recorded videos and created a virtual machine (VM), data sets, lectures, and labs. We assessed the whole course together and did not focus on a specific module of the library. The survey was conducted to assess the content arrangement and design of the course and to obtain direct feedback on the learning experience. The data came from 19 students who attended classes during the summer of 2019. The open-source learning module was designed to be a week-long ML course for cyber security professional. The course was provided for professionals with a basic knowledge of network security and coding. Most of students in the course possessed very advanced programming skills. The learning module consists of 7 data sets distributed in 15 lectures and 10 labs. All the code modules are implemented in Python and tested in an Ubuntu virtual machine. The VM has many libraries already preinstalled and can be downloaded from [7].
In terms of survey methods and survey data, we followed standard approaches such as in [26]. The survey contained 17 questions, including questions of satisfaction with materials prepared in class, surveys of satisfaction with teaching content and materials, and analysis of teaching results. In addition, participants could write down their other feelings or concerns about the course. The malware module focuses on data collected via dynamic analysis (e.g., running a virus in a sandbox through an emulator). The emulator generates logs of the behavior of the malware that is processed for feature extraction. Each log is treated as a document, and its content is treated as words. With this view of the data, features can be extracted using a bag of words approach.

Evaluation
Evaluating a module statistically can be challenging given that we need to assess, in a way, how the code was written and if it is useful to the user. We settled on evaluating the use of some of the modules as part of a class. In particular, we used modules for malware feature extraction and for network data analysis. As part of the class, we also recorded videos and created a virtual machine (VM), data sets, lectures, and labs. We assessed the whole course together and did not focus on a specific module of the library. The survey was conducted to assess the content arrangement and design of the course and to obtain direct feedback on the learning experience. The data came from 19 students who attended classes during the summer of 2019. The open-source learning module was designed to be a week-long ML course for cyber security professional. The course was provided for professionals with a basic knowledge of network security and coding. Most of students in the course possessed very advanced programming skills. The learning module consists of 7 data sets distributed in 15 lectures and 10 labs. All the code modules are implemented in Python and tested in an Ubuntu virtual machine. The VM has many libraries already preinstalled and can be downloaded from [7].
In terms of survey methods and survey data, we followed standard approaches such as in [26]. The survey contained 17 questions, including questions of satisfaction with materials prepared in class, surveys of satisfaction with teaching content and materials, and analysis of teaching results. In addition, participants could write down their other feelings or concerns about the course.
All of the questions in the survey, except for the final comment, used a Likert scale, for a range of 1 to 5. Table 1 shows the scale meanings of the survey answers, and Table 2 demonstrates the survey questions. Neither agree nor disagree 4 Agree 5 Strongly agree Were the lectures helpful to better understand the topics? 5 Did you find the lecture and lab documents helpful to better understand the topics? 6 Was the use of a flash drive with a VM helpful for working on the lab problems? 7 Did you find the challenge data sets useful? 8 After this 1-week course, do you feel you have a better understanding of machine learning and how it can be applied to cyber security problems? 9 Did you find the use of AWS as a large data set useful to your learning? 10 Did you find the distribution of time between lecture and labs appropriate? 11 Were the video recording useful to your learning? 12 Overall, do you feel that you can convert raw data to the vector space model format for machine learning purpose? 13 Overall, do you feel you can now better analyze data sets with Weka? 14 Overall, do you feel you can now better analyze data sets with sklearn? 15 Overall, do you feel you can now better analyze data sets with TensorFlow? 16 Do you have a better understanding of deep neural networks and their advantages? 17 Comments (provide any additional comments) Given that this is a library/coding module, it can be difficult to present statistical results of its usefulness and usability. Given that these modules were used in a class run in the summer of 2019, we used the results of the course surveys as an indirect assessment of these materials. In particular, we emphasized those questions that were more pertinent to the library itself. We presented a full list of questions for completion. For our purposes, we focused on question 12 (Q12), in particular, which we felt was the most relevant to assess our module/library. Through statistical analysis of the survey data with the R language, we presented some issues of note. This paper tried to find out the relationship between course materials, course content, and course satisfaction, so as to help optimize the materials and meet the needs of target groups.

Results
Given that this is a library, it can be difficult to present statistical results. However, we ran some of these modules as labs in a course during the summer of 2019. At the end of the class, we collected course surveys. We used some questions in the survey as an indirect way of assessing the library for usability and usefulness. We also assessed the tutorial and videos with the survey to help gauge if the code written for the library was understood by students and if there were correlations. Finally, we presented a simple use case to illustrate the use of the modules. This use case was taken directly from the lab materials in the course and was related to malware analysis.

Use Case
The malware module focuses on data collected via dynamic analysis (e.g., running a virus in a sandbox through an emulator). The emulator generates logs of the behavior of the malware that is processed for feature extraction. The contents of the log are shown in Figure 4.
Information 2020, 11, 100 7 of 14 usability and usefulness. We also assessed the tutorial and videos with the survey to help gauge if the code written for the library was understood by students and if there were correlations. Finally, we presented a simple use case to illustrate the use of the modules. This use case was taken directly from the lab materials in the course and was related to malware analysis.

Use Case
The malware module focuses on data collected via dynamic analysis (e.g., running a virus in a sandbox through an emulator). The emulator generates logs of the behavior of the malware that is processed for feature extraction. The contents of the log are shown in Figure 4. Each log is treated as a document, and its content is treated as words. With this view of the data, features can be extracted using a bag of words approach. The log labels or classes can be included in the name itself for each goodware or malware sample, as can be seen in Figure 5. In Figure 6, it can be seen that the module uses CountVectorizer, a Python built-in function, to implement the bag of words approach on each log file. Therefore, each log file will become a sample or row in the output data set. Finally, in Figure 7, CountVectorizer uses pandas to create a data frame containing the output data for every single input log. The output is a CSV file containing all samples. Each log is treated as a document, and its content is treated as words. With this view of the data, features can be extracted using a bag of words approach. The log labels or classes can be included in the name itself for each goodware or malware sample, as can be seen in Figure 5.
Information 2020, 11, 100 7 of 14 usability and usefulness. We also assessed the tutorial and videos with the survey to help gauge if the code written for the library was understood by students and if there were correlations. Finally, we presented a simple use case to illustrate the use of the modules. This use case was taken directly from the lab materials in the course and was related to malware analysis.

Use Case
The malware module focuses on data collected via dynamic analysis (e.g., running a virus in a sandbox through an emulator). The emulator generates logs of the behavior of the malware that is processed for feature extraction. The contents of the log are shown in Figure 4. Each log is treated as a document, and its content is treated as words. With this view of the data, features can be extracted using a bag of words approach. The log labels or classes can be included in the name itself for each goodware or malware sample, as can be seen in Figure 5. In Figure 6, it can be seen that the module uses CountVectorizer, a Python built-in function, to implement the bag of words approach on each log file. Therefore, each log file will become a sample or row in the output data set. Finally, in Figure 7, CountVectorizer uses pandas to create a data frame containing the output data for every single input log. The output is a CSV file containing all samples. In Figure 6, it can be seen that the module uses CountVectorizer, a Python built-in function, to implement the bag of words approach on each log file. Therefore, each log file will become a sample or row in the output data set.
Information 2020, 11, 100 7 of 14 usability and usefulness. We also assessed the tutorial and videos with the survey to help gauge if the code written for the library was understood by students and if there were correlations. Finally, we presented a simple use case to illustrate the use of the modules. This use case was taken directly from the lab materials in the course and was related to malware analysis.

Use Case
The malware module focuses on data collected via dynamic analysis (e.g., running a virus in a sandbox through an emulator). The emulator generates logs of the behavior of the malware that is processed for feature extraction. The contents of the log are shown in Figure 4. Each log is treated as a document, and its content is treated as words. With this view of the data, features can be extracted using a bag of words approach. The log labels or classes can be included in the name itself for each goodware or malware sample, as can be seen in Figure 5. In Figure 6, it can be seen that the module uses CountVectorizer, a Python built-in function, to implement the bag of words approach on each log file. Therefore, each log file will become a sample or row in the output data set. Finally, in Figure 7, CountVectorizer uses pandas to create a data frame containing the output data for every single input log. The output is a CSV file containing all samples. Finally, in Figure 7, CountVectorizer uses pandas to create a data frame containing the output data for every single input log. The output is a CSV file containing all samples.    As can be seen in Figure 9, the terms can relate to anything that can happen on a windows system. In general, binaries call DLLs, and they open files, make internet connections and perform     As can be seen in Figure 9, the terms can relate to anything that can happen on a windows system. In general, binaries call DLLs, and they open files, make internet connections and perform  registry edits. The goal of an ML algorithm would be based on lots of data to learn to associate DLL 37 to malware files and DLL 34 to benign binaries, for example, according to their co-occurrence.

Survey Analysis
The analysis in Table 3 showed that there was no correlation between Q12 and Q1, Q2, and Q3. Knowing how to extract features was not correlated to understanding Weka (Q1: p-value = 0.427), sklearn (Q2: p-value = 0.809), and TensorFlow (Q3: p-value = 0.303). Correlation was considered significant at p-value ≤ 0.05. This sounds reasonable, since feature extraction and the use of ML algorithms are things that can be learned separately. As can be seen in Figure 9, the terms can relate to anything that can happen on a windows system. In general, binaries call DLLs, and they open files, make internet connections and perform registry edits. The goal of an ML algorithm would be based on lots of data to learn to associate DLL 37 to malware files and DLL 34 to benign binaries, for example, according to their co-occurrence.

Survey Analysis
The analysis in Table 3 showed that there was no correlation between Q12 and Q1, Q2, and Q3. Knowing how to extract features was not correlated to understanding Weka (Q1: p-value = 0.427), sklearn (Q2: p-value = 0.809), and TensorFlow (Q3: p-value = 0.303). Correlation was considered significant at p-value ≤ 0.05. This sounds reasonable, since feature extraction and the use of ML algorithms are things that can be learned separately. In addition, a relevant analysis was also conducted to prove whether lecture and lab files helped students understand ML, and the results are shown in Table 4. It is obvious that the preparation of lectures and labs (Q4 and Q5) was helpful for understanding course logic and understanding ML (Q8: p-value = 0.029; Q10: p-value = 0.021). The survey results ( Figure 10) indicated that after this one-week course, 77.36% students feel they have a better understanding of ML and how it can be applied to cyber security problems, 15.79% do not know if they understand how ML works, and 10.53% think they cannot understand ML from this course ( Figure 10).  In addition, a relevant analysis was also conducted to prove whether lecture and lab files helped students understand ML, and the results are shown in Table 4. It is obvious that the preparation of lectures and labs (Q4 and Q5) was helpful for understanding course logic and understanding ML (Q8: p-value = 0.029; Q10: p-value = 0.021). The survey results ( Figure 10) indicated that after this one-week course, 77.36% students feel they have a better understanding of ML and how it can be applied to cyber security problems, 15.79% do not know if they understand how ML works, and 10.53% think they cannot understand ML from this course ( Figure 10).  In Figure 11, we focused again on Q12. The pie chart shown in Figure 11 shows that 88.88% students agree that they "can pre-process data into vector space model formats for use in machine learning applications", while 5.56% disagree with that.
Information 2020, 11, 100 11 of 14 In Figure 11, we focused again on Q12. The pie chart shown in Figure 11 shows that 88.88% students agree that they "can pre-process data into vector space model formats for use in machine learning applications", while 5.56% disagree with that. Figure 11. Survey results on students' answers to Q12.
From Figure 12, it can be seen that 88.89% have a better understanding of deep neural networks and the rest of them think they have the opposite opinions. This showed that there is strong interest in deep learning by students, and we will extend the library to reflect this aspect.  Table 5 shows the summary statistics of each question. Only the mean of Q11 was under 4, which meant videos were not useful for helping students to get improvement from this course or they did not have time to watch them. It is important to note that the videos were not made available to most of students prior to taking the course. From Figure 12, it can be seen that 88.89% have a better understanding of deep neural networks and the rest of them think they have the opposite opinions. This showed that there is strong interest in deep learning by students, and we will extend the library to reflect this aspect.
Information 2020, 11, 100 11 of 14 In Figure 11, we focused again on Q12. The pie chart shown in Figure 11 shows that 88.88% students agree that they "can pre-process data into vector space model formats for use in machine learning applications", while 5.56% disagree with that. Figure 11. Survey results on students' answers to Q12.
From Figure 12, it can be seen that 88.89% have a better understanding of deep neural networks and the rest of them think they have the opposite opinions. This showed that there is strong interest in deep learning by students, and we will extend the library to reflect this aspect.  Table 5 shows the summary statistics of each question. Only the mean of Q11 was under 4, which meant videos were not useful for helping students to get improvement from this course or they did not have time to watch them. It is important to note that the videos were not made available to most of students prior to taking the course.  Table 5 shows the summary statistics of each question. Only the mean of Q11 was under 4, which meant videos were not useful for helping students to get improvement from this course or they did not have time to watch them. It is important to note that the videos were not made available to most of students prior to taking the course.

Discussion
In this paper, we presented and discussed a new library for ML and cyber security. To assess the usefulness of the materials (e.g., the modules for feature extraction), we considered the following question: • "Is there any relation between learning materials and mastery of knowledge?" Spearman correlations are often used to assess relationships related to sequential variables, which is well suited for the data studied in this paper, because a Likert scale was used to represent the results of the survey, from strong agreement to strong disagreement. According to Table 4, Q5 and Q8 are related, because the p-value obtained by the correlation test is less than 0.05. The result can be understood like this: "With a better understanding of lectures and labs, students feel they can solve cyber security problems using machine learning", which is a good example to show the course materials are helpful for students to use and learn. Finally, Figure 11 shows that 88.88% students agree that they "can pre-process data into vector space model formats for use in machine learning applications".

Future Work
Future work will involve improving the library and adding more modules. Possible future data sets and modules that can be added include web security data, phishing data, steganography data, etc. Additionally, we will provide more use cases of how to use the library in various other applications.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.