Mitigating Insider Threats Using Bio-Inspired Models

Featured Application: Authors are encouraged to provide a concise description of the speciﬁc application or a potential application of the work. This section is not mandatory. Abstract: Insider threats have become a considerable information security issue that governments and organizations must face. The implementation of security policies and procedures may not be enough to protect organizational assets. Even with the evolution of information and network security technology, the threat from insiders is increasing. Many researchers are approaching this issue with various methods in order to develop a model that will help organizations to reduce their exposure to the threat and prevent damage to their assets. In this paper, we approach the insider threat problem and attempt to mitigate it by developing a machine learning model based on Bio-inspired computing. The model was developed by using an existing unsupervised learning algorithm for anomaly detection and we ﬁtted the model to a synthetic dataset to detect outliers. We explore swarm intelligence algorithms and their performance on feature selection optimization for improving the performance of the machine learning model. The results show that swarm intelligence algorithms perform well on feature selection optimization and the generated, near-optimal, subset of features has a similar performance to the original one.


Introduction
The recent Data Breach Investigations Report (DBIR) by Verizon reports that 34% of the reported data breaches were a result of internal actors' involvement and 2% of the data breaches were the result of a partner's involvement [1]. The report was based on an analysis of 41,686 security incidents, of which 2013 are confirmed data breaches. The previous yearly data breach reports, DBIR 2018 and DBIR 2017, show a data breach percentage involving internal actors of 28% and 25%, respectively [2,3]. The insider threat has been on the rise and the latest DBIR reports by Verizon confirm the rapid increase of the problem. The term "Data Breach" indicates a confirmed data disclosure after the event of a security incident [1]. An internal actor, or insider in an organization, is a current or former employee, partner, contractor, consultant, temporary personnel, personnel from partners, subsidiaries, contractors and anyone else that has been granted access privilege in the organization's network or data [4,5]. The Computer Emergency Readiness Team (CERT) National Insider Threat Center defines malicious insider as "a current or former employee, contractor, or business partner" who has authorized access to the organizational system and network resources and has intentionally exceeded or used that access in a manner that compromises the confidentiality, integrity and availability of the organization's data and information systems. An unintentional insider threat is an internal actor who has authorized access to organizational system and network resources and "causes harm or substantially increases the probability of future serious harm of the confidentiality, integrity and availability of the organization's data and information systems" [6]. The insider threat is summed up as a security threat which describes the intentional or unintentional privileged misuse by an internal actor that causes damage to an organization's asset.
A survey conducted by the CERT National Threat Center and CSO Magazine revealed that 30% of the survey responders considered the damage caused by insider attacks more severe than the damage caused by outsider attacks [6]. Insider attacks include information system sabotage, theft of intellectual property, disclosure of confidential information, theft of trade secrets and espionage that leads organizations to financial losses and also negatively impacts their reputation and brand [6].
There is plenty of literature available on the mitigation of the insider threat problem that focuses on methods for detecting the insider threat. In this paper, we focus on the detection of insider threat using bio-inspired computing and utilizing machine learning.
Bio-inspired computing is an emerging approach, inspired by biological evolution, to develop new models that provides a solution for complex optimization problems in a timely manner. The explosion of data in the digital era has created challenges which are difficult to approach with traditional and conventional optimization algorithms and led the scientific community to develop bio-inspired algorithms that can be applied as a solution. Swarm Intelligence is a family of bio-inspired algorithms. These algorithms have been proposed by researchers to solve optimization problems by obtaining near-optimal solutions [7].
The purpose of this paper is to approach the insider threat problem with a new model that utilizes algorithms inspired by nature and contributes to the insider threat domain research by exploring metaheuristic algorithms to solve feature selection optimization problems and improve the performance of machine-learning-based insider threat detection models. The performance improvement of these models will help organizations and governments to detect malicious insiders in time, and prevent severe damage.
The rest of the paper is organized as follows. In Section 2, we review research related to the mitigation of insider threats. In Section 3, we present our methodology for the proposed approach. In Section 4, we present our findings from the evaluation of the algorithms and the improvement in the machine learning model after feature selection optimization. In Section 5, we discuss the results and findings. Finally, in Section 6, we conclude our findings and discuss future work.

Literature Review
The CERT division, part of Carnegie Mellon University's Software Engineering Institute, provides insider threat mitigation recommendations with the release of the "Common Sense Guide to Mitigating Insider Threats", based on research and analysis of previous insider threat cases. The guide includes and describes the practices that organizations should implement in order to reduce their exposure to the insider threat problem. Although this is the sixth edition of the guide, the insider threat problem continues to rise, which is another indication that further research must be made on the detection aspect of the problem [6]. Schultz [4] presents a framework based on insider behavior, to define insider-attack-related indicators and predict an attack. By using multiple and various indicators, there is a better chance of detecting or predicting the insider threat than using one [4]. While some indicators, such as "Preparatory behavior" for example, will indeed detect an insider attacker on the reconnaissance phase trying to gather information about the target, some others, such as "Meaningful Errors", depend on the attacker's skills and it will be hard to detect a skilful attacker.
Salem et al. [8] conducted research regarding the approaches and techniques for insider threat detection and acknowledge the challenge of building an effective and accurate system for detecting insider attacks. Brown et al. [9] propose a system to monitor electronic communication in an organization, to identify and predict an insider threat early. The system is based on personality factors and word correlations. It detects common words in the communication data and calculates a score based on the predefined words' frequency of use. These scores are then combined into a composite personality factor score for neuroticism, agreeableness and conscientiousness, which are the three factors that are associated with high insider threat risk [9]. The authors state that their method mitigates possible legal or privacy concerns, but this was before the enforcement of GDPR. Monitoring electronic communication to profile a user is regulated by GDPR and raises privacy and legal issues. Axelrad et al. [10] propose a Bayesian network model, developed based on a list of variables associated with insider threats, to predict the potential malicious insider. The Bayesian network models generate a score for a person based on the person's characteristics. The list of variables was prepared after research through various papers addressing the insider threat problem. Correlations between variables were considered in the design of the model. Categories of variables include "personal life stressor and job stressors", personality and capability, attitude, workplace behavior and degree of interest [10]. As the authors acknowledge, there are some concerns regarding the collection of data for specific variables, such as job satisfaction, in which data may not be accurate. Nurse et al. [5] propose a framework to understand better and fully characterize the insider threat problem, developed after analysis of several real-world threat cases and the relevant literature. The authors' proposed unifying framework consists of several classes of components, which are presented in four main areas and broken down into more sections, beginning with the analysis of behavioral and psychological aspects related of the actor to understand one's tendency to attack. As the authors acknowledge, it is quite difficult to collect accurate psychological and historical behavioral information regarding insiders, to understand one's mind-set and, in many cases, this applies even after an attack. Behavioral analysis is continued in the next section as well by observing the physical and cyber behavior of the subject. Observing the physical and cyber behavior will be challenging to implement, since regulations vary among countries, for example in the European Union (EU), the General Data Protection Regulation (GDPR) regulates behavioral observation. Despite regulations, there are many challenges of monitoring the behavior of all insiders, for example contractors and partners. In the third section, the actor's type, enterprise role and state of relationship with the enterprise is defined, for example, whether the actor is an employee, contractor or partner, a current or former one and in what role he acts, as scientist, engineer, etc. The last two sections analyze the attack and the assets under the attack with their vulnerabilities. The proposed framework is indeed simple enough to follow, as the authors mention, and will help enterprises to analyze past attacks and identify weak points in their network, based on the insider attacker's steps [5]. Greitzer et al. [11], propose mitigation strategies and countermeasures for the unintentional insider threat, after their research of regarding cases and papers. The authors review possible causes and contributing factors and propose measures with an emphasis on employees' continuous training, to recognize threats such as phishing and enhance awareness of the insider threat problem. Mitigation strategies also include the enforcement of security policies and the implementation of security best practices, such as two-factor authentication. Although the proposed measures will enhance the awareness of the problem, they highly depend on the human factor and do not consider a change in an employee's behavior who might become an actual threat [11]. In order to build mechanisms for the detection and prevention of insider threats, real data need to be gathered and this "raises a variety of legal, ethical and business issues" [12].
Eldardiry et al. [13] propose a global model approach, based on feature extraction from user activities, based on a large amount of work practice data. This data is comprised of various domain areas, such as log files of logon and logoff events, HTTP browsing history, external device usage and file access. The authors evaluate their multi-domain system, utilizing ADAMS synthetic dataset to calculate the accuracy of anomalies and outlier detection and acknowledge that file access and external device usage domains can be used for easier threat prediction, compared to logon and HTTP history domains [13]. Rashid et al. [14] utilize Hidden Markov Models (HMM) with CERT's synthetic dataset to "learn" user normal behavior and then use HMM to detect significant changes in the "already learned" behavior. The authors report that their approach can be used to learn normal user's behavior, and then detect any significant deviations from it and detect potential malicious insiders, with high Appl. Sci. 2020, 10, 5046 4 of 14 accuracy. As the authors acknowledge, their model will not detect malicious insiders with no previously logged normal behavior, such as internal actors who attack an organization's systems, shortly after they log in. Lo et al. [15] apply Hidden Markov Method on CERT's synthetic dataset and analyze a number of distance measurement techniques, Damerau-Levenshtein Distance, Cosine Distance, and Jaccard Distance and their performance for detecting changes in user behavior. The authors report that, although HMM outscores each individual distance measurement technique, it needs more than a day to process all data. Le and Zincir-Heywood [16] propose a user-centered machine learning model that detects malicious insiders with high accuracy. The authors present a machine learning model focused on supervised learning by employing popular algorithms, such as Logistic Regressions (LR), Random Forest (RF) and Artificial Neural Network (ANN). Even though their proposed system detects malicious insiders with limited training, the authors propose the use of more sophisticated data pre-processing techniques and feature analysis to improve system performance. Liu et al. [17] acknowledge that, despite previous research on mitigating insider threats, organizations continue to report severe damage caused by malicious insiders. In their survey, they review several proposed systems addressing insider threats based on data analytics and identify relevant challenges. Log data analysis requires the collection of huge amounts of data, from a wide array of systems and a dedicated system to store the data for further processing. Since log data takes place on a variety of systems, there is no standard format for collecting log data and data pre-processing must be performed in order to clean the data and extract relevant features. As the authors acknowledge, this process requires extensive scripting and coding skills and a deep understanding of the various involved systems. Another challenging problem is extracting the essential and relevant features and managing them effectively by selecting the optimal subset to capture the attacker's footprint on time. Detecting the attacker's tiny footprint is like "find a needle in a haystack", and the challenge comes in deciding which method to utilize. The authors report that incorporating prior domain knowledge to a certain degree during the feature extraction process, may offer better results than entirely relying on prior domain knowledge.
In the literature reviewed for this paper, we came up with a limited study that uses bio-inspired computing to mitigate insider threats. Much research focuses on machine-learning-based insider threat detection to identify unusual behavior of users in regard to their normal behavior. Machine learning models rely on domain knowledge in the feature extraction and selection process, resulting in more time being consumed during data pre-processing and limited effectiveness in detecting the threats in cases where domain knowledge is not stated. Several researchers utilized bio-inspired models to address optimal solutions in complex problems. In this paper, we utilize bio-inspired computing to enhance machine learning models by automating the feature selection process and utilize unsupervised algorithms for outlier detection.

Methodology
Our proposed approach uses Swarm Intelligence Algorithms to automate feature selection optimization and eliminate unnecessary features before fitting the log data to a machine learning algorithm. Feature Selection or variable selection is the process of selecting the most relevant features for a specific problem and omitting unneeded or irrelevant and redundant features, to improve a model's prediction performance, reduce resource requirements, processing and utilization times [18]. We start by consolidating the log data from a synthetic dataset into a single data frame by parsing specific data such as the date into a more meaningful and useful data chunks, so that our algorithms can be more efficient. An automatic feature selection optimization, which is described in Section 3.3, is then applied to the generated data frame to produce the optimum subset of features. In Section 3.1, we present an overview of the proposed system, along with the data flow between the system components.

System Overview
In Figure 1, we illustrate an overview of our proposed system for malicious behavior detection. In the first step we collect data from various sources as well as the user information. Data Pre-processing takes place in the next step, along with all relevant log parsing. In the third step, feature selection optimization using bio-inspired models, is performed to generate the optimal features subset. In the fourth step, we fit the anomaly detection algorithm into the generated subset to detect outliers. In the final step, we analyze the results and measure the algorithm's performance.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 14 processing takes place in the next step, along with all relevant log parsing. In the third step, feature selection optimization using bio-inspired models, is performed to generate the optimal features subset. In the fourth step, we fit the anomaly detection algorithm into the generated subset to detect outliers. In the final step, we analyze the results and measure the algorithm's performance.

Data Collection and Pre-Processing
Data are an essential element in threat detection models and play a crucial role in the detection of security incidents [19]. In order to avoid legal and privacy issues we are using a publicly available synthetic dataset that has no privacy constraints. The synthetic datasets contain logs generated specifically for insider threat research, and each dataset contains a small number of insider threat incidents regarding the dataset's accompanied scenario [12]. We need to parse records from the log data and extract data element values in a way that our model can use and make accurate predictions from the given data.

Dataset and Data Collection
The synthetic dataset that we evaluated in our system is a release from the Insider Threat Test Dataset collection, generated from Carnegie Mellon University Division. This dataset is "free of privacy and restriction limitations" [12] to allow researchers to experiment with it and evaluate algorithms. There are various releases of datasets to choose from, with most of them having one instance of each scenario, depending on the creation time. We chose the release r4.2, the "dense needle" dataset, since it includes many instances of each scenario, with multiple users involved in each scenario. The r4.2 dataset is split up in seven (7) different parts, as shown in Table 1.

Data Collection and Pre-Processing
Data are an essential element in threat detection models and play a crucial role in the detection of security incidents [19]. In order to avoid legal and privacy issues we are using a publicly available synthetic dataset that has no privacy constraints. The synthetic datasets contain logs generated specifically for insider threat research, and each dataset contains a small number of insider threat incidents regarding the dataset's accompanied scenario [12]. We need to parse records from the log data and extract data element values in a way that our model can use and make accurate predictions from the given data.

Dataset and Data Collection
The synthetic dataset that we evaluated in our system is a release from the Insider Threat Test Dataset collection, generated from Carnegie Mellon University Division. This dataset is "free of privacy and restriction limitations" [12] to allow researchers to experiment with it and evaluate algorithms. There are various releases of datasets to choose from, with most of them having one instance of each scenario, depending on the creation time. We chose the release r4.2, the "dense needle" dataset, since it includes many instances of each scenario, with multiple users involved in each scenario. The r4.2 dataset is split up in seven (7) different parts, as shown in Table 1. We are not going to employee the data of psychometric.csv in our system, since it contains personality data and we want to avoid any privacy or legal issues regarding this in a real-life scenario. Psychometric.csv provides personality scores for each user, based on the big five personality traits. Since the used dataset is a synthetic one and there are no privacy or legal constraints in using this dataset's personality scores, we could add these features as well to experiment with. However, the scope of this work is intended to be used by organizations which might not have this kind of data. In addition to this, psychometric data are usually recorded by Human Resources (HR) and it will be difficult or even impossible to include scores for all internal actors, such as consultants, contractors, partners or even personnel from subsidiaries. Furthermore, we followed the Rashid et al. [14] approach and chose features that can be used to model user behavior across several different domains.
The Http.csv file contains data that can be used to trace employees' visits to employment websites and report indications for unsatisfied or even disgruntled employees that are planning to leave the company. While these data can be useful to predict an employee intention for leaving the company, privacy concerns arise. Furthermore, an additional overhead is added to the model, which eventually may not be effective in a real-world scenario where an employee can use their smartphone to access this kind of website.

Data Pre-Processing
Date values were collected as they were, but we had to split and encode the date feature into two features, day and time. Machine learning algorithms understand only integer numbers, so we had to convert the date and time into numbers and the other features as well, such as the activity. The activity feature's initial values correspond to user actions, such as logon to a system, logoff, connect thumb drive, disconnect, send an e-mail, process a file or access the internet. Since we have seven (7) different actions for the activity feature, we can replace each action with an integer, logon is 1, logoff is 2, etc. Our system's initial selected features are presented in Table 2, along with their value space. (1) The values for the activity feature correspond to the user's activities, such as logon, logoff, connect, disconnect, HTTP, file and e-mail.
An example dataset comprised of the aforementioned features is shown in Table 3. (1) We encoded categorical features using One-Hot Encoding scheme at a later stage since Machine learning algorithms are more effective in prediction when working with datasets encoded this scheme.

Feature Selection Optmization using Bio-Inspired Algorithms
For the purpose of this paper, we decided to use EvoloPy-FS framework [20][21][22][23][24][25] and measure the performance of several popular Swarm Intelligence algorithms on the feature selection optimization problem. EvoloPy-FS, an easy to use Python framework, developed by its authors to help researchers in solving optimization problems using Swarm intelligence algorithms. The main component of the framework is the Optimizer, where we set up our experiment along with the initial configurations.
In the optimizer, we define the dataset to use, the optimizers, the number of runs and number of iterations. For each implementation of the included optimizers, there is a separate Python script, since the framework is Open Source and all included components are transparent.
The Optimizers are used to generate a near-optimal subset of features from the original dataset, free from unnecessary features, to improve the model's anomaly prediction performance.
In our approach, for feature selection optimization we selected Binary Particle Swarm Optimization (BPSO), Binary Gray Wolf Optimizer (BGWO), Binary Bat Algorithm (BBAT), Binary Multi-Verse Optimizer (BMVO), Binary Moth-Flame Optimizer (BMFO), Binary Whale Optimization Algorithm (BWOA) and Binary Firefly Algorithm (BFFA) as optimizers, as they are available in EvoloPy-FS framework. We employed these bio-inspired models in the feature selection optimization process, to generate the optimal subset and fit the Machine Learning algorithm on it, to get better results compared with the original dataset.

Utilizing Machine Learning for Outlier Detection
Machine Learning (ML) is applied in a wide area of applications to discover patterns from given data and make predictions, such as anomaly detection. Several approaches, reported in the literature, utilize Machine Learning as an effective method for detecting anomalies. For the purpose of this paper, we developed a Machine Learning system, focused on user-centred analysis to distinguish malicious activities from the legitimate ones. The system utilizes the Local Outlier Factor (LOF), to detect the outlier or rare instances from the given data. LOF is fitted to the subset data frame, generated after feature selection optimization, to detect the outliers. For every detected outlier, the system marks the corresponding insider as malicious. The system marks an outlier based on the entire subset dataframe, based on the selected features and not based on CERT's accompanied scenarios.
Local Outlier Factor (LOF), is an unsupervised anomaly detection algorithm, which uses a score to determine if a certain point is an anomaly. Each datapoint is assigned this score, which is the result of the computation of local density deviation of the given datapoint with respect to its neighbor data oints. If a given datapoint has a substantially lower density than its neighbors, then it is a considered as an outlier [26]. The LOF algorithm is considered as an efficient method to detect outliers in high dimensional datasets. In our system, we employed sklearn.neighbors.LocalOutlierFactor from the scikit-learn module, which utilizes K-nearest neighbours to compute the local density of a datapoint. LOF score value is determined from the ratio of the average local density of the observation's neighbors and its own local density [27]. In LOF, we need to define the values of n_neighbors, contamination and n_jobs [27]: • n_neighbors: the number of neighbors to take into consideration to detect the outliers. If the value is larger than the number of provided samples, then all samples will be used; • Contamination: the proportion of outliers in the dataset. Contamination is used to define the threshold on the scores of the samples we fit LOF to; • n_jobs: number of parallel jobs to run or neighbors search. "−1" value uses all available processors.
We fit LOF to the "candidate" optimal subset data frame, generated after performing feature selection optimization using bio-inspired models, to detect outliers.

Performance Metrics
To measure the performance of the subset dataset, generated after feature selection optimization, we followed Ferreira et al. [28] performance metrics' method, who used the insider detection rate and insider detection precision. As mentioned in Section 3.4, the subset dataset is fitted to LOF to detect anomalies and we measure the performance of the results for each subset dataset tested, to find the optimal subset. The subset dataset that produces the best precision will be selected as the optimal one. Insider detection rate (DR) and precision can be determined by using Equations (1) and (2).
In Equations (1) and (2) True Positive (TP), represents the number of true malicious users detected, False Positive (FP) represents the number of normal users that were detected as malicious, and False Negative (FN) represents the malicious users that were not detected as malicious, but falsely considered as normal users.

Results
In this section, we present the results from our experiments, along with the various scenarios we executed to test the model and measure the performance of the utilized swarm intelligence algorithms for feature selection optimization.

Experimental Setup
All data processing tasks for this paper are performed using a PC with Intel Core ™ i5 4200M @ 2.5GHz CPU and 16.0 GB Dual-Channel DDR @798MHz RAM. All algorithms are tested using Anaconda's Python distribution version 2019.07. The global settings are the same for all Swarm Intelligence algorithms in order to have fair comparisons. Population size is set to 50 search agents, and the number of iterations is set to 20. For the purpose of this paper, we used and utilized Python programming language along with several open-source libraries: Scikit-learn.

Testing the Model
The objective of the test is to improve the performance of the Machine learning model by obtaining the optimal subset, using Bio-inspired models, before fitting Local Outlier Factor (LOF) algorithm to it and determine the outliers.
We created samples of the first 50,000 rows from the synthetic dataset, to work with a smaller portion of it, for performance reasons. Sample creation code is presented in Figure 2.

Testing the Model
The objective of the test is to improve the performance of the Machine learning model by obtaining the optimal subset, using Bio-inspired models, before fitting Local Outlier Factor (LOF) algorithm to it and determine the outliers.
We created samples of the first 50,000 rows from the synthetic dataset, to work with a smaller portion of it, for performance reasons. Sample creation code is presented in Figure 2. Data collection and data pre-processing code execution take place in order to prepare the data frame with all selected features and fit the feature selection optimization to it.
We fitted the feature selection optimization framework on the first thousand (1000) rows of the generated data, with the results shown in Table 4. The seven (7) columns represent the results for each optimizer from the selected, BPSO, BMVO, BGWO, BMFO, BWOA, BFFA and BBAT. Time taken is in seconds and is about the same for all seven optimizers. Train and testing accuracy are computed based on the sliced part for each value, as shown in Figure 3 and against the complete dataset. (1) Results of a single independent run executed for each algorithm.  Data collection and data pre-processing code execution take place in order to prepare the data frame with all selected features and fit the feature selection optimization to it.
We fitted the feature selection optimization framework on the first thousand (1000) rows of the generated data, with the results shown in Table 4. The seven (7) columns represent the results for each optimizer from the selected, BPSO, BMVO, BGWO, BMFO, BWOA, BFFA and BBAT. Time taken is in seconds and is about the same for all seven optimizers. Train and testing accuracy are computed based on the sliced part for each value, as shown in Figure 3 and against the complete dataset. (1) Results of a single independent run executed for each algorithm. Data collection and data pre-processing code execution take place in order to prepare the data frame with all selected features and fit the feature selection optimization to it.
We fitted the feature selection optimization framework on the first thousand (1000) rows of the generated data, with the results shown in Table 4. The seven (7) columns represent the results for each optimizer from the selected, BPSO, BMVO, BGWO, BMFO, BWOA, BFFA and BBAT. Time taken is in seconds and is about the same for all seven optimizers. Train and testing accuracy are computed based on the sliced part for each value, as shown in Figure 3 and against the complete dataset. (1) Results of a single independent run executed for each algorithm.  In order to test the feature selection output results if they generate an optimal subset, we need to fit LOF to the subset generated after the feature selection. The performance can be measured by comparing time taken for insider threat detection along with the detection rate and precision. Precision is the ratio of True Positives to the total number of positive results, Precision = TPP / (TP + FP), with 1 being the best value and 0 the worst.
The first measurement is done by selecting all features (Figure 4), as generated after the pre-processing data phase, and see the results without feature selection optimization. Following this, we run more experiments based on the results of feature selection optimization. The results of the experiments are presented in Table 5.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 14 In order to test the feature selection output results if they generate an optimal subset, we need to fit LOF to the subset generated after the feature selection. The performance can be measured by comparing time taken for insider threat detection along with the detection rate and precision. Precision is the ratio of True Positives to the total number of positive results, Precision = TPP / (TP + FP), with 1 being the best value and 0 the worst.
The first measurement is done by selecting all features (Figure 4), as generated after the preprocessing data phase, and see the results without feature selection optimization. Following this, we run more experiments based on the results of feature selection optimization. The results of the experiments are presented in Table 5.   The results of Table 5 show a high detection rate but very low precision, since we have a high number of FP. We can experiment with the parameters of LOF algorithm and change n_neighbors and contamination values to see whether we can get improved precision results. In order to get the results in Table 5, n_neighbors' value was set to 20 and contamination set to auto.
The performance of LOF is highly dependent on the values of contamination and n_neighbors [29]. We set the value of n_neighbors to 20, which is the default value of the utilized Machine Learning algorithm [27] and defines the number of neighbors that need to be taken into consideration to detect the outliers. As mentioned in Section 3.4, contamination represents the proportion of outliers in the dataset and its value defines the number of objects to be predicted as anomalies. Contamination can be set as a float number with a value between 0 and 0.5 [27]. The higher the value, the more objects will be predicted as anomalies. Since we had a high number of FP in the first run of the model (Table 5), we continued our experiments by reducing the value of contamination to reduce the number of FP.
We changed the contamination value to 0.1, to get similar results when contamination was set to auto (results are shown in Table 6). The optimal subset remains the same, but precision value is still very low. By changing the contamination value to 0.01 we got slightly different results (shown in Table 7) compared with the previous two cases and, in this case, the better precision value is produced by experiment #2, with a subset dataset of five (5) features. This subset dataset was produced by all seven (7) algorithms during most of the algorithms' iterations. We continued our experiments by tuning the contamination value in LOF and fitting the LOF algorithm on the optimized subset of four (4) features. In Table 8, the reported results indicate that, for the specific synthetic dataset, we can get better results with lower contamination values. The performance of an LOF algorithm depends on its parameters' values, contamination and neighborhood size [29]. When experimenting with synthetic data with the known anomaly portion, contamination value and neighborhood size can be tuned based on this known anomaly portion data and report better results. In a real-world scenario, these parameters can be tuned based on historic evidence of malicious insiders or use the Xu et al. [29] methodology for automatic tuning Local Outlier Factor's hyperparameters.

Discussion
The findings of Section 4.2, show that after feature selection optimization, one of the resulted subsets is a near-optimal subset, since it performs better than the original one when measured with precision. This near-optimal subset was generated after feature selection optimization by using BPSO, BMVO, BWOA, BFFA, BBAT optimizers.
These results acknowledge that Swarm intelligence algorithms have high performance on feature selection optimization problems and can be used to enhance Machine Learning models. The use of bio-inspired models in our proposed system resulted in better precision value when utilized in the feature selection optimization process, before fitting the machine learning algorithm to the dataset.
Rashid et al. [14] used HMM and reported an 85% identification rate with a false positive rate of 20%. Lo et al. [15] acknowledge that during the training phase of HMM, the computational time can be quite slow, as the number of features increases. Lo et al. [15] used HMM and distance measurement techniques and reported detection rates of 69% and 80% (aggregate score), respectively. The authors reported that HMM took more than 24 h to process all data, in opposition to distance measurements that took minutes to process. While the authors report that the combination of the three distance measurements techniques has the potential of raising a high number of FP, they do not mention the number of FP of their results. While our approach reports a high number of FP, it also reports a better TP identification rate compared with other approaches, as reported in Table 9.

Conclusions
In this paper, we introduce the use of bio-inspired computing in machine learning models for mitigating insider threats and we improve the model by automating the feature selection optimization process. We evaluate several swarm intelligence algorithms and our results show that swarm intelligence algorithms should be employed to improve accuracy and speed in detecting malicious behavior in large data sets.
An optimal subset with reduced features has similar or better performance to the original one and can improve the performance of a machine learning model.
The employment of labeled data and the addition of extra features, such as an indication of visits to employment websites, social networking sites and cloud storage services, where an internal actor can share confidential information to others will improve the performance of our system in the detection of malicious insiders and reduce the FP rate. These additional indicators must be transparent to all internal actors, who must be clearly informed about all their log activity, according to relevant privacy regulation. The collection of an employee's private data in an organization might raise several issues and concerns and eventually reduce productivity.
The usage of unsupervised ML techniques to detect anomalies using unlabeled data, not only protects the privacy of legitimate internal actors, but also detects anomalies in the behavior of malicious insiders with no previous logged history, since no prior training is needed for unsupervised learning.

Limitations
The process of detecting malicious insiders by detecting significant changes, or anomalies in a user's normal behavior, might be inconsistent in several circumstances, such as a team of employees working overtime on a project with a strict deadline, or another team resolving a system failure during work after hours. Regarding after-hours, certain employees might work on a rotating shift schedule, thus detecting outliers based on normal and after-hours' activity may not work for them.

Future Research
For future work, we can fit our proposed system to other existing or future releases of CERT's datasets, experiment with other bio-inspired models, such as cuckoo search (CS) and compare their performance with the bio-inspired models we experiment within this paper.
In future works, similar systems should be developed and evaluate hybrid algorithms or explore the possibility of detecting anomalies with the use of bio-inspired computing. Funding: This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 833673. This work reflects authors' view and the agency is not responsible for any use that may be made of the information it contains.