A Review of Insider Threat Detection: Classification, Machine Earning Techniques, Datasets, Open Challenges, and Recommendations

: Insider threat has become a widely accepted issue and one of the major challenges in cybersecurity. This phenomenon indicates that threats require special detection systems, methods, and tools, which entail the ability to facilitate accurate and fast detection of a malicious insider. Several studies on insider threat detection and related areas in dealing with this issue have been proposed. Various studies aimed to deepen the conceptual understanding of insider threats. However, there are many limitations, such as a lack of real cases, biases in making conclusions, which are a major concern and remain unclear, and the lack of a study that surveys insider threats from many different perspectives and focuses on the theoretical, technical, and statistical aspects of insider threats. The survey aims to present a taxonomy of contemporary insider types, access, level, motivation, insider profiling, effect security property, and methods used by attackers to conduct attacks and a review of notable recent works on insider threat detection, which covers the analyzed behaviors, machine ‐ learning techniques, dataset, detection methodology, and evaluation metrics. Several real cases of insider threats have been analyzed to provide statistical information about insiders. In addition, this survey highlights the challenges faced by other researchers and provides recommendations to minimize obstacles.


Introduction
Computer networks and telecommunications play a significant role in information exchange. An increase in valuable information, along with enabling technology expansions, have led to increases in threats. The sources of these threats are not only from outside but, also, from within the organization. Such threats possess a large security risk and are seemingly difficult to detect [1][2][3]. Insider threats can inflict critical damage on the reputation, financial assets, and intellectual property of enterprises. A 2018 report on the insider threat has stated that slightly more than half of threats (53%) in the past 12 months came from inside of organizations. Moreover, 27% of surveyed organizations have stated that attacks originated from within the organizations [4]. In the previous decade, many incidents of insider threats have gradually reached the media; these include wellknown cases of data leakage conducted by Edward Snowden and Daniel Ellsberg [1]. Thus, most organizations implement cybersecurity techniques, such as firewalls, intrusion detection, and electronic access system, to protect data not only from outsider threats but, also, from potential malicious insiders.
To fend off malicious insiders, organizations should have an insider threat detection system that can detect and mitigate malicious insiders before they perpetuate threats. Unfortunately, the insider threat field is not sufficiently understood. Moreover, the detection mechanisms or approaches that can be used and the limitations of existing solutions remain unexplored. Thus, an urgent review on available insider threat detection systems is needed [5]. The current scenario is primarily due to the lack of knowledge on insider threat and its potential damage to organizations.
This study aims to discuss the insider threat problem in a descriptive and analytic manner by studying the insider or attacker to understand its nature, such as insider access, types of insider, insider motivation, insider profiling, effect security property, level of insider, and method and actions used by the insider, and highlighting the researchers' achievements. Furthermore, our objectives in this study are, first, to summarize previous studies conducted on insider threat detection systems and, second, to categorize the field of insider threats based on insider and detection systems; we classified them mainly by the type of study, such as development scenarios based on behaviors analyzed. The detection mechanisms or approaches can differ through the type of analyzed behavior [5], techniques, datasets used to evaluate the proposed solution, detection methodology, and evaluation metrics. In this work, the major contributions are summarized as follows:  Classifying the field of insider threats based on the insider as an actor and insider threat detection systems.  Analyzing real insider threat cases based on the subsection of the proposed classification.  Discussing challenges and recommendations in the field of insider threats. Figure 1 presents the remaining sections of the article in detail.

Related Studies
This section discusses the surveys and review articles on insider threats. Only a few papers were found, i.e., [6][7][8][9][10][11][12][13][14][15][16]. Walker-Roberts et al. [6] conducted a systematic review on insider threat detection; however, the scope of the review focused only on insider threats in healthcare critical infrastructures. Ullah et al. [7] and Alneyadi et al. [14] covered data exfiltration and data leakage. However, only F. Ullah et al. [7] provided in-depth details and insights into data exfiltration caused by the malicious activities of a remote attacker. In addition, the studies [10][11][12][13] and [15] described the understanding and future directions for the behavior of insider information security threats. However, Ho et al. [11] provided details on sociotechnical works with technical and behavioral evidence. Nonetheless, Nazir et al. [9] provided a comprehensive study on modeling, simulation, and related techniques that have been used to assess the vulnerabilities of the supervisory control and data acquisition (SCADA) system to cyberattacks. In addition, M. Kim et al. [16] surveyed trends and forecasts of intrusiondetection techniques by categorizing them into basic software and machine-learning techniques. Finally, L. Liu et al. [8] conducted a survey to line up and deeply discuss insider threats from several perspectives and mainly focused on three common types of insiders: traitor, masquerader, and unintentional perpetrator.
The main contributions of this review and what makes it different from the previous studies can be summarized as follows. Firstly, as far as we can possibly know, this is the first work that surveys the insider threat literature from many different perspectives and focuses on theoretical, technical, and statistical aspects of insider threats. Secondly, we study and analyze many insider threats incidents and provide statistical information about insiders. Finally, we discuss several challenges and highlight some recommendations for future research in insider threats.

Descriptive and Literature Classification
Classifying various aspects of insider threats into relevant classifications and forming them into a set of taxonomy is useful. The present study summarizes insider threats in such a taxonomy, as described in Figure 2. The research areas are divided into two main categories. Adopting from the cyber security principal stated in [17], the security culmination is of the interactions of people, process, and technology. Thus, our description consists of: the first describes a subject who performs an action. In this case, a potential attacker originates an attack from an internal system or authorized area. This category covers an ecosystem of insiders known as the root of access, types of insider, motivation, insider profiling, and affected security property, as well as the level, method, and actions undertaken.
The second describes the technology and state-of-the-art in detecting insider threats. The category comprises five main elements: namely, analyzed behaviors, techniques and methods, dataset, detection methodology, and evaluation matrices. This section discusses the subclassifications of each element in detail.

Access
By nature, insiders have authorized access to networks, physical or both, which enable them to pose threats. Physical access describes malicious insiders who misuse data systems using physical access to infiltrate organizational data or steal devices. By having such a level of access, they are able to copy documents to a USB drive or any other removable media to steal intellectual property, sabotage data systems, or use identifiable information stored in the organizational system to commit fraud [18]. Network access involves malicious insiders who misuse their access to the organization's data/system. One of the examples for detecting insiders who use network access is network traffic, which may carry unauthorized content of interest or constitute protocols or the address of sources and destination endpoints that might be unauthorized. Abusing the network access can cause significant damage to the organization, and the majority of insider threats are caused by misusing network access, such as sabotaging and altering information, abusing authorized network access, or installing malicious software [19]. Figure 3 shows that the majority of insider threats are due to network access, which constitutes approximately 66% of the studied cases. This finding is based on the fact that insiders can easily access data and systems using the company network. In the case of data exfiltration, insiders can send data through email or upload it to the cloud service and abuse the use of data outside of the company. Although most insider cases fall under the network access scenario, however, physical access cannot be ignored, because it can cause the same damage to the organization when insiders intentionally or unintentionally exploit physical security vulnerabilities.

Insider Types and Methods
Insiders are categorized into three types: namely, traitor, masquerader, and unintentional. Traitors constitute the main category, and, as shown in Figure 4, most of the threats are posed by this type of insider. Traitors are also called misfeasors; the users are from within the organization and do not need to masquerade but use their access privileges to misuse the organization's systems [20].
A masquerader is an external attacker who steals the legitimate identification of an insider and uses the stolen identity to impersonate the insider for malicious intent [8].
The unintentional type pertains to a current employee who unintentionally causes harm or increases the possibility of future harm to the organization [21]. To link the type of insider with the threat or method used to conduct the attack, Liu et al. [8] introduced a taxonomy that matches the types of insider with the methods used or threats they pose. The authors have employed the Advanced Persistent Threat APT intrusion kill chain to model all types of threats, from early to late stages. Figure 5 shows the details.

Insider Motivation
Defining the motivation of the insider is highly important for facilitating detection, installing appropriate mitigation strategies, and enabling forensics [21]. Cole and Ring [22] divided the motivation into three main factors. Financial motivation is an extremely powerful matter, and it leads some people to act in manner that no one thought could be possible. Hence, it is not a big surprise that financial motivation is one of the main factors that motivate the malicious insider. Another motivation that drives people is their political views; they feel very strongly about their views, and if a company works against their interests, this might be a strong reason that drives the employee to harm the company or cooperate with outsiders to harm the organization. There is also personal-this type of motivation can come in several shapes or sizes, but mostly, it comes in blackmail form. The attacker targets somebody with personal secrets that the target does not want anyone to know about and then threatens him/her with disclosing the personal secrets if the target does not collaborate. This is a very dangerous way and can cause a lot of trouble for the people and their organization.

Insider Profiling
In this review, based on the works [23,24], insider profiling is classified into four categories: namely, sabotage, theft (of intellectual property), fraud, and espionage. The malicious insider uses information technology to sabotage or direct particular harm at an organization or individual. Such malicious insiders are mainly disgruntled employees with technical knowledge and authorized access. An example of this type of profiling is the logic bomb installation, which can be activated after the termination of the employee [1]. Theft of intellectual property is the case where the malicious insider steals intellectual property that they access during daily work and takes the data with her/him outside of the organization (e.g., using intellectual property for personal business by sending it to a new employer or transferring it to a competitor organization). This act is frequently carried out by technical (e.g., developers or engineers) or nontechnical (e.g., salesmen or clerks) employees [1]. Fraud refers to the use of authorized access to misuse the organization's financial resources. In other words, fraud is a means of stealing money from the organization [25]. Lastly, espionage refers to corporate information systematic and targeted extractions by a malicious insider, which gives the malicious insider strategic economic, military, or public relation benefits [26,27]. Figure 6 indicates that the majority of analyzed cases fall under sabotage and fraud, where disgruntled employees harm the organization due to vengeful motivations after being terminated or the intention to gain financial benefits using their authorized access.

Effected Security Propriety
According to the type of access used by an insider, stated in Section 3.1.1, each insider has legitimate access to the organization's assets. This legitimate access can be a physical access, network access, or both (i.e., people who are working in an information system at the office). These different types of access can bring threats that can be done either intentionally by traitor or unintentionally by careless employee. Making this distinction is essential, due to the fact that not all insider threats are posed with the intention to cause harm to the company. All intentional malicious insiders and unintentional insider threats can be done by the misuse of authorized actions on data or by the utilization of unauthorized activity. The threats can either result in the disclosure (threat to information confidentiality), modification (threat to information integrity), or destruction and interruption (threats to the information availability) of information [27].

Level of Insider
Based on the access level privileges, malicious insiders have been divided in [22,28,29] into four categories: namely, pure insider, insider affiliate, insider associate, and outside affiliate. The first category pertains to users with authorized access and badges or keys to the organization's data centers. The user has access to all information about the logical or physical structures of highsensitivity data and access rights to such data. Compared with pure insiders, insider affiliates lack a reason or permission to access the company's resources. Insider affiliates may be friends, relatives, or clients of the company. In certain cases, employees' relatives or friends may visit the workplaces and access resources using the employees' credentials. An insider associate is not employed at the company but may have physical access to the company instead of network access. The insider associate may be a business partner, cleaner, contractor, or security guard. Outside affiliates are not a part of the organization and do not have legitimate access to organizational resources. However, they may attempt to access the resources through unprotected networks. The outsider affiliate may illegally access the network to obtain access credentials from the organization [22,28,29].
Based on the studied cases, Figure 7 shows that the majority of insiders fall under the category of outsider affiliates, who are mostly ex-employees who have been terminated and pose a unique risk because of their knowledge of the organization and their vengeful motivations.

Insider Threat Detection
This section reviews and analyzes the relevant studies on insider threat detection. Many aspects were investigated to establish a broad understanding of various studies examined through a survey of the existing literature. Such studies have been previously discussed in detail, and the types of analyzed behaviors and techniques used to model such behaviors have been categorized. The subsequent sections discuss the datasets that have been used in the literature and provide a summary of techniques, detection methodology, and evaluation metrics. and [41] used an ensemble-based stream mining algorithm based on supervised learning and graphbased anomaly detections, while Parveen et al. [43] used the same method but with unsupervised learning, and Pitropakis et al. [42] proposed a solution that used the GPU card computational power to effectively monitor the virtual machines VMs health, detecting both the presence of malicious insiders and attacks against the infrastructure. Song et al. [44] and E. Yuan et al. [45] proposed a model by studying a user's system-level behavior using the Fisher feature selection and Gaussian mixture model (GMM) and data mining model, respectively. Fifth, Nasr et al. [46,47] highlighted the use of the system and application-level features in the SCADA system, which is molded by statistical and machine-learning techniques. The studies [48][49][50] utilized software and resource architecture using the model-based sequence, clustering algorithm, and the Bayesian information criterion by Lamba et al. [48], while S. Young and Dahnert [49] used the Bayesian belief network to propose a DevEyes Framework that has the capability to identify potential user actions, and in work [50], Clark et al. focused on the identification, characterization, and modeling of unintended USB channels. White and Panda's [51] proposed criticality score is based on the content sensitivity of the data item using SVM, the naive Bayes network is used to model the userʹs behavior information by the client, including the called process and its corresponding threads when the user is normally working [52], and a decision model called named RevMatch is proposed [53]. RevMatch made a decision based on the history of the labeled malware detection. Lastly, to identify malicious patterns in the system, Nkosi et al. [54] used a sequential pattern-mining technique.

Cyber Activity Behavior
Many articles focused on the use of cyber activity behaviors, login events, or the combination of login events with other features [55][56][57][58][59][60] using many different techniques. Nikolai and Wang [55] proposed a solution for data theft in Infrastructure as a Service IAAS Clouds; the k-nearest neighbor (KNN) is used to detect data theft in Clouds. Their monitoring system analyzed network messages patterns used to transfer data; using the similar KNN classifier technique and Dempster-Shafer theory by Punithavathani et al. [57]. Roberts et al. [56] provided a detection mechanism to counter insider threats in critical networks. W. Liu et al.'s [58] used the Bayesian networks (BNs) and novel modeling approach for the performance of insider threat detection. Research methodology was proposed to provide an approach for the construction, assessment, and optimization of the insider user normal behavior model. The framework was derived from the dynamic Bayesian perspective. Goldberg et al. [59] introduced an anomaly detection system, PROactive Detection of Insider threats with Graph Analysis and Learning PRODIGAL, to support human analysts by combining multiple machine-learning-based anomaly detection techniques. Rajamanickam et al. [60] discussed the password disclosure case and how the cryptography and, especially, the elliptic curve cryptography (ECC) worked, and because of its smaller keys, the ECC dominated the role of providing secure communication. Using ECC, it is appropriate to encrypt the usersʹ passwords and share them when users need to communicate with Internet applications.
Another widely analyzed feature in cyber activities is the network packet or network traffic using different methods, such as machine-learning techniques, to model the behavior of insiders [61][62][63][64][65][66][67]. Encryption and pattern-matching techniques [68][69][70][71][72] and network-attached systems, such as HoneyBot and deception (i.e., a decoy network interface controller) [73][74][75][76] have also been used. Traffic monitoring tools [77,78], a behavioral analysis based on a zero-knowledge anomaly called XABA (semi-comprehensive solution) [79], and novel algorithms are beneficial for detecting insider attacks on wireless sensor networks [80]. The industrial sector used ISA100.11a, a smart response mechanism based on wireless sensor networks [81]. Gossip protocols have been used to introduce the overlay architecture of networking designed to facilitate information sharing on mobile devices [82]. Dynamic access control [83] is a mechanism that is heavily dependent on malicious node assumption to design advanced collision attacks called random poisoning [84]. Other studies used a mathematical model [85], unknown input observers [86], Dynamic Host Configuration Protocol DHCP starvation attack and Transmission Control Protocol TCP [87][88][89], simple statistical measures [90], and Bayesian network model to predict insider threats [91].
The features of the user database and file access patterns have been studied by Hu et al. [110] using statistical methods, a community anomaly detection system applying a relational framework [111], One-Class Support Vector Machine OCSVM to analyze features [112], rule mining [113], and cryptographic techniques and watermarking [114]. A probabilistic mechanism was used to re-encrypt files [115], the scoring function [116], the naive Bayes algorithm, and vector space model (VSM) [117]. Consensus clustering was employed to create multi-view anomaly detection methods [118], a random topic access model [119], community evolution discovery [120], orthogonal defense mechanism [121], incremental algorithms [122], and an unsupervised approach [123]. Algorithms and techniques on structured query language (SQL) queries [124][125][126][127] were used, and a malicious information flow was detected through bridge data items [128]. This approach enables the identification of deviations by reconciling the process and data perspectives [129]. A user profiling approach was used to detect suspicious transactions using a two-stage database intrusion detection system [130]. Moreover, methods for bypassing data loss prevention systems were used over trusted applications [131], and a solution for fine-grained histogram-based profiles of database usage was created [132]. Business process mining [133], neural dependency and inference graph (NDIG) [134], knowledge graph and dependency graph components [135], the triangle authentication process [136], and a provenance graph using privacy, lineage, uncertainty, and security PLUS [137] were also utilized.
Legg et al. [138] analyzed and explored device usage features using PCA, and Aditham et al. [139] used a semi-supervised approach to investigate the multiple features of memory access. For insider threat detection, Crawford and Peterson [140], Meng et al. [141], and Chiu et al. [142] used a methodology that is dependent on scanning the memory of running virtual machines, a Bayesian inference-based trust mechanism, and a frequent pattern outlier factor, respectively. The works [143][144][145][146][147] highlighted correlation coefficient methods and kernel density estimation (KDE) to determine CPU usage, a medium access layer MAC based solution, design science research to detect USB usage, a fuzzy multi-criteria aggregation method, and the hidden Markov model (HMM) and Baum-Welch algorithm to model resource misuse, respectively. Jaenisch and Handley [148] analyzed email and text features using the random forest algorithm, which identifies the various behaviors of suspicious users or their abnormal derivatives. Canbay et al. [149] applied the term frequency-inverse document frequency TF-IDF numerical statistic to sensitive documents to extract sensitive words, whereas Garfinkel et al. [150] used latent semantic indexing to construct a model that documents topics based on Google's rapid response framework in monitoring disk forensic content (or other media), such as email addresses and credit card numbers.
Feng et al. [151] analyzed upload, download, and web-browsing features using a novel twostage machine-learning system, whereas Zhang et al. [152] proposed two generic reputationestablishment algorithms. Another study by Myers et al. [153] suggested a cooperating server distributed system that correlates log information and triggers rule-based responses. In addition, Sohal et al. [154] proposed a cybersecurity framework using the Markov model, virtual honeypots, and intrusion detection systems to detect malicious edge devices in the environments of fog computing. Nathezhtha and Yaidehi [155] proposed improvised long short-term memory (LSTM) to learns users' behaviors. The model automatically trains itself and stores behavioral data to classify user behaviors as normal or abnormal. Sharghi and Sartipi [156] explored file-sharing and access policy features using a new behavior pattern language-that is, a constraint-based pattern-matching engine, whereas Agrafiotis et al. [157] used tripwire grammar. Bao et al. [158] proposed a behavior rule-based methodology to monitor devices in a smart grid to detect insider threats. Using organizational structures, Kammüller and Probst [159] built vectors for insider attacks to identify the sequences of actions that lead to violations of security policies, whereas Dasgupta et al. [160] used a multi-token permission strategy.

Psychosocial Behaviors
Studies on psychosocial behavior analyses are few. Brdiczka et al. [161] combined psychological profiling and structural anomaly detection to build an architecture for the detection of insider threats using social networks, messages, and Internet visits. Alternatively, Suh and Yim [162] discussed the use of the power spectrum analysis of electroencephalogram EEG data to identify insiders using brain wave features. Similarly, Almehmadi and El-Khatib [163] and Almehmadi [164] proposed the use of intent-based access control that uses brain signals as intention access control. Lee et al. [165] proposed a real-time internal information leakage detection system based on emotional recognition, such as tension, agitation, and anxiety. A similar work by Taylor et al. [166] purported how self-focus, negativity, and cognitive processing can be assessed based on a linguistic inquiry and word count (LIWC) analysis.
In addition, Maasberg et al. [167] discussed a theoretical model of insider threats based on the following components: motive, opportunity, and capability. Safa et al. [168] modeled planned behavior and the dark triad personality trait theory and used motivation and opportunity modeled as presented by the social bond theory and situational crime prevention theory. Finally, studies [169][170][171][172] discussed social media and online behavior through Twitter and Facebook user comments and status updates, whereas Berk et al. [173] used opportunity and action theories.

Physical Behaviors
Marrone et al. [174] used door access and traffic server features and combined two unified modeling language models. The first addresses the physical protection of a system, whereas the second focuses on cyber protection. Another study by Zou et al. [175] used the door and sensor data features to explore the use of the failure mode and effect analysis method. Mavroeidis et al. [176] presented an ontological framework to improve physical security and insider threat detection using door access. Lastly, W. Meng et al. [177] used Euclidean distance to judge a node's reputation and combed multisource logs, such as emails, websites, and camera usage.

Other Behaviors
This section discusses the use of other behavior features or combined behaviors from previously discussed features. Durán [178] applied the reactor risk method, which includes a human reliability analysis and object-based event sequence trees developed using a probabilistic analysis approach. It integrates Material Control with Accounting (MC&A) protection and operational activities in a vulnerability assessment (VA) analysis. Kim et al. [179] discussed the modeling of game theoretics and an analysis of physical protection by incorporating insider threat implications to address the issues of interactions and intentionality. Dietzel et al. [180] suggested an approach that uses the resilient aggregation technique to leverage current communication redundancy. The data consistency technique is then applied to identify false aggregate information and filtration. Combining biometric and cyber behaviors, Fridman et al. [181] introduced a decision-level fusion technique that fused four modalities based on stylometry (text analysis), usage patterns of application, and behaviors in web browsing. Additionally, Santos et al. [182] analyzed a combination of features, such as nonverbal behavior, biometric information, and daily activities, using machine-learning techniques. Tabash and Happa [183] combined computer emergency response team (CERT) logs with the knowledge of security experts in their system by having the analysts classify detected anomalies. Soh et al. [184] used email, personality traits, and implicit motives to profile employees using a gated recurrent unit and Skip-gram. Nithiyanandam et al. [185] proposed a layered defense based on data monitoring, activity monitoring, user authentication, resource monitoring, and an overarching defense manager. They have analyzed multiple features, such as access to use particular data, keystrokes of a particular user, printer, scanner, USB, and transfer data.

Techniques and Methods Used
Many techniques have been used to model insider threat behaviors. Table 1. discusses several techniques used in detecting the insider threats and highlights the general strengths and weakness. Most of the reviewed articles in this field deal with the issue of insider threats as classification-based. Thus, scholars aim to improve insider threat detection performances by developing detection systems based on existing machine learning, statistical classification, and clustering techniques. However, with valuable efforts for proposing detection methods, the performances of such methods remain challenging, and the need for feature engineering for a number of these techniques tends to be costly and time-consuming. New research requires sophisticated methods for an in-depth understanding of insider activities, where the nature of the insider is entirely different from that of the outsider.

Bayesian Algorithms
Good for mutually exclusive event probability calculation with any other event within a given sample set [92]. BN can abstract from specific details that satisfy desirable characteristics in modeling insider threat detection systems and predict their performance in enterprises in terms of simplicity, privacy, and portability [56].
Most of the detection models built on the basis of mathematical methods, such as Bayesian networks and Principal component analysis PCA, require extensive experience and indepth knowledge for the models' development, training, and refinement. This expertise is neither cost-effective nor available for acquisition [186]. Expert disagreement may exist on the probability of a specific event or causality direction between two events. For example, certain experts might deem that A's behavior is normal, whereas the opposite is true for others [58].
One of the main aspects of SVMs that make it attractive for cybersecurity is that the latencies of classification are very low-that is, in the range of microseconds on modern computers. The performance of the classification is made faster by classifier training [61]. Another attractive property of SVMs is the fact that SVMs are based on a convex optimization formulation with single minima. In addition, SVMs provide a clear geometric interpretation of the classification boundaries and support vectors [112]. A better classification result can be provided by SVMs with less training data [52,63].
Although clustering k-means and SVM classifiers offer the best balance between quality and efficiency, they are not user-friendly and difficult for a human operator to understand [61]. In certain cases, parameterization can be tricky. Training can be more time-consuming for SVMs compared with other methods. SVM entails difficult communication.
One Class Support Vector OCSVM can address the issue of rare class by building a model that considers nonthreat or normal data only [41,42]. It focuses on each action's The OCSVM approach is applicable to static data streams with bounded lengths only. By contrast, the data that relate to insider threats are typically Machine (OCSVM) semantic content, whereas the KNN method focuses on each action type. Therefore, OCSVM is selected because data are unbalanced, and which action is normal or malicious remains unclear [13].
continuous, and the pattern of the threats evolves over time. In other words, data involve unbounded length streams [41,42] Decision Tree (DT) A DT is easy and intuitive for human operators to interpret [61]. Easy to communicate and maintain.
Simple, few, and relatively intuitive parameters are required. Can perform fast predictions.
DT consumes a large amount of memory (deep and large DT is required with additional features). DT, by nature, tends to overfit (i.e., it generates high-variance models, and the branches should be pruned to avoid such models). DT is incapable of incremental improvement.

TF-IDF
TF-IDF provides or identifies sensitive or important words in documents [43,150]. TF-IDF analyzes the importance of intercepted system calls (SCs) collected in the log file of the user [37].
It might be slow for large vocabularies, because TF-IDF directly computes document similarity in a word-count space. It assumes that different word counts provide similarity-independent evidence. TF-IDF ignores the semantic similarities between words.

Markov Model and
Hidden Markov Model HMM The Markov model effectively describes the consequent changes of the state [93]. The HMM models have been widely used in many areas, such as bioinformatics and computational linguistics, because of their capabilities on the recognition of temporal patterns. To capture sequential behavior, HMM is well-suited and has been successful in the biological sequence analysis and pattern reorganization in languages. It provides algorithms to learn the parameters from an observed sequence set, as well as the probability prediction for observing a given sequence [96].
The computational cost of the HMM increases with the number of states.

K-Nearest Neighbors KNNs
Compared with other classifiers, such as neural networks, the KNN classifier can achieve a faster speed with a lower computational burden in the training and classification phases. This quality makes it desirable when deployed in limited-resource platforms, such as the intrusion detection system node [64].
The KNN method is ineffective in certain aspects of detecting insider threat, because information can be hidden in normal actions via manipulation [182]. Using the KNN method requires advanced knowledge of how many clusters in the data may require many trials to assume the best cluster K number to define. Clusterization may differ across runs due to random algorithm initialization.
Principal component analysis PCA PCA is used for dimensionality reduction and reduces similar cluster behaviors. It is a widely used mechanism for addressing highdimensional data. PCA is an effective technique for identifying outliers [100,187]. Using PCA, a large feature set can be reduced into multiple anomaly assessment scores [138].
One of the drawbacks of PCA is that it is often regarded as a black-box approach, where comprehending the link between the resulting PCA space and original feature space becomes difficult [188]. Similar to the other detection models that are built on the basis of mathematical methods, such as Bayesian networks, PCA requires extensive experience and in-depth knowledge for the model development, training, and refinement. This expertise is neither cost-effective nor available for acquisition [186].
Gaussian Mixture Model GMM By implementing the GMM approach, the model can explain why given observations are classified as anomalous [183]. Moreover, the models' parameters and the results of predictions provide analysts with a deep understanding of the decisionmaking process of the method. GMM is capable of modeling a dataset with a complex probability distribution.
Long computation time. Falling into the local maximum. One of the serious limitations in GMMs is its statistical inefficiency in modeling data located on or near a nonlinear manifold in the data space.
Long Short-Term Memory LSTM LSTM is well-suited to classify a time series, because it employs an LSTM cell to learn a historical experience [99]. LSTM is suitable for capturing the long-term temporal dependencies on the user sequence of actions, because the hidden units of the LSTM can potentially record temporal behavior patterns [188].
It is inefficient when directly used to classify the insider sequence of actions, because its output only contains a single bit of information for each action sequence [188]. Compared with other techniques, such as gated recurrent units (GRU), LSTM requires a longer computation time due to its structure [189].

Datasets
Considering the various datasets, Table 2 shows that the majority of the recent studies have used the Computer Emergency Readiness Team CERT dataset, because such data contains many scenarios for the traitor and masquerader. We introduce the types of dataset involved as follows:  [190] Schonlau dataset Time, user, process, registry, and file access [182] APEX [44] (Are You You?) RUU Dataset Number of search and nonsearch actions, user-induced actions, window touches, new processes, running processes, and documents that edit running applications on the system. [191] The Wolf Of SUTD (TWOS) KEYSTROKE: timestamp, key press/release events, key value (anonymized as a subpart of the keyboard), and username MOUSE: timestamp, mouse move/click/release events, coordinates of mouse pointer, and username HOST MONITOR: timestamp, program name, PID, parent program name, parent PID, and SC operation NETWORK TRAFFIC: HTTP request/response, method (e.g., GET and POST), status code, content length, and content type EMAIL: timestamp, header, sender, receiver, and LIWC features extracted from email body LOGON: timestamp, login attempt, login success, logout event, and username CERT is a collection of datasets on the synthetic insider threat generated by the CERT and other partners. The dataset is generated using different scenarios that contain traitor instances and masquerade activities. The collected dataset contains logs of login data, HTTP or browsing history, emails, file access logs, device usage, LDAP data, and psychometric information.
Remote terminal unit (RTU). The RTU dataset is a collection of labeled RTU telemetry streams from a gas pipeline system at the Critical Infrastructure Protection Center at Mississippi State University. The dataset includes benign RTU transactions, data injection attacks, and command injection attacks, which were generated specifically for research on critical infrastructure protection.
NSLKDD/KDD-99. An intrusion detection evaluation dataset was collected at the Massachusetts Institute of Technology Lincoln Laboratory. The main purpose of the dataset was to improve and evaluate intrusion detection systems. However, this dataset has been used in research on insider threat detection-in particular, the "user-to-root" group of attacks-to mitigate masquerade attacks using SC logs.
Schonlau dataset. This dataset contains UNIX commands at approximately 15,000 commands per user and is generated by 50 users with different roles in the organization. In the generated dataset, the first 5000 commands for each user exclude any masqueraders, which are used for training. The next 10,000 commands are deemed as a hundred blocks of a hundred commands, which are seeded with the masquerader's user information, i.e., with the data of another user outside of the 50 users. Thus, the data used in masquerade sessions does not contain any malicious content.
APEX' 07. These datasets were introduced by the National Institute of Standards and Technology to simulate the analysts' tasks in the intelligence community. The APEX dataset contains the activities of 13 analysts, 8 benign analysts, and 5 malicious insider analysts.
RUU. This dataset for masqueraders was collected by Malek Ben Salem [192] and [193]. The datasets were collected from the PCs of 34 regular users, which consisted of host-based activities, such as Windows registry, file system access, processes, system GUI, and dynamic library loadings.
TWOS. The TWOS dataset has been collected from real user interactions with a host machine that contains legitimate user data and malicious insider instances (i.e., masqueraders and traitors).
The dataset was collected during the competition organized by the Singapore University of Technology and Design in March 2017 and comprises data collected from six data sources (i.e., keystrokes, mouse, host monitor, network traffic, SMTP logs, and logon), with additional data from a psychological personality questionnaire [191].

Evaluation Metrics for Insider Threat Detection
Many classification metrics are used to evaluate the insider threat detection systems, and a few are known by multiple names. Table 3 discusses the most common metrics and the works where these metrics are used.

Time complexit y
It pertains to the time required to complete the classificatio n task of such a classifier. TP: true positive, FN: false negative, TN: true negative, and FP: false positive. These terms have meanings according to the type of processes defined in Table 4.

TP
Number of malicious insiders correctly classified.

FN
Number of malicious insiders incorrectly classified.

FP
Number of normal insiders incorrectly classified.

TN
Number of normal insiders correctly classified.

Detection Methodology
Based on the reviewed studies, two main detection methodologies are used. The first is anomaly detection, where the system makes a baseline profile of the normal system, network, or program activities. Any abnormality from the learnt baseline is labeled a malicious insider. The second is signature-based detection, which identifies a previously known malicious insider when such activities match the stored signature or rule-based protocol to model the used behaviors on the system. Figure 8 shows that the majority of existing solutions are based on anomaly-based detection or address insider threats as a classification issue. The use of anomaly-based methodologies is due to the class imbalance in insider threat datasets, where the majority of the data are composed of the daily activities of regular users, as well as concerns on the issues of zero-day malicious attacks.

Discussion
This survey article aims to study insider threats and discuss the current detection methods and techniques in the field of insider threat. The main purpose of the study is to focus on the literature using an analysis review. The included papers were reviewed and discussed based on the analyzed behavior, techniques, and methods used, as well as datasets, evaluation metrics, challenges, and recommendations.
The previous sections indicate that the majority of previous studies have highlighted the monitoring and analysis of user activities to detect insider threats using cyber behaviors. However, other behaviors, such as physical behaviors, are attracting less attention. Despite the need for "feature engineering," which is difficult and time-consuming, machine-learning techniques are widely used for the development of insider detection methods. In terms of datasets, the majority of studies used their experiment and simulation due to the lack of real-world data for insider threats. Recently, however, synthetic datasets, such as CERT, have been used to evaluate insider threat detection 78% 20% 2% Anomaly-based or Classification Signature-based or Rule-based Other systems. Appendix A sheds light on studies carried out concerning detection methodologies, methods, and datasets used; behavior features; and their results.

Challenges
This part discusses the current challenges in the detection of insider threats; these challenges are grouped into eleven categories, which are discussed as follows in Figure 9.

Performance
As an attacker is a legitimate user of the system, this notion poses the difficulty of drawing a clear line between what is legitimate and what is malicious [112]. Most of the existing approaches used for insider threat detection apply the anomaly detection approach, which is supervised or unsupervised methods that classify small deviations from normal activity patterns in anomaly detection as an abnormality and, thus, classify this abnormality as malicious. However, most of these abnormalities are nonmalicious activities. These methods tend to raise unnecessary false alarms in handling such cases [36]. Thus, traditional approaches suffer from the well-known issue of false positives due to this notion, which makes such approaches difficult to apply in enterprise environments [37]. In other words, reducing false positive and negative alarms for insider threat detections without affecting the detection accuracy remains a major challenge.

Lack of Real Data
In spite of advanced research on insider threats, challenges in validating and refining the detection models remain due to the absence of real-world data from organizations [13,139,195]. The lack of actual insider threat data is also a major challenge in assessing and developing insider threat detection systems. Moreover, the present review observes that synthetically created datasets used in the surveyed articles were not created specifically for insider threats. Furthermore, a few of these datasets did not contain malicious data, whereas others were outdated [1].

Ethical and Privacy Issues
Despite the increase in the number of insider threat incidents, not all organizations report such incidents nor allow access to their data, typically due to ethical and privacy concerns. The issue of real data access is crucial for insider detection, which continues to be a significant obstacle for validating and refining effective and scalable detection systems. As a result, most existing detection systems are tested and evaluated on synthetic and simulated datasets, with the biases that such data imply [57,100,191].

Analysis Issues on Encrypted Flows or Encrypted Data Packets
To avoid detection by tools, such as intrusion detection systems, attackers may use cryptography to mask their attacks. Such a scenario renders detection systems unable to analyze encrypted flows or encrypted data packets, which is another main limitation of the current intrusion detection systems [70].

Sheer Volume
The capabilities of capturing logs for the activities are an advantage that may provide insight into employee actions. Despite this advantage, the analysis of activity logs continues to be difficult for analysts because of the sheer volume of activities that employees produce every day [95,139,162]. The large number of organizational staff requires the monitoring of staff behavior properties, which results in the massive need for data to be processed [52]. The growth of this data outpaces the ability of human auditors and administrators to digest such data quantities using manual analyses [95].

Unbounded Length and Threat Patterns
Features can be found in time or frequency domains, according to the temporal phenomenon of insider threat detection. Nevertheless, any sudden changes in behaviors should be monitored to identify specific problems. Additionally, any actions can be a sign of malicious attacks by the insider. The complexity and unpredictability of malicious actions render a careful analysis difficult for the system, network, and user parameters correlated with insider threats. Therefore, a heterogeneous, high-dimensional data analysis problem in isolating suspicious users was created [41,106,107,195]. Unsupervised learning is one of the approaches applied to solve such a problem, but these approaches are limited to static, finite-length data, thereby limiting the application against insider threats. Thus, insiders tend to have unbounded length and threat patterns that evolve over time [43].

Physical and Cyber Behavior
Another limitation of the current insider threat detection approaches is that they only concentrate on cyber or physical security behaviors within cybersecurity [87,175,177]. Most of the previous works did not use both behaviors of the cyber and physical systems in analyzing insider threat detections. The majority of scholars aim to detect insiders by observing behaviors either from the cybersecurity or physical security aspects [86]. However, in terms of detecting physical threats, most of the existing studies applied physical access control mechanisms that may control the physical access of unauthorized users to a certain point. However, such mechanisms are ineffective against insider attacks.

Analysis interval
Several insider threat detection systems were unable to provide real-time responses, which raises the need for additional research efforts [7]. In the case of offline detection tools, a drawback exists where these tools are unable to provide support and respond to a log analysis with respect to time. Therefore, most of the current systems continue to lack real-time tools, which prevents further actions from curbing the problem [92]. Large amounts of audit data are collected from organizational environments in a server log form, which potentially can play a role in access decisions. However, audit data are often used only for offline forensics, which leads to "later is too late" circumstances [61].

Costly and Time-Consuming
One of the detection approaches for insider threats is the supervised learning approach, which trains data to build a classification model. However, most of the introduced detection methods built are based on supervised learning. Thus, the need remains for contextual data entries about users and a training process for supervised learning methods that are specific to insider threat detections. Despite such capabilities, this approach tends to be costly and time-consuming [39,42,80,103].

The policy
In general, insiders have knowledge of policies and practice such knowledge. Typically, policies are related to access rights granted to insiders, which essentially aim to circumvent regulations [159]. The development of access control policies is centered around trust regarding the access rights of legitimate users, such as reading, writing, and execution, based on the task and position hierarchy of the legitimate users in the organization. In this case, an insider with malicious intentions can have the power to destroy or steal information [43]. However, such access rights are increasingly misused by oblivious, hostile, rogue, and pseudomalicious insiders [160]. Therefore, the lack of access control systems in insider threat detection systems leaves enterprises frequently vulnerable to such threats [163].

Limitations of Static Access Control Policy
The existing access control techniques are designed on the basis of static policies that tie cryptocredentials to attributes used by the rules of access control. Dynamic events, such as behavioral changes of the actor (e.g., a user performs illegitimate activity within their privilege rights), subversion of credentials (e.g., theft of common access cards), and changes in the structures of the document (e.g., editing Wiki pages), are not detectable, which leaves systems vulnerable for a long period [62,122].

Limitations of Access Control Point Location
As countermeasures against insider threats, access control rules are more complex than those used for countermeasures against malicious outsiders. Furthermore, the challenge for countermeasures against insider threats continues to question where access control should be installed in a network. The question of the suitable location of an access control point to control insider threats is yet unanswered [83].

Complexity of insider threat detection
Detecting insider threats is becoming a highly complex and difficult task for the following reasons. First, insiders with trusted access can perform unauthorized activities. Thus, external network security tools, such as firewalls, IDS, and antivirus software, cannot detect a malicious insider [93,147,184,189]. Second, insider attacks manifest in many forms. For example, a malicious insider may plant a logic bomb to disrupt the systems or steal intellectual property. The diversity of insider attacks increases with the complexity of detecting insider threats. Finally, insider attacks are frequently conducted by malicious insiders during daily working hours, which drown the anomalous behavior of malicious insiders in most of the normal employee behaviors [95,139,184,189].

Collusion attack detection
Most of the existing solutions for insider threats are focused on individual detections. Nevertheless, collaborative attacks can be launched by two or more insiders, which are difficult to detect. The disadvantage of these types of attacks is that the activity of each insider may look benign, but, when combined with other activities, it may result in a malicious action. Therefore, further effort is required to handle collusion attacks [128].

Lack of attention and Strategy
Most organizations provide extra focus on outsider threats rather than malicious insiders [95]. Researchers in the field of cybersecurity have dealt with and identified many different security threats. Scholars emphasize that threats from malicious insiders are more dangerous than external threats; however, this statement has failed to receive sufficient attention. Another challenge is the lack of understanding of the intent and strategy of the malicious insider [45]. Most researchers in security highlight the lower layers of a software system-that is, mining data at the network and host machine or the levels of source code. As a result, these solutions mainly focus on certain signatures or threat categories that are naturally tactical. The current study deems that a piece remains lacking in the overall picture of the current understanding on the intent and strategy of malicious insiders.

Insider Threats in SCADA
Supervisory control and data acquisition systems (SCADA) constitute the critical infrastructures' sensitive parts. Each successful malicious incident could cause huge damages on materials economic and human [196]. Operators play an important role in SCADA systems, and their commands can have high impacts on the reliability of critical infrastructures. Therefore, insider attacks and approaches to deal with the malicious insider get more attention in SCADA security [46]. All SCADA environments are open to insider attacks, even though an insider cyberattack mostly needs more technical skills and knowledge about the targeted system. For the direct attack on remote terminal units, physical access is required to the communications channels, but when this access is obtained, mostly, at that point of access, all protections will be bypassed. The equipment of programmable logic controllers (PLC) is also more vulnerable to remote attack because of the device's inherent design and devices origins on the floor of the factory [197].

Recommendation
The research recommendations in insider threats detection are grouped into eight categories as shown in figure 10.

Performance
For insider threat detection to be effective, improving the performance of the recall rate without sacrificing precision is extremely important [93]. Feature selection of security event data is a potential approach to improve performance. Therefore, added efforts are required for the development of feature selection technologies and tool defenses against insider threats [111]. One such example of the recommendation for selecting features is the use of complex models, such as the LSTM recurrent neural network [96] and gated recurrent unit (GRU), such that a rich representation of user behavior can be learned.

Dataset
Previous studies recommended that new insider datasets should be created with much larger data and stream sizes [102]. A long period is also essential to confirm whether the models can normally work with the daily updates, whereas normal user behaviors change over time [58]. The new dataset should include more malicious data, because the current datasets contain only a few malicious data, a few of which are becoming outdated. Collusion attacks should also be considered using the new insider dataset for further improvements and challenging and realistic testing of detection methods [1]. Thus, maintaining updates on insider datasets with normal and malicious activity patterns is necessary so that the proposed solutions for detecting recent insider attacks can be verified and evaluated [111].

Hybrid Solution
In summary, the present study observed that most of the existing solutions are based on anomaly and unsupervised approaches due to the class imbalances in datasets and other issues, such as zeroday malicious attacks. However, a good and robust insider threat detection method should use a combination of several independent approaches [1]. In terms of the first defense line, misuse-based insider detection should be considered to cover the scenarios of known insider threats. However, at the second line, anomalies and other best practices, such as prevention and mitigation techniques, should be employed.

Logs
One of the better techniques for mitigating insider threats is using log management, which includes log analysis and event correlation. Log analysis can pinpoint the root cause of an insider attack and protect the network from security violations at the same time [92]. Combining other sources of data for the better detection of insider threats includes Windows logs, active directory logs, printer logs, and physical security logs [151]. The statistical learning algorithms, an ensemble learning classification, and the utilization of a network access control (NAC), modern "big data" are having the ability to capture and manage the flow of logs and provide accessibility to batch processing, stream processing into the analytic tools, and also, offering interfaces queries for investigative of the ad hoc [198]. Mayhew et al. [61] discussed the well-known instance of a big data processing Splunk, which offers a capability that eased the cyber defender's task to create correlations between different pieces of log information using a specialized query language. Therefore, combining and analyzing these types of logs and applying the use of big data analytics tools can be a direction for future studies on insider threats.

Evaluation and Validation
Currently, no framework or standard exists that addresses the evaluation of insider threat detection systems [7]. Thus, the selection for the best detection method is still a challenging decisionmaking task; this is due to the multiple detection evaluation criteria, such as accuracy, Recall, FPR, time complexity, etc. Therefore, this study strongly recommends that a generic framework should be developed for the evaluation of insider threat detection systems to assist and guide researchers and practitioners in evaluating their proposed systems. With regard to improving the evaluation standard, collaboration between researchers in the academia and practitioners in the industry is highly recommended. Such collaborations will assist researchers in seeking feedback from the industry on developed systems to be evaluated, which can lead to improved adoption and adaptation of the respective systems to real-world environments [7,188].

Human Aspect
Most of the existing works focused on the technical aspect of insiders, on the machine, system, and network. However, the nontechnical aspects of the insider problem are critical elements of any insider threat detection system. Therefore, many effective techniques can be used to address the human aspect of the insider problem, such as human communication (i.e., tone of voice, body language, and attitude toward others) [187].

Physical Features
Although cybersecurity researchers appear to be aware of the physical effects of cyber threats, most studies are conducted either on "cyber" or "logical" security. Even so, many cybersecurity threats (particularly on the Internet) originated from physical intrusions, and we still need approaches that model cyber and physical security aspects [174].

Theoretical
Future studies should provide information for all responsible management departments, as well as security professionals, with a deep understanding of insider characteristics, threats posed, potential risks of insider threats, and possible countermeasures. The analysis of the problem in general, including insider taxonomy development, attacks, and countermeasures, points to a particular information security threat with forecasting model developments [12].

Conclusions
Insider threats are among the most challenging security threats and the main concern of organizations of all sizes. Numerous studies have been conducted in this field, and efforts continue to grow, although the boundaries and descriptions of insider threats remain ambiguous. Thus, understanding and gaining insights into insider threat detection is an important research direction. This review aimed to provide an extensive view and deep understanding of the field of insider threats by surveying and categorizing the existing literature. Along with the deep investigation on the existing literature and analysis of real cases, the two distinct classes-namely, insider and insider threat detection-have been elucidated. Significant information was obtained through the intensive review and analysis of the final set of the reviewed articles, such as the challenges and issues that researchers face in the field of insider threats. In addition, important recommendations related to insider threat detection, as well as datasets and techniques that have been used, were proposed. The various recommendations can provide future researchers with a clear picture of the research direction on insider threat detection. The present review also summarizes the concepts for insider detection as presented in the previous literature, which provides a useful reference for researchers.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A
To understand and categorize the previous studies on the insider threat detection domain, Appendix A provides a comparative analysis based on diverse aspects, such as detection methodologies which divided into: The anomaly detection and classification method, where the anomaly system makes a baseline profile of the normal network, system, or program activities, and any deviation from the created baseline is addressed as an malicious insider and the signature-based detection and rule based method detect previously known malicious insider threat when those intrusions match the stored signature. Methods are divided based on the approaches that used in each of the studied article to model the insider threat. Datasets is referring to the data that been used to evaluate the proposed solution as well as the environment used to build the insider threat detection which was divided into synthetic dataset (SY), real data (RD), case study (CS), Simulation (SE), own environment (OE). Features, all the reviewed articles are divided based on the analyzed behavioral features and divided into four groups: behavior of biometrics, cyber features, psychosocial behaviors and other behaviors. Finally, the Results that have been achieved by the studied article.

Method
Dat aset