Research on Behavior-Based Data Leakage Incidents for the Sustainable Growth of an Organization

With the continuously increasing number of data leakage security incidents caused by organization insiders, current security activities cannot predict a data leakage. Because such security incidents are extremely harmful and difficult to detect, predicting security incidents would be the most effective preventative method. However, current insider security controls and systems detect and identify unusual behaviors to prevent security incidents but produce many false-positives. To solve these problems, the present study collects and analyzes data leaks by insiders in advance, analyzes information leaks that can predict security incidents, and evaluates risk based on behavior. To this end, data leakage behaviors by insiders are analyzed through an analysis of previous studies and the implementation of an in-depth interview method. Statistical verification of the analyzed data leakage behavior is performed to determine the validity and derive the levels of leakage risk for each behavior. In addition, by applying the N-gram analysis method to derive a data leakage scenario, the levels of risk are clarified to reduce false-positives and over detection (i.e., the limitations of existing data leakage prevention systems) and make preemptive security activities possible.


Introduction
Security attacks that threaten the wellbeing of organizations are changing in various ways, including cyber-attacks. While cyber-attacks originating from outside an organization occur mainly to acquire important data, security attacks can also come from inside the organization. According to HfS Research statistics from a survey on data leaks administered to executives in corporate security departments, 69% had experienced insider leaks, and 57% had experienced outsider leaks [1]. For example, in the case of Google's autonomous vehicle project Waymo, after the main employee who executed the project left the company, he founded a startup company and sold the trade secrets of his previous company to other companies. This exemplifies how mainstream security attacks have changed from being caused by outsiders to being caused by insiders, but the countermeasures implemented by organizations have not evolved from existing cyber-attack frameworks to adjust to this change. Most organizations rely on the installation of security systems such as intrusion detection systems and control removable storage media to engage in security activities around the boundaries. However, there is a limit to detecting and preventing data leakages by insiders because the criteria are ambiguous for determining whether an act in the system is legitimate or not. Furthermore, data leakage by insiders is characterized by a high magnitude of damage because insiders know important data about the organization, and these data are freely accessible to the attackers [2]. To minimize damage to the organization, it is most effective to prevent security incidents caused by insiders and to predict and manage security risks in advance. In this study, to secure the sustainable growth of the organization, we collect in advance the signs of data leakage by insiders of the organization, analyze the signs of data leakage so that the accidents can be predicted, and attempt to derive the levels of risk according to those signs.
This study consists of six sections. First, the introduction (above) explains the research importance and background. In Section 2, we consider the characteristics of data leakage incidents and the limitations of existing data leakage prevention solutions. In Section 3, we explain the research methodology, which was intended to overcome the limits of existing data leakage prevention solutions. Through in-depth interviews and an analysis of earlier studies, we derive the data leakage behaviors from insider attacks. In Section 4, through a security expert survey, we statistically verify the derived data leakage behaviors and apply the N-gram method-a text mining analysis method-to derive the data leakage scenarios. In Section 5, a summary of the research results is given, which continues into Section 6, where we discuss the research contents, contributions, and future work.

Characteristics of Data Leakage Accidents
Most of the documents produced by organizations today are in the form of electronic files; thus, Information Communication Technology (ICT) technology is located in the organization's infrastructure. In addition, various industries have launched strategies to digitize data that previously existed in writing through digital transformation. According to a Smart Insight report, 76% of organizations have implemented digital transformation [3]. Through digital transformation, it becomes more convenient to use data for various reasons, such as effective operation management and productivity improvement. Due to their focus on using the data generated by introducing digital transformation, organizations have been limited in their data protection activities. However, when data in an electronic file format is leaked, it is difficult to identify the leaker due to the anonymity that is possible in cyberspace. Although organizations establish and operate regulations for preventing data leakages, current security policies have limitations in implementing effective security activities. Previous studies [4,5] that analyzed the limitations of existing security policies showed a strong impetus to protect the boundaries of organizations but an inability to prevent security attacks that occur and are fused. Further, given the rise of security incidents caused by humans, the shortcomings of human management illustrate the ineffectiveness of existing security policies.
Leaking data in the form of electronic files does not mean the original data loses its form or content. If an electronic file is captured and leaked, the organization will have difficulty determining if information leakage has occurred because the file remains present despite the primary data in the file being leaked. In particular, there is difficulty in determining whether the actions performed by insiders are for business execution or data leakage. As shown in Figure 1, the EKRAN statistics show that more than 40% of organizations take several years to detect data leakages involving insider threats [6].
Sustainability 2020, 12, x FOR PEER REVIEW 2 of 14 organization, we collect in advance the signs of data leakage by insiders of the organization, analyze the signs of data leakage so that the accidents can be predicted, and attempt to derive the levels of risk according to those signs. This study consists of six sections. First, the introduction (above) explains the research importance and background. In Section 2, we consider the characteristics of data leakage incidents and the limitations of existing data leakage prevention solutions. In Section 3, we explain the research methodology, which was intended to overcome the limits of existing data leakage prevention solutions. Through in-depth interviews and an analysis of earlier studies, we derive the data leakage behaviors from insider attacks. In Section 4, through a security expert survey, we statistically verify the derived data leakage behaviors and apply the N-gram method-a text mining analysis methodto derive the data leakage scenarios. In Section 5, a summary of the research results is given, which continues into Section 6, where we discuss the research contents, contributions, and future work.

Characteristics of Data Leakage Accidents
Most of the documents produced by organizations today are in the form of electronic files; thus, Information Communication Technology (ICT) technology is located in the organization's infrastructure. In addition, various industries have launched strategies to digitize data that previously existed in writing through digital transformation. According to a Smart Insight report, 76% of organizations have implemented digital transformation [3]. Through digital transformation, it becomes more convenient to use data for various reasons, such as effective operation management and productivity improvement. Due to their focus on using the data generated by introducing digital transformation, organizations have been limited in their data protection activities. However, when data in an electronic file format is leaked, it is difficult to identify the leaker due to the anonymity that is possible in cyberspace. Although organizations establish and operate regulations for preventing data leakages, current security policies have limitations in implementing effective security activities. Previous studies [4,5] that analyzed the limitations of existing security policies showed a strong impetus to protect the boundaries of organizations but an inability to prevent security attacks that occur and are fused. Further, given the rise of security incidents caused by humans, the shortcomings of human management illustrate the ineffectiveness of existing security policies.
Leaking data in the form of electronic files does not mean the original data loses its form or content. If an electronic file is captured and leaked, the organization will have difficulty determining if information leakage has occurred because the file remains present despite the primary data in the file being leaked. In particular, there is difficulty in determining whether the actions performed by insiders are for business execution or data leakage. As shown in Figure 1, the EKRAN statistics show that more than 40% of organizations take several years to detect data leakages involving insider threats [6].  In addition, security incidents have the characteristics of hidden crimes [7] and have a significant impact on corporate image, such as decreasing the organization's credibility when notifying the public of the security incident. Hence, most corporations do not count the number of internal data leakage incidents, so the various statistics are not representative of reality. Due to the nature of these internal data leakage incidents, statistics examining the number of security incidents are gradually decreasing, but the scale of damage caused by such incidents is steadily increasing. When important business-critical data are leaked, the damage is great. However, if the leaked information cannot be recovered or repaired, the damage is more severe. Hence, prevention before the occurrence of an incident is more important than detecting the moment of occurrence.

Security Technologies for Data Leakage Protection
To minimize the occurrence of and damage from information leakage incidents, organizations are presently attempting to develop and introduce various security technologies. There are many types of security technologies, which are difficult to classify, as many provide key functions that overlap. However, this study intends to classify these technologies into two types according to their security purpose: (1) security technologies aimed at prevention and (2) security technologies aimed at detecting incidents through continuous monitoring. We reconstructed the data leakage prevention framework proposed by Alzhrani [8] according to the criteria established in this study, as shown in Figure 2. Security technology for prevention can be subdivided into leakage control technology that grants responsibility to users and access control technology that limits user access. Security incident detection through continuous monitoring includes technology that detects abnormal behaviors and continuously monitors and filters the content generated and transmitted in the business. These security technologies are also capable of managing and detecting the behaviors of the user and each terminal, as well as behaviors through unreliable channels. In addition, security incidents have the characteristics of hidden crimes [7] and have a significant impact on corporate image, such as decreasing the organization's credibility when notifying the public of the security incident. Hence, most corporations do not count the number of internal data leakage incidents, so the various statistics are not representative of reality. Due to the nature of these internal data leakage incidents, statistics examining the number of security incidents are gradually decreasing, but the scale of damage caused by such incidents is steadily increasing. When important business-critical data are leaked, the damage is great. However, if the leaked information cannot be recovered or repaired, the damage is more severe. Hence, prevention before the occurrence of an incident is more important than detecting the moment of occurrence.

Security Technologies for Data Leakage Protection
To minimize the occurrence of and damage from information leakage incidents, organizations are presently attempting to develop and introduce various security technologies. There are many types of security technologies, which are difficult to classify, as many provide key functions that overlap. However, this study intends to classify these technologies into two types according to their security purpose: (1) security technologies aimed at prevention and (2) security technologies aimed at detecting incidents through continuous monitoring. We reconstructed the data leakage prevention framework proposed by Alzhrani [8] according to the criteria established in this study, as shown in Figure 2. Security technology for prevention can be subdivided into leakage control technology that grants responsibility to users and access control technology that limits user access. Security incident detection through continuous monitoring includes technology that detects abnormal behaviors and continuously monitors and filters the content generated and transmitted in the business. These security technologies are also capable of managing and detecting the behaviors of the user and each terminal, as well as behaviors through unreliable channels. As shown in Figure 2, all security technologies classified and identified by the security objectives focus on detecting and identifying the point at which a data leakage occurs, while the ability to predict security incidents before a data leakage incident occurs is lacking [9]. For this reason, some security technologies have been developed to predict data leakage security incidents. However, these As shown in Figure 2, all security technologies classified and identified by the security objectives focus on detecting and identifying the point at which a data leakage occurs, while the ability to predict security incidents before a data leakage incident occurs is lacking [9]. For this Sustainability 2020, 12, 6217 4 of 14 reason, some security technologies have been developed to predict data leakage security incidents. However, these technologies predict security incidents by detecting abnormal behaviors rather than by identifying and detecting leakages. Since it is impossible to determine whether a behavior performed by an organization member is for work or for data leakage purposes, behaviors are classified into normal or abnormal, where the latter is judged as data leakage. For example, a behavior is considered abnormal if an employee goes to work early because there is more work than usual. This causes problems because the rate of over-detection is too high when identifying all abnormal actions as information leakage actions. To solve this problem, the present study increased the predicted level of data leakage security incidents by identifying and detecting the behaviors that actually cause information leakages, not simply abnormal behaviors inside the organization, and, ultimately, the number of information leakage security incidents.

Research Methodology
In this research, the methodology presented in Figure 3 was used to collect data leakage behaviors and analyze the risk of behaviors. The data leakage behaviors were primarily investigated through an analysis of previous research related to data leakage incidents. Following the investigation, we conducted in-depth interviews with security managers who had experienced incidents in their organizations and collected data leakage behaviors. To perform statistical verification of the collected data of the leakage behaviors, a validity assessment targeted to security experts and a risk assessment by behaviors were conducted. We used the N-Gram analysis method for data leakage behaviors that were statistically verified to design hypothetical scenarios with data leakage potential. The application of statistical methods to elicit survey targets, processes for collecting information leakage behaviors, and the risk of information leakage are detailed in Sections 4 and 5, respectively. Sustainability 2020, 12, x FOR PEER REVIEW 4 of 14 technologies predict security incidents by detecting abnormal behaviors rather than by identifying and detecting leakages. Since it is impossible to determine whether a behavior performed by an organization member is for work or for data leakage purposes, behaviors are classified into normal or abnormal, where the latter is judged as data leakage. For example, a behavior is considered abnormal if an employee goes to work early because there is more work than usual. This causes problems because the rate of over-detection is too high when identifying all abnormal actions as information leakage actions. To solve this problem, the present study increased the predicted level of data leakage security incidents by identifying and detecting the behaviors that actually cause information leakages, not simply abnormal behaviors inside the organization, and, ultimately, the number of information leakage security incidents.

Research Methodology
In this research, the methodology presented in Figure 3 was used to collect data leakage behaviors and analyze the risk of behaviors. The data leakage behaviors were primarily investigated through an analysis of previous research related to data leakage incidents. Following the investigation, we conducted in-depth interviews with security managers who had experienced incidents in their organizations and collected data leakage behaviors. To perform statistical verification of the collected data of the leakage behaviors, a validity assessment targeted to security experts and a risk assessment by behaviors were conducted. We used the N-Gram analysis method for data leakage behaviors that were statistically verified to design hypothetical scenarios with data leakage potential. The application of statistical methods to elicit survey targets, processes for collecting information leakage behaviors, and the risk of information leakage are detailed in Sections 4 and 5, respectively.

Colloection of Data Leakage Behaviors
To collect the data leakage behaviors, we analyzed prior research. We focused on previous studies that used scenarios of security incidents involving insider threats [10][11][12][13][14][15][16]. As mentioned in the previous studies, the analysis of leaked behaviors through past research has limitations because only a small proportion of security incidents and data leakage behaviors are disclosed due to the hidden nature of the crime. Therefore, in this study, the in-depth interview method was used to collect the data leakage behavior that occur in industrial sites. The in-depth interview method is an interview method in which a small number of interviewees related to the study are interviewed using questions and answers based on previously prepared questions.
Since the in-depth interviews were conducted among a small number of experts related to the research, it was important to select the interviewee targets before starting the research. In this study, using the purposive sampling method, security experts with experience in data leakage security accidents were selected as targets from among relevant professionals who had performed security

Colloection of Data Leakage Behaviors
To collect the data leakage behaviors, we analyzed prior research. We focused on previous studies that used scenarios of security incidents involving insider threats [10][11][12][13][14][15][16]. As mentioned in the previous studies, the analysis of leaked behaviors through past research has limitations because only a small proportion of security incidents and data leakage behaviors are disclosed due to the hidden nature of the crime. Therefore, in this study, the in-depth interview method was used to collect the data leakage behavior that occur in industrial sites. The in-depth interview method is an interview method in which a small number of interviewees related to the study are interviewed using questions and answers based on previously prepared questions.
Since the in-depth interviews were conducted among a small number of experts related to the research, it was important to select the interviewee targets before starting the research. In this study, Sustainability 2020, 12, 6217 5 of 14 using the purposive sampling method, security experts with experience in data leakage security accidents were selected as targets from among relevant professionals who had performed security tasks for more than 10 years. The five security expert groups (in which the security teams belonged to companies that had experienced security incidents-32 of the security team members) selected for the survey were interviewed over three months, from May to July 2019. Prior to conducting the in-depth interviews, it was not possible to disclose the affiliations or for the interviewee to be investigated because each organization's security tasks were confidential. In addition, the main cases of security incidents experienced by the expert groups were determined, and the major data leakage incidents were partially determined. The main contents of the in-depth interview are shown in Table 1.  Table 1 summarizes the main contents of the interview by dividing by respondent code for each in-depth interviewee. The main contents of the in-depth interviews listed in this table were summarized based on the main leaking behaviors and the paths in which the data leakage security incidents occurred. Investigations were also used to confirm the facts after data leakage security incidents. As a result of the in-depth interviews, a total of 26 data leakage security incidents were investigated. After the detailed leakage routes of security incidents were revealed, all investigation methods and measures were explored. After collecting the data leakage behaviors, as shown in Table 2, both the data leakage behaviors from previous research analyses and those investigated through our in-depth interview methods were synthesized. Table 2 is categorized according to the types of comprehensive data leakage behaviors into four major categories and 10 classifications of medium importance according to the behavior type. Each subdivision includes the data leakage behaviors that were investigated through the in-depth interview method and an analysis of previous studies. A list of previous studies is shown in Table A1 of Appendix A and mainly refers to prior research on detecting information leakage by insiders. Based on previous research on insider threats, the importance of insider threat detection technology was largely confirmed. Moreover, we referenced studies related to insider leakage detection that applied mining and analysis techniques using machine learning, the RNN algorithm, and big data analysis.  Copy key documents using universal serial bus memory A1-A3, A6, A7, A9 Duplicate key documents to a hard disk A1, A2, A6, A7, A9 Duplicate key documents to a CD A1-A3, A6, A7, A9 Copy key documents to a smart phone A1-A3, A6, A7, A9, B Arbitrary transfer data processing unit Alteration or desorption of portable storage (hard disk drive, solid state drive) A6, A8, B

Arbitrary transfer of laptops (including tablets) B
Exceeding the transfer period of the authorized laptop (including a tablet) B False reporting of the loss of the data processing unit B * Source A is a data leakage behavior obtained through analysis of previous studies (Table A1), and B is the data leakage behavior determined through an in-depth interview.

Validity and Risk Evaluation
Statistical verification was carried out to conduct a validity analysis and risk assessment of the information leakage behaviors derived from the previous research findings and in-depth interviews. The questionnaire was used to verify the data statistically, and the survey was conducted among 76 Sustainability 2020, 12, 6217 8 of 14 security experts with more than 10 years' experience in performing security tasks and data leakage experiences involving leakage behaviors. The survey was conducted using a 5-point Likert scale. In this survey, a questionnaire was conducted based on the subdivision of the data leakage behaviors and the interviewee's experience. The risk degrees of the proposed data leakage behaviors were then assessed. From the survey, only data leakage behaviors with an average score of 3.5 or higher were extracted, while those with an average of less than 3.5 were rejected. The derived statistical average represents the degree of risk for the data leakage behavior. In total, 15 data leakage behaviors with a risk of 3.5 or higher were derived, as shown in Table 3. We also calculated the standard deviation. The most dangerous data leakage behavior is "Receive key documents by accessing them through a Virtual Private Network (through encrypted tunneling)," which means downloading important documents through a network that bypasses both inside and outside of the organization. The data leakage behavior with the lowest risk was identified as "Divide key documents into fragmented files" and leaking data. In addition to deriving the degree of risk, the Cronbach's alpha coefficient was assessed to confirm the reliability of the survey. The reliability coefficient of Cronbach's alpha is a value that expresses the internal consistency of a survey and determines whether the test items are composed of homogeneous factors based on the average correlation between the variables within a survey. If the same concept is based on the assumption that the results will be similar when measured by different independent measurement methods (for example, when conducting a questionnaire survey), the same answer will be given after repeatedly asking the same question in different ways. The Cronbach's alpha coefficient is reliable when it is 0.8 or higher. This study determined a high reliability of 0.893.

Design of the Data Leakage Scenario through the N-Gram Analysis Method
One of the key components of text mining is representing documents. For more effectively representing documents, the Bag-of-Words (BoW) model is commonly used [17]. This model denotes the presence or absence of a particular word in the document, and provides a simple check of the frequency of words in the document [18]. However, the BoW model does not consider the words' sequence in the sentence while N-gram analyzes the sentence based on words' sequence. N-gram is one Sustainability 2020, 12, 6217 9 of 14 of the text mining methods based on words' sequence. There are other methods to text classification such as Naïve Bayes, Support Vector Machines, etc. Using them alone, their performance varies greatly depending on the model variant, features used and task/dataset. However, when they were used with N-gram, the text classification results always show improved results because bigram can identify sequence [19]. From existing studies, it can be seen that they commonly use a clustering method for those with similar properties and identify similarities among the scenarios in each cluster for designing a scenario [20]. Prior to selecting an analysis method, this research conducted a pilot test by using the Naïve Bayes analysis method and Clustering method. Table 4 is the result of Naïve Bayes method and Clustering method. However, in the case of the Naïve Bayes analysis method, a scenario with an unrelated action regarding a technology incident was produced, such as printing important files after obtaining unauthorized access through a Virtual Private Network and taking a device (laptop/storage device) after obtaining unauthorized access through a Virtual Private Network. In the case of the Clustering method, a leakage scenario with a lack of correlation occurred when taking portable device out after printing an important document then, attaching the file and connecting on a smartphone after sending an e-mail from other seats. Accordingly, it was confirmed that both mentioned analysis methods were inappropriate and, therefore, we proceeded with the use of the N-gram analysis method. We used N-gram to derive a scenario by considering a sequence of behaviors. The N-gram is a method of predicting a certain word that is most likely to follow another word by representing a relationship of currently recognized words. The approach of the N-gram analysis technique is to consider a word partially, not as a whole, when conducting a vast amount of text analysis. The "N" in N-gram refers to deciding how many words to set as the standard for analysis in consideration of some words. An "N-gram" is defined as a consecutive sequence of n items. The whole text is broken down into word units of n items and is regarded as a single token. For instance, the N-gram for each n in the sentence "An adorable little girl is spreading smiles." [21] presents the process of sorting out by the size of n item in the case of applying the N-gram analysis method regarding a given sentence. In this method, when the value of n item is 1, it is called a Uni-gram, a Bi-gram for 2, a Tri-gram for 3, a 4-gram for 4, and so on. When n item is 1, it refers to one consecutive word sequence and is broken down one by one as per spacing words. In the case of 2 being the n item, which means two consecutive word sequences, the result is a two-word pairing. Thus, it is confirmed that the word unit of consecutive word sequences grows by the increase of n items.
Predicting a following word in a sequence of words in the Language model using the N-gram method depends only on n item-1. For example, when predicting a following word after the sentence "An adorable little girl is spreading," using a language model of 4-gram, in which the n item is 4, the three preceding words-little girl is-which refers to n item-1, is considered. Assuming that the phrase "girl is spreading" appears 1,000 times in vast amount of text analyzing N-gram language models, given that "girl is spreading insults" appears 500 times and "girl is spreading smiles" appears 200 times, the logic is that the probability of "insults" to follow "girl is spreading" is 50%, whereas it is 20% for "smiles." In this way, according to probabilistic choice, the N-gram analysis method considers "insults" as a more appropriate word to be followed by "girl is spreading." P(insult girl is spreading) = 0.500 P(smiles girl is spreading) = 0.200 (1) In this research, we proceed to draw an information leakage scenario by applying the N-gram analysis method to the information leakage. A total of 15 acts of information leakage, which verified a fitness (risk) evaluation through survey, were designated as the target of the N-gram analysis. Information leakage scenarios (a list of information leakage) were applicated by conducting N-gram analysis between information leakage acts along with the order of the analysis method shown in Figure 4. For the first stage, all 15 data leakage behaviors that were evaluated for suitability and validity (=degree of risk) were parameterized by code, as were the scenarios of actual data leakage behaviors, as confirmed through the previous in-depth interview method. Thereafter, when n is 2, a scenario analysis is performed between two data leakage behaviors, and when n is 3, a scenario analysis is performed between three data leakage behaviors. In the last stage, we attempt to design a data leakage scenario through N-gram analysis and to identify the data leakage scenario that has the greatest risk according to the risk of each information leakage activity.
consecutive word sequences, the result is a two-word pairing. Thus, it is confirmed that the word unit of consecutive word sequences grows by the increase of n items.
Predicting a following word in a sequence of words in the Language model using the N-gram method depends only on n item-1. For example, when predicting a following word after the sentence "An adorable little girl is spreading," using a language model of 4-gram, in which the n item is 4, the three preceding words-little girl is-which refers to n item-1, is considered. Assuming that the phrase "girl is spreading" appears 1,000 times in vast amount of text analyzing N-gram language models, given that "girl is spreading insults" appears 500 times and "girl is spreading smiles" appears 200 times, the logic is that the probability of "insults" to follow "girl is spreading" is 50%, whereas it is 20% for "smiles." In this way, according to probabilistic choice, the N-gram analysis method considers "insults" as a more appropriate word to be followed by "girl is spreading." In this research, we proceed to draw an information leakage scenario by applying the N-gram analysis method to the information leakage. A total of 15 acts of information leakage, which verified a fitness (risk) evaluation through survey, were designated as the target of the N-gram analysis. Information leakage scenarios (a list of information leakage) were applicated by conducting N-gram analysis between information leakage acts along with the order of the analysis method shown in Figure 4. For the first stage, all 15 data leakage behaviors that were evaluated for suitability and validity (=degree of risk) were parameterized by code, as were the scenarios of actual data leakage behaviors, as confirmed through the previous in-depth interview method. Thereafter, when n is 2, a scenario analysis is performed between two data leakage behaviors, and when n is 3, a scenario analysis is performed between three data leakage behaviors. In the last stage, we attempt to design a data leakage scenario through N-gram analysis and to identify the data leakage scenario that has the greatest risk according to the risk of each information leakage activity.  N-gram was conducted using the statistical analysis tool RapidMiner Studio version 7.2, which is specialized in predictive data analysis. For analysis, the TF-IDF (Term Frequency-Inverse Document Frequency) feature selection method was applied. The TR-IDF feature selection method means that the frequency of specific behaviors is divided by the frequency of appearance of all actions in n behavior sets like as below Figure 5. N-gram was conducted using the statistical analysis tool RapidMiner Studio version 7.2, which is specialized in predictive data analysis. For analysis, the TF-IDF (Term Frequency-Inverse Document Frequency) feature selection method was applied. The TR-IDF feature selection method means that the frequency of specific behaviors is divided by the frequency of appearance of all actions in n behavior sets like as below Figure 5.

Result of the Research
In this study, data leakage scenarios were designed through bi-gram and tri-gram analyses. In general, it is recommended to set the size of n not to exceed 5 [21,22]. However, in this study, an n value of 4 or more indicates a sparse matrix, from which it is difficult to derive meaningful results. When confirming the perplexity value [21] that can confirm the accuracy of the N-gram analysis result, we confirmed a perplexity value of 223 for the bi-gram and 158 for the tri-gram. The smaller

Result of the Research
In this study, data leakage scenarios were designed through bi-gram and tri-gram analyses. In general, it is recommended to set the size of n not to exceed 5 [21,22]. However, in this study, an n value of 4 or more indicates a sparse matrix, from which it is difficult to derive meaningful results. When confirming the perplexity value [21] that can confirm the accuracy of the N-gram analysis result, we confirmed a perplexity value of 223 for the bi-gram and 158 for the tri-gram. The smaller the perplexity value is, the more accurate the analysis result of the data will be [21]. When n is 4, the perplexity value is 110, which is a high value. However, the four data leakage behaviors of the N-gram-analyzed data leakage scenario dataset can be considered as unreliable perplexity values due to insufficient data in the sequence [21]. Thus, we set the size of n to 2 and 3. The bi-gram analysis, classified by the sum of the risks of the two data leakage behaviors and the scenarios with more than eight points, is shown in Table 5. The five data leakage scenarios were derived. The main peculiarities of the bi-gram analysis results are the behaviors of "Use of unauthorized external commercial email (not using the authorized in-house email server)" and "Receiving key documents by accessing them through a Virtual Private Network (through encrypted tunneling)", which correspond to scenarios 3 and 5. This means that if the behaviors of illegally using an external commercial mail service to bypass the network through a Virtual Private Network and receiving a key document occur together with other data leakage behaviors, it is likely that a data leakage security accident will occur and, therefore, the data leakage behaviors need to be carefully assessed. As another key point of the analysis result, the case of "Receiving key documents by accessing them through a Virtual Private Network (through encrypted tunneling)" after the "SW installation to disable the security environment (bypassing the security SW)" was analyzed as the scenario with the highest risk (9.21).
The scenarios with more than 13 points from the tri-gram analysis, classified by the sum of the risks of the three data leakage behaviors, are shown in Table 6. Compared with the bi-gram analysis results, the main peculiarities of the tri-gram analysis results are "Receive key documents by accessing them through a Virtual Private Network (through encrypted tunneling)" and "Use of unauthorized external commercial email (not using the authorized in-house email server)," which are also major data leakage behaviors. The above two behaviors showed a high risk in the bi-gram analysis. Thus, the probability of a data leakage security accident is very high when these two behaviors occur in succession.

Conclusions and Future Work
In this study, we conducted research to prevent behavior-based data leakage to facilitate sustainable growth of the organization. In detail, data leakage behaviors and risks were derived to predict data leakage security incidents by collecting and analyzing the behaviors of data leakage to detect threats from organization insiders. To derive the risk, we surveyed and analyzed the data leakage behaviors of insiders by analyzing previous research and conducting an in-depth interview and then conducted a survey and statistically verified the analyzed data leakage behaviors. Moreover, the statistically verified data leakage behaviors were analyzed through the N-gram methodology. The data leakage behaviors considered in this study were determined through an in-depth interview with security experts who have experience in actual security accidents. These behaviors were then classified and identified as actual data leakage behaviors rather than abnormal behaviors. As the data leakage behaviors and scenarios were statistically verified through an additional survey of security experts, and data leakage scenarios were derived using the N-gram methodology, the behaviors and scenarios proposed in this study are different from those derived via previous detection methods using abnormal behaviors. As a result of this study, we can overcome the limitations of security solutions that use previous detection methods and thereby predict security accidents. Furthermore, the research results are expected to ameliorate the organization's risk of data leakage and contribute to the organization's sustainability. The main limitation of this study is that it failed to apply the derived data leakage behaviors and scenarios to the actual industry; thus, the reliability of the results could not be confirmed. Therefore, future work should develop an automated data leakage behavioral analysis tool and determine its applicability by analyzing and applying it to actual industry.