Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents

: This paper describes the development and implementation of a natural language processing model based on machine learning which performs cognitive analysis for cybersecurity-related documents. A domain ontology was developed using a two-step approach: (1) the symmetry stage and (2) the machine adjustment. The ﬁrst stage is based on the symmetry between the way humans represent a domain and the way machine learning solutions do. Therefore, the cybersecurity ﬁeld was initially modeled based on the expertise of cybersecurity professionals. A dictionary of relevant entities was created; the entities were classiﬁed into 29 categories and later implemented as classes in a natural language processing model based on machine learning. After running successive performance tests, the ontology was remodeled from 29 to 18 classes. Using the ontology, a natural language processing model based on a supervised learning model was deﬁned. We trained the model using sets of approximately 300,000 words. Remarkably, our model obtained an F1 score of 0.81 for named entity recognition and 0.58 for relation extraction, showing superior results compared to other similar models identiﬁed in the literature. Furthermore, in order to be easily used and tested, a web application that integrates our model as the core component was developed.


Introduction
In the last decade, great progress has been made in natural language processing (NLP) based on machine learning (ML). In this context, the interest in developing solutions to automatically understand text using ML algorithms increased. This paper describes an NLP model based on ML model specialized in the cybersecurity field. The framework automatically recognizes the main entities and extracts the relations between them. Furthermore, it provides a solid baseline for future more complex solutions such as (1) automatically extracting information from hacker's discussions or (2) semantic indexing relevant documents.
In a recent paper, we presented [1] a semantic indexing system designed to automatically monitor the latest information relevant to cybersecurity. The system filters the data and presents it in an organized and structured framework according to the users needs. The system automatically collects text data, analyzes it through NLP algorithms, stores only the relevant documents that are semantically indexed and makes the documents available on a platform where users can conduct semantic searches. The model described in this paper can be adapted to be the NLP component needed to implement the solution presented in [1].

Our Contribution
This article describes an original prototype that we developed for cognitive text analysis in the cybersecurity field. We present the architecture of the prototype and detail each component. Our

Related Work
Developing domain ontologies for cybersecurity is a topical subject in the specialized literature. Various articles, such as [3,4], described the development process and the main components of such ontologies. Articles [5,6] proposed software solutions containing an ontology specialized for the cybersecurity field. Recent papers proposed domain ontologies for Internet of Things (IoT) security, such as [7] or [8]. In order to develop an ontology, paper [9] used ML to extract entities from the text and obtains a cybersecurity knowledge base. Paper [10] presented the state-of-the-art of the web observatory, provided insights and discussed the main challenges associated with this concept, including security and privacy.
Text mining in cybersecurity is another topical subject. Various papers such as [11] used text mining and information retrieval techniques in the cybersecurity field. Most of the projects identified preferred supervised learning. Paper [12] implemented the perceptron learning algorithm to automatically annotate data available in the US national vulnerability database (NVD) [13]. They managed to automatize the training process, by creating a set of heuristics for text annotation. The results consisted of a training corpus of circa 750,000 tokens. However, their corpus was very homogeneous, which affected the results. Due to the difficulty of human annotating large collections of data, some researchers preferred semi-supervised learning algorithms. In [14], a bootstrapping algorithm was used to heuristically recognize cyber-entities and identify new entities through an iterative process of analyzing a large unannotated corpus. The results showed high precision, but very low recall. Paper [15] improved the approach and developed cyber-entity tagging with much better results. The recent evolution of deep learning attracted researchers to implement deep neural network models for NLP. One of the main advantages is that they reduce the effort of human annotators. The results of using the deep learning approach seem to be satisfying. Article [16] performed a comparative analysis of the main deep learning methods for NER and entity extraction. Survey [17] described the state-of-the-art of deep learning algorithms and applications for cybersecurity.
Various projects implemented NLP-based models for the cybersecurity field, such as [8] or [18]. In [8], the author developed a cybersecurity model for IoT, which was connected to a gateway in order to identify the potential vulnerabilities of an IoT environment. Similar methods and technologies were used in [19], which described a model that extracts relevant information about emails from the shipping industry. Papers [20,21] proposed NLP models based on ML for the medical field, extracting relevant data that can be used to make valuable inferences. A comparison of our model and other projects is conducted in Section 4. Our work is highlighted by better performance indicators. Moreover, our model is more complex than the other projects discussed, having the ability to recognize more types of entities and relations.
This article describes an original prototype that we developed for cognitive text analysis in the field of cybersecurity. The architecture, components, technologies used and implementation techniques of the solution are presented in detail. The prototype is referred to as Cybersecurity Analyzer and is available at: www.cybersecurityanalyzer.com. An older version of this prototype was presented at the national innovation contest PatriotFest 2018. The project was evaluated by the competition jury, winning the main prize at Gala PatriotFest 2018 [22]. The solution is available online starting from November 15, 2018. It has been uploaded to an open environment, where it can be tested by users interested in contributing to this project, by providing feedback. For the development of Cybersecurity Analyzer, open-source and trial license applications were used. The only costs were generated by purchasing the domain name.
Section 2 illustrates the architecture of Cybersecurity Analyzer and the prototype components are discussed in detail. Section 3 describes the process of developing the NLP model based on ML. A domain ontology specialized for cybersecurity was developed and later used to define and train the model. Scraping custom-made solutions were used to automatically download data available online, which was used in the training process. Section 4 analyzes the main performance indicators of the model for both NER and relation extraction. Our model is compared to other projects. Section 5 illustrates the interface of Cybersecurity Analyzer, as well as a use-case example. Figure 1 illustrates the prototype's architecture. It is structured on four levels: (1) document upload, (2) cognitive analysis, (3) data store and (4) presentation. The links between levels are made through representational state transfer application programming interfaces (REST APIs).

The Architecture of Cybersecurity Analyzer
In order to facilitate the user's access to the prototype, a web interface was developed. Figure 2 illustrates the main page of the web application. Within it, there is an upload form where text documents can be inserted in various text formats (e.g., .doc, .docx, .pdf, .txt).
Once uploaded, the document is sent via a REST API to the NLP model based on ML developed in the cloud, using the Watson Knowledge Studio service [23]. In order to access the model, the credentials required for API transmission are stored in the server-side. The inclusion of a server-side component was mandatory to ensure the security of the API credentials. Once the document is sent to the Watson Knowledge Studio service, the data flow reaches level 2. At the core of the Cybersecurity Analyzer solution is the model adapted to the cybersecurity domain. The components of the model were implemented using tools available in IBM Cloud [24]. The uploaded documents are annotated and stored in IBM Cloud through the Watson Discovery [25] service. The Cybersecurity Analyzer prototype does not store processed documents for cost reasons. After a document is enriched with metadata and sent via REST API to the presentation level (4), the document is deleted. This option was preferred because the license used for the Watson Discovery service was limited to storing up to 2000 files. Level 4 manages the presentation of the annotated documents. The interface is implemented as a web application. Within it, users can observe the recognized entities as well as the relation between the entities. In the following sections, each component is described in detail. In order to facilitate the user's access to the prototype, a web interface was developed. Figure 2 illustrates the main page of the web application. Within it, there is an upload form where text documents can be inserted in various text formats (e.g., .doc, .docx, .pdf, .txt). Once uploaded, the document is sent via a REST API to the NLP model based on ML developed in the cloud, using the Watson Knowledge Studio service [23]. In order to access the model, the credentials required for API transmission are stored in the server-side. The inclusion of a server-side component was mandatory to ensure the security of the API credentials. Once the document is sent to the Watson Knowledge Studio service, the data flow reaches level 2. At the core of the In order to facilitate the user's access to the prototype, a web interface was developed. Figure 2 illustrates the main page of the web application. Within it, there is an upload form where text documents can be inserted in various text formats (e.g., .doc, .docx, .pdf, .txt). Once uploaded, the document is sent via a REST API to the NLP model based on ML developed in the cloud, using the Watson Knowledge Studio service [23]. In order to access the model, the credentials required for API transmission are stored in the server-side. The inclusion of a server-side component was mandatory to ensure the security of the API credentials. Once the document is sent to the Watson Knowledge Studio service, the data flow reaches level 2. At the core of the

Developing a Domain Ontology
The ontology developed in our work was designed to be implemented in the NLP model based on ML. After performing a literature review, we recognized the ontology developed by Iannacone et al. as the closest to the one needed for the prototype [3]. Initially, we considered using the ontology proposed by them for the implementation of our prototype, but after we performed several tests we decided to develop a new ontology, custom-built to be easily integrated with the NLP based on an ML component.
In order to develop the ontology, we used the two-step approach described in the introductory section. Besides studying the literature review, we conducted interviews with 14 cybersecurity experts. The main purposes of the interviews were: (a) identifying the materials they use for documentation, (b) finding the most common cybersecurity terms used by them and (c) understanding the utility of a cognitive analysis solution in the cybersecurity field from their perspective.
Based on the interviews and on the most frequently used cybersecurity documents, a dictionary containing approximately 5000 words was created. Subsequently, we created 29 categories and each word in the dictionary was assigned to one or more categories. The categories were implemented as classes in Watson Knowledge Studio, as follows: Account, Action/Course of Action, Address, Antivirus, Assets, Attack, Attacker, Detection, Device, Event, Firewall, Hardware, Impact, Incident, Loss, Malware, Networking, Procedure, Protocol, Risk, Service, Software, Target, Threat, Tools, User, Victim, Vulnerability and Weakness.
In the second stage, we trained the NLP supervised learning model and performed periodic tests. We used F1-score, precision and recall indicators, the methodology of which is described in detail in Section 4. Besides the indicators, we used confusion matrices in order to identify the classes where the entities were not correctly labeled.
The results of the tests showed that the initial ontology was too complex to achieve satisfactory performance. Therefore, less relevant classes or classes with high redundancy were eliminated and several classes were merged. As an example, we removed the classes: Antivirus, Firewall, Procedures and included their instances in the Defensive means class. Although the variety of knowledge representation decreased, we considered that the current form of the ontology is suitable for its use in the designed model. Supplement 1 illustrates the initial 29 classes of the ontologies, as well as the evolution of the classes after the tests were performed. An explanation is provided for each class which was changed and had the entities redistributed.
In order to develop the ontology, the tools Protégé 5.2 (desktop application) and WebProtégé (cloud-based application) were used. Protégé 5.2. is an open-source solution that offers a suite of tools to build domain models and knowledge-based applications using ontologies. Protégé toolkits are currently the most widely used solutions for developing ontologies. Protégé is developed by the Center for Bio-Informatics Research at Stanford University [26]. Figure 3 illustrates the classes of the developed ontology, as well as the relationships between them. The ontology consists of 18 classes: Account, Address, Attack, Attacker, Defensive Means, Event, Exploit, Host, Impact, Loss, Malware, Networking, Offensive Means, Risk, Software, Threat, User and Vulnerability. Table 1 illustrates the relations between the classes. As can be observed, there are 12 types of relations and, in total, there are 33 relations between the 18 classes. Once the model was implemented in the NLP model based on ML, we identified that some relations were not optimally managed, leading to aberrant results. As with the case of the classes, the relations that increased the complexity and did not significantly improve the system performances were eliminated. The ontology consists of 18 classes: Account, Address, Attack, Attacker, Defensive Means, Event, Exploit, Host, Impact, Loss, Malware, Networking, Offensive Means, Risk, Software, Threat, User and Vulnerability. Table 1 illustrates the relations between the classes. As can be observed, there are 12 types of relations and, in total, there are 33 relations between the 18 classes. Once the model was implemented in the NLP model based on ML, we identified that some relations were not optimally managed, leading to aberrant results. As with the case of the classes, the relations that increased the complexity and did not significantly improve the system performances were eliminated.

Implementing the Ontology Using IBM Watson
In order to implement the ontology for the cybersecurity domain, the Knowledge Studio service of IBM Watson was used. This service allows the development of both rule-based models and ML models. Knowledge Studio service is available in the cloud and can be accessed in two ways: from the web interface and by connecting to the REST APIs. In our work, both methods were used, the first one for the ease of working directly with the interface, and the second one when the process automation was suitable. In order to interact with the APIs, we used Postman [27] as well as tailor-made scripts.
Knowledge Studio offers the possibility to implement the ontology components, such as classes, relations between classes, entities, various lexical forms for the same entity, rules, properties, etc. The 18 classes were implemented, according to the previously described ontology. Knowledge Studio was initially designed as a service that used only dictionaries, not ontologies. Therefore, it does not provide facilities for integrating ontologies written into standard formats such as RDF, OWL and XML. As the solution was developed and improved, more functionalities were introduced, creating the possibility of using ontology-specific components.

The Development Process of the NER Model
The cognitive analysis model developed has two main functionalities: NER and relation extraction. In order to implement the NER functionality, we first defined the classes and then the types of relations by using the domain ontology presented above.
The source-code of the Watson system is a trade secret; therefore, its techniques and algorithms are not known in detail. According to [28], in order to develop the system, over 100 different techniques have been implemented for "analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses" [28]. Paper [29] states that for classification logistic regression algorithms were chosen due to their simplicity, being preferred over other algorithms such as support-vector machines. Currently, Watson's impressive performances are mainly based on deep learning algorithms. The purpose of these algorithms is to understand the content, domain and context. This approach involves the use of multi-level neural networks to extract knowledge from the data.

The Training Process
The first stage of the training process consisted of the selection of training data relevant to the cybersecurity field. The selection of documents was made considering the literature review as well as the interviews conducted with the cybersecurity professionals. The training sets consisted of a total of about 300,000 words, grouped into documents of approximately 1000 words each. A custom-made script was developed to split each text file into 1000 words chunks. Therefore, 300 documents were generated as follows: 210 Common Vulnerabilities and Exposures (CVEs) files [30], 30 documents consisting of research articles, 30 documents consisting of books, 15 files with news and 15 other relevant documents available online, as can be observed in Figure 4. and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses" [28]. Paper [29] states that for classification logistic regression algorithms were chosen due to their simplicity, being preferred over other algorithms such as support-vector machines. Currently, Watson's impressive performances are mainly based on deep learning algorithms. The purpose of these algorithms is to understand the content, domain and context. This approach involves the use of multi-level neural networks to extract knowledge from the data.

The Training Process
The first stage of the training process consisted of the selection of training data relevant to the cybersecurity field. The selection of documents was made considering the literature review as well as the interviews conducted with the cybersecurity professionals. The training sets consisted of a total of about 300,000 words, grouped into documents of approximately 1000 words each. A custom-made script was developed to split each text file into 1000 words chunks. Therefore, 300 documents were generated as follows: 210 Common Vulnerabilities and Exposures (CVEs) files [30], 30 documents consisting of research articles, 30 documents consisting of books, 15 files with news and 15 other relevant documents available online, as can be observed in Figure 4. The CVE database is used by cybersecurity professionals to be up to date regarding the newest types of vulnerabilities discovered. All the cybersecurity experts interviewed considered the CVE database as the main source of information. Besides that, we noticed that other sources of documentation that they were commonly using contained numerous references to CVEs. About 100,000 CVEs were downloaded and grouped by the year of occurrence. Out of these, 10,000 CVEs from 1999 to 2019 were used to train the model. In order to automatically collect CVEs, a scraping solution was developed. Within the solution, special programs called spiders have been created for each website from which data was downloaded. The CVE database is used by cybersecurity professionals to be up to date regarding the newest types of vulnerabilities discovered. All the cybersecurity experts interviewed considered the CVE database as the main source of information. Besides that, we noticed that other sources of documentation that they were commonly using contained numerous references to CVEs. About 100,000 CVEs were downloaded and grouped by the year of occurrence. Out of these, 10,000 CVEs from 1999 to 2019 were used to train the model. In order to automatically collect CVEs, a scraping solution was developed. Within the solution, special programs called spiders have been created for each website from which data was downloaded.
Once the relevant documents are available, the training process begins. The training stages we performed are:

1.
Pre-annotation of documents using the rule-based model (based on the ontology): The rule-based model identifies predefined elements, such as instances or relations. Subsequently, since the human annotation process is time-consuming, we used the rule-based model to automatize and speed up the training process of the ML model. The advantage of this approach is the reduced time required for annotation. On the other hand, the annotators need to be extra careful. If the machine has annotation flaws and the mistakes are not corrected by the human annotators, the flaws become even more difficult to correct in the subsequent process; 2.
Correction of the annotation made by the rule-based model: Very often, the entities automatically annotated by the rule-based model are incomplete and sometimes even wrong, therefore human intervention is required; 3.
Quality examination of the annotation process: For this purpose periodic tests of the ML-based model's performances are implemented, comparing the evolution of indicators over time;

4.
Integration of the training sets into the model: Once the training sets are considered to be appropriately annotated, they are approved and integrated within the model, being marked as ground truth.
The machine learning model has reached an acceptable level of performance after approximately 40% of the training process, adding up to 120,000 words to the training corpus. From that point, the pre-annotation of the documents was done using the already trained ML model. This change of process helped the author to annotate the rest of the training documents faster. The training process took place over a period of five months. After about 80% of the total training process (labeled documents that contained circa 240,000 words), the model has reached a high level of maturity, its performances improving slower and slower. The evolution of the model is described in the next section. Figure 5 illustrates a screenshot taken during the NER training process. The author of this paper was the only annotator. Each relevant token was identified and labeled according to the classes it belonged to. Based on these labels, the model builds the ground truth, which is subsequently used in the automatic annotation process.
careful. If the machine has annotation flaws and the mistakes are not corrected by the human annotators, the flaws become even more difficult to correct in the subsequent process; 2. Correction of the annotation made by the rule-based model: Very often, the entities automatically annotated by the rule-based model are incomplete and sometimes even wrong, therefore human intervention is required; 3. Quality examination of the annotation process: For this purpose periodic tests of the ML-based model's performances are implemented, comparing the evolution of indicators over time; 4. Integration of the training sets into the model: Once the training sets are considered to be appropriately annotated, they are approved and integrated within the model, being marked as ground truth.
The machine learning model has reached an acceptable level of performance after approximately 40% of the training process, adding up to 120,000 words to the training corpus. From that point, the pre-annotation of the documents was done using the already trained ML model. This change of process helped the author to annotate the rest of the training documents faster. The training process took place over a period of five months. After about 80% of the total training process (labeled documents that contained circa 240,000 words), the model has reached a high level of maturity, its performances improving slower and slower. The evolution of the model is described in the next section. Figure 5 illustrates a screenshot taken during the NER training process. The author of this paper was the only annotator. Each relevant token was identified and labeled according to the classes it belonged to. Based on these labels, the model builds the ground truth, which is subsequently used in the automatic annotation process. On the right side of the figure, the classes of the model can be observed, and on the left side, there is a fragment of a training document. The words in the text are labeled as entities of specific classes, by using tags with the same color as the classes to which they belong. For example, the token cleartext belongs to the class Vulnerability and the tokens database and engine are labeled to the class On the right side of the figure, the classes of the model can be observed, and on the left side, there is a fragment of a training document. The words in the text are labeled as entities of specific classes, by using tags with the same color as the classes to which they belong. For example, the token cleartext belongs to the class Vulnerability and the tokens database and engine are labeled to the class Software. An interesting aspect is that the token denial of service was labeled to both the Attack and Impact classes.
Besides that, the relations between entities are identified and annotated. Figure 6 illustrates a screenshot made during the process of relation extraction training and Table 2 shows the labeled relations.
Once the model is trained, it can be used to annotate new documents for NER and relation extraction functionalities. After a document is annotated, it is stored in the Watson Discovery database, from where it can be used. The documents are saved in JSON files, together with the enrichments. Figure 7 illustrates the representation of a relation. As it can be observed, the token Web is an entity that is part of the class Networking, and the token Application is part of the class Software. Between these words, the model identifies the relation uses.
Software. An interesting aspect is that the token denial of service was labeled to both the Attack and Impact classes.
Besides that, the relations between entities are identified and annotated. Figure 6 illustrates a screenshot made during the process of relation extraction training and Table 2 shows the labeled relations.   Figure 7 illustrates the representation of a relation. As it can be observed, the token Web is an entity that is part of the class Networking, and the token Application is part of the class Software. Between these words, the model identifies the relation uses.

Model Performances
This section analyzes the performance of the NLP model based on ML. The model's performance indicators were evaluated persistently during the development, in order to extract regular feedback regarding its progress. We present in detail the performance indicators for the current version of the model and briefly discuss the model's evolution in time.

Methodology and Metrics Used
The evaluation methodology used is based on the IBM Watson documentation [26]. Since our

Model Performances
This section analyzes the performance of the NLP model based on ML. The model's performance indicators were evaluated persistently during the development, in order to extract regular feedback regarding its progress. We present in detail the performance indicators for the current version of the model and briefly discuss the model's evolution in time.

Methodology and Metrics Used
The evaluation methodology used is based on the IBM Watson documentation [26]. Since our ML model is supervised, its evaluation is based on comparing the annotations automatically made by the model with the human annotations. Human labels are used as ground truth; thus, the more similar the model's annotations are to those performed by humans, the better the model's results.
In order to evaluate the model's performances, the relevant documents were grouped into three categories:

•
Training sets: represent documents labeled by humans. Starting from these annotations, the model learns to properly recognize entities, relations and classes; • Test sets: represent documents used to test the model after it has been trained. The performances are evaluated based on the differences between the annotations made by the model and those made by human; • Blind sets: represent documents that have not been previously viewed by the humans involved in the annotation process [31].
In order to validate the results, the main performance indicators specific to the ML domain were considered: F1 score, precision and recall [32]. The indicators were calculated for the two main functionalities of the solution: NER and relation extraction.

F1 score
The F1 score can be interpreted as a harmonic mean of the values of the precision and recall indicators, falling within the range [0,1]. Its formula is:

Precision
Precision is an indicator that measures the ratio of the number of correct annotations to the total number of annotations made by the ML model. Precision can be interpreted as a deviation of the results from the real values. A maximum precision score for an entity assumes that each time it was annotated by the machine, the annotation was correct (consistent with the ground truth).

Recall
Recall is the ratio between the true positive annotations and the total of the annotations that the machine should have identified. The maximum value of this indicator for an entity is met when the ML model correctly annotates each occurrence of that entity. A low recall score helps to identify contexts where the ML model fails to label objects that should have been annotated. The formula of the recall indicator is: According to the recommendations of [31], the documents were split as 70% training sets, 23% test sets and 7% blind sets. Several aspects were taken into account in order to ensure that the performance evaluation is properly conducted. We made sure that we applied the methodology's stipulations by the book. First of all, we made sure that the dataset was large enough. Although we obtained similar performances for less training data, we continued to annotate the model until we reached a corpus of 300,000 words. We ensured that datasets were not very uniform, so the test sets would have become predictable. Nevertheless, we conducted many tests in order to emphasize that the high performances are not just isolated case but the natural evolution in time of the model.
As a sampling technique, we preferred hold-out over cross-validation. We considered the hold-out method suitable due to the large volumes of datasets. One of the main reasons we preferred the hold-out method was that it was easier to ensure that the training and test sets were properly separated, especially considering that our corpus contains 70% CVE documents. Due to the fact that we conducted many tests at various time intervals, we considered the hold-out technique appropriate. In the future, we take into account using cross-validation as well and compare the results.

The Values of Performance Indicators Obtained by Our Model
The F1 score, precision and recall indicators are calculated both at the aggregate level and for each individual entity and relation. During the development of the model, performance tests were performed regularly and the evolution of the indicators was taken into account. In order to validate the model, it is necessary to reach an aggregate value of the F1 score of at least 0.5. Figure 8 is a screenshot done while using Watson Knowledge Studio that illustrates the evolution of F1 scores at the aggregate level, for both NER and relations extraction.
According to the recommendations of [31], the documents were split as 70% training sets, 23% test sets and 7% blind sets. Several aspects were taken into account in order to ensure that the performance evaluation is properly conducted. We made sure that we applied the methodology's stipulations by the book. First of all, we made sure that the dataset was large enough. Although we obtained similar performances for less training data, we continued to annotate the model until we reached a corpus of 300,000 words. We ensured that datasets were not very uniform, so the test sets would have become predictable. Nevertheless, we conducted many tests in order to emphasize that the high performances are not just isolated case but the natural evolution in time of the model.
As a sampling technique, we preferred hold-out over cross-validation. We considered the hold-out method suitable due to the large volumes of datasets. One of the main reasons we preferred the hold-out method was that it was easier to ensure that the training and test sets were properly separated, especially considering that our corpus contains 70% CVE documents. Due to the fact that we conducted many tests at various time intervals, we considered the hold-out technique appropriate. In the future, we take into account using cross-validation as well and compare the results.

The Values of Performance Indicators Obtained by Our Model
The F1 score, precision and recall indicators are calculated both at the aggregate level and for each individual entity and relation. During the development of the model, performance tests were performed regularly and the evolution of the indicators was taken into account. In order to validate the model, it is necessary to reach an aggregate value of the F1 score of at least 0.5. Figure 8 is a screenshot done while using Watson Knowledge Studio that illustrates the evolution of F1 scores at the aggregate level, for both NER and relations extraction.  As can be observed, for the NER functionality the model has achieved a relatively high level of performance since the first tested version, where the value of the F1 score was about 0.7. This can be due to the fact that the first training, testing and blind sets were very homogeneous, all consisting of CVE documents. Subsequently, the documents used were diversified, and the training sets contained different structures and approaches. The volume of the training sets was increased significantly, and the performances of the model also increased, but at a lower rate.
Based on the evaluations conducted on the 15 versions of the model, the best performances were obtained by version 1.13, where the F1 score for NER was 0.81, and for relation extraction was 0.58. Subsequent training of the model performed after version 1.13 not only did not lead to the increase of the performance indicators, but on the contrary, it generated their decrease. The author considers that this decrease was caused by the inconsistencies between the annotations. At the time of writing this paper, Cybersecurity Analyzer uses version 1.13. Below, the performances achieved by this version of the model are presented.

Analysis of F1 Score, Precision and Recall Indicators for NER
The values of the precision and recall indicators obtained for NER are 0.88 and 0.74, respectively. Taking into account the large number of classes, we consider that the value of 0.88 for precision is very good, indicating high accuracy. The lower value obtained for the recall indicator usually indicates that the model can be improved by increasing the volume of training data.
The analysis of the indicators for the NER is particularly useful during the development of the model. As classes with low performance indicators were identified, training sets were introduced to improve the model's results for these classes in particular. Figure 9 represents a screenshot done during the use of Watson Knowledge Studio. It illustrates the values of the F1 score indicators, the accuracy and the recall for each class, the frequency of occurrence of the classes, both from the total of annotations and from the total of the words, as well as the percentage of documents that contain entities from each class.
14 of 20 Out of all the classes, only three have an unsatisfactory F1 score. The low F1 scores for the classes Loss, Risk and Threat may be due to very low values of their recall indicators, perhaps caused by the small number of entities of those types that occurred in the documents. During the training process, for several entities, it was challenging to assess if they belonged to the Vulnerability class or the Threat class. Therefore, during the annotation process, we decided to include these entities in the Vulnerability class. We believe that this is the main reason why a low performance was recorded for the Threat class.
The low percentage of documents that contain a type of class is often an indication of a set of training documents that do not fully represent the field. In this case, the ontology structure and the training documents must be investigated to ensure that the training sets contain relevant entity types. It is recommended that during, the training process, the model contains a minimum of 50 occurrences for each type of entity [31].

Analysis of F1 Score, Precision and Recall for Relation Extraction
The relation extraction functionality implies that the model identifies three elements: the relation type, the parent entity and the child entity. This process is much more complex than NER, therefore the values of performance indicators for relation extraction are lower. Figure 10 is a screenshot made during the use of Watson Knowledge Studio. The aggregate F1 score of 0.58 illustrates the validity of the relation extraction functionality. However, four of the relations, namely data_flow, has, includes and runs_on, have unsatisfactory F1 scores, mainly due to low values of recall indicators. In the future, efforts will be made to improve the F1 score, precision and recall for the Out of all the classes, only three have an unsatisfactory F1 score. The low F1 scores for the classes Loss, Risk and Threat may be due to very low values of their recall indicators, perhaps caused by the small number of entities of those types that occurred in the documents. During the training process, for several entities, it was challenging to assess if they belonged to the Vulnerability class or the Threat class. Therefore, during the annotation process, we decided to include these entities in the Vulnerability class. We believe that this is the main reason why a low performance was recorded for the Threat class.
The low percentage of documents that contain a type of class is often an indication of a set of training documents that do not fully represent the field. In this case, the ontology structure and the training documents must be investigated to ensure that the training sets contain relevant entity types. It is recommended that during, the training process, the model contains a minimum of 50 occurrences for each type of entity [31].

Analysis of F1 Score, Precision and Recall for Relation Extraction
The relation extraction functionality implies that the model identifies three elements: the relation type, the parent entity and the child entity. This process is much more complex than NER, therefore the values of performance indicators for relation extraction are lower. Figure 10 is a screenshot made during the use of Watson Knowledge Studio. The aggregate F1 score of 0.58 illustrates the validity of the relation extraction functionality. However, four of the relations, namely data_flow, has, includes and runs_on, have unsatisfactory F1 scores, mainly due to low values of recall indicators. In the future, efforts will be made to improve the F1 score, precision and recall for the relation extraction.

The Comparison of Our Results with Other Similar Models
We identified five papers that use similar techniques and technologies for cognitive text analysis. Table 3 illustrates a comparison between the alternative frameworks and the solution we propose, the Cybersecurity Analyzer. The main difference in the approaches is that our model contains more types of classes and relations than any of the other models, this aspect significantly increasing its complexity. The purpose of the developed model is to perform cognitive analysis of any relevant text in the field of cybersecurity, which implies the use of large and diversified data sets. Joshi et al. [18] developed a similar model, adapted for cybersecurity, using the Stanford NER solution [33]. The F1 score obtained by them for NER is close to the F1 score we obtained. However, it is important to emphasize that our model is able to better understand the domain by being able to identify 18 different types of classes compared to 10 of the model presented in [18]. Although Joshi et al. state that the model includes relation extraction components, they did not provide examples or performance indicators in this regard.
Projects [19] and [21] are designed for different domains than ours and have lower F1 scores for NER. The small number of their classes facilitates obtaining satisfactory performance indicators. We consider that the usefulness of these projects is rather to find certain aspects of interest in large quantities of data than to perform a cognitive analysis for a whole domain (as Cybersecurity Analyzer does). For relation extraction, the project [19] has an F1 score close to Cybersecurity Analyzer. However, the level of complexity of the model developed by Fritzner is much lower, thus the small number of classes (two) conducted to good performance indicators.
The NLP model based on ML described in [8] was developed by the author of this paper, hence the approach is very similar. Compared to Cybersecurity Analyzer, the model presented in [8] is specialized in Internet of Things (IoT) security. The training data volume was much lower and the training process was shorter, therefore the performances of Cybersecurity Analyzer are clearly superior.

The Comparison of Our Results with Other Similar Models
We identified five papers that use similar techniques and technologies for cognitive text analysis. Table 3 illustrates a comparison between the alternative frameworks and the solution we propose, the Cybersecurity Analyzer. The main difference in the approaches is that our model contains more types of classes and relations than any of the other models, this aspect significantly increasing its complexity. The purpose of the developed model is to perform cognitive analysis of any relevant text in the field of cybersecurity, which implies the use of large and diversified data sets. Joshi et al. [18] developed a similar model, adapted for cybersecurity, using the Stanford NER solution [33]. The F1 score obtained by them for NER is close to the F1 score we obtained. However, it is important to emphasize that our model is able to better understand the domain by being able to identify 18 different types of classes compared to 10 of the model presented in [18]. Although Joshi et al. state that the model includes relation extraction components, they did not provide examples or performance indicators in this regard.
Projects [19,21] are designed for different domains than ours and have lower F1 scores for NER. The small number of their classes facilitates obtaining satisfactory performance indicators. We consider that the usefulness of these projects is rather to find certain aspects of interest in large quantities of data than to perform a cognitive analysis for a whole domain (as Cybersecurity Analyzer does). For relation extraction, the project [19] has an F1 score close to Cybersecurity Analyzer. However, the level of complexity of the model developed by Fritzner is much lower, thus the small number of classes (two) conducted to good performance indicators.
The NLP model based on ML described in [8] was developed by the author of this paper, hence the approach is very similar. Compared to Cybersecurity Analyzer, the model presented in [8] is specialized in Internet of Things (IoT) security. The training data volume was much lower and the training process was shorter, therefore the performances of Cybersecurity Analyzer are clearly superior. The performance indicators demonstrate the validity of the ML model developed, both for NER and relation extraction. The comparison with similar projects identified in the literature illustrates that Cybersecurity Analyzer shows superior performance, having the highest F1 score indicator for both NER and relation extraction.

Using Cybersecurity Analyzer
In order to illustrate the documents in a clear and user-friendly manner, a web interface that receives the data from the Watson Discovery service through REST API was developed. The main functionalities within the graphical interface of the Cybersecurity Analyzer application are: • Presentation of the entities and classes relevant to the cybersecurity field for documents uploaded by users; • Drawing up a chart that presents an overview of the most important entities and classes relevant to the cybersecurity field identified in the uploaded documents (Figures 11 and 12); • Presentation of the relations between the identified entities ( Figure 13).    The visualization of the relations between the entities can be done individually for each sentence. Figure 13 shows the relations existing in the phrase: This vulnerability enables remote attackers to target other users of the application, potentially gaining access to their data, performing unauthorized actions on their behalf, or carrying out other attacks against them. Numerous relations  are identified, and each is represented in the form of an arrow from the parent entity to the child entity. The entities are colored according to the class they belong to, consistent with the graph presented in Figure 11. An example of a relationship illustrated in Figure 13 is: Remote attackers exploit vulnerability.

Conclusions and Future Work
This article described a prototype developed for cognitive text analysis of documents related to cybersecurity. The general architecture was presented and each component has been discussed. The main contribution of this article consists of the NLP model based on ML specialized for the cybersecurity field. The process of developing the model was extensively discussed. The performance indicators were presented in order to emphasize the model's validity. The comparison of our model's performances with those of other projects identified in the literature review showed better results among our work. A web application that integrates our model has been developed in order to facilitate access to the model.
In the future, we consider using other technologies to develop NLP models based on ML. Although Watson Knowledge Studio is a top NLP tool, using a commercial service can be considered a limitation. Therefore, we will concentrate on using open-source solutions.
There is a growing interest in the development of semantic indexing systems and cognitive Text documents can be uploaded within the graphical interface. It is possible to upload predefined documents to quickly view the functionalities of the application. Figure 11 illustrates a chart obtained by loading the predefined Web Application Security document. Within it, 15 types of classes were identified, most of them belonging to the class Software (71). By selecting (left-click) a class represented in the chart, all the entities corresponding to the respective class can be viewed. Figure 13 illustrates the 16 entities identified for the class Attack.
The visualization of the relations between the entities can be done individually for each sentence. Figure 13 shows the relations existing in the phrase: This vulnerability enables remote attackers to target other users of the application, potentially gaining access to their data, performing unauthorized actions on their behalf, or carrying out other attacks against them. Numerous relations are identified, and each is represented in the form of an arrow from the parent entity to the child entity. The entities are colored according to the class they belong to, consistent with the graph presented in Figure 11. An example of a relationship illustrated in Figure 13 is: Remote attackers exploit vulnerability.

Conclusions and Future Work
This article described a prototype developed for cognitive text analysis of documents related to cybersecurity. The general architecture was presented and each component has been discussed. The main contribution of this article consists of the NLP model based on ML specialized for the cybersecurity field. The process of developing the model was extensively discussed. The performance indicators were presented in order to emphasize the model's validity. The comparison of our model's performances with those of other projects identified in the literature review showed better results among our work. A web application that integrates our model has been developed in order to facilitate access to the model.
In the future, we consider using other technologies to develop NLP models based on ML. Although Watson Knowledge Studio is a top NLP tool, using a commercial service can be considered a limitation. Therefore, we will concentrate on using open-source solutions.
There is a growing interest in the development of semantic indexing systems and cognitive analysis solutions of cybersecurity documents written in different languages. The solutions described in this paper can also be adapted to other languages by adjusting ontologies and training the ML model with specific sets of documents. We consider developing models such as Cybersecurity Analyzer for other languages as well.