Automatic Actionable Information Processing and Trust Management towards Safer Internet of Things

The security of the Internet of Things (IoT) is a very important aspect of everyday life for people and industries, as well as hospitals, military, households and cities. Unfortunately, this topic is still too little researched and developed, which results in exposing users of Internet of Things to possible threats. One of the areas which should be addressed is the creation of a database of information about vulnerabilities and exploits in the Internet of Things; therefore, the goal of our activities under the VARIoT (Vulnerability and Attack Repository for IoT) project is to develop such a database and make it publicly available. The article presents the results of our research aimed at building this database, i.e., how the information about vulnerabilities is obtained, standardized, aggregated and correlated as well as the way of enhancing and selecting IoT related data. We have obtained and proved that existing databases provide various scopes of information and because of that a single and most comprehensive source of information does not exist. In addition, various sources present information about a vulnerability at different times—some of them are faster than others, and the differences in publication dates are significant. The results of our research show that aggregation of information from various sources can be very beneficial and has potential to enhance actionable value of information. We have also shown that introducing more sophisticated concepts, such as trust management and metainformation extraction based on artificial intelligence, could ensure a higher level of completeness of information as well as evaluate the usefulness and reliability of data.


Introduction
According to the Cambridge Online Dictionary [1], "Internet of Things" refers to the "objects with computing devices in them that are able to connect to each other and exchange data using the Internet". Therefore, virtually any system that consists of interconnected computing devices that have unique identifiers and can transfer data over a network without human or computer interaction is an example of the "Internet of Things". However, the world of IoT is evolving and the above definition is not the only accepted definition of an IoT device. For this reason, for the purposes of our research, we adopted the following definition: "IoT device-an item (except a phone, PC, tablet and data center hardware) equipped with network connectivity and the ability to collect and exchange data". Not only is the scale of use of these devices growing but so is their importance. They are used not only at home but also in hospitals, the military and industry, therefore ensuring the security of Internet of Things devices is more and more urgent and necessary.
According to IoT Analytics [2], for the first time in 2020 there were more IoT devices (e.g., connected cars, smart home devices, connected industrial devices) than non-IoT devices (smartphones, laptops and computers), and it is estimated that by 2025 there will and systems built on the basis of these devices. Another not obvious application is the analysis and monitoring of the quality (in the context of cybersecurity) of vendors or their products, which may allow predicting the existence of new unknown vulnerabilities. Such a concept, although not dedicated to the IoT world, can be found in the article [6]. From the perspective of IoT devices, such prospects can be even more important and more promising.
The process of creating the intended repository of vulnerabilities and exploits is shown in Figure 1 and can be briefly described as follows. The first step is the identification and selection of valuable sources of information related to vulnerabilities and exploits. Many types of sources, such as national vulnerability databases, CSIRTs and vendor's bulletins and other structured sources are interesting. It is worth mentioning that unstructured sources, such as blogs, reports or individual websites can also be included in the repository. The next step consists of harvesting information from the sources and saving them in the so-called raw databases. In the next step, the information is standardized-for example, the names of the corresponding fields are unified and some supplementary information is added. As a result, the so-called low databases are created. These three first steps are described in Section 3. The aim of the fourth step is to correlate and then aggregate information from various sources about a vulnerability or an exploit. Details about that process can be found in Section 4. On the output of that process, the medium database is created. The medium database contains all the information from all low databases and every entry in that database corresponds to one vulnerability or exploit. Every field within an entry contains information derived from corresponding fields from low databases. The next step is to try to enhance and select the most reliable information about every vulnerability and exploit. More details about this process can be found in Section 5. This process uses two separate mechanisms, such as metainformation extraction and trust management, described in Sections 6 and 7, respectively. Creation of high database is an output of this process. High database can be then shared and used for various purposes. To facilitate this, the last step-presentation-should be done.
Filtering at different levels is also done to select information related to IoT. Various means are used to complete this task, such as: internal IoT devices catalogue, self-created taxonomy for IoT devices, filtering mechanism based on keywords and, to some extent, information about devices from sources of information about vulnerabilities and exploits (in the minority, as not many of them provide useful information in this context). However, the filtering process is not trivial; it will not be described in detail in this article.   Figure 2 presents the architecture of the repository from the perspective of types of databases which are used to store information on various steps of processing. As it has been mentioned earlier, many types of sources can be useful for the purpose of this work. The information contained in these sources can be provided in various formats. What is even more interesting is that various mechanisms should be used to harvest information from the sources to create a raw database for each corresponding source. In the raw database there is no interference with the structure or the content of the information, but the information is saved in a common format-as a JSON file. Each raw database consists of information harvested from one source. Of course, every raw database can have many entries and contain information about many vulnerabilities or exploits. After standardization process, the low databases are created.
The number of the low databases is equal to that of raw databases; however, the number of entries and the structure of data differ due to standardization process. The medium database is created by combining information from all low databases. All information about a vulnerability or an exploit is combined into a single entry. Various mechanisms (such as identifiers matching or other means of correlation) are used to identify which entries in low databases correspond to the same vulnerability or exploit. It is worth mentioning that any vulnerability or exploit can be described in many sources, so on the level of raw and low databases as well as medium database, the information can be multiplicated. The last instance of the repository, the high database, is done to present comprehensive information about any vulnerability or exploit by selecting the most reliable piece of information in each parameter or by enhancement of existing information.

Sources
Raw DBs Low DBs Mid DB High DB   The architecture of the repository enables easily incorporating new information sources in the final information. Adding a new source of information needs three actions:
Transforming information to the common format (creating a low database for that source); 3.
Setting the trust value of the new source (or implement a more sophisticated trust calculation algorithm for that source; currently this is not and probably never will be needed, but our architecture supports inclusion of such mechanism).
Creation of medium and high database with use of the newly added source will be done automatically.
On the technical layer, various technologies are used to harvest and process information, such as: Python scripts with various libraries (at all steps), Selenium (to harvest information from selected sources), Elasticsearch and Kibana (as a database and to process and analyze information).

Information Harvesting and Standardization
We are obtaining data from multiple sources with varying data formats and different languages. The broad description of vulnerability information sources was published in [7]. In addition to the vulnerabilities, we are also obtaining exploit data from Packet Storm [8] and Exploit-DB [9]. All structured data sources currently in use are listed in Table 1. Only publicly available free sources are considered, which disqualifies paid services such as vulnerability and exploit aggregator Vulners [21]. Besides the structured sources listed in Table 1, we are also obtaining write-ups, such as blogposts, about IoT vulnerabilities and exploits. In their case relevant metadata can be extracted from raw text, as discussed in Section 6.
From the data acquisition perspective, the sources can be divided into three categories: 1. Sources with API access; 2.
Sources offering only a website.
Sources with API access are a minority among publicly available free sources. Out of sources listed in Table 1, an API is available only in JVNDB and, since March 2021, NVD [22]. There are more sources, usually national vulnerability databases, that offer data feeds in various formats: as JSON files (NVD), XML files (JVNDB, CNNVD and CNVD) or GitHub dumps (CERT/CC). While these may provide all required vulnerability information, they are often lacking-the CNVD feed is incomplete as it does not contain changes made to old entries and is only weekly updated with new ones, the CNNVD feed is only available for registered users, without open registration to the service and the CERT/CC feed is updated only once per year. Therefore, these three sources have to be harvested using web scraping, as well as other sources classified in the third category.
Web scraping is performed with custom Python scripts, using the Beautiful Soup [23] library to parse HTML files. A JavaScript engine is needed to retrieve data from the CNVD; therefore, we use a web browser through the Selenium Framework [24] to download it. In other cases Python's built-in network libraries are sufficient and are used instead. For each source the HTML data is parsed and relevant information is retrieved and stored in a JSON format. Entries in languages other than English are translated using Google Translate API [25]. Since raw databases are meant as raw representations of data from the actual remote sources, no other processing is done at this stage. Therefore, the structures of raw database entries are vastly different for each source and are incompatible with each other. Additional binary files available from some sources, usually exploit-related, are also downloaded and archived.
Entries stored in raw databases are used to create low databases, which follow one of the two unified formats, different for vulnerabilities and exploits. Standardizing entry formats allows for easier data aggregation and correlation in subsequent stages. Data processing involved in this step includes:

1.
Parsing dates and saving them in ISO 8601 compliant format; 2.
Parsing affected products lists to separate vendor, product and version fields; 3.
Parsing references to create a list of external IDs, which is used in later stages to correlate entries from different sources; 4.
Dividing entries that contain multiple vulnerabilities to create one entry per each vulnerability; 5.
Adding IoT classification based on categories, tags, etc. from source.
The process of transforming a raw database entry into a low database entry is supposed to keep all the data available in the former one; however, some minor losses are possible at this stage. For example, external IDs that do not comply with formats used by their corresponding sources are discarded. The unified low database entry structure for vulnerabilities is presented in Listing A1, found in Appendix A. Each low database is synchronized with its raw database immediately after any changes to the latter are done, ensuring that data available for the next processing steps is as current as possible.

Information Aggregation and Correlation
One of the unique features of the VARIoT's vulnerabilities database is the correlation and aggregation of information about vulnerabilities from different sources. As presented in the previous section, there are a lot of publicly available databases with the information about vulnerabilities in different types of software and hardware. Only a few of these databases are solely dedicated to the IoT or at least somehow indicate such vulnerabilities, but none of them explicitly aggregate information from other sources (beside placing external links-e.g., Vulners.com accessed on 7 May 2021).
The aggregation of information about vulnerabilities depends on two key features of data in the low databases: the common data format and lists of external identifiers linking vulnerabilities descriptions in different sources. A common data format in the low databases helps in integrating data from matching entries into an entry in the medium database. External identifiers help to match entries from different databases. To aggregate entries from the low databases, at least one external identifier must match and all matching entries (from the low databases) must point to the same CVE or not have one. This means that low databases entries with matching external identifiers but with different CVEs will not be aggregated. This constraint is mainly intended to limit aggregation. For example, some sources link similar vulnerabilities which are related, for example, by a common software or hardware stack or by the means of exploitation. However, the automatic evaluation of the relation between linked vulnerabilities is very hard to assess (i.e., the scope and severity of linked vulnerabilities may be completely different). Therefore, we decided to limit aggregation to a maximum of one CVE per entry. This process is presented in Figure 3. We have two CNNVD entries, two NVD and one SecurityFocus (BID). These five entries are pointing to two CVEs (CVE-2005-290 and CVE-2005-291). However, only SecurityFocus's entry points to both. Therefore, it is split into two separate entries in the low database (NVD and CNNVD entries are just standardized to a common data format). In the medium database, the aforementioned entries from the low databases are merged into two entries with different CVE identifiers. In the next step, medium database entries are processed into the high database. Every medium database entry is a union of data from matching low databases entries with explicit information about the source of information in every field. Examples of the aggregation of data are presented in Listings 1-3. Listing 1 presents the information from the affected_products fields aggregated for one vulnerability out of three low databases: NVD, CNNVD and JVNDB. The affected_products field contains information about the vulnerable software or hardware, including: vendor name, model name, affected versions and scope of affected versions (i.e., equal, newer or older than). Listing 2 presents information from the description field from the three aforementioned databases. This is any text describing nature of vulnerability in more or less detail. In Listing 3 one can find titles aggregated from the raw databases. The title field contains a short description of the vulnerability. As one can see, the information is correlated but mostly duplicated, so it would be beneficial to keep only unique ones. The next step after aggregating data is to automatically handle similar data from different sources. Knowledge of the source of the information is needed to estimate how much every part of the data can be trusted. This is helpful in selecting the most reliable and insightful information on vulnerabilities from across all the matching sources.
The process of computing trust and selection of data is part of building the high database and is described in the next sections.

Information Enhancement and Selection
Information in the high database should be actionable for people as well as for machines. To achieve this, we need to process different fields of every entry in the medium database differently during building the high database. For example, the title field is more useful for people. This field can be less structured but should contain insightful yet concise information about the vulnerability. Therefore, for this field, we are selecting only the best and most reliable information. For the description field, we are using a more sophisticated method. To preserve as much information as possible from all the sources, we are using Word Mover's Distance (WMD) algorithm [26] from Gensim library [27] with the fastText English model [28] to compare sentences of descriptions and get rid of duplicates (sentences from less trusted sources are removed). On the other hand, fields like: affected_products, cpe (affected products' identifiers in CPE format [29]) or cvss (Common Vulnerability Scoring System as attack's vectors and single metrics [30,31]) should be easy to use with IT assets management or risk assessment tools. This information must be precise and well structured but it can be broader as it will be processed automatically. For the aforementioned fields, like: affected_products, cpe or cvss we merge the data from all the sources, deduplicate it, sort by the trust level and present as a list of values.
Listings 4-6 present results of our algorithms' work on the data from Listings 1-3 on moving an entry from the medium to the high database. The data in the affected_products field has been deduplicated and sorted by the trust level. The data in the description field has been deduplicated and concatenated on the basis of the WMD algorithm's results and trust to the sources. Information in the title field has been selected from the most trusted source. All information is presented with sources' identifiers and trust levels so if necessary, the end users of the database can use their own filtering mechanisms to select data from particular sources or with a particular level of trust.

Metainformation Extraction
The data that we collected in the database was used to create dictionaries with information about vendors, models, device types or vulnerability types and to create a training dataset used to prepare an NLP/AI based solution. The main sources of information for which the mechanism will be used are unstructured data sources such as articles or blog entries about vulnerabilities in IoT devices. The information that can be extracted is as follows. To extract the text keywords from input text we are using the Gensim library [32]. We want to get summaries that will contain about 100 words. If the length of the input text is sufficient to create a summary, we used another method from the Gensim library [33]. Results of these two extraction methods give some general information about text. Information about vendors, device names and device types is based on searching for words or phrases in prepared dictionaries. External database identifiers are extracted with a set of regular expressions collected during other VARIoT works. The metainformation extraction mechanism can also be used to estimate the criticality (CVSS) of the vulnerability presented in the description. We are searching for information about vulnerability types in two ways. The first way is to search for words or phrases from the prepared dictionary. The second method used is the custom Named Entity Recognition (NER) model from the Spacy library [34], based on training data prepared using the VARIoT project's data. Default NER model implemented in the Spacy library identifies basic information types like organizations, people and dates. We wanted to adapt this model to identify information about vulnerabilities, so it had to be significantly developed and adjusted. The current custom NER model enables identifying the vulnerability types. It can also be used for extracting information about IoT vendors and model names. By using only a dictionary-based solution, we would be putting ourselves at risk of not finding the phrases with different word orders or ways of describing information. With the help of the custom NER model, we can identify information that is not in the dictionary or that we have but in a different form. We used the rule based matching method [35] for preparing the custom NER model training dataset. This Spacy library method allows for the matching of phrases in the input text with previously prepared patterns based on special rules. Patterns take into account the following features of the input text: • The occurrence of the specific words; • Part of speech; • Types of special entity labels generated with the Spacy library; • Punctuation; • Case-sensitivity.
The process of creating the NER model is as follows. The first step is creating the training dataset. For each sentence with a phrase about vulnerability type identified with rule-based matching, additional information was added. The first information is an entity label (type of vulnerability). The second information is the string index range of phrase within the sentence related to vulnerability type. A dataset of elements that were prepared in this way was used to train the custom Named Entity Recognition model.
The process of learning the NER model requires training data. The training dataset consists of sentences from the Japanese Vulnerability Database (JVNDB). JVNDB was selected to create the training dataset for two reasons. There are approximately 130,000 entries, which make it possible to create a comprehensive collection of training dataset. The second argument is a regular structure of the presented vulnerability descriptions. Schematic building of JVNDB descriptions allowed automatic extracting information about vulnerability types using rule-based matching.
Below is an example of a description extracted from an article (in the Listing 7) about vulnerability in Cisco devices and results obtained with the metainformation extraction mechanism (in the Listing 8). In this example, the summary and keywords are generated with the Gensim library. In prepared dictionaries were found information about the vendor, product name and device type. The most interesting information which was extracted from the description is about vulnerability type. In both cases: with dictionary and the custom Named Entity Recognition model, vulnerability types were extracted with similar result. The last field in this data are identifiers from external sources mentioned in descriptions.

Trust Management
Trust and reputation management (TRM) is used in various applications such as Wireless Sensor Networks, peer to peer networks, e-commerce platforms and recommendation systems [36,37]. It can be used to evaluate trust of vendors of IT products [6] or IoT devices but also for information processing methods. In our work, trust estimation is used to select the most reliable and informative piece of information as well as to evaluate its reliability.
For each source, its trust value (source trust-T S k ∈ [0, 1]) was assigned on the basis of expert's knowledge, taking into account properties such as reliability, accuracy and comprehensiveness of information presented in the source, recognition of the source in the community, documentation or available information related to the source, stability and topicality of the source, uniqueness of information provided, self-consistency of information within the source and also consistency with information from other sources (by providing links to other sources or identifiers of entries in other sources). Source trust values for each source are presented in Table 2. For most of the fields, the same values of each field are aggregated and sum of trust values of sources presenting the same information are calculated. That action is performed for each existing value of that field (for each piece of information). Let us assume that sources k till m have the same information (information-i related to a field f ). Then, the trust for that piece of information can be calculated as follows: After calculation of the trust value for all pieces of information, the piece of information which has the higher trust value is selected as the most reliable.
As we have indicated in the previous sections, we use another method to create a description which will contain and aggregate all information form all descriptions related to a particular vulnerability. However, alternatively we can also use trust to select a description from the set of existing descriptions. To calculate trust to piece of information related to description field, modified method of trust calculation should be used. However, every description is rather unique (if not, it means that one source simply copied the description from another, because the probability of independent repetition of long text is rather small), we have assumed that even if a source has copied a description from another source (these two sources are not independent), we hope that some of the sources also do a simple verification of that description, so if the same description is present in more than one source we can trust that description more than the description provided only by one source. On the other hand, we want to take into account the length of the description. Because of that we calculate trust to specific description by taking into account the source trust of sources which had provided such description and the length of the description (we assume that the longer the description, the better, as it could be more informative). So to calculate trust to a description d (provided by sources k till m), we use the following formula: where n d is the length of description d.
To some extent a source can present more reliable information regarding a field but less reliable information in the case of other fields. Because of that, a more advanced approach could be implemented, in which source trust value would be set not only in relation to every source but to every field in every source. It would lead, however, to a significant increase in complexity of trust mechanism and multiplication of source trust values which should be taken into account in the mechanism. However, the necessity of evaluation of a few times more trust values is the worst feature connected with that advanced mechanism and it would be hard to be done on the basis of experts method, which is the most reliable in that case. Moreover, such advanced trust mechanism probably would not introduce significant changes in the trust results in the case of a vast majority of evaluated pieces of information.
It is worth emphasizing that trust management is used to select the most reliable information, when many pieces of information (dissimilar) from various sources exist but also for an estimation of how much we can rely on the selected information. In theory, to accomplish these two aims (at least to some extent), other means could be used, such as natural language processing or machine learning. NLP could be used to try to match information obtained from various fields (for example between description and cvss) in a way that we use it to extract such information from description in case it is not provided separately, as explained in the previous section. It is less useful, however, in the case when inconsistent information from various sources exists. On the other hand, we could try to use other machine learning methods, but the accurateness of the selection process would be in that case hard to evaluate and the process of selection would also be hard to digest, analyze, audit and improve. Because of this we decided to use the natural and easy to understand concept of trust. The concept of trust also has more benefits than simple prioritization of sources because it takes into account the very typical situation when many sources present consistent information within a field.
As indicated earlier, in case the same information exists in more than one source, we assume that this information is more reliable (we sum up trust values connected to all sources that provide such information). That way of operation is valid when we are certain that all sources are independent from each other. However, it is not always true in relation to sources of information about vulnerabilities and exploits-we have observed that a few sources duplicate information harvested from other sources without further verification. We have taken into account such situation by adjusting (namely: lowering) the source trust value of sources which copy information from other sources.

Results
There are 650,494 entries in low databases which gives 203,475 entries in the medium and the high database, where 19,572 are IoT related (as of 21 May 2021). About 80% of all entries in the high database contain data from more than one source (mostly from two to four). Detailed results can be found in the plot in Figure 4.
Additionally, we compared release dates of information in sources composing the high database entries to see how many days can pass between the first and the last source publication. Results are presented in Table 3 and in Figure 5. These results take into account only high database entries with more than one source and more than one correct release date (146,484 entries). As some of the sources place invalid release dates (for example, beginning of the Linux epoch, i.e., 1 January 1970), we arbitrarily took into account only the dates after 1 January 2000. It is not a complete solution, for example, CNNVD has a history of faking release dates [38] but removes obvious errors. Seventy five percent of entries have a difference less than 166 days. Differences over 2000 days (more than five years) are mostly an effect of wrong release dates presented in the sources (as described above). As most of the differences are less than a year we also analyzed delays in this time span. Results, presented in Figure 6, show that most of the differences between the first and the last mention in the sources are less then 61 days. Histograms of delays (in days) for every source are presented in Figure 7 and statistics in Table 4. We have also checked which databases most frequently report vulnerability as first and which as a last one. Three databases that report as frequently as the first one are: SecurityFocus (BID), CNNVD and NVD. The three databases that report the most frequently as the first one are: SecurityFocus (BID), CNNVD, and NVD. The three most frequently reporting as last one are: JVNDB, NVD, and CNNVD. NVD and CNNVD are in both top threes because these are also some of the biggest databases, therefore raw counts (not normalized) outnumber other, smaller databases. Results for all the sources are presented in the Figure 8.  7 9 11 13 15 17 19 21 23 25 27 29 31 34 37 42 45 56 63 70        Merging information from many sources enhances vulnerabilities' data completeness. To illustrate benefits of this process we present an example of the entry from the medium database in Listing A2 in Appendix B. Even though this vulnerability has no CVE identifier, the CNVD database provides the CVSS vector so one can assess the risk related to this vulnerability (the cvss field). On the other hand, Zero Science Lab (ZSL) provides a lot of external links for further reading or even with an exploit (the references field). Finally, ZSL and CNVD present slightly different details of the affected products (the affected_products field). However, this form of data is not optimal as it contains duplicated or missing information. To make it more useful, the data is deduplicated, aggregated and selected as described in Section 5. The results of these operations are presented in Listing A3 in Appendix B containing an example of the high database entry corresponding to the same vulnerability. To prove the validity of the method it is worth analyzing the level of completeness of information achieved in the high database. To do this we want to show statistics regarding the lack of information in particular fields. It is worth emphasizing that these statistics are done before our extraction mechanism has operated. Table 5 summarizes achieved statistics about information completeness. Of course, on that stage we do not try to analyze accurateness of information, just the completeness of data. We will try to evaluate accurateness of information by taking into account trust management results at the end of this section. As can be seen from Table 5, in relation to entries about IoT vulnerabilities, information regarding vulnerability description, affected products or vulnerability risk assessment is quite complete (higher than 99%). In relation to all entries (also regarding IoT), the completeness of information is lower. To achieve even better results of data completeness or handle unstructured data sources (e.g., blog posts or news) in the future, we have developed an NLP/AI tool to extract metainformation from a text as described in Section 6. It is due to the fact that description of vulnerability is the most common type of information both in structured (as can be seen in Table 5 in relation to sources harvested by us) and unstructured sources (as then it is the only type of information which can be harvested directly).
Below we present the evaluation of the metainformation extraction mechanism. For the purpose of analyzing the results generated by the mechanism, a test dataset was prepared with high database entries tagged as IoT information with descriptions and defined vulnerability types. Taking into account the criteria presented above, it was possible to obtain 14,841 entries from the high database. The next step was using the mechanism on the test dataset and providing quantitative data of metainformation extraction results. General statistics can be found in Table 6. Text keywords ("text keyword gensim") were generated for all entries and summaries ("text summary gensim") in 99% of entries. In 56% of entries information about vulnerability type with the custom NER model ("vulnerability type ai") was found, but with vulnerability type dictionary ("vulnerability type dict") this information was found in 95% of entries. Metainformation extraction process was able to obtain information about vendors ("vendor name") from 82% of entries and devices ("device name") from 61% of entries.
For a deeper analysis, we conducted a verification of the results of the metainformation extraction process's results. We have chosen vulnerability type as this type of information has the smallest level of completeness in relation to IoT vulnerabilities, as can be seen in Table 5. We made this verification by comparing the types of vulnerabilities generated by the mechanism with the information contained in the test dataset (the CWE dictionary identifier or a phrase defining the vulnerability type).
Results (presented in Figure 9) show the comparison of extracted vulnerability types (with the custom NER model or vulnerability type dictionary) with vulnerability type information from test dataset entries. Data defined as "all entries" are the whole test dataset. The mechanism was able to extract information about vulnerability type using the custom NER model or dictionary from 96% of entries ("entries with identified vulnerability type"). For 65% of entries, extracted vulnerability types matched the vulnerability type information assigned to specific entry ("successfully identified vulnerability type"). The result obtained after verifying the metainformation extraction may be higher. This may be due to false negatives in the results, which may be caused by how the descriptions in the CWE dictionary or phrases that define the vulnerability type are structured.
In addition to the metainformation extraction performed on the test dataset, we made the extraction process on the other data. In Figure 10 we present results of metainformation extraction on high database entries without information about vulnerability type. There are 4731 entries that are tagged as information about IoT devices and have not defined any information about vulnerability type. We extracted vulnerability types (with custom NER model or with the dictionary) for 89% of entries ("entries with identified vulnerability type" in Figure 10). The results are similar to those presented in Figure 9 ("entries with identified vulnerability type"). In this case, we cannot verify results because of a lack of information about vulnerability type, but quantitative results are similar to those generated on the test dataset.
On the level of the high database we have faced two types of problems. The first one, namely: the lack of information in relation to a specific field, we tried to solve by using our extraction mechanism and the results are presented above. The second problem is related to existence of various information related to a field which can be mutually exclusive. For example, a vulnerability can have CVSS score set as 5.0 in one database and as 10.0 in another. To solve that problem we use our trust mechanism as described earlier. In the next paragraphs we want to show some statistics related to the effectiveness of our mechanism to select the most reliable information.  First of all we want to calculate the average trust value within all fields (of course we have taken into account only the highest value of trust related to that field within a vulnerability-we analyze just the value of trust of the most trusted information). We also show the minimum and the maximum of the highest trust value. It is worth noting that trust level above 1.0 implies that the information which is under trust assessment was presented by more than one source, whereas trust level equal or lower than 1.0 does not imply that the information was presented only by one source. The results also show information as to which field is the most often repeated among many sources. The results are shown in Table 7. The results show that, on average, information in fields related to vulnerability risk assessment, vulnerability type or affected products, is aggregated from more than one source (it is indicated by average trust value to these types of information significantly higher than 1.0). Such information is often repeated (or set independently) by various sources and because of that these fields are very susceptible to the aggregation process. Of course fields such as title are almost always unique, so they have the lowest average trust value. On the other hand, description field has the highest average trust value, despite the fact it is unique, but that is due to the another trust calculation formula.
The only field which has a numerical value and which can be put under more in-depth analysis is CVSS score (vulnerability risk assessment). In all IoT-related vulnerabilities we have analyzed the dispersion of the value of CVSSv2, and the results can be found in Table 8. We take into account the highest difference within an entry, both in relation to CVSSv2 score and trust value related to that score. Results presented in Table 8 show that the dispersion of risk assessment for an entry could be very high (max = 10, which means that one source indicates that a vulnerability has CVSS score equal to 0-the lowest possible value but another that this vulnerability has CVSS score equal to 10.0-the highest possible value), but the average CVSS difference is rather small. This shows that sources are rather compliant in relation to risk assessment. As can be seen, the difference between trust values related to that information is rather small on average.
The results regarding trust evaluation show that trust mechanism works as intended and can be used as an evaluation of accurateness of provided information. All results show that the created database of vulnerabilities and exploits could be beneficial and useful to the community of IoT cybersecurity analysts, as it is as comprehensive as possible on the basis of public sources of such information.

Summary and Future Works
This article has shown that the process of gathering and automatically processing actionable information on IoT vulnerabilities in order to obtain the best results is a nontrivial task, but it is currently necessary to perform it. Obtaining information on vulnerabilities will make it much easier to manage the security of IoT devices. Due to the very rapid growth in the use of IoT devices, they are more and more often used in attacks, and the lack of security measures may lead to the fact that attacks will become more and more frequent. To avoid this, it would be advisable to make users or network owners aware of the vulnerabilities in these devices. Until now, information about them could be found in various places; they were fragmented, incomplete and often unstructured. Creating a publicly available structured database of information about known technical vulnerabilities and exploits is of great benefit to all interested parties: users and producers or network owners.
Our research has shown that by collecting data from various sources, we can obtain a more comprehensive entry than from a single source. Due to the fact that our database collects, correlates and aggregates data from various sources, each entry is rich in actionable information and it also reduces the risk of lack of data or delays in obtaining information on vulnerabilities.
As future works we will enhance our metainformation extraction mechanisms to support other types of information and also we want to further evaluate that mechanism.
We will also harvest information from other (also unstructured) sources, which will significantly increase usefulness and also necessity to use our metainformation extraction mechanism. We also plan to create a search engine optimized to find information on the Internet related to the IoT vulnerabilities and exploits.
The most important implication of our research is the fact that there is still much work to do to improve vulnerability management regarding IoT. To move towards that goal, we have focused on providing more comprehensive and reliable actionable information. Mechanisms implemented and information provided by our work can be a ground for building various services. The database could be used by vulnerability scanners-not as an engine of scanning process but as a repository of information about vulnerabilities. That is due to the fact that information collected by us is to some extent broader and more ample than information used by common vulnerability scanners in the context of IoT. Another natural and easy to build but still very practical service could provide a list of possible vulnerabilities on the basis of the product name (vendor and model). Of course to do this the IoT asset inventory must be done beforehand, but such approach can give much better results that scanning (for example, due to the fact that scanners can improperly recognize a device and still not verify the real existence of a vulnerability).
The article is written on the basis of the results obtained during the work in the Vulnerability and Attack Repository for IoT project [39]. This project involves not only creating and sharing information about vulnerabilities and exploits but also scanning the Internet in order to obtain a security image in IoT devices. Laboratories to test legitimate and malicious IoT traffic, IoT artefacts and IoT anomaly models were also built. Moreover, various types of statistics related to devices in a given country will be created. All these tasks are performed in cooperation with our partners, i.e., Stichting The Shadowserver Foundation Europe, Security Made In Letzebuerg G.I.E., Institut Mines-Télécom and Mondragon Goi Eskola Politeknikoa Jose Maria Arizmendiarrieta S COOP.
Data prepared by us will be available on the European Data Portal (through National Data Portals, including the Poland's Open Data Portal [40]), as well as on many other sources such as MISP platform (Malware Information Sharing Platform), which is commonly used by the community of cybersecurity analysts. Acknowledgments: We would like to show our gratitude to Adam Kozakiewicz from Research and Academic Computer Network (NASK) for comments and insights that greatly improved the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: