This section outlines the experiments executed to determine the performance of the COSMOS framework as a malware detector. As previously described in
Section 2, the malware samples are sequentially analyzed by three defensive layers (i.e., Internal Analyzer, ML Module, and External Analyzer), following the defense-in-depth approach. In particular, we decided to report separately the analysis of Android malware and other platforms (e.g., Windows, Linux, etc.), since they can be considered as two different use cases, thus implying different executions.
It is worth noticing that the current implementation of the ML layer is effectively working only for the Android malware detection, since the learning model is trained to distinguish such samples. Nonetheless, the operation flow is easily extensible to the detection of other malware extensions (e.g., .exe, .sh, .deb, etc.) by training another learning model on such files. Indeed, we are currently working on this extension which represents an interesting and challenging future work.
4.2.1. Android Malware Detection
In this section, we describe the experiments conducted on the COSMOS architecture regarding the Android malware detection. Such experiments were carried out over a set of 5000 Android malware samples belonging to different families [
34]. The samples were directly injected within the catching directory of the IoT sentinel so that the main detection module was fired against the incoming threats. To better argue on COSMOS performance, we decided to vary the sizes of the samples according to the division in [
34]. Specifically, the selected size classes were: (i) less than 100 KB, (ii) from 100 KB to 500 KB, (iii) from 500KB to 1 MB, (iv) from 1 MB to 5 MB, and (v) more than 5 MB.
Regarding the machine learning model, the module was trained using a small subset of malware (i.e., 100 samples), and tested using 20-fold cross-validation. For the static detection layer, a set of 72 Yara rules was used instead. These rules were originally taken from the official Yara repositories concerning the mobile detection. To assess the sentinel’s behavior, the following parameters were measured, as reported in
Table 2: (i) the overall
resource consumption of the framework’s components (i.e., CPU and RAM consumption), (ii)
response time, that is, the time a ring requires to give a response about a given sample, and (iii)
detection rate, meaning the percentage of malware samples detected for a given experiment in each ring. We believe that the above-mentioned parameters are particularly critical in the context of IoT threats’ detection.
Relevant results are shown in
Figure 5, where the response time of the Sentinel for different malware sizes is depicted in
Figure 5a, the detection rate of the three malware detection rings is showed in
Figure 5b, and finally the resource consumption in terms of CPU and RAM is illustrated in
Figure 5c,d, respectively.
Regarding the response time,
Figure 5a illustrates the time required for the malware samples analysis of the three defense rings of COSMOS. It has to be stated that the response time has been calculated separately for the rings during this experiment to give a more accurate measure for each ring. Additionally, we decided to plot the median value of the collected time values, since this statistical measure is more robust to the possible presence of outliers in the results. As expected, the first ring (i.e., Yara) is the fastest within the architecture. This characteristic is mainly due to the fact that the Yara rules used to detect the peculiarities of the Android malware are relatively low in number (i.e., 72 rules). Moreover, since the enabled rules are signature-based, the time required by the detection engine to analyze them is negligible. Concerning the other rings, for malware sizes less than 500 KB, the machine learning module performs faster than VirusTotal. Later, this trend is inverted considering malware sizes greater than 500 KB. This outcome is justifiable looking at the different operations performed internally by the second and third ring. In fact, while the machine learning module checks the permissions required by the Android application, VirusTotal computes only the hash of the files and sends it to external databases through the Internet connection. Thus, while the response time of VirusTotal remains quite constant (except for very large malware samples, for which the time needed for the hash calculation becomes considerable), the time required by the machine learning tool increases linearly.
Another interesting result obtained during the time analysis experiments is represented by the trend of the response time of the different rings. That is,
Figure 5a shows that the malware analysis time of the rings increases when the malware size grows. Specifically, the first ring (i.e., Yara) performs a static analysis of the malware code in order to detect possible matches with the enabled rules. Thus, the time required for this static analysis increases for bigger malware. Similarly, the second ring (i.e., the machine learning module) checks the permissions requested from the Android application during its execution in order to discriminate between goodware and malware. Therefore, the time demanded for the permissions analysis is dependent on the malware size, mainly for the time required for the decompilation and the permissions extraction. Likewise, as previously stated, the third ring (i.e., VirusTotal) needs to compute the hash of the input file and then send it online to external databases. Consequently, a bigger input file could imply a bigger hashing generation time before the hash can be actually sent to the VirusTotal API to obtain a malware report. On the other hand, when a sample hash is not found in the VirusTotal database, such sample is directly uploaded to VirusTotal so that it can be incorporated and analyzed, which extends the ring execution time depending on the sample size.
All in all, the conducted experiments on the response time of the sentinel w.r.t. Android malware detection show that the time required for the malware analysis is lower the 3.1 s even in the worst case, that is, a very large file size over the slowest ring (i.e., the second one). On average, the sentinel is able to analyze the malware within a second. This allows one to easily say that the proposed solution is suitable in the context of near real-time detection.
Regarding the detection rate,
Figure 5b depicts the detection accuracy of the three rings. In this context, we refer to detection rate as for the quantity of detected malware in a specific run. It has to be stated that, contrary to the experiments executed to measure the response time, the detection rate has been calculated enabling sequentially all the rings. In other words, the malware samples were initially analyzed by the first ring, then if Yara was not able to detect the malicious file, the second ring comes into play, and so on until the third ring. Hence, in
Figure 5b, a cumulative detection rate for the malware samples is shown. A remarkable result is represented by the low detection exhibited by the first ring. That is, the static malware analysis performed with the Yara rules downloaded from the official repository [
47] stuck to 8%. However, the presence of Yara in the proposed IoT detection framework relies on the fact that Yara rules can be read and compared against sample features in a fast way. Moreover, new Yara rules are being developed and contributed constantly, counting on an everyday bigger supporting community. Additionally, if a malware is able to elude the detection engine, Yara may autonomously generate an ad hoc rule which will detect the malicious trace in the future. This beneficial capability fulfills the previously described security goal of adaptability for the sentinel, as mentioned in
Section 2, which we believe is of crucial importance in IoT contexts. Furthermore, adding more Yara rules implies directly a slower static analysis, which in the case of Android malware detection is not increasing concretely the detection rate.
Another interesting feature shown in
Figure 5b is the absence of undetected malware. Specifically, no malware of the selected 5000 samples remains undiscovered after passing through the three rings of the sentinel. For the most part, the second ring (i.e., the machine learning module) is able to spot the malicious files, reaching a notable 75% detection rate. Lastly, the third ring (i.e., VirusTotal) is able to block all the remaining malware, summing up a 100% detection rate for COSMOS. This result can be explained looking at the operation mode of VirusTotal: when a malware hash is not present in the databases, the tool requires the upload of the entire file to process it internally. Since the Drebin dataset used during the experiments was made available in 2014, the involved malware instances have been already uploaded and analyzed by VirusTotal. Thus, for the Drebin malware, VirusTotal manages to correctly label them all efficiently and in a timely fashion.
Furthermore,
Figure 5c,d depict the impact of the COSMOS architecture on the Raspberry’s resources. In particular, we registered for each experiment the CPU and RAM usage, plotting also in this case the median values. Considering the CPU consumption,
Figure 5c shows that the overall CPU usage for very small Android malware (i.e., less than 100 KB) is lower than 21.5%, while for other malware sizes increases until 25% of utilization. This outcome may be explained by looking at the operations performed by the three detection modules; that is, each of them is directly influenced by the malware size, thus implying a higher computation demand. For the RAM consumption, similarly,
Figure 5d reveals that the malware size straightforwardly impacts the RAM usage, since a bigger malware requires more memory to be analyzed. However, the overall RAM consumption never exceeds the 350 MB in the worst case (very large malware samples), which can be seen as acceptable in this context.
In conclusion, the experimental results on Android malware detection prove the ability of COSMOS to analyze and detect the incoming malware samples. In particular, our proposal is able to analyze most of the malware in less than two seconds, making it suitable for an online threat detection. Moreover, by combining the powerful capabilities of static rules, anomaly detection and external knowledge, the detection rate reaches 100% for the used dataset. Additionally, the impact on the resources of the Raspberry is not critical, reaching 25% of CPU usage and 350 MB of RAM usage in the worst case (i.e., very large malware samples). Thus, we can safely conclude that the proposed framework is suitable to protect IoT devices from potential Android malware-threats.
4.2.2. Non-Android Malware Detection
This section describes the experiments performed to evaluate the performance of COSMOS w.r.t. the detection of generic malware. As previously mentioned, in this case, the ML is not effectively working, thus the malware samples are passed to the static detection layer (i.e., Internal Analyzer using Yara rules) and subsequently to the External Analyzer (i.e., VirusTotal).
The experiments were performed over a set of 5000 malware samples belonging to different families. The malware were originally collected from open-source databases of generic malware [
48], excluding the Android samples, such as Offensive Computing [
49] and Virus Sign [
50], among others [
51]. As previously explained, even for this experiment, the files were directly injected within the catching directory of the Sentinel, thus starting the overall analysis. In this experiment, we decided to select the following size classes: (i) less than 100 KB, (ii) from 100 KB to 500 KB, (iii) from 500 KB to 1 MB, and (iv) from 1 MB to 5 MB.
Regarding the static detection layer, we decided to increase the number of charged Yara rules in order to argue on possible differences in terms of performance. Specifically, we used 474 rules taken from the official Yara-rules repository concerning the detection of different malware families. Similarly to the Android detection testing, to discuss on the sentinel’s behavior, the following parameters were measured, as reported in
Table 2: (i) the global
resource consumption of the framework’s components (i.e., CPU and RAM consumption), (ii)
response time, that is, the time a particular defensive layer requires giving a response about a given sample, and (iii)
detection rate, meaning the percentage of malware samples detected for a given experiment in each layer. One could say that the above-mentioned parameters are particularly critical in the context of IoT threats’ detection.
Significant results are shown in
Figure 6, where the response time of the Sentinel for different malware sizes is depicted in
Figure 6a, the detection rate of the rings is showed in
Figure 6b, and finally the resource consumption in terms of CPU and RAM is illustrated in
Figure 6c,d respectively.
Starting with the response time,
Figure 6a depicts the time required by the defensive layers to analyze the malware sample. It is worth mentioning that the response time has been calculated separately also during this experiment to accurately argue on the performance of each ring. Moreover, the median value of the obtained time values has been plotted, since it is more robust against possible outliers. Contrary to the results shown in
Figure 5a, the first ring (i.e., Yara) is the slower one during the experiment. This outcome is mainly due to the number of rules used to detect the malware samples: that is, the bigger is the rule set, the slower is the overall execution, since the source code of the malware has to be checked against more rules in order to find potential matches. Thus, the time required for the static analysis to be completed increases linearly with the malware size, reaching 3.5 s for large samples. Nonetheless, the time performance of the static layer may be further improved, say, by enabling a specific subset of rules, or by studying the number of correct hits generated by a specific rule. Such improvements would decrease the overall execution time of the static ring. On the other side, as already discussed in the previous sections, the response time of VirusTotal remains quite constant, expected for large malware samples, for which the time needed to compute the hash becomes considerable.
Regarding the detection rate,
Figure 6b illustrates the detection accuracy of the enabled rings. It has to be stated that also in this case we refer to detection rate as for the quantity of detected malware samples in a particular experimental run. Additionally, the presented results have been calculated enabling sequentially all the rings, differently from the experiments related to the response time. Specifically, the malware samples were initially analyzed by the static layer, then, if Yara was not able to detect the malicious file, the third ring comes into play. Hence, in
Figure 6b, a cumulative detection rate for the malware samples is shown. An interesting result is represented by the high detection rate shown by Yara, which reached 91% during the experiment. This outcome is mainly explicable by looking at the higher rules number charged into the Sentinel during this experimental session. In particular, compared with the Android detection experiments presented in
Section 4.2.1, we increased the number of rules from 72 to 474, which is the entire ruleset available on the official Yara-rule repository. The main disadvantage of this change, as previously mentioned, is represented by the worse time performance, while the detection is clearly improved. Thus, we can safely conclude that the Yara-rule repository is well-performing against generic malware, while the Android detection is still an on-going project.
Another interesting result shown in
Figure 6b is the presence of a small portion of undetected malware, together with a low accuracy exposed by VirusTotal. This outcome is justifiable considering on the one side the higher complexity of generic malware compared with the Android ones. On the other side, the malware traces analyzed during the experiment also contain specific extensions (i.e., .bytes, .asm), which VirusTotal is not able to analyze, while the rule-based static analyzer is performing accurately with the above-mentioned extensions.
Furthermore,
Figure 6c,d show the impact of COSMOS on the Raspberry’s resources w.r.t generic malware detection. Also in this experiment, we registered the CPU and RAM utilization of the detection components, plotting the median values for different malware sizes. Regarding the CPU usage,
Figure 6c shows that the overall usage is around 25%, with a low peak for very small malware size (i.e., 100 KB). Comparing this outcome with the Android detection experiments, we can conclude that the different malware extensions do not impact the CPU performances, which stuck to an acceptable level. Regarding the RAM consumption, the trend is similar to the Android experiment presented in
Figure 5d. That is, a bigger malware size implies directly a higher RAM usage. The main difference among the two tests is that the RAM consumption for generic malware detection is slightly higher, since more rules are charged within the static analyzer. Nevertheless, the overall RAM usage never exceeded 380 MB in the worst case, thus representing an acceptable result considering the Raspberry’s capabilities.
All in all, the experimental results w.r.t. generic malware detection demonstrate the capabilities of COSMOS in dealing with incoming malware samples. Specifically, our solution reaches the 92% of accuracy and analyzes the samples in less than 4.5 s (in the case of very large malware). Such results may be further improved by studying the Yara-rule repository, but also by developing and enabling the other protection layer (i.e., ML model), which proved its capabilities for the Android samples. Moreover, the impact on the Raspberry’s resource is not critical, reaching 25% of CPU usage and 380 MB of RAM usage in the worst case. Thus, we can conclude that the proposed architecture, also in case of generic malware detection, is suitable to protect the surrounding IoT devices from potential threats.