The Presence, Trends, and Causes of Security Vulnerabilities in Operating Systems of IoT’s Low-End Devices

Internet of Things Operating Systems (IoT OSs) run, manage and control IoT devices. Therefore, it is important to secure the source code for IoT OSs, especially if they are deployed on devices used for human care and safety. In this paper, we report the results of our investigations of the security status and the presence of security vulnerabilities in the source code of the most popular open source IoT OSs. Through this research, three Static Analysis Tools (Cppcheck, Flawfinder and RATS) were used to examine the code of sixteen different releases of four different C/C++ IoT OSs, with 48 examinations, regarding the presence of vulnerabilities from the Common Weakness Enumeration (CWE). The examination reveals that IoT OS code still suffers from errors that lead to security vulnerabilities and increase the opportunity of security breaches. The total number of errors in IoT OSs is increasing from version to the next, while error density, i.e., errors per 1K of physical Source Lines of Code (SLOC) is decreasing chronologically for all IoT Oss, with few exceptions. The most prevalent vulnerabilities in IoT OS source code were CWE-561, CWE-398 and CWE-563 according to Cppcheck, (CWE-119!/CWE-120), CWE-120 and CWE-126 according to Flawfinder, and CWE-119, CWE-120 and CWE-134 according to RATS. Additionally, the CodeScene tool was used to investigate the development of the evolutionary properties of IoT OSs and the relationship between them and the presence of IoT OS vulnerabilities. CodeScene reveals strong positive correlation between the total number of security errors within IoT OSs and SLOC, as well as strong negative correlation between the total number of security errors and Code Health. CodeScene also indicates strong positive correlation between security error density (errors per 1K SLOC) and the presence of hotspots (frequency of code changes and code complexity), as well as strong negative correlation between security error density and the Qualitative Team Experience, which is a measure of the experience of the IoT OS developers.


Introduction
The Internet of Things (IoT) is a dynamic global network of sensors, actuators, controllers and smart devices that act together to capture, filter and exchange data about their environment, taking advantage of Internet connection and integration capabilities. IoT is a new technology that is growing rapidly and extensively, with an estimated 50 billion devices at the end of 2020 [1]. The combined IoT market will rise to about 520 billion USD in 2021, more than double the 235 billion USD invested in 2017 [2]. Over the last few years, big IoT vendors such as Microsoft (Azure IoT) [3], Amazon (AWS IoT) [4], Cisco (Jasper) [5], Google (Brillo) [6], Apple (Homekit) [7], IBM (Watson) [8], and Qualcomm (AllJoyn) [9] have rapidly grown in the IoT market. Furthermore, over 300 IoT platforms are available today, with more on the way [10]. IoT is a heterogeneous, complex environment and suffers from lack of interoperability. IoT devices are ubiquitous and are sources of big data in terms of their level of size and transmission over the Internet.

Background
This section provides an explanation of the relevant background knowledge directly connected to this research.

Operating Systems of Low-End IoT Devices
IoT devices can be high-end devices that are operated by traditional operating systems, such as Linux, or low-end devices with limited resources, e.g., very limited memory, computational power, and power supply [17]. The scope of this study is low-end IoT OSs, which play a vital role in operating and running low-end devices, taking into account the resource limitations of these devices. We chose four of the popular open-source IoT OSs for this study, based on the following rules. The targeted IoT OS should be: (1) among the We briefly introduce each of the IoT OSs chosen for our study: RIOT [18], Contiki [19], FreeRTOS [20] and Amazon FreeRTOS [21]. They are well documented, developed with C/C++, and are among the most-used open-source IoT OSs, according to the last four surveys of the IoT Eclipse foundation chosen IoT Oss [22][23][24][25]. According to the 2019 IoT Eclipse foundation survey [22], FreeRTOS, Contiki and RIOT accounted for 19%, 5% and 5% of non-Linux operating system use, respectively. Table 1 gives an overview of our case study of IoT OSs. RIOT is an open-source real-time operating system for low-end IoT devices built on Microkernel architecture [26]. A grassroots community gathering companies, academics, and hobbyists, distributed all around the world, developed RIOT. It was mainly written from scratch using the C/C++ programming language, with minor use of other languages such as Python, and it seeks to implement all related open standards supporting IoT. RIOT was developed to use minimal resources in terms of power consumption, ROM (~5 kB), and RAM ((~1.5 kB) [27]. RIOT offers a generic API to access sensor and actuator devices, named the Sensor Actuator Uber Layer (SAUL) API. This API enables a vendor-agnostic access to sensors and actuators and allows applications to be written against heterogeneous IoT devices using the same function calls. RIOT runs on a variety of platforms, such as embedded devices and personal computers.
Contiki has a Monolithic architecture [28], and was especially designed to run low-end IoT devices. It makes it possible to build applications that allow effective use of hardware while ensuring adequate low-power wireless communication for a variety of hardware platforms, where it enables microcontroller chips to connect to the Internet. It is mainly implemented in C with minor use of other languages such as Python and Java, Contiki prioritizes light power performance and memory management, with standard setups implemented using as little as 2 KB of RAM and 60 KB of ROM running at 1 MHz [29]. Popular SSL/TLS libraries such as wolfSSL support and perfectly match Contiki operating system, which was developed with portability in mind [30].
FreeRTOS has a Microkernel architecture [28] and was basically written in the C programming language over 15 years in collaboration with the world's leading chip companies. Its focus was reliability and ease of use. It is distributed free of charge under the Massachusetts Institute of Technology (MIT) open-source license. FreeRTOS consumes less than 4 to 9 KB of ROM and provides a collection of libraries for handling File Allocation Table (FAT) and storage media. Therefore, it is efficient for running low-end IoT devices.
Amazon FreeRTOS. Amazon provides an expansion of FreeRTOS, referred to as Amazon FreeRTOS and is a Microkernel architecture. Amazon FreeRTOS was basically written in the C programming language with minor use of other languages such as Python, Perl and Ruby. FreeRTOS includes libraries for IoT support, and is specifically for Amazon Web Services (AWS). Since version 10.0.0 (2017), Amazon has been in charge of the FreeRTOS's source code, including any changes to the original kernel.

Common Weakness Enumeration Vulnerabilities
Software vulnerabilities are weaknesses in the source code. Vulnerabilities stem from insecurities in the language used, combined with ignoring secure coding practices by the programmers, the pressure of deadlines, and/or lack of management focus on the topic [31]. Since IoT integrates multiple devices, sensors, and actuators and interacts directly with humans in many of its applications, the presence of vulnerabilities in IoT systems can have severe consequences. Imagine, for example, hacking a pacemaker device or a self-driving car [14]. The situation is further complicated by the fact that most IoT OSs are written in C/C++ due to their very powerful low-level programming support. However, at the same time, they are among the least secure programming languages. Some studies claim that 50% of vulnerabilities in open-source projects discovered between 2009 and 2019 were in C programs [32].
Common Weakness Enumerations vulnerabilities (CWEs) [33] is a community-developed evolving formal list of software vulnerability types, called weaknesses. The CWE list and related classification taxonomy act as vocabulary that can be used in terms of CWEs to define and explain these vulnerabilities. The main objective of CWE is to avoid vulnerabilities in the source code by educating software and hardware programmers, designers, architects about how to remove the most common errors before software and hardware are delivered, targeted at both the development and security communities.

Code Analysis Tools
To analyze the chosen IoT OSs in our study of the presence of vulnerabilities, Static Analysis Tools (SATs) were used to target and identify vulnerabilities in the source code without being executed. Cppcheck version 2.1 [34], Flawfinder version 2.0.11 [35], and Rough Auditing Tool for Security (RATS) [36] were our chosen SATs for examining the IoT OS source code. The three SATs are well documented, free, well known among the research community [37][38][39], and CWE compatible. Besides the three SATS, we needed a tool to examine the evolution of the IoT OS code base and the factors that affect its well-being over time in order to study the factors that may influence the existence of vulnerabilities. For this purpose, we used CodeScene [40] to investigate the evolutionary properties of IoT OSs. We introduce each of these tools as the following.
Cppcheck is a C/C++ static analysis tool. It provides comprehensive code analysis to find errors of source code, focusing on the identification of unknown actions and unsafe code, such as divide by zero, dead pointers, null pointer dereferences to name and integer overflows. Cppcheck is designed to analyze the source code and classify the severity of the errors found. The tool locates errors and potential errors, issuing messages identifying errors, giving warnings with recommendations to prevent errors, suggesting performance recommendations for faster code, etc. Our analysis results focus on vulnerabilities related to CWE, since this is our work benchmark.
Flawfinder is an open-source tool used to look for potential security errors within C/C++ source code and it is officially CWE-Compatible. Flawfinder investigates source code, categorizing the findings from level 0, a very little level of risk, to 5, a high level of risk, ignoring text inside comments and strings. Flawfinder is highly sensitive to error detection, and the author of Flawfinder stated that "Not every hit is necessarily a security vulnerability" at the end of Flawfinder result reports.
RATS is an open-source tool that has the capability of scanning various programming languages such as C, C++, Perl, PHP and Python source code. RATS flags common programming errors related to security, such as buffer overflows and TOCTOU (Time Of Check, Time Of Use) race conditions. CodeScene: By analyzing the evolution of the code, CodeScene [41] detects trends at the level of the entire system and at the file level. CodeScene can track code Hotspots, which are the complicated pieces of code that developers often have to work with. Hotspots are determined by combining the change frequency of each file as an interest rate proxy and the code lines as a simple measure of code complexity. Consequently, Hotspot analysis finds those files where much of the development time is spent. As shown in Figure 1, the darker the red color is, the more commits (changes) that have been done to the code. The wider the circle is, the wider circle is, the more the code it represents in the file. rate proxy and the code lines as a simple mea Hotspot analysis finds those files where much of in Figure 1, the darker the red color is, the more to the code. The wider the circle is, the wider cir the file. CodeScene can also track Code Health, whi and evaluation. The Code Health metric is calcula the properties of the code and the organizational of code depending on the programming languag lating Code Health include (1) Brain Methods " much behavior", (2) Nested complexity "such a and/or loops", (3) Do not Repeat Yourself (DRY) that fails to encapsulate its responsibilities", an comes a coordination bottleneck when multiple d [40].
Additionally, in terms of experience, CodeS tion of the team. CodeScene classifies the experi board (0-3 months), seasoned (6-12 months), an the RIOT team composition from 2019 to 2021, w terms of months (black line) and "Qualitative" which the experience of each developer currentl line). CodeScene can also track Code Health, which refers to the ease of code maintenance and evaluation. The Code Health metric is calculated on the basis of a combination of both the properties of the code and the organizational factors, with a total of 25-30 biomarkers of code depending on the programming language. Some of the biomarkers used in calculating Code Health include (1) Brain Methods "single functions/methods that center too much behavior", (2) Nested complexity "such as if statements inside other if statements and/or loops", (3) Do not Repeat Yourself (DRY) violations, (4) Bumpy Road "a function that fails to encapsulate its responsibilities", and (5) Developer Congestion, "code becomes a coordination bottleneck when multiple developers need to work on it in parallel" [40].
Additionally, in terms of experience, CodeScene can calculate the monthly composition of the team. CodeScene classifies the experiences of teams into three categories: on-board (0-3 months), seasoned (6-12 months), and veterans (+12 months). Figure 2 shows the RIOT team composition from 2019 to 2021, where the total accumulated experience in terms of months (black line) and "Qualitative" Team Experience is a weighted value in which the experience of each developer currently in the team is taken into account (blue line). the level of the entire system and at the file level. CodeScene can track code Hotspots, which are the complicated pieces of code that developers often have to work with. Hotspots are determined by combining the change frequency of each file as an interest rate proxy and the code lines as a simple measure of code complexity. Consequently, Hotspot analysis finds those files where much of the development time is spent. As shown in Figure 1, the darker the red color is, the more commits (changes) that have been done to the code. The wider the circle is, the wider circle is, the more the code it represents in the file. CodeScene can also track Code Health, which refers to the ease of code maintenance and evaluation. The Code Health metric is calculated on the basis of a combination of both the properties of the code and the organizational factors, with a total of 25-30 biomarkers of code depending on the programming language. Some of the biomarkers used in calculating Code Health include (1) Brain Methods "single functions/methods that center too much behavior", (2) Nested complexity "such as if statements inside other if statements and/or loops", (3) Do not Repeat Yourself (DRY) violations, (4) Bumpy Road "a function that fails to encapsulate its responsibilities", and (5) Developer Congestion, "code becomes a coordination bottleneck when multiple developers need to work on it in parallel" [40].
Additionally, in terms of experience, CodeScene can calculate the monthly composition of the team. CodeScene classifies the experiences of teams into three categories: onboard (0-3 months), seasoned (6-12 months), and veterans (+12 months). Figure 2 shows the RIOT team composition from 2019 to 2021, where the total accumulated experience in terms of months (black line) and "Qualitative" Team Experience is a weighted value in which the experience of each developer currently in the team is taken into account (blue line).    Table 2 shows a brief overview of the related work, and the following paragraphs explain the work in details. Alnaeli et al. [42] conducted an empirical study using static analysis methods on three C/C++ open-source IoT software packages to identify known vulnerable statements. They created a tool called UnsafeFunsDetector to find unsafe functions that are known to the research community or banned by some compiler producers. The study found that vulnerable functions were very common among the three systems, where memcpy() was the most prevalent unsafe function, followed by strlen(). The number of unsafe functions increased over the five-year period from 2012 to 2016 for the studied systems, reaching 1859 unsafe functions for Contiki OS, 772 unsafe functions for TinyOS, and 220 unsafe functions for OpenWSN. Alnaeli et al. [43] extended their previous work [42] to empirically examine the vulnerabilities of eighteen open-source IoT software systems, all of which were specifically written in C/C++ for IoT architectures. They found that usage of unsafe functions was still common among the selected systems, and developers were not working to improve the problems that were still present in the selected systems. On the other hand, memcpy() was the most prevalent unsafe function in the majority of the systems, followed by strlen(), free(), and strcmp(). In both studies, Alnaeli et al. focused on known vulnerable statements, ignoring many types of IoT OSs vulnerabilities that could lead to serious security issues.

Related Work
McBride et al. [44] conducted a study using static program analysis tools and techniques to scan the Contiki OS source code in order to identify errors, bug density, and unsafe functions. The study found an obvious increment of unsafe functions over the past 10 years of releases for Contiki OS. Unsafe functions increased from Contiki Version 2.0, with a total of 334 unsafe functions, to 1311 unsafe functions in Contiki Version 3.x, where memcpy() was the most prevalent unsafe statement, with 743 unsafe functions, followed by strlen(), with 375 functions.
Both Alnaeli et al. [42,43] and McBride et al. [44] presented different results for the total number of unsafe functions, since different static analysis tools were used. Both of the studies used the term unsafe function to describe vulnerable commands and statements in terms of functions written by C/C++ programming language. Nevertheless, Contiki contains some files written using the Python programming language, and these files suffer from security errors; both Alnaeli et al. [42,43] and McBride et al. [44] ignored these errors.
Mullen and Meany [45] conducted a comprehensive assessment of Buffer Overflow (BOF) attacks, one of the most prevalent vulnerabilities in IoT devices running an IoT OS. The assessment was conducted for IoT devices use FreeRTOS version 9.0.0, focusing on two such attacks, namely return-to-libc and code injection. The assessment addressed the mechanics, implementation, and testing of BOF attacks and how to prevent them. It also exposed the limitations of FreeRTOS with respect to BOF prevention methods.
Mahmood and Mahmoud [46] conducted an evaluation on SATs for finding vulnerabilities in Java and C/C++ source code. They explained that none of the studied tools was sufficient to comprehensively uncover all present vulnerabilities. They recommended the adoption of secure coding techniques and the use of several vulnerability detection methods to reduce source code security risks.
Our work adopts this approach by employing multiple SATs. We complement the previous studies by further investigating the presence of vulnerabilities in IoT OSs up to their 2020 versions. However, we differ in taking a broader approach by examining multiple systems over multiple versions, up until the most recent ones, using multiple SATs. Furthermore, we take CWEs as the benchmark for identifying vulnerabilities and we use CodeScene to study IoT OSs' Code Health and factors that affect the presence of vulnerabilities. Figure 3 illustrates the methodology of our research investigating the presence of security vulnerabilities in IoT OSs. The study targeted sixteen releases of the four previously mentioned IoT OSs from 2010 to 2020. Source code was obtained from the GitHub repository of each of each IoT OSs. Table 3 shows the targeted IoT OS releases and their corresponding year of release.  Cppcheck version 2.1 [34], Flawfinder version 2.0.11 [35], and RATS version 2.4 [36] were used to examine and identify errors in the IoT OSs' source code that could lead to security vulnerabilities. The output report of the three SATs describes errors that leads to a security vulnerability with "Error"; thus, our study refers to security vulnerabilities as errors when mentioning the total number of errors and errors per 1K SLOC of IoT OSs. The source code of each IoT OS release was examined standalone by the three SATs, with 48 examinations for all IoT OS releases. The methodology targeted C/C++ files and errors related to the CWE list, identifying the CWE errors of each IoT OS release and creating a report of SAT results. This step aimed to find the growth of total number of errors, errors per 1K SLOC, and the most prevalent CWE vulnerabilities of IoT OSs.

Methodology
Consequently, CodeScene was used to investigate the relationship between the growth of total number of errors, errors per 1K SLOC, and the development evolutionary properties trend of IoT OSs. For this step, and based on the three SATs results, two IoT OSs were nominated to be examined by CodeScene. The first nominated IoT OS was the one with the lowest errors per 1K SLOC, where the second nominated IoT OS is the one with the highest errors per 1K SLOC. Finally, answers for the four research questions are provided.

Results
We start by discussing the results of running SATs on the target systems. The three SATs obviously produced different results at the level of CWE error ID, the total number of errors, and the number of errors per 1K SLOC, because each SAT is designed to detect certain errors and applies certain rules for security error detection. The following subsections illustrate the examination results of each IoT OS. Cppcheck version 2.1 [34], Flawfinder version 2.0.11 [35], and RATS version 2.4 [36] were used to examine and identify errors in the IoT OSs' source code that could lead to security vulnerabilities. The output report of the three SATs describes errors that leads to a security vulnerability with "Error"; thus, our study refers to security vulnerabilities as errors when mentioning the total number of errors and errors per 1K SLOC of IoT OSs. The source code of each IoT OS release was examined standalone by the three SATs, with 48 examinations for all IoT OS releases. The methodology targeted C/C++ files and errors related to the CWE list, identifying the CWE errors of each IoT OS release and creating a report of SAT results. This step aimed to find the growth of total number of errors, errors per 1K SLOC, and the most prevalent CWE vulnerabilities of IoT OSs.
Consequently, CodeScene was used to investigate the relationship between the growth of total number of errors, errors per 1K SLOC, and the development evolutionary properties trend of IoT OSs. For this step, and based on the three SATs results, two IoT OSs were nominated to be examined by CodeScene. The first nominated IoT OS was the one with the lowest errors per 1K SLOC, where the second nominated IoT OS is the one with the highest errors per 1K SLOC. Finally, answers for the four research questions are provided.

Results
We start by discussing the results of running SATs on the target systems. The three SATs obviously produced different results at the level of CWE error ID, the total number of errors, and the number of errors per 1K SLOC, because each SAT is designed to detect certain errors and applies certain rules for security error detection. The following subsections illustrate the examination results of each IoT OS.  Table 4 illustrates that the total number of errors found by the three SATs increased over time, despite a little decrease from RIOT R. 2017.07 and RIOT R. 2020.04 by RATS examination. Table 4 and Figure 4 show that errors per 1K SLOC decreased chronologically according to the three SATs. While these results suggest that a significant number of errors was still present in RIOT's latest version at the time of the study, they also show a significant improvement in the error trend relative to 1K SLOC.  Table 4 illustrates that the total number of errors found by the three SATs increased over time, despite a little decrease from RIOT R. 2017.07 and RIOT R. 2020.04 by RATS examination. Table 4 and Figure 4 show that errors per 1K SLOC decreased chronologically according to the three SATs. While these results suggest that a significant number of errors was still present in RIOT's latest version at the time of the study, they also show a significant improvement in the error trend relative to 1K SLOC.

Contiki Examination Results
All three SATs showed that the total number of errors in Contiki increased from one version to the next, and a significant number of errors was still present in Contiki's latest version at the time of the study, as shown in Table 5. Nevertheless, the number of errors per 1K SLOC was chronologically decreased slightly, but this was improved significantly in the latest version as shown in Table 5 and Figure 5.

Contiki Examination Results
All three SATs showed that the total number of errors in Contiki increased from one version to the next, and a significant number of errors was still present in Contiki's latest version at the time of the study, as shown in Table 5. Nevertheless, the number of errors per 1K SLOC was chronologically decreased slightly, but this was improved significantly in the latest version as shown in Table 5 and Figure 5.

FreeRTOS Examination Results
Table 6 and Figure 6 show that the total number of errors in FreeRTOS increased over time according to the three SATs, except for FreeRTOS version 10.3.1 scanned by Flawfinder. Unlike RIOT, which saw significant improvements in code security over time, the number of errors per 1K SLOC in FreeRTOS stayed more or less the same.

FreeRTOS Examination Results
Table 6 and Figure 6 show that the total number of errors in FreeRTOS increased over time according to the three SATs, except for FreeRTOS version 10.3.1 scanned by Flawfinder. Unlike RIOT, which saw significant improvements in code security over time, the number of errors per 1K SLOC in FreeRTOS stayed more or less the same.

Amazon FreeRTOS Examination Results
As can be seen in Table 7 and Figure 7, the three SATs reveal that the total number of errors increased until version Amazon FreeRTOS v. 201908, after which it significantly decreased in version Amazon FreeRTOS v. 202007. The decrease in total number of errors was due to a decrease in SLOC from Amazon FreeRTOS v. 201,908 to Amazon FreeRTOS v. 202007. In addition, this IoT OS had the least number of vulnerabilities per 1K SLOC except by Flawfinder, which showed similar behavior to FreeRTOS.

Amazon FreeRTOS Examination Results
As can be seen in Table 7 and Figure 7, the three SATs reveal that the total number of errors increased until version Amazon FreeRTOS v. 201908, after which it significantly decreased in version Amazon FreeRTOS v. 202007. The decrease in total number of errors was due to a decrease in SLOC from Amazon FreeRTOS v. 201,908 to Amazon FreeRTOS v. 202007. In addition, this IoT OS had the least number of vulnerabilities per 1K SLOC except by Flawfinder, which showed similar behavior to FreeRTOS.

Amazon FreeRTOS Examination Results
As can be seen in Table 7 and Figure 7, the three SATs reveal that the total number of errors increased until version Amazon FreeRTOS v. 201908, after which it significantly decreased in version Amazon FreeRTOS v. 202007. The decrease in total number of errors was due to a decrease in SLOC from Amazon FreeRTOS v. 201,908 to Amazon FreeRTOS v. 202007. In addition, this IoT OS had the least number of vulnerabilities per 1K SLOC except by Flawfinder, which showed similar behavior to FreeRTOS.    Tables 8-10 show the total number of errors and the error frequency for each IoT OS release with respect to the CWEs that are found by the three SATS. From these tables, we can see that the most prevalent vulnerabilities in the IoT OSs according to Cppcheck 2.1 were CWE-561, CWE-398 and CWE-563, where (CWE-119!/CWE-120), CWE-120 and CWE-126 were the most prevalent vulnerabilities according to Flawfinder 2.0.11, and CWE-119, CWE-120 and CWE-134 were the most prevalent vulnerabilities according to RATS 2.4. The description of the CWEs is set out in Appendix A.

Investigating Evolutionary Properties of the IoT OSs using CodeScene
From Tables 4-7, it is clear that RIOT has the lowest error rate, and Contiki has the highest, while FreeRTOS and Amazon FreeRTOS are in between. Additionally, the numbers of errors per 1K SLOC for RIOT and Contiki were clearly high. For these reasons, RIOT and Contiki served as our case study for CodeScene. To investigate the causes of these findings, CodeScene was used to examine the effect of the evolutionary properties of IoT OSs, such as Hotspots, Code Health, Qualitative Team Experiences on the total number of security errors and number of security errors per 1K SLOC.

Investigation of the Evolutionary Properties of RIOT
As shown in Table 11, CodeScene addresses the decline of Hotspots across RIOT releases, and there have been high development efforts and bug fixes within the RIOT Hotspots area. CodeScene also shows that the Qualitative Team Experience increased across RIOT releases. These results explain the decline in errors per 1K SLOC according the three SATs' examinations of RIOT. In another direction, CodeScene shows that the source code for RIOT releases was healthy. Despite the decline of Code Health across ROIT releases, Code Health value remains higher than 6 out of 10, and thus the total number of errors of RIOT releases was not as high. Nevertheless, it should alarming to project developers that Code Health is declining from a version to the next.

Investigation of the Evolutionary Properties of Contiki
As shown in Table 12, CodeScene shows that Contiki releases contain a higher percentage of code Hotspots than RIOT and high development efforts and bug fixes within Hotspots. Therefore, Contiki releases suffer from code errors and vulnerabilities. Code-Scene shows a decline of Hotspots and increase of Qualitative Team Experiences across Contiki releases. This explains the decline in error rate, i.e., errors per 1K SLOC by the three SATs' examinations of Contiki as in Table 5. Table 12 also shows that Contiki releases were not healthy, the value of Code Health is always less than 5 out of 10, and is declining from a version to the next. Thus, the total number of errors of Contiki releases were high. coefficient between security errors and code evolutionary properties, as shown in Table  13. The values of X were obtained from Tables 4 and 5, where values of Y were obtained  from Tables 3, 11 and 12. If the correlation coefficient is closer to 1, it implies a strong positive relationship, where a strong negative relationship between the two variables is indicated if it is close to −1. A value of zero means that a relationship does not exist.   From Table 13, there is a strong positive correlation between the total number of security error within IoT OSs and SLOC, as well as a strong negative correlation between the total number of security errors and Code Health.   Table  13. The values of X were obtained from Tables 4 and 5, where values of Y were obtained  from Tables 3, 11 and 12. If the correlation coefficient is closer to 1, it implies a strong positive relationship, where a strong negative relationship between the two variables is indicated if it is close to −1. A value of zero means that a relationship does not exist.   From Table 13, there is a strong positive correlation between the total number of security error within IoT OSs and SLOC, as well as a strong negative correlation between the total number of security errors and Code Health. Table 13 also indicates a strong positive correlation between the number of security errors per 1 K SLOC and the presence of Hotspots (frequency of code change and complexity of code), as well as a strong negative   Table  13. The values of X were obtained from Tables 4 and 5, where values of Y were obtained  from Tables 3, 11 and 12. If the correlation coefficient is closer to 1, it implies a strong positive relationship, where a strong negative relationship between the two variables is indicated if it is close to −1. A value of zero means that a relationship does not exist.   From Table 13, there is a strong positive correlation between the total number of security error within IoT OSs and SLOC, as well as a strong negative correlation between the total number of security errors and Code Health. Table 13 also indicates a strong positive correlation between the number of security errors per 1 K SLOC and the presence of Hotspots (frequency of code change and complexity of code), as well as a strong negative  For more clarifications, we expressed the relationship between the evolutionary properties and the presence of IoT OS vulnerabilities by calculating the linear correlation coefficient between security errors and code evolutionary properties, as shown in Table 13. The values of X were obtained from Tables 4 and 5, where values of Y were obtained from  Tables 3, 11 and 12. If the correlation coefficient is closer to 1, it implies a strong positive relationship, where a strong negative relationship between the two variables is indicated if it is close to −1. A value of zero means that a relationship does not exist.  Table 13, there is a strong positive correlation between the total number of security error within IoT OSs and SLOC, as well as a strong negative correlation between the total number of security errors and Code Health. Table 13 also indicates a strong positive correlation between the number of security errors per 1 K SLOC and the presence of Hotspots (frequency of code change and complexity of code), as well as a strong negative correlation between the number of security errors per 1 K SLOC and the Qualitative Team Experience.

Discussion
After investigating the results of the three SAT analyses, the results were different in terms of the level of CWE error IDs, total number of errors, and number of errors per 1K SLOC. In analyzing these SATs, we found that each SAT applied certain rules for error detection and specialized in detecting certain CWEs. Nevertheless, among the three tools, RATS had the ability to detect errors within IoT OSs' files written using the Python, Perl, and Ruby scripting languages. However, these errors were neglected, because the study focused on C/C++ IoT OSs files in order to have a fair comparison between the three IoT OSs. The neglected errors included 513 errors in versions of Amazon FreeRTOS, 46 errors in RIOT releases, 46 errors in Contiki releases, and zero errors in versions of FreeRTOS. In the following, we provide the answers to our research questions.
Research Question No. 1: Do IoT OSs' security errors increase or decrease as they evolve over time? The answer is: Except for the latest version of Amazon FreeRTOS, IoT OSs show steady growth in the total number of security errors. This is primarily due to the growth in size. These results suggest that there are ample chances for attackers to exploit such systems, and that the problem is increasing over time.
Research Question No. 2: Does IoT OSs' security error density, i.e., errors per 1K SLOC increase or decrease as they evolve over time? The answer is: Generally, and with few exceptions, there was a gradual decline in the number of errors per 1K SLOC, suggesting an improvement in code security and better use of secure coding practices. However, if we consider Flawfinder, as it is CWE compatible, and we take the latest version examined from each IoT OS in this study, we find that the security errors densities in RIOT, Contiki, FreeRTOS, and Amazon FreeRTOS were 1.58, 9.26, 5.04, and 5.71 vulnerabilities per 1K SLOC, respectively. First, we see big variations, with the highest (Contiki) being roughly 6 times the lowest (RIOT). This suggests that the use of secure code practices among IoT OS developers varies significantly. Second, if we average these figures, we get about 5.4 errors per 1K SLOC (a similar result is obtained if we take the median value), which is still a significant rate. Developers of IoT OSs need to be much more aware of this topic and employ secure code practices and development life cycle. RIOT's superior results are easily explained by the adoption of security policies in the project, according to the project's owners [48].
CWE-561 means that the software contains dead code that can never be executed. During code evolution and maintenance, dead code can lead to confusion and result in vulnerabilities. CWE-398 is one of the phyla classifications in the Seven Pernicious Kingdoms vulnerability classification [49]. It does not introduce a weakness or vulnerability directly, but indicates that the software has not been carefully developed or maintained, and this increases the possibility of buried vulnerabilities within the code. CWE-398 refers to an unused variable, and a bug can be pointed out.
CWE-119 refers to memory corruption, where the software fails to constrain operations within the memory buffer boundaries. Consequently, the attackers may execute arbitrary code, sensitive data may be read, or the software may crash. CWE-120 is a classic buffer overflow, where the program copies the buffer without checking its length at all, neglecting the most fundamental security protections. CWE-126 is a buffer over-read, which means that the program reads from a buffer using buffer access mechanisms, which reference memory locations after the targeted buffer. This leads to sensitive information being exposed, or possibly a program crash. CWE-134 is an uncontrolled format string; this could lead to buffer overflows or problems with data representation.

Research Question No. 4:
What is the relationship between the presence of vulnerabilities and the evolutionary properties of IoT OSs? The answer is: CodeScene shows that the low Code Health of IoT OS leads to a high number of total security errors. In addition, the three SATs show that the total number of security errors of IoT OSs generally increase or decrease due to an increase or decrease in SLOC. CodeScene also shows that the decline of errors per 1K SLOC depends on the decline of code Hotspots and the increase in Qualitative Team Experience. The take-away from this analysis is that vulnerabilities in IoT OSs increase with the increase of Hotspots, which refers to the increase of code change frequency and code complexity. Moreover, vulnerabilities decrease with the increase of Qualitative team experience and Code Health (ease code maintenance and evolution). Additionally, the violation of healthy code metrics such as the Brain Method, Nested Complexity, DRY, Bumpy Road, and Developer Congestion leads to high numbers of vulnerabilities. Furthermore, activating security policies of IoT OSs development repository could help decrease the vulnerabilities.
The limitations of our study are as the following. (1) The IoT OSs of our study were written using various programing languages such as C, C++, Python, Perl, Ruby and Java, where the study SATs are able to perform static analysis only on C/C++ files with the exception of RATS. (2) The study depends on non-commercial SATs and code analysis tools which have limitations. The produced results are limited by the limitations of these tools. While the SATs used are able to find a wide range of CWEs, they are not perfect and may not catch all present vulnerabilities. For example, Cppcheck can detect 83.5% of vulnerabilities, and has 7.2% of false alarms [37]. The results of Flawfinder are close to the RATS results, where Flawfinder works by matching simple text patterns, which results in many false positives [37,46]. CodeScene only illustrates the complexity metric of code Hotspot by simple complexity metric (LOC), ignoring other important complexity metrics such as Cyclomatic complexity. Finally, (3) We considered only three factors when investigating the causes of vulnerabilities in IoT OSs, which are Hotspots (code complexity and change frequency), Code Health (maintainability) and Qualitative Team Experience, which are the main metrics supported by CodeScene. While these are very reasonable factors to study, other factors may influence the presence and trends of vulnerabilities, e.g., process followed, security policies set, team size and hierarchy, etc.

Conclusions and Future Work
IoT OSs still suffer from errors that could lead to security vulnerabilities, and the total number of errors increases chronologically across IoT OSs' releases. The good news is that errors per 1K SLOC decreased chronologically for all IoT OS releases, with few exceptions. The exceptions were that error rate increased among FreeRTOS's versions when examined by Cppcheck and Flawfinder. It also increased among Amazon FreeRTOS's versions upon examination by Flawfinder.
The three SATs produced different results at the level of CWE IDs, total number of errors, and number of errors per 1K SLOC, because each SAT applies certain rules for error detection and specializes in detecting certain things.
The three SATs show that the total security errors of IoT OSs are generally dependent on the growth of SLOC. Investigating the evolutionary properties of the IoT OSs by Code-Scene shows that low Code Health of IoT OSs leads to a high number of total security errors, and the decline of errors per 1 K SLOC depends on the decline of code Hotspots and the increase in Qualitative Team Experience.
Finally, we can conclude that one standalone SAT could not cover all vulnerabilities, and it is recommended to use various SATs to cover a wide range of vulnerability detections. In addition, SATs produce different results at the level of CWE ID and total discovered errors. Hotspots, Code Health, and Qualitative Team Experience are important evolution factors that developers should take care of during the development phase. Hence, IoT OS sponsors should make clear use of security policies during the Software Development Life Cycle (SDLC) by using various SATs, by preventing or raising awareness of the use of known unsafe functions such as memcpy(), strlen(), free(), and strcmp(), by training team developers, by encouraging code documentation, and by encouraging security testing prior to release.
Our immediate future work will extend the use of SATs to identify security errors within IoT OS files written not only by C/C++ but also by other languages such as Python, Perl, and Ruby scripting. Furthermore, our case study will be extended to include IoT OSs such as TinyOS, OpenWSN and Femto OS.
As we contribute new research results to the study of security of IoT OSs for low-end devices, there is still a need for multiple further studies. First, similar studies are needed for other IoT OSs not included in this study. Second, similar studies are needed for popular commercial and open-source IoT applications and systems, other than OSs. Third, while we touched on the underlying factors that can contribute to the presence of vulnerabilities in IoT OSs through our analysis using CodeScence, further and deeper investigations of the root causes are still needed. Practices and project characteristics that lead to vulnerabilities are needed. Fourth, we focused on IoT OSs written in C/C++, which are the dominant languages in this domain, but other languages are also used and can be sources of vulnerabilities. Fifth, and most important, while advanced and sophisticated SATs help expose vulnerabilities, similarly advanced tools need to be developed for vulnerability remediation and automatic program repair.

Data Availability Statement:
The data presented in this study is contained within the article.

Acknowledgments:
The authors acknowledge the contributions of Adam Tornhill by granting free license for CodeScene and advice on using it. The authors also acknowledge the contribution and guidance of the late professor Amr Badr, Faculty of Computers and Artificial Intelligence, Cairo University during his involvement in this research before his sad departure. The vulnerability is mapped to CWE-362 and CWE-20 (CWE-807,  The vulnerability is mapped to CWE-807 and CWE-20 (CWE-829,  The vulnerability is mapped to CWE-829 and CWE-20