From Mean Time to Failure to Mean Time to Attack/Compromise: Incorporating Reliability into Cybersecurity

Around the world, numerous companies strive to successfully facilitate digital transformation [...]

Around the world, numerous companies strive to successfully facilitate digital transformation. One of the key goals of many countries around the globe is the development of its digital infrastructure [1]. Whether in the public or private sectors, achieving data and infrastructure security is crucial to the success of any digital transformation. To guarantee a smooth and secure digital transformation, IT system security in all enterprises requires specific attention. People are becoming more reliant on ICT technology to do daily tasks at home or at work as a result of the digital revolution [2]. Systems that support essential elements of this intelligent way of living are classified as critical, and their security level is higher than that of other systems. The idea of digitally managed services, which include security monitoring, managed network services, or the outsourcing of business processes that are essential to the operation, dependability, and availability of critical national infrastructures, has been introduced by novel cybersecurity regulations. Following the COVID-19 epidemic, which compelled practically all daily operations to go digital, this shift is being emphasized more than before [3]. This unexpected transformation has an impact on almost every industry, including critical infrastructures, education and others. Furthermore, because of the lockdown brought on by the Coronavirus epidemic, online and remote working are now crucial. However, because many companies were not prepared to employ e-working technologies, several difficulties arose while attempting to digitize the business processes, including the lack of cybersecurity preparedness [4].
There are already a variety of frameworks on the market that businesses may use to increase the effectiveness of their cybersecurity [5]. These frameworks encourage both individual and corporate action. As emphasized already [6,7], there is a high significance of providing employees with training and information security awareness in order for any security improvement program chosen by an organization to be successful and secure [8]. Teaching front-end users will act as the first line of defense against attackers, thus this should be incorporated into the risk/security assessment plan followed by all levels of administration [9].
A system's or infrastructure's cybersecurity posture can be evaluated using a number of techniques, such as vulnerability assessment [10], risk management, maturity assessment, or posture assessment. An enterprise can create a successful information management system, standardize cybersecurity rules, and increase the security of an organization by using well-known cybersecurity frameworks like NIST or ISO. These frameworks can be incredibly effective in aiding organizations in comprehending the dangers they face, analyzing their vulnerabilities, and organizing their security countermeasures and mitigation plans. To better represent the system processes and operations, those models still need to be modified for use in specialized fields like banking, healthcare, maritime [11,12], education [13] or vital components like critical infrastructures [14] and industrial systems [15]. Many of the proposed frameworks define a set of metrics for gauging organizational maturity or competency in terms of a collection of widely accepted best practices, competencies, or standards. To conduct a gap analysis against several security requirements, MAFs may combine the General Data Protection Regulation (GDPR), Payment Card Industry Data Security Standard (PCI DSS), and Data Security and Protection Toolkit (DSPT), among others, depending on the business processes of the business sector. The metrics used can be classified into groups and expressed in terms of performance. The measurable transitions between levels are usually based on empirical data that have been validated in practice, and each level in the model is more mature than the previous level. Although the existing maturity models have managed to incorporate several directives, regulations and frameworks, and have been tailored to specific sectors, they still fail to incorporate the reliability aspect of a system [16,17].
On the other hand, reliability is a metric used to gauge a system's capacity to execute in accordance with its requirements under given temporal and operational conditions. The possibility that the system will function properly for a set amount of time is known as reliability [18]. The system's ability to meet the needs of the application is dependent on those requirements. We must have a thorough grasp of the system's constituent parts and how they function in order to determine the system's reliability. Calculating the dependability of each subsystem or entity, as well as their connections and inter-dependencies, is necessary to determine a system's reliability. The Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR), among other measures, are used to measure the system's reliability. For many years, extremely critical systems have been developed with a focus on reliability, efficiency, and optimization. The robustness of a system can be supported by the use of reliability theory in analyzing the behavior of complicated systems and creating ones that are incredibly stable. By embracing the fundamental ideas of reliability, securability can be used as a metric to show how well a system can function in accordance with the demands of the services it is being supplied [19][20][21][22].
Since faults and failures are elements that affect the system's proper operation, they can and ought to be considered in the analysis of securability [23]. The triptych of analysis, prediction, and optimization is where the concept of security is found. The operation of the system under analysis could be modeled using terms like Mean Time to Attack (MTTA), Mean Time to Compromise [24] and Mean Time to Recovery (MTTR), which are based on current incident response and mitigation plans. Using patterns that combine security and dependability and attack prediction using Markov models, some initial steps in this approach have already been taken [25]. New methodologies that could define the system requirements by incorporating security (and privacy) with reliability (and safety), however, are still lacking and are anticipated to be introduced in the upcoming years. These methodologies would also introduce a new research area under the umbrella term of securability.
Another interesting concept would be to manage to integrate the reliability analysis as a part of a maturity assessment framework, like the risk analysis element that already is incorporated in many MAFs. The idea of including a probabilistic model of the behavior of a part (or the whole system) in terms of tentative failures or errors could provide a better picture of the system in analysis and a prediction of future states [21]. As an example lets imagine that we are trying to analyze a system's behavior (from a high level perspective) when the also have a disaster recovery facility in place. Data and computer processing must be replicated at an off-premises location unaffected by the incident for disaster recovery to work. A business must restore lost data from a backup location when servers go down due to a natural disaster, equipment malfunction, or cyberattack. In order to maintain operations, a company should be able to move its computer processing to that distant site as well, thus managing to continue to offer the services to the clients [26].
The main system is represented as MS and the Disaster Recovery Site as DR. Working in an abstract level, we can represent the states of the system using a Markov chain (See Figure 1)  The transition from State B or State C to State A is done with rates µ MS and µ DR , respectively, representing the repair/restore rate of the system/DR. In order for the organization to be able to offer the service 24/7, the λ rates must be smaller compared to µ rates. Using the Markov model of Figure 1, we can calculate the MTTF (Mean Time to Failure) or MTTA (or Mean Time to Attack) and the MTTR (Mean Time to Restore, Respond or Repair depending on the model we are using). The correct values of the these rates demand a thorough analysis of the components of the systems, their inter-dependencies [27], along with an up to date threat assessment. This analysis is demanding and must be done in several steps following a top-down approach. The main system can be divided into sub-systems. A state transition diagram must be created for each sub-system along with a general model that will represent the dependencies among subsystems in the general form r out of n (r-out of n:G). In this model, the at least r subsystems (or elements) must be in a good state in order for the system to be operational [28]. When incorporating cybersecurity into this reliability analysis, the calculation of the failure probability of each component must include failures as well as possible attacks.