1. Introduction
Over the past 50 years, maintenance strategies have undergone a radical change from a necessary evil into a total engagement system to achieve the organization’s main targets. All chemical processing facilities consist of complex systems due to multiple different pieces of equipment that run under harsh operational conditions. The risk of a processing plant can be mitigated via several approaches. The reliability of a system is one of the most important approaches used in determining the probability of risk. An increase in the reliability of equipment can be achieved through continuous inspection and scheduling maintenance activities in order to ensure oil and gas productivity, which faces high risks due to the handling of hazardous materials under harsh operating conditions. To achieve safety and effectiveness in any processing plant, it is necessary to reduce the possibility of unscheduled failures. This can be achieved through effective maintenance activities involving risk assessment and reliability aspects [
1].
Recent years have witnessed vital progress in the development of maintenance strategies, from traditional breakdown maintenance to more sophisticated strategies such as maintenance based on condition monitoring and maintenance based on the reliability of the system.
The major challenge for a maintenance engineer is to implement a maintenance strategy that increases the availability and efficiency of the equipment, avoids equipment deterioration, ensures safe and environmentally friendly operation, and reduces the total cost of the operation. Thomas and Weiss [
2] report that, according to an annual survey of manufacturers in 2016, United States manufacturers spent USD 50 billion on maintenance and repair for both machinery and building maintenance, representing a significant part of the total operation costs.
In the twentieth century, the American Society of Mechanical Engineers considered performance criteria to improve tool safety and minimize the frequency of unexpected failures. In addition, the repair of static equipment (vessels, safety valves, and piping pieces) is considered. Finally, the importance of risk estimation is associated with the probability (PoF) and consequence of failure (CoF). It focuses on using significant tools to measure and analyze system safety in order to determine the estimated risk. Risk analysis is performed according to three stages, namely design, normal operation (material defect, inspection, and maintenance), and the wear-out stage, as shown in
Figure 1.
The risk is typically estimated in terms of economic and human losses resulting from the toxic effects of materials on humans and the environment and the effects of fire on equipment and structures per cycle. The estimation of the factor probability value is based on many influencing variables, including the availability, maintainability, and reliability of the system and human factors [
3].
The American Society of Mechanical Engineers (ASME) standard considers the protection of equipment that runs continuously under harsh conditions. It presents a risk-based criterion reflecting a safety zone that prevents practitioners from incurring personal injuries related to toxic materials’ effects and the blast effects on equipment and the surrounding environment due to equipment failures [
4]. Protection processes usually focus on two factors: the PoF and the CoF. The PoF and CoF can be used for any equipment based on the failure rate, seeking to protect practitioners when using complex equipment, especially when located in chemical zones or in the presence of suppression or other hazards. In addition, the ASME provides risk criteria to protect against hazards associated with personnel. However, API 691, related to risk-based machinery management, also provides criteria to estimate risks in terms of economic and environmental consequences due to facility failures [
5]. Therefore, the occurrence probability of failure should be estimated either qualitatively or quantitatively. Similarly, the severity of the CoF should be considered in terms of each category of consequences. According to the standards of the American Petroleum Institute (API) and the ASME, the estimation of the probable risk can be broken down into six stages considering equipment/components that can present a threat to the assets and environment of the plant:
- (1)
The identification of risk based on the pressure parameter;
- (2)
The estimation of the risk level;
- (3)
The identification of risk criteria for each component based on the severity index;
- (4)
The application of risk assessment for N&O indicators without/with a safety system;
- (5)
The selection of a safety system if the risk indicator is tolerable; and
- (6)
Repeating the above stages.
Table 1 shows a 4 × 4 risk matrix that defines the risk level using a scale of one to four for the PoF (1, 2, 3, 4) and the severity of the CoF (N, O, C, D). A risk level ≥ L indicates a tolerable risk.
Most plants that run under harsh operating conditions are focused on risk-based inspection (RBI) techniques due to the complex processes associated with static equipment, since this type of equipment requires higher safety and reliability [
2]. Therefore, the approach of RBI was developed to define risk criteria related to tolerable and estimated risks.
In the RBI approach, high-risk equipment is inspected with greater attention than equipment exposed to low risks. In addition, RBI is one of the most used approaches related to risk assessment programs for static equipment; however, it considers not only the assessment of static equipment risks but also takes rotating equipment into account. It also includes economic and environmental factors that may represent a threat to any unit or equipment. However, the application of the RBM approach to rotating equipment needs to be considered in terms of critical rotating equipment that uses the components as a series system. Any failure in any component or equipment can lead to the total breakdown of the processing plant [
6].
Most previous studies have elaborated on the importance of applying RBI to solve problems associated with higher-risk static equipment. In addition, others have focused on RBM based on the maintenance schedule, according to the recommendations of the original equipment manufacturer, without taking critical equipment into consideration. Regarding the PG unit at the gas plant (SOC), maintenance management faces significant challenges due to the aging of the components of the PG unit. Reliance on manufacturer recommendations has proven ineffective in many cases, as these do not account for actual operational conditions or component wear caused by harsh environments or extended operating hours. Two questions often faced by maintenance management are (i) which critical components present a high risk? and (ii) what is the estimated MTTF?
Consequently, the novelty of this study consists of the integration of the RBM approach with FTA in order to determine and arrange critical equipment within the turnaround maintenance list and to schedule the maintenance intervals of their activities, as opposed to applying the intervals of the original equipment manufacturer.
The proposed RBM approach aims at improving maintenance functions and maintaining a high level of reliability among rotating machines. Based on the probable failure modes, it can estimate the risk level and schedule maintenance activities for each failure component in order to reduce threats to the operational condition and stability of the unit and avoid any failure that may represent a high risk when the unit operates under normal conditions. The study aims to determine the interval of maintenance based on the critical components of the PG unit. This is achieved via three steps:
- (i)
The construction and description of the RBM approach;
- (ii)
A case study according to the critical components of the gas plant at the Sirte Oil Company;
- (iii)
The validation of the study based on the records of maintenance activities for critical components at the SOC and other facilities that run under harsh operational conditions.
2. The Risk-Based Maintenance (RBM) Approach
The RBM approach aims at mitigating the total risk that may pose catastrophic effects due to the unexpected failure of operating units [
7]. Assessing the risk indicator of the unit resulting from the failure of each component enables one to prioritize critical maintenance activities for all components of the system. RBM can be defined as a strategy that aims to prioritize maintenance activities to mitigate probable risks to critical components and high-risk systems before they cause a failure, which may impact the environment and the safety of industrial operations. This means that high risk levels will be given greater emphasis than low risk levels based on the RBM approach. Moreover, RBM can be considered a tool to estimate the interval between two consecutive maintenance activities for any equipment or unit, seeking to mitigate the overall risk caused by the occurrence of a failure. RBM can also present a set of recommendations for a preventive maintenance strategy, as shown in the second stage in
Figure 2. The quantitative assessment of risk is the basis for prioritizing maintenance activities. Based on the risk criteria, the estimated risk is compared against the acceptance risk for each failure scenario associated with the equipment or components. If the estimated risk exceeds the acceptable risk, this failure scenario can be considered an optimal maintenance interval, and it is necessary to minimize the exceeded risk to the acceptable zone. These scenarios can be repeated for each piece of equipment. The obtained results regarding the PG unit should be included to optimize maintenance events for the overall system. All details for each stage of the RBM approach are described below. A MATLAB software version (R2022b) (The MathWorks, Inc., Natick, MA, USA) is used to carry out the calculations procedure. The RBM approach can be broken down into three stages, as shown in
Figure 2.
- ▪
Stage 1: Estimated Risk (ER)
This stage can be divided into four categories, as shown in
Figure 2. Each category is described below, where the risk of the system is estimated according to Equation (1):
- -
Category 1.1: Development of Failure Scenarios (FSs)
An FS is a series of events that cause a failure. These series may include a single event or a combination of sequential events. It is obvious from previous case studies that a failure appears as a result of an interactive sequence of events. Therefore, a scenario does not signify that a failure will occur, but it may indicate that there is a strong probability of occurrence. A scenario is also not represented by a specific status or event but is a description of a typical status that can be expressed by a set of potential events or statuses.
Risk assessment assists practitioners in indicating sources of failure in order to select the most critical approaches and appropriate means for preventing and mitigating the possibility of its occurrence. Failure scenarios are estimated based on the operational condition of the system, the characteristics of the process under which the operation takes place, the configuration of the process, and safety plans. The RBM approach focuses on the collection of failure data, which include the consequences and likelihoods of failures in maintenance activities. Thus, many probabilistic risk assessment techniques, such as fault tree analysis, event tree analysis, and reliability-centered maintenance, rely heavily on historical data, failure data, and probabilities [
8]. RBM, along with reliability-centered maintenance approaches, is among the most well-known maintenance strategies, especially in critical sectors that operate under harsh conditions, such as oil and gas and aviation [
9,
10]. In addition, Esa and Muhammad [
11] suggested an RBM framework for naval vessels based on prescriptive analytics. Abbassi et al. [
12] also discussed how RBM and predictive maintenance can be integrated. Makua [
13] used condition-based and predictive maintenance as new strategies in an RBM approach for wind turbine systems. Zhang et al. [
14] presented a guideline to optimize reliability and maintenance for pipeline systems using reinforcement learning algorithms and the Markov modeling framework in order to demonstrate the utility of reinforcement learning in reliability and maintenance applications.
The applicability of the proposed methodology is demonstrated in a semi-automated cutting and crimping machine as a case study demonstrating the use of RBM. Masud et al. [
15] studied the RBM approach based on the development of FTA. Their study covered critical subsystems in many sewerage pumping stations in Australia, seeking to mitigate the risk of failure and enhance the implementation of RBM. Mohammad and Pirouzmand [
16] described the effectiveness of maintenance activities using RBM, considering aging effects using a Markov model of maintenance and then coupling this model to FTA in order to upgrade from the component level to the system level, focusing on two safety systems in the VVER-1000/V446 nuclear power plants. The results showed that the RBM tool was accurate in determining the technical specifications of real maintenance for a nuclear power plant from a risk point of view. Khan and Haddara [
7] proposed a systematic model to determine the maximum credible unexpected failure scenario to assess FSs at any processing plant that runs under harsh conditions. According to the recommendations of the American Petroleum Institute and the American Society of Mechanical Engineers, the maximum credible scenarios should be considered, rather than the worst-case scenarios of risk. The development of FSs can be based on assessment indicators, serving to minimize undesired effects without impacting the reliability of the system. It may be advisable to take one or two of the most suitable FSs into consideration for each level to achieve the most credible risk scenario.
Table 2 presents a systematic review of some previous studies related to RBM approaches, highlighting gaps in order to provide a basis for the maintenance plan for the PG unit.
The majority of previous studies associated with applying RBM focus on the failure of equipment (system), without considering the critical components outside the shaded area, as shown in
Figure 3. Therefore, this study focuses on critical components outside the shaded area, which may represent a threat to the system.
- -
Category 1.2. Consequence of Risk (CoR)
This category aims at prioritizing equipment/components based on their contributions, including risk factors resulting from pressure containment or a leak in any processing line, which may lead to partial production losses. Moreover, any failure in a relief valve may cause the total shutdown of the plant due to the need to replace the valves with a standby. CoR analysis involves the assessment of the probable consequences of the materialization of an FS. Firstly, the CoR measures the radius of the surrounding zone of the risk in which injuries and unit deterioration may occur. Secondly, it determines the likely damage to assets in terms of the effects on the buildings, structures, and units of the plant. Finally, the CoR also contributes to determining the toxic effects of materials on humans and is used to predict the human response to different levels of exposure to toxic chemicals. The damage area is calculated to estimate the risk effects on employees, profit margins, environmental deterioration, and fires and blasts on other assets. The calculation of the CoR involves a set of mathematical tools and analyses, such as WHAZAN and RISKIT [
19], in order to predict hazardous materials and their evaporation rates, as well as the effects of explosions and fires on buildings and structures. The CoR can be classified into four major types, as given below.
- -
System Damage (SD)
SD is associated with the failure of the system or compound. This can be estimated using the following equation:
- -
Financial Loss (FL)
FL reflects the damage for each scenario associated with asset losses. This can be expressed with the following relation:
- -
Human Loss (HL)
HL reflects the losses in terms of human health for each risk scenario using the following equation:
Note: The radius = 500 m. If the employees as a population are uniformly distributed in the risk zone, the factor is assigned a value of [0.2 of 1]; when the employees are situated away from the point that is affected by the risk, a different value is assigned. The value of human health differs from one site to another, according to the extent of the risk and the nature of the work.
- -
Environmental Damage (ED)
ED reflects ecosystem damage and is calculated using the following equation:
Figure 3 shows that the importance factor value can be considered unity when the diagonal length of the damage zone is greater than the distance between the risk point and the ecosystem location. This means that any component outside the shaded area is considered a critical component in the red zone and should be taken under study.
- -
Category 1.3: Probabilistic Failure Analysis(PFA)
PFA is conducted using fault tree analysis (FTA). FTA is one of the tools that is used in failure analysis, using deductive reasoning to identify undesired events at any component of a system based on the failure and success logic. In this category of RBM, FTA involves several basic events to construct the top event scenario. FTA is performed using analytical simulation in order achieve a PFA [
19]. The main procedures are as follows.
- (a)
Development of Fault Tree Analysis (FTA):
The top event is determined according to information related to the process sequences, control arrangement, and behavior of components for any equipment or unit, seeking to develop the logical dependency among the failure causes that can lead to the top event.
- (b)
Creation of Boolean Matrix (BM):
In order to develop FTA, firstly, transformation into a BM is required. Secondly, if the size of the BM is too large, this indicates that the structural stage technique should be applied.
- (c)
Identification and Optimization of Minimum Cut Sets (MCSs):
The MCSs are identified from the Boolean matrix [
17]. In the first step, each stage is solved individually. In the second step, the results are combined. Then, the MCSs are optimized using the appropriate model in order to eliminate unimportant paths.
- (d)
Probability Analysis (PA):
Many tools can be used to optimize minimum cut sets. The Monte Carlo simulation method is one of the tools used to estimate probabilities [
18]. The fuzzy probability set is considered as one of the theories used in the analytical simulation [
19].
- (e)
Estimation of Improvement Index (MI):
The MI is a measure of the effect on the final failure event for each root cause. The improvement indicators can be estimated according to the simulation results. To determine the impact of the root cause, it is necessary to conduct the simulation twice, with and without the cause, to identify an appropriate scale that can change the probability of occurrence of the final event.
- -
Category 1.4: Risk Estimation (RE)
Based on the results of the PoF and CoF analyses, the risk is determined for each unit. The CoF analysis covers the factors associated with fatalities, economic losses, environmental damage, and system performance unreliability. Therefore, the estimated risk can be evaluated against the acceptable risk in the next stage.
- ▪
Stage II: Risk Evaluation (RV)
This RBM stage aims at evaluating the predetermined risk to determine whether the RE of each FS is accepted or not, as shown in
Figure 2. This stage can be broken down into two categories, as detailed below.
- -
Category 1. Acceptable Risk (RA)
This category aims to identify the acceptable risk (RA) to apply in the current criteria related to the indicator of acceptable risk based on the nature and type of the work.
- -
Category 2. Risk Comparison to Acceptable Risk (RA)
This category aims to apply the RA to the RE for each component of the system. When the RE exceeds the RA of each component, this means that the RE of this component should be taken into account to reduce its risk and develop a maintenance plan.
- ▪
Stage III: Maintenance Planning (MP)
The level of RE for components that exceed the RA is taken into consideration to schedule maintenance plans and mitigate the risk level. The details are given below.
- -
Category 1. Estimation of Mean Time to Failure (MTTF)
This category is aimed at identifying failure causes in order to determine which ones may adversely affect the PoF. Then, a reverse fault analysis can be conducted in order to estimate the PoF for any root event and complete the maintenance plan (
Figure 4).
- -
Category 2. Verification of the System Level (VSL)
This category aims at the verification of the maintenance plan, ensuring that it poses no threat to operational stability, based on the RA level of the system.
5. Results and Discussion
The failure rate (λ) associated with the exponential model, and the shape (ϕ) and scale (δ) parameters associated with the Weibull model of the PG unit, as shown in Equation (6), are estimated according to the failure behavior of the component (n). These parameters are identified according to failure data listed in maintenance records (PG unit—gas plant of SOC). Thus, these parameters are considered reasonable assumptions to identify the appropriate model in order to determine the PoF for each component of the PG unit, as shown in
Table 6.
The RE criterion was determined according to the PoF and CoF for the PG unit (e.g., at 10 MCS, we found values of 0.3985 and USD 172,000 for the PoF and CoF, respectively) and compared against the acceptable risk criterion (RA) identified by the company. Note that all processing companies have their own RA criteria, which depend on several aspects, such as the operating conditions, geographical environment, and economic aspects.
The risk index (RI) should be taken into account to start shutdown for any unit in order to execute maintenance activities based on the critical components that present extreme risks. Therefore, harnessing RE/RA knowledge becomes a core competency to avoid any threat that may exceed 1.0.
Therefore, the maintenance plan of the generator must be executed at 288 days to avoid any threat associated with the oil seal system.
Thus, .
This means that maintenance for the HP water should be scheduled every 348 days. For this equipment, any operating period of more than 348 days may result in a threat to the boiler feed pumps, water pumps, or both. Therefore, the operating time of the HP water should not exceed 348 days in order to avoid any failure that may occur in the short term.
Table 7 shows the PoF and CoF values used to identify the risk estimate (RE) and risk index (RI) for each component of the PG unit. Based on the RI, four components were found to violate the risk criterion: MCS numbers 8, 5, 4, and 6.
The RI of the PG unit was significant due to its value in terms of the insights that it provided. This reflected a broader trend toward components 4, 5, 6, and 8.
Failure data were obtained from the organization’s maintenance logs and the original equipment manufacturer (OEM) Guide. The data were cleaned to remove inconsistencies and normalized for analysis. The MTTF was calculated using exponential and Weibull distribution fitting for each MCS.
Table 8 shows the results, listed as maintenance intervals and associated with the uptime of the unit. This can be expressed as the mean time to failure (MTTF). Any RI exceeding 1.0 would lead to serious consequences that may affect the assets of the company. Based on the RI, any component that presents a high risk should be included in the T/A program to avoid entering the red zone, as shown in
Table 1. Therefore, MCS numbers 8, 5, 4, and 6 should undergo turnaround maintenance (T/A) every 8353, 6921, 1750, and 8050 h, respectively. Other components can undergo noncritical maintenance activities.
In addition, the annual availability level of MCS 8 must be 95.3% in order to enhance the reliability of the component to 82.50% and reduce downtime, and maintenance should be performed every 17 days, as compared to the maintenance duration of the HP water system located in the gas plant at the SOC. This illustrates that any decrease in the duration of maintenance (MTTR) would reduce the cost of maintenance based on RBM. In addition, according to the real-world validation of the results of the PG unit of the SCO, RBM reduces the risk by 4.65%, 21%, 80%, and 8.1% annually for the HP water, generator, steam turbine, and turbine, respectively.
6. Conclusions
This paper presents an approach to developing a maintenance plan. The development of the maintenance plan reflects the broader trend toward RBM, where PG unit components are no longer maintained solely based on the recommendations of the manufacturer but according to the critical faults of components. Maintenance is prioritized based on critical components that are identified using FTA. In this approach, the MTTF is also considered, which includes the effects of failures and their consequences, seeking to reduce them. In order to determine the MTTF, data are collected for components that would be maintained at the same time, and the RI must be more than 1.0 to avoid any serious consequences that may occur due to a delay in maintenance activities for the whole plant/unit. This means that some pieces of equipment, such as MCSs 8, 5, and 6, are critical components and should be maintained at a certain time to reduce downtime and to justify the planned maintenance policy. In this study, the HP water was one of the critical components that had an unacceptable risk (the RI exceeded 1.0) due to failures that stemmed from seal damage and worn bearings. This causes clogged drains, reduced flow, and the overheating of the system. Thus, it should be subjected to T/A maintenance activities every 348 operating hours to avoid entering the high-risk zone, which may threaten the assets of the company. The use of RBM resulted in a risk decrease of 4.65% for the PG unit of the SOC based on the HP water.
With this approach, the T/A maintenance activities of the system can be executed according to the critical component that presents the highest RI. Moreover, the PG unit should undergo maintenance every 290 days to avoid the unanticipated consequences of failures between operation periods, especially when there is a high risk. The results of the development of the maintenance plan were validated against records of maintenance activities for the HP water in the gas plant at the SOC, which scheduled maintenance once every year based on the predetermined costs. Thus, the use of a suitable RBM approach can yield important results for decision making and ensure the safe and effective operation of the plant and enhance the reliability of the system. The presented risk-based maintenance approach demonstrated effectiveness in identifying critical components and optimizing the interval and duration of maintenance for the PG unit at the SOC, which operates in a high-risk industrial environment and under harsh conditions. However, several directions for future research and development are suggested.
The methodology should be tested across a wider range of industrial environments—particularly those operating under harsh or variable conditions, such as offshore platforms—to validate its generalizability and robustness.
- 2.
Integration with real-time monitoring systems
Incorporating real-time sensor data and condition monitoring (e.g., vibration analysis, thermal imaging, or oil analysis) could enhance the accuracy of PoF and CoF estimates, enabling dynamic risk assessment and predictive maintenance scheduling.
- 3.
Integration with multi-criteria optimization
It is necessary to incorporate cost–benefit analysis and multi-objective optimization (e.g., minimizing downtime, risk, and cost simultaneously) to better support strategic maintenance decision making under resource constraints.
By pursuing these directions, the RBM methodology can be strengthened and adapted for more comprehensive, cost-effective, and risk-aware maintenance planning in a variety of industrial settings.