Next Article in Journal
Leaders’ Calling and Employees’ Innovative Behavior: The Mediating Role of Work Meaning and the Moderating Effect of Supervisor’s Organizational Embodiment
Previous Article in Journal
An Emergency Scheduling Model for Oil Containment Boom in Dynamically Changing Marine Oil Spills: Integrating Economic and Ecological Considerations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparative Analysis of Standard Operating Procedures Across Safety-Critical Domains: Lessons for Human Performance and Safety Engineering

1
Department of Industrial and Systems Engineering, College of Engineering, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
2
Center for Air Transportation Systems Research, George Mason University, Fairfax, VA 22030, USA
*
Author to whom correspondence should be addressed.
Systems 2025, 13(8), 717; https://doi.org/10.3390/systems13080717
Submission received: 23 June 2025 / Revised: 4 August 2025 / Accepted: 15 August 2025 / Published: 20 August 2025
(This article belongs to the Section Systems Engineering)

Abstract

Standard Operating Procedures (SOPs) serve a critical role in complex systems operations, guiding operator response during normal and emergency scenarios. This study compares 29 SOPs (517 steps) across three domains with varying operator selection rigor: airline operations, Habitable Airlock (HAL) operations, and semi-autonomous vehicles. Using the extended Procedure Representation Language (e-PRL) framework, each step was decomposed into perceptual, cognitive, and motor components, enabling quantitative analysis of step types, memory demands, and training requirements. Monte Carlo simulations compared Time on Procedure against the Allowable Operational Time Window to predict failure rates. The analysis revealed three universal vulnerabilities: verification steps missing following waiting requirements (70% in airline operations, 58% in HAL operations, and 25% in autonomous vehicle procedures), ambiguous perceptual cues (15–48% of steps), and excessive memory demands (highest in HAL procedures at 71% average recall score). Procedure failure probabilities varied significantly (5.72% to 63.47% across domains), with autonomous vehicle procedures showing the greatest variability despite minimal operator selection. Counterintuitively, Habitable Airlock procedures requiring the most selective operators had the highest memory demands, suggesting that rigorous operator selection may compensate for procedure design deficiencies. These findings establish that procedure design approaches vary by domain based on assumptions about operator capabilities rather than universal human factors principles.

1. Introduction

Standard Operating Procedures (SOPs) serve as critical bridges between human operators and complex systems, especially during abnormal and emergency situations [1,2]. While research has examined SOPs within individual domains, few studies have conducted systematic cross-domain comparisons to identify universal principles for effective SOP design that exceed specific technical applications.
The significance of cross-domain analysis is emphasized by consistent findings in accident investigations across multiple industries. Accidents frequently occur when the Time on Procedure (ToP) exceeds the Allowable Operational Time Window (AOTW) [3,4], often due to ambiguous cues, excessive recall demands, or inadequate verification steps. Despite these commonalities, research communities in aviation, space operations, and autonomous systems remain largely isolated, preventing autonomous vehicle designers from learning aviation’s approaches to unambiguous cue design or space operations’ verification protocols.
This paper applies an analytical framework to compare SOPs across three domains with distinct characteristics: commercial aviation (highly selective operators with extensive training), Habitable Airlock (HAL) operations (extremely selective operators with specialized training), and semi-autonomous vehicles (minimally selective operators with limited training). By employing the extended Procedure Representation Language (e-PRL) framework [5], SOP performance characteristics that have been previously assessed only qualitatively are quantified.
The research questions addressed in this paper are as follows:
  • How do SOP characteristics differ across domains with varying levels of operator selection and training?
  • How do operator selection criteria influence SOP design vulnerabilities across different technical domains?
  • What cross-domain lessons can be applied to enhance safety in emerging technical systems?

1.1. Key Contributions

This study reveals three counterintuitive phenomena: (1) domains with the most selective operators impose the highest memory demands (71% vs. 44–46% in other domains), suggesting rigorous selection enables rather than prevents design deficiencies; (2) universal vulnerabilities exist across all domains regardless of operator training (25–70% of procedures lack verification steps); and (3) procedure design reflects institutional culture more than human factors principles. These findings establish the first quantitative framework for cross-domain SOP analysis and provide evidence-based design principles for emerging safety-critical systems.

1.2. Method Foundation and Limitations

The analytical framework presented in this study builds upon established human factors principles, including the trigger–decide–act model of human information processing [6] and documented relationships between cue characteristics and response times [1,5]. The e-PRL framework extends established procedure representation languages used in space operations [7], while the Monte Carlo simulation approach follows established practices in human reliability analysis.
However, there are important limitations in this approach. The time distributions assigned to procedure components are based on general human performance data rather than domain-specific validation studies. AOTW estimates rely on published safety margins and calculated constraints rather than empirical measurement. The simulation results should, therefore, be interpreted as analytical predictions that identify relative vulnerabilities between domains rather than absolute failure probabilities.
Future validation against operational performance data will be essential to establish the predictive accuracy of this framework across different domains and operational contexts.

1.3. Rationale for Cross-Domain Analysis

Research communities in aviation, space operations, and autonomous systems remain largely isolated, preventing systematic knowledge transfer despite common human factors challenges [3,4]. Single-domain studies cannot distinguish between universal human limitations and domain-specific cultural practices. This distinction is critical for emerging domains like autonomous vehicles, where operators receive minimal training yet must perform safety-critical interventions [8,9]. Without cross-domain insights, these domains risk adopting inappropriate design approaches based on assumptions about operator capabilities that may not match their operational constraints. Cross-domain comparison enables identification of universal vulnerabilities versus domain-specific issues, providing evidence-based guidance for procedure design across the spectrum of human–machine systems.

2. Materials and Methods

The methodology follows a systematic data processing framework, as shown in Figure 1, that transforms raw SOP data from three safety-critical domains into quantifiable metrics for cross-domain comparison.
The framework processes 517 individual SOP steps (171 aviation, 321 HAL, 25 autonomous vehicle) through systematic e-PRL decomposition and metric calculation. The resulting step type distributions, recall scores, and training estimates provide the foundation for vulnerability analysis and performance simulation detailed in subsequent sections.
This study analyzed a total of twenty-nine (29) SOPs from three (3) domains, selected to represent varying levels of operator selection and training. Table 1 shows the three domains along with the operator selection and training associated with each domain.

2.1. SOP Selection and Preprocessing

Commercial aviation SOPs (15 procedures, 171 steps) were obtained from the A318/A319/A320/A321 Flight Crew Operating Manual (FCOM) [10] and Boeing 747-441 Operations Manual [11]. These included both normal procedures (13) and emergency procedures (2). Procedures covered takeoff, landing, approach stabilization, and systems management during both normal and abnormal conditions.
A set of hypothetical procedures for operating a Habitable Airlock (HAL), a cabin designed to support astronauts during short-term scientific studies when detached from larger vehicles, was used (8 procedures, 321 steps). The hypothetical HAL procedures were developed based on established space operations protocols from NASA’s International Space Station (ISS) procedures. While not operational procedures, they represent realistic operational scenarios for habitat-based space exploration missions currently under development by NASA and commercial space companies. The hypothetical nature limits absolute generalizability but provides sufficient structural complexity to demonstrate cross-domain analytical principles. Future studies should validate findings using operational space procedures when available.
Semi-autonomous vehicle SOPs (6 procedures, 25 steps) were extracted from the Tesla Model 3 Owner’s Manual [12] using a Large Language Model-assisted approach [5]. Procedures focused exclusively on human takeover requirements for safety-critical scenarios, including Forward Collision Warning, Lane Departure Avoidance, Blind Spot Collision, Autosteer Disable, Intersection Turn Takeover, and Unexpected Autopilot Behavior Intervention.
In total, these procedures contained 517 individual SOP steps (171 aviation, 321 HAL, and 25 autonomous vehicle) that were classified and analyzed. The uneven sample sizes across domains (171 aviation, 321 HAL, 25 autonomous vehicle steps) reflect the natural complexity and documentation practices within each domain. To address potential bias from sample size differences, the analysis employs proportional metrics (percentages) rather than absolute counts for cross-domain comparisons.

2.2. Extended Procedure Representation Language (e-PRL) Framework

The e-PRL framework [5] provides a canonical structure for decomposing each SOP step into its basic perceptual, cognitive, and motor components. This structure is based on human information processing models [6] and reflects the trigger–decide–act cycles that govern human–machine interactions [13].
The e-PRL model categorizes SOP step elements into 16 distinct components:
  • Actor: The operator responsible for executing the step;
  • Trigger (What): The condition for initiating the step;
  • Trigger (How): The data required for recognizing the triggering condition;
  • Trigger (Where): The source of the triggering data (e.g., display, environment);
  • Decide (What): The decision to be made;
  • Decide (How): The data required to make the decision;
  • Decide (Where): The source of the decision data;
  • Action (What): The physical action to be performed;
  • Action (How): The specific manipulation required;
  • Action (Where): The input device used;
  • Waiting (What): The condition being waited for;
  • Waiting (How): The data indicating the waiting condition has been met;
  • Waiting (Where): The source of the waiting data;
  • Verification (What): The verification action to be performed;
  • Verification (How): The data required for verification;
  • Verification (Where): The source of the verification data.
To classify SOP steps into their e-PRL components, a systematic process following the method outlined in [14] was employed. For aviation and HAL SOPs, this classification was performed manually by two trained analysts with domain expertise. For autonomous vehicle SOPs, the initial classification was supported by an LLM-based method [5], with manual verification by the same analysts.

2.3. SOP Analysis Metrics

Three primary metrics were applied to evaluate SOPs across domains: (1) the SOP step type, (2) SOP Step Recall Score, and (3) SOP Training Requirements.

2.3.1. SOP Step Type

Each SOP step was categorized into one of four types based on its structural components:
  • Action-Only: Steps containing only trigger and action components (minimum of 4 e-PRL elements);
  • Decision–Action: Steps containing trigger, decision, and action components (minimum of 6 e-PRL elements);
  • Action with Waiting and Verification: Steps containing trigger, action, waiting, and verification components (minimum of 8 e-PRL elements);
  • Decision–Action with Waiting and Verification: Steps containing all components (minimum of 10 e-PRL elements).
This categorization was performed using the algorithm described in [14], which evaluates the presence of specific e-PRL components to determine the step type.

2.3.2. SOP Step Recall Score

The SOP Step Recall Score (RSS) quantifies the implicit knowledge required to complete a step, calculated as
R S S = 100 × I S D E T E A
where RSS is the SOP Step Recall Score (%), ISDE is the number of implied system-description e-PRL elements, and TEA is the total number of e-PRL elements available in the step.
The number of implied system-description e-PRL elements must be recalled from memory rather than explicitly stated in the procedure. The total number of e-PRL elements available in a step varies by step type: 4 for Action-Only, 6 for Decision–Action, 8 for Action with Waiting and Verification, and 10 for Decision–Action with Waiting and Verification.
An e-PRL element was classified as “explicit” if the information was directly stated in the SOP text and “implied” if it was absent but necessary for procedure execution. To ensure consistency, implied elements were further categorized as follows:
  • Implied-Sequential: Information absent due to the sequential nature of steps;
  • Implied-Instantaneous: Information absent due to its instantaneous nature;
  • Implied System-Description: Information absent that requires operator system knowledge.
Only the third category contributes to the recall score, as it represents memory demands placed on the operator.

2.3.3. Training Requirements

Training requirements were estimated using an adaptation of Matessa and Polson’s model [15], which relates list learning to procedure learning. The original model was developed for word list memorization tasks and required modification for procedural learning contexts. The base Matessa and Polson model assumes uniform item difficulty and sequential learning but inadequately represents SOP step learning where the steps vary significantly in cognitive complexity, memory demands differ based on explicit and implicit information, and knowledge involves spatial and temporal relationships beyond simple recall. The adaptation incorporates several key modifications. First, instead of treating all items equally, each step is weighed by its recall score to account for varying cognitive demands. Second, the coefficients were derived from word list learning. Through regression analysis of procedural training data [15], new coefficients were derived through regression fitting against empirical training data as detailed in [5].
The model estimates the following:
Total Number of Repetitions (TNR): The number of practice repetitions required to reach proficiency, calculated as
T N R S O P = 2.35 × i = 1 n 1 + R S S i 3.24
where n is the number of steps in the SOP, and RSS is found from Equation (1).
Number of Days (ND): The training days required to reach proficiency, calculated as
N D S O P = 2.16 × ln T N R S O P + 0.68
This model accounts for both the number of steps and their memory demands, with higher recall scores requiring additional repetitions. This adaptation affects training predictions by differentiating procedures based on cognitive complexity rather than just step count. For example, using the original model with uniform weighting, a procedure with 103 steps = 103 units, while with the adapted model, the same procedure would have 135 adjusted units (30% increase due to high memory demands). This weighting better reflects the intuitive relationship between memory demands and learning difficulty. However, these represent relative estimates for cross-domain comparison rather than validated absolute training predictions.

2.3.4. Performance Simulation

To estimate the probability of procedure failure under varying conditions, Monte Carlo simulations were conducted using the Time on Procedure (ToP) vs. the Allowable Operational Time Window (AOTW) model [5].
For each SOP, each step was decomposed into trigger and action components. Then, time distributions were assigned to each component based on cue characteristics. This was followed by generating time samples by drawing randomly from these distributions. The total ToP for each simulated execution was calculated and compared against the AOTW distribution. Finally, the Probability of Failure to Complete (PFtC) was computed.
Time distributions were assigned based on (1) Cue Type categorized as (a) Not in Field of View; (b) In Field of View but Not Salient; (c) In Field of View, Salient but Ambiguous; (d) In Field of View, Salient and Unambiguous; or (e) No Cue (Long-Term Memory) and (2) Frequency of Use, categorized as (a) Always/Frequent, (b) Infrequent, or (c) Rare.
These distributions were derived from empirical human performance data collected in prior studies [5] with parameters fitted to triangular distributions.
The AOTW for each procedure was determined either from published data (aviation) or calculated from operational constraints (HAL). For autonomous vehicles, the AOTW was calculated using a collision avoidance model based on vehicle physics, with parameters for vehicle velocity, deceleration rates, following distance, and human reaction time derived from [5]. Each simulation was run for 10,000 iterations to ensure stable convergence of the PFtC estimate.
The time distribution parameters used in this study were derived from the established human performance literature [5] rather than domain-specific validation studies. While this approach enables cross-domain comparison using consistent assumptions, it represents a limitation in absolute predictive accuracy. The triangular distributions were selected based on their documented use in human reliability analysis and their ability to capture the asymmetric nature of human response times (faster minimum times, extended tail for delayed responses). AOTW values were validated, where possible, against published operational constraints: aviation procedures used certified minimum equipment times, HAL procedures used calculated life support margins, and autonomous vehicle procedures used physics-based collision models. Future validation should compare these analytical predictions against empirical performance data collected in operational or experimental settings to establish domain-specific correction factors and confidence intervals.

3. Results

3.1. e-PRL Statistical Breakdown

3.1.1. SOP Step Type Distribution

The analysis revealed substantial differences in the distribution of SOP step types across domains, as shown in Table 2 and Figure 2.
While Action-Only steps predominated across all domains (70–80%), their prevalence increased as operator selectivity decreased. Aviation SOPs contained the highest proportion of steps with explicit decision points (18% total), including 4% with complex Decision–Action with Waiting and Verification structures. In contrast, HAL procedures contained no explicit decision steps, despite the highly selective operator population. Autonomous vehicle SOPs had minimal decision steps (4%) despite addressing safety-critical scenarios.
The complete absence of decision steps in HAL procedures was particularly notable given the complex environment and highly trained operators. Further analysis revealed that decisions were typically embedded within action descriptions rather than explicitly delineated, suggesting a domain-specific procedural style rather than an absence of decision-making.

3.1.2. SOP Content Analysis

Table 3 presents the analysis of explicit vs. implicit content in SOP steps across domains.
This analysis revealed several domain-specific patterns: (1) actor specifications, (2) triggering information, (3) input device specifications, (4) verification requirements, and (5) decision information.
Aviation decision steps frequently specified actors (79%), while HAL procedures never identified actors explicitly. This difference reflects aviation’s crew coordination needs versus HAL’s single-operator environment. Additionally, aviation SOPs consistently specified triggering conditions (55–83%), while HAL SOPs rarely did so (1–3%). Autonomous vehicle SOPs showed high trigger specification (70–100%), crucial for time-critical interventions.
Aviation SOPs frequently specified input devices (81–100%), while HAL SOPs rarely did so (1–9%). This suggests greater reliance on operator system knowledge in HAL operations. Additionally, among steps with waiting requirements, aviation SOPs more consistently included verification (71–76%) compared to HAL SOPs (24–76%). Autonomous vehicle SOPs showed high verification specification (75%) but applied to a limited number of steps. While aviation procedures contained explicit decision steps (18% of the total), they rarely specified where to find decision-critical information (0%) or how to interpret it (0%). The single decision step in autonomous vehicle procedures, by contrast, included explicit information on how to make the decision (100%).
These findings demonstrate significant cross-domain variation in the explicitness of procedural content, suggesting different assumptions about operator knowledge and training.

3.1.3. SOP Step Recall Score

The SOP Step Recall Score quantifies memory demands by measuring the percentage of implicit system-description knowledge required. Figure 3 shows the distribution of recall scores by domain and step type.
Aviation procedures demonstrated moderate recall demands (average 46%, σ = 17%), with significant variation across procedures (range: 25–81%). HAL procedures showed consistently high recall demands (average 71%, σ = 15%), despite being operated by the most selective personnel. Autonomous vehicle procedures presented moderate but variable recall demands (average 44%, σ = 21%).
Table 4 presents the average recall score by SOP and domain.
This analysis revealed that the highest recall scores were in the HAL domain, with HAL 3.1 series procedures imposing extreme memory demands (88–96%). Additionally, there were significant variations within domains, with some procedures requiring nearly four times the recall demands of others (25% vs. 96%). Counterintuitively, some of the most critical autonomous vehicle procedures (Forward Collision Warning and Blind Spot Collision) had relatively low recall demands (25–27%), while non-emergency procedures like Intersection Turn Takeover had much higher demands (80%).
Across all domains, high recall scores were associated with implicit information about where to find relevant data (Trigger Where and Decide Where) and how to manipulate controls (Action How). The pattern suggests that procedures often assume operators know the physical layout of displays and controls without explicit instruction.

3.1.4. Training Requirements

Using the adapted Matessa and Polson model [15], the training requirements were estimated for each procedure. Table 5 presents these estimates.
While autonomous vehicle procedures contained fewer steps (4–5), the HAL procedures contained substantially more (10–103). However, when adjusted for recall demands, the step complexity increased across all domains. Also, HAL procedures required the longest training periods (8.5–13.1 days), followed by aviation (9.4–10.4 days), with autonomous vehicle procedures requiring the shortest (5.3–6.9 days). The total repetitions required ranged from 9 (autonomous vehicle) to 316 (HAL), representing a 35-fold difference in practice requirements.
The relationship between procedure length, recall demands, and training requirements was not strictly linear. For example, the HAL 5.111 procedure with 103 steps required 316 repetitions (3.1 per step), while the HAL 3.1 procedure with only 17 steps required 76 repetitions (4.5 per step), indicating that step complexity can be more significant than step quantity.

3.2. Identified Vulnerabilities from Statistical Analysis

The e-PRL statistical breakdown revealed four primary vulnerability patterns across domains:

3.2.1. Structural Completeness Gaps

The largest quantitative domain difference was in SOP structure formality, reflected in overall completeness scores: autonomous vehicle procedures showed the highest average explicit content specification (60.1%), followed by aviation (55.5%), and HAL procedures showing the lowest completeness (32.3%). However, domains exhibited distinct structural emphases—aviation SOPs consistently identified actors in 79% of decision steps and included decision points in 18% of all steps, HAL SOPs rarely specified actors (0%) but emphasized verification elements (42–76% specification rates), and autonomous vehicle SOPs showed variable context specification.

3.2.2. Verification Requirement Gaps

Analysis of steps with waiting requirements revealed systematic omission of verification steps: 70% of aviation waiting steps, 58% of HAL waiting steps, and 25% of autonomous vehicle waiting steps lacked explicit verification requirements. This pattern creates vulnerability to timing-related errors regardless of operator training level.

3.2.3. Memory Burden Distribution

Memory demands showed counterintuitive patterns across domains. HAL procedures imposed the highest memory burdens (71% average recall score) despite having the most selective operators, while domains with less selective operators showed more moderate demands (aviation: 46%; autonomous vehicles: 44%). Individual procedures ranged from 25% to 96% recall requirements.

3.2.4. Cue Specification Inconsistencies

Critical triggering information showed substantial variation in specification rates. Aviation procedures specified triggering conditions in 55–83% of steps and HAL procedures in only 1–3% of steps, while autonomous vehicle procedures specified triggers in 70–100% of steps. Similarly, trigger source specification (where to look for cues) varied dramatically across domains.

3.3. Simulation Results Demonstrating Vulnerability Impact

Monte Carlo simulations comparing Time on Procedure (ToP) against the Allowable Operational Time Window (AOTW) demonstrated how the identified statistical vulnerabilities translate into operational failure rates. Figure 4 shows representative ToP vs. AOTW distributions for each domain. Aviation procedures (Figure 4a) demonstrated minimal overlap with wide safety margins. While HAL procedures (Figure 4b) and autonomous vehicle procedures (Figure 4c) exhibited substantial overlap, indicating high failure risk.

3.3.1. Probability of Failure to Complete Results

Table 6 presents the simulation results across domains, showing the direct relationship between the identified vulnerabilities and procedure performance.

3.3.2. Domain-Level Performance Patterns

Aviation procedures showed the lowest failure rates (5.72%), whereas HAL procedures exhibited consistently high PFtC rates (31–45%), and autonomous vehicle procedures showed extreme variability (2.69–63.47%). Procedures relying on unambiguous cues (Lane Departure; Autosteer Disable) had significantly lower failure rates than those using ambiguous cues (Blind Spot Collision; Forward Collision).
For aviation procedures, the AOTW distribution was consistently wider than the ToP distribution, providing a margin for variation in execution time. For HAL and autonomous vehicle procedures, the distributions showed substantial overlap, indicating minimal safety margin.

3.3.3. Cue Type Analysis and Failure Correlation

Further analysis of individual steps within the problematic procedures identified specific cue types associated with high failure probabilities. Table 7 shows the distribution of cue types across domains.
This analysis highlights that autonomous vehicle procedures relied heavily on ambiguous cues (48%), while aviation procedures predominantly used unambiguous cues (67%). HAL procedures showed a concerning reliance on memory-based execution (22% No Cue), consistent with their high recall scores.

3.3.4. Vulnerability–Performance Correlation

The simulation results demonstrate clear correlations between identified statistical vulnerabilities and operational performance. The 48% reliance on ambiguous cues in autonomous vehicle procedures directly correlates with their extreme performance variability, ranging from 2.69% to 63.47% PFtC across different procedures. This wide range demonstrates that cue quality fundamentally determines procedure reliability, with well-designed cues enabling near-perfect performance, while ambiguous cues result in frequent failures.
HAL procedures’ 22% reliance on memory-based execution aligns with their consistently high failure rates of 31–45% across all simulated procedures. Despite having the most selective operators with extensive training, the heavy memory demands create systematic vulnerabilities that manifest as elevated failure probabilities. This finding challenges assumptions about the relationship between operator capability and procedure design requirements.
Aviation’s 67% use of unambiguous cues and wider AOTW distributions result in the most reliable performance, with only 5.72% PFtC. The combination of clear perceptual triggers and adequate time margins creates robust operational conditions that accommodate normal variations in human performance. This design approach demonstrates how proper cue specification can compensate for moderate memory demands and provide reliable procedure execution across varying operational conditions.

4. Discussion

4.1. Domain-Specific Patterns

The most striking domain difference was in the structure formality, reflected in overall completeness scores: autonomous vehicle procedures showed the highest average explicit content specification (60.1%), followed by aviation (55.5%), and HAL procedures showed the lowest completeness (32.3%). However, domains exhibited distinct structural emphases—aviation SOPs consistently identified actors in 79% of decision steps and included decision points in 18% of all steps, HAL SOPs rarely specified actors (0%) but emphasized verification elements (42–76% specification rates), while autonomous vehicle SOPs showed variable context specification with 100% actor identification in decision steps but inconsistent trigger specification (70–100% across step types).
These differences reflect the institutional cultures and historical development of SOPs in each domain. Aviation’s crew coordination requirements have driven explicit role assignments [2], while operations emphasize verification due to the catastrophic potential of verification failures in space operations [7].
Interestingly, the design approaches across domains do not consistently align with operator selection and training levels. HAL procedures, despite having the most selective operators, imposed the highest memory demands and had among the highest PFtC rates in simulation. This suggests that procedure design practices may be driven more by institutional culture than by objective human factors considerations.

4.2. Cross-Domain Vulnerability Patterns

The analysis revealed three vulnerability patterns that appear across domains but manifest differently based on operator selection and training approaches: (1) verification gap, (2) cue ambiguity, and (3) memory burden.

4.2.1. Comparison of Types of Steps Across Domains

Across all domains, steps with waiting requirements frequently lacked explicit verification steps (70% of aviation waiting steps, 58% of HAL waiting steps, and 25% of autonomous vehicle waiting steps). This pattern creates vulnerability to timing-related errors regardless of operator training, consistent with findings from accident investigations [16,17]. This represents a fundamental cognitive vulnerability in procedure design—when operators must wait for system responses, explicit verification ensures the expected state was achieved before proceeding.

4.2.2. Types of Causes of Vulnerabilities

All domains showed cases where critical cues were semantically ambiguous, though the prevalence varied significantly. This was particularly pronounced in autonomous vehicles, where 48% of steps relied on ambiguous cues compared to 15% in aviation and 24% in HAL procedures. Previous research has identified cue ambiguity as a significant factor in procedure failures [3,4]. The prevalence of ambiguous cues in autonomous vehicles is particularly concerning, given the limited training of operators and the safety-critical nature of the interventions.
These differences become clear through concrete examples. Aviation procedures typically provide unambiguous cues such as “When engine fire warning light illuminates red on overhead panel,” specifying both the trigger condition and its precise location. In contrast, Tesla manual procedures often use ambiguous cues like “When you hear a chime,” requiring operators to recall from memory which of several possible chime meanings applies to the current situation. HAL procedures frequently rely on memory-based cues such as “Monitor airlock pressure” without specifying display locations, acceptable pressure ranges, or verification methods.

4.2.3. Types of Causes of Vulnerabilities—Recall Scores

High recall scores were associated with longer training requirements across all domains. Notably, HAL procedures imposed the highest memory burdens (71% average recall score) despite being operated by the most selective personnel, a finding that aligns with cognitive load theory [18,19]. The extreme recall demands in some HAL procedures (88–96%) suggest a reliance on extensive training rather than procedure design to ensure correct execution.
These vulnerability patterns appear in different proportions across domains but represent common failure modes in procedure design. The analysis suggests that they arise from different design philosophies and assumptions about operator capabilities rather than domain-specific technical constraints.

4.2.4. Value of Cross-Domain Insights

The cross-domain analysis reveals findings that are impossible to discover through single-domain studies. The counterintuitive result that HAL procedures impose the highest memory demands despite having the most selective operators suggests that rigorous selection enables poor design practices rather than driving better design. The universal verification gap (25–70% of waiting steps lack verification across all domains) indicates a fundamental design vulnerability rather than a domain-specific issue.
These insights enable concrete knowledge transfer: autonomous vehicle designers can adopt aviation’s unambiguous cue design (leading to 5.72% vs. 63.47% failure rates), while all domains can implement systematic verification requirements to address the universal vulnerabilities identified.

4.3. Implications for Emerging Technical Domains

These findings have significant implications for emerging domains like autonomous vehicles, where operators receive minimal domain-specific training yet must perform critical intervention tasks with high reliability [8,9].
The autonomous vehicle SOPs analyzed showed concerning characteristics, with a high reliance on ambiguous cues (colored indicator lights; generic chimes), an extensive use of multi-function-moded devices, and limited explicit verification requirements. These characteristics, combined with the minimal selection and training of operators, suggest potential safety vulnerabilities that could be addressed by applying lessons from more established domains [20].
The extreme variation in PFtC rates among autonomous vehicle procedures (2.69–63.47%) demonstrates that effective procedure design is possible within the domain’s constraints. The Autosteer Disable procedure, with its 2.69% failure rate, provides a template for designing more reliable procedures by using unambiguous cues and explicit verification steps.
The findings reveal fundamental disconnects between operator capabilities and design assumptions. The HAL domain’s 71% memory demands, despite having the most selective operators, suggest that rigorous selection may enable rather than prevent poor design practices. The 35-fold difference in training requirements (9 to 316 repetitions) indicates domains make implicit trade-offs between design investment and training costs. For emerging domains, the 63.47% vs. 2.69% failure rate difference between ambiguous and unambiguous autonomous vehicle procedures provides clear design guidance toward explicit specification approaches.
Several organizations demonstrate successful cross-domain knowledge transfer in practice. Aviation’s Crew Resource Management (CRM) principles have been successfully adapted in medical operating rooms, reducing surgical errors through explicit role assignments and verification protocols [21,22]. Similarly, nuclear power operations have adopted aviation-style checklist methodologies, resulting in improved safety performance [23,24]. These cases demonstrate the practical viability of applying universal SOP design principles across safety-critical domains.

4.4. Method Limitations and Future Validation Needs

This study presents an analytical framework for comparing SOP vulnerabilities across domains. While the approach is grounded in established human factors principles, the idea that predictions require empirical validation against actual operational performance data is acknowledged. This study presents several key limitations.

4.4.1. Time Distribution Assumptions

The time distributions are derived from the general human performance literature rather than domain-specific studies, which likely affects the absolute failure probability predictions more than relative cross-domain comparisons. The finding that autonomous vehicle procedures show 35-fold higher failure rates than aviation (63.47% vs. 5.72%) may overestimate absolute values but reliably indicates the relative vulnerability ranking. Domain-specific validation could adjust absolute predictions by ±30–50% while preserving the fundamental cross-domain patterns that were identified.

4.4.2. AOTW Estimation

Allowable Operational Time Windows are estimated from published data and calculated constraints rather than measured operational margins. This primarily impacts the performance simulations rather than the core structural findings. The universal verification gaps (25–70% across domains) and memory demand patterns (HAL: 71%, Aviation: 46%, and Autonomous: 44%) emerge from direct SOP analysis independent of AOTW assumptions. Thus, the primary contributions regarding institutional culture effects and universal vulnerabilities remain robust despite simulation limitations.

4.4.3. Context Independence

The model treats procedure steps as independent, potentially missing important interactions between steps or cumulative effects. This assumption may underestimate cumulative cognitive effects in lengthy procedures (e.g., HAL’s 103-step procedures), potentially explaining why the simulations show 31–45% failure rates that seem high even for complex operations. However, this limitation actually strengthens the argument for explicit procedural guidance: if context-independent analysis already reveals concerning failure rates, real-world cascading effects would likely be worse.
Given these limitations, the results should be interpreted as relative comparisons between domains and procedures rather than absolute failure predictions; identification of structural vulnerabilities warrants further investigation, and analytical predictions require empirical validation.
It is recommended that future research validate this framework through human-in-the-loop experiments measuring actual execution times, comparison with documented incident data where available, domain-specific studies to refine time distribution parameters, and longitudinal studies tracking procedure performance in operational settings.

5. Conclusions

This cross-domain analysis of SOPs demonstrates that despite substantial differences in operator selection, training, and domain-specific technologies, common patterns of vulnerability exist in SOP design. The consistent analytical framework applied in this study enables the identification of universal principles for effective SOP design that go beyond domain boundaries.
Key recommendations emerging from this analysis include adding explicit verification requirements following any waiting step, reducing memory demands through explicit specification of critical information; eliminating ambiguous cues, particularly for critical interventions; and standardizing methods for evaluating SOP performance across technical domains.
This research demonstrates the value of going beyond domain-specific SOP evaluation processes to enhance safety across all complex human–machine systems. By applying consistent analytical frameworks and metrics, safety engineering can develop universal principles applicable across the full spectrum of human–machine systems.

Recommendations

This cross-domain analysis establishes foundational principles for effective SOP design applicable across human–machine systems of varying complexity and operator selection rigor. Organizations should implement explicit verification requirements following any waiting step to prevent timing-related errors, regardless of operator expertise level. Memory demands should be systematically reduced through explicit specification of critical information, particularly regarding display locations and control manipulation, even for highly trained operators. Designers must eliminate ambiguous cues, particularly for critical interventions requiring time-sensitive responses, as demonstrated by the dramatic PFtC differences between procedures using ambiguous versus unambiguous cues (63.47% vs. 2.69% in autonomous vehicles). Finally, standardized methods for evaluating SOP performance using the metrics described in this study should be adopted across technical domains, particularly in emerging fields with limited operator selection and training. These principles provide particular value when human operators of diverse training backgrounds must perform critical intervention tasks with high reliability.
Memory demands should be systematically reduced through explicit specification of critical information, particularly regarding display locations and control manipulation, even for highly trained operators. Practical implementation strategies include standardized display location nomenclature (e.g., “upper left engine display” rather than “engine display”), explicit control manipulation descriptions (e.g., “rotate switch clockwise to detent position” rather than “adjust switch”), and verification checkpoints that specify both the expected indication and its location. SOP designers should conduct systematic reviews identifying any step requiring operator recall of system layouts or control procedures.
Designers must eliminate ambiguous cues, particularly for critical interventions requiring time-sensitive responses, as demonstrated by the dramatic PFtC differences between procedures using ambiguous versus unambiguous cues (63.47% vs. 2.69% in autonomous vehicles). A systematic audit approach should evaluate each procedural cue against criteria, including sensory channel specification (visual, auditory, and tactile), temporal characteristics (continuous vs. momentary), and uniqueness within the operational context (no overlapping meanings with other system indications).

Author Contributions

Conceptualization, J.A.B. and L.S.; methodology, J.A.B.; formal analysis, J.A.B.; data curation, J.A.B.; writing—original draft preparation, J.A.B.; writing—review and editing, J.A.B. and L.S.; visualization, J.A.B.; funding acquisition, J.A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number (PNURSP2025R908), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

Princess Nourah bint Abdulrahman University Researchers Supporting Project Number (PNURSP2025R908), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Barshi, I.; Mauro, R.; Degani, A.; Loukopoulou, L. Designing Flightdeck Procedures; Report No. NASA/TM-2016-219421; NASA: Washington, DC, USA, 2016; p. 81.
  2. Degani, A.; Wiener, E.L. Procedures in complex systems: The airline cockpit. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 1997, 27, 302–312. [Google Scholar] [CrossRef] [PubMed]
  3. Bashatah, J.; Sherry, L. Lessons Learned from Human Operator Intervention for Ai Navigation and Flight Management Systems. In Proceedings of the 2022 Integrated Communication, Navigation and Surveillance Conference (ICNS), Dulles, VA, USA, 5–7 April 2022; pp. 1–15. [Google Scholar] [CrossRef]
  4. Bashatah, J.; Sherry, L. Model-Based Analysis of Standard Operating Procedures’ Role in Abnormal and Emergency Events. INCOSE Int. Symp. 2022, 32, 1220–1246. [Google Scholar] [CrossRef]
  5. Bashatah, J. A Method for Formal Analysis and Simulation of Standard Operating Procedures (SOPs) to Meet Safety Standards. 2024. Available online: https://drepo.sdl.edu.sa/items/341b4362-0502-4cad-b6fa-7911339e0e40 (accessed on 7 May 2025).
  6. Card, S.K.; Moran, T.P.; Newell, A. The Psychology of Human-Computer Interaction; L. Erlbaum Associates: Hillsdale, NJ, USA, 1983. [Google Scholar]
  7. Kortenkamp, D.; Bonasso, R.P.; Schreckenghost, D.; Dalal, K.M.; Verma, V.; Wang, L. A Procedure Representation Language for Human Spaceflight Operations. In Proceedings of the 9th International Symposium on Artificial Intelligence, Robotics and Automation in Space (i-SAIRAS-08), Los Angeles, CA, USA, 26–29 February 2008; p. 8. [Google Scholar]
  8. Louw, T.; Markkula, G.; Boer, E.; Madigan, R.; Carsten, O.; Merat, N. Coming back into the loop: Drivers’ perceptual-motor performance in critical events after automated driving. Accid. Anal. Prev. 2017, 108, 9–18. [Google Scholar] [CrossRef] [PubMed]
  9. Dixit, V.V.; Chand, S.; Nair, D.J. Autonomous Vehicles: Disengagements, Accidents and Reaction Times. PLoS ONE 2016, 11, e0168054. [Google Scholar] [CrossRef]
  10. Airbus. A318/A319/A320/A321 Flight Crew Operating Manual (FCOM); Airbus S.A.S.: Toulouse, France, 2017. [Google Scholar]
  11. Boeing Commercial Airplanes. Boeing 747-441 Operations Manual; The Boeing Company: Seattle, WA, USA, 2000. [Google Scholar]
  12. Tesla. Model 3 Owner’s Manual. Available online: https://www.tesla.com/ownersmanual/model3/en_us/GUID-32E9F9FD-0014-4EB4-8D96-A8BE99DBE1A2.html (accessed on 5 July 2023).
  13. Moon, I.-C.; Carley, K.M.; Kim, T.G. Modeling and Simulating Command and Control. In Springer Briefs in Computer Science; Springer: London, UK, 2013. [Google Scholar] [CrossRef]
  14. Bashatah, J.; Sherry, L. Method for Formal Analysis of the Type and Content of Airline Standard Operating Procedures. In Proceedings of the 2023 Integrated Communication, Navigation and Surveillance Conference (ICNS), Herndon, VA, USA, 18–20 April 2023; pp. 1–8. [Google Scholar] [CrossRef]
  15. Matessa, M.; Polson, P. List Models of Procedure Learning; Report No. A-0514345; NASA: Washington, DC, USA, 2006.
  16. BEA. Accident on 27 November 2008 off the Coast of Canet-Plage (66) to the Airbus A320-232 Registered D-AXLA Operated by XL Airways Germany; BEA: Paris, France, 2010.
  17. National Transportation Safety Board. Runway Overrun During Landing American Airlines Flight 1420 McDonnell Douglas MD-82, N215AA Little Rock, Arkansas, June 1, 1999; Aviation Investigation Report AAR0102; National Transportation Safety Board: Washington, DC, USA, 2001.
  18. Yuan, X.; Song, R.; Sun, L. Research on the Effect of Working Memory Training on the Prevention of Situation Awareness Failure in Shearer Monitoring Operations. Appl. Sci. 2024, 14, 11876. [Google Scholar] [CrossRef]
  19. Chakraborty, S.; Kiefer, P.; Raubal, M. The influence of uncertainty visualization on cognitive load in a safety- and time-critical decision-making task. Int. J. Geogr. Inf. Sci. 2024, 38, 1583–1610. [Google Scholar] [CrossRef]
  20. Wolfe, B.; Seppelt, B.; Mehler, B.; Reimer, B.; Rosenholtz, R. Rapid holistic perception and evasion of road hazards. J. Exp. Psychol. Gen. 2020, 149, 490–500. [Google Scholar] [CrossRef] [PubMed]
  21. Thomas, E.; Sexton, J.; Helmreich, R. Discrepant Attitudes About Teamwork Among Critical Care Nurses and Physicians. Available online: https://pubmed.ncbi.nlm.nih.gov/12627011/ (accessed on 12 July 2025).
  22. Leonard, M.; Graham, S.; Bonacum, B. The Human Factor: The Critical Importance of Effective Teamwork and Communication in Providing Safe Care. Available online: https://pubmed.ncbi.nlm.nih.gov/15465961/ (accessed on 12 July 2025).
  23. Flin, R.; O’Connor, P. Safety at the Sharp End: A Guide to Non-Technical Skills; CRC Press: London, UK, 2017. [Google Scholar] [CrossRef]
  24. O’Hara, J.M.; Fleger, S. Human-System Interface Design Review Guidelines (NUREG-0700, Revision 3); United States Nuclear Regulatory Commission: Rockville, MA, USA, 2019.
Figure 1. SOP data processing and metric calculation framework. The flowchart shows the systematic methodology for processing SOPs from three domains through e-PRL decomposition and quantitative metric calculation. Parallelograms represent data inputs/outputs, rectangles represent processing steps, and ovals indicate start/end points.
Figure 1. SOP data processing and metric calculation framework. The flowchart shows the systematic methodology for processing SOPs from three domains through e-PRL decomposition and quantitative metric calculation. Parallelograms represent data inputs/outputs, rectangles represent processing steps, and ovals indicate start/end points.
Systems 13 00717 g001
Figure 2. Distribution of SOP types across domains.
Figure 2. Distribution of SOP types across domains.
Systems 13 00717 g002
Figure 3. SOP Step Recall Score by domain and step type. Counterintuitive finding: HAL procedures consistently show the highest memory requirements despite having the most selective operators, suggesting rigorous selection enables poor design practices.
Figure 3. SOP Step Recall Score by domain and step type. Counterintuitive finding: HAL procedures consistently show the highest memory requirements despite having the most selective operators, suggesting rigorous selection enables poor design practices.
Systems 13 00717 g003
Figure 4. (a) Time on Procedure (ToP) vs. Allowable Operational Time Window (AOTW) for aviation takeoff roll procedure. Minimum overlap demonstrated wide safety margins (5.72% Probability of Failure to Complete, PFtC). (b) Time on Procedure (ToP) vs. Allowable Operational Time Window (AOTW) for HAL Procedure 3.1A. Substantial distribution overlap indicates high failure probability (37% PFtC). Blue represents AOTW; red represents ToP. (c) Time on Procedure (ToP) vs. Allowable Operational Time Window (AOTW) for autonomous vehicle blind spot collision procedure. Large overlap results in 63.47% failure probability.
Figure 4. (a) Time on Procedure (ToP) vs. Allowable Operational Time Window (AOTW) for aviation takeoff roll procedure. Minimum overlap demonstrated wide safety margins (5.72% Probability of Failure to Complete, PFtC). (b) Time on Procedure (ToP) vs. Allowable Operational Time Window (AOTW) for HAL Procedure 3.1A. Substantial distribution overlap indicates high failure probability (37% PFtC). Blue represents AOTW; red represents ToP. (c) Time on Procedure (ToP) vs. Allowable Operational Time Window (AOTW) for autonomous vehicle blind spot collision procedure. Large overlap results in 63.47% failure probability.
Systems 13 00717 g004aSystems 13 00717 g004b
Table 1. Domains and operator training.
Table 1. Domains and operator training.
Case StudyOperator TrainingOperator CertificationDomain Selectiveness of Operators
Aviation Operational SOPsOperators trained on airlines’ operational SOPs. Recurrent training every 9–12 months.Certified and licensed by FAA.Selective
Habitable Airlock (HAL)Operators trained for approximately 300 h in simulators.Certified by regulatory body.Extremely Selective
Semi-Autonomous VehiclesOperators are trained in driver’s education courses operating non-autonomous vehicles.Licensed by DMV.Not Selective
Table 2. Distribution of SOP step types across domains.
Table 2. Distribution of SOP step types across domains.
SOP Step TypeAviationHALAutonomous Vehicle
Action-Only70%73%80%
Decision–Action14%0%4%
Action with Waiting and Verification12%27%16%
Decision–Action with Waiting and Verification4%0%0%
Table 3. Explicit vs. implicit content.
Table 3. Explicit vs. implicit content.
SOP ElementAviation HAL Autonomous Vehicle
Action-OnlyDecision–ActionAction w/W&VAction-OnlyAction w/W&VAction-OnlyDecision–ActionAction w/W&V
Actor10%79%33%0%0%25%100%25%
Trigger (What)55%83%57%3%1%70%100%75%
Trigger (Where)3%67%5%58%74%15%0%25%
Trigger (How)21%83%48%48%77%35%0%50%
Decide (What)-100%----100%-
Decide (Where)-0%----0%-
Decide (How)-0%----100%-
Action (What)100%100%100%100%100%100%100%100%
Action (Where)81%100%90%9%1%65%100%75%
Action (How)43%92%62%2%1%45%100%50%
Waiting (What)--29%-6%--50%
Waiting (Where)--14%-10%--50%
Waiting (How)--14%-18%--50%
Verification (What)--76%-42%--75%
Verification (Where)--71%-24%--75%
Verification (How)--71%-76%--75%
Table 4. Average SOP Step Recall Score by procedure and domain.
Table 4. Average SOP Step Recall Score by procedure and domain.
Aviation SOPsScoreHAL SOPsScoreAutonomous Vehicle SOPsScore
Takeoff (B747)43%HAL 5.10171%Forward Collision Warning27%
Landing Approach58%HAL 5.11156%Lane Departure Avoidance31%
System A81%HAL 5.12161%Blind Spot Collision25%
System B71%HAL 5.12254%Autosteer Disable50%
Approach Stabilization38%HAL 5.99958%Intersection Turn Takeover80%
Additional Aviation Procedures25–50%HAL 3.1 series88–96%Unexpected Autopilot Behavior50%
Table 5. Estimated training requirements by SOP. Key finding: Training requirements vary 35-fold (9 to 316 repetitions), with HAL procedures requiring the longest training periods despite having the most experienced operators.
Table 5. Estimated training requirements by SOP. Key finding: Training requirements vary 35-fold (9 to 316 repetitions), with HAL procedures requiring the longest training periods despite having the most experienced operators.
DomainProcedureStepsAdjusted Steps *Total RepetitionsTraining Days
AviationTakeoff (B747)28409110.4
AviationLanding Approach1625579.4
AviationSystem A1629669.7
AviationSystem B1526579.4
HALHAL 5.11110313531613.1
HALHAL 5.1011017388.5
HALHAL 3.117327610.0
HALHAL 5.121548620212.1
AutonomousForward Collision44.195.3
AutonomousLane Departure44.2105.4
AutonomousBlind Spot Collision44.095.3
AutonomousIntersection Turn57.0186.9
* Adjusted Steps accounts for recall demands using Equation (3) in Ref. [5].
Table 6. Probability of Failure to Complete (PFtC) by domain and procedure.
Table 6. Probability of Failure to Complete (PFtC) by domain and procedure.
DomainProcedurePFtCContributing Factors
AviationTakeoff Roll5.72%Ambiguous semantic cues
HALHAL 5.11134%High memory demands, limited verification
HALHAL 5.10145%Missing cues, ambiguous semantics
HALHAL 3.137%Extreme recall demands
HALHAL 5.12131%Interface complexity, missing verification
HALHAL 5.12234%Limited verification requirements
AutonomousForward Collision41.51%Ambiguous cues, non-salient triggers
AutonomousLane Departure10.55%Unambiguous cues
AutonomousBlind Spot Collision63.47%Highly ambiguous cues
AutonomousAutosteer Disable2.69%Unambiguous cues
AutonomousIntersection Turn41.60%Recall demands, ambiguous decision points
Table 7. Distribution of cue types by domain.
Table 7. Distribution of cue types by domain.
Cue TypeAirline OperationsHALAutonomous Vehicle
Not in FoV3%0%4%
In FoV, Not Salient7%12%8%
In FoV, Salient, Ambiguous15%24%48%
In FoV, Salient, Unambiguous67%42%40%
No Cue (LTM)8%22%0%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bashatah, J.A.; Sherry, L. Comparative Analysis of Standard Operating Procedures Across Safety-Critical Domains: Lessons for Human Performance and Safety Engineering. Systems 2025, 13, 717. https://doi.org/10.3390/systems13080717

AMA Style

Bashatah JA, Sherry L. Comparative Analysis of Standard Operating Procedures Across Safety-Critical Domains: Lessons for Human Performance and Safety Engineering. Systems. 2025; 13(8):717. https://doi.org/10.3390/systems13080717

Chicago/Turabian Style

Bashatah, Jomana A., and Lance Sherry. 2025. "Comparative Analysis of Standard Operating Procedures Across Safety-Critical Domains: Lessons for Human Performance and Safety Engineering" Systems 13, no. 8: 717. https://doi.org/10.3390/systems13080717

APA Style

Bashatah, J. A., & Sherry, L. (2025). Comparative Analysis of Standard Operating Procedures Across Safety-Critical Domains: Lessons for Human Performance and Safety Engineering. Systems, 13(8), 717. https://doi.org/10.3390/systems13080717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop