Next Article in Journal
A Review of Offshore Renewable Energy for Advancing the Clean Energy Transition
Previous Article in Journal
Multi-Time-Scale Demand Response Optimization in Active Distribution Networks Using Double Deep Q-Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study

Institute of Nuclear and New Energy Technology (INET), Tsinghua University, Beijing 100084, China
*
Authors to whom correspondence should be addressed.
Energies 2025, 18(18), 4794; https://doi.org/10.3390/en18184794
Submission received: 17 July 2025 / Revised: 14 August 2025 / Accepted: 5 September 2025 / Published: 9 September 2025
(This article belongs to the Section B4: Nuclear Energy)

Abstract

High software reliability is essential for safety-critical systems in nuclear power plants. To improve the quality of software following the requirements phase, requirements inspections are conducted to detect defects. Traditional approaches enhance inspection outcomes by employing more effective techniques or by increasing team redundancy. This study investigates an alternative approach: introducing diversity within the inspection team. Inspection technique diversity and inspector background diversity are considered in this paper. We hypothesize that an inspection team in which the inspectors use diverse inspection techniques or have diverse backgrounds will have a better performance in defect detection compared to an inspection team with no diversity. This is because diversity can reduce the number of dependent failures in an inspection team. In this study, a controlled experiment is designed and conducted to examine our hypothesis. In the experiment, research subjects with different backgrounds inspect a software requirements specification using different inspection techniques. The results are collected and analyzed statistically. The experiment shows that using diverse techniques in an inspection team can improve the performance of the inspection team; however, using inspectors with diverse backgrounds will not affect the performance of an inspection team significantly.

1. Introduction

Reliable software is fundamental to the safe and effective operation of nuclear power plants. The software development life cycle starts with the requirements phase, where a software requirements specification (SRS) is typically produced and reviewed by a verification and validation team. Defects missed during this review persist into later stages. According to internal NASA data, approximately 40% of the defects found in software products can be traced back to these overlooked requirements defects [1]. Therefore, the quality of the SRS has a significant impact on software reliability in nuclear applications [2].
Several techniques are available for inspecting requirements, including ad hoc reading, checklist-based reading (CBR), and perspective-based reading (PBR), among others. However, these traditional inspection techniques often struggle with a low defect detection rate. An experiment conducted by A. Porter [3] demonstrated that the defect detection rates for ad hoc, CBR, and PBR were 32.5%, 36.5%, and 51.5% respectively. In another experiment by L. He [4], the PBR technique showed a detection rate of 37%. As a result, a considerable number of defects remain in the SRS as the development progresses. To address this, two primary strategies are commonly adopted, as described below [5].
The first strategy is to enhance the effectiveness of the inspection. A. Alshazly [2] proposed a combined reading technique that divides the SRS into sections, by purpose, with each inspector using a tailored checklist. A case study showed that this significantly improves inspector performance. Ali [6] proposed standardizing the requirements generation process to produce SRSs that can be third-party inspected. A Total Quality Score was used to quantify and improve the quality of the SRS generated. B. Li [7] proposed the RIMSM method, which involves constructing an SRS model and generating inspection scenarios through model mutation. The system’s behavior under each scenario is analyzed to identify defects. An experiment demonstrated that RIMSM significantly improved defect detection effectiveness, increasing detection rates over CBR by 18.9%, 60.8%, and 75.8% for small, medium, and large SRSs, respectively.
Another strategy is to increase the number of inspectors. Techniques such as N-fold inspection replicate inspection activities across multiple teams to capture a broader set of defects [8]. E. Kantorowitz [9] demonstrated that, while one team of three inspectors detected 35% of defects, nine independent teams detected up to 78%.
In summary, the first strategy aims to increase the defect detection rate of an individual inspector; the second strategy adds redundancies to the requirements inspection process. The limitation of the first strategy is that no existing technique can promise a defect detection rate of an inspector that is high enough to ignore the remaining defects.
The second approach is limited by the occurrence of correlated failures among redundant inspectors. This issue was highlighted in E. Kantorowitz’s study on N-fold inspection. Theoretically, if nine teams worked independently and each had a 35% chance of identifying a defect, the overall detection rate should reach 98%, which is calculated as 1 − (1–35%)9 = 98%. In practice, the experiment yielded only a 78% detection rate. The shortfall arose because the participants, i.e., senior undergraduates from the same computer science program, possessed nearly identical training and employed the same inspection method. As a result, a lack of diversity led to dependent failures across teams, greatly reducing defect detection efficiency.
In redundant systems, a common solution to mitigate dependent failures is the introduction of diversity [10]. However, in requirements inspection, the effectiveness of introducing diversity has not been systematically studied. In this paper, the effectiveness of introducing diversity in requirements inspection is studied with an experiment.
Based on findings from [11], dependent failures in software development are correlated to the knowledge possessed by the subjects in a task, and the rules/procedures followed by the subjects in that task. We therefore assume that dependencies in requirements inspection originate mainly from common inspection techniques (i.e., procedure followed) and shared background (i.e., knowledge possessed). We hypothesize that teams with diverse techniques or backgrounds perform better in defect detection than homogeneous teams. To verify our hypothesis, an experiment is designed and conducted with undergraduate students from different majors using two techniques: CBR and RIMSM. We compared teams with technique or background diversity against teams where all members used the same technique or are pursuing the same major.
This paper is organized as follows: Section 2 introduces the background knowledge required in this paper; in Section 3, an experiment is designed to study the effectiveness of introducing diversity in an inspection team; Section 4 analyzes the results of the experiment; Section 5 concludes the findings of this paper.

2. Background

This section introduces the background knowledge required in this paper.
In requirements inspection, stakeholder requirements are used as the “oracle” to examine the correctness of an SRS. A defect in the SRS is defined as the inconsistency between the SRS and the stakeholder requirements. Defects are introduced in the development of an SRS and are detected in the inspection process.
In this study, a failure denotes the inability of an inspector or an inspection team to identify a defect in the SRS. Such failures in requirements inspection can be divided into two categories, as defined in [12] from the standpoint of human error: independent failures and dependent failures. The failure rate of an inspector represents the likelihood that a defect remains undetected by that inspector. Because this rate is typically high, our analysis adopts the zone-based model (Z-model), which was designed to address dependent failures in contexts characterized by high failure probabilities [12].
This section first outlines the definitions of independent and dependent failures in requirements inspection, then introduces the inspection techniques applied in the experiment, and, finally, it gives a brief overview of the Z-model used to analyze the experimental results.

2.1. Dependent and Independent Failures in Requirements Inspection

Ref. [12] characterizes requirements inspection as a cognitive process where inspectors assess SRS elements for potential defects. Failures in requirements inspection are caused by human errors. Human errors in requirements inspection can be classified into human errors due to insufficiency of knowledge and human errors due to random events in the cognitive process.
The perception domain of an inspector refers to the body of knowledge that can be accurately understood and applied to identify defects in an SRS. A defect located within this domain can be detected, whereas one outside it will be overlooked. The perception domain is shaped by the inspector’s system knowledge, expertise in requirements engineering, and access to supporting external information. When a defect falls outside this domain, the resulting human error—caused by insufficient knowledge—leads to detection failures.
Using Bayesian inference, Ref. [12] distinguished dependent failures from independent ones using this concept. Dependent failures arise when knowledge limitations prevent inspectors from recognizing a defect outside their perception domain. Independent failures, by contrast, result from random variations in the cognitive process, even when the defect is within the perception domain. Thus, the classification of failures depends on whether the defect lies inside or outside the inspector’s perception domain.

2.2. Inspection Methods

This section introduces the two inspection methods used in our experiment and discusses the reasons why the two methods were selected.
The checklist-based reading (CBR) technique supplies inspectors with a checklist, typically formulated as questions or statements, to guide the search for specific types of defects [13]. During inspection, the document is reviewed while the inspector answers a series of yes/no questions aimed at identifying potential issues [14]. However, CBR is regarded as a nonsystematic approach [4], since it offers no explicit guidance on how the inspection process itself should be conducted.
The RIMSM method is proposed by B. Li et al. [7] and aims to improve performance in terms of detecting defects related to system functions. Executing the RIMSM method, an inspector first constructs a high-level extended finite state machine (HLEFSM) model of the SRS, using the RITSM tool. The RITSM tool automatically identifies a set of scenarios under which the SRS model shall be executed, based on an extended mutation testing technique in the requirements phase. Then, the tool executes the SRS model for the identified scenarios automatically. The execution results are stored in an output file that records the execution of function definitions, variable definitions, and function logic. By examining this file, defects in the SRS model can be identified. Once a defect is detected in the output file, the inspector locates the defect in the SRS document. The process of defect location is straightforward since the SRS model is constructed directly using the SRS.
Table 1 displays the differences between the two techniques in terms of instrumentation, coverage of the defects, inspection item, and defect detection activity. Since the focus of this paper is to discover the effectiveness of introducing diversity in requirements inspection, the inspection techniques used in our experiment should have as little in common as possible. The inspection methods CBR and RIMSM are selected to minimize the dependencies between inspectors using different inspection methods.

2.3. Z-Model

To evaluate an inspection team’s performance, a quantification model is necessary to estimate the probability of dependent and independent failures within that team. Traditional models, such as the basic parameter model [15], the beta factor model [16], the alpha factor model [17], and the multiple Greek letter model [18], assume redundant components are identical. A single component failure within an event is treated as independent, whereas multiple component failures are treated as dependent. These models primarily address high-reliability systems, where the probability of component failure is extremely low. Consequently, the likelihood of simultaneous multiple independent failures, or the coexistence of independent and dependent failures within the same event, is considered negligible.
In contrast, requirements inspection involves inspectors with comparatively high and heterogeneous failure probabilities, rendering traditional models inapplicable. In [12], the Z-model is proposed to analyze dependent failures in a high failure probability context, i.e., requirements inspection.
The Z-model addresses this limitation by explicitly considering the following: (1) the potential for high individual failure probabilities, (2) variability in failure probabilities among inspectors, (3) the occurrence of both dependent and independent failures within an inspection team. The Z-model is introduced briefly below.
For an inspection team consisting of m inspectors, i.e., I 1 , I 2 , , I m , the following concepts are introduced:
  • Detection activity: A detection activity refers to the actions undertaken by the inspection team in an attempt to identify a defect.
  • Multiplicity (k): The multiplicity of a detection activity is the number of failures in the detection activity, which is denoted as k 0 , m .
  • Dependency (d): The dependency of a detection activity is the number of dependent failures it contains, denoted by d 0 , m .
  • Perception zone: The perception domain of an inspector encompasses the knowledge that the inspector can accurately access or apply to identify defects in an SRS. The universal perception domain U, representing all knowledge required to detect every defect in the SRS, is partitioned into different zones according to the perception domains of the inspectors in a team. An example is illustrated in Figure 1. For two inspectors, i.e., I 1 and I 2 , the universal perception domain U is divided into four perception zones: Z 0 1 , Z 1 1 , Z 2 1 , and Z 1 2 . Assuming that detecting a defect requires the knowledge in a given perception zone, the dependency of a zone is defined as the number of inspectors who lack the necessary knowledge for that zone. For instance, the dependency of Z 2 1 is 2, since neither inspector possesses the knowledge needed to detect defects in that zone.
  • Dependency of a perception zone: The dependency of a perception zone is the number of inspectors lacking the knowledge within that zone. Perception zones can be categorized according to their dependencies. Let Z d i denote the ith perception zone with dependency d. In this paper, a defect requiring the knowledge contained in Z d i for detection is referred to as a defect in that zone.
In an inspection team, Q t represents the average probability of failure for an individual inspector. The Z-model decomposes Q t as (1), where Q t I is the probability of independent failures and Q t D is the probability of dependent failures.
Q t = Q t D + Q t I
The Z-model characterizes the average probability of failures by an inspector using a set of parameters, namely z d i and p a I . Here, z d i represents the probability that a defect lies in perception zone Z d i , and p a I denotes the probability that inspector I a independently fails to detect a defect in her/his perception domain. Q t D can be calculated using (2), where n d is the number of zones with dependency d.
Q t D = 1 m d = 0 m i = 1 n d z d i · d
Q t I can be calculated using (3), where p a j i I is the probability that inspector I a j i , whose perception domain contains Z d i , fails independently.
Q t I = 1 m d = 0 m i = 1 n d j = 1 m d z d i · p a j i I
The objective of the experiment is to estimate whether introducing diversity to requirements inspection can increase software reliability in the requirements phase. The following variables are derived and used in this paper to analyze the dependent failures in detail.
Based on the decomposition of Q t , Q d : m is defined in (4) to represent the probability that d inspectors fail dependently in a detection activity within an inspection team with m members.
Q d : m = 1 n d · i = 1 n d z d i
Given a dependent failure by an inspector, the conditional probability that the failure is involved in a detection activity with dependency d is denoted as β d , as shown below.
β d = d · i = 1 n d z d i d = 0 m i = 1 n d d · z d i
To evaluate the probability of failure by an inspection team, Q m is defined as the probability that all inspectors in a detection activity fail to detect a defect. Q m can be obtained using (6).
Q m = d = 0 m i = 1 n d z d i · j = 1 m d p a j i I
The data collected in requirements inspection for estimating the parameters of the Z-model are summarized in Table 2.
The key parameters in the Z-model, z d i and p a I , can be estimated using (7) and (8) through the maximum likelihood estimator (MLE).
z d i = n d i N D
p a I = n a I N D n a D

3. Experiment Design

The goal of this paper is to study the effectiveness of increasing diversity in requirements inspection. In this paper, two types of diversity are considered for an inspection team: the diverse backgrounds of the inspectors and diverse techniques used for defect detection.
In this section, the following definitions are used:
  • A background-diverse team is an inspection team in which the members are pursuing different majors.
  • A background-uniform team is an inspection team in which all members share the same major.
  • A technique-diverse team is an inspection team in which different inspection techniques are used.
  • A technique-uniform team is an inspection team in which all members use the same inspection technique.
This experiment was designed to compare the background-diverse teams and technique-diverse teams to the background-uniform teams and technique-uniform teams, respectively.

3.1. Research Questions

The following aims are the basis for this experiment:
  • To determine whether the performance of a background-diverse team is better than that of a background-uniform team.
  • To determine whether the performance of a technique-diverse team is better than that of a technique -uniform team.

3.2. Hypotheses

The null hypotheses and alternative hypotheses of the experiment are given below:
  • H 0 , b a c k g r o u n d : there is no difference between the performance of the background-diverse teams and the background-uniform teams.
  • H A , b a c k g r o u n d : the performance of the background-diverse teams is better than that of the background-uniform teams.
  • H 0 , t e c h i q u e : there is no difference between the performance of the technique-diverse teams and the technique-uniform teams.
  • H A , t e c h n i q u e : the performance of the technique-diverse teams is better than that of the technique-uniform teams.
The t-test is used to test the null hypotheses in this experiment based on the central limit theorem.

3.3. Research Variables

This experiment controlled the following independent variables:
  • The inspection techniques used by the subjects in an inspection team (RIMSM or CBR).
  • The backgrounds (i.e., major) of the subjects in an inspection team.
  • The number of subjects in an inspection team.
  • The SRS documents under inspection.
In this study, the treatment variables consist of the inspection methods and the inspectors’ backgrounds, while the remaining variables help to mitigate potential threats to the experiment’s internal validity.
The dependent variables in the experiment are described as follows. An inspection team fails to detect a defect only when every member misses it. Therefore, Q m is employed to quantify team performance, where Q m is the probability of a detection activity with multiplicity k = m. The other parameters of the Z-model are used to analyze the failure rate of an inspector in the inspection team, including Q t , Q t D , β d , Q t I , and Q d : m .

3.4. Experiment Instrumentations

3.4.1. SRS Documents

An SRS comprises both functional and non-functional requirements. This experiment focuses on functional requirements to ensure that the software’s intended functionality is accurately represented. Twelve SRS documents of varying applications and sizes were used, all sourced from the study by B. Li [7] and reused in this experiment. These documents adhere to the IEEE 29148:2018 standard [19] and are structured into three sections: Introduction, Overview, and Specific Functions.
Table 3 summarizes each SRS, including its topic, number of pages, number of functions, and number of defects. The defects considered are indigenous, meaning they occur naturally rather than by being intentionally seeded. The total number of defects in each SRS was estimated using the capture–recapture method described in B. Li’s study [7].
SRSs S1–S8 are classified as “Small size,” S9 and S10 as “Medium size,” and S11 and S12 as “Large size,” allowing the experiment to evaluate the scalability of the results. Additionally, a stakeholder requirements specification (StRS) was developed for each SRS in B. Li’s study. Each StRS outlines the system capabilities required by stakeholders within a defined context and serves as an oracle to guide inspectors in defect detection.

3.4.2. Checklist

A checklist is used in the CBR method. Traditional checklists typically address both functional and non-functional requirements. Since the SRS documents in this experiment include only functional requirements, all checkpoints related to non-functional aspects were removed. The checkpoints related to functional requirements are detailed using a defect taxonomy in the requirements phase, as discussed in B. Li and X. Li’s work [20,21,22]. Part of the checklist used is summarized in Figure 2.

3.4.3. RITSM Tool

B. Li developed the RITSM tool to facilitate the application of the RIMSM method [7]. With RITSM, inspectors can construct the HLEFSM model of an SRS, while the tool automatically generates the model’s mutants and the scenarios needed to detect them. The tool then executes the model under these scenarios and outputs the results to a file, which inspectors review to identify defects. Additional details about RITSM are provided in [7].

3.4.4. Defect Recording Sheet

Each inspector records detected defects using a defect recording sheet. Figure 3 gives an example of the defect recording sheet.

3.5. Research Subject Identification

During the requirements inspection phase, an SRS is reviewed by inspectors from an independent verification and validation (IV&V) team. Inspectors typically hold at least a bachelor’s degree in engineering or a related field and are expected to have (1) knowledge of requirements engineering, and (2) proficiency in programming with procedural languages.
For this experiment, subjects were selected from junior and senior undergraduate students across various majors. Each participant was required to be familiar with at least one programming language, while concepts related to requirements engineering were covered during training sessions. In total, 23 students from Ohio State University were recruited for the study. The majors of the subjects and the number of subjects in each major are given in Table 4.

3.6. Experimental Procedure

The design of the experiment procedure is introduced in this section.
For a given inspection team, the inspection process includes two steps: detection and collection [23]. In the first step, every inspector detects the defects in the SRS individually. In the second step, the defects detected by each inspector are collected through a collection meeting. A. Porter [3] and Votta [24] studied the effect of the collection meetings. The “meeting gain” and “meeting loss” were analyzed using experiments. A meeting gain occurs when a defect is detected for the first time at the collection meeting. A meeting loss occurs when a defect is first detected by an inspector, but is not recorded during the collection meeting. They found that the meeting gain and meeting loss were negligible. The defects detected by an inspection team are just a union of the defects detected by each inspector in the team. Based on this fact, the subjects in this experiment were only asked to detect defects individually. To study the defects detected by an inspection team that contains m specific inspectors, we only need to combine the defects detected by the m inspectors individually. This design provided us the flexibility to “virtually” combine any inspectors in a team for analysis.
This experiment had four stages: (1) collection of data for individual inspectors, (2) creation of data for inspection teams, (3) determination of dependent failures and independent failures, and (4) analysis of results. The details in each stage are introduced below.

3.6.1. Collection of Data for Individual Inspectors

In this stage, all subjects were trained and tested individually, since the first step of an inspection team’s requirements inspection is to let all inspectors detect defects individually.
The first stage of the experiment spanned 3 days, with all sessions conducted remotely via video communication. This stage comprised five training sessions, six practice sessions, and six testing sessions, as detailed in Table 5. Four investigators participated in the experiment, with their respective roles described below:
  • Investigator 1 designed the experiment and prepared all instruments and training materials. Investigator 1 also conducted the analysis of the experimental results.
  • Investigator 2 was unaware of the experiment’s hypotheses. To minimize research bias, Investigator 2 delivered the training session presentations and hosted all practice and testing sessions. At the end of each session, Investigator 2 collected the results recording sheets, removed identifiers and methods used by the subjects, and provided the anonymized data for analysis. The results were subsequently analyzed by Investigators 1 and 3.
  • Investigator 3 did not participate in the 3-day experiment. Investigators 1 and 3 independently analyzed the results to evaluate inter-rater reliability.
  • Investigator 4 supervised the entire experiment.
Each training or practice session lasted approximately 45 min, followed by a 15 min break. Testing sessions had no time limit. The training sessions introduced subjects to the fundamental concepts of software requirements engineering and provided instruction on using the two inspection techniques.
Practice sessions allowed subjects to become familiar with these methods. In each practice session, all subjects were assigned a small-size SRS, along with the corresponding StRS, and inspected the SRS using the specified technique, as indicated in Table 5. By the end of the practice sessions, subjects’ performance was expected to have stabilized.
In the testing sessions, the subjects were divided into six groups based on their majors. As shown in Table 6, Group 1 contained six subjects from CSE; Group 4 consisted of the other five subjects from CSE. The 10 subjects from EE were evenly assigned into Group 2 and Group 5. Group 3 and Group 6 each had one subject who majors in ME. In each testing session, the inspection technique used by each group is listed in Table 6. For example, Groups 1, 2, and 3 used RIMSM to inspect S7, which is a small-size SRS, in testing session 1. The subjects did not know that they were grouped.
The design presented in Table 6 enables the combination of results across different groups and sessions. For instance, the results from Group 1 in testing session 1 can be combined with those from Group 4 in testing session 2 to represent the inspection of a small-size SRS using the RIMSM method for CSE-major subjects. This approach aggregates data from all CSE subjects and both small-size SRSs (S7 and S8), increasing the number of data points for hypothesis testing and reducing the biases associated with using a single SRS or a single group.

3.6.2. Creation of Data for Inspection Teams

As discussed previously, the union of the defects detected by m inspectors individually can be regarded as the defects detected by an inspection team that consists of the m inspectors. Therefore, in this stage, the performance of an inspection team of size m was estimated by selecting m inspectors and combining the defects they detected individually. The process of estimation is referred to as “creating” an inspection team in this paper. Inspection teams that are created virtually were also used in previous studies such as [25].
In this paper, to test hypotheses H 0 , b a c k g r o u n d and H 0 , t e c h n i q u e , we created the following inspection teams: (1) background-diverse teams in which half of the members major in CSE and the other members major in EE; (2) background-uniform teams in which all members have the same major (either CSE or EE); (3) technique-diverse teams in which half of the members use the CBR technique and the other members use the RIMSM technique; (4) technique-uniform teams in which all members use the same inspection technique (either CBR or RIMSM). We also considered the different sizes of the inspection teams. In this experiment, the inspection teams of size m = 2 and m = 4 are used to test the hypotheses. Details on the creation of the inspection teams of each type and each size are discussed in Section 4.3.1 and Section 4.4.1.

3.6.3. Identification of Dependent and Independent Failures

In the first stage of the experiment, we collected the defects detected by each inspector in the testing sessions. The failures of each inspector in each testing session can be determined accordingly.
However, we found that it was not feasible to discriminate dependent failures and independent failures by an inspector by simply asking the subjects to answer a questionnaire at the end of a testing session. This is because a subject cannot remember or realize the causes of all his/her failures after completing the inspection of an SRS. On the other hand, if we ask a subject to record the causes of his/her failures in the middle of the testing sessions, we need to provide the subject with a list of defects that are in the SRS. Uncontrolled biases will be introduced into the experiment. Therefore, an indirect method was used to determine the dependent failures and independent failures in this stage.
As defined in Section 2.1, failures can be categorized as dependent or independent. A failure is considered independent if an inspector misses a defect that lies within their perception domain, and dependent if the defect falls outside it. The key challenge is identifying whether a defect is within an inspector’s perception domain.
Following the criteria established in [12], the following criteria are used to determine whether a defect lies within an inspector’s perception domain, where c is an integer:
  • Criterion 1: the inspector must detect at least c defects in the SRS.
  • Criterion 2: the inspector must detect at least c defects within the function containing the defect or within a closely related function.
  • Criterion 3: the inspector must detect at least c defects of the same type in previous testing or practice sessions.
The first criterion indicates that the inspector possesses basic knowledge of the application described in the SRS. The second criterion ensures that the inspector understands the function containing the defect, while the third criterion verifies the inspector’s ability to detect defects of the same type. If an inspector’s failure meets all three criteria, the defect is considered within the inspector’s perception domain, and the failure is classified as independent; otherwise, it is treated as a dependent failure. In this experiment, we set c = 1 . Appendix A provides a sensitivity analysis study of the impaction of selecting various values of c .

3.6.4. Analysis of Results

In the last stage, the data collected and created were analyzed. The two hypotheses, i.e., H 0 , b a c k g r o u n d and H 0 , t e c h n i q u e , were tested. More details on this stage are introduced in the next section.

4. Experiment Results

This section presents the results of the data analysis. The two hypotheses, i.e., H 0 , b a c k g r o u n d and H 0 , t e c h n i q u e , are tested by comparing the background-diverse teams and technique-diverse teams to the background-uniform teams and technique-uniform teams.
The data analyzed include the stability of the performance of the subjects in the testing sessions, the inter-rater reliability in data analysis, the comparison of background-diverse teams and background-uniform teams, the comparison of technique-diverse teams and technique-uniform teams, and the threats to validity of the experiment.

4.1. Stability of the Subjects’ Performance

In this section, we examine whether each inspector’s performance with the two inspection techniques remains stable across the testing sessions. Stable performance helps minimize the experimental biases associated with subject maturation.
During the six practice sessions and the first two testing sessions, each subject inspected eight small-size SRSs, with each inspection technique applied in four rounds. The total probability of failures by an inspector, i.e., Q t , is used to evaluate the performance of a subject. The average performance of the subjects using the CBR and RIMSM technique is displayed in Figure 4.
Figure 4 shows that the subjects’ performance improved over the first three rounds and stabilized during the last two rounds. To formally assess the stability of the subjects’ performance, we test the following null and alternative hypotheses:
  • H 0 , s t a b i l i t y : there is no significant difference between the performance of the inspectors using the two inspection techniques in the third and fourth rounds.
  • H A , s t a b i l i t y : there are significant differences between the performance of the inspectors using the two inspection techniques in the third and fourth rounds.
A t-test was conducted to evaluate the null hypothesis at a 5% significance level. The results are summarized in Table 7. For both inspection techniques, the p-values were much greater than the significance level, indicating that the null hypothesis could not be rejected. Consequently, the null hypothesis is assumed to hold, suggesting that the performance of the subjects remained stable when using both techniques after the third round. Therefore, it can be concluded that the performance of the subjects had stabilized by the time they entered the testing sessions.

4.2. Inter-Rater Reliability

Given the results recording sheets collected in the testing sessions, the investigators of the experiment need to determine whether the defects identified by each inspector are valid. The entries in the results recording sheets were analyzed by Investigators 1 and 3 independently. The inter-rater reliability (IRR) describes the level of agreement between the investigators and the degree to which the data collected are reliable. Cohen’s kappa coefficient [26] is used to quantify the IRR in the experiment. The overall Cohen’s kappa coefficient obtained is 90.4%.
Based on [26], a Cohen’s kappa coefficient above 90% indicates an “almost perfect” level of agreement between raters. This corresponds to a reliability range of 82% to 100%. Accordingly, the IRR results in this experiment demonstrate that the collected data are highly reliable.

4.3. Comparison of Background-Diverse Teams and Background-Uniform Teams

In this section, the creation of background-diverse teams and background-uniform teams is discussed first. Then, the background-diverse teams and background-uniform teams created are compared to test hypothesis H 0 , b a c k g r o u n d .

4.3.1. Creation of Background-Diverse Teams and Background-Uniform Teams

Let us first consider the creation of background-uniform teams. Given a small-size SRS, e.g., S7, we can create a background-uniform team by selecting m inspectors from Group 1 (see Table 6) and combining their results in testing session 1. The created inspection team has the following features: all members major in CSE and all members use the RIMSM technique. In addition, the background-uniform teams that have the same features can also be created by selecting m members from Group 4 and combining their results in testing session 2. The first row of Table 8 summarizes the creation of the background-uniform teams in which all members major in CSE and use the RIMSM method. The second and third columns of the first row specify the features of the created background-uniform teams. In row 1, the background-uniform teams with the specified features are labeled as BU_R1. The methods to create the BU_R1 teams are given in the fourth column.
Given the small-size SRSs, four different types of background-uniform teams can be created, which are labeled as BU_R1, BU_R2, BU_C1, and BU_C2, as displayed in Table 8. Since we can select any m inspectors from a group to create an inspection team, the number of different teams that can be created by selecting m inspectors from a group can be obtained using (1), where g is the number of inspectors in the group. The last column of Table 8 calculates the total number of different inspection teams that can be created for each type of background-uniform team when m = 2.
N t e a m s = g m = g ! g m ! m !
It should be noted that Groups 3 and 6 are excluded in creating the background-uniform teams since both groups contain only one subject. Besides, we do not consider the inspection teams of size m = 4 in this section. This is because Groups 2, 4, and 5 each contain five subjects. We can only create five different inspection teams of size m = 4 from each one of those groups. The number of data points are not sufficient for hypothesis testing.
For the SRSs of a small size, a background-diverse team that contains m members can be created by selecting m/2 inspectors from Group 1 and m/2 inspectors from Group 2, and combining their results in testing session 1. The created inspection team has the following two features: (1) half of the members major in CSE and the other half major in EE, (2) all members use the RIMSM method. The background-diverse teams that have the same features can also be created by selecting m/2 inspectors from Group 4 and m/2 inspectors from Group 5, and combining their results in testing session 2. For the small-size SRSs, two types of background-diverse teams can be created, as displayed in Table 9. The background-diverse teams in which all members use the RIMSM technique are labeled as BD_R. The background-diverse teams, in which all members use the CBR technique, are labeled as BD_C.
The discussion above focuses on the small-size SRSs. Given the medium-size or large-size SRSs, the background-diverse teams and background-uniform teams can be created in the same manner using the data collected in testing sessions 3, 4, 5, and 6.
To test hypothesis H 0 , b a c k g r o u n d , we can compare the inspection teams BD_R to BU_R1 and BU_R2, respectively, for SRSs of each size. We can also compare the inspection teams BD_C to BU_C1 and BU_C2, respectively, for SRSs of each size. It should be noted that the number of the inspection teams of each type is different. Therefore, Welch’s t-test, which assumes unequal sample size and unequal variance, is used for hypothesis testing. The results are discussed in the following section.

4.3.2. Results of Comparing Background-Diverse Teams and Background-Uniform Teams

In this section, the created inspection teams BU_C1, BU_C2, and BD_C are compared.
Table 10 gives a summary of the performance-related statistics of the BU_C1, BU_C2, and BD_C teams of size m = 2 in the inspection of the small-size SRSs (i.e., S7 and S8). The first row of Table 10 displays the results of Q m (i.e., the probability of a team failure). In the first six columns of row 1, the mean and standard deviation (std) of Q m for the BU_C1, BU_C2, and BD_C teams are given respectively. Hypothesis H 0 , b a c k g r o u n d is tested in two situations: (1) comparing the average Q m of the BU_C1 teams to the BD_C teams, (2) comparing the average Q m of the BU_C2 teams to the BD_C teams. The testing results in the first situation are provided in columns 7, 8 and 9, which show the p-value, the power, and the effect size of the hypothesis test. The last three columns display the hypothesis testing results in the second situation.
By selecting a significance level of 0.05, the p-value of the hypothesis test in both situations is much greater than the significance level. We fail to reject hypothesis H 0 , b a c k g r o u n d in both situations. The results indicate that there is no difference between the performance of the BU_C1 teams and BD_C teams in terms of the probability of team failures. In columns 8 and 11 of row 1 in Table 10, the power of the test in both situations shows that the probability of a type-II error in the test is very low.
Cohen’s d is used to quantify effect size when comparing different inspection teams. Effect size is usually used to measure the magnitude of the difference between the two samples [27]. Specifically, an effect size below 0.2 indicates that the difference is negligible, 0.2–0.5 suggests a small but noticeable difference, 0.5–0.8 reflects a medium difference, and values above 0.8 represent a large and substantial difference [28]. In columns 9 and 12 of row 1 in Table 10, the effect sizes show that the differences between inspection teams of different types are negligible or small.
Rows 2–8 of Table 10 display the other parameters of the Z-model that are used to analyze Q m , including Q t , Q t D , Q t I , β 1 , β 2 , Q 1 : m , and Q 2 : m . For example, for the BU_C1 teams in column 1, the average probability of a failure by an inspector is Q t = 0.39 , which consists of the probability of a dependent failure Q t D = 0.34 and the probability of an independent failure Q t I = 0.05 . The probability of an independent failure is much less than that of a dependent failure.
In rows 5 and 6, β 1 and β 2 describe the proportion of the probability of a dependent failure that is involved in a detection activity with dependency d = 1 and d = 2. In an inspection team, the sum of β d ( d 1 , m ) shall be equal to 1 if dependent failures exist. However, if an inspection team does not have any dependent failures, β d is equal to 0 for all d 1 , m . Therefore, the sum of the mean value of β 1 and β 2 in each column is less than 1. The probabilities of a detection activity that contains d dependent failures are described by Q d : m . In the comparison between the BU_C1, BU_C2, and BD_C teams, the p-values are greater than the significance level 0.05 for all parameters considered. The powers and effect sizes are either small or very small. There is no difference observed between the BU_C1, BU_C2, and BD_C teams in terms of all parameters. The standard deviation for each variable is very large compared to the mean of the variables. In other words, all variables have a high level of dispersion around the mean.
Given Table 10, we can conclude that, for the inspection teams of size m = 2, in which all members use the CBR method, the performance of the background-uniform teams is the same as the background-diverse teams in the inspection of small-size SRSs.
Table 11 and Table 12 display the summary of the statistics related to the performance of the BU_C1, BU_C2, and BD_C teams in the inspection of the SRSs of medium size and large size. The results in Table 11 and Table 12 suggest a similar conclusion, as discussed for Table 10. Therefore, for the inspection teams of size m = 2, in which all members use the CBR method, the performance of the background-uniform teams is the same as the background-diverse teams in the inspection of SRSs of all sizes.
Figure 5 plots the mean of Q m of the BU_C1, BU_C2, and BD_C inspection teams in the inspection of SRSs of different sizes. We can observe that the performance of an inspection team decreases as the size of the SRS increases. For SRSs of each size, the performance of the BU_C1, BU_C2, and BD_C inspection teams are very close to each other.
Hypothesis H 0 , b a c k g r o u n d is also tested by comparing the BU_R1 and BU_R2 teams to the BD_R teams of size m = 2. In this case, the summary of the statistics related to the performance of the BU_R1, BU_R2, and BD_R teams for inspecting SRSs of a small size, medium size, and large size, are given in Appendix B. The results are similar to the comparison of the BU_C1, BU_C2, and BD_C teams. The results indicate that, for the inspection teams of size m = 2, in which all members use the RIMSM methods, the performance of the background-uniform teams is the same as the background-diverse teams in the inspection of SRSs of all sizes.
Figure 6 plots the mean of Q m of the BU_R1, BU_R2, and BD_R inspection teams in the inspection of SRSs of different sizes. For SRSs of each size, the performance of the BU_R1, BU_R2, and BD_R inspection teams are very close to each other. We also notice that the performance of BU_R1, BU_R2, and BD_R teams is very close in terms of the inspection of the medium-size SRSs and large-size SRSs. This is because the effectiveness of the RIMSM method is stable in the inspection of the medium-size SRSs and the large-size SRSs, as discussed in Li’s work [7].
Table 13 summarizes the results for testing hypothesis H 0 , b a c k g r o u n d in different conditions. We are not able to reject this hypothesis in all conditions. Overall, we can conclude that no difference is observed between the performance of the background-uniform teams and the background-diverse teams. The reason is that subjects shared similar knowledge of requirements engineering, as introduced in the training sessions, and used the same inspection technique. They also shared the same information about the application under inspection by using the same SRS and StRS documents. Although the majors of the inspectors are different, their perception domain is dominated by shared knowledge, information, and technique. Therefore, a background-diverse team does not reduce the dependencies between the inspectors significantly. The overall performance of a background-diverse team is not improved compared to a background-uniform team.
Although adding diversity to the backgrounds of inspectors in an inspection team cannot reduce the number of dependent failures in the inspection team, other strategies may be effective. To avoid dependent failures, we can diversify the information about the application that is available to the inspectors. For example, some inspectors can use the StRS document as the oracle and the others can contact the stakeholders directly. A more reasonable strategy is to use different inspection techniques, a solution which will be discussed in the next section.

4.4. Comparison of Technique-Diverse Teams and Technique-Uniform Teams

In this section, the creation of technique-diverse teams and technique-uniform teams is introduced first. Then, the technique-diverse teams and technique-uniform teams are compared to test hypothesis H 0 , t e c h n i q u e . Welch’s t-test, which assumes unequal sample size and unequal variance, is used for hypothesis testing.

4.4.1. Creation of Technique-Diverse Teams and Technique-Uniform Teams

The previous section shows that the backgrounds of the inspectors in an inspection team do not affect the performance of the inspection team. Therefore, in this section, Groups 1, 2, and 3 in Table 6 are considered together and denoted as Group 1–3 since the subjects in those groups use the same inspection technique in all testing sessions. Similarly, Groups 4, 5, and 6 are combined and denoted as Group 4–6. In this section, the inspection teams of both size m = 2 and m = 4 are considered.
A technique-uniform team can be created by selecting m inspectors from Group 1–3 or Group 4–5, and combining their results in a testing session. Based on the inspection techniques used in a technique-uniform team, two types of technique-uniform teams can be created, which are labeled as TU_R teams and TU_C teams, respectively. All inspectors in a TU_R team use the RIMSM method and all inspectors in a TU_C team use the CBR method. Table 14 shows the creation of the technique-uniform teams for the small-size SRSs. A TU_R team can be created by selecting m inspectors from Group 1–3 and combining their results in testing session 1, or by selecting m inspectors from Group 4–5 and combining their results in testing session 2. The creation of the TU_C teams is displayed in the last row of Table 14. The last two columns give the total number of the different TU_R and TU_C teams that can be created when m = 2 and m = 4.
A technique-diverse team can be created by selecting m/2 inspectors from Group 1–3 and m/2 inspectors from Group 4–6. The technique-diverse teams are denoted as TD teams. The creation of TD teams for the small-size SRSs is shown in Table 15. The TU_R, TU_C, and TD teams for the medium- or large-size SRSs can be created in the same manner using the data collected in testing sessions 3, 4, 5, and 6.
To test hypothesis H 0 , t e c h n i q u e , we can compare the TD teams to the TU_R and TU_C teams, respectively, for inspection teams of each size. The results are discussed in the next section.

4.4.2. Results of Comparing Technique-Diverse Teams and Technique-Uniform Teams

In this section, the TU_R, TU_C, and TD teams are compared to test hypothesis H 0 , t e c h n i q u e .
For inspection teams of size m = 2, Table 16 gives a summary of the statistics related to the performance of the TU_R, TU_C, and TD teams in the inspection of the small-size SRSs. The first row shows that the average Q m of the TD teams is less than that of the TU_R teams and the TU_C teams. Hypothesis H 0 , t e c h n i q u e is tested in two situations: (1) comparing the average Q m of the TU_C teams to the TD teams, (2) comparing the average Q m of the TU_R teams to the TD teams. The hypothesis is rejected in both situations since the p-values in both situations are much less than the significance level (0.05). In the first situation, the power of the test is almost 100%, and the effect size suggests that the difference between the TU_C teams and the TD teams is very large. In the second situation, the power of the test is also high. The effect size shows that the difference between the TU_R teams and the TD teams is small. The reason for the small effect size is that the RIMSM method is a more effective inspection technique compared to the CBR technique. Although the effectiveness of the RIMSM technique is much higher than that of the CBR technique, introducing a diversity of techniques can still improve the performance of the inspection teams.
Row 2 of Table 16 shows that the probability of failures of an inspector in the TU_C teams, using the CBR technique (i.e., 0.33), is much higher than that of an inspector in the TU_R teams using the RIMSM technique (i.e., 0.17). The probability of failures by an inspector in the TD teams is between the TU_R teams and TU_C teams. This is because, by using diverse techniques in an inspection team, we are not enhancing the average performance of an individual inspector. Since the RIMSM technique is more effective, we can expect that the average performance of an individual inspector is lower in a TD team compared to a TU_R team. However, the performance of the TD teams is higher than the TU_R teams and the TU_C teams. The reasons are discussed below using the parameters of the Z-model.
Comparing to the TU_C teams, both the probability of dependent failure and the probability of independent failure by an inspector in the TD teams are lower. In addition, β 2 in the TD teams (i.e., 0.03) is much lower compared to the TU_C teams (i.e., 0.24). Therefore, an inspector in a TD team has a much lower probability to get involved in a detection activity in which all inspectors fail dependently. As a result, the performance of a TD team is better than that of a TU_C team.
Things are a little different in the comparison of the TD teams and the TU_R teams. Since RIMSM is a more effective technique, we observe that the probability of dependent failures and the probability of independent failures by an inspector in the TD teams is higher than that in the TU_R teams (see rows 3 and 4 in Table 16). Intuitively, a TU_R team should have a better performance. However, in rows 5 and 6, β 1 of the TD teams (i.e., 0.84) is greater than the TU_R teams (i.e., 0.74), and β 2 of the TD teams (i.e., 0.03) is much less than the TU_R teams (i.e., 0.19). This leads to the observation that, in a TD team, the probability of a detection activity that contains two dependent failures (i.e., Q 2 : 2 ) is small. Although an inspector in the TD teams has a higher probability of dependent failure, a large portion of the probability is related to the detection activities with dependency d = 1. Therefore, using diverse techniques may not increase the average performance of an individual inspector, but it can reduce the probability of detection activities of high dependency. As a result, the performance of the inspection teams is improved.
Table 17 displays a statistics summary for comparing the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the small-size SRSs. As shown in the first row, the p-value is close to 0 in the comparison between the TU_C teams and the TD teams, and between the TU_R teams and the TD teams. Hypothesis H 0 , t e c h n i q u e is rejected in both cases. The reasons are similar to the analysis for the TU_R, TU_C, and TD teams of size m = 2.
In the comparison between the TU_C teams and TD teams, the probability of dependent failure and the probability of independent failure by an inspector in the TD teams is lower, and β 2 in the TD teams is also lower. Therefore, the probability of a detection activity with dependency d = 2 is lower in a TD team. As a result, the performance of a TD team is better than that of a TU_C team.
In the comparison between the TU_R teams and TD teams, the probability of dependent failure and the probability of independent failure by an inspector in the TD teams is higher than in the TU_R teams. However, the beta factors behave as follows. In a TD team, β 1 is higher, β 2 is higher, β 3 is lower, and β 4 is lower compared to a TU_R team. The proportion of the detection activities of high dependencies is reduced in a TD team; the proportion of the detection activities of low dependencies is increased in a TD team. In other words, the diversity in an inspection team converts high-order dependencies to low-order dependencies. As a result, in a TD team, the probabilities of a detection activity that contains more than two dependent failures (i.e., Q 3 : 4 and Q 4 : 4 ) is small. Therefore, the performance of the TD teams is higher.
In the inspection of the medium-size and large-size SRSs, we observed similar results, as discussed above. The summaries of the statistics related to the performance of the TU_R, TU_C, and TD teams of size m = 2 and m = 4 in the inspection of the medium-size SRSs and large-size SRSs are provided in Appendix C. Figure 7a,b plots the probability of team failures (i.e., Q m ) of the TU_R, TU_C, and TD teams of size m = 2 and m = 4 in inspecting SRSs of different sizes. We can see the probability of failures by a TD team is smaller than that of a TU_R team or a TU_C team, regardless of the size of the SRSs and the size of the team. Table 18 summarizes the results of testing hypothesis H 0 , t e c h n i q u e in different conditions. We are able to reject the hypothesis in all conditions.
Table 19 displays the ratio of Q m (i.e., the probability of team failures) of a TD team to Q m of a TU_R team and TU_C team in different cases. We can see that, as the size of the SRSs increases, the ratio decreases and becomes stable. In addition, the ratio for inspection teams of size m = 4 is higher than the ratio for inspection teams of size m = 2. In other words, as the size of the inspection teams increases, the improvement in the performance of an inspection team by using diverse techniques also increases.
As a conclusion, the performance of the background-diverse teams is better than the background-uniform teams. This is because different inspection techniques can detect different types of defects. Based on the taxonomy of defects in the requirements phase, we classified the defects encountered in our experiment into three main types: missing defects, extra defects, and incorrect defects. Figure 8 displays the average percentage of defects of each type that is detected by an inspector using the RIMSM and CBR techniques. The RIMSM technique performed better in the detection of missing defects and extra defects, while the CBR technique performed better in the detection of incorrect defects. Therefore, using different inspection techniques in an inspection team can reduce the detection activities with high dependencies by increasing the detection activities with low dependencies. As a result, the performance of the inspection team can be increased.

4.5. Threats to Validity

This section analyzes the typical threats to the validity of an experiment and the methods used in this experiment to minimize those threats.

4.5.1. Internal Validity

  • Selection Bias
The research subjects were selected from junior and senior undergraduate engineering students. We believe the differences between the junior and senior students will not affect our results since the inspection techniques are new to all undergraduate students. Learning those techniques requires an average level of programming and mathematics knowledge which is shared by all engineering students.
  • Rivalry
All subjects were trained identically and did not know a priori that they would be grouped. The inspection teams were created virtually. Thus, there were no possible rivalries between either the inspectors or the inspection teams.
  • History
The entire experiment lasted only 3 days, during which the subjects had no other classes. Therefore, the likelihood of significant external events affecting their attributes can be disregarded.
  • Maturation
As discussed in Section 4.1, the performance of the subjects is expected to remain stable across testing sessions. Hence, maturation bias can be ignored.
  • Repeated testing
This threat was minimized by using SRSs of different sizes and topics in each testing session.
  • Hawthorne effect
During the testing sessions, subjects inspected the SRSs independently without interference from the investigators. Any Hawthorne effect would be consistent across all sessions.
  • Experimenter bias
Experimenter bias was minimized because all sessions were conducted by Investigator 2, who was unaware of the experimental hypotheses.
  • Observer-expectancy effect
This threat was reduced since the subjects were not informed of the experiment’s true purpose. The study was introduced under the title “Study of Software Requirements Inspection Process” to avoid expectation-related bias.
  • Mortality
One subject in Group 1 dropped out of the testing sessions 5 and 6 due to a home emergency. The results for small-size SRSs and medium-size SRSs were not affected. The results for the large-size SRSs are generated using the remaining 22 subjects. The absence of the subject will not affect the validity of our results since the SRSs of different sizes were analyzed separately.

4.5.2. External Validity

First, our research subjects were junior and senior engineering undergraduates. Generalizing our findings to other inspectors should be valid since the typical requirements inspectors are usually from the engineering fields and have at least a bachelor’s degree. The typical requirements inspectors should have a similar background and learning potential as the subjects in our experiment.
Second, this study involved 23 subjects from CSE, EE, and MSE. Although the limited number of participants and academic disciplines may constrain the generalizability of the results, the findings remain valid for several reasons. First, the number of virtual teams was sufficient to support statistical hypothesis testing. Second, CSE students primarily focus on software, whereas EE and MSE students concentrate more on hardware. The substantial differences between these disciplines provide a representative level of diversity across academic backgrounds.
Third, the inspection teams were created virtually in the experiment. This is designed on purpose since the defects detected by an inspection team are the same as the union of the defects detected by each inspector in the team. The advantages of such a design include the following: (1) providing more data points by enumerating all combinations of the subjects in an inspection team, (2) isolating the dependencies between the subjects to only the backgrounds and the inspection techniques, and eliminating type-II dependent failures. The results should be valid when using a real inspection team.
In addition, this experiment only studied the RIMSM and CBR techniques. In each technique, an instrumentation was used (i.e., the checklist and the RITSM tool). Using other inspection techniques or instrumentations may affect the results. However, this paper aims to study the effectiveness of introducing diversity in an inspection team. The results demonstrated the possibility of increasing the performance of an inspection team by using diverse techniques. To apply our findings, a technique-diverse team should use techniques that are as different as possible.
Another threat to the generalization of our findings is the size of an inspection team. To minimize such a threat, the inspection teams of sizes m = 2 and m = 4 were studied.
Last, the SRSs used in the testing sessions were designed in such a way that different types of systems were covered. Different sizes were considered as well. Therefore, possible biases related to the SRSs used in the testing sessions should be minimized and should not affect external validity.

5. Conclusions

In this paper, an experiment was designed and conducted to study the effectiveness of introducing diversity in an inspection team. Two types of diversity were considered in the experiment: the diverse backgrounds of the inspectors and diverse techniques used for defect detection. The Z-model was used to analyze the data collected. The results showed that (1) the performance of the background-diverse teams is the same as the performance of the background-uniform teams; (2) the performance of the technique-diverse teams is better than the performance of the technique-uniform teams; (3) as the size of an inspection team increases, the improvement in the performance of the inspection team, using diverse techniques, also increases. This experiment demonstrated that using diverse techniques in an inspection team can improve the performance of the inspection team.
The findings in this paper can provide valuable experimental support for the software development lifecycle, specifically in enhancing software reliability during the requirements phase by introducing diversities. These results are significant for safety-critical systems, where software failures can have severe consequences. By introducing diversity in the requirements phase, potential defects can be identified early. The safety and economic effectiveness of the system can be improved.
Future research includes the following aspects: (1) the effectiveness of introducing diversity can be studied in other phases of software development, (2) the experiment can be conducted with more subjects from the industry.

Author Contributions

Conceptualization, B.L.; Methodology, B.L.; Software, B.L.; Validation, B.L.; Investigation, J.L.; Writing—original draft, B.L. and J.L.; Writing—review & editing, B.L., J.L. and X.H.; Visualization, B.L.; Project administration, J.L. and X.H.; Funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62120106003).

Data Availability Statement

Raw data used in this paper are not publicly available to preserve individuals’ privacy under the University Human Research Protection Program.

Acknowledgments

I would like to express my special thanks to Carol Smidts, Xiaoxu Diao, and Yunfei Zhao for their help in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Sensitivity Analysis for the Criteria of the Dependent Failures and Independent Failures

In this experiment, three criteria were used to discriminate dependent failures and independent failures. A sensitivity analysis study of the value of the variable c in those criteria is introduced in this section.
In this experiment, we only considered c = 1 and c = 2 . This is because most functions in the SRSs used in the experiment contain 0–3 defects. A value of c greater or equal to 3 will render the criteria to label a failure as an independent failure too stringent. If we select a c greater or equal to 3, most of the defects will be labelled as dependent defects.
Table A1 gives the Q m for the inspection teams of different types and different sizes for inspecting the small-size SRSs under the conditions of c = 1 and c = 2. The last column of Table A1 gives the relative difference of Q m between the conditions of c = 1 and c = 2 for each type of inspection team. Table A2 and Table A3 display the Q m of different inspection teams for inspecting the medium-size SRSs and large-size SRSs.
Table A1. Q m of inspection teams for inspecting the small-size SRSs.
Table A1. Q m of inspection teams for inspecting the small-size SRSs.
SizeTeamc = 2c = 1Change%
m = 2BU_C10.130.1618.7%
BU_C20.130.130.0%
BD_C0.140.157.0%
BU_R10.040.0412.7%
BU_R20.030.030.0%
BD_R0.060.064.5%
TU_C0.140.157.2%
TU_R0.050.054.9%
TD0.020.0326.7%
m = 4TU_C0.040.059.9%
TU_R0.00200.00229.1%
TD0.00030.000856.1%
Table A2. Q m of inspection teams for inspecting the medium-size SRSs.
Table A2. Q m of inspection teams for inspecting the medium-size SRSs.
SizeTeamc = 2c = 1Change%
m = 2BU_C10.440.464.3%
BU_C20.430.443.0%
BD_C0.470.482.7%
BU_R10.240.252.0%
BU_R20.260.261.7%
BD_R0.260.271.6%
TU_C0.490.502.7%
TU_R0.260.261.6%
TD0.210.225.3%
m = 4TU_C0.310.323.8%
TU_R0.150.151.5%
TD0.040.0412.0%
Table A3. Q m of inspection teams for inspecting the large-size SRSs.
Table A3. Q m of inspection teams for inspecting the large-size SRSs.
SizeTeamc = 2c = 1Change%
m = 2BU_C10.550.550.7%
BU_C20.560.560.3%
BD_C0.570.570.4%
BU_R10.210.222.4%
BU_R20.240.241.7%
BD_R0.250.251.7%
TU_C0.530.530.4%
TU_R0.290.291.6%
TD0.250.251.7%
m = 4TU_C0.370.370.4%
TU_R0.140.142.8%
TD0.060.073.9%
The average values of the relative difference of Q m between the conditions of c = 1 and c = 2 in Table A1, Table A2 and Table A3 are 13%, 4%, and 1%, respectively. We can see that the relative difference of Q m is negligible for inspecting the medium-size and large-size SRSs. This is because, when using a bigger c, more failures will be determined as independent failures. The fraction of independent failures in our results increases. However, the failure rate of an inspector in the inspection of the medium-size and large-size SRSs is high and the fraction of independent failures is small. Therefore, the change in the fraction of independent failures will not affect the results significantly.
The relative difference in Q m is larger for inspecting the small-size SRSs compared to the inspection of the medium-size and large-size SRSs. This is because the failure rate of an inspector in the inspection of the small-size SRSs is lower; therefore, the fraction of independent failures is higher. As a result, the change in the fraction of independent failures has a bigger effect on the relative difference of Q m for inspecting the small-size SRSs.
The results of testing hypotheses H 0 , b a c k g r o u n d and H 0 , t e c h n i q u e , given c = 2, are displayed in Table A4 and Table A5. Comparing to Table 13 and Table 18, we can see that the results of hypothesis testing are the same under the conditions of c = 1 and c = 2. Therefore, the results of the experiment are not sensitive to the selection of the value of c. A more detailed sensitivity analysis will be introduced in future work, following [29].
Table A4. Testing of H 0 , b a c k g r o u n d .
Table A4. Testing of H 0 , b a c k g r o u n d .
SRS SizeBU_C1 vs. BD_CBU_C2 vs. BD_CBU_R1 vs. BD_RBU_C2 vs. BD_R
Small Not rejectedNot rejectedNot rejectedNot rejected
MediumNot rejectedNot rejectedNot rejectedNot rejected
LargeNot rejectedNot rejectedNot rejectedNot rejected
Table A5. Testing of H 0 , t e c h n i q u e .
Table A5. Testing of H 0 , t e c h n i q u e .
SRS SizeTD vs. TU_CTD vs. TU_R
m = 2m = 4m = 2m = 4
SmallRejectRejectRejectReject
MediumRejectRejectRejectReject
LargeRejectRejectRejectReject

Appendix B. Summary of the Performance Statistics for the BU_R1, BU_R2, and BD_R Teams

This appendix displays a summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in the inspection of SRSs of a small size, medium size, and large size. The results are displayed in Table A6, Table A7 and Table A8.
Table A6. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the small-size SRSs.
Table A6. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the small-size SRSs.
BU_R1BU_R2BD_RBU_R1 vs. BD_RBU_C2 vs. BD_R
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.040.090.030.070.060.090.220.120.190.120.180.27
Q t 0.180.070.180.090.180.090.490.050.000.440.050.04
Q t D 0.160.080.160.080.160.080.490.050.000.430.050.04
Q t I 0.020.030.020.030.020.040.500.050.000.500.050.00
β 1 0.800.410.840.340.700.440.170.150.230.080.240.34
β 2 0.160.370.110.280.220.400.240.100.170.080.220.32
Q 1 : 2 0.130.090.130.070.110.080.210.130.200.140.170.27
Q 2 : 2 0.040.090.030.070.050.090.230.110.180.100.200.29
Table A7. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the medium-size SRSs.
Table A7. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the medium-size SRSs.
BU_C1BU_C2BD_CBU_C1 vs. BD_CBU_C2 vs. BD_C
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.250.150.260.160.270.170.320.070.110.450.050.03
Q t 0.450.150.470.150.460.160.350.070.090.450.050.03
Q t D 0.420.150.440.150.430.160.370.060.080.430.050.05
Q t I 0.030.050.030.030.030.040.450.050.030.400.060.06
β 1 0.510.280.470.260.480.320.350.060.090.460.050.03
β 2 0.490.280.530.260.520.320.350.060.090.460.050.03
Q 1 : 2 0.190.100.200.110.180.110.350.060.090.310.080.13
Q 2 : 2 0.220.160.240.150.240.170.280.080.140.430.050.04
Table A8. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the large-size SRSs.
Table A8. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the large-size SRSs.
BU_C1BU_C2BD_CBU_C1 vs. BD_CBU_C2 vs. BD_C
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.220.170.240.140.250.180.250.100.170.430.050.04
Q t 0.450.200.420.170.440.200.380.060.080.370.060.08
Q t D 0.430.210.400.170.410.200.380.060.080.360.060.09
Q t I 0.020.020.030.030.030.020.400.060.060.420.050.06
β 1 0.590.270.450.170.500.270.100.260.360.180.110.20
β 2 0.410.270.550.170.500.270.100.260.360.180.110.20
Q 1 : 2 0.230.130.170.090.180.110.070.350.430.360.060.09
Q 2 : 2 0.200.170.220.130.230.180.250.100.180.420.050.05

Appendix C. Summary of the Performance Statistics for the TU_R, TU_C, and TD Teams

This section provides a summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 and m = 4 in the inspection of medium-size and large-size SRSs, as shown in Table A9, Table A10, Table A11 and Table A12.
Table A9. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the medium-size SRSs.
Table A9. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the medium-size SRSs.
TU_CTU_RTDTU_C vs. TDTU_R vs. TD
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.500.230.260.160.220.200.001.001.310.020.500.22
Q t 0.660.160.440.150.550.160.001.000.690.001.000.66
Q t D 0.630.170.410.160.520.170.001.000.700.001.000.67
Q t I 0.020.040.030.040.030.040.060.310.160.140.190.12
β 1 0.320.320.470.310.720.280.001.001.380.001.000.89
β 2 0.680.320.530.310.280.280.001.001.380.001.000.89
Q 1 : 2 0.170.130.170.100.340.110.001.001.470.001.001.54
Q 2 : 2 0.470.250.240.160.180.200.001.001.330.000.790.31
Table A10. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the medium-size SRSs.
Table A10. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the medium-size SRSs.
TU_CTU_RTDTU_C vs. TDTU_R vs. TD
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.320.230.150.110.040.080.001.002.590.001.001.20
Q t 0.660.100.450.100.550.110.001.001.070.001.000.96
Q t D 0.640.110.410.110.520.110.001.001.070.001.000.94
Q t I 0.020.030.030.030.030.030.001.000.170.060.400.06
β 1 0.070.060.200.130.120.160.001.000.350.001.000.51
β 2 0.120.150.190.180.510.240.001.001.690.001.001.35
β 3 0.400.290.300.230.330.250.001.000.270.000.720.09
β 4 0.420.330.300.250.040.110.001.002.530.001.001.98
Q 1 : 4 0.040.030.080.050.050.050.000.970.140.001.000.59
Q 2 : 4 0.020.030.030.030.080.040.001.001.690.001.001.58
Q 3 : 4 0.080.060.040.030.060.050.001.000.290.001.000.47
Q 4 : 4 0.290.250.140.120.030.080.001.002.480.001.001.37
Table A11. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the large-size SRSs.
Table A11. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the large-size SRSs.
TU_CTU_RTDTU_C vs. TDTU_R vs. TD
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.530.240.290.190.250.210.001.001.290.030.430.21
Q t 0.690.160.490.190.570.150.001.000.820.001.000.53
Q t D 0.680.160.460.190.550.160.001.000.830.001.000.53
Q t I 0.020.010.020.020.020.020.030.410.200.080.320.17
β 1 0.280.230.430.190.660.270.001.001.450.001.000.92
β 2 0.720.230.570.190.340.270.001.001.450.001.000.92
Q 1 : 2 0.160.100.180.100.330.110.001.001.570.001.001.37
Q 2 : 2 0.520.240.270.190.220.210.001.001.330.010.590.25
Table A12. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the large-size SRSs.
Table A12. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the large-size SRSs.
TU_CTU_RTDTU_C vs. TDTU_R vs. TD
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.370.230.140.100.070.110.001.002.380.001.000.70
Q t 0.700.110.500.150.570.100.001.001.370.001.000.54
Q t D 0.700.110.490.150.550.110.001.001.380.001.000.56
Q t I 0.020.010.020.010.020.010.001.000.290.001.000.26
β 1 0.050.040.130.110.090.110.001.000.410.001.000.40
β 2 0.110.110.250.160.480.220.001.001.730.001.001.07
β 3 0.360.220.350.170.350.200.080.330.060.410.050.01
β 4 0.480.280.270.150.080.150.001.002.420.001.001.27
Q 1 : 4 0.030.020.050.030.040.040.001.000.260.001.000.29
Q 2 : 4 0.020.020.040.030.080.030.001.001.870.001.001.27
Q 3 : 4 0.080.050.060.040.070.050.001.000.250.000.990.17
Q 4 : 4 0.360.240.130.100.050.110.001.002.380.001.000.71

References

  1. Arndt, S.A.; Alvarado, R.; Dittman, B.; Mott, K.; Wood, R. NRC Technical Basis for Evaluation of Its Position on Protection Against Common Cause Failure in Digital Systems Used in Nuclear Power Plants. In Proceedings of the 2017 NPIC-HMIT, San Francisco, CA, USA, 11–15 June 2017. [Google Scholar]
  2. Alshazly, A.A.; Elfatatry, A.M.; Abougabal, M.S. Detecting defects in software requirements specification. Alex. Eng. J. 2014, 53, 513–527. [Google Scholar] [CrossRef]
  3. Porter, A.A.; Votta, L.G.; Basili, V.R. Comparing detection methods for software requirements inspections: A replicated experiment. IEEE Trans. Softw. Eng. 2002, 21, 563–575. [Google Scholar] [CrossRef]
  4. He, L.; Carver, J. PBR vs. Checklist: A Replication in the N-Fold Inspection Context. In Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering, Rio de Janeiro, Brazil, 21–22 September 2006; pp. 95–104. [Google Scholar] [CrossRef]
  5. Signoret, J.P.; Leroy, A. Dependent and Common Cause Failures; Springer Series in Reliability Engineering; Springer: Berlin/Heidelberg, Germany, 2021; pp. 103–120. [Google Scholar] [CrossRef]
  6. Ali, S.W.; Ahmed, Q.A.; Shafi, I. Process to enhance the quality of software requirement specification document. International Conference on Engineering and Emerging Technologies, Lahore, Pakistan, 22–23 February 2018. [Google Scholar]
  7. Li, B.; Diao, X.; Gao, W.; Smidts, C. A Requirements Inspection Method Based on Scenarios Generated by Model Mutation and the Experimental Validation. Empir. Softw. Eng. 2021, 26, 108. [Google Scholar] [CrossRef]
  8. Martin, J.; Tsai, W.T. N-Fold Inspection: A Requirements Analysis Technique. Commun. ACM 1990, 33, 225–232. [Google Scholar] [CrossRef]
  9. Kantorowitz, E.; Guttman, A.; Arzi, L. The performance of the N-fold requirement inspection method. Requir. Eng. 1997, 2, 152–164. [Google Scholar] [CrossRef]
  10. Vulpe, A.; Carausu, A. Dependent failure and CCF analysis of NPP systems with diversity defense factors. In Proceedings of the Transactions of the 14th International Conference on Structural Mechanics in Reactor Technology, Lyon, France, 17–22 August 1997. [Google Scholar]
  11. Huang, F.; Liu, B.; Song, Y.; Keyal, S. The links between human error diversity and software diversity: Implications for fault diversity seeking. Sci. Comput. Program. 2014, 89, 350–373. [Google Scholar] [CrossRef]
  12. Li, B.; Smidts, C. A Zone-Based Model for Analysis of Dependent Failures in Requirements Inspection. IEEE Trans. Softw. Eng. 2023, 49, 3581–3598. [Google Scholar] [CrossRef]
  13. Staron, M.; Kuzniarz, L.; Thurn, C. An empirical assessment of using stereotypes to improve reading techniques in software inspections. In Proceedings of the International Conference on Software Engineering, St. Louis, MO, USA, 15–21 May 2005; pp. 63–69. [Google Scholar] [CrossRef]
  14. Lanubile, F.; Visaggio, G. Evaluating Defect Detection Techniques for Software Requirements Inspections; International Software Engineering Research Network; Bari, Italy, 2000. [Google Scholar]
  15. Fleming, K.; Mosleh, A. Classification and Analysis of Reactor Operating Experience Involving Dependent Events; ISERN Report no. 00-08; Electric Power Research Institute: Palo Alto, CA, USA, 1985; pp. 1–24. [Google Scholar]
  16. Fleming, K. A reliability model for common cause failures in redundant safety systems. In Technical Report No. GA-A-13284; General Atomics: San Diego, CA, USA, 1974. [Google Scholar]
  17. Mosleh, A.; Siu, N. A reliability model for common mode failure in redundant safety systems. In Proceedings of the Ninth International Conference on Structural Mechanics in Reactor Technology, Lusanne, Switzerland, 17–21 August 1987. [Google Scholar]
  18. Atwood, C. Common Cause Fault Rates for Pumps; NUREG/CR-2098; US Nuclear Regulatory Commission: Washington, DC, USA, 1983.
  19. ISO/IEC/IEEE 29148:2018; Systems and Software Engineering-Life Cycle Processes: Requirements Engineering. IEEE: New York, NY, USA, 2018.
  20. Li, X.; Mutha, C.; Smidts, C.S. An automated software reliability prediction system for safety critical software. Empir. Softw. Eng. 2016, 21, 2413–2455. [Google Scholar] [CrossRef]
  21. Li, X.; Gupta, J. ARPS: An Automated Reliability Prediction System Tool for Safety Critical Software; PSA: Quezon City, Philippines, 2013; pp. 22–27. [Google Scholar]
  22. Li, B.; Smidts, C.S. Extension of Mutation Testing for the Requirements and Design Faults. In Proceedings of the 2017 NPIC-HMIT, Pittsburgh, PA, USA, 24–28 September 2017. [Google Scholar]
  23. Lanubile, F.; Visaggio, G. Assessing defect detection methods for software requirements inspections through external replication. In International Software Engineering Research Network, Technical Report ISERN9601; International Software Engineering Research Network: Bari, Italy, 1996; p. 17. [Google Scholar]
  24. Votta, L.G. Does every inspection need a meeting? In Proceedings of the Symposium on the Foundations of Software Engineering, Los Angeles, CA, USA, 7–10 December 1993; pp. 107–114. [Google Scholar] [CrossRef]
  25. Goswami, A.; Walia, G. An empirical study of the effect of learning styles on the faults found during the software requirements inspection. In Proceedings of the 24th International Symposium on Software Reliability Engineering, Pasadena, CA, USA, 4–7 November 2013; pp. 330–339. [Google Scholar] [CrossRef]
  26. McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
  27. Sullivan, G.M.; Feinn, R. Using Effect Size—Or Why the P Value Is Not Enough. J. Grad. Med. Educ. 2012, 4, 279–282. [Google Scholar] [CrossRef] [PubMed]
  28. Sawilowsky, S.S. New Effect Size Rules of Thumb. J. Mod. Appl. Stat. Methods 2009, 8, 597–599. [Google Scholar] [CrossRef]
  29. Zubair, M.; Ishag, A. Sensitivity analysis of APR-1400’s reactor protection system by using RiskSpectrum PSA. Nucl. Eng. Des. 2018, 339, 225–234. [Google Scholar] [CrossRef]
Figure 1. Perception Zones.
Figure 1. Perception Zones.
Energies 18 04794 g001
Figure 2. Checklist used in CBR.
Figure 2. Checklist used in CBR.
Energies 18 04794 g002
Figure 3. Defect recording sheet.
Figure 3. Defect recording sheet.
Energies 18 04794 g003
Figure 4. Average performance of the inspectors.
Figure 4. Average performance of the inspectors.
Energies 18 04794 g004
Figure 5. Mean of Q m of the BU_C1, BU_C2, and BD_C inspection teams.
Figure 5. Mean of Q m of the BU_C1, BU_C2, and BD_C inspection teams.
Energies 18 04794 g005
Figure 6. Mean of Q m of the BU_R1, BU_R2, and BD_R inspection teams.
Figure 6. Mean of Q m of the BU_R1, BU_R2, and BD_R inspection teams.
Energies 18 04794 g006
Figure 7. Probability of team failures for SRSs of different sizes.
Figure 7. Probability of team failures for SRSs of different sizes.
Energies 18 04794 g007
Figure 8. Percentage of defect detected by an inspection.
Figure 8. Percentage of defect detected by an inspection.
Energies 18 04794 g008
Table 1. Comparison of CBR and RIMSM.
Table 1. Comparison of CBR and RIMSM.
MethodCBRRIMSM
InstrumentationChecklistTool (RITSM)
Defect coverageDefects defined in the checklistDefects considered in model mutation
Inspection itemSRS documentResults of execution of the SRS model
Defect detection activityAnswer questions in the checklistExamine the system behaviors and outputs in different scenarios
Table 2. Data for parameter estimation in the Z-model.
Table 2. Data for parameter estimation in the Z-model.
DataDefinition
mNumber of Inspectors in the Team
N D Total Number of Detection Activities
n d i Number   of   Defects   Detected   in   the   Perception   Zone   Z d i
n a I Number   of   Independent   Failures   by   Inspector   I a   in   N D Detection Activities
n a D Number   of   Dependent   Failures   by   Inspector   I a   in   N D Detection Activities
Table 3. SRS documents.
Table 3. SRS documents.
SRS LabelSRS TopicFunctionPageDefects
S1A water level control system435
S2A reaction chamber control system in a chemical plant534
S3An automated car assembly system534
S4A fly safety system533
S5A valve control system535
S6A vehicle speed monitor system534
S7A post-collision event control system536
S8An automobile cruise control and monitoring system535
S9A digital-based small reactor protection system1069
S10An elevator control system11611
S11An integrated vehicle-based safety system211155
S12An embedded control software for smart sensor221222
Table 4. Major of the Subjects.
Table 4. Major of the Subjects.
MajorNumber of Subjects
Computer Science & Engineering (CSE)11
Electrical Engineering (EE)10
Mechanical Engineering (ME)2
Table 5. Experiment schedule.
Table 5. Experiment schedule.
DaySessionsContent
Day 1training 1Introduced what requirements engineering is
training 2Introduced the CBR method
practice 1Practiced the CBR method using a small-size SRS (S1)
training 3Introduced the RIMSM method
practice 2Practiced the RIMSM method using a small-size SRS (S2)
Day 2training 4Reviewed the CBR method
practice 3Practiced the CBR method using a small-size SRS (S3)
training 5Reviewed the RIMSM method
practice 4Practiced the RIMSM method using a small-size SRS (S4)
practice 5Practiced the CBR method using a small-size SRS (S5)
practice 6Practiced the RIMSM method using a small-size SRS (S6)
Day 3testing 1Inspected a small-size SRS (S7)
testing 2Inspected a small-size SRS (S8)
testing 3Inspected a medium-size SRS (S9)
testing 4Inspected a medium-size SRS (S10)
testing 5Inspected a large-size SRS (S11)
testing 6Inspected a large-size SRS (S12)
Table 6. Design of testing sessions.
Table 6. Design of testing sessions.
SessionSRSInspection Method
Group 1:
CSE (6)
Group 2:
EE (5)
Group 3:
ME (1)
Group 4:
CSE (5)
Group 5:
EE (5)
Group 6:
ME (1)
testing session 1S7 (small size)RIMSMRIMSMRIMSMCBRCBRCBR
testing session 2S8 (small size)CBRCBRCBRRIMSMRIMSMRIMSM
testing session 3S9 (medium size)RIMSMRIMSMRIMSMCBRCBRCBR
testing session 4S10 (medium size)CBRCBRCBRRIMSMRIMSMRIMSM
testing session 5S11 (large size)RIMSMRIMSMRIMSMCBRCBRCBR
testing session 6S12 (large size)CBRCBRCBRRIMSMRIMSMRIMSM
Table 7. Stability statistics.
Table 7. Stability statistics.
RIMSMCBR
Number of data points2323
Significance level0.050.05
p-value (two-tailed)0.980.78
Statistical power5.9%5.0%
Table 8. Creation of background-uniform teams for small-size SRSs.
Table 8. Creation of background-uniform teams for small-size SRSs.
LabelTeam FeaturesTeam CreationTeams of Size m = 2
MethodMajor
BU_R1RIMSMCSESelect m inspectors from Group 1 and combine their results in testing session 1
Select m inspectors from Group 4 and combine their results in testing session 2
25
BU_R2RIMSMEESelect m inspectors from Group 2 and combine their results in testing session 1
Select m inspectors from Group 5 and combine their results in testing session 2
20
BU_C1CBRCSESelect m inspectors from Group 1 and combine their results in testing session 2
Select m inspectors from Group 4 and combine their results in testing session 1
25
BU_C2CBREESelect m inspectors from Group 2 and combine their results in testing session 2
Select m inspectors from Group 5 and combine their results in testing session 1
20
Table 9. Creation of background-diverse teams for small-size SRSs.
Table 9. Creation of background-diverse teams for small-size SRSs.
LabelTeam FeaturesTeam CreationTeams of Size m = 2
MethodMajor
BD_RRIMSMBoth (CSE, EE)Select m/2 inspectors from Group 1 and m/2 inspectors from Group 2, and combine their results in testing session 1
Select m/2 inspectors from Group 4 and m/2 inspectors from Group 5, and combine their results in testing session 2
55
BD_CCBRBoth (CSE, EE)Select m/2 inspectors from Group 1 and m/2 inspectors from Group 2, and combine their results in testing session 2
Select m/2 inspectors from Group 4 and m/2 inspectors from Group 5, and combine their results in testing session 1
55
Table 10. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the small-size SRSs.
Table 10. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the small-size SRSs.
BU_C1BU_C2BD_CBU_C1 vs. BD_CBU_C2 vs. BD_C
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.160.170.130.150.150.170.440.050.040.260.090.161
Q t 0.390.220.320.160.350.210.230.120.190.230.100.182
Q t D 0.340.210.280.180.310.210.280.090.140.290.080.143
Q t I 0.050.060.040.050.040.060.250.100.160.280.080.144
β 1 0.660.360.620.430.580.400.170.150.230.330.070.125
β 2 0.220.270.230.350.260.330.260.090.140.350.070.106
Q 1 : 2 0.230.130.180.120.180.130.070.310.360.420.050.057
Q 2 : 2 0.110.150.110.160.130.170.330.070.100.320.070.128
123456789101112
Table 11. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the medium-size SRSs.
Table 11. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the medium-size SRSs.
BU_C1BU_C2BD_CBU_C1 vs. BD_CBU_C2 vs. BD_C
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.460.190.440.280.480.240.370.060.080.290.090.16
Q t 0.640.120.630.170.630.170.430.050.040.460.050.03
Q t D 0.610.110.610.190.610.180.490.050.000.490.050.01
Q t I 0.030.050.020.030.020.040.290.090.150.350.060.09
β 1 0.350.280.430.400.340.330.460.050.030.190.160.25
β 2 0.650.280.570.400.660.330.460.050.030.190.160.25
Q 1 : 2 0.190.150.200.140.160.120.160.190.270.170.180.27
Q 2 : 2 0.420.210.410.310.450.250.260.090.150.320.080.14
Table 12. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the large-size SRSs.
Table 12. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the large-size SRSs.
BU_C1BU_C2BD_CBU_C1 vs. BD_CBU_C2 vs. BD_C
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.550.220.560.260.570.240.350.070.100.440.050.04
Q t 0.700.150.710.180.710.180.440.050.040.440.050.04
Q t D 0.680.160.700.180.690.180.410.060.060.420.060.06
Q t I 0.020.010.010.010.010.010.200.130.230.150.150.25
β 1 0.250.200.270.270.240.230.370.060.090.300.090.16
β 2 0.750.200.730.270.760.230.370.060.090.300.090.16
Q 1 : 2 0.150.080.150.100.130.090.220.120.200.220.120.22
Q 2 : 2 0.530.220.550.260.560.240.320.070.120.440.050.04
Table 13. Testing of H 0 , b a c k g r o u n d .
Table 13. Testing of H 0 , b a c k g r o u n d .
SRS SizeBU_C1 vs. BD_CBU_C2 vs. BD_CBU_R1 vs. BD_RBU_C2 vs. BD_R
Small Not rejectedNot rejectedNot rejectedNot rejected
MediumNot rejectedNot rejectedNot rejectedNot rejected
LargeNot rejectedNot rejectedNot rejectedNot rejected
Table 14. Creation of technique-uniform teams for small-size SRSs.
Table 14. Creation of technique-uniform teams for small-size SRSs.
LabelMethodTeam CreationNumber of Teams
m = 2m = 4
TU_RRIMSMSelect m inspectors from Group 1–3 and combine their results in testing session 1
Select m inspectors from Group 4–5 and combine their results in testing session 2
121825
TU_CCBRSelect m inspectors from Group 1–3 and combine their results in testing session 2
Select m inspectors from Group 4–5 and combine their results in testing session 1
121825
Table 15. Creation of technique-diverse teams for small-size SRSs.
Table 15. Creation of technique-diverse teams for small-size SRSs.
LabelMethodTeam CreationNumber of Teams
m = 2m = 4
TDRIMSM, CBRSelect m/2 inspectors from Group 1–3 and m/2 inspectors from Group 4–6, and combine their results in testing Section 1 and testing Section 22647260
Table 16. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the small-size SRSs.
Table 16. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the small-size SRSs.
TU_CTU_RTDTU_C vs. TDTU_R vs. TD
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.150.180.0470.090.0270.050.001.001.100.010.830.321
Q t 0.330.200.170.080.250.140.001.000.540.001.000.602
Q t D 0.280.200.160.080.220.150.000.970.420.000.980.453
Q t I 0.050.050.020.030.030.050.000.890.350.000.910.374
β 1 0.570.420.740.430.840.340.001.000.720.010.700.275
β 2 0.240.340.190.380.030.110.001.001.000.001.000.686
Q 1 : 2 0.160.120.110.080.200.130.000.800.310.001.000.747
Q 2 : 2 0.120.180.040.090.010.050.001.001.000.000.990.498
123456789101112
Table 17. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the small-size SRSs.
Table 17. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the small-size SRSs.
TU_CTU_RTDTU_C vs. TDTU_R vs. TD
meanstdmeanstdmeanstdp-ValuePowereff Sizep-ValuePowereff Size
Q m 0.050.100.0020.0180.0010.0020.001.001.440.011.000.241
Q t 0.340.150.170.060.250.100.001.000.940.001.000.752
Q t D 0.300.150.160.050.220.100.001.000.740.001.000.573
Q t I 0.050.030.020.020.030.030.001.000.550.001.000.464
β 1 0.350.360.430.360.640.340.001.000.830.001.000.605
β 2 0.300.310.440.360.330.330.010.590.080.001.000.346
β 3 0.240.300.120.290.020.080.001.001.810.001.000.877
β 4 0.070.180.010.080.000.000.001.001.160.011.000.288
Q 1 : 4 0.070.050.060.040.120.060.001.000.760.001.000.979
Q 2 : 4 0.030.030.020.020.030.040.380.060.010.001.000.1910
Q 3 : 4 0.030.040.010.020.000.010.001.001.950.001.000.6011
Q 4 : 4 0.040.100.000.020.000.000.001.001.140.001.000.2912
123456789101112
Table 18. Summary of testing hypothesis H 0 , t e c h n i q u e .
Table 18. Summary of testing hypothesis H 0 , t e c h n i q u e .
SRS SizeTD vs. TU_CTD vs. TU_R
m = 2m = 4m = 2m = 4
SmallRejectRejectRejectReject
MediumRejectRejectRejectReject
LargeRejectRejectRejectReject
Table 19. Ratio of Q m .
Table 19. Ratio of Q m .
SRS SizeTD vs. TU_CTD vs. TU_R
m = 2m = 4m = 2m = 4
Small560%5899%178%284%
Medium228%733%119%336%
Large213%552%117%212%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, B.; Li, J.; Huang, X. Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study. Energies 2025, 18, 4794. https://doi.org/10.3390/en18184794

AMA Style

Li B, Li J, Huang X. Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study. Energies. 2025; 18(18):4794. https://doi.org/10.3390/en18184794

Chicago/Turabian Style

Li, Boyuan, Jianghai Li, and Xiaojin Huang. 2025. "Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study" Energies 18, no. 18: 4794. https://doi.org/10.3390/en18184794

APA Style

Li, B., Li, J., & Huang, X. (2025). Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study. Energies, 18(18), 4794. https://doi.org/10.3390/en18184794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop