Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study

Li, Boyuan; Li, Jianghai; Huang, Xiaojin

doi:10.3390/en18184794

Open AccessArticle

Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study

by

Boyuan Li

^*,

Jianghai Li

^*

and

Xiaojin Huang

Institute of Nuclear and New Energy Technology (INET), Tsinghua University, Beijing 100084, China

^*

Authors to whom correspondence should be addressed.

Energies 2025, 18(18), 4794; https://doi.org/10.3390/en18184794

Submission received: 17 July 2025 / Revised: 14 August 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

(This article belongs to the Section B4: Nuclear Energy)

Download

Browse Figures

Versions Notes

Abstract

High software reliability is essential for safety-critical systems in nuclear power plants. To improve the quality of software following the requirements phase, requirements inspections are conducted to detect defects. Traditional approaches enhance inspection outcomes by employing more effective techniques or by increasing team redundancy. This study investigates an alternative approach: introducing diversity within the inspection team. Inspection technique diversity and inspector background diversity are considered in this paper. We hypothesize that an inspection team in which the inspectors use diverse inspection techniques or have diverse backgrounds will have a better performance in defect detection compared to an inspection team with no diversity. This is because diversity can reduce the number of dependent failures in an inspection team. In this study, a controlled experiment is designed and conducted to examine our hypothesis. In the experiment, research subjects with different backgrounds inspect a software requirements specification using different inspection techniques. The results are collected and analyzed statistically. The experiment shows that using diverse techniques in an inspection team can improve the performance of the inspection team; however, using inspectors with diverse backgrounds will not affect the performance of an inspection team significantly.

Keywords:

software reliability; dependent failures; common cause failures; requirements inspection

1. Introduction

Reliable software is fundamental to the safe and effective operation of nuclear power plants. The software development life cycle starts with the requirements phase, where a software requirements specification (SRS) is typically produced and reviewed by a verification and validation team. Defects missed during this review persist into later stages. According to internal NASA data, approximately 40% of the defects found in software products can be traced back to these overlooked requirements defects [1]. Therefore, the quality of the SRS has a significant impact on software reliability in nuclear applications [2].

Several techniques are available for inspecting requirements, including ad hoc reading, checklist-based reading (CBR), and perspective-based reading (PBR), among others. However, these traditional inspection techniques often struggle with a low defect detection rate. An experiment conducted by A. Porter [3] demonstrated that the defect detection rates for ad hoc, CBR, and PBR were 32.5%, 36.5%, and 51.5% respectively. In another experiment by L. He [4], the PBR technique showed a detection rate of 37%. As a result, a considerable number of defects remain in the SRS as the development progresses. To address this, two primary strategies are commonly adopted, as described below [5].

The first strategy is to enhance the effectiveness of the inspection. A. Alshazly [2] proposed a combined reading technique that divides the SRS into sections, by purpose, with each inspector using a tailored checklist. A case study showed that this significantly improves inspector performance. Ali [6] proposed standardizing the requirements generation process to produce SRSs that can be third-party inspected. A Total Quality Score was used to quantify and improve the quality of the SRS generated. B. Li [7] proposed the RIMSM method, which involves constructing an SRS model and generating inspection scenarios through model mutation. The system’s behavior under each scenario is analyzed to identify defects. An experiment demonstrated that RIMSM significantly improved defect detection effectiveness, increasing detection rates over CBR by 18.9%, 60.8%, and 75.8% for small, medium, and large SRSs, respectively.

Another strategy is to increase the number of inspectors. Techniques such as N-fold inspection replicate inspection activities across multiple teams to capture a broader set of defects [8]. E. Kantorowitz [9] demonstrated that, while one team of three inspectors detected 35% of defects, nine independent teams detected up to 78%.

In summary, the first strategy aims to increase the defect detection rate of an individual inspector; the second strategy adds redundancies to the requirements inspection process. The limitation of the first strategy is that no existing technique can promise a defect detection rate of an inspector that is high enough to ignore the remaining defects.

The second approach is limited by the occurrence of correlated failures among redundant inspectors. This issue was highlighted in E. Kantorowitz’s study on N-fold inspection. Theoretically, if nine teams worked independently and each had a 35% chance of identifying a defect, the overall detection rate should reach 98%, which is calculated as 1 − (1–35%)⁹ = 98%. In practice, the experiment yielded only a 78% detection rate. The shortfall arose because the participants, i.e., senior undergraduates from the same computer science program, possessed nearly identical training and employed the same inspection method. As a result, a lack of diversity led to dependent failures across teams, greatly reducing defect detection efficiency.

In redundant systems, a common solution to mitigate dependent failures is the introduction of diversity [10]. However, in requirements inspection, the effectiveness of introducing diversity has not been systematically studied. In this paper, the effectiveness of introducing diversity in requirements inspection is studied with an experiment.

Based on findings from [11], dependent failures in software development are correlated to the knowledge possessed by the subjects in a task, and the rules/procedures followed by the subjects in that task. We therefore assume that dependencies in requirements inspection originate mainly from common inspection techniques (i.e., procedure followed) and shared background (i.e., knowledge possessed). We hypothesize that teams with diverse techniques or backgrounds perform better in defect detection than homogeneous teams. To verify our hypothesis, an experiment is designed and conducted with undergraduate students from different majors using two techniques: CBR and RIMSM. We compared teams with technique or background diversity against teams where all members used the same technique or are pursuing the same major.

This paper is organized as follows: Section 2 introduces the background knowledge required in this paper; in Section 3, an experiment is designed to study the effectiveness of introducing diversity in an inspection team; Section 4 analyzes the results of the experiment; Section 5 concludes the findings of this paper.

2. Background

This section introduces the background knowledge required in this paper.

In requirements inspection, stakeholder requirements are used as the “oracle” to examine the correctness of an SRS. A defect in the SRS is defined as the inconsistency between the SRS and the stakeholder requirements. Defects are introduced in the development of an SRS and are detected in the inspection process.

In this study, a failure denotes the inability of an inspector or an inspection team to identify a defect in the SRS. Such failures in requirements inspection can be divided into two categories, as defined in [12] from the standpoint of human error: independent failures and dependent failures. The failure rate of an inspector represents the likelihood that a defect remains undetected by that inspector. Because this rate is typically high, our analysis adopts the zone-based model (Z-model), which was designed to address dependent failures in contexts characterized by high failure probabilities [12].

This section first outlines the definitions of independent and dependent failures in requirements inspection, then introduces the inspection techniques applied in the experiment, and, finally, it gives a brief overview of the Z-model used to analyze the experimental results.

2.1. Dependent and Independent Failures in Requirements Inspection

Ref. [12] characterizes requirements inspection as a cognitive process where inspectors assess SRS elements for potential defects. Failures in requirements inspection are caused by human errors. Human errors in requirements inspection can be classified into human errors due to insufficiency of knowledge and human errors due to random events in the cognitive process.

The perception domain of an inspector refers to the body of knowledge that can be accurately understood and applied to identify defects in an SRS. A defect located within this domain can be detected, whereas one outside it will be overlooked. The perception domain is shaped by the inspector’s system knowledge, expertise in requirements engineering, and access to supporting external information. When a defect falls outside this domain, the resulting human error—caused by insufficient knowledge—leads to detection failures.

Using Bayesian inference, Ref. [12] distinguished dependent failures from independent ones using this concept. Dependent failures arise when knowledge limitations prevent inspectors from recognizing a defect outside their perception domain. Independent failures, by contrast, result from random variations in the cognitive process, even when the defect is within the perception domain. Thus, the classification of failures depends on whether the defect lies inside or outside the inspector’s perception domain.

2.2. Inspection Methods

This section introduces the two inspection methods used in our experiment and discusses the reasons why the two methods were selected.

The checklist-based reading (CBR) technique supplies inspectors with a checklist, typically formulated as questions or statements, to guide the search for specific types of defects [13]. During inspection, the document is reviewed while the inspector answers a series of yes/no questions aimed at identifying potential issues [14]. However, CBR is regarded as a nonsystematic approach [4], since it offers no explicit guidance on how the inspection process itself should be conducted.

The RIMSM method is proposed by B. Li et al. [7] and aims to improve performance in terms of detecting defects related to system functions. Executing the RIMSM method, an inspector first constructs a high-level extended finite state machine (HLEFSM) model of the SRS, using the RITSM tool. The RITSM tool automatically identifies a set of scenarios under which the SRS model shall be executed, based on an extended mutation testing technique in the requirements phase. Then, the tool executes the SRS model for the identified scenarios automatically. The execution results are stored in an output file that records the execution of function definitions, variable definitions, and function logic. By examining this file, defects in the SRS model can be identified. Once a defect is detected in the output file, the inspector locates the defect in the SRS document. The process of defect location is straightforward since the SRS model is constructed directly using the SRS.

Table 1 displays the differences between the two techniques in terms of instrumentation, coverage of the defects, inspection item, and defect detection activity. Since the focus of this paper is to discover the effectiveness of introducing diversity in requirements inspection, the inspection techniques used in our experiment should have as little in common as possible. The inspection methods CBR and RIMSM are selected to minimize the dependencies between inspectors using different inspection methods.

2.3. Z-Model

To evaluate an inspection team’s performance, a quantification model is necessary to estimate the probability of dependent and independent failures within that team. Traditional models, such as the basic parameter model [15], the beta factor model [16], the alpha factor model [17], and the multiple Greek letter model [18], assume redundant components are identical. A single component failure within an event is treated as independent, whereas multiple component failures are treated as dependent. These models primarily address high-reliability systems, where the probability of component failure is extremely low. Consequently, the likelihood of simultaneous multiple independent failures, or the coexistence of independent and dependent failures within the same event, is considered negligible.

In contrast, requirements inspection involves inspectors with comparatively high and heterogeneous failure probabilities, rendering traditional models inapplicable. In [12], the Z-model is proposed to analyze dependent failures in a high failure probability context, i.e., requirements inspection.

The Z-model addresses this limitation by explicitly considering the following: (1) the potential for high individual failure probabilities, (2) variability in failure probabilities among inspectors, (3) the occurrence of both dependent and independent failures within an inspection team. The Z-model is introduced briefly below.

For an inspection team consisting of m inspectors, i.e.,

I_{1}, I_{2}, \dots, I_{m}

, the following concepts are introduced:

Detection activity: A detection activity refers to the actions undertaken by the inspection team in an attempt to identify a defect.
Multiplicity (k): The multiplicity of a detection activity is the number of failures in the detection activity, which is denoted as $k \in [0, m]$ .
Dependency (d): The dependency of a detection activity is the number of dependent failures it contains, denoted by $d \in [0, m]$ .
Perception zone: The perception domain of an inspector encompasses the knowledge that the inspector can accurately access or apply to identify defects in an SRS. The universal perception domain U, representing all knowledge required to detect every defect in the SRS, is partitioned into different zones according to the perception domains of the inspectors in a team. An example is illustrated in Figure 1. For two inspectors, i.e., $I_{1}$ and $I_{2}$ , the universal perception domain U is divided into four perception zones: $Z_{0}^{1}$ , $Z_{1}^{1}$ , $Z_{2}^{1}$ , and $Z_{1}^{2}$ . Assuming that detecting a defect requires the knowledge in a given perception zone, the dependency of a zone is defined as the number of inspectors who lack the necessary knowledge for that zone. For instance, the dependency of $Z_{2}^{1}$ is 2, since neither inspector possesses the knowledge needed to detect defects in that zone.
Dependency of a perception zone: The dependency of a perception zone is the number of inspectors lacking the knowledge within that zone. Perception zones can be categorized according to their dependencies. Let $Z_{d}^{i}$ denote the ith perception zone with dependency d. In this paper, a defect requiring the knowledge contained in $Z_{d}^{i}$ for detection is referred to as a defect in that zone.

In an inspection team,

Q_{t}

represents the average probability of failure for an individual inspector. The Z-model decomposes

Q_{t}

as (1), where

Q_{t}^{I}

is the probability of independent failures and

Q_{t}^{D}

is the probability of dependent failures.

Q_{t} = Q_{t}^{D} + Q_{t}^{I}

(1)

The Z-model characterizes the average probability of failures by an inspector using a set of parameters, namely

z_{d}^{i}

and

p_{a}^{I}

. Here,

z_{d}^{i}

represents the probability that a defect lies in perception zone

Z_{d}^{i}

, and

p_{a}^{I}

denotes the probability that inspector

I_{a}

independently fails to detect a defect in her/his perception domain.

Q_{t}^{D}

can be calculated using (2), where

n_{d}

is the number of zones with dependency d.

Q_{t}^{D} = \frac{1}{m} \sum_{d = 0}^{m} \sum_{i = 1}^{n_{d}} z_{d}^{i} \cdot d

(2)

Q_{t}^{I}

can be calculated using (3), where

p_{a_{j}^{i}}^{I}

is the probability that inspector

I_{a_{j}^{i}}

, whose perception domain contains

Z_{d}^{i}

, fails independently.

Q_{t}^{I} = \frac{1}{m} \sum_{d = 0}^{m} \sum_{i = 1}^{n_{d}} \sum_{j = 1}^{m - d} z_{d}^{i} \cdot p_{a_{j}^{i}}^{I}

(3)

The objective of the experiment is to estimate whether introducing diversity to requirements inspection can increase software reliability in the requirements phase. The following variables are derived and used in this paper to analyze the dependent failures in detail.

Based on the decomposition of

Q_{t}

,

Q_{d : m}

is defined in (4) to represent the probability that

d

inspectors fail dependently in a detection activity within an inspection team with m members.

Q_{d : m} = \frac{1}{n_{d}} \cdot \sum_{i = 1}^{n_{d}} z_{d}^{i}

(4)

Given a dependent failure by an inspector, the conditional probability that the failure is involved in a detection activity with dependency d is denoted as

β_{d}

, as shown below.

β_{d} = \frac{d \cdot \sum_{i = 1}^{n_{d}} z_{d}^{i}}{\sum_{d = 0}^{m} \sum_{i = 1}^{n_{d}} d \cdot z_{d}^{i}}

(5)

To evaluate the probability of failure by an inspection team,

Q_{m}

is defined as the probability that all inspectors in a detection activity fail to detect a defect.

Q_{m}

can be obtained using (6).

Q_{m} = \sum_{d = 0}^{m} \sum_{i = 1}^{n_{d}} (z_{d}^{i} \cdot \prod_{j = 1}^{m - d} p_{a_{j}^{i}}^{I})

(6)

The data collected in requirements inspection for estimating the parameters of the Z-model are summarized in Table 2.

The key parameters in the Z-model,

z_{d}^{i}

and

p_{a}^{I}

, can be estimated using (7) and (8) through the maximum likelihood estimator (MLE).

z_{d}^{i} = \frac{n_{d}^{i}}{N_{D}}

(7)

p_{a}^{I} = \frac{n_{a}^{I}}{N_{D} - n_{a}^{D}}

(8)

3. Experiment Design

The goal of this paper is to study the effectiveness of increasing diversity in requirements inspection. In this paper, two types of diversity are considered for an inspection team: the diverse backgrounds of the inspectors and diverse techniques used for defect detection.

In this section, the following definitions are used:

A background-diverse team is an inspection team in which the members are pursuing different majors.
A background-uniform team is an inspection team in which all members share the same major.
A technique-diverse team is an inspection team in which different inspection techniques are used.
A technique-uniform team is an inspection team in which all members use the same inspection technique.

This experiment was designed to compare the background-diverse teams and technique-diverse teams to the background-uniform teams and technique-uniform teams, respectively.

3.1. Research Questions

The following aims are the basis for this experiment:

To determine whether the performance of a background-diverse team is better than that of a background-uniform team.
To determine whether the performance of a technique-diverse team is better than that of a technique -uniform team.

3.2. Hypotheses

The null hypotheses and alternative hypotheses of the experiment are given below:

$H_{0, b a c k g r o u n d}$ : there is no difference between the performance of the background-diverse teams and the background-uniform teams.
$H_{A, b a c k g r o u n d}$ : the performance of the background-diverse teams is better than that of the background-uniform teams.
$H_{0, t e c h i q u e}$ : there is no difference between the performance of the technique-diverse teams and the technique-uniform teams.
$H_{A, t e c h n i q u e}$ : the performance of the technique-diverse teams is better than that of the technique-uniform teams.

The t-test is used to test the null hypotheses in this experiment based on the central limit theorem.

3.3. Research Variables

This experiment controlled the following independent variables:

The inspection techniques used by the subjects in an inspection team (RIMSM or CBR).
The backgrounds (i.e., major) of the subjects in an inspection team.
The number of subjects in an inspection team.
The SRS documents under inspection.

In this study, the treatment variables consist of the inspection methods and the inspectors’ backgrounds, while the remaining variables help to mitigate potential threats to the experiment’s internal validity.

The dependent variables in the experiment are described as follows. An inspection team fails to detect a defect only when every member misses it. Therefore,

Q_{m}

is employed to quantify team performance, where

Q_{m}

is the probability of a detection activity with multiplicity k = m. The other parameters of the Z-model are used to analyze the failure rate of an inspector in the inspection team, including

Q_{t}

,

Q_{t}^{D}

,

β_{d}

,

Q_{t}^{I}

, and

Q_{d : m}

.

3.4. Experiment Instrumentations

3.4.1. SRS Documents

An SRS comprises both functional and non-functional requirements. This experiment focuses on functional requirements to ensure that the software’s intended functionality is accurately represented. Twelve SRS documents of varying applications and sizes were used, all sourced from the study by B. Li [7] and reused in this experiment. These documents adhere to the IEEE 29148:2018 standard [19] and are structured into three sections: Introduction, Overview, and Specific Functions.

Table 3 summarizes each SRS, including its topic, number of pages, number of functions, and number of defects. The defects considered are indigenous, meaning they occur naturally rather than by being intentionally seeded. The total number of defects in each SRS was estimated using the capture–recapture method described in B. Li’s study [7].

SRSs S1–S8 are classified as “Small size,” S9 and S10 as “Medium size,” and S11 and S12 as “Large size,” allowing the experiment to evaluate the scalability of the results. Additionally, a stakeholder requirements specification (StRS) was developed for each SRS in B. Li’s study. Each StRS outlines the system capabilities required by stakeholders within a defined context and serves as an oracle to guide inspectors in defect detection.

3.4.2. Checklist

A checklist is used in the CBR method. Traditional checklists typically address both functional and non-functional requirements. Since the SRS documents in this experiment include only functional requirements, all checkpoints related to non-functional aspects were removed. The checkpoints related to functional requirements are detailed using a defect taxonomy in the requirements phase, as discussed in B. Li and X. Li’s work [20,21,22]. Part of the checklist used is summarized in Figure 2.

3.4.3. RITSM Tool

B. Li developed the RITSM tool to facilitate the application of the RIMSM method [7]. With RITSM, inspectors can construct the HLEFSM model of an SRS, while the tool automatically generates the model’s mutants and the scenarios needed to detect them. The tool then executes the model under these scenarios and outputs the results to a file, which inspectors review to identify defects. Additional details about RITSM are provided in [7].

3.4.4. Defect Recording Sheet

Each inspector records detected defects using a defect recording sheet. Figure 3 gives an example of the defect recording sheet.

3.5. Research Subject Identification

During the requirements inspection phase, an SRS is reviewed by inspectors from an independent verification and validation (IV&V) team. Inspectors typically hold at least a bachelor’s degree in engineering or a related field and are expected to have (1) knowledge of requirements engineering, and (2) proficiency in programming with procedural languages.

For this experiment, subjects were selected from junior and senior undergraduate students across various majors. Each participant was required to be familiar with at least one programming language, while concepts related to requirements engineering were covered during training sessions. In total, 23 students from Ohio State University were recruited for the study. The majors of the subjects and the number of subjects in each major are given in Table 4.

3.6. Experimental Procedure

The design of the experiment procedure is introduced in this section.

For a given inspection team, the inspection process includes two steps: detection and collection [23]. In the first step, every inspector detects the defects in the SRS individually. In the second step, the defects detected by each inspector are collected through a collection meeting. A. Porter [3] and Votta [24] studied the effect of the collection meetings. The “meeting gain” and “meeting loss” were analyzed using experiments. A meeting gain occurs when a defect is detected for the first time at the collection meeting. A meeting loss occurs when a defect is first detected by an inspector, but is not recorded during the collection meeting. They found that the meeting gain and meeting loss were negligible. The defects detected by an inspection team are just a union of the defects detected by each inspector in the team. Based on this fact, the subjects in this experiment were only asked to detect defects individually. To study the defects detected by an inspection team that contains m specific inspectors, we only need to combine the defects detected by the m inspectors individually. This design provided us the flexibility to “virtually” combine any inspectors in a team for analysis.

This experiment had four stages: (1) collection of data for individual inspectors, (2) creation of data for inspection teams, (3) determination of dependent failures and independent failures, and (4) analysis of results. The details in each stage are introduced below.

3.6.1. Collection of Data for Individual Inspectors

In this stage, all subjects were trained and tested individually, since the first step of an inspection team’s requirements inspection is to let all inspectors detect defects individually.

The first stage of the experiment spanned 3 days, with all sessions conducted remotely via video communication. This stage comprised five training sessions, six practice sessions, and six testing sessions, as detailed in Table 5. Four investigators participated in the experiment, with their respective roles described below:

Investigator 1 designed the experiment and prepared all instruments and training materials. Investigator 1 also conducted the analysis of the experimental results.
Investigator 2 was unaware of the experiment’s hypotheses. To minimize research bias, Investigator 2 delivered the training session presentations and hosted all practice and testing sessions. At the end of each session, Investigator 2 collected the results recording sheets, removed identifiers and methods used by the subjects, and provided the anonymized data for analysis. The results were subsequently analyzed by Investigators 1 and 3.
Investigator 3 did not participate in the 3-day experiment. Investigators 1 and 3 independently analyzed the results to evaluate inter-rater reliability.
Investigator 4 supervised the entire experiment.

Each training or practice session lasted approximately 45 min, followed by a 15 min break. Testing sessions had no time limit. The training sessions introduced subjects to the fundamental concepts of software requirements engineering and provided instruction on using the two inspection techniques.

Practice sessions allowed subjects to become familiar with these methods. In each practice session, all subjects were assigned a small-size SRS, along with the corresponding StRS, and inspected the SRS using the specified technique, as indicated in Table 5. By the end of the practice sessions, subjects’ performance was expected to have stabilized.

In the testing sessions, the subjects were divided into six groups based on their majors. As shown in Table 6, Group 1 contained six subjects from CSE; Group 4 consisted of the other five subjects from CSE. The 10 subjects from EE were evenly assigned into Group 2 and Group 5. Group 3 and Group 6 each had one subject who majors in ME. In each testing session, the inspection technique used by each group is listed in Table 6. For example, Groups 1, 2, and 3 used RIMSM to inspect S7, which is a small-size SRS, in testing session 1. The subjects did not know that they were grouped.

The design presented in Table 6 enables the combination of results across different groups and sessions. For instance, the results from Group 1 in testing session 1 can be combined with those from Group 4 in testing session 2 to represent the inspection of a small-size SRS using the RIMSM method for CSE-major subjects. This approach aggregates data from all CSE subjects and both small-size SRSs (S7 and S8), increasing the number of data points for hypothesis testing and reducing the biases associated with using a single SRS or a single group.

3.6.2. Creation of Data for Inspection Teams

As discussed previously, the union of the defects detected by m inspectors individually can be regarded as the defects detected by an inspection team that consists of the m inspectors. Therefore, in this stage, the performance of an inspection team of size m was estimated by selecting m inspectors and combining the defects they detected individually. The process of estimation is referred to as “creating” an inspection team in this paper. Inspection teams that are created virtually were also used in previous studies such as [25].

In this paper, to test hypotheses

H_{0, b a c k g r o u n d}

and

H_{0, t e c h n i q u e}

, we created the following inspection teams: (1) background-diverse teams in which half of the members major in CSE and the other members major in EE; (2) background-uniform teams in which all members have the same major (either CSE or EE); (3) technique-diverse teams in which half of the members use the CBR technique and the other members use the RIMSM technique; (4) technique-uniform teams in which all members use the same inspection technique (either CBR or RIMSM). We also considered the different sizes of the inspection teams. In this experiment, the inspection teams of size m = 2 and m = 4 are used to test the hypotheses. Details on the creation of the inspection teams of each type and each size are discussed in Section 4.3.1 and Section 4.4.1.

3.6.3. Identification of Dependent and Independent Failures

In the first stage of the experiment, we collected the defects detected by each inspector in the testing sessions. The failures of each inspector in each testing session can be determined accordingly.

However, we found that it was not feasible to discriminate dependent failures and independent failures by an inspector by simply asking the subjects to answer a questionnaire at the end of a testing session. This is because a subject cannot remember or realize the causes of all his/her failures after completing the inspection of an SRS. On the other hand, if we ask a subject to record the causes of his/her failures in the middle of the testing sessions, we need to provide the subject with a list of defects that are in the SRS. Uncontrolled biases will be introduced into the experiment. Therefore, an indirect method was used to determine the dependent failures and independent failures in this stage.

As defined in Section 2.1, failures can be categorized as dependent or independent. A failure is considered independent if an inspector misses a defect that lies within their perception domain, and dependent if the defect falls outside it. The key challenge is identifying whether a defect is within an inspector’s perception domain.

Following the criteria established in [12], the following criteria are used to determine whether a defect lies within an inspector’s perception domain, where c is an integer:

Criterion 1: the inspector must detect at least $c$ defects in the SRS.
Criterion 2: the inspector must detect at least $c$ defects within the function containing the defect or within a closely related function.
Criterion 3: the inspector must detect at least $c$ defects of the same type in previous testing or practice sessions.

The first criterion indicates that the inspector possesses basic knowledge of the application described in the SRS. The second criterion ensures that the inspector understands the function containing the defect, while the third criterion verifies the inspector’s ability to detect defects of the same type. If an inspector’s failure meets all three criteria, the defect is considered within the inspector’s perception domain, and the failure is classified as independent; otherwise, it is treated as a dependent failure. In this experiment, we set

c = 1

. Appendix A provides a sensitivity analysis study of the impaction of selecting various values of

c

.

3.6.4. Analysis of Results

In the last stage, the data collected and created were analyzed. The two hypotheses, i.e.,

H_{0, b a c k g r o u n d}

and

H_{0, t e c h n i q u e}

, were tested. More details on this stage are introduced in the next section.

4. Experiment Results

This section presents the results of the data analysis. The two hypotheses, i.e.,

H_{0, b a c k g r o u n d}

and

H_{0, t e c h n i q u e}

, are tested by comparing the background-diverse teams and technique-diverse teams to the background-uniform teams and technique-uniform teams.

The data analyzed include the stability of the performance of the subjects in the testing sessions, the inter-rater reliability in data analysis, the comparison of background-diverse teams and background-uniform teams, the comparison of technique-diverse teams and technique-uniform teams, and the threats to validity of the experiment.

4.1. Stability of the Subjects’ Performance

In this section, we examine whether each inspector’s performance with the two inspection techniques remains stable across the testing sessions. Stable performance helps minimize the experimental biases associated with subject maturation.

During the six practice sessions and the first two testing sessions, each subject inspected eight small-size SRSs, with each inspection technique applied in four rounds. The total probability of failures by an inspector, i.e.,

Q_{t}

, is used to evaluate the performance of a subject. The average performance of the subjects using the CBR and RIMSM technique is displayed in Figure 4.

Figure 4 shows that the subjects’ performance improved over the first three rounds and stabilized during the last two rounds. To formally assess the stability of the subjects’ performance, we test the following null and alternative hypotheses:

$H_{0, s t a b i l i t y}$ : there is no significant difference between the performance of the inspectors using the two inspection techniques in the third and fourth rounds.
$H_{A, s t a b i l i t y}$ : there are significant differences between the performance of the inspectors using the two inspection techniques in the third and fourth rounds.

A t-test was conducted to evaluate the null hypothesis at a 5% significance level. The results are summarized in Table 7. For both inspection techniques, the p-values were much greater than the significance level, indicating that the null hypothesis could not be rejected. Consequently, the null hypothesis is assumed to hold, suggesting that the performance of the subjects remained stable when using both techniques after the third round. Therefore, it can be concluded that the performance of the subjects had stabilized by the time they entered the testing sessions.

4.2. Inter-Rater Reliability

Given the results recording sheets collected in the testing sessions, the investigators of the experiment need to determine whether the defects identified by each inspector are valid. The entries in the results recording sheets were analyzed by Investigators 1 and 3 independently. The inter-rater reliability (IRR) describes the level of agreement between the investigators and the degree to which the data collected are reliable. Cohen’s kappa coefficient [26] is used to quantify the IRR in the experiment. The overall Cohen’s kappa coefficient obtained is 90.4%.

Based on [26], a Cohen’s kappa coefficient above 90% indicates an “almost perfect” level of agreement between raters. This corresponds to a reliability range of 82% to 100%. Accordingly, the IRR results in this experiment demonstrate that the collected data are highly reliable.

4.3. Comparison of Background-Diverse Teams and Background-Uniform Teams

In this section, the creation of background-diverse teams and background-uniform teams is discussed first. Then, the background-diverse teams and background-uniform teams created are compared to test hypothesis

H_{0, b a c k g r o u n d}

.

4.3.1. Creation of Background-Diverse Teams and Background-Uniform Teams

Let us first consider the creation of background-uniform teams. Given a small-size SRS, e.g., S7, we can create a background-uniform team by selecting m inspectors from Group 1 (see Table 6) and combining their results in testing session 1. The created inspection team has the following features: all members major in CSE and all members use the RIMSM technique. In addition, the background-uniform teams that have the same features can also be created by selecting m members from Group 4 and combining their results in testing session 2. The first row of Table 8 summarizes the creation of the background-uniform teams in which all members major in CSE and use the RIMSM method. The second and third columns of the first row specify the features of the created background-uniform teams. In row 1, the background-uniform teams with the specified features are labeled as BU_R1. The methods to create the BU_R1 teams are given in the fourth column.

Given the small-size SRSs, four different types of background-uniform teams can be created, which are labeled as BU_R1, BU_R2, BU_C1, and BU_C2, as displayed in Table 8. Since we can select any m inspectors from a group to create an inspection team, the number of different teams that can be created by selecting m inspectors from a group can be obtained using (1), where g is the number of inspectors in the group. The last column of Table 8 calculates the total number of different inspection teams that can be created for each type of background-uniform team when m = 2.

N_{t e a m s} = (\begin{matrix} g \\ m \end{matrix}) = \frac{g!}{(g - m)! m!}

(9)

It should be noted that Groups 3 and 6 are excluded in creating the background-uniform teams since both groups contain only one subject. Besides, we do not consider the inspection teams of size m = 4 in this section. This is because Groups 2, 4, and 5 each contain five subjects. We can only create five different inspection teams of size m = 4 from each one of those groups. The number of data points are not sufficient for hypothesis testing.

For the SRSs of a small size, a background-diverse team that contains m members can be created by selecting m/2 inspectors from Group 1 and m/2 inspectors from Group 2, and combining their results in testing session 1. The created inspection team has the following two features: (1) half of the members major in CSE and the other half major in EE, (2) all members use the RIMSM method. The background-diverse teams that have the same features can also be created by selecting m/2 inspectors from Group 4 and m/2 inspectors from Group 5, and combining their results in testing session 2. For the small-size SRSs, two types of background-diverse teams can be created, as displayed in Table 9. The background-diverse teams in which all members use the RIMSM technique are labeled as BD_R. The background-diverse teams, in which all members use the CBR technique, are labeled as BD_C.

The discussion above focuses on the small-size SRSs. Given the medium-size or large-size SRSs, the background-diverse teams and background-uniform teams can be created in the same manner using the data collected in testing sessions 3, 4, 5, and 6.

To test hypothesis

H_{0, b a c k g r o u n d}

, we can compare the inspection teams BD_R to BU_R1 and BU_R2, respectively, for SRSs of each size. We can also compare the inspection teams BD_C to BU_C1 and BU_C2, respectively, for SRSs of each size. It should be noted that the number of the inspection teams of each type is different. Therefore, Welch’s t-test, which assumes unequal sample size and unequal variance, is used for hypothesis testing. The results are discussed in the following section.

4.3.2. Results of Comparing Background-Diverse Teams and Background-Uniform Teams

In this section, the created inspection teams BU_C1, BU_C2, and BD_C are compared.

Table 10 gives a summary of the performance-related statistics of the BU_C1, BU_C2, and BD_C teams of size m = 2 in the inspection of the small-size SRSs (i.e., S7 and S8). The first row of Table 10 displays the results of

Q_{m}

(i.e., the probability of a team failure). In the first six columns of row 1, the mean and standard deviation (std) of

Q_{m}

for the BU_C1, BU_C2, and BD_C teams are given respectively. Hypothesis

H_{0, b a c k g r o u n d}

is tested in two situations: (1) comparing the average

Q_{m}

of the BU_C1 teams to the BD_C teams, (2) comparing the average

Q_{m}

of the BU_C2 teams to the BD_C teams. The testing results in the first situation are provided in columns 7, 8 and 9, which show the p-value, the power, and the effect size of the hypothesis test. The last three columns display the hypothesis testing results in the second situation.

By selecting a significance level of 0.05, the p-value of the hypothesis test in both situations is much greater than the significance level. We fail to reject hypothesis

H_{0, b a c k g r o u n d}

in both situations. The results indicate that there is no difference between the performance of the BU_C1 teams and BD_C teams in terms of the probability of team failures. In columns 8 and 11 of row 1 in Table 10, the power of the test in both situations shows that the probability of a type-II error in the test is very low.

Cohen’s d is used to quantify effect size when comparing different inspection teams. Effect size is usually used to measure the magnitude of the difference between the two samples [27]. Specifically, an effect size below 0.2 indicates that the difference is negligible, 0.2–0.5 suggests a small but noticeable difference, 0.5–0.8 reflects a medium difference, and values above 0.8 represent a large and substantial difference [28]. In columns 9 and 12 of row 1 in Table 10, the effect sizes show that the differences between inspection teams of different types are negligible or small.

Rows 2–8 of Table 10 display the other parameters of the Z-model that are used to analyze

Q_{m}

, including

Q_{t}

,

Q_{t}^{D}

,

Q_{t}^{I}

,

β_{1}

,

β_{2}

,

Q_{1 : m}

, and

Q_{2 : m}

. For example, for the BU_C1 teams in column 1, the average probability of a failure by an inspector is

Q_{t} = 0.39

, which consists of the probability of a dependent failure

Q_{t}^{D} = 0.34

and the probability of an independent failure

Q_{t}^{I} = 0.05

. The probability of an independent failure is much less than that of a dependent failure.

In rows 5 and 6,

β_{1}

and

β_{2}

describe the proportion of the probability of a dependent failure that is involved in a detection activity with dependency d = 1 and d = 2. In an inspection team, the sum of

β_{d}

(

d \in [1, m]

) shall be equal to 1 if dependent failures exist. However, if an inspection team does not have any dependent failures,

β_{d}

is equal to 0 for all

d \in [1, m]

. Therefore, the sum of the mean value of

β_{1}

and

β_{2}

in each column is less than 1. The probabilities of a detection activity that contains d dependent failures are described by

Q_{d : m}

. In the comparison between the BU_C1, BU_C2, and BD_C teams, the p-values are greater than the significance level 0.05 for all parameters considered. The powers and effect sizes are either small or very small. There is no difference observed between the BU_C1, BU_C2, and BD_C teams in terms of all parameters. The standard deviation for each variable is very large compared to the mean of the variables. In other words, all variables have a high level of dispersion around the mean.

Given Table 10, we can conclude that, for the inspection teams of size m = 2, in which all members use the CBR method, the performance of the background-uniform teams is the same as the background-diverse teams in the inspection of small-size SRSs.

Table 11 and Table 12 display the summary of the statistics related to the performance of the BU_C1, BU_C2, and BD_C teams in the inspection of the SRSs of medium size and large size. The results in Table 11 and Table 12 suggest a similar conclusion, as discussed for Table 10. Therefore, for the inspection teams of size m = 2, in which all members use the CBR method, the performance of the background-uniform teams is the same as the background-diverse teams in the inspection of SRSs of all sizes.

Figure 5 plots the mean of

Q_{m}

of the BU_C1, BU_C2, and BD_C inspection teams in the inspection of SRSs of different sizes. We can observe that the performance of an inspection team decreases as the size of the SRS increases. For SRSs of each size, the performance of the BU_C1, BU_C2, and BD_C inspection teams are very close to each other.

Hypothesis

H_{0, b a c k g r o u n d}

is also tested by comparing the BU_R1 and BU_R2 teams to the BD_R teams of size m = 2. In this case, the summary of the statistics related to the performance of the BU_R1, BU_R2, and BD_R teams for inspecting SRSs of a small size, medium size, and large size, are given in Appendix B. The results are similar to the comparison of the BU_C1, BU_C2, and BD_C teams. The results indicate that, for the inspection teams of size m = 2, in which all members use the RIMSM methods, the performance of the background-uniform teams is the same as the background-diverse teams in the inspection of SRSs of all sizes.

Figure 6 plots the mean of

Q_{m}

of the BU_R1, BU_R2, and BD_R inspection teams in the inspection of SRSs of different sizes. For SRSs of each size, the performance of the BU_R1, BU_R2, and BD_R inspection teams are very close to each other. We also notice that the performance of BU_R1, BU_R2, and BD_R teams is very close in terms of the inspection of the medium-size SRSs and large-size SRSs. This is because the effectiveness of the RIMSM method is stable in the inspection of the medium-size SRSs and the large-size SRSs, as discussed in Li’s work [7].

Table 13 summarizes the results for testing hypothesis

H_{0, b a c k g r o u n d}

in different conditions. We are not able to reject this hypothesis in all conditions. Overall, we can conclude that no difference is observed between the performance of the background-uniform teams and the background-diverse teams. The reason is that subjects shared similar knowledge of requirements engineering, as introduced in the training sessions, and used the same inspection technique. They also shared the same information about the application under inspection by using the same SRS and StRS documents. Although the majors of the inspectors are different, their perception domain is dominated by shared knowledge, information, and technique. Therefore, a background-diverse team does not reduce the dependencies between the inspectors significantly. The overall performance of a background-diverse team is not improved compared to a background-uniform team.

Although adding diversity to the backgrounds of inspectors in an inspection team cannot reduce the number of dependent failures in the inspection team, other strategies may be effective. To avoid dependent failures, we can diversify the information about the application that is available to the inspectors. For example, some inspectors can use the StRS document as the oracle and the others can contact the stakeholders directly. A more reasonable strategy is to use different inspection techniques, a solution which will be discussed in the next section.

4.4. Comparison of Technique-Diverse Teams and Technique-Uniform Teams

In this section, the creation of technique-diverse teams and technique-uniform teams is introduced first. Then, the technique-diverse teams and technique-uniform teams are compared to test hypothesis

H_{0, t e c h n i q u e}

. Welch’s t-test, which assumes unequal sample size and unequal variance, is used for hypothesis testing.

4.4.1. Creation of Technique-Diverse Teams and Technique-Uniform Teams

The previous section shows that the backgrounds of the inspectors in an inspection team do not affect the performance of the inspection team. Therefore, in this section, Groups 1, 2, and 3 in Table 6 are considered together and denoted as Group 1–3 since the subjects in those groups use the same inspection technique in all testing sessions. Similarly, Groups 4, 5, and 6 are combined and denoted as Group 4–6. In this section, the inspection teams of both size m = 2 and m = 4 are considered.

A technique-uniform team can be created by selecting m inspectors from Group 1–3 or Group 4–5, and combining their results in a testing session. Based on the inspection techniques used in a technique-uniform team, two types of technique-uniform teams can be created, which are labeled as TU_R teams and TU_C teams, respectively. All inspectors in a TU_R team use the RIMSM method and all inspectors in a TU_C team use the CBR method. Table 14 shows the creation of the technique-uniform teams for the small-size SRSs. A TU_R team can be created by selecting m inspectors from Group 1–3 and combining their results in testing session 1, or by selecting m inspectors from Group 4–5 and combining their results in testing session 2. The creation of the TU_C teams is displayed in the last row of Table 14. The last two columns give the total number of the different TU_R and TU_C teams that can be created when m = 2 and m = 4.

A technique-diverse team can be created by selecting m/2 inspectors from Group 1–3 and m/2 inspectors from Group 4–6. The technique-diverse teams are denoted as TD teams. The creation of TD teams for the small-size SRSs is shown in Table 15. The TU_R, TU_C, and TD teams for the medium- or large-size SRSs can be created in the same manner using the data collected in testing sessions 3, 4, 5, and 6.

To test hypothesis

H_{0, t e c h n i q u e}

, we can compare the TD teams to the TU_R and TU_C teams, respectively, for inspection teams of each size. The results are discussed in the next section.

4.4.2. Results of Comparing Technique-Diverse Teams and Technique-Uniform Teams

In this section, the TU_R, TU_C, and TD teams are compared to test hypothesis

H_{0, t e c h n i q u e}

.

For inspection teams of size m = 2, Table 16 gives a summary of the statistics related to the performance of the TU_R, TU_C, and TD teams in the inspection of the small-size SRSs. The first row shows that the average

Q_{m}

of the TD teams is less than that of the TU_R teams and the TU_C teams. Hypothesis

H_{0, t e c h n i q u e}

is tested in two situations: (1) comparing the average

Q_{m}

of the TU_C teams to the TD teams, (2) comparing the average

Q_{m}

of the TU_R teams to the TD teams. The hypothesis is rejected in both situations since the p-values in both situations are much less than the significance level (0.05). In the first situation, the power of the test is almost 100%, and the effect size suggests that the difference between the TU_C teams and the TD teams is very large. In the second situation, the power of the test is also high. The effect size shows that the difference between the TU_R teams and the TD teams is small. The reason for the small effect size is that the RIMSM method is a more effective inspection technique compared to the CBR technique. Although the effectiveness of the RIMSM technique is much higher than that of the CBR technique, introducing a diversity of techniques can still improve the performance of the inspection teams.

Row 2 of Table 16 shows that the probability of failures of an inspector in the TU_C teams, using the CBR technique (i.e., 0.33), is much higher than that of an inspector in the TU_R teams using the RIMSM technique (i.e., 0.17). The probability of failures by an inspector in the TD teams is between the TU_R teams and TU_C teams. This is because, by using diverse techniques in an inspection team, we are not enhancing the average performance of an individual inspector. Since the RIMSM technique is more effective, we can expect that the average performance of an individual inspector is lower in a TD team compared to a TU_R team. However, the performance of the TD teams is higher than the TU_R teams and the TU_C teams. The reasons are discussed below using the parameters of the Z-model.

Comparing to the TU_C teams, both the probability of dependent failure and the probability of independent failure by an inspector in the TD teams are lower. In addition,

β_{2}

in the TD teams (i.e., 0.03) is much lower compared to the TU_C teams (i.e., 0.24). Therefore, an inspector in a TD team has a much lower probability to get involved in a detection activity in which all inspectors fail dependently. As a result, the performance of a TD team is better than that of a TU_C team.

Things are a little different in the comparison of the TD teams and the TU_R teams. Since RIMSM is a more effective technique, we observe that the probability of dependent failures and the probability of independent failures by an inspector in the TD teams is higher than that in the TU_R teams (see rows 3 and 4 in Table 16). Intuitively, a TU_R team should have a better performance. However, in rows 5 and 6,

β_{1}

of the TD teams (i.e., 0.84) is greater than the TU_R teams (i.e., 0.74), and

β_{2}

of the TD teams (i.e., 0.03) is much less than the TU_R teams (i.e., 0.19). This leads to the observation that, in a TD team, the probability of a detection activity that contains two dependent failures (i.e.,

Q_{2 : 2}

) is small. Although an inspector in the TD teams has a higher probability of dependent failure, a large portion of the probability is related to the detection activities with dependency d = 1. Therefore, using diverse techniques may not increase the average performance of an individual inspector, but it can reduce the probability of detection activities of high dependency. As a result, the performance of the inspection teams is improved.

Table 17 displays a statistics summary for comparing the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the small-size SRSs. As shown in the first row, the p-value is close to 0 in the comparison between the TU_C teams and the TD teams, and between the TU_R teams and the TD teams. Hypothesis

H_{0, t e c h n i q u e}

is rejected in both cases. The reasons are similar to the analysis for the TU_R, TU_C, and TD teams of size m = 2.

In the comparison between the TU_C teams and TD teams, the probability of dependent failure and the probability of independent failure by an inspector in the TD teams is lower, and

β_{2}

in the TD teams is also lower. Therefore, the probability of a detection activity with dependency d = 2 is lower in a TD team. As a result, the performance of a TD team is better than that of a TU_C team.

In the comparison between the TU_R teams and TD teams, the probability of dependent failure and the probability of independent failure by an inspector in the TD teams is higher than in the TU_R teams. However, the beta factors behave as follows. In a TD team,

β_{1}

is higher,

β_{2}

is higher,

β_{3}

is lower, and

β_{4}

is lower compared to a TU_R team. The proportion of the detection activities of high dependencies is reduced in a TD team; the proportion of the detection activities of low dependencies is increased in a TD team. In other words, the diversity in an inspection team converts high-order dependencies to low-order dependencies. As a result, in a TD team, the probabilities of a detection activity that contains more than two dependent failures (i.e.,

Q_{3 : 4}

and

Q_{4 : 4}

) is small. Therefore, the performance of the TD teams is higher.

In the inspection of the medium-size and large-size SRSs, we observed similar results, as discussed above. The summaries of the statistics related to the performance of the TU_R, TU_C, and TD teams of size m = 2 and m = 4 in the inspection of the medium-size SRSs and large-size SRSs are provided in Appendix C. Figure 7a,b plots the probability of team failures (i.e.,

Q_{m}

) of the TU_R, TU_C, and TD teams of size m = 2 and m = 4 in inspecting SRSs of different sizes. We can see the probability of failures by a TD team is smaller than that of a TU_R team or a TU_C team, regardless of the size of the SRSs and the size of the team. Table 18 summarizes the results of testing hypothesis

H_{0, t e c h n i q u e}

in different conditions. We are able to reject the hypothesis in all conditions.

Table 19 displays the ratio of

Q_{m}

(i.e., the probability of team failures) of a TD team to

Q_{m}

of a TU_R team and TU_C team in different cases. We can see that, as the size of the SRSs increases, the ratio decreases and becomes stable. In addition, the ratio for inspection teams of size m = 4 is higher than the ratio for inspection teams of size m = 2. In other words, as the size of the inspection teams increases, the improvement in the performance of an inspection team by using diverse techniques also increases.

As a conclusion, the performance of the background-diverse teams is better than the background-uniform teams. This is because different inspection techniques can detect different types of defects. Based on the taxonomy of defects in the requirements phase, we classified the defects encountered in our experiment into three main types: missing defects, extra defects, and incorrect defects. Figure 8 displays the average percentage of defects of each type that is detected by an inspector using the RIMSM and CBR techniques. The RIMSM technique performed better in the detection of missing defects and extra defects, while the CBR technique performed better in the detection of incorrect defects. Therefore, using different inspection techniques in an inspection team can reduce the detection activities with high dependencies by increasing the detection activities with low dependencies. As a result, the performance of the inspection team can be increased.

4.5. Threats to Validity

This section analyzes the typical threats to the validity of an experiment and the methods used in this experiment to minimize those threats.

4.5.1. Internal Validity

Selection Bias

The research subjects were selected from junior and senior undergraduate engineering students. We believe the differences between the junior and senior students will not affect our results since the inspection techniques are new to all undergraduate students. Learning those techniques requires an average level of programming and mathematics knowledge which is shared by all engineering students.

Rivalry

All subjects were trained identically and did not know a priori that they would be grouped. The inspection teams were created virtually. Thus, there were no possible rivalries between either the inspectors or the inspection teams.

History

The entire experiment lasted only 3 days, during which the subjects had no other classes. Therefore, the likelihood of significant external events affecting their attributes can be disregarded.

Maturation

As discussed in Section 4.1, the performance of the subjects is expected to remain stable across testing sessions. Hence, maturation bias can be ignored.

Repeated testing

This threat was minimized by using SRSs of different sizes and topics in each testing session.

Hawthorne effect

During the testing sessions, subjects inspected the SRSs independently without interference from the investigators. Any Hawthorne effect would be consistent across all sessions.

Experimenter bias

Experimenter bias was minimized because all sessions were conducted by Investigator 2, who was unaware of the experimental hypotheses.

Observer-expectancy effect

This threat was reduced since the subjects were not informed of the experiment’s true purpose. The study was introduced under the title “Study of Software Requirements Inspection Process” to avoid expectation-related bias.

Mortality

One subject in Group 1 dropped out of the testing sessions 5 and 6 due to a home emergency. The results for small-size SRSs and medium-size SRSs were not affected. The results for the large-size SRSs are generated using the remaining 22 subjects. The absence of the subject will not affect the validity of our results since the SRSs of different sizes were analyzed separately.

4.5.2. External Validity

First, our research subjects were junior and senior engineering undergraduates. Generalizing our findings to other inspectors should be valid since the typical requirements inspectors are usually from the engineering fields and have at least a bachelor’s degree. The typical requirements inspectors should have a similar background and learning potential as the subjects in our experiment.

Second, this study involved 23 subjects from CSE, EE, and MSE. Although the limited number of participants and academic disciplines may constrain the generalizability of the results, the findings remain valid for several reasons. First, the number of virtual teams was sufficient to support statistical hypothesis testing. Second, CSE students primarily focus on software, whereas EE and MSE students concentrate more on hardware. The substantial differences between these disciplines provide a representative level of diversity across academic backgrounds.

Third, the inspection teams were created virtually in the experiment. This is designed on purpose since the defects detected by an inspection team are the same as the union of the defects detected by each inspector in the team. The advantages of such a design include the following: (1) providing more data points by enumerating all combinations of the subjects in an inspection team, (2) isolating the dependencies between the subjects to only the backgrounds and the inspection techniques, and eliminating type-II dependent failures. The results should be valid when using a real inspection team.

In addition, this experiment only studied the RIMSM and CBR techniques. In each technique, an instrumentation was used (i.e., the checklist and the RITSM tool). Using other inspection techniques or instrumentations may affect the results. However, this paper aims to study the effectiveness of introducing diversity in an inspection team. The results demonstrated the possibility of increasing the performance of an inspection team by using diverse techniques. To apply our findings, a technique-diverse team should use techniques that are as different as possible.

Another threat to the generalization of our findings is the size of an inspection team. To minimize such a threat, the inspection teams of sizes m = 2 and m = 4 were studied.

Last, the SRSs used in the testing sessions were designed in such a way that different types of systems were covered. Different sizes were considered as well. Therefore, possible biases related to the SRSs used in the testing sessions should be minimized and should not affect external validity.

5. Conclusions

In this paper, an experiment was designed and conducted to study the effectiveness of introducing diversity in an inspection team. Two types of diversity were considered in the experiment: the diverse backgrounds of the inspectors and diverse techniques used for defect detection. The Z-model was used to analyze the data collected. The results showed that (1) the performance of the background-diverse teams is the same as the performance of the background-uniform teams; (2) the performance of the technique-diverse teams is better than the performance of the technique-uniform teams; (3) as the size of an inspection team increases, the improvement in the performance of the inspection team, using diverse techniques, also increases. This experiment demonstrated that using diverse techniques in an inspection team can improve the performance of the inspection team.

The findings in this paper can provide valuable experimental support for the software development lifecycle, specifically in enhancing software reliability during the requirements phase by introducing diversities. These results are significant for safety-critical systems, where software failures can have severe consequences. By introducing diversity in the requirements phase, potential defects can be identified early. The safety and economic effectiveness of the system can be improved.

Future research includes the following aspects: (1) the effectiveness of introducing diversity can be studied in other phases of software development, (2) the experiment can be conducted with more subjects from the industry.

Author Contributions

Conceptualization, B.L.; Methodology, B.L.; Software, B.L.; Validation, B.L.; Investigation, J.L.; Writing—original draft, B.L. and J.L.; Writing—review & editing, B.L., J.L. and X.H.; Visualization, B.L.; Project administration, J.L. and X.H.; Funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62120106003).

Data Availability Statement

Raw data used in this paper are not publicly available to preserve individuals’ privacy under the University Human Research Protection Program.

Acknowledgments

I would like to express my special thanks to Carol Smidts, Xiaoxu Diao, and Yunfei Zhao for their help in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Sensitivity Analysis for the Criteria of the Dependent Failures and Independent Failures

In this experiment, three criteria were used to discriminate dependent failures and independent failures. A sensitivity analysis study of the value of the variable

c

in those criteria is introduced in this section.

In this experiment, we only considered

c = 1

and

c = 2

. This is because most functions in the SRSs used in the experiment contain 0–3 defects. A value of

c

greater or equal to 3 will render the criteria to label a failure as an independent failure too stringent. If we select a

c

greater or equal to 3, most of the defects will be labelled as dependent defects.

Table A1 gives the

Q_{m}

for the inspection teams of different types and different sizes for inspecting the small-size SRSs under the conditions of c = 1 and c = 2. The last column of Table A1 gives the relative difference of

Q_{m}

between the conditions of c = 1 and c = 2 for each type of inspection team. Table A2 and Table A3 display the

Q_{m}

of different inspection teams for inspecting the medium-size SRSs and large-size SRSs.

Table A1.

Q_{m}

of inspection teams for inspecting the small-size SRSs.

Table A1.

Q_{m}

of inspection teams for inspecting the small-size SRSs.

Size	Team	c = 2	c = 1	Change%
m = 2	BU_C1	0.13	0.16	18.7%
	BU_C2	0.13	0.13	0.0%
	BD_C	0.14	0.15	7.0%
	BU_R1	0.04	0.04	12.7%
	BU_R2	0.03	0.03	0.0%
	BD_R	0.06	0.06	4.5%
	TU_C	0.14	0.15	7.2%
	TU_R	0.05	0.05	4.9%
	TD	0.02	0.03	26.7%
m = 4	TU_C	0.04	0.05	9.9%
	TU_R	0.0020	0.0022	9.1%
	TD	0.0003	0.0008	56.1%

Table A2.

Q_{m}

of inspection teams for inspecting the medium-size SRSs.

Table A2.

Q_{m}

of inspection teams for inspecting the medium-size SRSs.

Size	Team	c = 2	c = 1	Change%
m = 2	BU_C1	0.44	0.46	4.3%
	BU_C2	0.43	0.44	3.0%
	BD_C	0.47	0.48	2.7%
	BU_R1	0.24	0.25	2.0%
	BU_R2	0.26	0.26	1.7%
	BD_R	0.26	0.27	1.6%
	TU_C	0.49	0.50	2.7%
	TU_R	0.26	0.26	1.6%
	TD	0.21	0.22	5.3%
m = 4	TU_C	0.31	0.32	3.8%
	TU_R	0.15	0.15	1.5%
	TD	0.04	0.04	12.0%

Table A3.

Q_{m}

of inspection teams for inspecting the large-size SRSs.

Table A3.

Q_{m}

of inspection teams for inspecting the large-size SRSs.

Size	Team	c = 2	c = 1	Change%
m = 2	BU_C1	0.55	0.55	0.7%
	BU_C2	0.56	0.56	0.3%
	BD_C	0.57	0.57	0.4%
	BU_R1	0.21	0.22	2.4%
	BU_R2	0.24	0.24	1.7%
	BD_R	0.25	0.25	1.7%
	TU_C	0.53	0.53	0.4%
	TU_R	0.29	0.29	1.6%
	TD	0.25	0.25	1.7%
m = 4	TU_C	0.37	0.37	0.4%
	TU_R	0.14	0.14	2.8%
	TD	0.06	0.07	3.9%

The average values of the relative difference of

Q_{m}

between the conditions of c = 1 and c = 2 in Table A1, Table A2 and Table A3 are 13%, 4%, and 1%, respectively. We can see that the relative difference of

Q_{m}

is negligible for inspecting the medium-size and large-size SRSs. This is because, when using a bigger c, more failures will be determined as independent failures. The fraction of independent failures in our results increases. However, the failure rate of an inspector in the inspection of the medium-size and large-size SRSs is high and the fraction of independent failures is small. Therefore, the change in the fraction of independent failures will not affect the results significantly.

The relative difference in

Q_{m}

is larger for inspecting the small-size SRSs compared to the inspection of the medium-size and large-size SRSs. This is because the failure rate of an inspector in the inspection of the small-size SRSs is lower; therefore, the fraction of independent failures is higher. As a result, the change in the fraction of independent failures has a bigger effect on the relative difference of

Q_{m}

for inspecting the small-size SRSs.

The results of testing hypotheses

H_{0, b a c k g r o u n d}

and

H_{0, t e c h n i q u e}

, given c = 2, are displayed in Table A4 and Table A5. Comparing to Table 13 and Table 18, we can see that the results of hypothesis testing are the same under the conditions of c = 1 and c = 2. Therefore, the results of the experiment are not sensitive to the selection of the value of c. A more detailed sensitivity analysis will be introduced in future work, following [29].

Table A4. Testing of

H_{0, b a c k g r o u n d} .

Table A4. Testing of

H_{0, b a c k g r o u n d} .

SRS Size	BU_C1 vs. BD_C	BU_C2 vs. BD_C	BU_R1 vs. BD_R	BU_C2 vs. BD_R
Small	Not rejected	Not rejected	Not rejected	Not rejected
Medium	Not rejected	Not rejected	Not rejected	Not rejected
Large	Not rejected	Not rejected	Not rejected	Not rejected

Table A5. Testing of

H_{0, t e c h n i q u e} .

Table A5. Testing of

H_{0, t e c h n i q u e} .

SRS Size	TD vs. TU_C		TD vs. TU_R
SRS Size	m = 2	m = 4	m = 2	m = 4
Small	Reject	Reject	Reject	Reject
Medium	Reject	Reject	Reject	Reject
Large	Reject	Reject	Reject	Reject

Appendix B. Summary of the Performance Statistics for the BU_R1, BU_R2, and BD_R Teams

This appendix displays a summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in the inspection of SRSs of a small size, medium size, and large size. The results are displayed in Table A6, Table A7 and Table A8.

Table A6. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the small-size SRSs.

	BU_R1		BU_R2		BD_R		BU_R1 vs. BD_R			BU_C2 vs. BD_R
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.04	0.09	0.03	0.07	0.06	0.09	0.22	0.12	0.19	0.12	0.18	0.27
$Q_{t}$	0.18	0.07	0.18	0.09	0.18	0.09	0.49	0.05	0.00	0.44	0.05	0.04
$Q_{t}^{D}$	0.16	0.08	0.16	0.08	0.16	0.08	0.49	0.05	0.00	0.43	0.05	0.04
$Q_{t}^{I}$	0.02	0.03	0.02	0.03	0.02	0.04	0.50	0.05	0.00	0.50	0.05	0.00
$β_{1}$	0.80	0.41	0.84	0.34	0.70	0.44	0.17	0.15	0.23	0.08	0.24	0.34
$β_{2}$	0.16	0.37	0.11	0.28	0.22	0.40	0.24	0.10	0.17	0.08	0.22	0.32
$Q_{1 : 2}$	0.13	0.09	0.13	0.07	0.11	0.08	0.21	0.13	0.20	0.14	0.17	0.27
$Q_{2 : 2}$	0.04	0.09	0.03	0.07	0.05	0.09	0.23	0.11	0.18	0.10	0.20	0.29

Table A7. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the medium-size SRSs.

	BU_C1		BU_C2		BD_C		BU_C1 vs. BD_C			BU_C2 vs. BD_C
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.25	0.15	0.26	0.16	0.27	0.17	0.32	0.07	0.11	0.45	0.05	0.03
$Q_{t}$	0.45	0.15	0.47	0.15	0.46	0.16	0.35	0.07	0.09	0.45	0.05	0.03
$Q_{t}^{D}$	0.42	0.15	0.44	0.15	0.43	0.16	0.37	0.06	0.08	0.43	0.05	0.05
$Q_{t}^{I}$	0.03	0.05	0.03	0.03	0.03	0.04	0.45	0.05	0.03	0.40	0.06	0.06
$β_{1}$	0.51	0.28	0.47	0.26	0.48	0.32	0.35	0.06	0.09	0.46	0.05	0.03
$β_{2}$	0.49	0.28	0.53	0.26	0.52	0.32	0.35	0.06	0.09	0.46	0.05	0.03
$Q_{1 : 2}$	0.19	0.10	0.20	0.11	0.18	0.11	0.35	0.06	0.09	0.31	0.08	0.13
$Q_{2 : 2}$	0.22	0.16	0.24	0.15	0.24	0.17	0.28	0.08	0.14	0.43	0.05	0.04

Table A8. Summary of the performance statistics for the BU_R1, BU_R2, and BD_R teams of size m = 2 in inspecting the large-size SRSs.

	BU_C1		BU_C2		BD_C		BU_C1 vs. BD_C			BU_C2 vs. BD_C
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.22	0.17	0.24	0.14	0.25	0.18	0.25	0.10	0.17	0.43	0.05	0.04
$Q_{t}$	0.45	0.20	0.42	0.17	0.44	0.20	0.38	0.06	0.08	0.37	0.06	0.08
$Q_{t}^{D}$	0.43	0.21	0.40	0.17	0.41	0.20	0.38	0.06	0.08	0.36	0.06	0.09
$Q_{t}^{I}$	0.02	0.02	0.03	0.03	0.03	0.02	0.40	0.06	0.06	0.42	0.05	0.06
$β_{1}$	0.59	0.27	0.45	0.17	0.50	0.27	0.10	0.26	0.36	0.18	0.11	0.20
$β_{2}$	0.41	0.27	0.55	0.17	0.50	0.27	0.10	0.26	0.36	0.18	0.11	0.20
$Q_{1 : 2}$	0.23	0.13	0.17	0.09	0.18	0.11	0.07	0.35	0.43	0.36	0.06	0.09
$Q_{2 : 2}$	0.20	0.17	0.22	0.13	0.23	0.18	0.25	0.10	0.18	0.42	0.05	0.05

Appendix C. Summary of the Performance Statistics for the TU_R, TU_C, and TD Teams

This section provides a summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 and m = 4 in the inspection of medium-size and large-size SRSs, as shown in Table A9, Table A10, Table A11 and Table A12.

Table A9. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the medium-size SRSs.

	TU_C		TU_R		TD		TU_C vs. TD			TU_R vs. TD
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.50	0.23	0.26	0.16	0.22	0.20	0.00	1.00	1.31	0.02	0.50	0.22
$Q_{t}$	0.66	0.16	0.44	0.15	0.55	0.16	0.00	1.00	0.69	0.00	1.00	0.66
$Q_{t}^{D}$	0.63	0.17	0.41	0.16	0.52	0.17	0.00	1.00	0.70	0.00	1.00	0.67
$Q_{t}^{I}$	0.02	0.04	0.03	0.04	0.03	0.04	0.06	0.31	0.16	0.14	0.19	0.12
$β_{1}$	0.32	0.32	0.47	0.31	0.72	0.28	0.00	1.00	1.38	0.00	1.00	0.89
$β_{2}$	0.68	0.32	0.53	0.31	0.28	0.28	0.00	1.00	1.38	0.00	1.00	0.89
$Q_{1 : 2}$	0.17	0.13	0.17	0.10	0.34	0.11	0.00	1.00	1.47	0.00	1.00	1.54
$Q_{2 : 2}$	0.47	0.25	0.24	0.16	0.18	0.20	0.00	1.00	1.33	0.00	0.79	0.31

Table A10. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the medium-size SRSs.

	TU_C		TU_R		TD		TU_C vs. TD			TU_R vs. TD
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.32	0.23	0.15	0.11	0.04	0.08	0.00	1.00	2.59	0.00	1.00	1.20
$Q_{t}$	0.66	0.10	0.45	0.10	0.55	0.11	0.00	1.00	1.07	0.00	1.00	0.96
$Q_{t}^{D}$	0.64	0.11	0.41	0.11	0.52	0.11	0.00	1.00	1.07	0.00	1.00	0.94
$Q_{t}^{I}$	0.02	0.03	0.03	0.03	0.03	0.03	0.00	1.00	0.17	0.06	0.40	0.06
$β_{1}$	0.07	0.06	0.20	0.13	0.12	0.16	0.00	1.00	0.35	0.00	1.00	0.51
$β_{2}$	0.12	0.15	0.19	0.18	0.51	0.24	0.00	1.00	1.69	0.00	1.00	1.35
$β_{3}$	0.40	0.29	0.30	0.23	0.33	0.25	0.00	1.00	0.27	0.00	0.72	0.09
$β_{4}$	0.42	0.33	0.30	0.25	0.04	0.11	0.00	1.00	2.53	0.00	1.00	1.98
$Q_{1 : 4}$	0.04	0.03	0.08	0.05	0.05	0.05	0.00	0.97	0.14	0.00	1.00	0.59
$Q_{2 : 4}$	0.02	0.03	0.03	0.03	0.08	0.04	0.00	1.00	1.69	0.00	1.00	1.58
$Q_{3 : 4}$	0.08	0.06	0.04	0.03	0.06	0.05	0.00	1.00	0.29	0.00	1.00	0.47
$Q_{4 : 4}$	0.29	0.25	0.14	0.12	0.03	0.08	0.00	1.00	2.48	0.00	1.00	1.37

Table A11. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the large-size SRSs.

	TU_C		TU_R		TD		TU_C vs. TD			TU_R vs. TD
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.53	0.24	0.29	0.19	0.25	0.21	0.00	1.00	1.29	0.03	0.43	0.21
$Q_{t}$	0.69	0.16	0.49	0.19	0.57	0.15	0.00	1.00	0.82	0.00	1.00	0.53
$Q_{t}^{D}$	0.68	0.16	0.46	0.19	0.55	0.16	0.00	1.00	0.83	0.00	1.00	0.53
$Q_{t}^{I}$	0.02	0.01	0.02	0.02	0.02	0.02	0.03	0.41	0.20	0.08	0.32	0.17
$β_{1}$	0.28	0.23	0.43	0.19	0.66	0.27	0.00	1.00	1.45	0.00	1.00	0.92
$β_{2}$	0.72	0.23	0.57	0.19	0.34	0.27	0.00	1.00	1.45	0.00	1.00	0.92
$Q_{1 : 2}$	0.16	0.10	0.18	0.10	0.33	0.11	0.00	1.00	1.57	0.00	1.00	1.37
$Q_{2 : 2}$	0.52	0.24	0.27	0.19	0.22	0.21	0.00	1.00	1.33	0.01	0.59	0.25

Table A12. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the large-size SRSs.

	TU_C		TU_R		TD		TU_C vs. TD			TU_R vs. TD
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.37	0.23	0.14	0.10	0.07	0.11	0.00	1.00	2.38	0.00	1.00	0.70
$Q_{t}$	0.70	0.11	0.50	0.15	0.57	0.10	0.00	1.00	1.37	0.00	1.00	0.54
$Q_{t}^{D}$	0.70	0.11	0.49	0.15	0.55	0.11	0.00	1.00	1.38	0.00	1.00	0.56
$Q_{t}^{I}$	0.02	0.01	0.02	0.01	0.02	0.01	0.00	1.00	0.29	0.00	1.00	0.26
$β_{1}$	0.05	0.04	0.13	0.11	0.09	0.11	0.00	1.00	0.41	0.00	1.00	0.40
$β_{2}$	0.11	0.11	0.25	0.16	0.48	0.22	0.00	1.00	1.73	0.00	1.00	1.07
$β_{3}$	0.36	0.22	0.35	0.17	0.35	0.20	0.08	0.33	0.06	0.41	0.05	0.01
$β_{4}$	0.48	0.28	0.27	0.15	0.08	0.15	0.00	1.00	2.42	0.00	1.00	1.27
$Q_{1 : 4}$	0.03	0.02	0.05	0.03	0.04	0.04	0.00	1.00	0.26	0.00	1.00	0.29
$Q_{2 : 4}$	0.02	0.02	0.04	0.03	0.08	0.03	0.00	1.00	1.87	0.00	1.00	1.27
$Q_{3 : 4}$	0.08	0.05	0.06	0.04	0.07	0.05	0.00	1.00	0.25	0.00	0.99	0.17
$Q_{4 : 4}$	0.36	0.24	0.13	0.10	0.05	0.11	0.00	1.00	2.38	0.00	1.00	0.71

References

Arndt, S.A.; Alvarado, R.; Dittman, B.; Mott, K.; Wood, R. NRC Technical Basis for Evaluation of Its Position on Protection Against Common Cause Failure in Digital Systems Used in Nuclear Power Plants. In Proceedings of the 2017 NPIC-HMIT, San Francisco, CA, USA, 11–15 June 2017. [Google Scholar]
Alshazly, A.A.; Elfatatry, A.M.; Abougabal, M.S. Detecting defects in software requirements specification. Alex. Eng. J. 2014, 53, 513–527. [Google Scholar] [CrossRef]
Porter, A.A.; Votta, L.G.; Basili, V.R. Comparing detection methods for software requirements inspections: A replicated experiment. IEEE Trans. Softw. Eng. 2002, 21, 563–575. [Google Scholar] [CrossRef]
He, L.; Carver, J. PBR vs. Checklist: A Replication in the N-Fold Inspection Context. In Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering, Rio de Janeiro, Brazil, 21–22 September 2006; pp. 95–104. [Google Scholar] [CrossRef]
Signoret, J.P.; Leroy, A. Dependent and Common Cause Failures; Springer Series in Reliability Engineering; Springer: Berlin/Heidelberg, Germany, 2021; pp. 103–120. [Google Scholar] [CrossRef]
Ali, S.W.; Ahmed, Q.A.; Shafi, I. Process to enhance the quality of software requirement specification document. International Conference on Engineering and Emerging Technologies, Lahore, Pakistan, 22–23 February 2018. [Google Scholar]
Li, B.; Diao, X.; Gao, W.; Smidts, C. A Requirements Inspection Method Based on Scenarios Generated by Model Mutation and the Experimental Validation. Empir. Softw. Eng. 2021, 26, 108. [Google Scholar] [CrossRef]
Martin, J.; Tsai, W.T. N-Fold Inspection: A Requirements Analysis Technique. Commun. ACM 1990, 33, 225–232. [Google Scholar] [CrossRef]
Kantorowitz, E.; Guttman, A.; Arzi, L. The performance of the N-fold requirement inspection method. Requir. Eng. 1997, 2, 152–164. [Google Scholar] [CrossRef]
Vulpe, A.; Carausu, A. Dependent failure and CCF analysis of NPP systems with diversity defense factors. In Proceedings of the Transactions of the 14th International Conference on Structural Mechanics in Reactor Technology, Lyon, France, 17–22 August 1997. [Google Scholar]
Huang, F.; Liu, B.; Song, Y.; Keyal, S. The links between human error diversity and software diversity: Implications for fault diversity seeking. Sci. Comput. Program. 2014, 89, 350–373. [Google Scholar] [CrossRef]
Li, B.; Smidts, C. A Zone-Based Model for Analysis of Dependent Failures in Requirements Inspection. IEEE Trans. Softw. Eng. 2023, 49, 3581–3598. [Google Scholar] [CrossRef]
Staron, M.; Kuzniarz, L.; Thurn, C. An empirical assessment of using stereotypes to improve reading techniques in software inspections. In Proceedings of the International Conference on Software Engineering, St. Louis, MO, USA, 15–21 May 2005; pp. 63–69. [Google Scholar] [CrossRef]
Lanubile, F.; Visaggio, G. Evaluating Defect Detection Techniques for Software Requirements Inspections; International Software Engineering Research Network; Bari, Italy, 2000. [Google Scholar]
Fleming, K.; Mosleh, A. Classification and Analysis of Reactor Operating Experience Involving Dependent Events; ISERN Report no. 00-08; Electric Power Research Institute: Palo Alto, CA, USA, 1985; pp. 1–24. [Google Scholar]
Fleming, K. A reliability model for common cause failures in redundant safety systems. In Technical Report No. GA-A-13284; General Atomics: San Diego, CA, USA, 1974. [Google Scholar]
Mosleh, A.; Siu, N. A reliability model for common mode failure in redundant safety systems. In Proceedings of the Ninth International Conference on Structural Mechanics in Reactor Technology, Lusanne, Switzerland, 17–21 August 1987. [Google Scholar]
Atwood, C. Common Cause Fault Rates for Pumps; NUREG/CR-2098; US Nuclear Regulatory Commission: Washington, DC, USA, 1983.
ISO/IEC/IEEE 29148:2018; Systems and Software Engineering-Life Cycle Processes: Requirements Engineering. IEEE: New York, NY, USA, 2018.
Li, X.; Mutha, C.; Smidts, C.S. An automated software reliability prediction system for safety critical software. Empir. Softw. Eng. 2016, 21, 2413–2455. [Google Scholar] [CrossRef]
Li, X.; Gupta, J. ARPS: An Automated Reliability Prediction System Tool for Safety Critical Software; PSA: Quezon City, Philippines, 2013; pp. 22–27. [Google Scholar]
Li, B.; Smidts, C.S. Extension of Mutation Testing for the Requirements and Design Faults. In Proceedings of the 2017 NPIC-HMIT, Pittsburgh, PA, USA, 24–28 September 2017. [Google Scholar]
Lanubile, F.; Visaggio, G. Assessing defect detection methods for software requirements inspections through external replication. In International Software Engineering Research Network, Technical Report ISERN9601; International Software Engineering Research Network: Bari, Italy, 1996; p. 17. [Google Scholar]
Votta, L.G. Does every inspection need a meeting? In Proceedings of the Symposium on the Foundations of Software Engineering, Los Angeles, CA, USA, 7–10 December 1993; pp. 107–114. [Google Scholar] [CrossRef]
Goswami, A.; Walia, G. An empirical study of the effect of learning styles on the faults found during the software requirements inspection. In Proceedings of the 24th International Symposium on Software Reliability Engineering, Pasadena, CA, USA, 4–7 November 2013; pp. 330–339. [Google Scholar] [CrossRef]
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
Sullivan, G.M.; Feinn, R. Using Effect Size—Or Why the P Value Is Not Enough. J. Grad. Med. Educ. 2012, 4, 279–282. [Google Scholar] [CrossRef] [PubMed]
Sawilowsky, S.S. New Effect Size Rules of Thumb. J. Mod. Appl. Stat. Methods 2009, 8, 597–599. [Google Scholar] [CrossRef]
Zubair, M.; Ishag, A. Sensitivity analysis of APR-1400’s reactor protection system by using RiskSpectrum PSA. Nucl. Eng. Des. 2018, 339, 225–234. [Google Scholar] [CrossRef]

Figure 1. Perception Zones.

Figure 2. Checklist used in CBR.

Figure 3. Defect recording sheet.

Figure 4. Average performance of the inspectors.

Figure 5. Mean of

Q_{m}

of the BU_C1, BU_C2, and BD_C inspection teams.

Figure 5. Mean of

Q_{m}

of the BU_C1, BU_C2, and BD_C inspection teams.

Figure 6. Mean of

Q_{m}

of the BU_R1, BU_R2, and BD_R inspection teams.

Figure 6. Mean of

Q_{m}

of the BU_R1, BU_R2, and BD_R inspection teams.

Figure 7. Probability of team failures for SRSs of different sizes.

Figure 8. Percentage of defect detected by an inspection.

Table 1. Comparison of CBR and RIMSM.

Method	CBR	RIMSM
Instrumentation	Checklist	Tool (RITSM)
Defect coverage	Defects defined in the checklist	Defects considered in model mutation
Inspection item	SRS document	Results of execution of the SRS model
Defect detection activity	Answer questions in the checklist	Examine the system behaviors and outputs in different scenarios

Table 2. Data for parameter estimation in the Z-model.

Data	Definition
m	Number of Inspectors in the Team
$N_{D}$	Total Number of Detection Activities
$n_{d}^{i}$	$Number of Defects Detected in the Perception Zone Z_{d}^{i}$
$n_{a}^{I}$	$Number of Independent Failures by Inspector I_{a}$ $in N_{D}$ Detection Activities
$n_{a}^{D}$	$Number of Dependent Failures by Inspector I_{a}$ $in N_{D}$ Detection Activities

Table 3. SRS documents.

SRS Label	SRS Topic	Function	Page	Defects
S1	A water level control system	4	3	5
S2	A reaction chamber control system in a chemical plant	5	3	4
S3	An automated car assembly system	5	3	4
S4	A fly safety system	5	3	3
S5	A valve control system	5	3	5
S6	A vehicle speed monitor system	5	3	4
S7	A post-collision event control system	5	3	6
S8	An automobile cruise control and monitoring system	5	3	5
S9	A digital-based small reactor protection system	10	6	9
S10	An elevator control system	11	6	11
S11	An integrated vehicle-based safety system	21	11	55
S12	An embedded control software for smart sensor	22	12	22

Table 4. Major of the Subjects.

Major	Number of Subjects
Computer Science & Engineering (CSE)	11
Electrical Engineering (EE)	10
Mechanical Engineering (ME)	2

Table 5. Experiment schedule.

Day	Sessions	Content
Day 1	training 1	Introduced what requirements engineering is
	training 2	Introduced the CBR method
	practice 1	Practiced the CBR method using a small-size SRS (S1)
	training 3	Introduced the RIMSM method
	practice 2	Practiced the RIMSM method using a small-size SRS (S2)
Day 2	training 4	Reviewed the CBR method
	practice 3	Practiced the CBR method using a small-size SRS (S3)
	training 5	Reviewed the RIMSM method
	practice 4	Practiced the RIMSM method using a small-size SRS (S4)
	practice 5	Practiced the CBR method using a small-size SRS (S5)
	practice 6	Practiced the RIMSM method using a small-size SRS (S6)
Day 3	testing 1	Inspected a small-size SRS (S7)
	testing 2	Inspected a small-size SRS (S8)
	testing 3	Inspected a medium-size SRS (S9)
	testing 4	Inspected a medium-size SRS (S10)
	testing 5	Inspected a large-size SRS (S11)
	testing 6	Inspected a large-size SRS (S12)

Table 6. Design of testing sessions.

Session	SRS	Inspection Method
Session	SRS	Group 1: CSE (6)	Group 2: EE (5)	Group 3: ME (1)	Group 4: CSE (5)	Group 5: EE (5)	Group 6: ME (1)
testing session 1	S7 (small size)	RIMSM	RIMSM	RIMSM	CBR	CBR	CBR
testing session 2	S8 (small size)	CBR	CBR	CBR	RIMSM	RIMSM	RIMSM
testing session 3	S9 (medium size)	RIMSM	RIMSM	RIMSM	CBR	CBR	CBR
testing session 4	S10 (medium size)	CBR	CBR	CBR	RIMSM	RIMSM	RIMSM
testing session 5	S11 (large size)	RIMSM	RIMSM	RIMSM	CBR	CBR	CBR
testing session 6	S12 (large size)	CBR	CBR	CBR	RIMSM	RIMSM	RIMSM

Table 7. Stability statistics.

	RIMSM	CBR
Number of data points	23	23
Significance level	0.05	0.05
p-value (two-tailed)	0.98	0.78
Statistical power	5.9%	5.0%

Table 8. Creation of background-uniform teams for small-size SRSs.

Label	Team Features		Team Creation	Teams of Size m = 2
Label	Method	Major	Team Creation	Teams of Size m = 2
BU_R1	RIMSM	CSE	Select m inspectors from Group 1 and combine their results in testing session 1 Select m inspectors from Group 4 and combine their results in testing session 2	25
BU_R2	RIMSM	EE	Select m inspectors from Group 2 and combine their results in testing session 1 Select m inspectors from Group 5 and combine their results in testing session 2	20
BU_C1	CBR	CSE	Select m inspectors from Group 1 and combine their results in testing session 2 Select m inspectors from Group 4 and combine their results in testing session 1	25
BU_C2	CBR	EE	Select m inspectors from Group 2 and combine their results in testing session 2 Select m inspectors from Group 5 and combine their results in testing session 1	20

Table 9. Creation of background-diverse teams for small-size SRSs.

Label	Team Features		Team Creation	Teams of Size m = 2
Label	Method	Major	Team Creation	Teams of Size m = 2
BD_R	RIMSM	Both (CSE, EE)	Select m/2 inspectors from Group 1 and m/2 inspectors from Group 2, and combine their results in testing session 1 Select m/2 inspectors from Group 4 and m/2 inspectors from Group 5, and combine their results in testing session 2	55
BD_C	CBR	Both (CSE, EE)	Select m/2 inspectors from Group 1 and m/2 inspectors from Group 2, and combine their results in testing session 2 Select m/2 inspectors from Group 4 and m/2 inspectors from Group 5, and combine their results in testing session 1	55

Table 10. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the small-size SRSs.

	BU_C1		BU_C2		BD_C		BU_C1 vs. BD_C			BU_C2 vs. BD_C
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.16	0.17	0.13	0.15	0.15	0.17	0.44	0.05	0.04	0.26	0.09	0.16	1
$Q_{t}$	0.39	0.22	0.32	0.16	0.35	0.21	0.23	0.12	0.19	0.23	0.10	0.18	2
$Q_{t}^{D}$	0.34	0.21	0.28	0.18	0.31	0.21	0.28	0.09	0.14	0.29	0.08	0.14	3
$Q_{t}^{I}$	0.05	0.06	0.04	0.05	0.04	0.06	0.25	0.10	0.16	0.28	0.08	0.14	4
$β_{1}$	0.66	0.36	0.62	0.43	0.58	0.40	0.17	0.15	0.23	0.33	0.07	0.12	5
$β_{2}$	0.22	0.27	0.23	0.35	0.26	0.33	0.26	0.09	0.14	0.35	0.07	0.10	6
$Q_{1 : 2}$	0.23	0.13	0.18	0.12	0.18	0.13	0.07	0.31	0.36	0.42	0.05	0.05	7
$Q_{2 : 2}$	0.11	0.15	0.11	0.16	0.13	0.17	0.33	0.07	0.10	0.32	0.07	0.12	8
	1	2	3	4	5	6	7	8	9	10	11	12

Table 11. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the medium-size SRSs.

	BU_C1		BU_C2		BD_C		BU_C1 vs. BD_C			BU_C2 vs. BD_C
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.46	0.19	0.44	0.28	0.48	0.24	0.37	0.06	0.08	0.29	0.09	0.16
$Q_{t}$	0.64	0.12	0.63	0.17	0.63	0.17	0.43	0.05	0.04	0.46	0.05	0.03
$Q_{t}^{D}$	0.61	0.11	0.61	0.19	0.61	0.18	0.49	0.05	0.00	0.49	0.05	0.01
$Q_{t}^{I}$	0.03	0.05	0.02	0.03	0.02	0.04	0.29	0.09	0.15	0.35	0.06	0.09
$β_{1}$	0.35	0.28	0.43	0.40	0.34	0.33	0.46	0.05	0.03	0.19	0.16	0.25
$β_{2}$	0.65	0.28	0.57	0.40	0.66	0.33	0.46	0.05	0.03	0.19	0.16	0.25
$Q_{1 : 2}$	0.19	0.15	0.20	0.14	0.16	0.12	0.16	0.19	0.27	0.17	0.18	0.27
$Q_{2 : 2}$	0.42	0.21	0.41	0.31	0.45	0.25	0.26	0.09	0.15	0.32	0.08	0.14

Table 12. Summary of the performance statistics for the BU_C1, BU_C2, and BD_C teams of size m = 2 in inspecting the large-size SRSs.

	BU_C1		BU_C2		BD_C		BU_C1 vs. BD_C			BU_C2 vs. BD_C
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.55	0.22	0.56	0.26	0.57	0.24	0.35	0.07	0.10	0.44	0.05	0.04
$Q_{t}$	0.70	0.15	0.71	0.18	0.71	0.18	0.44	0.05	0.04	0.44	0.05	0.04
$Q_{t}^{D}$	0.68	0.16	0.70	0.18	0.69	0.18	0.41	0.06	0.06	0.42	0.06	0.06
$Q_{t}^{I}$	0.02	0.01	0.01	0.01	0.01	0.01	0.20	0.13	0.23	0.15	0.15	0.25
$β_{1}$	0.25	0.20	0.27	0.27	0.24	0.23	0.37	0.06	0.09	0.30	0.09	0.16
$β_{2}$	0.75	0.20	0.73	0.27	0.76	0.23	0.37	0.06	0.09	0.30	0.09	0.16
$Q_{1 : 2}$	0.15	0.08	0.15	0.10	0.13	0.09	0.22	0.12	0.20	0.22	0.12	0.22
$Q_{2 : 2}$	0.53	0.22	0.55	0.26	0.56	0.24	0.32	0.07	0.12	0.44	0.05	0.04

Table 13. Testing of

H_{0, b a c k g r o u n d} .

Table 13. Testing of

H_{0, b a c k g r o u n d} .

SRS Size	BU_C1 vs. BD_C	BU_C2 vs. BD_C	BU_R1 vs. BD_R	BU_C2 vs. BD_R
Small	Not rejected	Not rejected	Not rejected	Not rejected
Medium	Not rejected	Not rejected	Not rejected	Not rejected
Large	Not rejected	Not rejected	Not rejected	Not rejected

Table 14. Creation of technique-uniform teams for small-size SRSs.

Label	Method	Team Creation	Number of Teams
Label	Method	Team Creation	m = 2	m = 4
TU_R	RIMSM	Select m inspectors from Group 1–3 and combine their results in testing session 1 Select m inspectors from Group 4–5 and combine their results in testing session 2	121	825
TU_C	CBR	Select m inspectors from Group 1–3 and combine their results in testing session 2 Select m inspectors from Group 4–5 and combine their results in testing session 1	121	825

Table 15. Creation of technique-diverse teams for small-size SRSs.

Label	Method	Team Creation	Number of Teams
Label	Method	Team Creation	m = 2	m = 4
TD	RIMSM, CBR	Select m/2 inspectors from Group 1–3 and m/2 inspectors from Group 4–6, and combine their results in testing Section 1 and testing Section 2	264	7260

Table 16. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 2 in the inspection of the small-size SRSs.

	TU_C		TU_R		TD		TU_C vs. TD			TU_R vs. TD
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.15	0.18	0.047	0.09	0.027	0.05	0.00	1.00	1.10	0.01	0.83	0.32	1
$Q_{t}$	0.33	0.20	0.17	0.08	0.25	0.14	0.00	1.00	0.54	0.00	1.00	0.60	2
$Q_{t}^{D}$	0.28	0.20	0.16	0.08	0.22	0.15	0.00	0.97	0.42	0.00	0.98	0.45	3
$Q_{t}^{I}$	0.05	0.05	0.02	0.03	0.03	0.05	0.00	0.89	0.35	0.00	0.91	0.37	4
$β_{1}$	0.57	0.42	0.74	0.43	0.84	0.34	0.00	1.00	0.72	0.01	0.70	0.27	5
$β_{2}$	0.24	0.34	0.19	0.38	0.03	0.11	0.00	1.00	1.00	0.00	1.00	0.68	6
$Q_{1 : 2}$	0.16	0.12	0.11	0.08	0.20	0.13	0.00	0.80	0.31	0.00	1.00	0.74	7
$Q_{2 : 2}$	0.12	0.18	0.04	0.09	0.01	0.05	0.00	1.00	1.00	0.00	0.99	0.49	8
	1	2	3	4	5	6	7	8	9	10	11	12

Table 17. Summary of the performance statistics for the TU_R, TU_C, and TD teams of size m = 4 in the inspection of the small-size SRSs.

	TU_C		TU_R		TD		TU_C vs. TD			TU_R vs. TD
	mean	std	mean	std	mean	std	p-Value	Power	eff Size	p-Value	Power	eff Size
$Q_{m}$	0.05	0.10	0.002	0.018	0.001	0.002	0.00	1.00	1.44	0.01	1.00	0.24	1
$Q_{t}$	0.34	0.15	0.17	0.06	0.25	0.10	0.00	1.00	0.94	0.00	1.00	0.75	2
$Q_{t}^{D}$	0.30	0.15	0.16	0.05	0.22	0.10	0.00	1.00	0.74	0.00	1.00	0.57	3
$Q_{t}^{I}$	0.05	0.03	0.02	0.02	0.03	0.03	0.00	1.00	0.55	0.00	1.00	0.46	4
$β_{1}$	0.35	0.36	0.43	0.36	0.64	0.34	0.00	1.00	0.83	0.00	1.00	0.60	5
$β_{2}$	0.30	0.31	0.44	0.36	0.33	0.33	0.01	0.59	0.08	0.00	1.00	0.34	6
$β_{3}$	0.24	0.30	0.12	0.29	0.02	0.08	0.00	1.00	1.81	0.00	1.00	0.87	7
$β_{4}$	0.07	0.18	0.01	0.08	0.00	0.00	0.00	1.00	1.16	0.01	1.00	0.28	8
$Q_{1 : 4}$	0.07	0.05	0.06	0.04	0.12	0.06	0.00	1.00	0.76	0.00	1.00	0.97	9
$Q_{2 : 4}$	0.03	0.03	0.02	0.02	0.03	0.04	0.38	0.06	0.01	0.00	1.00	0.19	10
$Q_{3 : 4}$	0.03	0.04	0.01	0.02	0.00	0.01	0.00	1.00	1.95	0.00	1.00	0.60	11
$Q_{4 : 4}$	0.04	0.10	0.00	0.02	0.00	0.00	0.00	1.00	1.14	0.00	1.00	0.29	12
	1	2	3	4	5	6	7	8	9	10	11	12

Table 18. Summary of testing hypothesis

H_{0, t e c h n i q u e} .

Table 18. Summary of testing hypothesis

H_{0, t e c h n i q u e} .

SRS Size	TD vs. TU_C		TD vs. TU_R
SRS Size	m = 2	m = 4	m = 2	m = 4
Small	Reject	Reject	Reject	Reject
Medium	Reject	Reject	Reject	Reject
Large	Reject	Reject	Reject	Reject

Table 19. Ratio of

Q_{m} .

Table 19. Ratio of

Q_{m} .

SRS Size	TD vs. TU_C		TD vs. TU_R
SRS Size	m = 2	m = 4	m = 2	m = 4
Small	560%	5899%	178%	284%
Medium	228%	733%	119%	336%
Large	213%	552%	117%	212%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, B.; Li, J.; Huang, X. Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study. Energies 2025, 18, 4794. https://doi.org/10.3390/en18184794

AMA Style

Li B, Li J, Huang X. Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study. Energies. 2025; 18(18):4794. https://doi.org/10.3390/en18184794

Chicago/Turabian Style

Li, Boyuan, Jianghai Li, and Xiaojin Huang. 2025. "Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study" Energies 18, no. 18: 4794. https://doi.org/10.3390/en18184794

APA Style

Li, B., Li, J., & Huang, X. (2025). Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study. Energies, 18(18), 4794. https://doi.org/10.3390/en18184794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Software Reliability in Nuclear Power Plants via Diversity in the Requirements Phase: An Experimental Study

Abstract

1. Introduction

2. Background

2.1. Dependent and Independent Failures in Requirements Inspection

2.2. Inspection Methods

2.3. Z-Model

3. Experiment Design

3.1. Research Questions

3.2. Hypotheses

3.3. Research Variables

3.4. Experiment Instrumentations

3.4.1. SRS Documents

3.4.2. Checklist

3.4.3. RITSM Tool

3.4.4. Defect Recording Sheet

3.5. Research Subject Identification

3.6. Experimental Procedure

3.6.1. Collection of Data for Individual Inspectors

3.6.2. Creation of Data for Inspection Teams

3.6.3. Identification of Dependent and Independent Failures

3.6.4. Analysis of Results

4. Experiment Results

4.1. Stability of the Subjects’ Performance

4.2. Inter-Rater Reliability

4.3. Comparison of Background-Diverse Teams and Background-Uniform Teams

4.3.1. Creation of Background-Diverse Teams and Background-Uniform Teams

4.3.2. Results of Comparing Background-Diverse Teams and Background-Uniform Teams

4.4. Comparison of Technique-Diverse Teams and Technique-Uniform Teams

4.4.1. Creation of Technique-Diverse Teams and Technique-Uniform Teams

4.4.2. Results of Comparing Technique-Diverse Teams and Technique-Uniform Teams

4.5. Threats to Validity

4.5.1. Internal Validity

4.5.2. External Validity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Sensitivity Analysis for the Criteria of the Dependent Failures and Independent Failures

Appendix B. Summary of the Performance Statistics for the BU_R1, BU_R2, and BD_R Teams

Appendix C. Summary of the Performance Statistics for the TU_R, TU_C, and TD Teams

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI