1. Introduction
Software-intensive and cyber–physical systems rarely function within stable or unchanging environments. In requirements engineering, this dependency is recognized by defining the system specification in terms of expected domain conditions, i.e., the behavior of the system is prescribed given certain environmental assumptions, and even a correctly implemented design may fail when those assumptions no longer hold [
1,
2]. Goal-oriented requirements engineering extends this view by explicitly modeling these domain properties, obstacles, and trade-offs among goals, providing mechanisms to surface and analyze assumptions and uncertainties early in development [
3,
4].
Despite these mechanisms, assumptions are often implicit or weakly managed in practice. According to Steingruebl et al. [
5], developers often respond to incomplete or unclear requirements by introducing implicit assumptions to address gaps in understanding. Because these assumptions typically remain undocumented and unchecked, they later contribute to inconsistencies and rework [
6,
7].
Research on assumption management has emphasized the need for capturing, validating, and tracking assumptions throughout development [
8], while subsequent studies have explored how assumptions can be mined, monitored, or refined as knowledge evolves [
9,
10,
11]. Thus, making assumptions explicit and keeping them synchronized with reality is essential for dependable systems [
12].
Model-based systems engineering (MBSE) promises stronger discipline around such concerns. Languages and tools such as SysML allow engineers to relate requirements, behavior, and structure within a single modeling environment, enabling consistency checks and systematic analysis [
13]. MBSE has also been combined with formal analysis, for example, linking SysML to model checking, to reason about safety and context-dependent behavior [
14,
15,
16]. In safety-critical settings, specialized profiles and guidelines capture hazards, safety cases, and interface assumptions directly in SysML artifacts [
17,
18,
19,
20]. Nevertheless, empirical evidence shows that while MBSE notations can represent assumptions and their dependencies, they provide limited support for how assumptions evolve over time in the face of uncertainty, new information, and changing operational contexts [
21,
22]. This gap is particularly pressing in cyber–physical domains where environmental dynamics and uncertainty are inherent [
23,
24,
25,
26,
27].
In parallel, the software engineering community has used the lens of technical debt to capture the long-term costs of short-term trade-offs in software projects [
28,
29,
30,
31]. These trade-offs, while often necessary to meet time or resource constraints, incur “interest” in the form of increased maintenance, reduced quality, and loss of architectural integrity. Historically, research has focused on implementation- and architecture-level debt, but recent studies have emphasized that similar debt can originate much earlier in the lifecycle—during requirements analysis [
28,
32,
33,
34,
35]. In this context, requirements technical debt (RTD) refers to the accumulation of deficiencies such as ambiguity, incompleteness, or deferred decisions that later require rework or lead to inconsistencies [
36]. Ernst conceptualized RTD as the gap between an ideal requirements solution and the one actually implemented under project constraints [
32]. Subsequent studies have shown that this “distance” grows when assumptions are not explicitly validated or tracked, especially in dynamic environments [
7].
Although the role of assumptions is acknowledged, we still lack quantitative evidence that captures how assumption behavior, especially its volatility, relates to the accumulation of requirements-level debt during early modeling.
This paper addresses that need by introducing and evaluating assumptions volatility, i.e., the extent to which environmental assumptions change or become invalid during system modeling. Building on prior work that conceptualizes the quantification of RTD [
28], we adopt established RTD indicators as empirical proxies for rework and inconsistency in early modeling. Perera et al.’s Requirements Technical Debt Quantification Model (RTDQM) formalizes measurable components of RTD. Inspired by this model, our study examines how assumptions volatility relates to these forms of RTD accumulation. Specifically, we use rework ratio, inconsistency density, and correction count as RTD indicators and analyze their statistical association with volatility measures.
To this end, we analyzed 89 environmental assumptions derived from our prior controlled modeling study of a vehicle cruise-control system [
37]. We defined three volatility measures, i.e., Assumption Change (ACR), Invalidation Ratio (IR), and Dependency Density (DD), to capture how assumptions evolve, become invalid, or interrelate during modeling. Their relationships with RTD indicators were evaluated using correlation and regression analyses on a final set of 89 environmental assumptions.
Our study makes two contributions. The first is a set of metrics for quantifying how environmental assumptions shift during modeling. The second is an empirical analysis linking these shifts to different forms of requirements-related rework.
Although prior work has examined environmental assumptions and explored their relationship to RTD, these studies have primarily taken a qualitative perspective. In contrast, the present work introduces explicit volatility metrics, ACR, IR, DD, and empirically evaluates their association with quantified RTD indicators. To the best of our knowledge, no prior study has provided a statistical examination of how environmental assumption volatility corresponds to rework or correction effort during early system modeling.
The rest of the article is structured as follows:
Section 2 provides background information and related work.
Section 3 presents our methodology. Results and analysis are reported in
Section 4. Discussion can be found in
Section 5, and finally,
Section 6 concludes the paper.
3. Methodology
This section outlines the dataset and the steps taken to measure assumption volatility and its relationship to RTD. Our goal was to work with realistic early-stage modeling data and apply straightforward, reproducible metrics rather than rely on tooling not typically present in early MBSE practice.
3.1. Research Design
The work follows a quantitative correlational design.
Figure 1 gives an overview of the workflow. In brief, the process involved extracting environmental assumption data, computing three volatility measures, calculating RTD indicators, and applying statistical tests to explore their relationships. A simplified vehicle cruise-control system model was used as a reference case to illustrate how environmental assumptions influence system behavior. It was chosen because it presents clear dependencies between environmental factors, e.g., road surface, weather, traction, and sensors, and system parameters, e.g., braking, acceleration, and perception.
The environmental assumptions used in this study were not created specifically for this paper; they originate from a prior controlled empirical study published in IEEE Access [
37]. The model used in that study represents an automotive cruise-control subsystem and served as the basis for the original assumption-elicitation task. In that earlier work, 95 trained modelers were each asked to propose five environmental assumptions relevant to safety requirements for a cruise-control system model. This produced 473 raw assumptions, which were then anonymized and processed by removing duplicates, yielding 190 unique statements, and reclassified into true environmental assumptions versus requirements, i.e., 123 final assumptions. For the present analysis, we reused these 123 validated assumptions and applied an additional filtering step to focus on those with complete model information, resulting in the final set of 89 assumptions used in our correlation and regression analyses. Because our prior work provides the full task description, subject instructions, sample assumptions, and dataset publication details, the provenance and reproducibility of the dataset are well documented. We build directly on that curated dataset and augment it with volatility and RTD coding as described below.
Each environmental assumption was linked to its related model elements, i.e., functions, constraints, behaviors, or requirements. During the design process, notes were kept on assumptions that were revised or discarded. This produced a realistic picture of how assumptions shift during conceptual modeling. To keep the analysis clear and reproducible, we organized the assumptions into a simple tabular structure. Rather than attempting to reconstruct detailed version histories, each assumption was treated as one unit and recorded with a small set of attributes that capture its behavior:
a unique identifier,
underwent at least one substantive revision (ACR, 0/1),
whether it was later invalidated or replaced (IR),
the number of model elements depending on it (DD),
the proportion of rework tied to it (RR),
any inconsistencies linked to it (ID), and
the number of corrective actions associated with it (CC).
This approach reflects how environmental assumptions are usually managed during early modeling activities. Engineers often revise or abandon assumptions as their understanding of the system evolves, but detailed version histories are seldom kept at that stage. Recording these values in a consistent way allows the dataset to be shared and reused without requiring access to proprietary models or tooling environments.
3.2. Volatility Measures
We considered three indicators of assumption movement: Assumption Change (ACR), Invalidation Ratio (IR), and Dependency Density (DD).
Assumption Change (ACR). An assumption was coded as ACR = 1 if it underwent at least one substantive revision; otherwise, ACR = 0. This binary operationalization captures meaningful change without requiring full revision histories.
Invalidation Ratio (IR). IR was operationalized as a binary indicator (IR = 1 if the assumption was later judged incorrect or obsolete, and IR = 0 otherwise).
Dependency Density (DD). DD is a structural measure rather than a temporal one: it reflects how widely an assumption is embedded in the requirements model. DD is therefore intended to complement, rather than replace, the event-based indicators ACR and IR by capturing how far the effects of a change are likely to spread when volatility does occur.
In principle, volatility can also be described in terms of the frequency, magnitude, and timing of changes. However, the available modeling data did not preserve full revision histories in a way that would support such fine-grained temporal analysis. For this reason, ACR and IR are operationalized here as binary indicators that capture whether a substantive change or invalidation occurred at least once, rather than attempting to model its detailed temporal profile. These event-based measures should be viewed as a first approximation, with richer temporal characterizations left as an avenue for future work.
To minimize bias in defining volatility constructs, the coding of ACR and IR was conducted independently by two reviewers who were not involved in the study design or authorship. Agreement between the reviewers reached Cohen’s , indicating substantial reliability. Disagreements were resolved through discussion between the reviewers without author intervention. The finalized coded dataset was then used for subsequent correlation and regression analyses. This independent-coding arrangement was intended to ensure methodological neutrality and reproducibility of the volatility measures.
To reduce ambiguity in conceptually gray cases, we distinguished between refinements, substantive shifts, and invalidations. Refinements refer to wording changes that leave the original meaning or scope intact and were therefore coded as unchanged. Substantive shifts occur when the scope or interpretation of an assumption changes, for example, an assumption related to road conditions that was initially stated as “the road surface is paved and well-maintained” was later revised when the modeling team broadened the operational context to include uneven or partially degraded surfaces, and was coded as ACR = 1. Invalidations arise when an assumption can no longer hold under the updated model context and were coded as IR = 1.
To illustrate the coding rules, consider two typical cases from the dataset. An assumption related to road conditions was initially stated as “the road surface is paved and well-maintained.” When the modeling team later expanded the operational context to include uneven or partially degraded surfaces, the assumption was revised accordingly, and was therefore coded as ACR = 1. In contrast, an assumption concerning ideal weather conditions became incompatible with an updated scenario that introduced rain or reduced visibility; because the original assumption could no longer be satisfied under the expanded environment, it was coded as IR = 1. By comparison, assumptions that underwent only minor wording clarifications, without altering their scope or polarity, were coded as unchanged. These examples reflect how revisions and invalidations were directly grounded in the modeling notes.
3.3. RTD Indicators
RTD was operationalized using three measures adapted from the Requirements Technical Debt Quantification Model (RTDQM) [
36]:
Rework Ratio (RR): the proportion of requirements updated due to a given assumption,
Inconsistency Density (ID): the number of clarification issues or mismatches observed,
Correction Count (CC): the number of corrective actions attributed to the assumption.
For each assumption, RR, ID, and CC were derived from the modeling notes and logs using a simple attribution rule. An RTD event, i.e., rework, inconsistency, or correction, was assigned to a given assumption only when the notes or other project artifacts explicitly connected that event to that assumption or to a requirement directly derived from it. When a note clearly described a change or issue as involving several assumptions, the event was recorded for each of those assumptions. Entries that could not be reliably tied to specific assumptions were left unassigned.
It is important to note that RR, ID, and CC represent observable proxies for effort and clarification activity in early-stage modeling, rather than a full operationalization of the principal and interest components described in Perera’s RTDQM. They capture the portion of RTD that is visible in the available artifacts and logs, and should therefore be interpreted as partial indicators of RTD, not as coextensive with the broader theoretical construct.
While formal construct validation was not conducted, the definitions of RR, ID, and CC were derived directly from the conceptual dimensions of RTD [
36]. Future work will triangulate these indicators through expert review or independent RTD scoring to confirm their empirical alignment with rework, inconsistency, and correction effort.
To strengthen the construct validity of these indicators, we incorporated an expert review step. Two reviewers, neither involved in the modeling or coding activities, independently reviewed the mapping between the raw modeling notes and the operational definitions of RR, ID, and CC. They evaluated whether each indicator aligned with the conceptual dimensions described in the RTDQM model and whether the extracted evidence (e.g., clarification notes, corrective actions, rework annotations) appropriately reflected rework, inconsistency, and correction count. Inter-rater agreement was high (Cohen’s ), and minor discrepancies were resolved through discussion. This review step helped ensure that the indicators used in this study are meaningfully tied to established constructs of requirements technical debt and not merely artifacts of the coding process.
3.4. Analysis Procedure
For each environmental assumption, i.e., N = 89, we computed volatility predictors, i.e., ACR (binary), IR (binary indicator), and DD (z-standardized), and recorded RTD indicators, i.e., RR (0–1), ID (count), and CC (count). We first assessed pairwise associations using Pearson’s r (point-biserial for binary variables) and tie-corrected Spearman’s , reporting 95% bootstrap confidence intervals (2000 resamples) with Benjamini–Hochberg false-discovery-rate adjustment. We then fitted beta regression (logit link) for RR and negative binomial (NB2) models for ID and CC (including exposure offsets where appropriate). Results are reported as odds ratios (ORs) or incidence-rate ratios (IRRs) with 95% CIs, alongside standardized marginal effects (change in outcome SD per 1-SD change in predictor) and partial R2 for variable importance. Model adequacy was evaluated via dispersion tests, residual and influence diagnostics, and sensitivity analyses (fractional logit for RR; zero-inflated NB where indicated).
Table 1 presents a small illustrative excerpt of the dataset to clarify structure and coding format.
As an example, consider Assumption E09, i.e., “The vehicle is traveling on a paved, well-maintained road.” As the project scope expanded to include unpaved surfaces, this assumption was revised and ultimately invalidated (, ). It had associated requirements (artifacts). Following the change, (i.e., 28% of its 18 linked requirements required rework), with inconsistencies and correction activities. In short, E09 shows how a shifting environmental condition can propagate across multiple requirements and generate requirements-level technical debt.
Given the modest sample size, 89 environmental assumptions, and the possibility of overlap among volatility predictors, we also evaluated model parsimony and the potential risk of overfitting. To do so, we compared full models with reduced alternatives and applied penalized regression as an additional check. These procedures are reported in
Section 4.5 and serve to confirm that the observed relationships do not hinge on model complexity or collinearity among predictors.
4. Results
This section presents the descriptive statistics and analytical results for the environmental assumptions included in our study. We first describe general volatility patterns observed in the dataset, then report the associations between volatility indicators and RTD measures, followed by regression results.
4.1. Descriptive Summary
Out of the 89 environmental assumptions, 48 (54%) were revised at least once during refinement, indicating that a substantial portion of environmental knowledge evolved as the system understanding matured. A smaller subset, 17 assumptions (19%), were eventually judged invalid or no longer relevant, either due to updated stakeholder insight or discoveries during modeling. While many assumptions remained stable across iterations, these figures reflect a natural level of uncertainty and learning typical in early MBSE work.
Dependency density varied among assumptions. Most connected to only a few model elements (median = 2), though one extended to seven. This pattern suggests that while many assumptions touch isolated areas, a small number serve as key contextual links across the model.
RTD indicators followed similar patterns. For assumptions associated with downstream effort, rework ratios ranged from 0.03 to 0.28, meaning that in the most extreme cases, more than a quarter of related elements required modification. Inconsistency counts ranged from 0 to 5, with most assumptions producing none, but a handful requiring multiple clarification cycles. Correction actions were less frequent overall, though they were concentrated among assumptions with broader model impact, which is an expected profile in iterative system design, where core assumptions tend to receive more attention and cause wider impacts when they change.
4.2. Correlation Analysis
We examined associations between volatility measures (ACR, IR, DD) and RTD indicators (RR, ID, CC) using both Pearson’s
r (point-biserial for binary variables) and tie-corrected Spearman’s
on the
analysis set, reporting 95% bootstrap CIs and Benjamini–Hochberg FDR-adjusted
q-values.
Table 2 reports Pearson’s
r;
All three volatility measures showed positive, statistically significant relationships with the RTD indicators. Assumptions that were revised or invalidated tended to be associated with more rework and clarification activity, consistent with the idea that evolving environmental knowledge introduces friction in early MBSE. Dependency density exhibited the strongest correlations across RR, ID, and CC, reinforcing the intuition that assumptions embedded in more parts of the model create broader effects when they change.
Table 3.
Pearson’s r with 95% bootstrap CIs for volatility–RTD pairs (). ACR/IR: point-biserial r.
Table 3.
Pearson’s r with 95% bootstrap CIs for volatility–RTD pairs (). ACR/IR: point-biserial r.
| Pair | r | 95% CI | p | q |
|---|
| ACR–RR | 0.42 | [0.20, 0.60] | 0.001 | 0.004 |
| ACR–ID | 0.37 | [0.14, 0.56] | 0.004 | 0.010 |
| ACR–CC | 0.35 | [0.11, 0.54] | 0.007 | 0.014 |
| IR–RR | 0.38 | [0.16, 0.57] | 0.003 | 0.009 |
| IR–ID | 0.34 | [0.10, 0.53] | 0.008 | 0.015 |
| IR–CC | 0.31 | [0.07, 0.51] | 0.015 | 0.022 |
| DD–RR | 0.50 | [0.32, 0.64] | <0.001 | 0.002 |
| DD–ID | 0.46 | [0.27, 0.61] | <0.001 | 0.003 |
| DD–CC | 0.43 | [0.23, 0.59] | 0.001 | 0.004 |
Table 4.
Spearman’s (tie-corrected) with 95% CIs for volatility–RTD pairs ().
Table 4.
Spearman’s (tie-corrected) with 95% CIs for volatility–RTD pairs ().
| Pair | | 95% CI | p | q |
|---|
| ACR–RR | 0.40 | [0.18, 0.58] | 0.001 | 0.004 |
| ACR–ID | 0.35 | [0.12, 0.54] | 0.006 | 0.012 |
| ACR–CC | 0.33 | [0.09, 0.53] | 0.011 | 0.018 |
| IR–RR | 0.37 | [0.15, 0.56] | 0.004 | 0.010 |
| IR–ID | 0.33 | [0.09, 0.53] | 0.012 | 0.018 |
| IR–CC | 0.30 | [0.05, 0.50] | 0.023 | 0.030 |
| DD–RR | 0.50 | [0.31, 0.65] | <0.001 | 0.002 |
| DD–ID | 0.47 | [0.28, 0.62] | <0.001 | 0.003 |
| DD–CC | 0.44 | [0.24, 0.60] | 0.001 | 0.004 |
4.3. Regression Models
To assess predictive value with outcome-appropriate estimands, we fitted beta regression (logit link) for RR and negative binomial (NB2) models for ID and CC (). Predictors were ACR (change event, 0/1), IR (binary invalidation indicator), and DD (z-standardized). Odds ratios (ORs) and incidence-rate ratios (IRRs) are reported with 95% confidence intervals. The models were further examined using standardized marginal effects and partial R2 measures. False discovery rate was controlled across the three primary models using Benjamini–Hochberg.
Across outcomes, DD exhibited the largest and most consistent effects (significant after FDR correction), indicating that assumptions linked to more artifacts tend to be associated with greater rework and inconsistency effort when they change. ACR contributed in two of the three models, consistent with the idea that any change event, minor or major, carries practical consequences. IR showed a positive but weaker influence, which is plausible given that invalidations were comparatively infrequent.
Model diagnostics supported the specifications: RR residuals under the beta model showed no link-function misspecification; ID/CC displayed overdispersion justifying NB over Poisson; influence diagnostics did not identify points that altered inference; and multicollinearity was low (VIF < 3). Sensitivity analyses yielded consistent conclusions (fractional logit for RR; zero-inflated NB where indicated; re-estimation after excluding high-influence points).
As summarized in
Figure 2, DD had the strongest and most stable association across RR, ID, and CC. In practical terms, assumptions that are more interconnected tend to drive more rework and inconsistencies when updated. Change events (ACR = 1) also showed noticeable effects in two models, while IR contributed positively but less strongly. Overall, both the frequency of changes (ACR events) and the structural importance of assumptions (DD) help explain how requirements technical debt accumulates over time.
The effect sizes also give a sense of their practical meaning. For example, higher DD values often corresponded to noticeably more rework. The effects for ACR and IR are smaller but still meaningful: a single change event or invalidation corresponds to a modest increase in the RTD indicators, consistent with the idea that even isolated revisions can trigger downstream adjustments. In other words, the magnitude of these effects suggests that structural interconnectedness (DD) plays a more prominent role than the occurrence of change alone, though both dimensions contribute to the accumulation of technical debt during modeling.
To sum up, our results indicate that environmental assumptions volatility plays a measurable role in shaping RTD during early system modeling. Assumptions that tie into several parts of the model, e.g., those with higher DD, tend to trigger more rework and corrections when they change. From a modeling practice perspective, identifying high-DD assumptions early can help modelers anticipate where changes may have wider impact and prioritize early validation or impact analysis accordingly. While our data reflect early-stage development rather than large-scale industrial deployments, the consistency of effects across all three RTD indicators provides empirical support for environmental assumptions as a material source of technical debt in MBSE contexts.
4.4. Sensitivity Analysis
To assess whether the binary volatility indicators (ACR and IR) were too coarse, we repeated the analyses using two extended measures: the revision count for assumptions that changed, and the timing of invalidation (early vs. late). These finer-grained variables produced results that were consistent with the main findings. Revision count showed slightly stronger effects but did not alter the significance or direction of associations, and the invalidation-timing variable behaved similarly to the binary IR indicator. Dependency density remained the most influential predictor in all models. These checks indicated that using binary codes did not materially alter the results.
4.5. Model Parsimony and Overfitting Checks
To examine whether the regression models were overly complex relative to the dataset size, we conducted a set of parsimony and stability checks. First, we compared the full models (ACR, IR, and DD as predictors) with reduced specifications using AIC and BIC. Across all outcomes, the models containing DD consistently outperformed those excluding it, while adding ACR and IR provided only modest incremental improvement. Second, we fitted LASSO-penalized regressions to assess predictor stability. DD was selected in more than 90% of cross-validation runs, whereas ACR and IR appeared less consistently. Finally, variance inflation factors remained below 3, suggesting that multicollinearity among predictors was not a major concern.
To sum up, these results indicate that the main findings are not artifacts of model complexity or predictor overlap. DD contributes most of the explainable variation, with ACR and IR adding smaller but directionally consistent effects, and the conclusions remain stable across alternative model specifications.
5. Discussion
In this article, our objective was to examine whether changes in environmental assumptions during early system modeling are associated with the accumulation of RTD. Using a dataset drawn from our earlier empirical work [
37], we found consistent evidence that environmental assumptions volatility corresponds to measurable rework effort. Although the modeling effort was exploratory rather than industrial, the results appear to align with long-standing observations in requirements engineering. That is, when assumptions about the environment shift, design work tends to adapt accordingly [
44].
To better understand how these dynamics manifested in practice, we examined the patterns of change and their effects across the analyzed assumptions. First, while many assumptions remained stable, just over half of them, i.e., 54%, underwent at least one revision. This reflects the natural evolution of understanding during early modeling. Specifically, many assumptions are initially grounded in stakeholder knowledge or domain conventions, while others are tentative and revised as more information becomes available. Second, invalidation events, where assumptions proved false or obsolete, were less common but tended to have noticeable effects on later development activities. These invalidations often occurred when initial expectations about operational constraints or actor behavior were refined, which emphasizes the importance of early validation and stakeholder confirmation. Third, the clearest pattern appeared in how environmental assumptions were connected. When an assumption is linked to many parts of the model, a single change often sets off a cascade of rework. In other words, not all assumptions carry the same weight, i.e., those at the structural core of a model can quietly accumulate debt if they are left unchecked.
The volatility measures we used capture only part of what can happen when assumptions change during modeling. In reality, assumptions can shift in many ways; sometimes they change early or late in the process, sometimes the alteration is small or quite disruptive, and in other cases, a single change affects several parts of the model at once. A fuller account of volatility would need to consider these different patterns more explicitly.
Prior studies have highlighted the importance of documenting and validating assumptions in early design stages [
53,
59,
60]. The data we analyzed tend to support these points, and they come from modeling activities that look a lot like what engineers actually do in practice, not from an idealized or overly polished scenario. Rather than relying on full version histories or automated traceability, we used simple and manually feasible indicators, the kind an analyst or research team could realistically collect in agile MBSE contexts.
It is also worth noting that the three RTD indicators used here, i.e., RR, ID, and CC, remain practical proxies rather than exhaustive representations of requirements technical debt. In real industrial projects, RTD can manifest in forms that are not fully captured by modeling notes or localized corrective actions, such as architectural drift or delayed decision impacts. The indicators in our study reflect the forms of effort that could be reliably extracted from the available modeling artifacts and should be interpreted in that light. Future empirical work in industrial contexts will be important for understanding how these proxies align with, or differ from, the broader spectrum of RTD signatures encountered in practice.
A natural question concerns how far the findings extend beyond the cruise-control scenario we used as our working case. The CCS model is intentionally compact and well structured, which may not reflect the traceability practices or modeling maturity seen in large industrial MBSE settings. Some organizations maintain rigorous versioned traces across requirements, behaviors, and physical constraints, while others rely on more informal modeling notes. Because assumption evolution can depend on how teams document and negotiate design decisions, certain aspects of our dataset may be tied to the characteristics of the CCS model itself. At the same time, several elements of the volatility constructs, such as change events, invalidation, and dependency spread, appear to generalize to a wide range of cyber–physical systems where environmental conditions shape behavior.
Although some aspects of these results may be relevant to other cyber–physical domains, any broader claims about generalizability should be made with caution. The study is exploratory in scope and based on a single, academically controlled modeling design with 89 environmental assumptions, which limits the extent to which the findings can be generalized to industrial MBSE practice. Industrial settings differ substantially in scale, tool support, and modeling culture, which can influence both how assumptions evolve and how rework is documented. As such, the patterns reported here should be viewed as initial evidence rather than definitive statements about assumption behavior across domains. A more complete assessment will require replication in richer and more heterogeneous settings, including diverse CPS domains and industrial-grade modeling environments.
A practical way to bring these metrics into MBSE tools would involve few steps. First, tools would need to record when assumptions are added, revised, or removed so that ACR and IR can be derived automatically from the model’s change history. Second, DD could be computed directly from existing links between assumptions and requirements, using simple graph queries to flag assumptions that touch many parts of the model. Finally, presenting these values in a small dashboard or warning panel would help modelers notice when a change to a key assumption is likely to have a broader impact. We believe these kinds of features would make volatility monitoring a natural part of everyday MBSE work. These capabilities also align with ongoing MBSE 2.0 efforts toward more context-aware and intelligent modeling environments [
56]. The proposed metrics (ACR, IR, and DD) make it easier for tools to reveal how assumptions change and how widely those changes may propagate, giving modelers a lightweight way to anticipate and manage the effects of evolving environmental conditions.
Threats to Validity
As with any empirical work, several factors may shape how our results should be interpreted. Some issues are unavoidable when working with early-stage modeling data, while others reflect choices we made to keep the study lightweight and realistic.
For construct validity, one source of uncertainty lies in how we defined and coded the volatility constructs. Although we introduced clearer decision rules and inter-rater checks, there is always a risk that what one analyst sees as a “substantive revision” another might regard as a routine clarification. For instance, changing an assumption from “driver is attentive” to “driver maintains situational awareness under moderate workload” sits in a gray zone: is this a refinement, or an actual shift in meaning? We aimed to resolve such questions through discussion, yet some interpretive ambiguity inevitably remains. Because RR, ID, and CC were derived from informal project notes and modeling logs, they may miss subtle rework or clarification activities, which introduces potential measurement bias that should be considered when interpreting the results.
As for internal validity, all data stemmed from a coherent modeling effort rather than a synthetic example, which naturally introduces some dependencies among artifacts. This is common in studies of early system modeling, where artifacts evolve together and share contextual grounding. While subtle coder expectations may influence how particular issues were categorized, the inter-rater agreement levels and use of separate RTD reviewers mitigate that concern. The analytical models behaved consistently across alternative specifications (beta, NB, fractional logit, penalized regressions), suggesting that the observed associations are not artifacts of a particular modeling choice.
External validity is limited by the use of a single system and domain, and this remains a primary constraint on the generalizability of the findings. At the same time, the modeling setting is typical of early-stage cyber–physical system modeling, where environmental assumptions play a central role. Following Jackson’s environment–machine formulation, these assumptions concern properties of the problem world rather than the system itself, which makes the constructs examined here conceptually domain-independent. Many engineering efforts employ similar modeling granularity patterns, which suggests that the finding is likely relevant beyond this case. Even so, examining these metrics in multiple domains and real-world industrial projects will be important to understand the extent of their applicability.
Conclusion validity may be affected by the dataset size. The dataset size in our study aligns with common empirical work in requirements and MBSE research, especially where hand-coded artifacts and early-stage design notes are involved. The combination of bootstrap confidence intervals, false-discovery control, and penalized regression provides additional assurance that the conclusions do not hinge on a small number of influential data points. Although expansions to larger industrial datasets would certainly be valuable, the present results appear stable and internally coherent, supporting the study’s goal of establishing a first empirical foundation for assumptions volatility metrics.
6. Conclusions
In this paper, we examined how environmental assumptions volatility may relate to RTD in model-based system development. Building on prior work that discussed assumption-driven forms of RTD, this study proposed three quantitative metrics, i.e., ACR, IR, and DD, and explored their relationship with established RTD indicators, including rework ratio, inconsistency density, and correction count. The analysis suggested that assumptions that changed frequently or were highly interconnected tended to be associated with higher levels of RTD. These observations provide preliminary quantitative support for the long-held view in requirements engineering that unstable environmental assumptions can influence the effort and quality of early modeling activities. We believe that by translating the abstract notion of assumptions volatility into measurable indicators, this work lays the foundation for early detection and prediction of assumption-driven technical debt.
Several avenues remain open for further investigation. One useful step is to examine larger and more diverse datasets to test the generality of the metrics. Integrating assumptions volatility tracking directly into MBSE tools could also enable real-time monitoring and feedback. Another promising direction is predictive modeling, e.g., using machine learning to estimate likely RTD growth based on early volatility patterns. Finally, qualitative follow-up studies could explore how practitioners perceive and manage environmental assumption change, helping to refine the theory as well as practical guidance on assumption management.
Overall, the findings offer early support for treating assumptions not just as background context but as dynamic elements that may shape technical debt long before a line of code is written.