1. Introduction
In today’s world, the complexity of technological systems is increasing rapidly, and this trend encompasses everything from localized installations to critical infrastructure, global digital platforms and autonomous services based on artificial intelligence. Evidence supports the explosive growth in complexity. The number of connected Internet of Things (IoT) devices reached 16.6 billion by the end of 2023 and is projected to reach around 40 billion by 2030 [
1]. At the same time, the amount of data is rapidly increasing: the IDC estimates that the global data sphere will grow from 33 zettabytes (33 trillion gigabytes) in 2018 to 175 zettabytes by 2025 [
2]. And about 80% of this data is unstructured, meaning knowledge is dispersed in texts, documents, and other informalized sources that are inaccessible to traditional analysis. This means that critical information is “hidden” outside of structured databases, making it difficult for risk models to be complete and relevant.
Not surprisingly, experts are growing louder in their warnings about the risks associated with such complexity. For example, the World Economic Forum notes that today’s risk landscape is “inherently interconnected and difficult to navigate” [
3]. Moreover, renowned cryptographer and security expert Bruce Schneier bluntly points out: “Complexity is the worst enemy of security, and our systems are getting more and more complex.” [
4]. In other words, the more intricate a system becomes, the harder it is to anticipate and prevent its vulnerabilities. In cybersecurity, for example, there is a steady increase in new vulnerabilities, with more than 26,000 CVEs disclosed in 2023—1500 more than the previous year [
5]. Great minds in management and academia also highlight the challenge of accelerating change: half of technical knowledge becomes obsolete in less than 2 years [
6], and regulators and organizations have not had time to adapt. In a recent PwC survey, three quarters of risk managers (75%) admitted that they are not keeping up with the rapidly changing technology and regulatory environment [
7], despite their investments in risk management. These authoritative opinions and facts agree; classical approaches to risk analysis have not kept pace with the complexity and dynamics of modern systems.
The fundamental laws of systems theory support these conclusions with strict logic. Ashby’s law of necessary diversity states “the diversity of the controlling system (controller) must be at least as great as the diversity of the system being controlled”, otherwise it cannot be effectively controlled [
8]. Simply put, if the environment changes faster than we are able to react, problems are inevitably waiting for us. Mathematically, this is expressed in the combinatorial growth of the number of states and interactions as the system becomes more complex. Adding new components leads to an exponential increase in possible configurations. Thus, traditional “closed” risk models quickly lose relevance without constant revision—in fact, with such an avalanche of options, no static list of scenarios remains complete. Formally, the control of a complex system requires equally complex models; otherwise, “fragile” zones emerge where uncertainty and knowledge gaps lead to failures [
9]. With 80% of the data being unstructured and new threats emerging every minute, without fundamentally new approaches, our analytical conclusions become obsolete before they can be realized.
The problem, then, is the following. The complexity of modern systems is growing faster than our tools to reduce it in a manageable way. The combination of components, interactions, and external relationships make classical “closed” risk models fragile for three reasons:
Incomplete descriptions (much of the knowledge is hidden in unstructured data and operational context);
Drift of technologies, regulations, and practices (models rapidly lose validity as they change);
The “analytics → action” gap (even valid conclusions are not transformed into standardized solutions or implemented into management loops—EAM/CMMS systems).
In practice, there is a swing between two extremes: very detailed analyses, which are expensive and poorly transferable to the changed system, and empiricism based on expert experience, which is subjective and poorly reproducible. It turns out that in increasingly complex systems, “risk analysis” either becomes disproportionately resource-intensive or produces unsustainable solutions. What is needed is a unified, reproducible, and interpretable approach to risk management that works equally well with heterogeneous data (including texts), is robust to constant environmental drift, and is embedded in the contours of management actions [
10]. Only such an approach will bridge the gap between analysis and action in the rapidly increasing complexity of the techno sphere.
Thus, the unit of risk analysis should not be a “component” or a “defect category”, but a phenomenon—a stable, cross-domain configuration of meanings, contexts, and conditions of events that can be detected in heterogeneous streams (texts of incidents and regulatory reports, operational notes, work logs, user communications, investigation narratives, and open sources) and compared with variable management objectives (reliability, safety, continuity, and compliance). Macro-level empiricism fulfils this requirement. The prevalence of unstructured sources of knowledge, cross-border supply chains and services, and accelerated drift of technologies and norms—all this changes risk from an “internal property of the product” to a systemic phenomenon spreading through networks of interactions. At the level of principles, such as the law of necessary diversity, the controllability of such an environment is achieved only when the representation space and the solution space have sufficient diversity, invariance to drift, and heterogeneity. Consequently, we need a method that steadily translates diverse narratives into a common semantic coordinate system of phenomena, purposefully extracts from it the factors directly related to risk metrics and, finally, returns the result to the management space in the form of reproducible, interpretable, and prioritized actions at the level of phenomena rather than disparate attributes.
This logic sets the method as a single macro-operator that transforms a multilingual and multimodal field of narratives into a practical space of management decisions. Resistance to drift and control of complexity are inherently built in. The pipeline works sequentially. First, semantic coding translates narrative data into invariant coordinates of phenomena, i.e., into a compact and causally meaningful representation. The target-matching module then adjusts this representation to the selected risk metric and generates latent factors that concentrate the part of the variability that explains the observed risk. The projection module then builds a calibrated risk prediction based on the extracted factors, while maintaining the reproducibility of the result and the transparent attribution of the contribution of each phenomenological component. Finally, the prescriptive module selects a portfolio of management interventions from the admissible set and minimizes risk in the post-intervention state and a regularizer that limits the complexity and cost of interventions and takes into account resource, regulatory, and linkage constraints augments where phenomenological coordinates are shifted via influences through an influence matrix, the target function.
This macro-imperatorial construction imposes system requirements. The representation of phenomena must be invariant to data sources, domains, and languages. Consistency with risk metrics must close the gap between signal detection and control action. The projection part should provide drift tolerance, transparent attribution, and the full traceability of factors. The optimization of measures should remain operational sable and take into account budget, regulations, and asset linkage topology. As a result, a continuous and verifiable chain is maintained from texts to phenomena, then to factors, then to forecast, and to action. At each stage, complexity control is maintained and the applicability of solutions is preserved.
Together, factual trends (scale and heterogeneity of data), expert consensus (fragility of interconnected networks and safety-by-design requirements), and a priori principles (combinatorial growth of states and law of necessary diversity) narrow the space of acceptable solutions to classes that operate precisely in the phenomenological space and return controlled deformations of this space as the primary object of intervention. In other words, as long as “components” and “incident categories” remain the unit of analysis, the gap “data, model, solution” grows faster than our ability to close it. Only when observations are translated into global phenomena and mapped back into management measures does this gap become manageable and compatible with EAM/CMMS cycles. Hence, a unified, domain-agnostic approach is needed, where phenomena act as the first unit of risk accounting, and measurement and impact are standardized in the same space.
Thus, there is a clear need for an innovative approach, the novelty of which lies in the fact that it introduces and rigorously operationalizes semantic factor analysis of phenomena as a universal macro-framework for risk management in complex systems. A single invariant representation of phenomena from heterogeneous narratives is consistent with target risk metrics and, through optimization mapping, generates reproducible, interpretable, and prioritized actions at the level of phenomena, thus bridging the gap “data, model, decision”.
It has already been established that the unit of risk accounting in complex, drifting ecosystems should be neither components nor “categories of defects”, but phenomena-stable, inter-domain configurations of meanings and conditions extracted from heterogeneous narratives and correlated with management objectives. It can also be argued that only with symmetry between the representation space (phenomenological), the factor space, risk metrics, and the space of managerial action does the gap “data, model, decision” become controllable. The formulation of the objective follows directly from this logic.
Research on extracting actionable information from unstructured text has a long history in information retrieval and statistical language processing. Early work established reproducible weighting and retrieval mechanisms that made large text collections measurable and comparable [
11,
12]. Topic models, and LDA in particular, later provided a compact probabilistic representation of documents and enabled the systematic discovery of latent themes from corpora [
13,
14]. Alongside topic modeling, the literature developed alternative semantic representations, including latent semantic models, distributional word embeddings, and modern transformer-based encoders that improve semantic coverage and the robustness of representations across contexts.
A separate line of research focuses on linking high-dimensional representations to target variables while keeping models stable and interpretable. Partial least squares provides a classic target-oriented projection that extracts components aligned with the response and mitigates multicollinearity [
15,
16]. Regularized regression methods, including ridge, lasso, and elastic net, further support generalization under high dimensionality and provide a controlled way to estimate the contribution of predictors to a target metric [
17,
18,
19,
20,
21]. These foundations are directly relevant when semantic coordinates must be related to risk indicators without losing traceability.
Prescriptive analytics extends prediction to decision making by optimizing interventions under constraints. CVaR-based formulations are widely used to focus on tail outcomes and to support risk-sensitive optimization objectives [
22,
23,
24,
25,
26]. Bayesian optimization and related black-box optimization techniques offer practical ways to search for effective decisions when evaluations are costly or uncertain [
27,
28,
29,
30,
31,
32]. In addition, graph-based modeling and regularization provide a way to encode dependencies between semantic factors and to stabilize solutions with respect to relational structure [
33,
34,
35,
36,
37].
For operational adoption, transparency is a core requirement. Modern explainability methods such as local explanations and additive feature attribution help to interpret model outputs and support auditability [
38,
39]. Broader conceptual frameworks in explainable and interpretable machine learning further systematize the taxonomy of explanations and the limits of interpretability [
40,
41]. These approaches motivate the need to keep a clear chain from semantic drivers to risk outcomes and to management actions.
In risk engineering, established work clarifies risk definitions and emphasizes the practical role of uncertainty and human factors in complex systems [
42,
43,
44]. Risk-based inspection planning and dynamic strategies show how risk indicators can be coupled to inspection decisions in engineering settings [
45,
46]. Related studies demonstrate that textual incident narratives can be mined to build reliability models and to extract safety risk factors from accident reports [
47]. Reviews on diagnostics, prognostics, and risk-based maintenance also provide decision-oriented context for maintenance planning in complex assets [
48,
49,
50,
51,
52]. At the same time, existing contributions typically solve only part of the chain, either focusing on text extraction, or on prediction, or on decision rules. This creates a gap for an integrated approach that transforms incident narratives into stable phenomena, links them to a risk metric through target-aligned factorization, and converts results into implementable portfolios of interventions. For clarity, the reviewed literature and its positioning relative to the present study are summarized in
Table 1.
The aim of this research is to develop a universal, domain-agnostic method of semantic factor analysis of phenomena, which ensures the manageability of risk under conditions of data heterogeneity and environmental drift through a standardized transition from observational narratives to prioritized management actions.
Thus, the goal of the proposed innovative method is to ensure risk manageability in conditions of increasing complexity and entropy through the transition to the phenomenological level of representation and impact. The implementation of the method leads to a stable translation of heterogeneous narratives into invariant coordinates of phenomena, purposefully coordinating them with risk metrics and returning the result to the space of managerial actions, thus systematically closing the gap “data, model, solution” in multi-sectoral and drifting environments. To concretize, the objective is revealed in three interrelated aspects:
Unifying risk representation—building a global phenomenological space invariant across sources, domains and languages, in which observations are comparable over time and across organizations;
Targeted alignment and explainability—extracting a semantic factor kernel directly paired with risk indicators, with a transparent “phenomenon, factor, risk” trace;
Operationalized management—prioritizing and optimizing actions at the phenomenon level, subject to cost and regulatory and network constraints, ensuring reproducibility, drift resistance, and seamless integration into EAM/CMMS loops.
The goal of the method is not to describe phenomena per se, but to use them as a primary, standardized unit of control to proactively reduce risk functionality in complex systems, turning semantic observations into coherent, prioritized, and transferable management decisions.
This study presents an end-to-end method that turns free-text incident notifications into decision-ready risk management outputs for complex systems under drift. The method uses phenomena as the main unit of analysis, so heterogeneous narratives become comparable across time and organizations. It links extracted phenomena to a risk indicator with target-oriented factor modeling, so that the drivers of risk stay interpretable and traceable. It then produces optimized portfolios of interventions that can be directly implemented in EAM/CMMS workflows.
2. Materials and Methods
The proposed approach is a holistic scenario-optimization semantic analysis pipeline that automatically converts textual incident data into recommendations for risk management. The pipeline covers the entire path from text processing and thematic modeling to the generation of management decisions, ensuring the end-to-end traceability of the entire chain (data, then model, then decision). The input data are descriptions of emergency incidents together with the calculated risk index (a weighted indicator of consequences that aggregates the frequency, severity, and other effective characteristics of the event). The output of the conveyor belt is a composite factor representation of the thematic space of incidents, an interpeted regression model of the influence of the identified factors on the risk index, optimal and efficiency-ranked “levers” of intervention (proposed actions to reduce the risk, taking into account the constraints), and a graph of interrelated topics for the comprehensive planning of measures, as well as human-readable recommendations ready for integration into asset management systems (EAM/CMMS).
First, the preprocessing of incident texts and topic modeling is performed. The text reports are brought to a uniform format (cleaning of service characters, case normalization, tokenization, and lemmatization), after which latent dirichlet allocation (LDA) is applied to the corpus [
11,
12,
13]. LDA topic modeling turns unstructured text into compact reproducible semantic coordinates: each document is described by a K latent topic distribution and each topic by a word distribution. The result is a matrix Θ of dimension N × K (where N is the number of incidents), which serves as a semantic “passport” of all events. Θ captures which latent themes are present in the description of each incident and with what proportion. The obtained topics are still abstract, so the auto-annotation of the topics is then performed using a large language model (LLM).
We designed the neural auto-annotation stage to be reproducible and auditable. Topic labeling was performed locally using Ollama with a fixed model version and a deterministic decoding setup. We used a constant temperature and a fixed random seed for all calls. We also fixed the remaining decoding parameters, including top-p, top-k, and repetition penalty. We constrained the maximum number of generated tokens to keep outputs short and comparable across runs. The prompt template was fixed and versioned. It received the same structured inputs for each topic, including the top words with weights and a small set of representative documents. All prompts and raw outputs were logged.
To verify reproducibility, we reran the labeling procedure multiple times under identical settings. In our experiments, the resulting topic labels were identical in more than 95% of cases. The remaining differences were typically minor and consisted of near-synonyms or small wording variations that did not change the operational meaning of the label. To keep the nomenclature stable, we applied a simple normalization step that standardized capitalization, removed redundant qualifiers, and mapped frequent synonymous variants to a single canonical label. This protocol makes the annotation step transparent and repeatable, while preserving the interpretability benefits of human-readable phenomenon names.
Based on the top-words and characteristic n-grams of each topic, as well as a few example documents in which this topic is dominized, the LLM generates short human-readable names for each topic (e.g., “unauthorized access”, “discharge of contaminants”, “component defects”, or “radiation exposure”). This does not affect the numerical part of the model but significantly increases the interpretability of the results and unifies the terminology for subsequent use in reporting and integrations.
To increase methodological transparency and reproducibility, we follow a fixed and documented topic modeling protocol. The LDA model is trained on lemmatized tokens that pass simple document frequency filters. Extremely rare tokens and extremely frequent tokens are removed, and a domain specific stop word list is added to the standard English stop word list. This reduces noise from idiosyncratic terms, boilerplate, and regulatory phrases that carry little discriminative information. All preprocessing steps and vocabulary filters are implemented as deterministic transformations, which makes it possible to rerun the pipeline and obtain the same input to the LDA stage.
The LDA configuration is selected using a combination of quantitative and qualitative checks. We train candidate models under different numbers of topics and prior settings and we monitor standard topic coherence scores such as and UMass on held out data. For each candidate model we also inspect top words and representative incident reports for several topics. Models with very few topics tend to merge distinct operational scenarios into broad themes, while models with too many topics split stable phenomena into fragmented clusters. The final configuration is chosen as the one that achieves high coherence scores and remains interpretable for safety engineers in terms of stable, reusable phenomena.
To check stability, we repeat LDA training with different random seeds and on different temporal slices of the NRC Event Notifications corpus. We then compare the resulting topics by overlap of top words and by similarity of document-level topic distributions. The main operational themes, such as unauthorized access, pollutant discharge, and equipment failures, appear consistently across seeds and time windows. We log model parameters, vocabularies, and topic word distributions for each run. This allows for independent verification of the linguistic layer and supports the external replication of the semantic coordinates used in the subsequent PLS and regression stages.
To relate the thematic representation to the target risk score, the matrices are compressed into a factor space by the target PLS projection. Linear combinations of features are selected that are maximally consistent with risk, eliminating multicollinearity and preserving interpretability through weights. The number of components is selected using cross-validation [
14,
15,
16].
A regularized linear regression is trained on the obtained factors, estimating the contribution of each factor to the risk index (basic–L2-regularisation) [
17]. Adjustment is performed on a set of metrics (R
2, RMSE/MAE/MAPE, CV indicators) with temporal cross-validation, which provides a balance of accuracy, generalizability, and interpretability. The coefficients actually set “levers” for subsequent optimization [
18,
19].
Next, a scenario-based optimization of management actions is formulated. A vector of interventions by topic is entered, their effects are assumed through a matrix of elasticities, and then the predicted risk is recalculated and minimized taking into account constraints and costs [
20,
21,
22]. Bayesian optimization on Gaussian processes is used to find the optimal plan. The output is the optimal intervention levels and the ranking of topics by risk reduction/cost efficiency [
23,
24,
25].
Prioritized themes are aggregated into a graph of thematic associations. Nodes are themes and edges are statistically significant associations (co-occurrence, correlations/partial correlations, causal relationships if available). The graph is annotated with metrics from previous steps: nodes–contribution to risk (PLS factors), centrality, and edges–strength of association (correlation/MI) and, where appropriate, effectiveness of the collaborative intervention. This representation provides a “management map” where the impact on high-centrality nodes has direct and indirect effects, e.g., dense subgraphs form coherent packages with the potential for multiplicative risk reduction [
26].
Based on the graph and attributes (topic names, influence weights, links, elasticities, and constraints), LLM generates structured recommendations: type of work, periannuality, responsible persons, resources/competences, control points and KPIs, in a formatted format ready to be loaded into EAM/CMMS. This transforms analytics into operational artifacts, bridging the gap “analysis, action” and closing the data loop.
It is important to emphasize that the proposed method is iterative and supports the full cycle “data–models–solutions–effects–data”. After the implementation of the recommended actions and the occurrence of new incidents, their results are fed back into the loop: the accumulated descriptions and updated risk indicators are fed back into the input of the pipeline. The LDA model is re-trained on the replenished corpus (with stability control to ensure that new data do not distort previously found themes), PLS components and regression coefficients are recalculated with fresh data, and elasticities and weights on the graph are updated [
27]. Periodic revisions of the model’s hyperparameters—e.g., the number of K themes or the set of intervention scenarios considered—may be conducted using regression tests on historical data to ensure that the quality of the model does not degrade over time.
To formalize the “feedback” block shown in
Figure 1 and to make the pipeline explicitly adaptive, we treat the operational outcomes of implemented interventions as first-class training signals, captured in the same EAM/CMMS loop that executes the recommendations. Concretely, each deployment round
logs an auditable tuple (
), where
is the pre-intervention document–topic matrix for the current time window,
is the implemented portfolio of topic-level intervention intensities, and
is post-intervention evidence (updated incident narratives, refreshed values of the proxy risk index
, tail indicators such as
where applicable, and execution/quality logs of work orders). This makes the pipeline not only iterative in the «new data, retrain» sense, but explicitly outcome-driven. Recommendations are evaluated against realized effects, and the model is updated to better align its risk estimates and prescriptions with operational reality.
This outcome-driven update can be structured as Regression with Human Feedback (RHF), directly inspired by the alignment loop of reinforcement learning from human feedback (RLHF) used for instruction-following language models [
28]. In RLHF, a supervised model is complemented by a human preference signal and an optimization step that improves decisions with respect to this signal. In our setting, the supervised backbone is the traceable regression
trained on PLS components. The factor kernel is
, and the predicted proxy risk is
. The corresponding regression coefficient vector
is traceable in the PLS subspace, i.e.,
. The «action» is the intervention portfolio
, and the «human feedback» is the expert assessment of effectiveness together with post-intervention operational outcomes that reveal whether the prescribed interventions actually reduced risk.
Formally, the prescriptive step uses the calibrated deformation of the semantic profile under interventions: where is induced through the elasticity structure estimated in the optimization block. The predicted post-intervention risk under the implemented portfolio is then , and the predicted effect is . After execution, the operational loop provides realized values and , as well as an expert feedback score for the effectiveness and appropriateness of the intervention. These signals define a scalar feedback target (utility) used for learning (, where is cost/effort and are governance-controlled weights. Thus, RHF uses the same traceable «phenomenon, factor, risk, action» chain, but closes it with an explicit «action, observed outcome, model update» channel.
The feedback loop enables two complementary and operationally realistic update modes, explicitly addressing how the PLS layers are refined. On a fixed schedule or when drift diagnostics trigger, we retrain the semantic layer and re-estimate the target PLS projection and the regression parameters in on an expanded or rolling-window corpus, using time decay weights and feedback-based sample weights. To preserve interpretability and auditability, we keep fixed for a period and update only the regression layer in factor space. In parallel, the prescriptive block is calibrated by contrasting realized and predicted effects, versus , which updates the elasticity parameters that generate and (if used) Bayesian optimization priors. This formalization turns the framework from a static analyzer into a self-improving tool that actively counters concept drift through outcome-driven learning.
To keep human-in-the-loop learning feasible, explicit feedback is requested only for high-impact or high-uncertainty cases, while routine updates rely on automatically captured post-intervention incident narratives and EAM/CMMS execution logs. When feedback is multi-source, we align heterogeneous ratings by modeling rater reliability and confidence and aggregating them into consistent targets, following the general approach of crowd-sourced feedback alignment studied by Wong and Tan (2025) [
29]. In addition, when absolute scoring is difficult, experts can rank alternative candidate portfolios generated by the optimizer. These pairwise preferences provide a stable supervision signal analogous to RLHF preference datasets, but applied here to intervention portfolios rather than text outputs. A simple instantiation is to fit a utility model
from comparisons
via
, and then select portfolios by maximizing the resulting expected utility subject to the operational constraints already encoded in the optimization step.
This approach ensures the robustness of the method to data drift and changes in external conditions, while maintaining reproducibility and auditability. At each round, versions of LDA/PLS dictionaries and parameters, model coefficient values, optimization settings, and graph snapshots are stored [
28]. Together, the method integrates heterogeneous textual and quantitative data, provides interpretable insights at the level of themes, factors and activities, automates the transition from data analysis to management actions, while remaining transparent and reproducible for further analysis and trust by sectoral experts [
29,
30,
31,
32,
33].
Figure 1 below presents a comprehensive flowchart of the scenario optimization method of semantic factor incident analysis. The flowchart visualizes the key steps and modules of the proposed pipeline and the interrelationships between them, demonstrating how unstructured textual data is sequentially transformed into risk factor models and management recommendations. The multi-level structure of the solution is shown, with forward, cross-links and feedback: from initial text processing and thematic modeling to recommendation generation and closing feedback, as well as an artifact registry for auditing and replication that feeds the effects of implemented interventions back into the data loop for adaptive model retraining.
The flowchart illustrates the main stages of the methodology and their interactions. The input data—textual incident reports and associated numerical risk indicators—are sequentially preprocessed (cleaned and normalized), after which latent themes of incident descriptions are extracted using LDA. Next, the neural network auto-annotation module assigns human-readable names to the obtained topics, increasing interpretability. In the next step, the topic space is projected into compact factors using a targeted PLS factorization method aligned with the risk index; this reduces dimensionality and eliminates multicollinearity, preserving the informativeness of the prisigns [
34,
35,
36,
37,
38]. The obtained factor coordinates are fed into a regularized regression block that estimates the quantitative impact of each factor (topic) on the integral risk [
39,
40].
The regression results are used in the Bayesian optimization module: the model selects the optimal set of management “levers” (measures), i.e., calculates what proportion of incidents for each topic should be prevented or reduced to minimize the expected risk given the constraints. At the same time, a graph of thematic links is constructed, the nodes of which correspond to the auto-annotated themes, and the edges reflect significant statistical associations between them. The nodes and links of the graph are equipped with attributes from the previous stages (impact on risk according to regression data, elasticity and cost of measures according to the data of the optimization module, and centrality measures). This graph serves as a basis for generating recommendations: the final LLM module generates preventive and corrective measures understandable to specialists, aggregating the topics into comprehensive proposals with prioritization, resources, and integration into the asset management system (EAM/CMMS) [
41,
42,
43,
44].
The right-hand part of
Figure 1 closes the operational loop of the method: after the recommended measures are implemented in EAM/CMMS and their effects are monitored, the resulting evidence (new incident narratives and updated indicators) is returned to the pipeline as refreshed input data. The verification/regression tests then compare predicted and observed outcomes and provide input for the retraining stage. In this retraining cycle, the semantic and predictive layers are refreshed—i.e., the LDA topic distributions are updated, the PLS components and regression coefficients are recalculated, and the intervention elasticities are re-calibrated—so that the framework adapts to evolving conditions and remains robust to data drift during continued operation [
45].
Verification of the phenomenon-centric method is performed on the open corpus of NRC Event Notifications—a consolidated array of textual descriptions of incidents with associated attributes (date/time, organization and site, jurisdiction, 10 CFR codes, accident classes, etc.) [
46,
47,
48,
49,
50]. The corpus covers a long time interval (about two and a half decades) and is formed by multiple operators and sites, which ensures both temporal variability (drift of vocabulary, reporting practices, and regulations) and structural heterogeneity (different technological contexts, safety cultures, and regulatory requirements). Incident texts are quite “deep” (containing descriptive motivations, conditions, consequences, and normative references) and semantically rich (stable collocations, domain terms, and referential attributes), which is crucial for thematization and the extraction of phenomena [
51,
52]. The corpus is representative of the global problem articulated in the introduction:
The heterogeneity of sources and domain practices mimics a real inter-organizational ecosystem;
The unstructured nature of primary descriptions verifies the method’s ability to systematically “stitch” narratives into manageable phenomenological coordinates;
The temporal scope allows for explicitly checking the resistance to drift (lexical, normative, and processual);
The presence of normative anchors (10 CFR, accident classes) creates a natural ground for proxy risk indicators and for traceable “phenomenon, factor, risk, action” linkages.
The corpus covers a wide range of risk phenomena (operational disturbances, equipment failures, procedural deviations, and communication/regulatory events), which allows the transferability of thematic coordinates between heterogeneous scenarios to be assessed. In terms of size, the corpus volume (≈ tens of thousands of records) is adequate for the typical dimensionality of the topic space and the subsequent target factorization. After document–theme thematization, the Θ matrix is projected into a compact factor kernel, Z = ΘW with K ≪ |V| (the number of topics and factors is selected by validation), which forms a favorable “observations per component” ratio. The minimum detectable linear effect (using the Fisher Z approximation for correlation) at significance level (α) and power (1 − β) is estimated as follows:
where
n is the number of observations in the training split. For (
n) approximately tens of thousands,
is up to
which is sufficient to detect weak but systematic factor contributions. The unbalanced nature of severe events is also important: the presence of a “thin tail” in accident rates naturally tests the tail metrics (CVaR) and the robustness of the regression on sparse extreme patterns. Together, this gives both substantive and statistical sufficiency of the corpus for validating the holistic “texts, phenomena, factors, risk, action” loop.
In the absence of a single “true” scalar business risk index, a proxy index y^”proxy” is introduced for all records aggregating normative and hard indicators (accident class, 10 CFR codes, off-site notification, length and “density” of the narrative, etc.) with monotonic normalization:
where (σ) is the logistic normalization and s(·) is the ordinal coding of the crash class. In parallel, an auxiliary classification tail is generated to align with the CVaR optimization target:
which allows for the simultaneous evaluation of regression calibration and recognition of high-risk cases.
To avoid time leakage, a chronologically meaningful split is used: training part (early years), validation part (interval for selection of hyperparameters and number of factors), and out of time OOT test (recent years). A within-training rolling origin with «purging» of neighboring windows is applied and the summary metric is averaged with weights on the window size:
The recruitment of metrics is organized into groups:
Goodness-of-fit ( RMSE/MAE);
Scale errors and calibration (slope/shift of regression () on (y), SMAPE/MASE);
CV metrics by time windows with confidence intervals;
Stability of representations (convergence of themes, robustness (W), and rank stability of factor contributions);
Graph replicates (convergence of edges/centrality);
Tail risk CVaRα () and PR-AUC/AUC for ;
Operationalizability (proportion and “footprint” of recommendations suitable for loading into EAM/CMMS).
To address reproducibility and methodological clarity across the full “tokens to actions” chain, we explicitly evaluate stability at the factor, regression, and prescriptive levels. This is necessary because the downstream estimates depend on the PLS-based semantic kernel and on the resulting regression coefficients that drive the optimization stage.
The stability of the PLS semantic factorization is evaluated across the rolling-origin splits used in the temporal validation design. For each split, we refit the PLS projection on the training window and obtain the topic for the factor loading matrix. We then compare these loading matrices between splits after resolving the sign indeterminacy of latent components. We report similarity both at the component level and at the subspace level. Component level similarity is computed using the cosine similarity between aligned loading vectors. Subspace level similarity is computed using the principal angle or Procrustes style alignment diagnostics. In addition, we compare factor scores obtained on a fixed reference subset and report their correlation structure across splits. This directly tests whether the semantic risk factors are stable under changes in the training period.
Robustness to sampling noise is assessed via bootstrap resampling inside each training window. We repeatedly resample incident reports with replacement, refit the PLS projection and the regression model, and summarize the variability of loadings and coefficients using empirical intervals and sign consistency rates. This quantifies whether the factor layer and the estimated risk contributions remain stable when the data contain stochastic variation and sparse tail events.
The regression interpretability is supported by coefficient stability diagnostics. We treat regression coefficients as associations between latent semantic factors and the proxy risk index. We do not interpret them as causal effects. To justify their use as inputs to prescriptive optimization, we report coefficient stability under both bootstrap resampling and rolling-origin splits. We also report the stability of the rank ordering of absolute coefficient magnitudes, since this ranking is used to define intervention leverage priorities.
Prescriptive recommendations are tested for reproducibility at the level of selected levers and portfolio composition. The prescriptive module is deterministic under fixed random seeds and fixed solver settings, but multiple near-optimal portfolios can exist. Therefore, we evaluate stability by repeating the full pipeline under controlled perturbations and reporting overlap metrics for the top recommended levers. We report Jaccard overlap for the top N levers and the dispersion of portfolio cost and predicted risk reduction across repeats. This directly tests whether the recommended actions are consistent across repeated runs of the pipeline.
Resistance to concept drift is evaluated with explicit temporal diagnostics on the semantic, factor, and regression layers. At the semantic layer, we measure how the distribution of inferred topic mixtures changes over time windows, and we monitor the stability of topic word distributions and topic coherence in delayed periods. At the factor layer, we track the similarity of the PLS loading subspace across time windows. At the regression layer, we track coefficient and ranking stability over time. These diagnostics complement the standard predictive and calibration metrics and provide direct evidence of whether the semantic representation and the coupled risk model remain stable when the narrative distribution shifts.
Ablations are provided to isolate component contributions: BOW/TF-IDF without thematization, LDA, Ridge versus LDA, PLS, Ridge/Elastic Net, optimization with/without graph regularization Lu, LLM annotation ON/OFF, and permutation control π(y).
Thus, the corpus used is both realistic (inter-organizational heterogeneity, normative anchors) and rich (long narratives, diversity of phenomena), and its long time span creates natural conditions for testing robustness to drift. The sample size and its structural heterogeneity provide statistical power to estimate weak but systematic effects and allow for the validation of the end-to-end loop “data, phenomena, factors, risk, action” rather than individual algorithmic modules. The described validation design provides a rigorous basis for further testing and interpretation of empirical results without going beyond the intended purpose of the study [
53].
All experiments were implemented in Python (version 3.x) using a reproducible open-source stack. Data handling and tabular transformations were carried out with pandas (v2.2) and NumPy (v1.26), while scientific utilities relied on SciPy (v1.16.2). Topic models and regression baselines were built using the scikit-learn library, complemented by gensim for exploratory topic modeling and NLTK for tokenization and lemmatization. Graph construction and analysis for the thematic association network were implemented with NetworkX, and the Bayesian optimization of intervention scenarios used Optuna (v4.6.0). Visualizations (
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6) were produced with Matplotlib version 3.10.6 and auxiliary plotting routines. The method therefore relies on standard, well-tested, open-source components, and only the authors implemented the integration logic of the end-to-end pipeline (data preprocessing, rolling-origin validation, scenario optimization, and artifact logging) as custom Python code on top of these libraries.
The full pipeline—from raw incident texts to trained topic models, PLS factorization, cross-validated regressors, and optimized intervention portfolios—was executed on a workstation-class Linux machine equipped with a multi-core Intel Xeon CPU (10 hardware cores), 64 GB of RAM, and no dedicated GPU. On this hardware, training the LDA model on the final NRC EN corpus with the selected number of topics required on the order of several minutes of wall-clock time. A complete run of the rolling-origin evaluation over all regression families and factorization variants was completed within a few hours. Training and evaluating a single configuration typically took from tens of seconds to a few minutes. Scenario optimization with Bayesian search and graph construction added comparable overhead but remained comfortably within interactive timescales, which confirms the practical feasibility of deploying the approach in operational risk management loops. Random seeds were fixed for all stochastic components to ensure the reproducibility of the reported results.
For completeness, we briefly summarize the regression models used in the study. We denote by X the matrix of predictors (in our case, these predictors are the latent PLS factors obtained from thematic coordinates), and by y the proxy risk index. All considered approaches are linear regression models that estimate an intercept term and a vector of coefficients, producing predictions as a linear combination of the predictors plus an intercept.
Ordinary Least Squares (OLS) fits the model by choosing the coefficients that minimize the overall discrepancy between the observed values of the target variable and the model predictions, measured as the sum of squared residuals. This is the standard unregularized linear regression baseline.
Ridge regression (L2 regularization) extends OLS by adding a penalty that discourages large coefficient values. Concretely, in addition to minimizing squared residuals, Ridge also minimizes the squared magnitude of the coefficient vector, scaled by a regularization strength parameter. This type of shrinkage improves stability under multicollinearity and helps prevent overfitting, which is why it is used in our best-performing configuration (PLS + SumDiff + Ridge with L/R scaling).
Lasso regression (L1 regularization) also extends OLS, but uses a different penalty: it discourages large coefficients by penalizing the sum of absolute values of the coefficients. This property tends to produce sparse solutions, meaning that some coefficients can become exactly zero. As a result, Lasso performs an implicit form of feature selection.
Elastic Net (combined L1 and L2 regularization) merges the two ideas above by using a weighted combination of the L1 and L2 penalties. A mixing parameter controls the balance between the Lasso-like sparsity effect and the Ridge-like shrinkage effect. This is often beneficial when predictors are correlated and one also wants some degree of sparsity.
In our pipeline, PLS regression is applied before these linear models in order to transform the input representation into a set of latent components that are maximally aligned with the response. The resulting PLS factors are then used as inputs to OLS, Ridge, Lasso, and Elastic Net. Model hyperparameters (regularization strength, Elastic Net mixing parameter, the number of PLS components, and the L/R scaling options) are selected using cross-validation within the rolling-origin evaluation framework described earlier in this section.
3. Results
A consolidated corpus of NRC Event Notifications was generated to test the performance of the method. The final sample consisted of 27,299 records with a predominant time span of 1993–2025. Single obvious date artifacts were excluded from the dynamic slices. The corpus aggregates reports from multiple operators and sites, contains rich attributes (date/time, facility and site, jurisdiction, 10 CFR regulatory codes, accident classes, etc.), and therefore combines temporal variability (lexical and regulatory drift) with the structural heterogeneity of technological contexts—exactly the kind of “stress environment” in which a phenomena-centric approach should remain robust. This composition makes the data indicative of the task: significant volume, long narratives, a wide range of phenomena (from operational and equipment failures to organizational and regulatory events), and the availability of normative anchors for building a proxy risk index and subsequent tracing from phenomenon to factor, risk and, finally, to action.
Before analyzing the regression layer and optimization results, we checked that the thematic representation itself is meaningful and stable. The selected LDA configuration achieves topic coherence values in a range that is usually interpreted as moderate to good, and these values are higher than for alternative settings with fewer topics or without document frequency filtering. Repeated training with different random seeds produces very similar clusters of top words, and the dominant themes remain present when the model is trained on early and late subsets of the corpus. Together with ablation experiments that remove the LDA layer or replace it with a simple bag of words and TF–IDF baselines, this indicates that the phenomena used in the factor models and in the optimization are robust structures of the data rather than artifacts of a single random run.
Beyond predictive accuracy, we evaluated whether the semantic factorization, regression layer, and prescriptive outputs are stable under time splits and repeated pipeline runs. This step is critical because the proposed framework claims traceability and reproducibility across the entire chain, and because the prescriptive recommendations inherit uncertainty from upstream linguistic representations.
For the PLS-based semantic factorization, we computed stability diagnostics across rolling-origin splits. We report the similarity of topics to factor loadings after the alignment of latent components, and we report the stability of factor scores on a shared reference subset. We also report the bootstrap variability of the loading structure within training windows. These results are summarized in a dedicated stability table. They provide direct evidence that the extracted semantic risk factors are not artifacts of a single split.
For the regression layer, we report coefficient stability across time windows and bootstrap resamples. We summarize sign consistency and rank the stability of absolute coefficient magnitudes. This supports the interpretation of the reported “quantitative impacts” as stable associations between semantic factors and the proxy risk index, and it clarifies the conditions under which coefficient-based leverage ranking is meaningful for prescriptive optimization.
For the prescriptive module, we repeated complete pipeline runs under controlled seeds and solver settings and evaluated the stability of the resulting recommended lever sets. We report overlap metrics for the top recommended levers and the dispersion of predicted risk reduction and portfolio cost. This directly addresses whether the recommended actions would remain consistent when the pipeline is rerun, even when upstream components are subject to small stochastic variation.
Finally, we complement the rhetoric of drift resistance with explicit temporal diagnostics. We quantify semantic distribution shifts over time windows, and we track whether factor loadings and regression coefficients remain stable when the narrative distribution changes. Together with the rolling-origin evaluation already used for predictive metrics, these analyses provide an empirical basis for the claimed drift resilience.
The verification design targets drift robustness. A chronologically consistent split is used (training–early years, validation–pro-interim period for hyperparameter selection, and out-of-time test–recent years) and rolling origin with purging of neighboring windows. The summary score is collected by groups of metrics reflecting different aspects of quality: goodness of fit (R2, EVS, RMSE/MAE), scale errors/calibration (slope/shift, SMAPE/MASE), cv metrics (stability over time), diagnostic residuals, and probabilistic and information criteria. This multi-axis control allows scenarios to be compared on the trade-off between accuracy, tolerance, and calibration, rather than on a single metric.
The radar diagram above summarizes the composite scores for groups of metrics for the entire pool of scenarios (OLS/Lasso/Ridge/ElasticNet variants “as is”, their L/R versions with robust scaling, γ-regularization, and configurations with targeted PLS factorization). L/R convolutions and PLS + regularization form the most “round” polygons—they fill almost the entire triangle of goodness of fit–scale errors–cv metrics, indicating both correct error scaling and stability over deferred periods. On the contrary, “Original” versions without L/R transformations and aggressive domain-free compression (PCA95) show dips in 1–2 axes, signaling insufficient tolerance and calibration during drift, thus visually confirming the methodology hypothesis that targeted PLS compression of thematic coordinates is critical for industrially acceptable accuracy and stability.
The integral horizontal ranking quantitatively consolidates the conclusions of the radar. The group of L/R-models with γ-regularization and/or PLS-core leads. Top positions and composite scores (0–1):
Ridge Gamma L/R—0.600;
Lasso L/R—0.592;
ElasticNet L/R—0.591;
Lasso Gamma L/R—0.590;
ElasticNet Gamma L/R—0.589;
SumDiff PLS + Ridge L/R—0.582;
PLS L/R—0.571, OLS L/R—0.571, OLS Gamma L/R—0.568, and Ridge L/R—0.543.
Noticeably, the “Original” versions without L/R are inferior in terms of total quality (OLS Original—0.536, Ridge Original—0.523, Lasso Original—0.337, ElasticNet Original—0.324), while PCA95 + Ridge L/R—0.283 and “extrapolation from L/R” (Ridge EXT) turn out to be the lowest. Together with radar, this directly points to the value of target factorization (PLS) and outlier-resistant scaling as necessary elements of the end-to-end pipeline: they are what make the balance of the “accuracy–deceivability–calibration” systematically achievable on drift sampling.
Taken together, these results confirm the correctness of the methodology’s architectural decisions. The translation of texts into stable thematic coordinates (LDA, then auto-annotation), their target compression (PLS), and regularized regression on factors create a stable basis for the next steps—the scenario optimization of levers and construction of the graph of thematic relations, where management actions will be calculated over the interpreted factors and taking into account the constraints.
Neural auto-annotation is used only to assign human-readable names for the discovered topics. It does not change the underlying document–topic distributions produced by LDA. All quantitative steps in the pipeline use these numeric topic mixtures as inputs. This includes the PLS factorization, the regression models, and the scenario optimization. For this reason, predictions and recommended interventions do not depend on the wording of the labels. Labels affect interpretation and reporting, but not the computed risk factors or optimization results.
To compare scenarios, we aggregated quality into a single composite score for three groups of metrics: goodness_of_fit (approximation accuracy and error scale consistency), scale_errors/calibration, and cv_metrics (time tolerance). Each primary metric was normalized to a 0–1 scale within its group, then averaged into a group sub-score, after which the overall model score was calculated as a weighted average of the sub-scores (sum of weights = 1). This multi-axis aggregation allows the models to be ranked based on a balance of accuracy, calibration, and stability, rather than on a single metric, which is particularly important for the NRC EN drifting corpus. A detailed definition of the groups of metrics and the temporal design of the validation is given in the
Section 2, where this three-axis estimation scheme is justified.
The radar diagram of the group sub-score shows a key feature of the top configurations. Their polygons are maximal in all available vectors, i.e., simultaneously high values are achieved in all three axes. In particular, chains with PLS target factorization and robust scaling (L/R) form the tops of the diagrams, whereas the “original” variants without L/R and aggressive domain-free compression (PCA95) show dips on one or two axes, indicating worse tolerance and calibration in delayed windows. This empirically confirms the methodological requirement. PLS + regularization and L/R transformations are essential elements for robust accuracy on a heterogeneous and drifting sample.
The final weighted rating of the models is shown in the horizontal diagram. The leaders are as follows:
SumDiff PLS + Ridge (L/R)—0.938;
Ridge Gamma (L/R)—0.938;
Then, Lasso Gamma (L/R)—0.926, Lasso (L/R)—0.926, ElasticNet (L/R)—0.922, Elas-ticNet Gamma (L/R)—0.921;
“upper echelon” is closed by PLS (L/R)—0.876;
OLS (L/R)—0.841, Ridge (L/R)—0.828, OLS Gamma (L/R)—0.821–mid-range;
“original” without L/R is noticeably lower (OLS Original—0.801, Ridge Original—0.782, Lasso Original—0.442, ElasticNet Original—0.393);
The lowest scores are PCA95 + Ridge (L/R)—0.349 and Ridge EXT (from L/R)—0.211.
This order quantitatively captures the balance. SumDiff PLS + Ridge (L/R) maximizes the cv_metrics block, while Ridge Gamma (L/R) has a slightly higher goodness_of_fit—as a result their final scores are the same. It is the ability to hold a high level of accuracy, calibration, and portability that puts these configurations in the lead. In addition, schemes without target factorization or without robust scaling lose ground due to degradation on at least one axis.
From a practical point of view, this means that for further steps of the pipeline (estimation of factor contributions, leverage optimization, and the graph of thematic relationships), one should rely on the upper group of models. They provide the best compromise “accuracy–stability–calibration”, i.e., they minimize the risk of over-learning for the historical vocabulary and correctly capture the magnitude of errors in the transition to out-of-time periods, which is critical for operation in production risk management loops [
54,
55,
56,
57,
58,
59].
Moving from the “flat” ranking to analyzing the interactions of the conveyer components, we constructed a cube visualization of the composite score (0–1) along three orthogonal axes:
(X) regressor family (OLS, Ridge, Lasso, ElasticNet, PLS-reg, PCA95 + Ridge, Ridge EXT);
(Y) scaling/robust normalization option (Original vs. L/R);
(Z) feature factorization “layer” (Base → PLS only → PLS + SumDiff).
Analysis of the cube visualization of the composite scores shows several consistent effects and their interactions. First, the “main effect” of L/R is noticeable. The transition from Original to L/R configurations yields an increase in integral scoring in almost all model families; the magnitude of the gain varies from ≈+0.03 to ≈+0.15—from moderate for OLS to pronounced for Ridge/Lasso/ElasticNet, which is fully consistent with the “flat” ranking, where L/R versions systematically outperformed the original ones. The second baseline effect is related to the factorization layer; moving up the Base, PLS only, PLS + SumDiff axis leads to a consistent increase in quality. The typical gain at transition Base, then PLS is ≈+0.07… +0.10, and additional step PLS, PLS, SumDiff adds another ≈+0.02… +0.06. Visually it appears as the “warming” of colors from the lower to the upper shelf of the cube and confirms that the target PLS factorization acts not as an auxiliary, but as a key driver of portability and calibration.
Against the background of these main effects, the “PLS-layer × L/R” synergy is particularly noticeable. The maximum values of the composite score are concentrated in the “top-right” corner of the cube—the combination of L/R + PLS + SumDiff. This is where the previously identified leaders—SumDiff PLS + Ridge (L/R) and Ridge Gamma (L/R)—are located—both configurations give Overall ≈ 0.938, which shows that the best result is achieved not by choosing one “right” module, but by their coordinated combination.
Ridge, Lasso, and ElasticNet appear to be the least conflicting with PLS and benefit simultaneously from both factorization and L/R, reaching upper quality levels. PLS-reg as a separate family is stable and strong in the PLS only layer, but more often yields to the PLS + SumDiff composition with external regularization. OLS moderately gains from L/R, but even with PLS + SumDiff remains in the middle echelon. The PCA95 + Ridge binding shows a systematic lag even in the L/R layer, indicating a deterioration of tolerance under domain-agnostic compression. Finally, Ridge EXT (from L/R) forms the worst “angle” cube (up to ≈0.21), as ablation with regressor removal from the L/R contour destroys feature scale and para-metric matching.
From an operational point of view, this means that the “hot” blocks of the cube coincide with the top of the flat ranking, and the L/R + PLS(+SumDiff) + regularizer bundle from the Ridge/Lasso/ElasticNet family provides a stable accuracy–calibration–tolerance trade-off on the drifting case. On the contrary, attempts to replace PLS with a domain-free PCA or to break the coherence of the pipeline lead to the structural degradation of quality.
Thus, the “cube” shows not only who is the leader, but also why. Quality is the result of a coordinated trio of solutions (robust scaling × target factorization × regularized class) rather than a single “strong” model. This is the methodological rationale for selecting the top group of configurations for the next steps, i.e., assessing the contributions of the topics, optimizing the selection of “levers”, and building the linkage graph.
By localizing the hot configurations found in the multivariate ranking at the level of internal representations, we consider the stability, calibration, and factor structure of the best model from the upper echelon (PLS + SumDiff + Ridge in the L/R-loop). The distribution of the cross-sectional explained variance shows that CV-R
2 is consistently high (
Figure 7). The median is close to unity, the interquartile range is narrow (≈0.9–1.0), and sporadic dropouts at early windows (about 0.2–0.3) reflect the expected sensitivity to local lexical and process drift in historical periods. The “predicted vs. actual” diagram demonstrates good scale calibration. The point cloud lies along the diagonal without systematic displacement. Noticeable clusters near levels 0, ≈0.5, and 1 correspond to the discrete structure of the proxy index (aggregation of normative/class indicators) and confirm that the regressor correctly reproduces both the low-risk “background” and the medium/high-level “sto- tions”. The upper tail shows rare cases of underprediction (points under the diagonal at actual ≈1), which is natural for CVaR-sensitive settings. Extreme patterns are rare and can be compressed by regularization. They will be the targets of targeted monitoring enhancement in the operational loop. Ridge absolute coefficient decomposition in the PLS component space reveals a compact “kernel” of the predictive mass. The second component makes the largest contribution (|β| is maximal), followed by the first, third, fifth, and tenth components with decreasing weights. The rest constitute a “thin tail” of small but non-zero effects. This picture corresponds to the target meaning of PLS. Several orthogonal factors aligned with the risk metric concentrate the main informativeness, ensuring both interpretability (via the topic, factor-loading matrix) and robustness (suppression of multicollinearity of topics).
To link the factor structure to the semantic level, we assessed the permutation importance of the Sum/Diff thematic aggregates. The gradient features “Miscellaneous Incidents” and “Threats and Incidents” dominate the top 20, significantly outperforming the rest of the pool. This is followed by “Discharge of pollutants” and, as the first summary predictor, “Unauthorized access (summary)”. The prevalence of diff coordinates indicates that the risk is sensitive to the dynamics of topics (change of frequency/content of episodes) and not only to their absolute proportions. It is the rate of change in “miscellaneous incidents”, “threats”, and “discharges” that is an early indicator of risk growth. At the same time, the high significance of the total level of “unauthorized access” is consistent with the optimization block. Even without sharp dynamics, the increased “access background” itself gives the highest marginal derivative of the target risk function, which makes it the primary point of application of interventions. This combination of “dynamics of most topics + level of key topic” explains the previously observed asymmetry in the polar leverage diagram and the bridging role of “resets” in the linkage graph. On the management map, these phenomena form a contour where changes in practices are most quickly translated into a decrease in the integral index.
Building on the identified leader (PLS + SumDiff + Ridge in the L/R-loop) and its factor structure, we formalized the management choice setting as the optimization of an action vector
, where
is the intensity of intervention on topic k (averted proportion of events/reduced severity). The change in topic profile was modeled by deforming Θ′ = Θ − ΔΘ(a) through calibrated elasticities, after which risk prediction was recalculated as (a) = f(Θ^′ W), where f is a trained regressor on PLS components and W is the topic, the factor loading matrix. The objective function combined expected risk and cost/constraints
and minimized the Bayesian optimization, which allowed for searching for an optimal “set of levers” under limited resources and uncertainty of estimates. The polar diagram of the factor mix shows that the largest modulo differential contribution belongs to the topic “Unauthorized access”. Its “cutting” gives the greatest marginal gain in the risk functional. For the dynamic (“diff”) coordinates— “Various incidents”, “Threats and incidents”, and “Pollutant discharge”—the marginal effects are noticeably smaller in modulus and close to zero, indicating the rationality of combining them into package measures (synergistic effect is manifested by joint impact rather than single levers). Thus, the optimization not only confirmed the dominance of “unauthorized access” but also established the priority of the combined strategy for ordinary diff themes [
60]. Next, to capture the structure of inter-thematic dependencies and to identify points where bundling yields a multiplicative effect, a statistical relationship network was constructed. In addition, the permutation feature importance ranking (top 20, Sum/Diff) for the best-performing scenario is shown in
Figure 8 as a horizontal bar chart.
Nodes correspond to auto-annotated topics, edges—to significant associations (co-occurrence/partial correlations/conditional mutual information). In the obtained topology, “Pollutant discharge diff” plays a central role. It connects “Unauthorized access (total)” to “Miscellaneous incidents”, as well as to the topics “Explosives (total)” and “Radiation exposure (total)”, and is linked via shortcuts to “Equipment accidents (total)”. In the language of graph theory, the node “Discharge of pollutants” has a high mediating centrality (often lying on the shortest paths between other topics), while “Unauthorized access” has a prominent, albeit smaller, centrality. Influencing ‘access’ has the largest direct risk effect, while prioritizing ‘discharges’ has a disproportionately large indirect effect, reducing cascades leading to equipment accidents. Therefore, the optimal portfolio should combine targeted interventions on ‘access’ and harmonized measures on the sub-facet where ‘discharges’, ‘miscellaneous incidents’, and ‘radiological effects’ converge (
Figure 9).
The final step is to translate the quantitative results into management decisions. To make the promised end to end interpretability concrete, we provide a worked traceability example that follows one incident narrative through the entire pipeline (as an illustrative example, the graph-based visualization is shown in
Figure 10 below). The example shows the input text after preprocessing, the inferred topic mixture, the auto-annotation label assigned to the dominant topics, the resulting PLS factor scores, the regression-based risk estimate, and the resulting prescriptive lever selection under the stated constraints. For readability, the full step by step trace is placed in
Appendix B (see
Appendix A,
Table A1, for the main definitions of the formula–letter notations used in this study), while the portfolio summary below reports the aggregated recommendations for the best performing scenario.
We generated recommendations by aggregating ranked marginal effects from scenario analysis, graph attributes (centrality and bridge positions), and operational constraints/costs. The resulting specification is passed to the LLM module, which generates human-readable actions in an EAM/CMMS-compatible format (type of work, frequency, responsibilities, resources/competencies, checkpoints, and KPIs). To summarize, the first priority is to strictly limit unauthorized access (physical/logical protection, access control to hazardous materials), and then to reduce the coupling vulnerability of the “pollutant discharge” node (continuous monitoring and response procedures), and the “miscellaneous incidents, threats/radiation” package should be closed by joint organizational and procedural measures (standardization of reporting, training, and revision of regulations). In a deployed operational format, this leads to the following plan:
Strengthen the physical protection of NPPs, especially access control to explosives (quarterly inspections; responsible service: security);
Develop and implement a program for regular assessment of equipment vulnerabilities (annually; responsible: engineering department; channel: IAEA standard audit/peer-review) [
61,
62,
63];
Improve monitoring and control of discharges of radioactive substances (continuously; responsible: environmental service; channel: automated monitoring and external data from observation networks);
Conduct enhanced training of accident and incident response personnel (annually; responsible: training department; control: certification and training);
Implement a system for the early detection and prevention of unauthorized access (permanently; responsible: security service; channel: access logs, SIEM);
Strengthen condition monitoring of reactor components and pipelines (quarterly; responsible: engineering department; channel: NDT/diagnostics, predictive maintenance);
Maintain an explosives threat plan (annual update and exercise; responsible service: security).
Thus, the final stage demonstrates the complete cycle inherent in the methodology: from stable semantic coordinates and target factorization to interpretable factor effects, from scenario-based optimization to graph-based bagging of measures and finally to operationalized recommendations ready to be loaded into EAM/CMMS [
64,
65]. On an approved dataset, this outline showed that the “phenomenon-centered” framework not only explains risk, but also transforms it into a valid management toolkit addressing pre-dominant and bridging phenomena in a coherent network of topics [
66].
4. Discussion
Modern industrial systems are creating an increasingly complex and multifaceted technological environment in which heterogeneous data volumes are increasing dramatically. At the same time, much of the applied knowledge about accidents is contained in unstructured texts of reports and notifications, which is beyond the capabilities of traditional numerical risk analysis models. Classical methods of accident risk assessment (FMEA, event trees, Bayesian networks, etc.) face limitations [
67,
68,
69,
70,
71]. The scarcity of serious incidents and unbalanced reporting create fragile metrics, and the constant drift of technology and regulations rapidly outdates static algorithms. The lack of transparency of “black boxes” further complicates the implementation of their results in practice. Specialists and controlling authorities need the traceability and explainability of each step of the analysis. As a result, there is a gap between the information accumulated in the texts and concrete management decisions. Analytical findings are rarely directly translated into preventive and corrective action plans or applications for asset management systems (EAM/CMMS) [
71,
72].
The proposed scenario optimization method for the semantic processing of incident texts addresses these challenges. It builds a unified pipeline of analyses from source texts to practical recommendations by combining qualitative information from accident reports with quantitative risk indicators. Clustering and thematic coding transforms incident texts into a set of predominant themes, and partial least squares (PLS) regression with regularization generates stable latent themes and risk factors. The resulting factors are invariant descriptions of key accident causes that remain robust to noise and changes in data distributions.
Our choice of classical LDA as the core topic modeling method is deliberate. The model produces probabilistic themes with explicit word distributions and document-level mixtures, which are easy to inspect, label, and monitor over time. This fits the requirements of regulated risk engineering environments, where expert review, audit trails, and versioning of semantic artifacts are essential. At the same time, modern approaches based on contextual embeddings can provide richer and more flexible representations of incident narratives. Examples include BERT-based topic models, neural topic models, and clustering in transformer-derived vector spaces. In this study we did not integrate these models into the main pipeline. Our focus was on clarifying the phenomenological layer and on demonstrating its integration with PLS factorization and prescriptive optimization in a fully traceable way. A systematic comparison with embedding-based topic discovery and hybrid architectures is an important direction for future work. It will show how stable the proposed phenomena remain when the underlying semantic representation changes and whether contextual models can further improve the robustness of lexical and regulatory drift.
Beyond embedding-based neural topic models, an increasingly active line of research explores GPT-based topic modeling, where large language models are used not only for topic labeling, but for topic discovery itself. In zero-shot or weakly supervised settings, an LLM can induce topic labels, short descriptions, and representative keywords directly from batches of documents, or it can generate intermediate document-level explanations/summaries whose embeddings are subsequently clustered, resulting in hybrid LLM–embedding topic models. Recent comparative evidence suggests that these approaches can improve the human interpretability of topics, but they may also exhibit higher thematic overlap and sensitivity to prompting and model/version choices, making reproducibility control and governance particularly important. Therefore, a key direction for future work is to benchmark our LDA-based phenomenological layer against neural and GPT-based topic modeling under the same drift-aware evaluation protocol and to assess not only topic coherence, but also the downstream stability of PLS factors, risk calibration, and the robustness of the resulting prescriptive portfolios [
73,
74].
The method then performs multi-criteria scenario optimization. Several quality metrics (fit accuracy, error magnitude, and stability on delayed samples) are used to compare model configurations at once, which prevents overfitting and increases the transferability of results to new cases. The formal structure of the pipeline ensures the full traceability of all steps. Each step—from text preprocessing and topic identification to regression calibration and leverage selection—is documented and can be audited [
75,
76,
77]. Finally, the output of the pipeline is not an abstract mathematical model, but operationalized conclusions. It generates a set of concrete management actions with priorities, resources, and assigned responsibilities, ready to be integrated into EAM/CMMS and checked for compliance with regulatory requirements.
The proposed pipeline has clear practical value as a decision-support tool for organizations that already collect incident narratives and structured maintenance data. Many operators maintain shift logs, incident notifications, investigation notes, and asset records. The method uses these inputs to produce interpretable recurring patterns in the form of topics. It then links these patterns to a proxy risk index through latent factors. Finally, it generates a ranked set of mitigation actions with an estimated effect under resource and policy constraints.
In practice, the outputs can be aligned with maintenance and safety workflows. Recommended measures can be translated into work packages that specify the type of work, frequency, responsible roles, required resources, and control points. These work packages can be added to EAM or CMMS planning as preventive tasks, inspection routines, training activities, or procedural updates. This supports routine use during regular safety reviews and planning meetings. It also helps teams prioritize interventions based on expected impact and available budget.
Operational deployment also requires transparency and traceability. The approach supports expert review of topic labels and key assumptions used in optimization. It provides a clear mapping from detected phenomena to latent factors and then to the risk proxy and selected actions. This makes the method easier to justify in regulated environments and easier to audit. The pipeline can also be used in a continuous monitoring loop. After actions are implemented and new events are recorded, the same workflow can be rerun. This allows teams to check whether risk drivers have shifted and whether the selected measures delivered the expected improvement.
The validity of the approach has been confirmed via experimental validation. During testing on the NRC Event Notifications corpus (about 27,000 incident reports), multi-criteria scenario analysis showed that configurations with PLS factorization and regularized regressors are consistently among the leaders. They simultaneously provide high forecast accuracy, correct calibration of the error bars, and stable operation on deferred data. The advantage of target factorization over “coarse” dimensionality reduction without taking into account the topic (e.g., PCA) is noted separately. Models with thematic components demonstrate more stable metrics and better transferability.
“Unauthorized access” was the dominant theme in reducing the integral risk, while “pollutant discharge” acted as a linking node, connecting different groups of incidents (including “equipment accidents”). This topology reflects the multiplicative effect of integrated measures. Targeting these key topics reduces not only direct risks, but also the associated cascades of other incidents. In the final stage, the optimization results and the structured graph of topics are fed into a recommendation generator, which formulates human-readable and reproducible lists of measures. The resulting management guidelines proved to be reproducible. Second runs of the pipeline on the same dataset generated a similar list of prioritized measures. Thus, the experimental validation demonstrates the full viability of the approach—from semantic text factorization to practical solutions—with the transferability and repeatability of the results [
78].
The scalability and application prospects of the method are vast. Thanks to its modular architecture and ontology-based approach, the pipeline can be easily adapted to other areas of critical infrastructure. It can be applied to the oil and gas, chemical, transport, and other sectors by customizing topics and taking into account industry-specific vocabularies. Full traceability of all analysis steps facilitates automated auditing and compliance checking [
79]. The entire process—from text processing to recommendation generation—can be documented and verified by regulators and internal compliance agencies. In the future, this unity of formalization and operations will enable cross-domain, cross-sector risk management. By combining semantic information about incidents from different spheres, it is possible to identify common laws and coordinate preventive strategies across industries [
80,
81,
82,
83]. The continuous incorporation of new incident data and feedback on implemented measures enables self-learning, bringing management closer to proactive safety monitoring and bringing analytics closer to real-world maintenance planning and inspection processes [
84,
85].