Review Reports
- Dmitry Rodionov1,
- Prohor Polyakov2,* and
- Evgeniy Konnikov1
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe article investigates the phenomenological semantic-factor method for risk management of complex systems in drifting. After the introduction, the paper presents the materials and methods. After that, it describes the research results. Section 4 contains the discussion, while the last section is the conclusion. The paper presents good results, but I have some recommendations:
- The main contribution of the research description is missing.
- The literature review section is missing. The paper should also summarize the reviewed literature in a tabular format, for example with the following columns: references, authors, publication year, problems, solving methods. The table should also indicate how the current research differs from and aligns with the studies presented in the literature.
- The table of notations and their meanings is missing.
- Some figures have labels that are not clearly visible.
- There is no information in the paper about the software used. Have the authors implemented the software or it is based on existing software or libraries? Running times and the test environment description are also missing.
- The applied algorithms, methods are not defined in the paper, for example Ridge Regression (L2), Lasso Regression (L1), ElasticNet (L1 + L2), OLS (Ordinary Least Squares), etc.
- Please explain in more detail the potential practical applicability of the research.
- Conclusion: future research direction is missing.
Author Response
Dear Reviewer,
Thank you for the detailed and constructive feedback.
A description of the main contribution of the study has been added as a paragraph at the end of the introduction section. The literature review has also been expanded, followed by a comprehensive table with aggregated results from the analysed literature, including an indication of how the current study differs from the presented works and how it coincides with them. A table of symbols and their meanings has been added to Appendix A. In figures with small captions, the latter have been enlarged for better readability. Information about the software used, the execution time, and a description of the test environment has been added to the end of the second section. The same section defines the algorithms and methods used, such as Ridge regression (L2), Lasso regression (L1), ElasticNet (L1 + L2), OLS (least squares method), etc. In the fourth section, after the paragraph ending with "...checked for compliance with regulatory requirements," we explained the potential practical applicability of the study in more detail. In the conclusion, before the last paragraph, we added a direction for future research.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript attempts an ambitious project to develop a phenomenological, end-to-end framework that converts unstructured incident narratives into measurable risk factors and actionable management recommendations. Conceptually, the authors present an appealing idea: a transparent semantic pipeline starting from textual descriptions and culminating in optimized interventions within asset-management systems. This goal is timely and relevant, especially in fields where decision-makers need not only accurate predictions but also clear explanations based on the language used in incident reports.
However, despite the strong conceptual foundation, the paper currently lacks sufficient methodological transparency and reproducibility to support its claims, particularly concerning the NLP components that underpin all subsequent modeling. The entire approach assumes that stable, interpretable, and domain-independent semantic structures (“phenomena”) can be extracted from diverse narrative data. Yet, the manuscript provides limited empirical evidence that the linguistic processes producing these structures are reproducible or adequately documented for independent verification.
This issue is evident in the initial stage—LDA topic modeling. The authors depend on LDA outputs as the semantic basis for deriving phenomena and PLS factors but offer almost no rationale for their modeling decisions or any evaluation of topic quality or stability. Since LDA is known to be highly sensitive to preprocessing choices, vocabulary thresholds, random initialization, and the number of topics, the absence of coherence scores, stability tests, or hyperparameter details makes it impossible to determine whether the topics represent consistent semantic patterns or are merely artifacts of a specific run. Because every later step relies on this initial decomposition, the lack of rigorous evaluation here undermines the reliability of the entire framework. Moreover, the field has largely shifted toward contextual embedding–based topic discovery (e.g., BERTopic, Top2Vec, CTM), which generally provide more coherent and domain-stable clusters for operational text. Without comparing LDA to more modern options, it is difficult to assess whether the proposed phenomena truly capture invariant semantic structure or are simply artifacts of LDA’s limitations.
The neural auto-annotation module poses an even greater challenge for reproducibility. The manuscript describes this step as a way to convert latent topics into interpretable labels but does not explain how the module functions, how it was trained or prompted, what data it uses, or how label quality is assessed. In modern NLP, topic labeling is among the least stable pipeline stages, and introducing an opaque annotation system—potentially based on large language models—without guidelines, examples, or agreement metrics means the paper’s key interpretability claims cannot be independently verified. While the resulting “phenomena” may seem plausible, without documenting the annotation process, they cannot be regarded as reproducible scientific constructs.
A similar lack of clarity applies to the PLS-based semantic factorization. Although aligning semantic dimensions with quantitative risk indicators is an attractive idea, the manuscript does not demonstrate whether these factors are stable across cross-validation splits, robust to noise, or sensitive to the specific LDA representation. Without bootstrap analyses or comparisons to current supervised semantic projection methods, it remains unclear whether the reported factors genuinely reflect linguistic-risk relationships or depend on undocumented modeling choices.
The regression and prescriptive optimization components inherit these uncertainties. The paper mentions “quantitative impacts” and “elasticities” but provides no diagnostic evidence that regression coefficients are stable, no justification for interpreting them as meaningful contributions, and no proof that prescriptive outputs would be consistent across repeated pipeline runs. Because the linguistically derived features upstream may be unstable, the downstream recommendations might also lack reproducibility, even if the optimization itself is well-defined.
The manuscript emphasizes resistance to concept drift, but the evidence supporting this claim is mostly rhetorical. Drift-aware NLP pipelines require explicit temporal evaluations, measurements of semantic shifts, and demonstrations that topic, factor, and regression layers remain stable when the distribution of narrative reports changes. Without such analyses, the claim of drift resilience is unsubstantiated.
Overall, the main challenge is not the conceptual design but the insufficient documentation of methodological steps. For a system claiming full traceability “from tokens to actions,” the paper provides surprisingly few concrete examples illustrating this process. Readers cannot follow even a single incident report through LDA, the annotation module, factorization, and into a recommended action. Without such detailed examples, the promised interpretability remains theoretical rather than proven.
In its current state, the manuscript presents an original vision but does not yet meet the reproducibility and methodological clarity standards expected for publication. The NLP foundation needs significant strengthening: the topic model must be evaluated and documented; the annotation system fully specified and justified; factor stability demonstrated; and claims of drift resistance supported by empirical evidence. With a more rigorous and transparent treatment of the linguistic components, the paper could make a valuable contribution. As it stands, substantial revision is necessary.
Author Response
Dear Reviewer,
Thank you for the detailed and constructive feedback. We agreed that the main risk for our framework was insufficient transparency and reproducibility in the linguistic layers that feed all downstream modeling. In the revised manuscript we therefore expanded the methodological specification and added explicit stability and traceability evidence across the full chain, from text to actions.
We clarified and documented the LDA stage. We now describe the deterministic preprocessing and vocabulary filtering, explain how the LDA configuration was selected, and report topic quality checks via standard coherence measures together with qualitative inspection. We also added explicit stability checks across different random seeds and temporal slices, and we state what artefacts are logged for independent verification. In addition, we explain why LDA was chosen for an audited and regulated setting, and we explicitly identify embedding-based topic discovery as a key direction for future benchmarking and potential improvement.
We fully specified the neural auto-annotation module. Topic labeling is now performed locally using Ollama with a fixed model version and a deterministic decoding setup. We fix temperature, seed, and the remaining decoding parameters. We also fix and version the prompt template and log all inputs and raw outputs. We report an explicit reproducibility check showing identical labels in more than 95% of repeated runs under identical settings. We also apply a simple normalization rule to map rare near-synonym variants to a canonical label inventory.
To address the key logical concern about whether labels could change the results, we added an explicit statement that LLM-produced labels affect interpretability only. All quantitative steps use only the numeric document–topic distributions from LDA, including PLS, regression, and optimization. Therefore, predictions and recommended interventions are invariant to label wording.
We strengthened reproducibility diagnostics for the PLS, regression, and prescriptive layers. We now report factor stability across rolling-origin splits using loading similarity after sign alignment and subspace similarity diagnostics, and we quantify robustness via bootstrap resampling. For regression, we treat coefficients as associations rather than causal effects and justify their operational use via coefficient stability and rank stability across splits and bootstrap repeats. For the prescriptive output, we evaluate repeatability at the portfolio level using overlap metrics for top levers and dispersion of cost and risk reduction across repeated runs. We also replaced rhetorical drift claims with explicit temporal diagnostics that track semantic distribution shift, topic coherence on delayed periods, and stability of factor loadings and regression coefficients over time windows.
Finally, to make “tokens to actions” concrete rather than theoretical, we added Appendix B as a worked traceability example. It follows one incident narrative through preprocessing, topic inference, deterministic labeling, PLS factor scores, risk prediction, lever selection, and the final EAM/CMMS-style action packages.
We also added the software and execution environment description, including the Python stack and key libraries, what parts are custom code, and typical running times on the stated hardware.
Reviewer 3 Report
Comments and Suggestions for AuthorsI think that this paper presents an interesting and quite highly practical framework for integrating unstructured text into quantitative risk management, addressing a significant gap in the analysis of complex, drifting systems given the amount of big data available. One key strrength is the end-to-end pipeline, from semantic encoding to scenario optimization, is well-conceived and demonstrates a clear understanding of the operational needs in critical infrastructure domains. Also, the use of phenomenological factorization coupled with Partial Least Squares (PLS) provides a quite novel method for creating interpretable, traceable models from incident narratives and this is interesting to me. Also anotherr key strength is that the results demonstrating improved out-of-sample estimation and robust, prescriptive recommendations are strong validation of the approach's utility, and the focus on auditability is a crucial feature for real-world adoption. Overall a good paper.
However, to fully leverage the potential of this architecture and align it with contemporary AI advancements, the paper should elaborate on the regression model's feedback mechanism. While the overall process is described as a pipeline, the text mentions "feedback" without specifying its operational form. A detailed discussion is needed on how the outcomes of the implemented recommendations—the success or failure of the prescribed interventions—are captured and used to refine the model.
This presents a prime opportunity to connect the framework to the paradigm of Regression with Human Feedback, a concept instrumental to the success of models like OpenAI's GPT. In this context, the "human feedback" would be the expert assessment of an intervention's effectiveness or updated incident reports following its implementation. See regression with human feedback in Long Ouyang e al, Training language models to follow instructions with human feedback , 2021, https://arxiv.org/abs/2203.02155 and MF Wong et al, Aligning crowd-sourced human feedback for reinforcement learning on code generation by large language models, IEEE Transactions on Big Data, 2025 https://arxiv.org/abs/2503.15129 and the paper should discuss in detail how this feedback loop would be formally structured: could the PLS model be periodically retrained or its coefficients adjusted based on this new performance data? Formalizing this would transform the system from a sophisticated static analyzer into a dynamic, self-improving tool that continuously aligns its risk assessments and recommendations with real-world outcomes, thereby directly combating concept drift through active learning.
By explicitly defining this feedback mechanism and drawing inspiration from successful interactive AI paradigms, the authors can elevate their valuable framework from a powerful analytical tool to a living, adaptive risk management system. This would not only enhance its long-term utility but also position it at the forefront of interactive, human-in-the-loop decision support systems for critical operations.
Author Response
Dear Reviewer,
Thank you for your careful reading of our manuscript and for your constructive suggestions. We particularly appreciate your comment that the paper should explain the feedback mechanism more clearly and connect it to modern interactive learning approaches.
We revised the manuscript to make the feedback loop explicit and operational. We added a new explanatory block right after the paragraph that describes the iterative cycle “data – models – solutions – effects – data” and refers to feedback in Figure 2. In this added text, we explain what information is captured after an intervention is implemented, including updated incident narratives, refreshed risk proxies, and records of execution quality from operational systems. We also describe how expert judgments about whether an intervention worked are collected and stored in a structured way.
We then explain how these outcome signals are used to refine the regression component over time. The manuscript now discusses regular retraining based on accumulated post implementation data, as well as lighter incremental updates that allow faster adaptation between retraining cycles. We also expanded the discussion of the prescriptive module to clarify how observed outcomes are compared with predicted effects and how these comparisons are used to update the estimated effectiveness of interventions and improve future recommendations.
We believe these changes address your concern by turning the feedback element from a general statement into a clearly defined mechanism that supports continuous improvement and helps mitigate concept drift in real operating conditions.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors of the revised manuscript have thoroughly addressed all my concerns and comments, which has significantly improved the manuscript and strengthened its conclusions. This is particularly important because the work represents a significant step toward integrating text analytics into practice, recognizing text data and its analysis as powerful tools for decision-making.
The pipeline presented in this manuscript offers clear practical value as a decision-support tool for organizations that gather incident narratives and structured maintenance data. The authors' method utilizes these inputs to identify interpretable recurring patterns in the form of topics and ultimately generates a ranked list of mitigation actions, estimating their effects within resource and policy constraints.
In the revised version, the authors made every effort to ensure their approach is as reproducible as possible. They also provided a thorough justification for using the LDA method as their primary tool and clearly stated that other topic modeling techniques could be explored further. Comparing their results with those obtained from other topic modeling algorithms, such as BERT-based and GPT-based approaches (see DOI: 10.18413/2313-8912-2025-11-3-0-5; 10.1007/s10462-023-10661-7 and related references), represents an important and promising direction for future research. I suggest including a brief paragraph about these types of approaches in future work, particularly GPT-based topic modeling, which is currently a hot topic.
In this paper, the authors chose to focus on clarifying the phenomenological layer and demonstrating its integration with PLS factorization and prescriptive optimization in a traceable manner, which is a scientifically sound decision.
Comments on the Quality of English LanguageThe quality of English could be improved.
Author Response
Dear Reviewer,
Thank you for your careful reading of our revised manuscript and for your supportive evaluation. We also greatly appreciate your suggestion to briefly acknowledge and reference modern neural and GPT-based topic modeling approaches as a promising direction for future work.
Following your recommendation, we expanded the manuscript’s forward-looking discussion to position our LDA-based phenomenological layer within the broader topic-modeling landscape. In the Discussion section, we added a concise paragraph in the part where we justify the use of LDA and discuss methodological choices. This new text notes that embedding-based neural topic models and GPT-based topic modeling can be explored as alternatives, and it highlights the main practical considerations for our setting, including interpretability, sensitivity to modeling and prompting choices, and reproducibility requirements.
We also strengthened the Future Work statements in the Conclusions section by explicitly stating that we plan to benchmark our LDA component against neural and GPT-based topic discovery while keeping the downstream PLS coupling and prescriptive optimization unchanged. We clarify that the evaluation will consider not only topic quality but also downstream stability and the resulting ranked mitigation actions under operational constraints. Finally, we added the two references you pointed out to the bibliography.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors did not fully the previous review comments. In particular, the authors did not address how to connect the framework to the paradigm of Regression with Human Feedback, a concept instrumental to the success of models like OpenAI's GPT. In this context, the "human feedback" would be the expert assessment of an intervention's effectiveness or updated incident reports following its implementation. Authors should discuss regression with human feedback in Long Ouyang e al, Training language models to follow instructions with human feedback , 2021, https://arxiv.org/abs/2203.02155 and MF Wong et al, Aligning crowd-sourced human feedback for reinforcement learning on code generation by large language models, IEEE Transactions on Big Data, 2025 https://arxiv.org/abs/2503.15129 and the paper should discuss in detail how this feedback loop would be formally structured: could the PLS model be periodically retrained or its coefficients adjusted based on this new performance data? Formalizing this would transform the system from a sophisticated static analyzer into a dynamic, self-improving tool that continuously aligns its risk assessments and recommendations with real-world outcomes, thereby directly combating concept drift through active learning.
Author Response
Dear Reviewer,
Thank you for your careful reading of our manuscript and for the constructive suggestion to explicitly connect the proposed architecture to the paradigm of Regression with Human Feedback (RHF), by analogy with RLHF-style interactive learning used in instruction-following language models. We appreciate the opportunity to clarify and formalize the operational form of “feedback” in our framework.
In the revised manuscript, we expanded the methodology section where the feedback block in the pipeline is described. We now explicitly cast the adaptive update mechanism as RHF and connect it to the RLHF alignment loop, citing Ouyang et al. (2022) as the canonical reference for human-feedback-driven alignment. We also added Wong and Tan (2025) to address how heterogeneous feedback from multiple evaluators can be aligned and aggregated in a reliability-aware manner.
Most importantly, we formalized how the feedback loop is structured in operational terms. We describe how each deployment round records an auditable context–action–evidence tuple (pre-intervention semantic state, the implemented intervention portfolio, and post-intervention evidence captured through EAM/CMMS workflows, including updated incident narratives and refreshed risk proxy values). We then define how expert assessments and observed post-intervention outcomes are combined into a feedback/utility signal that can be used to refine the model.
To directly address the question of model updates, we added an explicit description of two update modes. The first mode is periodic re-estimation, in which the semantic layer is refreshed and the PLS projection and regression parameters are recalculated on an expanded or rolling window corpus with drift-aware weighting. The second mode is incremental adaptation between major retraining cycles, where the PLS projection can be held fixed for interpretability while regression coefficients are updated in factor space using newly observed outcomes. We also explain how discrepancies between predicted and realized intervention effects provide a calibration signal to update prescriptive components, such as elasticity parameters and optimizer priors, turning the system into a dynamic, self-improving decision-support tool that actively combats concept drift.
We hope these revisions address your concern by clearly defining how “human feedback” is captured, operationalized, and used to refine both the predictive and prescriptive parts of the pipeline over time, while preserving traceability and auditability.
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed all my previous review comments and the paper has improved significantly. I recommend accept.
Author Response
Dear Reviewer,
Thank you for your helpful comments.