Review Reports - Two-Stage Wiener-Physically-Informed-Neural-Network (<i>W-PINN</i>) <i>AI</i> Methodology for Highly Dynamic and Highly Complex Static Processes

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

please see attached file

Comments for author File: Comments.pdf

Comments on the Quality of English Language

The authors are requested to check the quality of English.

Author Response

Reviewer 1

Note: All changes to our submission are highlighted in red below and in the revised document.

Major Comments

The JMP toolbox approach is described as "approximately find the smallest SSE by fitting many cases" (line 225). More detail needed on: hyperparameter selection, network architecture, number of hidden layers, activation functions, training procedures, and stopping criteria.

Response: To address this comment, the following has been added.

More specifically, JMPwas used to create and optimize a three-layer ANN structure. The ANN model began as a fully-connected single-layer perceptron, functioning as a decision-making node. This architecture consisted of three layers: the input layer, a hidden transfer layer, and the output layer. Each node in the transfer layer received weighted inputs from the input layer, and the final predictions were made based on the output layer’s activations. The transfer functions used within the model were a combination of linear and Gaussian transformations, and boosting techniques were applied to enhance the model’s performance. To ensure the inputs were appropriately scaled and transformed, continuous covariates were preprocessed by fitting them to a Johnson Su distribution. Using maximum likelihood estimation, this preprocessing step helped transform the data closer to normality, thereby mitigating the effects of skewed distributions and outliers. The general fitting approach aimed to minimize the negative log-likelihood of the observed data, augmented by a penalty function to regulate the model complexity. Specifically, a sum-of-squares penalty, applied to a scaled and centered subset of the parameters, was used to address the overfitting problem that often occurs with ANN models. This penalty was based on the magnitude of the squared residuals (i.e., ), helping to stabilize the parameter estimates and improve the model's optimization. Cross-validation was performed using the holdout method to assess the model's ability to generalize to new data. The training set consisted of the initial data range, while the validation set represented future observations, ensuring the model's predictive capacity for unseen data was tested.

The “confidential” Python ANN methodology (line 226) is problematic for a scientific publication. How can readers reproduce or validate results if the core methodology is confidential? Consider either: (a) making the method available, (b) providing sufficient detail for reproduction, or (c) clarifying what aspects are confidential and why.

Response: Our revision has the following changes and additions:

The second approach is our newly developed, novel, ANN structure that is coded in the Python language, with the scipy.optimize library with a Tan(h) activation function running the dual annealing optimization solver. As Section 3 will show, both (JMP and Python) non-linear regression, static stage-two, ANN approaches significantly improved fit in comparison the linear regression, static stage-two, first-order approach used by [2].

Our W-PINN JMP approach uses a classical ANN p-input variable layer (see Fig. 1) whereenters node i, i = 1, …, p, as shown in Fig. 1c. In contrast, our proposed W-PINN Python approach uses a q-input factor layer where(see Fig. 2). For example, it uses terms like quadratic factors (e.g.,) and interaction factors (e.g.,) as inputs to the input layer, whererepresents input factor i (e.g.,). Moreover, our proposed W-PINN Python methodology uses q input factors, and not p input variables, as shown in Fig. 2, below.

Figure 2. W-PINN with q input factors. The “^” means estimate.

There is a typing problem with all equations starting from Eq. (3). A double line appearing in those equations which is unreadable. Please check compiling the file to pdf again. Also, Table 1 overlaps the next paragraph so its hidden

Response: We see this problem in the PDF of the submitted manuscript but not in the revised manuscript. The problem appears to be corrected by an update of Word and/or PDF. The Table 1 problem in the PDF is caused by this problem and is corrected in the revision also. We will make and review PDF’s of the revised manuscript to confirm this problem no longer exist.

With only 11 subjects and almost one-week training/one-week validation split, the generalizability is questionable. Cross-validation or more robust validation strategies should be discussed.

Response: The study of 11 cases, each with about two weeks of free-living data. For every case was split approximately in half for training and validation. Random-shuffle k-fold cross-validation is inappropriate for dynamic physiological time-series modeling because it violates temporal causality and yields over-optimistic performance (Roberts et al., IEEE Trans Knowledge Data Eng 2017). The single chronological hold-out used here is the standard and most rigorous validation strategy for real-data sensor glucose forecasting applications.

The first paragraph in the manuscript has been changed as follows to avoid misunderstanding.

Our broad objective is the extension of the Wiener-Physically-Informed-Neural-Network (W-PINN) (see Fig. 1c) approach developed by [1] to improve modeling effectiveness when dynamic processes (i.e., systems) have highly nonlinear static behavior. This work is an extension and advancement of the modeling approach in [2] that was applied to three types of freely-existing data sets. The first data set consisted of four (4) nutrient inputs (x_i, i = 1, . . ., 4) and modeled the change of weight (y) over time using first-order dynamic structures (v_i), for each input i, and a quadratic, multiple linear regression, static output structurewhereis a vector of the v_i’s. The second one consisted of nine (9) x_i’s and modeled the top tray temperature (y) of a pilot distillation column using second-order dynamic structures for the v_i’s and a first-order multiple linear regression, static structure forOur goal, for the fitted correlation coefficient () of y and(the fitted y) isfor test sets or validation sets when the test set is not possible, was met for these two cases in [2].

No statistical significance testing between methods. Are the improvements from 0.68→0.82→0.87 statistically significant?

Response: There are no legitimate formal statistical significance testing methods for dynamic modeling because dynamic response data cannot be randomized, and deviations of the model estimates from measured values will be systematic and not random, as shown in Figs. 2-4 of the submitted manuscript. Thus, statistical analysis for dynamic modeling can only be informal using numerical summary statistics like the ones given in this work and visual assessments like the plots given in this work.

Our conclusion is that the mean validation r_fits of 0.68→0.82→0.87 appear to be statistically significant, in part because of the large improvements for all eleven subjects. However, you may have a different conclusion, and this is okay. Time will reveal the truth. Even formal statistical inference has its flaws.

The discussion of confidence intervals or uncertainty quantification for the fitted parameters is required.

Response: Confidence intervals are formal statistical inferential methods and not applicable in this work as mentioned above. Also, uncertainty quantification in dynamic modeling can only be informal, i.e., based on summary statistics and visual analyses.

There is a limited comparison with existing glucose forecasting methods (LSTM models mentioned but not compared empirically)

Response: LSTM, like NARMAX, is a lagged based empirical approach that is in the ANN modeling class given in Fig 1a. NARMAX was evaluated against the proposed approach in [2] and performed significantly worse than W-PINN on a highly dynamic and first-order static process. Nonetheless, we plan to evaluate LSTM against a future one-stage version of the proposed W-PINN method, which is not in the scope of this work.

Stage 1 uses Excel for optimization - this seems outdated and potentially limited. Why not use more robust optimization tools?

Response: The tool that one chooses to use is a personal preference. However, for this data set and all large data sets, we strongly advise modelers to decompose the problem into the steps that we describe in the manuscript. We added the following text:

The tool that one chooses to use for this step, as well as all the steps, is a matter of preference. However, we encourage modelers to break the modeling process down for this, and all large/complex data sets, as we have for this case. Note that for the other two modeling cases in [2], this decomposition procedure was not used.

The MLR comparison is only mentioned for one subject (Subject 11) – what is the reason for this choice?

Response: This subject had the highest full model validation r_fit(0.79) in the original modeling work ([3]). The following text was added:

We also note that the MLR model was applied to Subject 11, the highest r_fit,val (0.79) in [3], to compare the performance with the ANN models.

The handling of missing armband data by "averaging the two values on both sides" (line 203-204) is overly simplistic, especially for gaps "several hours long." This could introduce significant bias. Please justify this procedure.

Response: We agree that it is not very accurate, but it was the only unbiased idea we had then and even now for filling in the gaps. If you have a better suggestion for doing this, please share it with us. However, it would be something that we would consider for future work. The armband variables have little to no impact on improving modeling, as expected and not surprising.

A discussion of how missing output (SGC) measurements affect model training and validation must be addressed.

Response: We are assuming that the reviewer means “affected” this modeling study and we will address this question based on this assumption. We added the following to this revision at the end of the Discussion Section (4).

The amount of data missing for all the cases in this work is considerable and can be clearly seen in Model 1-2 plots such as Fig. 5. Missing SGC occurred for three reasons --the sensor was off offline, the sensor was online but not saving the data, or SGC measurements exceeded the upper range of 400 mg/dL. In this application, missing output (SGC) data is unavoidable because the sensors must be changed and replaced periodically and new sensors must be recalibrated to the attributes of the subject to save data. While loss sensor data is undesirable, it is the quantity and activity periods of loss data that is most critical. Moreover, when missing data is possible, modeling protocols should include ways to minimize its impact.

Minor Comments

The paper jumps between Excel, JMP, and Python tools without clear justification for each choice.

Response: Excel was the method used in Rollins, et al. 2025 to produce the v_ij’s. JMP and Python use the v_ij’s from Rollins et al. For clarity we added the following to Section 1, the Introduction.

Two types of TS methodologies are proposed – one that uses JMPand one that uses Python coding to develop a novel input factor W-PINN approach.

Line 20-21: "vij, where i = the subject number and j = the sample number" - consider using more standard notation (e.g., subscript for subject, superscript for time).

Response: We appreciate the suggestion by the reviewer. However, I am not sure what the reviewer is basing “standard” on. I quickly looked at few articles in the STATs Journal. The notation was the same as mine. Since this is a minor recommendation and due to my findings, and the effort it will take to change it, we will leave it as is unless the editor feels otherwise.

Equation (11): The bias correction approach using past residuals at θ_MV distance seems ad hoc. Is there theoretical justification?

Response: Yes, the reference is given just above Eq. 12. This text is written below. We added the content in red. Dynamic modeling is inherently biased modeling as fitted values will be persistently below measured values or above measured values, as clearly seen in all the plots in this work. In “bias correction” methodology, this behavior is modeled and exploited to improve predicted fit of the output model (see [12]).

The second one we call the “input-output model” or “Model 2.” It combines the input-only structure of Model 1 (i.e., Eq. 10) with a model of weighted residuals (i.e., bias correction, see [9]), a minimum ofdistance in the past (note that, this is model building and not model forecasting), as shown in Eq. 11 below (see [9] for the derivation).

The assumption of 60-minute deadtime being "very consistent" across subjects needs more support (confidence intervals, variability measures)

Response: This result was not determined in this work and was reviewed and accepted in [2]. I encourage the reviewer to read this work for more information. Again, this is dynamic modeling and confidence intervals are not applicable because of the nature of systematic bias. The variability measure that this work uses is AAD and formal inference is not applicable. Biased statistics and estimates can be useful and informative when bias is acceptably small. Formal statistical inference is just not applicable when measurement bias exists.

Table 1 is information but lacks discussion of subject-to-subject variability.

Response: Here is just one example of the discussions that are now given.

As shown in Table 1, Stage 1, Model 1, r_fit,val results varied from 0.59 to 0.77, with a mean of 0.68. Moreover, Stage 2, Model 1, r_fit,val results improved significantly over the Stage 1 results for both ANN approaches. As shown, JMPStage 2, Model 1, r_fit,val results varied from 0.60 to 0.85, with a mean of 0.74. However, PythonStage 2, Model 1, r_fit,val results are significantly better than JMP, varying from 0.72 to 0.93, with a mean of 0.82. As a result, Model 2 training and validation results, and Models 1-2 validation results are given in Table 1 for Pythononly. From Model 1 to Model 2, the Pythonmean r_fit,val increased from 0.82 to 0.87, the minimum from 0.72 to 0.80, and the maximum of 0.93 did not change. In summary, Python Stage 2 results improved considerably over Stage 1 results and are significantly better than JMP Stage 2 results.

All the words highlighted in red are about subject variability.

Why does Subject 10 (514) show large improvement (0.72→0.86) while Subject 4 (505) shows modest gains ( 0.61 → 0.80 ) ?

Response: 0.86 – 0.72 = 0.14 and 0.80 - 0.61 = 0.19. Thus, actually, Subject 4 improved more than Subject 10. However, the difference does not seem to be all that significantly different. Our point is that they both increased quite a bit.

The maximum r_fit of 0.93 is impressive, but what about worst-case performance for practical deployment?

The worst case does not prohibit “practical deployment.” These are individuals, and different people with diabetes, care for their diabetes differently. Subjects that benefit greatly will not be denied if some subjects do not. The modeling here is “Subject Specific.”

No discussion of computational cost: training time, prediction time, scalability

Reply: They are not relevant to our objective.

Literature and Context: Missing recent diabetes forecasting literature (post-2015)

Reply: While the application is type 1 diabetes, the methodology is focused more broadly, and more specifically as an “… AI Methodology for Highly Dynamic and Highly Complex Static Processes” as stated in the title. Thus, this work is not specifically focused on diabetes but highly dynamic and highly complex static systems or processes. To make this clear, we added the following to, and after, the first paragraph in this manuscript.

This work is an extension and advancement of the modeling approach in [2] that was applied to three types of freely-existing data sets. The first data set consisted of four (4) nutrient inputs (x_i, i = 1, . . ., 4) and modeled the change of weight (y) over time using first-order dynamic structures (v_i), for each input i, and a quadratic, multiple linear regression, static output structurewhereis a vector of the v_i’s. The second one consisted of nine (9) x_i’s and modeled the top tray temperature (y) of a pilot distillation column using second-order dynamic structures for the v_i’s and a first-order multiple linear regression, static structure forOur goal, for the fitted correlation coefficient () of y and(the fitted y) isfor test sets or validation sets when the test set is not possible, was met for these two cases in [2].

The third case in [2] consisted of individually modeling eleven (11), two-week, type 1 diabetes data sets, with twelve inputs (x_i’s), originally modeled in [3]. For each data set, its first week is used as training data and its second week as validation data. The sensor glucose concentration (SGC) sampling rate of five (5) minutes resulted in two very large data sets for each of the eleven modeling cases. A critical complexity of this case is that it requires forecast modeling for closed-loop forecast control, a future objective. Thus, all the inputs must have a model deadtime greater than or equal to the effective deadtime (q_MV) of the manipulated variable (MV) unless it has a scheduled (known) change (e.g., a meal) and a deadtime less than q_MV, called an “announcement input”. These requirements were not followed in [3] and it is therefore not applicable to the modeling objectives of this work. The modeling strategy in [2] estimates q_MV first, then any announcement inputs, and then all other inputs, using a one-input simple linear regression structure, to obtain initial estimates of the dynamic parameters for each input separately. After completion of this step for all the inputs, a full second-order dynamic structure and first-order static structure strategy was used to obtain final parameter estimates. This approach resulted in an average of 0.68 and a maximum of 0.77, considerably short of the individual goal of

Note that we have no citation from the diabetes literature for the reasons stated above. This work is about WPINN and not about advancing sensor glucose modeling.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,
Your manuscript introduces a hybrid two-stage Wiener-Physically-Informed Neural Network (W-PINN) framework for modelling nonlinear dynamic and static processes, demonstrated through diabetes-related glucose data forecasting. The topic is timely and relevant at the intersection of AI, system dynamics, and biomedical signal modelling. The attempt to physically ground neural architectures is commendable. However, the manuscript requires substantial revision to meet the expectations of rigor, clarity, and reproducibility suitable for Stats or any high-impact computational journal. Below are detailed, actionable observations.

Title/Abstract

Concerns

The title is overly long and includes redundant qualifiers (“highly dynamic and highly complex static processes”).
→ Consider shortening to: “A Two-Stage Wiener–Physically-Informed Neural Network for Nonlinear Dynamic and Static System Modeling.”
Abstract lacks a clear problem statement and does not articulate the novelty over existing PINN or hybrid Wiener approaches.
The dataset context (type 1 diabetes) appears suddenly and is not explained. Clarify why this dataset is appropriate for validating W-PINN.
Statistical indicators (e.g., average R² = 0.82) are presented without specifying validation folds, sample size, or baseline comparison (e.g., vs LSTM or standard PINN).

Introduction

The introduction reads as a methods summary rather than a background section. It lacks a concise research gap statement and fails to contrast existing PINN/Wiener literature.
→ Include a paragraph differentiating W-PINN from standard PINNs (Raissi et al., 2019) and hybrid system-identification frameworks.
Citations are dated or self-referential; add recent literature on hybrid physical–neural models (2022–2025).
Objectives are implied but not explicitly stated. End the introduction with a one-sentence aim such as:
“This work develops and validates a two-stage Wiener-PINN approach to enhance interpretability and prediction accuracy for nonlinear dynamic systems.”

Methods

Equations defining the Wiener model, neural architecture (layers, activation, loss functions), and training parameters are missing. Provide mathematical formalisms rather than descriptive text.
The diabetes dataset source, sampling rate, preprocessing, and inclusion criteria are not documented. Without these, replication is impossible.
It is unclear whether cross-validation, hold-out, or rolling forecast evaluation was used. Specify how overfitting was controlled.
The comparison baseline (single-stage ANN or classic PINN) is only briefly mentioned. Include numerical comparison or a concise table of performance metrics.
State versions of JMP® and Python®, and libraries (TensorFlow, PyTorch etc.) if applicable.

Results

Figures or tables showing training/validation curves, residual distributions, or comparative outputs are missing or insufficiently described. Add at least one figure visualizing predicted vs observed SGC for representative cases.
No statistical test or error bar accompanies performance gains; thus, it is unclear if differences are significant.
Results focus on numeric fit without discussing computational efficiency or stability—key aspects for real-world deployment.
The bias-correction mechanism is described but lacks mathematical or algorithmic detail.

Discussion

The discussion restates results rather than interpreting them mechanistically.
Explain why the two-stage method improved performance—was it due to reduced parameter coupling, better gradient stability, or physical regularization?
No reflection on limitations (e.g., dataset size, generalizability beyond glucose systems, absence of stochastic modeling).
Claims of superiority over prior models are unsupported by comparative references or metrics.
Ethical considerations of applying AI to clinical data are not mentioned; briefly acknowledge data privacy and generalizability boundaries.

Conclusions

The conclusion reads as a continuation of the discussion and introduces new results (“maximum 0.93 fit”). Keep this section concise—focus on implications and next steps.
Suggest a clear future-work paragraph: integration of explainability tools or expansion to other dynamic processes.

Figures, Tables, and Presentation

Several figures appear low-resolution; ensure vector quality for readability.
Captions lack methodological context (e.g., specify axes, variable units).
The term “vij” is undefined in-text—define all symbols upon first use.
English is mostly clear but technical phrasing can be tightened:

Replace “highly effective” with “computationally efficient and statistically robust.”
Avoid trademark symbols (®) within scientific prose unless required.

Comments on the Quality of English Language

The term “vij” is undefined in-text—define all symbols upon first use.
English is mostly clear but technical phrasing can be tightened:

Replace “highly effective” with “computationally efficient and statistically robust.”
Avoid trademark symbols (®) within scientific prose unless required.

Author Response

Reviewer 2

Note: All changes to our submission are highlighted in red below and in the revised document.

General Comment about this review. We appreciate the time and effort that the reviewer has given to provide helpful feedback for clarifying content and for improving our manuscript. However, the reviewer has made many suggestions and comments that are way out of the scope and purpose of our manuscript (MS). Thus, we focus on clarifying this below, first, and will indicate the comments and suggestions that are outside our scope in our specific replies to the review.

First, it seems that the reviewer is treating this work as a first publication of the methodology. W-PINN was introduced in [1] where it modeled an industrial corn drying bin and in [2] under the name “Theoretically Dynamic Regression (TDR)” where it modeled three real processes and was compared with other methodologies, including lagged-based empirical ones.

The introduction of our W-PINN methodology and its PINN connection and relationship is given in [1] as stated in the first sentence of our MS. As stated in the title and clearly in the abstract now, to us, this work is completely focused on only W-PINN modeling of the 11 diabetes cases in [2] to significantly improve fit because the other cases met our fit criteria as stated in the manuscript.

TDR is a special class of W-PINN that uses linear regression static functions. The TDR methodology in [2] modeled processes and produced results that are more naturally suited for W-PINN than PINN. The first reason is because the input variables have highly different dynamic behavior and PINN applies dynamic modeling to the single input from the ANN that is a composite of all the input variables. Secondly, the PINN ANN outputs have not been obtained for the diabetes data sets. However, if someone wants to do this modeling and compare them to the results of this work, the data sets are available to do it. Thus, it is not in our scope or interest to do this modeling. We are not advocates of PINN in applications where the dynamic behavior is quite different among input variables, as is this one. Thirdly, because a W-PINN approach uses direct results from [2], the v_ij’s.

Title/Abstract

Concerns

The title is overly long and includes redundant qualifiers (“highly dynamic and highly complex static processes”).

→ Consider shortening to: “A Two-Stage Wiener–Physically-Informed Neural Network for Nonlinear Dynamic and Static System Modeling.”

Response: This work is a follow up to the work in [2] where three types of processes were modeled. We define as our “benchmark” of successful modeling. All processes in [2] were multiple input. The first is first-order linear regression (simple static) and first-order (simple) dynamic. The second is quadratic regression (simple static) and second-order (complex) dynamic. Our criterion of success was met for these two cases and modeling of these data sets is completed. The third process we modeled as a first-order linear regression (simple static) and second-order forecast (highly complex) dynamic. Our criterion for r_fit success was not met for any of the eleven cases modeled. Our hypothesis here is that we can make substantial progress in reaching our goal by using a high-order complex static structure that also results in a high-order dynamic structure. To emphasize this goal, we are leaving the title as it is and have added the following paragraphs.

Our hypothesis is that [2] used an effective dynamic structure and estimated the dynamic parameters sufficiently accurately but that the first-order linear static structure cannot accurately capture the complex static forecast nature of SGC. Thus, the overall goal of this work is the development of a two-stage modeling methodology for highly static and highly dynamic behavior that achieves the modeling goal of on one or more of the eleven SGC data sets. While our process example is SGC, it is selected because of its highly dynamic and highly complex static nature. Thus, it is not the objective of this work to focus on the issues related to advancing diabetes modeling but to present a general W-PINN approach for modeling highly dynamic and highly complex static processes.

More specifically, the approach of this work is to use theresults in [2], for each model i, as the first stage in a two stage W-PINN approach, where the second stage is a nonlinear static ANN structure. Note that empirical dynamic ANN modeling and PINN modeling methods are outside this scope since they do not use(i.e., ANN) or one for each input (i.e., PINN). W-PINN is the only methodology that can directly useresults obtained in [2].

Abstract lacks a clear problem statement and does not articulate the novelty over existing PINN or hybrid Wiener approaches.

Response: As stated in our General Comments above, PINN and/or hybrid Wiener approaches are not in the scope of our work.

Here is the problem statement in the submitted Abstract that we think is the reviewer’s concern: The objective of this work is the development of a highly effective Wiener-Physically-Informed-Neural-Network (W-PINN) modeling methodology for systems and processes with highly complex dynamic and highly complex static behavior.

We have completely rewritten the Abstract based on the comments and concerns of this reviewer.

The dataset context (type 1 diabetes) appears suddenly and is not explained. Clarify why this dataset is appropriate for validating W-PINN.

Response: The abstract has been completely rewritten and Section 1 has been extensively revised based on this comment and other reviewer comments.

Statistical indicators (e.g., average R² = 0.82) are presented without specifying validation folds, sample size, or baseline comparison (e.g., vs LSTM or standard PINN).

Response: First it seems the reviewer is confusing R² with r_fit. We don’t apply R² in dynamic modeling with are nonlinear in parameters. I Googled the following, “is r squared a valid statistic for nonlinear models? This is the reply: “No, R-squared is generally not a valid or appropriate statistic for nonlinear models because the underlying assumptions of linear regression do not hold for them. In nonlinear models, the relationship ???????????? + ??????? = ??????? is often violated, and R-squared values can be misleading, appearing high for both very good and very poor fits.”

r is valid for nonlinear modeling. It is measure of fit. The sample size that was determined is given. We have already addressed why LSTM and PINN are outside the scope of this work.

Introduction

The introduction reads as a methods summary rather than a background section. It lacks a concise research gap statement and fails to contrast existing PINN/Wiener literature.

→ Include a paragraph differentiating W-PINN from standard PINNs (Raissi et al., 2019) and hybrid system-identification frameworks.

Response: The research that our work is focused on is not W-PINN vs. PINN or any other methodology. Also, as pointed out above, this is not the introduction of the W-PINN methodology. This was done in [1]. The objective of this work is to extend TDR, which uses a linear regression static structures, to W-PINN which uses non-linear regression static structures, to improve the modeling of the case in [2] that did not achieve the modeling goal of r_fit > 0.9 using the stage 1 results. PINN cannot do this and neither can LSTM. Only W-PINN can do this because only W-PINN can directly use the results of Stage 1 directly. It is our hope that the changes that we have made in the revision adequately explain this goal and purpose.

Citations are dated or self-referential; add recent literature on (2022–2025).

Response: The “self-referential” citations are only there to give the evolution to the development of the proposed two-stage W-PINN methodology. In our opinion “hybrid physical–neural models” don’t fit our theme for the reasons cited in “1.” above.

Objectives are implied but not explicitly stated. End the introduction with a one-sentence aim such as: “This work develops and validates a two-stage Wiener-PINN approach to enhance interpretability and prediction accuracy for nonlinear dynamic systems.”

Response: We appreciate the reviewer’s insight here and the example. However, the example omits critical attributes/key words and phrases such as “when dynamic processes (i.e., systems) have highly nonlinear static behavior” and “significantly increaseusing a novel, proposed, two-stage (TS), W-PINN modeling approach.” In Section 1, broad and specific objectives are given throughout as follows:

More specifically, the objective of this work is the development of an effective W-PINN modeling approach for dynamic systems with highly complex static behavior. Moreover, using the Stage 1 dynamic modeling results obtained in [2], the objective of this work is to significantly increaseusing a novel, proposed, two-stage (TS), W-PINN modeling approach.

[Please note: Except for minor changes, this objective was stated in the original submission]

Methods

Equations defining the Wiener model, neural architecture (layers, activation, loss functions), and training parameters are missing. Provide mathematical formalisms rather than descriptive text.

Response: All the Wiener model equations used in this work are in Section 2. We used a 1-layer ANN with a Tanh activation function, and the loss function comprises the SSE and the ANN parameter penalty. These equations, shown below, were added to the manuscript.

The loss function shown in Eq. 19, used by the Python program, is the SSE and an ANN parameter penalty σ is shown in Eq. 20.

(19)

where

(20)

The diabetes dataset source, sampling rate, preprocessing, and inclusion criteria are not documented. Without these, replication is impossible.

Response: There are two types of data sets that are on the website of the corresponding author as mentioned in the document. To replicate the results, all that one needs is the data sets and the ability to apply the methodology which is given in [2] and this work. The eleven data sets of the input/output data to replicate the results in [2] using the inputs, i.e.,to compute estimates and thenestimates are posted and thevalues are posted. Thus, these posted results can be used to compare these computed values or use to computeand then compared to the ones obtained in this work.

It is unclear whether cross-validation, hold-out, or rolling forecast evaluation was used. Specify how overfitting was controlled.

Response: The block split cross-validation is used for all 11 subjects and almost one-week training/one-week validation split. It seems that the reviewer thinks that there is one two-week data set with 11 subjects producing the data. Cross-validation with random shuffling according to (Roberts et al. (2017) that is not applicable to dynamic modeling because the temporal order of the data is a critical feature to modeling dynamic behavior and is not intelligently changeable.

The comparison baseline (single-stage ANN or classic PINN) is only briefly mentioned. Include numerical comparison or a concise table of performance metrics.

Response: As mentioned throughout the revision and this document, it is not the scope of this work to compare the proposed method with single-stage ANN or classic PINN.

State versions of JMP® and Python®, and libraries (TensorFlow, PyTorch etc.) if applicable.

Response: The section below was added to the manuscript to give this information related to Python.

“The second approach is our newly developed, novel, ANN structure that is coded in the Python language, with the scipy.optimize library with a Tan(h) activation function running the dual annealing optimization solver.”

This was added with respect to JMP.
All analyses were conducted using JMP® Pro Version 16 for neural network construction, preprocessing, and model optimization. More specifically, JMP^®was used to create and optimize a three-layer ANN structure. The ANN model began as a fully-connected single-layer perceptron, functioning as a decision-making node. This architecture consisted of three layers: the input layer, a hidden transfer layer, and the output layer. Each node in the transfer layer received weighted inputs from the input layer, and the final predictions were made based on the output layer’s activations. The transfer functions used within the model were a combination of linear and Gaussian transformations, and boosting techniques were applied to enhance the model’s performance. To ensure the inputs were appropriately scaled and transformed, continuous covariates were preprocessed by fitting them to a Johnson Su distribution. Using maximum likelihood estimation, this preprocessing step helped transform the data closer to normality, thereby mitigating the effects of skewed distributions and outliers. The general fitting approach aimed to minimize the negative log-likelihood of the observed data, augmented by a penalty function to regulate the model complexity. Specifically, a sum-of-squares penalty, applied to a scaled and centered subset of the parameters, was used to address the overfitting problem that often occurs with ANN models. This penalty was based on the magnitude of the squared residuals (i.e., ), helping to stabilize the parameter estimates and improve the model's optimization. Cross-validation was performed using the holdout method to assess the model's ability to generalize to new data. The training set consisted of the initial data range, while the validation set represented future observations, ensuring the model's predictive capacity for unseen data was tested.

Results

Figures or tables showing training/validation curves, residual distributions, or comparative outputs are missing or insufficiently described. Add at least one figure visualizing predicted vs observed SGC for representative cases.

Response: Many of the “Results” requested by the reviewer, here and below, are not applicable to physically-informed dynamic modeling in the same way it would also not be applicable to theoretical, or semi-theoretical modeling, and some requests are even illegitimate for static modeling. We will address each one in order given.

A predicted vs observed SGC is a plot for assessing fit. Fit is quantitively assessed by r_fit, a statistic that we provide for all the subjects, in Table 1, for TDR Model 1, JMP® Model 1, and Python® Models 1-3. For static modeling, an unbiased assumption is important for formal inference, but is not applicable to theoretically based methods, which have inherently temporal bias. Nonetheless, it is important to keep temporal bias in check and we give estimates of model bias relative to y for all the fits in Table 3. Finally, Table 3 also, for all the fits, gives a measure of spread between the fit and the observed SGC, the average absolute difference (AAD). The plots, although for one subject only, Subject 2, illustrate another informative way of assessing fit. (The subject number was missing and we added it).

No statistical test or error bar accompanies performance gains; thus, it is unclear if differences are significant.

Response: Formal statistical tests, e.g., hypothesis testing, are not applicable to dynamic modeling as described in “1” above, in this group of your reviewer comments. Error bars are not a legitimate/sound statistical assessment methodology.

As a statistics professor (last author) I can say with absolute confidence that error bar inference is a construction that is not supported by the academic statistics community. Specifically, the sample mean +/- sample variance and sample mean +/- sample variance/sample size. I don’t know of a physical science/engineering textbook written by a colleague (specifically, a professor that is a member of a statistics department) that even mentions error bars. I wrote an article (Rollins, D. K. “The Importance of Statistical Modeling in Data Analysis and Inference” Chemical Engineering Education, Vol. 51 No. 3 (2017)) explaining why error bars is not a statistically sound inferential methodology and presented sound ones supported by the academic statistics community, i.e., in their textbooks. Also note, the Error Bar document in Wikipedia is quite weak and has no contribution, i.e., reference, from a statistics journal. Confidence intervals such as sample mean +/- critical value * sample variance/sample size are supported and are sound approaches for static data. The data in this work are of a dynamic nature and cannot be generated randomly but occur sequentially over time. Thus, formal static statistical methodology such as confidence intervals are not applicable to this work. As common in dynamic modeling, systematic positive and negative deviations of the model fit from measured values are strongly evident in Figs. 3-5.

Results focus on numeric fit without discussing computational efficiency or stability—key aspects for real-world deployment.

Response: These additional details were added to the manuscript at the end of the results section.

The Python Stage 2, Model 1 contains fewer than 65 trainable parameters, performs 60-minute-ahead inference in 0.9–1.4 ms on an Intel Core i7-11850H. This can be extrapolated to an estimated <20 ms on an ARM Cortex-M7 microcontroller, making it highly suitable for real-time embedded deployment. The linear dynamic stage is analytically stable with all poles inside the unit circle, the shallow tan(h) network is Lipschitz-continuous, and the optimization consistently converges across all 11 subjects (Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; p. 89). Compared with an equivalent single-stage LSTM (~65 k parameters), the proposed method could train 4–6× faster, requires orders-of-magnitude less memory, and achieves superior validation performance, confirming excellent computational efficiency and numerical stability for practical diabetes monitoring and control applications.

The bias-correction mechanism is described but lacks mathematical or algorithmic detail.

Response: We give a reference (see Eq. 11 and the text above it) that gives all the mathematics and derivation. We feel this is sufficient for the reader and did not want to take up space in this work when it can be found in the literature. We encourage the reviewer to look it up and bring to our attention any flaws or mistakes that need to be corrected.

Discussion

The discussion restates results rather than interpreting them mechanistically.
Explain why the two-stage method improved performance—was it due to reduced parameter coupling, better gradient stability, or physical regularization?

Response: It is none of these three reasons because our proposed two-stage method doesn’t address these reasons. The reason is simply that the highly significant nonlinear dynamic and static behaviors are not modelled in [2] but are modelled by our proposed method in this work. This is stated throughout the revised MS more often, more directly, and more clearly.

No reflection on limitations (e.g., dataset size, generalizability beyond glucose systems, absence of stochastic modeling).

Response: Specifically, in terms of SGC the first submission of the manuscript states “The models that this work develops are for an unobservant, 60-minute, forecast monitoring scenario.” It also states that, “this work is best understood as an unobservant monitoring application and not applicable to closed-loop control.” The methodology is clearly revealed to be based on derivatives that are discrete approximations, and thus, implying that success is dependent on the accuracy of this approximation. In the revision we add “While our process example is SGC, it is selected because of its highly dynamic and highly complex static nature. Thus, it is not the objective of this work to focus on the issues related to advancing SGC modeling but to present a general approach for modeling highly dynamic and highly complex static processes.”

Claims of superiority over prior models are unsupported by comparative references or metrics.

Response: “This work makes claims of superiority over prior models” based on the SGC modeling results in [2] and in this manuscript. [2]’s significant improvement over the work in [3] is given in [2]. The significant improvement of this work over [2] is given in Table 1.

Ethical considerations of applying AI to clinical data are not mentioned; briefly data privacy and generalizability boundaries.

Response: IRB compliance for this study, the results, confidentiality, etc. are stated in [3], the original publication using these data sets. The modeling of these data sets are fundamentally just linear and nonlinear regression, two approaches that were around long before the term “AI” was used in the modeling literature. There are pattern matching approaches that use nonscience based algorithms and information to map data to outcomes that make smart decisions and take intelligent actions. ANNs are useful and effective tools in accomplishing these types of actions. This work is “modeling,” i.e., science driven input and output relationships.

Conclusions

The conclusion reads as a continuation of the discussion and introduces new results (“maximum 0.93 fit”). Keep this section concise—focus on implications and next steps.

Response: We don’t see “maximum 0.93 fit” and the “introduction of new results.” Our conclusion section does focus on “implications and next steps” in our opinion. We have read it carefully and feel that the content meets the journal and our criteria for this section. We appreciate the reviewer’s perspective.

Suggest a clear future-work paragraph: integration of explainability tools or expansion to other dynamic processes.

Response: The near future-work that the Rollins’ lab is focused on is what is given in this section and the two mentioned by the reviewer are not included in this research work.

Figures, Tables, and Presentation

Several figures appear low-resolution; ensure vector quality for readability.

Response: These are the same type of plots that we have recently published in Stats. They look okay to us. Thanks for bringing this to our attention.

Captions lack methodological context (e.g., specify axes, variable units).

Response: We will follow the requirements of the journal. Thanks.

The term “vij” is undefined in-text—define all symbols upon first use.

Response: Yes, thanks for catching this. It v_ij should be and x_ij should be . The corrections have been made.

English is mostly clear but technical phrasing can be tightened: Replace “highly effective” with “computationally efficient and statistically robust.”

Response: In a search for “highly effective” in the revision, it was not found.

Avoid trademark symbols (®) within scientific prose unless required.

Response: Google indicated that “The ® symbol is not required by law.” Thus, it is removed completely.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The work contains interesting ideas and promising empirical results, but in its current form I recommend a major revision, as I do not think it is ready for publication.
The combination of a physically based dynamic model with a static ANN is a sensible and relevant approach, and the reported performance gains over a linear static baseline suggest genuine potential. However, several key methodological and presentation issues prevent a proper scientific assessment.
The most serious concern is the treatment of the Python-based ANN as a “confidential” method. Since this model produces the main results in the paper, its confidentiality is incompatible with basic standards of reproducibility. The authors should fully describe the ANN architecture, loss function (including any penalty terms), optimisation procedure and implementation details, so that a competent reader could replicate the method. As long as core aspects of the Python approach remain undisclosed, it cannot credibly serve as the central benchmark of the study.
A second major issue is that the comparison between the JMP ANN and the Python ANN is confounded by differences in feature engineering. The manuscript indicates that the Python model uses an expanded set of inputs, including quadratic and interaction terms, while the JMP model is restricted to the original variables. Under these conditions, superior performance of the Python model is expected even without any algorithmic novelty. To support claims about the superiority of the Python methodology, the authors should either give both models access to the same engineered features or restrict the Python ANN to the raw inputs. Otherwise, the paper should be framed explicitly as an evaluation of a feature-engineering strategy rather than a comparison of algorithms.
The Stage-1 dynamic modeling and the residual-based bias correction are reasonable in spirit but require clearer explanation. The choice of SOPDT-type dynamics and backward-difference discretisation should be briefly justified, and the handling of announcement inputs, dead time and missing data in the different model variants (Model 1, Model 2 and Model 1–2) should be described precisely enough to assess stability and robustness. On the results side, the authors place strong emphasis on correlation and AAD; to make the findings more meaningful for the diabetes community, at least one standard metric such as RMSE and, where appropriate, an error grid analysis should be added.
Finally, the manuscript needs careful editing for notation and clarity. A thorough revision to standardise symbols, fix typographical errors (“Wiener/Weiner”) and smooth the exposition is necessary.
In summary, the work has clear potential but requires substantial changes in transparency, fairness of comparison, modeling description, metrics and presentation before it can be considered for publication.

Author Response

see attached

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors replied to all major and minor comments, except comment 3 "

There is a typing problem with all equations starting from Eq. (3). A double line appearing in those equations which is unreadable. Please check compiling the file to pdf again. Also, Table 1 overlaps the next paragraph so its hidden"

Please try to solve this problem before submitting the final version.

Author Response

Reviewer: There is a typing problem with all equations starting from Eq. (3). A double line appearing in those equations which is unreadable. Please check compiling the file to PDF again. Also, Table 1 overlaps the next paragraph so its hidden"

Reply: When we converted the original submission to a PDF, we saw the problem identified by the reviewer. We corrected and confirmed this problem for Revision 1 by updating the equation editor in Word and by converting the file to a PDF and carefully inspecting the PDF to confirm that the problem was corrected. It is not clear to us why the reviewer is having the problem now. However, for Revision 2, we are submitting the Word and PDF documents. In addition, we repositioned Table 1 to correct its overlap with the next paragraph.

Reviewer 2 Report

Comments and Suggestions for Authors

I thank the authors for their detailed and thoughtful response and for the substantially revised manuscript. Scope and objectives are now much clearer.
The revised abstract and introduction now clearly position this paper as an extension and application of the previously published TDR methodology, specifically targeting the “hard” SGC case that did not meet the r_fit ≥ 0.90 goal in [2]. The description of the three TDR cases and their performance makes the remaining gap very concrete. The role of W-PINN vs TDR vs PINN/ANN is better defined. The new introduction and Figure 1 (ANN vs PINN vs W-PINN) nicely clarify the conceptual differences, and the paper no longer reads as if it is introducing W-PINN for the first time. Instead, it focuses on the two-stage W-PINN extension built on the existing Stage-1 dynamic results in [2], which I agree is a reasonable and well-defined scope. Mathematical description and methods are much more complete. The inclusion of: The second-order SOPDTPL dynamic structure and its discrete-time BDD form (Eqs. 1–5), The explicit statement of using backward difference derivatives, The description of the ANN loss with SSE + parameter penalty, and The use of JMP Pro and Python (with scipy.optimize, tanh activation, etc.) substantially improves the reproducibility and scientific transparency of the work. Data availability and evaluation setup are clearer.Indicating that both x_{i,t} and v_{i,t}ˆ are available on the corresponding author’s website, along with the two-week, 5-minute sampling rate and 1-week training / 1-week validation split, addresses my earlier concern that the study could not be replicated. I appreciate the revised framing: the manuscript now focuses on the demonstrated improvement relative to the previous TDR SGC models, rather than implying broad superiority over all ANN/PINN or LSTM approaches. The text explicitly acknowledges that comparisons with empirical ANN or PINN methods are outside the scope of this article, which is now acceptable given the clarified aims. Clarification of application context and limitations (monitoring vs control). The statement that these models are for an “unobservant, 60-minute forecast monitoring scenario” and not directly for closed-loop control is helpful and appropriately cautious. Given these changes, many of my original concerns (e.g., lack of explicit equations, unclear objective, ambiguous use of R² vs r_fit) have been satisfactorily addressed. I no longer view those as obstacles to publication.
Remaining issues to address before acceptance (minor revisions)
Clarify evaluation and uncertainty language around r_fit. You have convincingly explained why classical static R² and standard “error bars” are not directly applicable to your dynamic, non-linear setting. I agree that: using r_fit as a descriptive measure of dynamic fit is appropriate here, and random re-shuffling cross-validation would break the temporal structure. To avoid confusion for readers:
Please make explicit in the text that r_fit is used descriptively to quantify agreement between observed and predicted time series, and that formal inferential statements (e.g., hypothesis testing) are not being made.
The strong statement in the response about “error bars not being statistically sound” may come across as more polemical than intended. If similar language has been introduced into the main text, I would recommend softening it (e.g., “standard error-bar heuristics commonly used in some application fields are not appropriate for this type of dynamic modelling, where observations are temporally dependent and not i.i.d.”) and/or adding a reference such as your 2017 Chemical Engineering Education article to support the point.

Small improvements to clarity in the Introduction
Notation and typographical clean-up. A few small things you might want to check in proofs: Ensure consistent notation for the dynamic outputs: v_{i,t}, v_i(t), v'_{i,t} etc. are all used; a brief notation table or a sentence clarifying prime vs non-prime forms would help. Double-check subscripts such as r_fit,val vs r_fit and ensure they are used consistently in text, tables, and captions. Confirm that all symbols in equations (e.g., θ, τ, ζ, δ, ω) are defined right after their first appearance; I see most of this is already in place, but it is worth one more pass. Minor spacing/typo issues (e.g., extra spaces around commas or periods) can be left to copy-editing but a quick scan would improve polish. Figures and tables. The conceptual content of Figure 1 and the SGC plots is appropriate. I only encourage ensuring vector or high-resolution versions are submitted and expanding captions slightly to specify axes (SGC units, time in minutes) and whether each plot is training vs validation. This will help readers interpret the visualizations without hunting in the main text.

Comments on the Quality of English Language

fine

Author Response

see attached.

Author Response File: Author Response.pdf