Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe aim of the paper under review was to to introduce two novel techniques: a self-attention mechanism with a partially observed diagonal, and uncertainty quantification in data imputation to better inform downstream tasks for the Uncertainty-Aware Self-Attention (UASA) model to get rid of the model's dependency of all input observations, including the missing values.
The article is well-written and good-structured, it includes the Introductionm covering the importance of the topic studied and summarizing the contributions made by the authors; the Related work section with the survey of approaches to data imputation for time series prediction with more than 50 sources to relevant and topical papers in this field, most of which published no more than 5 years ago in respectable and well-known journals and scientific conferences materials; the Methods section introducing the general formulation of a future state estimation for time series and the specific formulation of the task of a model selection for prediction with the missing values as well as the quite detailed formulation of upstream and downstream UASA models, objectives and etc.; then follows Experiment Result with the biomedical example of time series prediction with the proposed approach, also quite well structured and described, Conclusions and Discussions section providing the insight of future research with model pruning and optimizing for resource-constrained devices and the finance application after the planned enhancing of the model accuracy .
However, there are some suggestions to further improve the impression:
1) It can be understood from the Conclusions that the model is resource-consumptive but I didn't find any details on hardware the authors used to train and run inference of the model they propose, so it could be unclear for more general readers whether they can reproduce the approach on their setting or not
2) If the model requires pruning to be run on the off the shelve devices, why didn't authors made those pruning steps at this stage and discuss how the basic pruning methods (i.e. changing float32 to float16/int8, dropout layers and etc.) will affect the accuracy and the cost of inference; which exactly pruning steps are they planning to try further?
3) There is also no code examples for the readers to reproduce the findings and estimate the results for their task
Overall, the article is recommended for publishing after at least clarifying in text the issues 1) and 2). The issue 3) remains on the choice of the authors, though I recommend code sharing to get more interest to their approach by business practitioners and simplifying the study for the beginners.
Author Response
Thank you for your valuable feedback on our manuscript. In the following, we provide our responses:
Question 1: It can be understood from the Conclusions that the model is resource-consumptive but I didn't find any details on hardware the authors used to train and run inference of the model they propose, so it could be unclear for more general readers whether they can reproduce the approach on their setting or not.
Response: Our inference procedure requires running the UASA model $K=10$ forward passes—as detailed in Equation 8—to obtain uncertainty estimates. In our experiments, we used a single NVIDIA RTX 3090 GPU, which is a widely accessible piece of hardware in many research settings. Based on our experience, this resource is sufficient for non-real-time applications. For real-time tasks, we address latency by replicating the UASA model K times and running the inferences in parallel—a technique that has proven effective in our subsequent work. We hope this explanation reassures readers that our approach is both reproducible and practical across various settings.
Question 2: If the model requires pruning to be run on the off the shelve devices, why didn't authors made those pruning steps at this stage and discuss how the basic pruning methods (i.e. changing float32 to float16/int8, dropout layers and etc.) will affect the accuracy and the cost of inference; which exactly pruning steps are they planning to try further?
Response: We appreciate the reviewer’s suggestion to explore further steps for running our model on off-the-shelf devices. In our current experiments, the computational demands remain manageable for non-real-time applications (as noted in our response to Question 1); therefore, we have not yet implemented pruning or quantization measures. However, we acknowledge the importance of model optimization for broader deployment scenarios and plan to investigate the following techniques in future work:
1. Quantization (e.g., float16): Converting weights and activations to lower-precision formats can significantly reduce memory footprint and computational overhead while maintaining model accuracy at acceptable levels. We will systematically evaluate the trade-offs between accuracy and latency, especially in edge-computing or low-power scenarios.
2. Layer Skipping and Model Distillation: In settings where stricter latency constraints exist, we will investigate network pruning at the architectural level (e.g., skipping certain layers for less complex inputs). We also plan to explore knowledge distillation to transfer the performance and uncertainty-estimation capabilities of the original model into a smaller, more efficient student network.
We anticipate that these methods will enable us to strike a balance between computational cost and predictive performance, particularly for real-time or resource-constrained applications. We have updated the conclusion section to reflect our discussion on future research directions.
Question 3: There is also no code examples for the readers to reproduce the findings and estimate the results for their task
Response: We agree that providing code examples will significantly enhance reproducibility and facilitate further research. Currently, we have made the dataset available on GitHub (https://github.com/LIbbbao/AUST\_gait) to enable interested readers to begin exploring our work. We plan to release the full codebase once the manuscript is accepted.
We would like to thank you again for raising important questions about the UASA model. Have we sufficiently addressed the main concerns? Please feel free to let us know if there are additional concerns or questions.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this article, the authors propose a neural network-based method that can be applied to various tasks related to processing time series with missing values. The main novelty of the work is the proposed neural network architecture UASA and the AUST-Gait dataset. Overall, the work has been done well, but the paper requires several improvements.
1. The described method involves a novel application of a quite common technique for estimating uncertainty (described in section 3.2.2). I think it would be appropriate to add a couple of works about uncertainty estimation in the "Related work" section.
2. Lines 259-261, the descriptions of MIT and ORT loss functions are mixed up.
3. Figure 8. There is no description of the differences between the 3 graphs.
4. The fall detection task should be described in more details. How this task was formulated? Was it just additional class to other 4 classes? Was it binary classification of the entire time series? Or was it an event detection task that requires detection of particular timesteps with the falls inside the time series?
5. The algorithm description requires some clarification.
5.1 In section 3.2.2, formula 5, brackets should be fixed. Does (X,M)W1 mean that X and M both separately multiplied by W1 before concatenation? Or should it be written as Concat(X,M)W1?
5.2 If W1 has a size (d_m ; d_m) and is applied to concatenaded X and M, then d_m is always equal to 2D? It's better not to add additional variables without necessities and use 2D.
5.3 Lines 223-230 the use of h1, h2 and H is a little bit confusing, h1 looks like a matrix in formula 5, not a vector, H not mentioned in any formula at all.
5.4 The matrix h1 has the size (T; d_m + positional embedding size) (where d_m is probably just 2D) and then goes into MultiHeadAttentionPMD, which according to the section 3.2.1 receive 2 matrices X and M with sizes (T; D) and transform them into Q (dk ; T+D), K and V. Where did positional embedding go? Figure 2 shows only X matrix as an input, and M is used for PMD, and not concatenated to the X. Some implementation details have apparently been omitted, clarification is required.
6. I would like to see additional ablation studies and clarification of existing ones.
6.1. There is an ablation study with removing the PMD mechanism. Was the PMD Attention layer completely removed, or was it replaced with regular attention in order to save the number of parameters? Since the complete removal of these layers can lead to a decrease in performance simply due to the lack of trainable parameters.
6.2. As far as I understand, the missing value mask is used for both the concatenation function after input layer and the PMD attention layer. I would like to see a performance estimation for an experiment where this mask is not used in the input layers. That way we can evaluate the performance of the PMD algorithm itself without doubling the number of trainable parameters in first linear layers that provide the Q, K and V tranformation
7. Lines 399-401. It looks like authors have divided 10 time-step fragments into training and test sets, without taking into account records or people. Does this mean that several fragments from the same record of the same person can appear both in training and in test mode? This is not the right way to divide a time series dataset, it should be divided by people.
8. Authors provide a deviation for scores on figures 5 and 6, but do not explain how it was obtained. Were it multiple runs of the same model or different splits of the dataset? And the standard deviation should be provided for other experiments too, including ablation studies.
9. Statements in the abstract and conclusion should be supported by scores from the experiment section. Add scores to the abstract and conclusion where performance of the model is highlighted.
10. These statements doesn't seem to be supported by the results.
- "UASA model consistently surpasses state-of-the-art benchmarks in classification". Figure 6 shows that the difference for F1 scores between 4 models is within the standard deviation (which is around 3%). Table 2 doesn't have a standard deviation but if it's around 3% too, then it's also not showing much improvements and it's hardly can be considered as "UASA model outperforms all baseline models across all metrics". The abstract statement looks more correct:"Empirical evaluations on the benchmarks demonstrate that UASA achieves state-of-the-art performance in classification, ..."
- "This underscores the pivotal role of uncertainty maps in bolstering model robustness and reliability". According to ablation studies it gives 2.3% which is within the standard deviation demonstrated on the figure 6. It is not clear if it is actually statistically significant difference without providing standard deviation for all of the experiments.
- "ablation studies emphasize the significance of the multi-head architecture" - same as above. Experiment with reduced attention heads shows only 1.7% difference. If this experiment has same standard deviation as in figure 6 then it is not a statisticaly significant difference.
Author Response
We sincerely appreciate the time and effort you dedicated to reviewing our paper. We have addressed each of these points individually and revised the paper in blue accordingly. We hope that our responses will effectively resolve your concerns.
Question 1: The described method involves a novel application of a quite common technique for estimating uncertainty (described in section 3.2.2). I think it would be appropriate to add a couple of works about uncertainty estimation in the "Related work" section.
Response: We have revised the "Related Work" section to include several significant studies on uncertainty estimation. These additions cover classical methods such as MC Dropout, Deep Ensembles, and Bayesian Neural Networks, as well as recent advancements like Concrete Dropout and Deep Evidential Regression.
Question 2: Lines 259-261, the descriptions of MIT and ORT loss functions are mixed up.
Response: Thanks for spotting it out. We have correct the description in the revised manuscript.
Question 3:. Figure 8. There is no description of the differences between the 3 graphs.
Response: In the revised manuscript, we have added a thorough description of the three subplots, clarifying the experimental setups under different missing rates ($20\%$ , $40\%$ , $60\%$ ) and their impact on the results. Please check Section 4.4 in the revised manuscript.
Question 4: The fall detection task should be described in more details. How this task was formulated? Was it just additional class to other 4 classes? Was it binary classification of the entire time series? Or was it an event detection task that requires detection of particular timesteps with the falls inside the time series?
Response: We treat fall detection as a binary classification problem rather than simply adding a separate “fall” category to the four existing gait classes (FG, SA, SD, IT). Specifically, we use windowed segments of the time series and label each segment as either “fall” or “non-fall.” The model then predicts whether a fall event occurs within each window. We have added this explanation to the revised manuscript.
Question 5: The algorithm description requires some clarification.
Question 5.1: In section 3.2.2, formula 5, brackets should be fixed. Does (X,M)W1 mean that X and M both separately multiplied by W1 before concatenation? Or should it be written as Concat(X,M)W1?
Response: The equation has been revised-it is written as $h_1 = Concat(X_{t, d}, M_{t, d}) W_1 + b_1 + Embed(t)$. $X$ and $M$ are concatenated and then multiplied by $W_1$.
Question 5.2: If W1 has a size (d\_m ; d\_m) and is applied to concatenaded X and M, then d\_m is always equal to 2D? It's better not to add additional variables without necessities and use 2D.
Response: This question is connected to Q5.1. In the revised manuscript, we have corrected that $W_1 \in \mathbb{R}^{2 \times d_m}$ and $b_1 \in \mathbb{R}^{1 \times d_m}$. Here $Concat(X_{t, d}, M_{t, d}) \in \mathbb{R}^{1 \times 2}$, thus we have $Concat(X_{t, d}, M_{t, d}) W_1 \in \mathbb{R}^{1 \times d_m}$ and $h_1 \in \mathbb{R}_{1 \times d_m}$. It is simply a linear transformation before feeding the dimension to the attention layers.
Question 5.3: Lines 223-230 the use of h1, h2 and H is a little bit confusing, h1 looks like a matrix in formula 5, not a vector, H not mentioned in any formula at all.
Response: We are sorry about the typos. $h1$ is the vector from Equation 5 and $h_2$ is the vector from Equation 6. We have removed $H$ now.
Question 5.4: The matrix h1 has the size (T; d\_m + positional embedding size) (where d\_m is probably just 2D) and then goes into MultiHeadAttentionPMD, which according to the section 3.2.1 receive 2 matrices X and M with sizes (T; D) and transform them into Q (dk ; T+D), K and V. Where did positional embedding go? Figure 2 shows only X matrix as an input, and M is used for PMD, and not concatenated to the X. Some implementation details have apparently been omitted, clarification is required.
Response: We have revised the paper following your suggestions. Here is a further explanation of our PMD module:
In input concatenation and linear transformation, for each time step \(t \in \{1,\dots,T\}\) and dimension \(d \in \{1,\dots,D\}\), we begin with the scalars \(X_{t,d}\) and \(M_{t,d}\). These two scalars are concatenated into
\[
\text{Concat}(X_{t,d},M_{t,d}) \in \mathbb{R}^{1 \times 2}.
\]
We then apply a linear transformation:
\[
h_1 = \text{Concat}(X_{t,d}, M_{t,d}) W_1 + b_1,
\]
where
\[
W_1 \in \mathbb{R}^{2 \times d_m},
\quad
b_1 \in \mathbb{R}^{1 \times d_m}.
\]
Thus, $h_1 \in \mathbb{R}^{1 \times d_m}$ after this step.
We then add a positional embedding $\text{Embed}(t) \in \mathbb{R}^{d_m}$ (broadcasted to shape $1 \times d_m$) to $h_1$:
\[
h_1 = \text{Concat}(X_{t,d}, M_{t,d}) W_1 + b_1 + \text{Embed}(t).
\]
The result is still a $(1 \times d_m)$ row vector.
Finally, we repeat the above process for every pair $(t,d)$ in the time series. Stacking these vectors row-wise yields
\[
H \in \mathbb{R}^{(T \cdot D) \times d_m},
\]
where each row of $H$ corresponds to one specific $(t,d)$ pair, already incorporating the missing-mask bit and the positional encoding.
The matrix $H$ now serves as input tokens to the MultiHeadAttentionPMD layer. Although Figure~2 shows a simplified schematic (where $X$ appears to be the only direct input and $M$ is shown for the diagonal masking),
internally each token in the attention is
$\bigl[X_{t,d}, M_{t,d}, \text{Embed}(t)\bigr]$
after projection to $d_m$.
Please feel free to let us know if there are additional questions about the model.
Question 6: I would like to see additional ablation studies and clarification of existing ones.
Question 6.1: There is an ablation study with removing the PMD mechanism. Was the PMD Attention layer completely removed, or was it replaced with regular attention in order to save the number of parameters? Since the complete removal of these layers can lead to a decrease in performance simply due to the lack of trainable parameters.
Response: We recognize that isolating the effect of the PMD mechanism is critical for validating our contribution. In Section 4.6.1, the ablation study (Table 3) demonstrates a clear drop in performance once the PMD self-attention layer is removed. To rule out any confounding effects from fewer trainable parameters, we replaced the PMD layer with additional linear layers (of equivalent parameter count) rather than simply removing it. Consequently, we can confirm that the performance decrease primarily stems from the absence of the PMD mechanism, underscoring the PMD model’s direct contribution to our results.
Question 6.2: As far as I understand, the missing value mask is used for both the concatenation function after input layer and the PMD attention layer. I would like to see a performance estimation for an experiment where this mask is not used in the input layers. That way we can evaluate the performance of the PMD algorithm itself without doubling the number of trainable parameters in first linear layers that provide the Q, K and V tranformation.
Response: The mask is only used once in the input layer, where it is concatenated with the raw input values before being passed to the rest of the network (Equation 5). After this initial step, we do not feed the mask directly into subsequent layers — the PMD attention mechanism itself handles missing data through partial diagonal masking, rather than repeatedly using the mask. Consequently, we do not double the trainable parameters in the Q, K, and V transformations.
Question 7: Lines 399-401. It looks like authors have divided 10 time-step fragments into training and test sets, without taking into account records or people. Does this mean that several fragments from the same record of the same person can appear both in training and in test mode? This is not the right way to divide a time series dataset, it should be divided by people.
Response: In this study, we strictly performed person-wise split when dividing the dataset. Specifically, the data in the training and testing sets came exclusively from different individuals to ensure that no data from the same individual could simultaneously appear in both the training and testing sets. Consequently, the scenario you mentioned — where multiple fragments from the same record of a single individual appear in both the training and testing modes — does not occur in our study. We have revised the manuscript to address the concerns raised.
Question 8: Authors provide a deviation for scores on figures 5 and 6, but do not explain how it was obtained. Were it multiple runs of the same model or different splits of the dataset? And the standard deviation should be provided for other experiments too, including ablation studies.
Response: The standard deviations are obtained by running the same model for different random seeds. We use to to remove the randomness from model initialization. We have revised the manuscript to explain the standard deviation and provides them for other experiments.
Question 9: Statements in the abstract and conclusion should be supported by scores from the experiment section. Add scores to the abstract and conclusion where performance of the model is highlighted.
Response: We have revised the abstract and the conclusion to highlight our model's achievement.
Question 10: These statements doesn't seem to be supported by the results.
"UASA model consistently surpasses state-of-the-art benchmarks in classification". Figure 6 shows that the difference for F1 scores between 4 models is within the standard deviation (which is around 3\%). Table 2 doesn't have a standard deviation but if it's around 3\% too, then it's also not showing much improvements and it's hardly can be considered as "UASA model outperforms all baseline models across all metrics". The abstract statement looks more correct:"Empirical evaluations on the benchmarks demonstrate that UASA achieves state-of-the-art performance in classification, ..."
- "This underscores the pivotal role of uncertainty maps in bolstering model robustness and reliability". According to ablation studies it gives 2.3\% which is within the standard deviation demonstrated on the figure 6. It is not clear if it is actually statistically significant difference without providing standard deviation for all of the experiments.
- "ablation studies emphasize the significance of the multi-head architecture" - same as above. Experiment with reduced attention heads shows only 1.7\% difference. If this experiment has same standard deviation as in figure 6 then it is not a statisticaly significant difference.
Response: Thank you for your feedback regarding the statistical significance of our results. In our experiments, we performed five independent runs with different random seeds to reduce the impact of random model initialization — a widely accepted practice in related work. While the mean performance differences do favor our proposed model across these runs, we understand that the standard deviation and additional statistical measures (e.g., confidence intervals, hypothesis testing) are crucial for confirming whether those differences are indeed significant. In response, we have revised any overly strong claims.
We believe that the revised manuscript has become much stronger after incorporating your valuable feedback. We would like to thank you again for raising important questions about the UASA model. Have we sufficiently addressed the main concerns? Please feel free to let us know if there are additional concerns or questions.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
The article "Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values" focuses on developing a new UASA (Uncertainty-Aware Self-Attention) model for time series forecasting with missing data. The main innovations include a Partial Masking Diagonal (PMD) mechanism and quantitative uncertainty estimation in data imputation. Overall, the article presents a significant contribution to the field of time series analysis with missing data but has room for expansion in research. In its current version, I have identified several opportunities for improvement that could make the paper more rigorous and valuable for both academic and practical applications. I have listed them below:
- It would be beneficial to add an evaluation of the computational complexity of the proposed method, which would help readers better understand its efficiency compared to other approaches.
- The article would benefit from a more detailed comparison of the proposed model with state-of-the-art deep learning methods for time series, particularly Temporal Fusion Transformers. This could strengthen the argument regarding UASA's advantages.
- It would be useful to expand the analysis of how key hyperparameters affect model performance. This would make the research more comprehensive and contribute to a better understanding of the model's working mechanisms.
- By discussing potential limitations when working with very long sequences, the authors could highlight areas where the model is most effective and potential directions for future research.
- It would be advisable to include specific quantitative metrics in the conclusions to demonstrate the advantage of the proposed model over existing methods, helping readers quickly assess its effectiveness. Additionally, the conclusions would benefit from a brief discussion of potential model limitations, such as its applicability to different types of time series or sensitivity to input data size. This would make the conclusions more balanced and demonstrate awareness of potential challenges.
Best regards,
Reviewer
Author Response
We sincerely appreciate the time and effort you dedicated to reviewing our paper. We have addressed each of these points individually and revised the paper in blue accordingly. We hope that our responses will effectively resolve your concerns.
Question 1: It would be beneficial to add an evaluation of the computational complexity of the proposed method, which would help readers better understand its efficiency compared to other approaches.
Response: We have analyzed the computation complexity of our UASA model and revised the Section 3.2 accordingly.
Question 2: The article would benefit from a more detailed comparison of the proposed model with state-of-the-art deep learning methods for time series, particularly Temporal Fusion Transformers. This could strengthen the argument regarding UASA's advantages.
Response: We acknowledge that the TFT paper is closely related to our work. Since TFT is designed for forecasting tasks, we have revised the relevant experiments and updated Figure 7. Our results show that while TFT performs well, it remains slightly inferior to our proposed UASA model.
Question 3: It would be useful to expand the analysis of how key hyperparameters affect model performance. This would make the research more comprehensive and contribute to a better understanding of the model's working mechanisms.
Response: We have added a new subsection into the experiments to examine the UASA's sensitivity to the hyperparameters. In the experiments, we select two hyperparameters: (1) $K$, the number of inferences for uncertainty quantification; (2) $\alpha_1$ and $\alpha_2$, which controls the importance between the main objective, the MIT objective, and the $ORT$ objective. The results are shown in Table 5-7 and a new Section 4.7. We found that while hyperparameter choices can impact performance, our algorithm remains relatively robust across different settings.
Question 4: By discussing potential limitations when working with very long sequences, the authors could highlight areas where the model is most effective and potential directions for future research.
Response: We have incorporated a discussion of the computational and memory challenges associated with very long sequences in the revised conclusions. Specifically, we address how the self-attention mechanism, which underpins our partial diagonal zeroing strategy, can scale quadratically with sequence length. Recognizing this limitation allows us to pinpoint scenarios where UASA is most effective—for instance, when dealing with moderately long or noisy time series—and identify opportunities for future work.
Question 5: It would be advisable to include specific quantitative metrics in the conclusions to demonstrate the advantage of the proposed model over existing methods, helping readers quickly assess its effectiveness. Additionally, the conclusions would benefit from a brief discussion of potential model limitations, such as its applicability to different types of time series or sensitivity to input data size. This would make the conclusions more balanced and demonstrate awareness of potential challenges.
Response: In the revised conclusions, we have included specific quantitative metrics—such as ROC-AUC, PR-AUC, F1-score, and MSE—drawn from our experiments on the AUST-gait dataset and other test scenarios. These metrics provide a clearer demonstration of the UASA model’s advantages over existing methods and allow readers to quickly gauge its effectiveness.
Additionally, we have expanded our discussion of potential model limitations, particularly regarding (1) the computational challenges posed by very long sequences due to the self-attention mechanism’s inherent complexity, and (2) the sensitivity of the model to varying sequence lengths and data types. By highlighting these limitations and identifying possible mitigation strategies—such as alternative uncertainty quantification techniques and structured pruning—we hope to convey a more balanced perspective on the model’s capabilities and guide the future research.
We believe that the revised manuscript has become much stronger after incorporating your valuable feedback. We would like to thank you again for raising important questions about the UASA model. Have we sufficiently addressed the main concerns? Please feel free to let us know if there are additional concerns or questions.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have taken into account all the reviewers' comments and made the appropriate corrections. The manuscript is ready for publication.
Author Response
We would like to sincerely thank you for your time and feedback during the review process. Your suggestions have greatly contributed to improving the quality of our work.
We wish you all the best in both research and life!
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
The revised version of your manuscript shows substantial improvements. You have successfully addressed the previous comments and implemented significant changes that have enhanced the paper's scientific value and clarity. These revisions have strengthened the theoretical framework of the manuscript and improved its overall readability.
I believe the revised version has proper scientific merit and recommend it for publication.
Thank you for your diligent work on these revisions.
Best regards,
Reviewer
Author Response
We would like to sincerely thank you for your time and feedback during the review process. Your suggestions have greatly contributed to improving the quality of our work.
We wish you all the best in both research and life!