Review Reports
- Lili Zhang,
- Shenxi Dai* and
- Lihuang She
- et al.
Reviewer 1: Anonymous Reviewer 2: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- Formula (1) for dilated convolution, $g(i)=\sum_{l=1}^{L}f(i+r\cdot L)\cdot h(l)$, appears to be incorrect. In standard dilated convolution, the dilation rate $r$ should multiply the filter index $l$, not the filter length $L$. The paper claims the MSA layer "has fewer parameters than traditional CNN structures". This is a very broad and likely false statement, as self-attention mechanisms can be extremely parameter-heavy in high-dimensional spaces.
-
The abstract claims TensorRT "greatly improved (13.7 times)" the speed, while the conclusion claims it was "significantly accelerated (20-30 times)". These numbers are in severe disagreement. The paper fails to provide the baseline (pre-acceleration) or final (post-acceleration) absolute inference time (e.g., in ms/frame) for the TSHDC model. A "13.7x" or "30x" speedup is meaningless without the base numbers. The abstract mentions "a small loss of accuracy (5%)" but never defines what this "5%" refers to (e.g., a 5% relative increase in MPJPE?). No data comparing the MPJPE before and after TensorRT optimization is provided.
-
The conclusion claims that "ablation experiments of the multi-head self-attention layer (MSA) and the TSHDC model were designed". However, Section 3 (Experimental Results) presents no data from any ablation studies. We have no information on how much the MSA layer actually contributed to (or hurt) performance. We also have no justification for the choice of a 27-frame temporal receptive field, as no comparison to other frame counts is provided.
I recommend major revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
This manuscript addresses a timely topic. However, after reviewing it, I have a few critical comments:
- The paper structure is also uneven. The theoretical part is compilatory in nature – it summarises well-known concepts such as self-attention and dilated convolution, without further consideration of their application in the context of HPE.
- It also lacks a clear description of the training process, the number of epochs, the size of the training set, and hyperparameters.
- No convergence graphs or ablation studies are presented, despite the authors declaring their performance in the summary.
- Lack of empirical validation – there is no error analysis or comparison with competing methods under identical test conditions.
- Claimed advantages are unsupported by results – the authors claim improved efficiency and convergence, but provide no numerical data.
- Terminological and editorial ambiguity – many abbreviations are undefined, and the architecture description is inconsistent. - Insufficient documentation of the embedded implementation – lack of information on TensorRT optimisations, processing time, and actual FPS.
- Lack of discussion of limitations – the authors do not mention under what conditions the method loses accuracy (e.g., fast movements, partial occlusions).
- Insufficient innovative value – the presented model is a compilation of known solutions (MSA + TCN + HDC), without a new architectural component or new loss function.
- Lack of references to the extensive international literature.
Sincerely.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors- In the TensorRT experiment, the accuracy degrades from 49.4mm to 51.9mm (a 5.1% loss). This is a relatively high accuracy drop for TensorRT deployment. The authors do not specify the precision used (FP32, FP16, or INT8). If this is FP16, such a drop is unusual; if INT8, calibration details are missing.
-
The ablation study in Section 4.2.2 only compares "THDC" vs. "TSHDC" (Effect of MSA). Critical ablations are missing, such as: The impact of the Hybrid Dilated strategy vs. Standard Dilated Convolution. The effect of the "Sawtooth" dilation pattern mentioned in the background.
- The authors state that a 27-frame receptive field is the "optimal configuration" and that increasing frames yields diminishing returns. However, no data, tables, or graphs are provided to substantiate this claim. The paper needs an ablation study showing the curve of Accuracy vs. Latency for frame counts (e.g., 9, 27, 81, 243) to prove 27 is truly optimal.
- Some works about detection should be cited in this paper to make this submission more comprehensive, such as 10.1109/TPAMI.2024.3511621.
I recommend minor revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.
Author Response
please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
Thank you for your responses and the revised manuscript. I have no further comments.
Sincerely.
Author Response
Thank you