Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

Electronics 2025, 14(16), 3326; https://doi.org/10.3390/electronics14163326

by Majid Joudaki^1,*

, Mehdi Imani²

and Hamid R. Arabnia³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2025, 14(16), 3326; https://doi.org/10.3390/electronics14163326

Submission received: 28 July 2025 / Revised: 17 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025

(This article belongs to the Special Issue Real-Time Audio, Video and Image Processing: Latest Advances and Prospects)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Questions for authors:

Please justify the selection of RT-DETR as a person detection model used in the research. Why not any other form yolo family, RCNN. Is the RT-DETR easy replaceable by other model in the proposed architecture? Same question for Vitpose++ part. Can it be yolo-pose, pose-net, open-pose?

In the architecture description section authors listed parameters and the defined values for each of them. Please justify why concrete values are selected here: Batch=4, T: Time (clip_length=16), H: Height=224, W: Width=224, N: Tokens=196, K: Keypoints=(17, 2), D: Dimension=768.

"... we split the dataset by subject, using subjects 1-16 for training, 17 for validation, and 18-25 for testing ...". If in the future, another researchers will want to replicate and improve the achieved accuracy, how they will separate which videos should be assigned to train/val/test splits? Do authors keep the same seed each time training the model, so that mp4 records are exactly in the certain split repeating training process?

It is not clear, but probable, that the training was performed once. Here is most important remark. In the tables 2, 3 and 4 authors compare the accuracy of the proposed model to the existing results declared in the reference models. How authors are sure about the replicability? Do all of them have used same data for training and testing, or the records may be mixed as well.

Proposed improvements to the manuscript:

The main limitation of the paper lies in its reliance on a two-stage pipeline, where the pose estimation via ViTPose++ is performed independently and remains frozen during training. This makes the model vulnerable to errors in pose estimation, especially under motion blur or occlusion. To address this, the authors should integrate the pose estimator into an end-to-end trainable framework, allowing joint optimization with the action recognition head to improve robustness.

The PoseEncoder is relatively simplistic, using flattened 2D keypoints and a single Transformer encoder layer, which limits its ability to model complex spatial relationships between joints. This can be improved by adopting a graph-based or spatiotemporal pose model, such as ST-GCN, which captures joint dependencies more effectively.

The evaluation is also limited to smaller or moderately sized datasets (KTH, UCF101, HMDB51), which do not fully test generalization. To improve this, the authors should include experiments on larger-scale benchmarks like Kinetics400 or Something-Something V2.

The model claims computational efficiency, there is no detailed comparison of inference speed, FLOPs, or model size versus state of the art flow-based or dual-stream models. Adding such comparisons would better position TransMODAL in terms of practical deployment.

While qualitative results are strong, more in-depth attention visualizations and a broader ablation study (eg, alternative fusion mechanisms or deeper pose encoders) would provide greater insight into the architecture's design choices and performance.

Author Response

Response to Reviewer 1 Comments

1. Summary

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files.

In response to your comments, we have incorporated color for highlighting in the re-submitted file.

2. Questions for General Evaluation

Reviewer’s Evaluation

Does the introduction provide sufficient background and include all relevant references?

Can be improved

Are all the cited references relevant to the research?

Can be improved

Is the research design appropriate?

Must be improved

Are the methods adequately described?

Must be improved

Are the results clearly presented?

Can be improved

Are the conclusions supported by the results?

Can be improved

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: Please justify the selection of RT-DETR as a person detection model used in the research. Why not any other form yolo family, RCNN. Is the RT-DETR easy replaceable by other model in the proposed architecture? Same question for Vitpose++ part. Can it be yolo-pose, pose-net, open-pose?

Response 1: We thank the reviewer for this important question. We selected RT-DETR and ViTPose++ based on their state-of-the-art performance and efficiency at the time of this research. To clarify our rationale and the modularity of our approach, we have added a new subsection, "3.1.1. Upstream Model Selection and Hyperparameter Justification," which details these choices. We explicitly state that the pipeline is modular and other high-performing detectors or pose estimators could be substituted. In this new section, we clarify that:

RT-DETR was chosen for its excellent balance of high accuracy and real-time performance, representing the state-of-the-art in DETR-based object detectors.
ViTPose++ was selected as a top-performing, transformer-based model for pose estimation, aligning well with the overall architecture.
We explicitly confirm that our pipeline is modular. Both the detector and the pose estimator function as pre-processing steps and can be readily substituted with other high-performing models (such as those from the YOLO family or OpenPose), provided they output the required bounding boxes and 17-channel COCO keypoints. Mention exactly where in the revised manuscript this change can be found – page 5, paragraph 4, and lines 204-214.

Comments 2: In the architecture description section authors listed parameters and the defined values for each of them. Please justify why concrete values are selected here: Batch=4, T: Time (clip_length=16), H: Height=224, W: Width=224, N: Tokens=196, K: Keypoints=(17, 2), D: Dimension=768.

Response 2: These values were chosen to align with established standards and best practices from the foundational models we employed (e.g., VideoMAE) and to balance performance with the memory constraints of our GPU hardware. We have added a paragraph in the new "3.1.1. Upstream Model Selection and Hyperparameter Justification" subsection to provide a clear justification for each of these values. – page 6, paragraph 2, and lines 215-222.

Comments 3: "... we split the dataset by subject, using subjects 1-16 for training, 17 for validation, and 18-25 for testing ...". If in the future, another researchers will want to replicate and improve the achieved accuracy, how they will separate which videos should be assigned to train/val/test splits? Do authors keep the same seed each time training the model, so that mp4 records are exactly in the certain split repeating training process?

Response 3: The KTH dataset has a standard evaluation protocol where the split is performed by subject ID, not by random video selection. This ensures that any researcher using the same subject IDs for the splits will have the exact same training, validation, and test sets, thus ensuring replicability without needing a random seed for the split itself. We have revised the text in Section 4.1 to make this explicit. – page 9, paragraph 1, and lines 310-313.

Comments 4: It is not clear, but probable, that the training was performed once. Here is most important remark. In the tables 2, 3 and 4 authors compare the accuracy of the proposed model to the existing results declared in the reference models. How authors are sure about the replicability? Do all of them have used same data for training and testing, or the records may be mixed as well.

Response 4: Our comparisons are made against the results reported in the original, peer-reviewed publications for each respective model. This is a standard practice in the field, as these papers use the same standard dataset splits (e.g., the KTH subject-specific split, and official split 1 for UCF101 and HMDB51) that we do, ensuring a fair and replicable comparison. To make this explicit for the reader, we have added a clarifying sentence in Section 4.3. – page 9, paragraph 3, and lines 335-337.

Proposed improvements to the manuscript 1: The main limitation of the paper lies in its reliance on a two-stage pipeline, where the pose estimation via ViTPose++ is performed independently and remains frozen during training. This makes the model vulnerable to errors in pose estimation, especially under motion blur or occlusion. To address this, the authors should integrate the pose estimator into an end-to-end trainable framework, allowing joint optimization with the action recognition head to improve robustness.

Response of Proposed improvements to the manuscript 1: Agree. We thank the reviewer for this excellent suggestion. As implementing an end-to-end trainable framework would require a significant architectural redesign and a new set of experiments beyond the scope of the current work, we agree that the two-stage pipeline is a limitation and that end-to-end training represents a key direction for future research that could mitigate cascaded errors from the pose estimator, particularly in challenging scenarios with motion blur or occlusion. We have revised the manuscript to more strongly acknowledge this limitation. We have expanded the discussion in the "Conclusion and Future Work" section to explicitly highlight end-to-end training as a primary avenue for future research. – page 15, paragraph 1, and lines 503-507.

Proposed improvements to the manuscript 2: The PoseEncoder is relatively simplistic, using flattened 2D keypoints and a single Transformer encoder layer, which limits its ability to model complex spatial relationships between joints. This can be improved by adopting a graph-based or spatiotemporal pose model, such as ST-GCN, which captures joint dependencies more effectively.

Response of Proposed improvements to the manuscript 2: More sophisticated pose encoders, such as Graph Convolutional Networks (GCNs), could capture inter-joint dependencies more effectively. Our current design prioritizes simplicity and efficiency, demonstrating that even a lightweight encoder can yield strong results when combined with our novel fusion mechanism. We believe that exploring graph-based models is an excellent direction for future research. Accordingly, we have acknowledged this as a valuable avenue for improvement in the "Conclusion and Future Work" section and have added a citation to the foundational ST-GCN paper. – page 15, paragraph 1, and lines 497-500.

Proposed improvements to the manuscript 3: The evaluation is also limited to smaller or moderately sized datasets (KTH, UCF101, HMDB51), which do not fully test generalization. To improve this, the authors should include experiments on larger-scale benchmarks like Kinetics400 or Something-Something V2.

Response of Proposed improvements to the manuscript 3: We thank the reviewer for this important suggestion. We agree that evaluating TransMODAL on larger-scale benchmarks like Kinetics-400 and Something-Something V2 is a critical next step to fully assess its scalability and generalization capabilities. While conducting these extensive new experiments is beyond the scope of the current manuscript, we have made sure to explicitly acknowledge this limitation and highlight it as a primary direction for future work. This is stated in the "Conclusion and Future Work" section. – page 15, paragraph 1, and lines 495-497.

Proposed improvements to the manuscript 4: The model claims computational efficiency, there is no detailed comparison of inference speed, FLOPs, or model size versus state of the art flow-based or dual-stream models. Adding such comparisons would better position TransMODAL in terms of practical deployment.

Response of Proposed improvements to the manuscript 4: We thank the reviewer for this valuable suggestion. To better contextualize our model's efficiency, we have added a new Table 6 and a corresponding paragraph in the "Analysis and Discussion" section (Section 5). This table provides a direct comparison of TransMODAL's trainable parameters, FLOPs, and accuracy against prominent two-stream models that use computationally expensive optical flow (I3D and R(2+1)D), highlighting its competitive efficiency. – page 14, paragraph 2, and lines 458-468.

Proposed improvements to the manuscript 5: While qualitative results are strong, more in-depth attention visualizations and a broader ablation study (eg, alternative fusion mechanisms or deeper pose encoders) would provide greater insight into the architecture's design choices and performance.

Response of Proposed improvements to the manuscript 5: We thank the reviewer for these excellent suggestions for extending the analysis. We agree that a broader ablation study on different fusion mechanisms or deeper pose encoders, as well as more in-depth visualizations of the cross-attention weights, would provide valuable insights into the model's behavior. Given that these experiments would require significant implementation and retraining, we believe they constitute a substantial body of work that is best addressed in a future study. We have, however, incorporated these valuable suggestions into our "Conclusion and Future Work" section to explicitly highlight them as important and promising avenues for further research. – page 15, paragraph 1, and lines 497-500.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper has several serious issues as follows.
1. The author should refer to how other papers on deep neural networks describe the method. The general practice is to provide the architecture figure of the entire network in the third part. Please note that it is an architecture figure, not a flowchart. A flowchart is used to describe an algorithm. The architecture figure describes the input and output feature map size, fusion method, convolution used, and other information for each module, in order for readers to reproduce the author's work. Figure 1 uses a flowchart to describe the proposed network, which appears very unprofessional and lacks necessary information.

2. The persuasiveness of the experimental part is poor. The methods used by the author for comparison (such as references 4, 5, 8, 9, 10) were born between 2017 and 2019. These methods are too outdated and not the state of the art methods.

3. In the experimental section, experiments were conducted on different datasets and compared with different methods. Is the method proposed by the author not as effective as the comparative method, intentionally avoiding some of the comparative methods? For example, the experiment on the HMDB51 dataset was compared with reference 6, while the experiment on UCF101 was not compared with reference 6.

4.The author claims to have used Transformer, but needs to explain why Transformer is suitable for the current task. This applicability should be explained from a technical and theoretical perspective, rather than simply stating that the use of Transformer recognition networks yields good results. As is well known, the main contributions of Transformer are self attention and multi-head attention. The author did not use these two attention mechanisms, so why is the proposed network still called Transformer?

5. The author claims to have proposed two modules, a CoAttention Fusion module and an efficient Adaptiveselector. However, detailed network structure diagrams were not provided for these two modules. Readers are unable to determine the size of the output and input feature maps. It is also impossible to know the propagation process of the forward propagation flow.

6.Section 1.1 is included in the Introduction of the first part, but there is no section 1.2. Please prepare your manuscript carefully.

7. Remove the black dots before lines 77 and 87, as this is not a PowerPoint presentation.

8. In the second part of the relevant literature, the author only listed the literature without summarizing it, and did not prove the problem that needs to be solved in this article through the summary of the literature.

Author Response

Response to Reviewer 2 Comments

1. Summary

In response to your comments, we have incorporated color for highlighting in the re-submitted file.

2. Questions for General Evaluation

Reviewer’s Evaluation

Does the introduction provide sufficient background and include all relevant references?

Must be improved

Are all the cited references relevant to the research?

Must be improved

Is the research design appropriate?

Must be improved

Are the methods adequately described?

Must be improved

Are the results clearly presented?

Must be improved

Are the conclusions supported by the results?

Must be improved

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The author should refer to how other papers on deep neural networks describe the method. The general practice is to provide the architecture figure of the entire network in the third part. Please note that it is an architecture figure, not a flowchart. A flowchart is used to describe an algorithm. The architecture figure describes the input and output feature map size, fusion method, convolution used, and other information for each module, in order for readers to reproduce the author's work. Figure 1 uses a flowchart to describe the proposed network, which appears very unprofessional and lacks necessary information.

Response 1: We thank the reviewer for this valuable and detailed feedback. We agree. A professional architectural diagram is the standard for deep learning papers and provides significantly more clarity for reproducibility than the previous flowchart. We have, accordingly, completely replaced the former Figure 1 with a new, detailed architectural diagram. This new figure illustrates the internal components of our proposed modules, including the convolutional layers and residual connection in the PoseEncoder, the dual cross-attention mechanism in CoAttentionFusion, the learnable scoring layers in AdaptiveSelector, and the final MLP in the ActionClassifier. The transformation of tensor dimensions is clearly marked at each stage to provide all necessary information for readers to understand and reproduce our work. – page 5, paragraph 4, and lines 199-203.

Comments 2: The persuasiveness of the experimental part is poor. The methods used by the author for comparison (such as references 4, 5, 8, 9, 10) were born between 2017 and 2019. These methods are too outdated and not the state of the art methods.

Response 2: We thank the reviewer for this important feedback on the persuasiveness of our experimental comparisons. We agree that comparing our work against recent, state-of-the-art (SOTA) methods is crucial for demonstrating its significance. Our intention for including foundational models like I3D [8] and R(2+1)D [9] was to provide context and demonstrate the performance gains over classic CNN-based architectures. However, we acknowledge the reviewer's point that the comparison tables should be more focused on contemporary, Transformer-based models. We wish to gently note that our original tables did include comparisons to the highly relevant and recent VideoMAE (2022) [10] and VideoMAE V2-g (2023) [38], which are considered SOTA. To further strengthen our experimental section and directly address the reviewer's concern, we have revised our comparison tables (now Tables 3 and 4) to include additional, more recent SOTA Transformer-based models: MViTv2 (2022) and UniFormerV2 (2022). These additions provide a more robust and current context for evaluating our model's performance. – page 10, paragraph2, 3, and lines 350-354.

Comments 3: In the experimental section, experiments were conducted on different datasets and compared with different methods. Is the method proposed by the author not as effective as the comparative method, intentionally avoiding some of the comparative methods? For example, the experiment on the HMDB51 dataset was compared with reference 6, while the experiment on UCF101 was not compared with reference 6.

Response 3: We thank the reviewer for their careful observation and for raising this important point about the consistency of our comparisons. The reviewer is correct that reference [6] was included in our HMDB51 comparison (Table 4) but was absent from the UCF101 table (Table 3). This was not an intentional omission to avoid a comparison. The reason for this is that the original publication for this method [6] reported results on the KTH, UCFSports and HMDB51 datasets but did not evaluate or report performance on the UCF101 benchmark. Therefore, we were unable to include a valid data point for it in Table 3.

Comments 4: The author claims to have used Transformer, but needs to explain why Transformer is suitable for the current task. This applicability should be explained from a technical and theoretical perspective, rather than simply stating that the use of Transformer recognition networks yields good results. As is well known, the main contributions of Transformer are self attention and multi-head attention. The author did not use these two attention mechanisms, so why is the proposed network still called Transformer?

Response 4: We agree with the reviewer that the theoretical justification for using a Transformer architecture should be more prominent. The Transformer's self-attention mechanism is uniquely suited for video understanding as it can model long-range dependencies between pixels across both space and time, overcoming the limited receptive fields of traditional CNNs. This mechanism also provides a powerful and flexible framework for multi-modal fusion, where the query-key-value (QKV) formulation allows one modality (e.g., pose) to directly "query" another (e.g., RGB) to identify the most relevant features for a given motion. Specifically, our novel CoAttentionFusion module (described in Section 3.3.1) is built directly upon PyTorch's nn.MultiheadAttention to perform the critical cross-modal feature exchange(dual cross attention). Furthermore, our entire RGB stream is powered by a pre-trained VideoMAE model, which is a Vision Transformer (ViT) backbone that relies on self-attention to model spatiotemporal dependencies. To ensure this is clear to all readers, we have added a new introductory paragraph to Section 3 to explain, from a theoretical and technical perspective, why the Transformer architecture is particularly well-suited for our task. – page 4, paragraph 3, and lines 160-170.

Comments 5: The author claims to have proposed two modules, a CoAttention Fusion module and an efficient Adaptiveselector. However, detailed network structure diagrams were not provided for these two modules. Readers are unable to determine the size of the output and input feature maps. It is also impossible to know the propagation process of the forward propagation flow.

Response 5: We thank the reviewer for this crucial feedback. We agree with this valuable point completely. A detailed illustration of our two novel modules, CoAttentionFusion and AdaptiveSelector, is essential for clarity and reproducibility. The original Figure 1 was intended to provide a high-level overview of the entire system, but it did not adequately detail the internal forward propagation logic and feature map transformations within our core contributions. To address this, in addition to revise sections 3.3.1 and 3.3.2, we have added a new Figure 2 to the manuscript, which is dedicated to illustrating the detailed architecture of these two modules. This new figure visually depicts the parallel cross-attention paths, feed-forward networks, and residual connections within CoAttentionFusion, as well as the two-stage learnable scoring and token pruning process in AdaptiveSelector. All input and output tensor shapes are clearly labeled to show the forward propagation flow. Consequently, all subsequent figures have been renumbered (the former Figure 2 is now Figure 3, etc.), and all in-text references have been updated accordingly. – pages 7, paragraphs 2,3... , and lines 254-284 and pages 8, paragraphs 1, and lines 285-299.

Comments 6: Section 1.1 is included in the Introduction of the first part, but there is no section 1.2. Please prepare your manuscript carefully.

Response 6: We thank the reviewer for their careful attention to the manuscript's formatting. We agree that the single subheading "1.1. Problem Statement" was unnecessary and created an inconsistency in the structure. We have, accordingly, removed this subheading and integrated the text directly into the main body of the Introduction to improve the flow and correct the formatting. – page 1, paragraph 5, and lines 34-42.

Comments 7: Remove the black dots before lines 77 and 87, as this is not a PowerPoint presentation.

Response 7: We thank the reviewer for pointing out this formatting issue. We agree that the bullet points are not appropriate for a formal manuscript. We have removed the bulleted list and have rewritten the two contributions as a single, cohesive paragraph to improve the flow and professionalism of the text. – page 2, paragraph 4, and lines 73-94.

Comments 8: In the second part of the relevant literature, the author only listed the literature without summarizing it, and did not prove the problem that needs to be solved in this article through the summary of the literature.

Response 8: We thank the reviewer for this excellent feedback. We agree that the "Related Work" section should more clearly synthesize the existing literature to build a compelling argument for our research. The previous version described relevant works but did not sufficiently connect their limitations to the problems that TransMODAL is designed to solve. To address this, we have revised the end of each subsection within the "Related Work" section (Sections 2.1, 2.2, and 2.3) to include a concluding summary. These summaries now explicitly state the key takeaway from each line of research and highlight the specific gaps—such as the limited receptive fields of CNNs, the modality limitations of RGB-only Transformers, and the open questions in fusion strategies—that motivate our work. This creates a clearer logical progression that culminates in the problem statement our paper addresses. – page 3, paragraph 3, and lines 116-118 and pages 3, paragraphs 4, and lines 134-137 and pages 4, paragraphs 1, and lines 145-147.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors Authors have addressed my questions in the response and in the manuscript. After revision the paper was improved.

Reviewer 2 Report

Comments and Suggestions for Authors I

Article Menu

TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI