Deep Learning Algorithms for Human Activity Recognition in Manual Material Handling Tasks
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript utilizes the commercial Xsens MVN suit sensor set and a self-developed surface electromyographic (sEMG) device to collect data. It focuses on exploring the recognition performance of BiLSTM, Sp-DAE, Recurrent Sp-DAE, and RCNN algorithms for seven activities: N-pose (N), Lifting from the Floor (LF), Keeping lifted (K), Placing on the Table (PT), Lifting from the Table (LT), Carrying (W), and Placing on the Floor (PF). The research is systematic and thorough, but from my perspective, the following issues should be addressed:
(1) There are multiple grammatical and typographical errors in the manuscript, such as the misspelled word "ACcumulate" in the abstract.
(2) The background section is too lengthy for a research paper; the authors should better distinguish between a manuscript and a thesis.
(3) While the use of "materials" in the term "Manual Material Handling" is appropriate, its appearance in the section titled "3. Methods and Materials" is ambiguous.
(4) For the seven MMH activities, it is worth discussing whether certain activity types are better suited to specific algorithms and how to evaluate the claimed algorithmic advantages in practical applications.
(5) The inclusion of the following references would improve the comprehensiveness of the manuscript: 10.3390/s25082520
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript addresses a relevant and timely problem in Human Activity Recognition (HAR) for Manual Material Handling (MMH) using wearable sensors. The topic is well aligned with Sensors scope and has clear industrial and ergonomic implications. The study presents several deep learning architectures (BiLSTM, Recurrent Sp-DAE, RCNN) compared with a benchmark (DeepConvLSTM).
However, important methodological, validation, and reporting issues must be addressed before the manuscript can be accepted.
- The optimal hyperparameter combinations are selected using the full dataset (70/30 split with mixed samples from all subjects) and subsequently evaluated under the LOSO protocol.
This approach increases the risk of overfitting to the overall data distribution and introduces information leakage between subjects, potentially leading to biased hyperparameter selection and inflated performance estimates. - The description of data characteristics and preprocessing is insufficient.
Key details are missing, including the sampling frequency, filtering and normalization procedures (subject-wise or global?), synchronization between IMU and sEMG signals, RMS window configuration for sEMG, stride and overlap of the 1-second windows, and the standardization strategy used in the LOSO setup (e.g., whether normalization statistics are computed only on the training set). - The applied downsampling of the N-pose class and the addition of white noise to minority classes may distort temporal correlations and fail to capture inter-subject variability.
Class balancing and augmentation should consider time-series–specific transformations such as controlled-SNR jittering, scaling, time-warping, random cropping or permutation, magnitude-warping, and mixup for sequential data.
For sEMG, EMG-specific filtering and normalization techniques should be employed. The impact of augmentation on model performance should be evaluated through ablation studies. - Benchmarking exclusively against DeepConvLSTM is insufficient for the 2023–2025 state of the art.
Comparisons should include modern architectures such as Temporal Convolutional Networks (TCN), InceptionTime, GRU-FCN, or Transformer-based sequence models (e.g., Transformer-TS). - No statistical analysis or uncertainty quantification is provided.
The manuscript reports only single F1-scores without confidence intervals, standard deviations, or statistical tests (e.g., Wilcoxon signed-rank tests) to assess the significance of observed differences. - Complexity metrics are ambiguous.
For MAC/MA operations, it is unclear whether they are computed per window, per second, for batch size = 1, and whether preprocessing is included.
The paper also lacks the total number of parameters (weights), memory footprint, and inference latency on a representative embedded platform (e.g., MCU or SoC). - The RCNN architecture lacks justification.
The rationale for using 2D convolutions, a dilation factor of 2×2, and stride 1×4 is not explained. No ablation studies are provided (e.g., comparison with 1D-CNN, with/without pooling, GRU vs. LSTM, or varying the number of convolutional blocks). - Experimental setup and participant information are incomplete.
The manuscript does not describe laboratory conditions, participant demographics, sensor placement variability, hand dominance, sEMG artifacts (e.g., sweating, motion noise), load magnitudes, or task execution speeds.
The potential domain shift between controlled laboratory data and real-world conditions (lab → field) is not discussed.
Although the study is motivated by the prevention of Work-Related Musculoskeletal Disorders (WMSDs), it does not demonstrate how the detected activity classes translate into ergonomic risk indices (e.g., RULA, REBA, or NIOSH LI).
Suggestion for authors: include a prototype analysis pipeline linking activity detection to biomechanical parameters and ergonomic risk indicators (activity → kinematic/muscular parameters → risk score), even if only as a simple correlation or agreement analysis with expert ratings.
Specific questions the authors should address:
- What stride and overlap were applied for the 1-second segmentation windows?
- How exactly were MAC/MA operations computed — per window, per batch size = 1, and for which components (only GRU/LSTM layers or the entire network)?
- Was normalization performed exclusively on the training data in the LOSO evaluation?
- What filters and RMS parameters were used for the sEMG signals (frequency band, window length)?
- Were alternative architectures (e.g., TCN, InceptionTime, Transformer-TS) or classical ML models with engineered features tested?
- Will the source code and trained models be made publicly available?
Editorial Comments
The manuscript requires numerous editorial and linguistic corrections to improve clarity, consistency, and overall readability. Specific issues are listed below:
- Language and typography: Replace typographical errors and inconsistent terms, e.g., “latters” → “latter”, “icrease” → “increase”, “intepreted” → “interpreted”, “netoworks” → “networks”, and use “std. dev.” instead of “dev std.” Ensure consistent capitalization (e.g., “ACcumulate” → “Accumulate”). Remove redundant headers such as “Version September 25, 2025…”. Maintain consistent terminology: epochs (avoid “Eps”) and hidden units (avoid “HUs” abbreviations in figures; use full terms in captions).
- Abstract: Clarify that MAC/MA values are computed per 1-second window; replace vague wording such as “more than 100 greater” with exact ratios (e.g., ×k) and numerical values. Include the dataset name and reference (DOI or data availability statement).
- Figures 2–5: Add color scales, axis ranges, consistent limits across subplots, clear legends indicating hyperparameters, and improve font readability for figure labels.
- Table 5: Include the total number of model parameters and latency metrics; specify that the MAC/MA values are computed per 1-second window.
- Section 3.1: Explicitly report the stride and overlap used for the 1-second segmentation (e.g., 50% overlap), clarify whether normalization is performed per-subject or globally, and describe any resampling or artifact removal steps for EMG data. If EMG band-pass filtering was applied, specify the frequency range (e.g., 20–450 Hz).
- Modality fusion: The paper employs Early Fusion “for simplicity”; it would be valuable to include at least a brief comparison or discussion of Early vs. Mid- or Late-Level fusion, or to reference relevant studies illustrating the impact of fusion strategy on HAR performance.
- PT class performance: Provide an error analysis for the “Placing on the Table (PT)” class (e.g., using saliency, attention, or SHAP visualizations for temporal sequences), and include example time windows where PT is misclassified as N-pose.
- Metric consistency: Clarify whether the reported F1-score is macro, micro, or weighted. Provide a precise definition and justify the choice (macro-F1 is generally preferred under class imbalance).
- Terminology: Use “state of the art (SoA)” at first mention instead of the abbreviation alone.
Recommendation: Major Revision before Acceptance.
The paper is promising and demonstrates clear application value; however, it requires substantial improvements in methodological rigor and presentation. Specifically, the validation strategy should be strengthened (e.g., nested LOSO), the set of baseline models should be expanded, the data processing pipeline should be more precisely described, and model complexity should be evaluated using metrics relevant to edge deployment. After these revisions—particularly the inclusion of ablation studies and statistical analyses with confidence intervals—the conclusions will be considerably more robust.
In addition, the manuscript requires extensive editorial refinement to meet publication standards, particularly in terms of formatting consistency, figure quality, linguistic accuracy, and the uniform use of technical terminology.
Comments for author File:
Comments.pdf
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors should explain how the figures 2 and 3 were obtained .
Reviewer report
The authors propose four Deep Learning algorithms for HAR in MMH: Bidi-rectional Long Short-Term Memory (BiLSTM), Sparse Denoising Autoencoder (Sp-DAE), Recurrent Sp-DAE; and Recurrent Convolutional Neural Network (RCNN). They explored different hyperparameter combinations to maximize the classification performance (F1score) using wearable sensors’ data gathered from 14 subjects.
My comments are the following:
- The article proposes a system for recognizing MMH tasks based on deep neural networks, with excellent results in terms of accuracy and computational lightness. However, considering that one of the stated objectives is the use for biomechanical risk assessment, it would be useful to include a section that more explicitly describes how classified data (e.g. recognized activities, sEMGs, kinematics) can be used for quantitative biomechanical assessment. For example, integration with musculoskeletal models, biomechanical load indices, or standardized methods such as NIOSH or OCRA could be discussed.
- The authors state that sEMG signals may be useful for robotic control in shared tasks. It would be useful to better explain how these signals can be integrated into a robot control system: for example, for the estimation of motor intention, the adaptation of assistance based on muscular effort, or shared control. A brief discussion on possible control architectures or concrete application scenarios (e.g. cobot) would enrich the contribution
- The authors propose and compare several deep learning architectures for MMH task recognition, including BiLSTM, RCNN, and recurrent autoencoders. However, the Transformer architecture is not considered, which in recent years has shown excellent performance in timeline tasks, including those of HAR. It would be useful to include a brief discussion on why the Transformers were not considered, or a theoretical comparison between the proposed architectures and self-attention based models, in terms of the ability to model time dependencies, noise robustness and computational scalability."
- The authors state that the proposed architectures are suitable for embedded systems due to their computational lightness. However, it does not provide a description of the target embedded architecture or an assessment of performance on real-world devices. It would be useful to include a section describing a possible embedded implementation scenario (e.g., microcontroller, edge device, wearable), specifying memory requirements, inference time, and power consumption.
- The authors mention the use of the Xsens MVN system for the acquisition of motion data, but do not describe its technical characteristics. It would be useful to include a brief description of the system (e.g. number of IMUs, sampling frequency, accuracy,angular precision ) to allow a better understanding of the quality and resolution of the data used, as well as to promote the replicability of the study.
- The study clearly fits into the Industry 4.0 paradigm, but the practical implications are not fully developed in the text. It is suggested to include a section or paragraph that explains how the proposed architectures can be integrated into real-world industrial scenarios, for example in real-time ergonomic monitoring systems, adaptive control of cobots, or edge platforms for occupational safety. Such a discussion would strengthen the applicative relevance of work.
- Although the F1-score is a robust metric in the presence of class imbalances, it is suggested to also include precision and accuracy values in the results, at least for the best models (BiLSTM and RCNN). This would allow a more complete evaluation of performance and facilitate comparison with other studies.
Comments for author File:
Comments.pdf
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have adequately addressed the reviewers' comments.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have revised the manuscript according to my previous comments and suggestions. I am satisfied with the current version and recommend it for acceptance.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addresses the concerns of the reviewer .
The manuscript can be accepted for the pubblication

