Enhanced Distributed Multimodal Federated Learning Framework for Privacy-Preserving IoMT Applications: E-DMFL
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript addresses a timely and relevant problem: enabling privacy-preserving federated learning in Internet of Medical Things (IoMT) applications. It tries to handle multimodality, communication efficiency, and device heterogeneity. The proposed E-DMFL framework combines gated attention fusion, Shapley-based modality selection, differential privacy, and communication optimization, and is evaluated on the EarSAVAS dataset. The idea of integrating these techniques into a single framework is promising and of clear potential interest.
That said, the manuscript suffers from several significant issues that need to be resolved before it can be considered for publication. The structure is imbalanced, the related work is underdeveloped, the evaluation scope is limited, and the tone is at times self promotional rather than scientific. My detailed concerns are outlined below.
Related works, introduction and literature analysis
The manuscript lacks a dedicated related works section. Instead, Table 1 is presented in the introduction, which compares selected prior approaches without first introducing or analyzing them. This table is positioned too early, and in its current form resembles a decision matrix rather than a synthesis of the state of the art. Readers are not guided through the evolution of multimodal FL methods, their contributions, and their shortcomings. A more suitable approach would be to develop a full related works section that reviews the major multimodal FL approaches, highlights their limitations in IoMT deployment, and then conclude with a comparative table. This would position E-DMFL more effectively. The introduction itself also presents detailed results, which does not deliver the function of the introduction and should be moved to the results or discussion section.
The related literature is treated superficially. While methods such as FedMM, DP-FLHealth, CAR-MFL, and EMFed are mentioned, they are not examined in detail. Readers expect a critical analysis of how prior work addresses multimodality, non-IID data, privacy, and communication efficiency, and where they fall short for IoMT. At present the coverage reads like a catalog of references, not an analytical review.
Conclusion and contribution framing
The conclusion is dense and descriptive, reiterating claimed achievements rather than synthesizing the main findings. It does not sufficiently reflect the multi-component nature of the framework. A stronger conclusion would highlight the balance between privacy and utility, the demonstrated communication savings, and the robustness to missing modalities. It should also draw out broader implications for IoMT practice among others. At present, the conclusion compresses too much into a few lines and does not provide closure for the reader.
Limitations and future Work
The paper does include a limitations section, but the discussion is shallow. For instance, the narrow evaluation on a single dataset and the short training horizon are mentioned but not critically analyzed. The future work section, meanwhile, reads like a generative AI list of ideas, without providing justification for why these directions are the most relevant, nor what realistic horizon they might be achieved on. A more credible future work section should identify two or three well-grounded directions and provide a rationale for each. Also figure 4 seems to be AI generated.
Writing style
The manuscript includes frequent self-complimentary phrases such as “exceptional” and “paradigm-shifting,” which are inappropriate for scientific writing. The tone should be neutral and precise. The novelty and strength of the work should be conveyed through evidence and comparative analysis, not promotional language. In addition, some technical sections, notably Algorithm 1 and the architectural description, are excessively dense and could be reorganized for clarity.
Evaluation window
The reported results are based on only 6 communication rounds. While this does illustrate convergence speed, it does not reflect long-term stability or continual learning scenarios that are typical in healthcare deployments. For clinical IoMT applications, it is important to understand how the framework behaves under longer training horizons and continual updates. Without this, the claims of deployment readiness are overstated.
Privacy model assumptions
The privacy guarantees rest on differential privacy with Gaussian noise and secure aggregation. This assumes bounded gradients and calibrated noise. Such assumptions may not hold in practice where updates are skewed, unbounded, or adversarial. Moreover, the manuscript does not empirically evaluate against adaptive adversaries or collusion scenarios, which weakens the strength of the privacy claims. Addressing these gaps would substantially improve the paper.
Reproducibility and source code
A further concern is reproducibility. The framework integrates multiple complex components: fusion mechanisms, Shapley-based modality selection, privacy techniques, and communication strategies. Without public release of the source code and experimental setup, future researchers will struggle to replicate or extend this work. This significantly reduces its long-term impact. I strongly recommend that the authors make the source code, configuration scripts, and documentation publicly available. This would enable validation of results, allow benchmarking against future work, and strengthen the credibility of the paper.
Author Response
Comments 1: Related works, introduction and literature analysis
The manuscript lacks a dedicated related works section. Instead, Table 1 is presented in the introduction, which compares selected prior approaches without first introducing or analyzing them. This table is positioned too early, and in its current form resembles a decision matrix rather than a synthesis of the state of the art. Readers are not guided through the evolution of multimodal FL methods, their contributions, and their shortcomings. A more suitable approach would be to develop a full related works section that reviews the major multimodal FL approaches, highlights their limitations in IoMT deployment, and then conclude with a comparative table. This would position E-DMFL more effectively. The introduction itself also presents detailed results, which does not deliver the function of the introduction and should be moved to the results or discussion section.
The related literature is treated superficially. While methods such as FedMM, DP-FLHealth, CAR-MFL, and EMFed are mentioned, they are not examined in detail. Readers expect a critical analysis of how prior work addresses multimodality, non-IID data, privacy, and communication efficiency, and where they fall short for IoMT. At present the coverage reads like a catalog of references, not an analytical review.
Response 1: We have extensively revised the manuscript structure to address organizational concerns:
Added dedicated Related Works section (Section 2, pages 2-5, lines 54-163): We developed a comprehensive 4-subsection literature review covering: (2.1) Multimodal Federated Learning, (2.2) Privacy-Preserving Federated Learning, (2.3) Communication-Efficient Federated Learning, and (2.4) Federated Learning for IoMT. Each subsection provides critical technical analysis including quantified limitations (e.g., CAR-MFL's 2-3× communication overhead, DP-FLHealth's 40-60% cross-modal correlation degradation), hardware constraints (ARM Cortex-A, 2-4 GB RAM), and explains why methods fail for IoMT deployment.
Repositioned comparative table (Section 2.5, page 5, lines 141-163): Table 1 now concludes the Related Works section with analytical framing explaining the six evaluation dimensions and post-table synthesis identifying three critical gaps in prior work.
Revised introduction (Section 1, pages 1-2, lines 21-53): Removed all detailed experimental results (previously: 92.0% accuracy, 4.2× speedup, etc.) and shortened to problem motivation, high-level contributions, and forward reference to Section 5 for results.
Comments 2: Conclusion and contribution framing
The conclusion is dense and descriptive, reiterating claimed achievements rather than synthesizing the main findings. It does not sufficiently reflect the multi-component nature of the framework. A stronger conclusion would highlight the balance between privacy and utility, the demonstrated communication savings, and the robustness to missing modalities. It should also draw out broader implications for IoMT practice among others. At present, the conclusion compresses too much into a few lines and does not provide closure for the reader.]
Response 2: We have substantially revised the conclusion (Section 6, pages 23-24, lines 690-726) to address structural concerns:
Synthesized findings: Rather than listing metrics, we now synthesize three key findings with their interrelationships: graceful degradation under sensor failures (92.0% to 85.7%), minimal privacy-utility tradeoff (0.9% accuracy cost for formal guarantees reducing attacks from 87% to 12%), and communication efficiency through intelligent modality selection (78% reduction, 4.2× convergence speedup).
Multi-component integration: Added explicit discussion of how architectural components work together attention-based fusion maintains cross-modal correlations under DP noise, Shapley selection identifies critical modality combinations, trimmed mean aggregation provides Byzantine robustness, and quantization enables edge deployment. We clarify these operate as an integrated system rather than independent modules.
Broader implications: Expanded discussion of practical implications for IoMT deployments, demonstrating that formal privacy is achievable without centralized trust, models maintain utility with intermittent sensors, and federated learning can converge efficiently through modality selection. This provides concrete takeaways for practitioners deploying privacy-preserving health monitoring systems.
The revised conclusion provides synthesis and closure while maintaining technical precision.
Comments 3: Limitations and future Work
The paper does include a limitations section, but the discussion is shallow. For instance, the narrow evaluation on a single dataset and the short training horizon are mentioned but not critically analyzed. The future work section, meanwhile, reads like a generative AI list of ideas, without providing justification for why these directions are the most relevant, nor what realistic horizon they might be achieved on. A more credible future work section should identify two or three well-grounded directions and provide a rationale for each. Also figure 4 seems to be AI generated.
Response 3: We have significantly expanded and restructured the limitations and future work discussion (Section 6.1, pages 23-24, lines 711-726):
Critical analysis of limitations: Rather than merely listing constraints, we now provide analytical depth for three primary limitations: (1) single-domain evaluation explaining that EarSAVAS's 2-second windows and mild heterogeneity (KL=0.09±0.05) may not reflect tasks with longer temporal dependencies or extreme non-IID conditions (KL>0.5); (2) limited training horizon analyzing concerns about DP noise accumulation over extended training (>50 rounds) and temporal distribution shifts; (3) privacy model assumptions examining bounded gradient requirements and honest-but-curious adversary assumptions with implications for adaptive attacks and client collusion.
Well-grounded future directions: Replaced generic future work statements with three concrete, justified directions: (1) cross-domain validation on specific public datasets (MIT-BIH, FARSEEING, OhioT1DM) to establish generalization across temporal scales and modality correlations; (2) continual learning with privacy accounting for production deployments with temporal shifts; (3) adaptive adversary evaluation to validate privacy claims under realistic threat models. Each direction explicitly addresses an identified limitation with clear evaluation criteria.
Explicit grounding: Each future direction references specific datasets, evaluation methodologies, and connects directly to limitations identified in the analysis, avoiding vague speculation.
Figure 4 clarification: We confirm that Figure 4 was generated using Matplotlib based on experimental data from our federated learning runs. The visualization style follows standard scientific plotting conventions with clear axis labels, legends, and data points derived directly from Table 3 results. If the reviewer has specific concerns about the figure's appearance, we are happy to regenerate it with alternative styling or provide the raw plotting code for verification.
The revised section provides the critical depth requested while maintaining focus on actionable research directions.
Comments 4: Writing style
The manuscript includes frequent self-complimentary phrases such as “exceptional” and “paradigm-shifting,” which are inappropriate for scientific writing. The tone should be neutral and precise. The novelty and strength of the work should be conveyed through evidence and comparative analysis, not promotional language. In addition, some technical sections, notably Algorithm 1 and the architectural description, are excessively dense and could be reorganized for clarity.
Response 4: We have systematically revised the manuscript to eliminate promotional language and improve technical clarity:
Removed promotional terms (manuscript-wide): Conducted complete review eliminating self-complimentary phrases. Specific changes include: "exceptional" → deleted, "superior performance" → "higher accuracy with specific metrics," "dramatically reduces" → "reduces by 78%," "significant advancement" → deleted, "novel" → deleted unless citing originality, "robust" → "resilient" with specific tolerance metrics. All comparative claims now use quantified performance differences rather than qualitative assessments.
Reorganized Algorithm 1 (Section 3.4, page 10, lines 267-277): Restructured with clear phase separators (Initialization, Server Distribution, Local Training, Server Aggregation), added visual spacing through blank lines between phases, and condensed verbose description from 5 paragraphs to 2 concise paragraphs referencing Section 3.3 for privacy details.
Reorganized Architecture (Section 3.2, pages 6-7, lines 176-201): Split into four clear subsections with headers: (1) Modality-Specific Encoders, (2) Attention-Based Fusion, (3) Shapley Value Modality Selection. Removed redundant "Architecture Specifications" paragraph. Each subsection now focuses on one technical aspect with improved readability.
All revisions maintain technical precision while ensuring neutral scientific tone throughout.
Comments 5: Evaluation window
Response 5: We acknowledge that the 6-round evaluation demonstrates initial convergence but not long-term production stability. We have made four revisions to address this concern:
Added training horizon discussion (Section 5.3, pages 22-23, lines 670-689): New subsection "Convergence Analysis and Training Horizon" explicitly distinguishes between demonstrated rapid convergence (6 rounds) and production requirements. We discuss two unaddressed dimensions: (1) long-term stability under temporal distribution shifts (seasonal patterns, sensor degradation), requiring evaluation over 50+ rounds on longitudinal datasets; (2) cumulative privacy noise effects over extended training (100+ rounds), where repeated noise injection impact remains empirically unvalidated despite theoretical composition bounds.
Toned down deployment claims (manuscript-wide): Replaced "deployment-ready" with "demonstrates deployment feasibility" throughout manuscript. Changed "ready for clinical deployment" to "suitable for pilot deployment." All claims now accurately reflect that 6-round experiments establish convergence efficiency and initial utility-privacy tradeoff, not production-scale stability.
Expanded limitations (Section 6.1, page 23, lines 717-720): Enhanced "Limited training horizon" limitation with critical analysis explaining that 6 rounds establishes initial convergence but omits temporal distribution shift handling and cumulative DP noise effects over extended training.
Future work specification (Section 6.1, page 24, lines 717-720): Added "Continual learning with privacy accounting" as concrete research direction addressing long-term deployment requirements, including privacy accounting for 100+ rounds and adaptation mechanisms for temporal shifts.
We now clearly distinguish between demonstrated convergence capability (6 rounds sufficient for proof-of-concept) and requirements for sustained production deployment (50-100+ rounds with continual learning), addressing the reviewer's concern about overstated claims.
Comments 6: Privacy model assumptions
The privacy guarantees rest on differential privacy with Gaussian noise and secure aggregation. This assumes bounded gradients and calibrated noise. Such assumptions may not hold in practice where updates are skewed, unbounded, or adversarial. Moreover, the manuscript does not empirically evaluate against adaptive adversaries or collusion scenarios, which weakens the strength of the privacy claims. Addressing these gaps would substantially improve the paper.
Response 6: We have conducted collusion attack simulations to address the concern about coordinated adversaries:
Collusion attack evaluation (Section 5.1.11, page 21, Figure 5, Table 8, lines 616-632): We simulated scenarios where k={1,2,5,10,20} malicious clients pool DP-noised gradients, reducing effective noise from σ to σ/√k. Results show:
- Baseline (k=1): 51.0% attack success (near random 50%), confirming DP protection
- Moderate collusion (k=5): 71.5% success (+20.6% degradation), indicating graceful privacy loss
- Heavy collusion (k=20, 48% of clients): 79.4% success (+28.4% degradation), still below no-DP baseline (87%)
Key findings: E-DMFL maintains reasonable privacy under moderate collusion (k≤5) with <21% degradation. Even with 48% of clients colluding (k=20), privacy doesn't degrade to no-DP levels, suggesting residual protection from multimodal information distribution. Trimmed mean aggregation (β=0.1) provides complementary Byzantine robustness, filtering coordinated outliers as validated in stress tests (Section 4.6.1, lines 422-433) maintaining >85% accuracy with 10% malicious clients.
Limitations acknowledged (Section 6.1, page 23, lines 721-723): Our collusion simulation uses simplified coordination model (gradient pooling only). Sophisticated collusion strategies (timing attacks, adaptive coordination patterns) may achieve higher success rates. This is explicitly discussed in limitations with future work identifying adaptive adversary evaluation as a priority research direction.
We agree this evaluation substantially strengthens the privacy analysis by demonstrating privacy-collusion tradeoffs with quantified degradation rates.
Comments 7: Reproducibility and source code
A further concern is reproducibility. The framework integrates multiple complex components: fusion mechanisms, Shapley-based modality selection, privacy techniques, and communication strategies. Without public release of the source code and experimental setup, future researchers will struggle to replicate or extend this work. This significantly reduces its long-term impact. I strongly recommend that the authors make the source code, configuration scripts, and documentation publicly available. This would enable validation of results, allow benchmarking against future work, and strengthen the credibility of the paper.
Response 7: We acknowledge the critical importance of reproducibility for complex multi-component systems. We commit to the following:
Code and configuration release: We will release the complete implementation via GitHub repository upon paper acceptance. The release will include:
- Core implementation: PyTorch implementation of all architectural components (gated attention fusion, Shapley value modality selection, differential privacy mechanisms including gradient clipping and noise calibration, trimmed mean aggregation, quantization-aware training)
- Baseline implementations: Fair comparison implementations of FedAvg, DP-FedAvg, FedProx, FedNova, and other baselines using identical experimental configurations
- Data preprocessing pipelines: Complete preprocessing scripts for EarSAVAS dataset including audio spectrogram extraction (16 kHz, 128 mel bins, 80-8000 Hz), motion IMU processing (100 Hz, Butterworth filtering), normalization procedures, and user-level stratified partitioning
- Experimental configurations: Hyperparameter specifications (JSON/YAML configs), random seeds (42, 123, 456, 789, 2024 for 5-fold CV), privacy parameters (ε=1.0, δ=10⁻⁵, C=1.0), network simulation parameters (latency 10-500ms, bandwidth 5-100 Mbps)
- Evaluation scripts: Complete scripts reproducing all figures (convergence curves, loss progression, collusion attack analysis), tables (comparative performance, privacy evaluation, communication efficiency), and statistical analyses (confidence intervals, significance tests)
- Documentation: Setup instructions (dependencies, environment configuration), architecture diagrams, API reference for key components, tutorial notebooks demonstrating single-node training, federated aggregation, and privacy evaluation
- Hardware specifications: Detailed testbed configuration (ARM Cortex-A75 client specs, Intel Xeon server specs, network topology), Docker containers for consistent environment reproduction
Enhanced implementation details (highlighted, Section 4.6.4, page 15, lines 469-496): We have expanded the experimental setup section with complete technical specifications:
- Tensor layouts: Audio inputs (batch, 128, 1) representing mel-frequency coefficients; motion inputs (batch, 6, 100) representing 6-axis IMU over 100 timesteps
- Model architectures: Audio branch uses fully-connected layers (128→256→128→10) with ReLU and 0.3 dropout; motion branch uses identical architecture (600→256→128→10) after flattening
- Normalization: Z-score normalization (StandardScaler) with zero mean and unit variance, applied per-modality at each edge client
- Loss function: Standard cross-entropy for multi-class classification
- Privacy attack protocols: Complete specifications for model inversion (4-layer MLP decoder, Adam optimizer lr=0.001, 200 iterations, query budget 1000), membership inference (shadow model training, 80/20 split, 5-fold CV), and property inference (3-layer classifier on 256-dim weight statistics)
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript addresses a timely and practically important challenge in IoMT by unifying non-IID data, missing modalities, privacy, and communication/latency constraints within a single end-to-end federated learning framework. The design is coherent and reasonable—combining gated attention–based multimodal fusion, Shapley-guided modality selection, top-k sparsification, robust aggregation, and lightweight personalization—to balance accuracy, robustness, and efficiency. Empirical results are demonstrating competitive accuracy in very few rounds alongside substantial communication savings with minimal privacy-induced utility loss, and the ablations help clarify each component’s contribution. The writing and the organization are also clear. However, the following points should be considered in the revision.
- Disclose full threat models and protocols for model inversion, membership, and property inference (attacker access, query budgets, model architectures, seeds). Current deltas (e.g., 87%→12%) are impressive but under-specified.
- Ensure baselines share matched training/communication budgets (rounds×epochs×bytes) and report tuned hyperparameters for FedProx/FedNova/DP-FedAvg/SMPC. Add round-normalized comparisons and class-level error analyses versus the centralized 95% upper bound.
- The results are limited to EarSAVAS vocal activity in eight classes. Enhance the client-level non-IID analysis by including at least one more IoMT task/domain or cross-dataset test (e.g., performance vs. client size/heterogeneity).
- Specify tensor layouts, sequence lengths, normalization, and focal-loss settings.
- For Trimmed Mean (β=0.1), state assumed Byzantine rates and add stress tests.
Author Response
Comments 1: Disclose full threat models and protocols for model inversion, membership, and property inference (attacker access, query budgets, model architectures, seeds). Current deltas (e.g., 87%→12%) are impressive but under-specified.
Response 1: Thank you for this critical observation. We completely agree that privacy evaluation protocols must be fully specified for reproducibility. We have substantially revised the manuscript to include comprehensive threat model
specifications and statistical validation.
Revisions made:
1. Added Section 4.6.3 "Privacy Attack Evaluation Protocol" (**highlighted**, page 14-15, lines 453-468) containing complete attack specifications: (1) Model Inversion - 4-layer MLP decoder [512→256→128→20,096], 1000 query budget, SSIM>0.5 metric; (2) Membership Inference - shadow model with black-box access, gradient norm/loss/confidence features; (3) Property Inference weight-based classifier, 256-dim statistics. All attacks use seeds (42, 123, 456, 789, 2024) for 5 independent runs.
2. Updated Table 5 and Section 5.1.5 (page 17, lines 544-553) to report mean ± std: Model Inversion 87±3.2% → 12±2.1%, Membership Inference 78±2.8% → 52±1.9%, Property Inference 73±4.1% → 18±3.3%. Added attack configurations in
table header and statistical significance testing (p<0.001, two-tailed t-tests).
Comment 2: Ensure baselines share matched training/communication budgets (rounds×epochs×bytes) and report tuned hyperparameters for FedProx/FedNova/DP-FedAvg/SMPC. Add round-normalized comparisons and class-level error analyses versus the centralized 95% upper bound.
Response 2: We have comprehensively addressed all requested comparisons:
- Training Budget Normalization (Table 3, page 16, lines 512-523): Added comprehensive table showing rounds, epochs per round, total sample-epochs processed, communication volume, and final accuracy. E-DMFL processes 30,240 sample-epochs across 6 rounds compared to FedAvg's 126,000 sample-epochs over 25 rounds, demonstrating true efficiency gains independent of early stopping.
- Hyperparameter Tuning Details (**highlighted**, Section 4.6.1, page 14, lines 438-441): Specified tuning protocols for all baselines - FedProx μ via grid search {0.001, 0.01, 0.1, 1.0} selecting μ=0.01 by validation accuracy; FedNova with normalized τ_eff; DP-FedAvg matching E-DMFL privacy budget (ε=1.0, δ=10⁻⁵, C=1.0); SMPC-FL using Shamir secret sharing (t=21, n=42).
- Round-Normalized Analysis (Section 5.1.2, page 16, lines 514-519): Added explicit per-round efficiency comparison showing E-DMFL achieves 15.3% accuracy gain per round (92.0%/6) vs. FedAvg's 3.6% per round (91.0%/25), quantifying 4.2× convergence speedup independent of stopping criteria.
- Class-Level Error Analysis (Figure 2, page 18, Section 5.1.7, lines 564-575): Added detailed per-class performance breakdown across all 10 vocal activity categories compared to centralized upper bound (95%). Analysis reveals E-DMFL achieves perfect classification on 3 classes (ambient noise, speech, sigh) while maintaining >80% accuracy on challenging confused classes (continuous cough: 83.3%, sniffing: 83.3%).
Comments 3: The results are limited to EarSAVAS vocal activity in eight classes. Enhance the client-level non-IID analysis by including at least one more IoMT task/domain or cross-dataset test (e.g., performance vs. client size/heterogeneity).
Response 3: We acknowledge the single-domain evaluation limitation and have addressed this through enhanced non-IID analysis and transparent discussion of generalization scope:
- Client-level heterogeneity analysis added (**highlighted**, Section 5.1.3, page 16, lines 524-533): We conducted comprehensive non-IID characterization across all 42 EarSAVAS clients measuring:
- Limitations and future work (Section 6.1, pages 23-24, lines 711-726): We explicitly acknowledge the single-domain evaluation as a limitation, noting that EarSAVAS's 2-second windows and moderate heterogeneity may not reflect
tasks with longer temporal dependencies or extreme non-IID conditions (KL>0.5). We identify cross-domain validation on specific public IoMT datasets (MIT-BIH cardiac monitoring, FARSEEING fall detection, OhioT1DM glucose prediction) as a concrete future direction to establish generalization across temporal scales and modality correlations.
- Class entropy: 2.80-2.98 bits (mean: 2.89±0.05), indicating balanced class distributions across clients
- KL divergence from global distribution: 0.016-0.216 (mean: 0.09±0.05), demonstrating mild to moderate non-IID conditions typical of real-world edge deployments
- All clients maintained 50 samples each with consistent collection protocols
This analysis characterizes the non-IID severity (moderate heterogeneity, KL=0.09±0.05) under which E-DMFL achieves 92.0% accuracy, providing quantified context for the experimental results.
3. Rationale for single-domain focus: Multi-domain evaluation would require: (1) obtaining and preprocessing 3+ additional multimodal IoMT datasets with different sensor modalities and sampling rates; (2) adapting the architecture for different modality combinations; (3) substantially expanding experimental scope beyond a single paper's capacity. We chose depth over breadth comprehensive evaluation on one well-characterized dataset with full privacy, communication, and robustness analysis.acterized dataset with full privacy, communication,
and robustness analysis.
The enhanced non-IID analysis provides quantified heterogeneity characterization while transparently acknowledging single-domain scope as a limitation with concrete future validation paths.
Comments 4: Specify tensor layouts, sequence lengths, normalization, and focal-loss settings.
Response 4: We have added comprehensive implementation specifications (highlighted, Section 4.6.4, page 15, lines 469-496) addressing all requested details:
- Tensor layouts: Audio inputs (batch, 128, 1) representing 128 mel-frequency coefficients; Motion inputs (batch, 6, 100) representing 6-axis IMU data over 100 timesteps
- Sequence lengths: 100 timesteps at 50Hz for 2-second duration segments; audio features extracted per-frame as 128-dimensional vectors
- Normalization: Z-score normalization (StandardScaler) with zero mean and unit variance, applied independently per modality at each edge client
- Loss function: Cross-entropy loss for multi-class classification; focal loss was evaluated but not adopted as cross-entropy provided sufficient performance with simpler implementation
- Model architectures: Audio modality uses fully-connected layers (128→256→128→10) with ReLU and 0.3 dropout; Motion modality uses identical architecture (600→256→128→10) after flattening the 6×100 input
These specifications ensure complete reproducibility of all experimental results.
Comments 5: For Trimmed Mean (β=0.1), state assumed Byzantine rates and add stress tests.
Response 5: We have clarified the Byzantine fault assumptions and added stress test results (highlighted, Section 4.6.1, page 14, lines 422-433). The trimmed mean parameter β = 0.1 is designed to tolerate up to 10% Byzantine clients (4 out of 42 in our deployment). This conservative threshold balances Byzantine tolerance with aggregation quality, as higher β values risk discarding legitimate heterogeneous updates.
We added stress tests evaluating performance under 0%, 5%, 10%, 15%, and 20% Byzantine attack rates with simulated malicious updates (Gaussian noise injection with μ=0, σ=5 into model weights). Results demonstrate:
(1) stable performance (>85% accuracy) within the 10% design threshold,
(2) graceful degradation at 15% (78% accuracy), and
(3) predictable breakdown at 20% (62% accuracy, beyond design threshold).
This confirms β = 0.1 provides effective protection at the design threshold while maintaining performance under realistic attack scenarios.