Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition

Gao, Bo; Zhou, Quan

doi:10.3390/math13111753

Open AccessArticle

Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition

by

Bo Gao

¹ and

Quan Zhou

^2,*

¹

School of Sino-German Robotics, Shenzhen Institute of Information Technology, No. 2188, Longxiang Avenue, Longgang District, Shenzhen 518172, China

²

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(11), 1753; https://doi.org/10.3390/math13111753

Submission received: 20 April 2025 / Revised: 19 May 2025 / Accepted: 23 May 2025 / Published: 25 May 2025

Download

Browse Figures

Versions Notes

Abstract

The growing demand for efficient human action recognition has led to significant advancements in applications such as surveillance, healthcare, and human–computer interaction. This paper uses the Widar3.0 dataset to evaluate the proposed a cloud–edge collaborative learning framework that incorporates model compression techniques, including knowledge distillation, pruning, and quantization, to optimize computational resource usage and improve recognition accuracy. Focusing on human action recognition, the framework was structured across two layers: the cloud and the edge devices, each handling specific tasks such as global model updates, intermediate model aggregation, and lightweight inference, respectively. Experimental results showed that the proposed framework achieved an accuracy of 89.7%, outperforming traditional models by 4.4%. Moreover, the communication overhead per round was reduced by 43%, decreasing from 100 MB to 57 MB. The framework demonstrated improved performance as the number of edge devices increased, with the accuracy rising to 89.0% with 10 devices. These results validate the effectiveness of the proposed system in achieving high recognition accuracy while significantly reducing resource consumption for human action recognition.

Keywords:

cloud–edge–end collaboration; model compression; human action recognition; knowledge distillation; pruning

MSC:

68T07

1. Introduction

The rapid development of the Artificial Intelligence of Things (AIoT) has led to substantial advancements in human action recognition, particularly in resource-constrained environments such as smart homes, healthcare systems, and intelligent surveillance [1]. The integration of cloud–edge–terminal (CET) devices has emerged as a powerful architecture to offload computational tasks and improve data processing efficiency [2,3]. However, applying deep learning models directly to terminal devices is limited by their restricted computational power, energy resources, and storage capacities [4].

To address these constraints, the cloud–edge–end collaborative framework has been proposed, enabling each system layer to contribute to the overall learning process [5]. On the terminal side, lightweight models are deployed to allow for rapid inference and decision-making. The edge side aggregates and optimizes intermediate models, acting as a bridge between the cloud and terminal devices, while the cloud provides a centralized platform for model retraining, fine-tuning, and global optimization [6].

Distributed sensing systems for human action recognition are often affected by class imbalance, where certain action categories dominate the training data while others are underrepresented [7,8]. This imbalance can degrade the performance of traditional learning algorithms by skewing the model toward the majority classes. Various techniques, such as the synthetic minority over-sampling technique (SMOTE) [9], adaptive boosting (AdaBoost), and loss function modification methods like focal loss [10] and gradient harmonized mechanisms (GHMs), have been explored to address class imbalance.

However, these approaches primarily focus on improving model robustness within a single layer of the CET framework and often overlook the collaborative nature of cloud, edge, and terminal devices. Moreover, deploying large models on terminal devices remains a challenge due to their resource limitations [11].

To overcome these challenges, we propose a novel learning framework that integrates model compression techniques—such as distillation, pruning, and quantization—to facilitate efficient knowledge transfer across cloud, edge, and terminal devices [12,13]. Our approach incorporates a monitoring mechanism to dynamically adjust the loss function based on the detected class imbalance, thereby enhancing the performance of the collaborative learning framework.

We propose a novel framework that combines few-shot learning with multimodal data fusion for robust beam selection under data-limited conditions. By leveraging complementary features from RGB images, radar signals, and LiDAR data, the model effectively captures spatiotemporal features, enabling efficient and adaptive communication.
The multimodal fusion approach increases the model’s robustness against environmental noise and variability. By combining few-shot learning with multimodal inputs, the proposed method can quickly adapt to new scenarios, maintaining high accuracy in dynamic communication environments.
By reducing dependence on large-scale labeled datasets, the proposed framework enhances computational efficiency, enabling real-time beam selection to meet the low-latency requirements of mmWave communication systems.

Notation 1.

Throughout this paper, we denote the cloud and edge model parameters as

θ_{c}

and

θ_{e}

, respectively. The loss function includes the classification cross-entropy loss

L_{C E}

, the knowledge distillation loss

L_{K D}

, and a regularization term

R (\cdot)

used for model compression via pruning and quantization. The weights λ and γ control the importance of distillation and regularization in the joint objective. For each edge node i, we define resource constraints as storage budget

S_{i}

, delay budget

D_{i}

, and communication cost limit

B_{i}

. Model compression parameters include the pruning ratio ρ and the quantization bit-width b. Finally, E denotes the number of participating edge devices.

2. Related Work

Cloud-edge collaborative systems have garnered increasing attention due to their ability to improve computational efficiency and reduce latency [14]. These systems rely on a hierarchical architecture in which computational tasks are distributed across cloud servers, edge devices, and end devices, enabling real-time processing and efficient resource utilization. The cloud layer typically provides extensive computational resources, making it suitable for large-scale model training, global optimization, and knowledge aggregation. In contrast, the edge layer facilitates localized data processing, model aggregation, and communication between the cloud and end devices. Finally, end devices, constrained by limited computational power, benefit from lightweight models optimized for fast inference and low power consumption.

Federated learning (FL) is a prominent paradigm for enabling collaborative learning across distributed devices while preserving data privacy by keeping raw data localized and transmitting only model updates [15]. For the FL framework, each participant trains a local model using its private data and uploads the model’s gradients or parameters to a central server for aggregation. This iterative process continues until the global model converges. FL has demonstrated success in various applications such as image classification, speech recognition, and recommendation systems [16]. However, traditional FL methods often assume uniformly distributed data, which is rarely the case in real-world applications, where data heterogeneity and class imbalance are common [17]. While effective for privacy, standard FL struggles with significant data imbalance across clients and lacks inherent mechanisms for model compression needed for resource-constrained edge/end devices.

Ficili et al. explored integration approaches for IoT, cloud/edge computing, and AI, emphasizing challenges and strategies for seamless integration of these technologies to achieve pervasive ambient intelligence [18]. In cloud–edge–end collaborative systems, class imbalance is exacerbated by the hierarchical nature, where data collected by edge and terminal devices may be skewed toward certain categories. For instance, surveillance systems in different locations may capture human actions with varying frequencies, leading to biased training datasets. Additionally, the limited computational capacity of terminal devices prevents the direct application of complex methods [1].

To mitigate class imbalance, several methods have been proposed, including data-level approaches like over-sampling and under-sampling, and algorithm-level approaches that modify the learning process to emphasize minority classes [19]. Data-level methods balance the dataset by generating synthetic samples for underrepresented classes or reducing the number of samples from overrepresented classes. The most popular technique in this category is SMOTE and its variants, such as Borderline-SMOTE and SMOTEBoost, which generate synthetic samples by interpolating between existing minority class samples [20]. Although effective, these methods can be computationally expensive and may introduce noise when applied to complex data distributions.

Algorithm-level methods modify the training process to address class imbalance. Ensemble-based techniques like EasyEnsemble and BalanceCascade aim to improve classification performance by combining multiple classifiers trained on rebalanced datasets [21]. However, the computational cost associated with training multiple classifiers makes these methods unsuitable for resource-constrained devices. Moreover, ensemble methods may not generalize well when deployed in real-world cloud–edge–end architectures due to the mismatch between local and global models.

Loss function modifications, such as focal loss and gradient harmonized mechanisms (GHMs), enhance the learning process by reducing the impact of easily classified samples and emphasizing hard-to-classify ones. While these approaches help with class imbalance within a single model, they do not consider the collaborative nature of cloud–edge–end systems.

Existing collaborative learning approaches for such environments also face significant limitations when deploying complex models to resource-constrained edge and terminal devices. Moreover, traditional solutions fail to address the challenges of model compression and efficient knowledge transfer, which are critical for deploying lightweight models on resource-constrained terminal devices. Techniques like knowledge distillation, pruning, and quantization have been shown to reduce model size and improve inference speed without significantly compromising performance [12,13]. However, integrating these techniques into a cloud–edge collaborative framework presents challenges related to joint optimization, balancing accuracy with limited computational resources, communication bandwidth, and inference accuracy. Recent studies have further demonstrated the integration of cloud–edge architectures in intelligent systems [5,6,18]. For example, Ficili et al. [18] proposed a multi-layer AIoT architecture, while Krishnamurthy et al. [6] applied reinforcement learning for fog computing resource optimization.

Our proposed framework tackles these challenges by incorporating model compression techniques within the cloud–edge system. By addressing the integration of these techniques, the framework aims to overcome the identified gaps in existing solutions. Using knowledge distillation, pruning, and quantization, our approach facilitates efficient knowledge transfer from the cloud to edge and terminal devices, improving recognition accuracy while minimizing computational costs. Additionally, we introduce a monitoring mechanism that dynamically detects class imbalance and adjusts the loss function accordingly. The integration of these techniques provides a robust solution for real-time human action recognition in distributed environments.

3. Motivation

Human action recognition has attracted significant attention in recent years due to its wide range of applications, including surveillance, smart healthcare, human–computer interaction, and sports analysis. With the advent of Artificial Intelligence of Things (AIoT), collaborative learning frameworks involving cloud, edge, and terminal devices have been developed to enhance computational efficiency and improve recognition accuracy. However, existing methods face several critical challenges:

Resource constraints: Terminal devices often have limited computational power, memory, and battery life, making it difficult to deploy large-scale deep learning models directly on these devices.
Data distribution imbalance: Human action data collected from distributed devices is often unbalanced, with some actions being significantly under-represented. This issue severely impacts the training process, causing models to be biased towards majority classes.
Communication overhead: Collaborative learning requires frequent communication between cloud, edge, and terminal devices. High communication costs can reduce the efficiency and scalability of the framework.

To address these challenges in the context of efficient and robust human action recognition, we propose a cloud–edge collaborative learning framework that integrates model compression techniques, including knowledge distillation, pruning, and quantization. The framework aims to optimize the trade-off between accuracy, computational efficiency, and communication costs. Specifically, we formulate the optimization problem as follows:

min_{θ_{e}, θ_{c}} L_{C E} + λ L_{K D} + γ R (θ_{e}, θ_{c})

(1)

where

L_{C E}

is the cross-entropy loss, representing classification accuracy.

L_{K D}

indicates knowledge distillation loss, which measures the divergence between the student model (edge) and the teacher model (cloud).

R (θ_{e}, θ_{c})

represents the regularization term, representing the model compression achieved by pruning and quantization.

λ

denotes the hyperparameter, controlling the impact of knowledge distillation.

γ

is the hyperparameter controlling the strength of the regularization term.

θ_{e}, θ_{c}

indicate the parameters of the edge model and cloud model, respectively.

The optimization goal is to minimize the overall loss by balancing classification accuracy, knowledge distillation, and model compression. The proposed approach effectively reduces computational resource consumption and communication overhead while preserving high recognition accuracy.

4. System Model

4.1. Wireless Signal Model

Conventional device-free gesture recognition methods using WiFi signals typically extract broad statistical or signal-based features from channel state information (CSI). However, variations in user posture, movement direction, and complex indoor propagation can distort the extracted features, making gesture classification less robust and precise.

As shown in Figure 1, the overall process of wireless-based action detection begins with capturing incoming signals and processing the CSI to analyze the propagation environment. The refined CSI data are then used to detect and interpret user actions. Initially, a wireless receiver captures the signal and computes the CSI, reflecting how the environment influences signal propagation. Subsequent preprocessing extracts features like amplitude, phase, and time delay, which are sensitive to human movement. These features serve as inputs for classification models that associate signal patterns with specific activities, enabling applications such as smart environment interaction and human behavior monitoring. Recognition performance hinges on CSI fidelity, feature representation, and model effectiveness.

The CSI at time instance t and frequency bin f describes the cumulative effect of multipath propagation in indoor settings:

\tilde{H} (f, t) = e^{j Θ (f, t)} \sum_{k = 1}^{K} b_{k} (f, t) e^{- j 2 π f γ_{k} (f, t)},

(2)

where K is the total number of propagation paths,

b_{k}

and

γ_{k}

denote the complex attenuation and delay of the k-th path, respectively, and

Θ (f, t)

accounts for residual phase distortions from system imperfections like frequency offsets. Equation (2) describes the CSI as a frequency–domain representation of multipath propagation, capturing the cumulative impact of K signal paths.

By representing the phase in terms of the Doppler effect, the CSI can be rewritten as:

\tilde{H} (f, t) = e^{j Θ (f, t)} (H_{0} (f) + \sum_{k \in K_{d}} b_{k} (t) e^{- j 2 π \int_{- \infty}^{t} ν_{k} (τ) d τ}),

(3)

where

H_{0} (f)

refers to the aggregated static signal components, and

K_{d}

denotes the set of dynamic paths affected by motion. Equation (3) reformulates the CSI to separate static and dynamic Doppler-influenced components.

To remove irrelevant noise and random phase drift, differential CSI between antennas on the same NIC is computed, and further denoising removes static clutter. Short-time Fourier transform (STFT) is then applied to derive energy patterns over time and Doppler frequency. Each resulting frame forms a Doppler spectrum—a matrix of size

Q \times R

, with Q time samples and R transmitter–receiver links.

When a person moves, the variation in position of different limbs induces distinct Doppler shifts. These shifts combine into a composite Doppler signature, reflecting not only gestures but also broader motion. However, for gesture isolation, only hand-related velocities—described by the body-centric velocity profile (BVP)—are considered. BVP can be estimated using multi-link Doppler information.

Let the position of the sender and receiver for link j be denoted by

T_{j} = (x_{s}^{j}, y_{s}^{j})

and

R_{j} = (x_{r}^{j}, y_{r}^{j})

, respectively. The velocity vector

u = (u_{x}, u_{y})

of a moving body point contributes a Doppler component

d_{j} (u)

on link j, modeled as:

d_{j} (u) = α_{j} u_{x} + β_{j} u_{y},

(4)

where

α_{j}, β_{j}

depend on the geometrical alignment of the transmitter and receiver:

\begin{matrix} d_{j} (u) & = u_{x} f_{d} (\frac{x_{s}^{j}}{∥ T_{j} ∥} + \frac{x_{r}^{j}}{∥ R_{j} ∥}) + u_{y} f_{d} (\frac{y_{s}^{j}}{∥ T_{j} ∥} + \frac{y_{r}^{j}}{∥ R_{j} ∥}) \\ = f_{d} (\frac{u_{x} x_{s}^{j} + u_{y} y_{s}^{j}}{∥ T_{j} ∥} + \frac{u_{x} x_{r}^{j} + u_{y} y_{r}^{j}}{∥ R_{j} ∥}), \end{matrix}

(5)

where

f_{d}

is the Doppler coefficient and

∥ \cdot ∥

denotes the Euclidean norm. Equation (5) defines the Doppler shift on each link as a function of the moving body point’s velocity and transceiver geometry.

To map the Doppler spectrum of link j to motion, a projection matrix

Φ (j)

is introduced:

Φ (j) = G \cdot M^{2},

(6)

where G is the count of Doppler bins and M is a resolution parameter. The Doppler signature

S (j)

for link j relates to BVP via:

S (j) = λ_{j} Φ (j) u,

(7)

where

λ_{j}

is an attenuation coefficient due to signal loss. Finally, the velocity profile

u

is inferred by solving the following optimization:

min_{u} \sum_{j = 1}^{J} L (u, S_{j}) + η {∥ u ∥}_{0},

(8)

where

L (\cdot, \cdot)

is a divergence metric, and the

ℓ_{0}

-norm term encourages sparsity, controlled by parameter

η

.

4.2. Scenario Modeling

To support scalable, low-latency, and accurate human action recognition, we propose a two-tier hierarchical collaborative framework consisting of cloud servers and distributed edge devices. In this architecture, the edge nodes represent on-site devices, while the cloud provides centralized computation and long-term knowledge storage. This architecture enables efficient division of labor, where the cloud handles global model optimization and the edge performs real-time inference and localized updates.

Collaborative Learning and Optimization Framework

Let

M_{c}

denote the global model at the cloud, and let

M_{e}^{(i)}

be the local model running on the i-th edge device.

In the cloud side, the cloud aggregates information from multiple edge nodes to perform global training and refinement. It serves as the knowledge source in a distillation framework and generates model updates for deployment. The loss function at the cloud includes both prediction accuracy and regularization for global consistency:

L_{c l o u d} = \sum_{i = 1}^{E} w_{i} \cdot L_{K D} (M_{c}, M_{e}^{(i)}) + β \cdot R (M_{c}),

(9)

where

L_{K D}

is the knowledge distillation loss,

R (\cdot)

is a regularization term, E is the number of edge nodes, and

w_{i}

reflects the trust level or sample volume of edge node i. Equation (9) defines the cloud-side loss function, which combines knowledge distillation and regularization.

On the edge side, each edge node performs localized training and inference using data collected on-site. To reduce computation and communication costs, edge models are optimized with lightweight constraints via pruning and quantization. The local training objective is:

L_{e d g e}^{(i)} = L_{C E} (M_{e}^{(i)} (x_{i}), y_{i}) + λ \cdot L_{K D} (M_{c} (x_{i}), M_{e}^{(i)} (x_{i})) + γ \cdot R (M_{e}^{(i)}),

(10)

where

L_{C E}

is the cross-entropy loss and

λ

and

γ

are weighting factors for knowledge distillation and compression regularization, respectively.

Edge models are further compressed to reduce latency and memory requirements:

\begin{matrix} M_{e}^{(i)} & \leftarrow Prune (M_{e}^{(i)}, ρ_{i}), \end{matrix}

(11)

\begin{matrix} {\tilde{M}}_{e}^{(i)} & \leftarrow Quantize (M_{e}^{(i)}, b), \end{matrix}

(12)

where

ρ_{i}

denotes the pruning ratio for node i, and b is the quantization bit-width.

The total optimization problem of the cloud–edge system is defined as:

\begin{matrix} min_{M_{c}, {M_{e}^{(i)}}} & L_{c l o u d} + \sum_{i = 1}^{E} L_{e d g e}^{(i)} \\ s . t . & Size (M_{e}^{(i)}) \leq S_{i}, \\ Delay (M_{e}^{(i)}) \leq D_{i}, \\ CommCost (M_{e}^{(i)}) \leq B_{i}, \end{matrix}

(13)

where

S_{i}, D_{i}, B_{i}

represent the storage, real-time delay, and communication budget constraints for edge device i. In our experiments, we consider a range of

i \in [2, 10]

, representing small- to moderately sized deployment scenarios, to evaluate the scalability and effectiveness of the proposed optimization framework.

In the proposed cloud–edge collaborative framework, the model lifecycle begins with the cloud training a global model

M_{c}

using a large-scale labeled dataset or through self-supervised pretraining. This global model serves as a knowledge-rich teacher model, which is subsequently compressed via pruning and quantization to generate lightweight student models

{\tilde{M}}_{e}^{(i)}

for deployment on edge devices. After deployment, each edge model performs localized training using on-device data to adapt to its specific environment while preserving generalization capabilities through knowledge distillation from the cloud model. Periodically, edge devices transmit local model updates—either in the form of gradients or entire model parameters—back to the cloud, where they are aggregated and used to refine the global model

M_{c}

. The updated global model is then re-distilled and redistributed to edge devices, enabling continual learning and adaptation across the system. This cyclical update mechanism ensures that the global model maintains robustness and adaptability, while edge models remain lightweight and responsive to local variations in human activity patterns.

This scenario modeling approach ensures that the edge devices remain lightweight and adaptive, while the cloud continues to learn a more generalized model. The use of joint loss optimization, compression-aware training, and knowledge distillation forms the core of robust cloud–edge collaboration for human action recognition.

To better accommodate real-world deployment scenarios, future versions of the optimization framework can be extended in two important directions. First, uncertainty in system parameters—such as fluctuating network latency, device availability, or energy status—can be modeled using robust or fuzzy optimization techniques. This would improve the resilience of the learning process to dynamic environments. Second, a multi-objective formulation can be introduced to simultaneously optimize recognition accuracy, communication efficiency, and energy consumption. Pareto-based approaches or scalarization methods could be used to explore trade-offs among these objectives and guide the system toward balanced performance under competing resource constraints.

4.3. Proposed Architecture

The proposed cloud–edge collaborative learning architecture is designed to achieve accurate, efficient, and low-latency human action recognition by integrating model compression, knowledge distillation, and federated optimization into a unified hierarchical framework. At the core of the system lies a global model

M_{c}

trained on the cloud using large-scale datasets. Once trained, this model undergoes structured model compression techniques—including network pruning and parameter quantization—to generate lightweight variants that are suitable for deployment on edge devices. Mathematically, for a given compression ratio

ρ

and quantization level b, the compressed edge model

{\tilde{M}}_{e}^{(i)}

is obtained as:

{\tilde{M}}_{e}^{(i)} = Quantize (Prune (M_{c}, ρ_{i}), b), \forall i \in {1, 2, . . ., E},

(14)

where E is the total number of edge devices and

ρ_{i}

is the pruning ratio tailored to edge node i’s resource constraints.

To preserve the representational power of the original cloud model within these compressed variants, knowledge distillation is employed during deployment. The cloud model

M_{c}

serves as a teacher, and each edge model

M_{e}^{(i)}

acts as a student. For a given input x and soft label distribution

z_{c} = M_{c} (x)

, the student is trained to minimize the divergence between its output

z_{e} = M_{e}^{(i)} (x)

and the teacher’s output using the distillation loss:

L_{K D}^{(i)} = KL (σ (z_{c} / T) ‖ σ (z_{e} / T)),

(15)

where

σ (\cdot)

is the softmax function and T is the temperature parameter controlling the softness of predicted distributions. This allows the student model to learn finer-grained decision boundaries and internal representations, even under tight parameter budgets.

The architecture also integrates a federated optimization strategy to support decentralized learning and privacy preservation. During system operation, each edge device updates its local model

θ_{e}^{(i)}

using real-time, device-specific data. These updates are periodically transmitted to the cloud for aggregation into the global model

θ_{c}

. The global model is then refined using weighted averaging:

θ_{c}^{(t + 1)} = \sum_{i = 1}^{E} \frac{n_{i}}{\sum_{j = 1}^{E} n_{j}} \cdot θ_{e}^{(i, t)},

(16)

where

n_{i}

is the size of the local dataset on edge device i, and

θ_{e}^{(i, t)}

is the local model parameter at training round t. After aggregation, the updated global model is re-distilled and re-compressed, and the process repeats, forming a closed-loop iterative training cycle that adapts to environmental dynamics and usage patterns.

The resulting architecture not only ensures model compactness and inference efficiency at the edge but also maintains global knowledge consistency and robustness through periodic cloud-level aggregation and distillation. This enables the scalable deployment of intelligent sensing systems across heterogeneous environments with diverse computational and communication resources.

4.4. Algorithm Training Procedure

The training procedure of the proposed cloud–edge collaborative framework follows a cyclic optimization process that balances centralized intelligence with distributed adaptability. The process begins with the cloud server initializing the global model

M_{c}

through pretraining on a large labeled dataset or synthetic data generation. This global model is then compressed via pruning and quantization to generate edge-deployable variants

{\tilde{M}}_{e}^{(i)}

, which are dispatched to distributed edge nodes. Each edge device performs local training using real-time or user-specific data, adjusting its model parameters

θ_{e}^{(i)}

via standard gradient descent or local SGD. This local training incorporates both supervised learning and knowledge distillation from the cloud model, ensuring that the edge models preserve the global feature abstraction capacity despite local task heterogeneity.

After a defined number of local update steps, each edge device uploads either the full model or its parameter gradients back to the cloud. The cloud server aggregates these edge models through a federated averaging approach, optionally applying regularization and global constraints. The updated global model is then distilled and compressed again for the next iteration of deployment. This cyclic pipeline ensures continual learning, personalized adaptation at the edge, and model robustness across heterogeneous environments.

The training process is structured into four key stages: cloud-side compression, edge deployment and local training, parameter aggregation, and global model refinement. These steps are detailed below in Algorithm 1.

Algorithm 1: Cloud-Edge Collaborative Training Procedure

5. Experiment Results

5.1. Experimental Setup

To assess the performance of the proposed training framework under cloud–edge conditions, we conducted a series of experiments using the Widar3.0 dataset [1], which includes six types of hand or body gestures: Push&Pull, Sweep, Clap, Slide, Draw-Zigzag, and Draw-N. The Widar3.0 dataset is a publicly available benchmark commonly used for WiFi-based gesture and action recognition tasks. It has been validated in several peer-reviewed studies such as [1], making it a reliable foundation for our experiments. The full dataset contains 7498 labeled instances, which were randomly divided into training and testing partitions with a ratio of 8:2. This results in 6000 samples allocated for model training and the remaining 1498 samples allocated for testing.

Because the collection duration and signal length may vary across samples, the raw BVP data extracted from WiFi signals exhibit inconsistent temporal dimensions. To address this, each BVP instance is zero-padded and reshaped into a standardized 3D tensor of dimensions

[38 \times 20 \times 20]

, thereby enabling uniform processing across the entire dataset.

Prior to training, all BVP tensors v are normalized on a per-sample basis using min-max scaling, ensuring that the input values fall within the interval

[0, 1]

. For each communication round, every edge device performs 50 epochs of local training using its available data. The batch size is fixed at 256, and the learning rate is initialized at

1 \times 10^{- 4}

. All experiments are implemented in PyTorch 2.6 and executed on a computing platform equipped with an NVIDIA GeForce RTX 4090 GPU. Details regarding the dataset composition, preprocessing steps, and hyperparameter settings are summarized in Table 1.

5.2. Experimental Results

The parameter

λ

controls the influence of the teacher model during distillation. As shown in Figure 2, increasing

λ

from 0.1 to 0.5 improves accuracy by encouraging the edge models to better mimic the cloud (teacher) model. Beyond 0.5, the accuracy plateaus and slightly decreases, indicating that an over-reliance on the teacher model can reduce the edge model’s ability to adapt to local data nuances. Therefore,

λ

= 0.5 is an optimal trade-off point balancing generalization and local adaptation.

To thoroughly evaluate the effectiveness of the proposed cloud–edge collaborative learning framework, a series of experiments were conducted. The evaluation focused on multiple aspects, including recognition accuracy, model robustness, compression efficiency, knowledge distillation performance, and communication overhead. The proposed method was benchmarked against several competitive baselines, including SMOTEBoost, EasyEnsemble, BalanceCascade, Focal Loss, and the GHM. Performance metrics used for assessment comprised classification accuracy.

The classification performance across different learning approaches is evaluated. As illustrated in Figure 3, the proposed framework achieved an accuracy of 89.7%, outperforming all baselines. Traditional ensemble methods such as EasyEnsemble and BalanceCascade demonstrated moderate effectiveness due to their ability to handle class imbalance, while SMOTEBoost and focal Loss partially mitigated data skew. However, their performance was constrained by the absence of global coordination and compression-aware optimization. The superior accuracy of the proposed approach is attributed to its hierarchical optimization scheme, integrated model compression, and knowledge distillation components, which enhance generalization under imbalanced data distributions.

We also explored how the number of participating edge devices influences recognition performance. As shown in Figure 4, increasing the number of edge nodes resulted in consistent performance improvements for both the proposed method and the baseline. However, the proposed framework exhibited a steeper accuracy gain, reaching 89.0% with 10 participating devices, compared with 85.6% for the baseline. This suggests that the proposed method better exploits distributed data diversity, largely due to its federated aggregation mechanism and consistent knowledge transfer from the cloud model.

Figure 5 illustrates the effect of varying compression ratios—defined as the proportion of model parameters retained after pruning and quantization—on classification performance. While aggressive compression led to significant performance degradation in conventional models, the proposed framework maintained a high accuracy range of 81.5–89.0% across all compression levels. This robustness is attributed to regularization-aware training and cloud-guided distillation, which preserve critical model features even under severe parameter reductions. The results affirm the feasibility of deploying lightweight models on edge devices without compromising accuracy.

Additionally, the impact of knowledge distillation on model performance was assessed. As depicted in Figure 6, edge models trained without distillation achieved an accuracy of only 84.2%, whereas incorporating distillation from the cloud model improved performance to 88.6%. This 4.4% gain underscores the critical role of the teacher–student paradigm in enabling compact edge models to approximate the decision boundaries of more complex cloud models, thereby enhancing inference quality under resource constraints.

The final experiment compared the communication cost of the proposed method with that of standard federated learning frameworks. As shown in Figure 7, the proposed approach significantly reduced communication overhead per round from 100 MB to 57 MB—a 43% reduction. This improvement results from selective update transmission, intermediate feature compression, and parameter sparsification. These results demonstrate the practicality of the proposed method in bandwidth-constrained or latency-sensitive environments such as IoT systems and mobile edge networks.

Overall, the experimental results convincingly validate the effectiveness of the proposed cloud–edge collaborative learning framework. It consistently outperforms existing methods in terms of accuracy and robustness while also offering efficient compression, effective knowledge distillation, and reduced communication cost. The joint design of compression-aware training and cloud-assisted optimization provides a promising solution for real-time human activity recognition in edge-intelligent applications.

6. Conclusions

This paper proposes a cloud–edge collaborative learning framework for human action recognition that effectively addresses challenges related to computational efficiency, model accuracy, and communication overhead. By integrating model compression techniques—namely knowledge distillation, pruning, and quantization—the framework achieves substantial reductions in resource consumption while maintaining high recognition accuracy. Experimental results demonstrate that the proposed method attains an accuracy of 89.7%, surpassing baseline approaches by up to 4.4%. Furthermore, communication costs per round are reduced by 43%, decreasing from 100 MB to 57 MB. The framework exhibits robust performance in scenarios involving up to 10 edge devices, with consistent accuracy improvements observed as the number of devices increases. These findings underscore the system’s effectiveness in enabling real-time human action recognition with minimal computational demands, rendering it well suited for deployment in bandwidth-constrained and resource-limited settings. Nonetheless, these promising results motivate further investigation into large-scale and dynamic deployment scenarios. Although the framework demonstrates robustness up to 10 edge nodes, scaling to larger systems or highly heterogeneous environments may introduce new challenges. As the number of nodes increases, communication synchronization overhead and edge-to-cloud latency may degrade performance. Additionally, edge devices with highly diverse hardware capabilities and data distributions could complicate model consistency and fairness.

Although the framework demonstrates robustness with up to 10 edge nodes, scaling to larger or highly heterogeneous systems may introduce new challenges. Increasing the number of nodes can lead to heightened communication synchronization overhead and greater edge-to-cloud latency, potentially degrading performance. Additionally, heterogeneity in hardware capabilities and data distributions across edge devices may complicate maintaining model consistency and fairness. Despite its advantages, the proposed framework has several limitations. First, it assumes stable cloud–edge connectivity, which may not be guaranteed in mobile or low-connectivity environments. Second, while model compression preserves most of the accuracy, aggressive pruning or quantization can still cause noticeable performance degradation. Finally, the current evaluation is confined to WiFi-based action recognition; thus, future work should validate the framework across diverse data modalities and broader task domains to assess its generalizability. Expanding evaluations to larger-scale, more dynamic deployments will be essential to address challenges arising from varying data types and network conditions.

Author Contributions

Conceptualization, B.G. and Q.Z.; methodology, B.G. and Q.Z.; software, B.G. and Q.Z.; investigation, B.G. and Q.Z.; data curation, B.G. and Q.Z.; writing—original draft preparation, B.G. and Q.Z.; formal analysis, Q.Z.; supervision, Q.Z.; writing—review and editing, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under Grant No. 62303327 and No. 92467204.

Data Availability Statement

The data presented in this study are available from the corresponding author on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, Y.; Zhang, Y.; Qian, K.; Zhang, G.; Liu, Y.; Wu, C.; Yang, Z. Zero-Effort Cross-Domain Gesture Recognition With Wi-Fi. In Proceedings of the ACM MobiSys, Seoul, Republic of Korea, 17–21 June 2019. [Google Scholar]
Zhou, Q.; Gong, Y.; Nallanathan, A. Radar-Aided Beam Selection in MIMO Communication Systems: A Federated Transfer Learning Approach. IEEE Trans. Veh. Technol. 2024, 73, 12172–12177. [Google Scholar] [CrossRef]
Gao, Y.; Liu, L.; Zheng, X.; Zhang, C.; Ma, H. Federated sensing: Edge-cloud elastic collaborative learning for intelligent sensing. IEEE Internet Things J. 2021, 8, 11100–11111. [Google Scholar] [CrossRef]
Duan, Q.; Wang, S.; Ansari, N. Convergence of Networking and Cloud/Edge Computing: Status, Challenges, and Opportunities. IEEE Netw. 2020, 34, 148–155. [Google Scholar] [CrossRef]
Sheikh, A.M.; Islam, M.R.; Habaebi, M.H.; Zabidi, S.A.; Bin Najeeb, A.R.; Kabbani, A. A Survey on Edge Computing (EC) Security Challenges: Classification, Threats, and Mitigation Strategies. Future Internet 2025, 17, 175. [Google Scholar] [CrossRef]
Krishnamurthy, B.; Shiva, S.G. Scalable Resource Provisioning Framework for Fog Computing Using LLM-Guided Q-Learning Approach. Algorithms 2025, 18, 230. [Google Scholar] [CrossRef]
Mera, C.; Branch, J.W. A survey on class imbalance learning on automatic visual inspection. IEEE Lat. Am. Trans. 2014, 12, 657–667. [Google Scholar] [CrossRef]
Wang, P.; Guo, B.; Xin, T.; Wang, Z.; Yu, Z. TinySense: Multi-user respiration detection using Wi-Fi CSI signals. In Proceedings of the IEEE International Conference on e-Health Networking, Applications and Services (HealthCom), Dalian, China, 12–15 October 2017; pp. 1–6. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Chen, X.; Zou, H.; Wang, D.; Xu, Q.; Xie, L. EfficientFi: Towards large-scale lightweight WiFi sensing via CSI compression. IEEE Internet Things J. 2022, 9, 13086–13095. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Kobayashi, M.; Hamad, H.; Kramer, G.; Caire, G. Joint state sensing and communication over memoryless multiple access channels. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 270–274. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 12. [Google Scholar] [CrossRef]
Liu, Y.; Liu, Y.; Liu, Z.; Zhang, J.; Meng, C.; Zheng, Y. Federated forest. arXiv 2019, arXiv:1905.10053. [Google Scholar] [CrossRef]
Yu, H.; Vaidya, J.; Jiang, X. Privacy-preserving SVM classification on vertically partitioned data. In Proceedings of the Pacific–Asia Conference on Knowledge Discovery and Data Mining, Singapore, 9–12 April 2006; pp. 647–656. [Google Scholar]
Ficili, I.; Giacobbe, M.; Tricomi, G.; Puliafito, A. From Sensors to Data Intelligence: Leveraging IoT, Cloud, and Edge Computing with AI. Sensors 2025, 25, 1763. [Google Scholar] [CrossRef] [PubMed]
Liu, X.-Y.; Wu, J.; Zhou, Z.-H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 539–550. [Google Scholar]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Liu, X.-Y.; Wu, J.; Zhou, Z.-H. EasyEnsemble and BalanceCascade: Ensemble methods for highly imbalanced data sets. IEEE Trans. Knowl. Data Eng. 2009, 21, 1350–1362. [Google Scholar]

Figure 1. The proposed overall cloud–edge learning architecture.

Figure 2. Effect of different

λ

values on performance under varying numbers of edge nodes.

Figure 2. Effect of different

λ

values on performance under varying numbers of edge nodes.

Figure 3. The classification performance across different learning approaches.

Figure 4. Classification performance under varying numbers of edge nodes.

Figure 5. Effect of model compression on classification accuracy under different compression ratios.

Figure 6. Effectiveness of knowledge distillation in improving edge model accuracy.

Figure 7. Comparison of communication overhead between the proposed framework and conventional federated learning.

Table 1. Widar3.0 dataset configuration and training parameters.

Parameter	Value
Gesture Classes	Push&Pull, Sweep, Clap, Slide, Draw-Zigzag, Draw-N
Total Samples	7498
Training/Test Split	6000/1498
Input Feature Type	Body-Coordinate Velocity Profile
Input Dimension	$38 \times 20 \times 20$ (zero-padded)
Normalization Formula	$v_{B}^{'} = \frac{v - min (v)}{max (v) - min (v)}$
Local Training Epochs	50
Batch Size	256
Learning Rate	0.0001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, B.; Zhou, Q. Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition. Mathematics 2025, 13, 1753. https://doi.org/10.3390/math13111753

AMA Style

Gao B, Zhou Q. Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition. Mathematics. 2025; 13(11):1753. https://doi.org/10.3390/math13111753

Chicago/Turabian Style

Gao, Bo, and Quan Zhou. 2025. "Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition" Mathematics 13, no. 11: 1753. https://doi.org/10.3390/math13111753

APA Style

Gao, B., & Zhou, Q. (2025). Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition. Mathematics, 13(11), 1753. https://doi.org/10.3390/math13111753

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cloud–Edge Collaborative Learning and Model Compression Framework for Communication-Assisted Human Action Recognition

Abstract

1. Introduction

2. Related Work

3. Motivation

4. System Model

4.1. Wireless Signal Model

4.2. Scenario Modeling

Collaborative Learning and Optimization Framework

4.3. Proposed Architecture

4.4. Algorithm Training Procedure

5. Experiment Results

5.1. Experimental Setup

5.2. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI