This section describes the experimental setup and environment for evaluating the MuST-Net-based HAR system. We present classification accuracy for the network alongside real-time detection results for the integrated system.
4.1. Experimental Setup
The experimental evaluation consists of two main parts. First, the performance of the MuST-Net activity classification is assessed in RDM input segments. Second, the accuracy of activity detection counts derived from the MuST-Net frame-wise activity classifier stream is evaluated.
For the first experiment, the data collection environment and the preparation of the dataset for training and testing are explained in
Section 2. The MuST-Net model is optimized using a stochastic gradient descent optimizer, with the loss function using weighted cross-entropy, and is trained on an NVIDIA GeForce RTX 2080Ti GPU workstation. All experiments were executed using the PyTorch 2.5.1 framework.
For the second experiment, the proposed activity detection algorithm applied to the MuST-Net per-frame prediction stream is presented, given an online RDM input stream. Nine participants performed defined indoor activities and occasionally undefined motions in approximately 5 min trials, with the number of trials per subject varying by availability, yielding a total duration of 7972 s across three different environmental sites to ensure generalization. Sites A and B are laboratory spaces, while Site C is a lecture room, yielding a heterogeneous set of indoor environments. Across the three sites, the total floor area ranges from approximately 19.7 m
2 to 27.4 m
2, with internal obstacles (e.g., desks, chairs, shelves, and equipment) occupying approximately 5.3 m
2 to 6.6 m
2 per site and the placement of these obstacles differing across sites. The MuST-Net model used in all evaluations was trained solely on Site A data, making Sites B and C fully unseen test environments.
Table 3 shows the information on the sequences played by the participants.
All human subjects participated in the experiments voluntarily and provided informed consent prior to data collection.
4.3. Online Activity Detection Results
This section comprehensively analyzes the recognition performance of the proposed online HAR system. In particular, in order to secure intra-class variability in static postures such as
sit and
stand, experiments were conducted by setting various directions and positions of subjects. Performance evaluation was conducted on nine subjects (S1–S9) across three distinct indoor environments (Sites A, B, and C), and detailed subject and dataset information is summarized in
Table 3. Precision, recall, and F1 score were used as analysis indicators.
To evaluate the environmental generalizability of the HAR system proposed in this study, experiments were conducted at Sites A, B, and C, satisfying the indoor environmental conditions set in
Section 2.1. Importantly, MuST-Net was trained exclusively on Site A data, making Sites B and C fully unseen test environments for assessment of cross-environment generalization. The overall performance of the proposed system is summarized in
Table 6. As a result of the experiments, the system achieved a high accuracy of at least 0.97 across all classes, and for
fall detection, the overall recall recorded a perfect value of 1.00, demonstrating the reliability of the system for safety-critical events. Most activities achieved an F1 score of 0.95 or higher, while the
sit and
stand classes recorded relatively lower F1 scores.
The average F1 score by site is shown in
Figure 11. The proposed model maintained high F1 scores across all three sites, although each site exhibited a distinct performance profile. Although Site A recorded a marginally higher overall F1 score than Sites B and C, the performance at Sites B and C—which were entirely unseen during training—remained at a comparable level, confirming that the system generalizes reliably across different indoor environments.
Looking at the activity-wise analysis in
Figure 11, the lowest per-site F1 score varies depending on the environment. At Sites A and B,
sit exhibited the lowest F1 score, and
stand showed the lowest precision, indicating a relatively high false-positive rate at these two sites. This is interpreted as the result of the
stand action having a shorter signal duration than other dynamic actions, resulting in undefined microscopic movements between actions being mistaken for
stand. On the other hand, fewer false detections were observed for
sit at these two sites, owing to its relatively longer duration, consistent with previous biomechanical findings that
sit-to-stand and
stand-to-sit transitions involve an extended deceleration phase for postural stability [
25]. In contrast, at unseen Site C, the error pattern shifted: the precision of
sit and
stand rose to 1.00, and their F1 scores reached 0.96 and 0.97 respectively, while
squat, instead, became the weakest class due to a drop in recall to 0.79. This site-specific behavior suggests that the distinct layout and subject motion patterns at Site C altered the rhythm of composite
squat motions, leading to occasionally missed detections. Nevertheless, all activities at Site C retained an accuracy of 0.92 or higher, preserving the overall reliability of the system, even in this previously unseen environment.
Analyzing the effect of individual subject characteristics on performance (
Figure 12), class-specific differences were identified, while a practical level of recognition performance was maintained for all nine subjects. The
walkA and
fall classes recorded the highest subject-averaged F1 scores, followed by
walkT and
squat, while
stand and
sit showed slightly lower averages. Notably, subjects S8 and S9—who were evaluated at Site C, an environment unseen during training—achieved F1 scores comparable to the overall average across most activity classes, providing evidence of the system’s cross-environment and cross-subject generalization capability.
Despite the overall good average performance, noticeable decreases in the F1 score were observed for specific subject–activity combinations. The most notable drop was S6’s sit F1 score of 0.80, which largely contributed to the lower overall sit F1 score observed at Site B. Additional notable drops include S4’s fall, S9’s squat, S3’s walkT, S7’s sit, S5’s stand, and S8’s fall. For S4’s fall case, the F1-score reduction reflects subject-specific false positives—other actions occasionally classified as fall—rather than missed detections, since the overall fall recall remained 1.00, preserving the safety-critical performance of the system. Likewise, S9’s lower squat F1 score corresponds to the recall drop observed at Site C, and S8’s lower fall F1 score reflects the slight precision drop for fall at the same unseen site. These performance fluctuations are due to inter-subject variability, since each subject’s walking speed, body tilt when seated, and behavioral habits introduce variations in reflected radar signals. Nevertheless, the consistently high F1 score maintained across all classes and subjects—including those at the unseen Sites B and C—demonstrates that the activity-buffer mechanism robustly handles both inter-subject motion variability and cross-environment distribution shifts.
4.4. Real-Time Indoor Behavior-Logging Application
To validate the practical utility of the proposed online activity detection algorithm, we developed a smartphone-based indoor behavior-logging application and conducted real-world experiments. The application employs the best-performing MuST-Net model deployed in a real-time system, allowing for live detection and logging of user activities in diverse indoor environments.
As described in
Figure 13, the deployed system combines a radar sensing module with a FastAPI-based server and a mobile Web interface. The application continuously receives activity predictions from the detection algorithm, updates the recognized behavior class and its confidence level for each frame, and displays behavior statistics such as the number of squats and stand-ups detected over time. The interface refreshes every 100 ms, enabling users or operators to monitor current activities and receive immediate alerts—for example, in the event of a
fall.
For rigorous evaluation, experiments were performed with new subjects and in settings differing from the training environment in terms of radar installation, obstacle arrangement, and action positioning. This generalization setup was designed to test the robustness of the system beyond the specific conditions seen during model training. The results confirmed that the application provides reliable and intuitive feedback on real-time human activities and facilitates practical deployment for continuous indoor behavior monitoring on a mobile platform.