1. Introduction
Sensor and vision-based methods are widely used approaches for human localization and activity recognition [
1,
2,
3]. In spite of the popularity of these methods, they are unsuitable to be used in many scenarios. Sensor-based methods necessitate users continuously wearing sensor/s, which is inconvenient and uncomfortable, especially for elderly and disabled users. Correspondingly, vision-based methods have limitations as they require installing cameras in the area of interest which may be intrusive and affect users’ privacy.
The advent of CSI-based sensing has opened new avenues in the realms of localization and activity recognition [
4,
5,
6], enabling a deeper understanding of human behaviors and interactions in various environments. Despite the promising capabilities of CSI to provide fine-grained information about the environment, leveraging these data for multifaceted human-centric applications presents a unique set of challenges and considerations.
CSI data, inherently rich in information due to their sensitivity to environmental changes, have been widely recognized for their potential in detecting human presence, movements, and even intricate activities [
5,
7,
8]. However, the application of CSI for precise localization and activity recognition, especially in scenarios involving multiple individuals, compounds the complexity of the task. The primary challenges stem from the CSI data’s high dimensionality, the temporal variability of human activities, and the subtle nuances that distinguish different movements or locations. Furthermore, the multipath propagation of wireless signals, which is a fundamental characteristic of CSI, introduces additional layers of complexity due to the resulting signal fading and interference.
Localization and activity recognition, even when addressed as separate tasks, each pose their own set of difficulties. Localization requires high precision in interpreting the signal’s attenuation and phase changes to accurately estimate a person’s position within a given environment [
9,
10,
11]. Activity recognition, on the other hand, demands the ability to discern and classify various human actions based on the subtle modulations they cause in the surrounding wireless signals [
4,
6,
12]. When combined, these tasks necessitate sophisticated signal processing and machine learning techniques capable of extracting meaningful patterns from the high-dimensional, time-series CSI data.
The challenges are further magnified in multi-person scenarios, where the simultaneous presence and movement of multiple individuals introduce overlapping signal patterns that can obscure the distinct features associated with specific locations or activities. This overlap complicates the differentiation of individual contributions to the CSI data, thus requiring advanced modeling techniques to disentangle and attribute the observed signal variations to the correct sources.
This paper aims to address these challenges by proposing a novel approach that harnesses the power of Transformer networks for the dual task of multiperson localization and activity recognition from CSI data. The proposed model capitalizes on the Transformer’s proficiency in capturing long-range temporal dependencies within the CSI data, facilitating a nuanced understanding of the spatial-temporal dynamics governing human activities and their locations. Through a series of experiments and evaluations, our study not only validates the effectiveness of our proposed solution but also sets a new benchmark in the realm of CSI-based sensing for human-centric applications.
This paper addresses key challenges in sustainable human sensing by proposing a novel approach that leverages the power of Transformer networks for the dual tasks of multi-person localization and activity recognition from Channel State Information (CSI) data. The proposed model utilizes the Transformer’s ability to capture long-range temporal dependencies within CSI data, enabling a comprehensive understanding of the spatial-temporal dynamics underpinning human activities and their locations. This advancement contributes to non-intrusive, device-free monitoring solutions that support smart homes, inclusive urban spaces, and healthcare systems, aligning with SDG 3 and SDG 11. Through extensive experiments and evaluations, our study demonstrates the effectiveness of this approach, setting a new benchmark in CSI-based sensing for human-centric applications. By enabling more accurate, scalable, and resource-efficient multi-user sensing, this work supports the development of technologies that foster innovation and enhance sustainability and well-being in modern communities. The main contributions of this work can be summarized as follows:
A novel approach is developed to simultaneously localize and recognize human activities in settings with multiple users. A multi-label, multi-view Transformer-based architecture is used for this purpose. The proposed system shows superior performance in achieving the designed objectives in real time across a range of environments and frequency bands.
The proposed approach employs advanced preprocessing techniques and capitalizes on the Transformer’s self-attention mechanism to mitigate the constraints of traditional sequential data processing methods. It is able to learn high-dimensional representations of human activities and locations from CSI data effectively and reliably.
The proposed system is able to accurately separate and assign mixed signals to the corresponding users which presents an innovative solution for intelligent environment monitoring and user interaction analysis.
The rest of this paper is organized follows:
Section 2 summarizes the problem statement and motivations of the presented work. Some related works on the localization and activity recognition problems are introduced in
Section 3. An overview of the proposed system is presented briefly in
Section 4.
Section 5 presents the details of the proposed system. The presented system is assessed and evaluated in
Section 6. Finally, some conclusions are drawn and future work suggested in
Section 7.
2. Problem Statement and Motivation
The task of activity recognition in complex cluttered environments with multiple users presents unique challenges that significantly complicate accurate classification and localization. Traditional methods of human activity recognition using CSI sensing have emerged as a promising alternative, offering fine-grained environmental information through the analysis of wireless signals. CSI’s sensitivity to environmental changes allows for the detection of human presence, movements, and activities. However, leveraging CSI data for precise localization and activity recognition in multi-user scenarios introduces significant challenges.
Figure 1 illustrates the t-SNE projection of CSI data for a single-user case with minimal overlap between activity classes. In this scenario, the classes are relatively well separated, allowing for a straightforward classification of activities such as walking, jumping, waving, and sitting down. Despite some boundary overlap, the overall distinction between activities remains clear, enabling higher accuracy in recognition tasks. However, in a multi-user scenario, as shown in
Figure 2, the CSI data have substantial overlap between activity classes. The presence and movement of multiple individuals generate complex, intertwined signal patterns, making it challenging to distinguish between different activities. This overlap obscures the unique features associated with specific actions, leading to significant difficulties in accurately classifying activities. The high dimensionality and temporal variability of CSI data, coupled with the multipath propagation of wireless signals, further exacerbate the complexity of multi-user activity recognition.
To address the problem of multi-user activity recognition and localization, the proposed system segments the monitored space into smaller spatial units or “cells”, each occupied by a single person. This segmentation allows for the independent assessment of each cell’s signal characteristics and associated human activities, reducing the complexity of analyzing intertwined signals from multiple users. By isolating user activity within specific cells, the system can capture multiple activity patterns influenced by surrounding users, suggesting the need for training a powerful model on various variations to enhance accuracy. This is achieved by our proposed multiview Transformer architecture, which processes CSI data through separate views for amplitude and phase components which simplify the localization and activity recognition tasks in muti-user environments. The details of the proposed architecture are presented below.
3. Related Work
Human sensing with WiFi CSI is emerging as a promising alternative to traditional sensing technologies due to its non-intrusive, environmentally robust, and device-free characteristics. In the realm of WiFi-based human sensing, significant efforts have been concentrated on two primary tasks: human localization and Human Activity Recognition (HAR). While it is important to recognize human activities, it is also crucial to have information on location to promptly intervene in case of critical situations. Thus, achieving both functions in the same system is an essential and challenging task. This section gives a brief overview of some important research efforts for both tasks along with the challenges when it comes to work with multiple users.
3.1. WiFi-Based Human Localization
Human indoor localization is one of the most recognized problems widely tackled in many applications ranging from healthcare to gaming and virtual reality. Human localization aims to estimate user positions to facilitate human–computer interaction systems [
13]. The importance of localization comes from concern not only with with user location, but also with the supply of information about user preferences and conditions. Many technologies have been used in developing localization systems. These technologies include, but are not restricted to, ultrasound [
14], infrared [
15], magnetic field [
16], etc. These systems are restricted to specific environments/users. The significant developments in mobile devices and Wifi communication have motivated academia to develop many ubiquitous and accurate Wifi-based localization systems. The initial attempts employed Naive Bayes [
11], a basic statistical model, demonstrating acceptable localization performance. Progressing from this foundation, the Sparse Auto-encoder (SAE) [
12] was introduced, utilizing CSI images to localize users, although its fully connected architecture resulted in a model burdened with excessive parameters. Further explorations led to the adoption of Long Short-Term Memory (LSTM) [
10] and Convolutional Neural Networks (CNNs) [
13] for enhancing localization accuracy. Among these, CNN-1D [
13] has been particularly noted for its efficacy in WiFi-based human localization, capitalizing on its ability to process sequential and spatial signal attributes.
3.2. WiFi-Based Human Activity Recognition (HAR)
The domain of WiFi-based HAR is rapidly gaining attention for its capability to analyze user behaviors with finer granularity compared to mere identification and localization tasks. Early methods in HAR relied on handcrafted feature extraction, which often proved insufficient in extracting salient features from CSI [
17]. This limitation led to the adoption of LSTM [
17] and CNNs [
7,
8,
18] for their enhanced ability to learn temporal and spatial features effectively. The advent of CNN-LSTM hybrids [
4,
5,
19] marked a significant advancement, leveraging the spatial feature extraction capabilities of CNNs and the temporal learning prowess of LSTMs. Furthermore, to address the dynamic nature of environments, GANs [
6] have been integrated with CNNs to introduce adversarial learning, enhancing the robustness of HAR systems. Attention-based models [
20], particularly ABLSTM [
21], have equipped bidirectional LSTMs with attention mechanisms, significantly improving HAR performance. The introduction of a convolution augmented Transformer (THAT) [
7] has taken this a step further, combining attention layers with multi-scale CNNs to achieve state-of-the-art results in WiFi-based HAR.
3.3. Multi-User Sensing Challenges
Existing methods predominantly focus on single-user sensing. However, practical applications often involve multiple users, making multi-user sensing based on WiFi CSI a challenging yet vital endeavor. The presence of multiple users can lead to reflections, multipath and mutual interference, complicating the task of accurately sensing individual users’ activities. Recent studies [
22,
23,
24,
25,
26,
27] have made strides in addressing these challenges, aiming to effectively sense multiple users simultaneously. Unlike these systems, the proposed
MultiSenseX is designed to tackle the complexities of multi-user environments by leveraging advanced signal processing and a Transformer architecture, integrating multiview data analysis to accurately localize and recognize the activities of individual users in a shared space. This approach ensures a more effective and reliable sensing solution, making
MultiSenseX a pivotal development in WiFi-based human sensing technology.
4. MultiSenseX System Overview
Figure 3 illustrates the architecture of
MultiSenseX, which operates through two main stages: an offline training stage and an online recognition stage. The offline stage begins with the
Data Collection module gathering CSI data while users perform activities at different reference locations. This module integrates both hardware components (transmitters and receivers) and software to effectively capture the data. Next, the
Preprocessing module processes the continuous stream of measurements to reduce noise and format the data into fixed-length sequences suitable for analysis. Concurrently, the
Spatial Discretization module segments the monitored area into a virtual grid, assigning each user to a cell during their activities and tagging the CSI data with corresponding cell labels for location. These structured data are crucial for the next phase. The
Model Constructor module then takes center stage, developing and training a deep learning model specifically designed for precise localization and activity recognition. Once training is complete, the model is saved for subsequent use in the online phase.
During the online phase, the system deploys the pre-trained model to identify and classify user activities and locations in real-time, based on incoming CSI data. This enables immediate and accurate response to dynamic changes in the environment, demonstrating the system’s capability to adapt and apply learned patterns to real-world scenarios.
The proposed amplitude-specific Transformer network extracts features relevant to environmental and locational characteristics, improving the system’s ability to detect and differentiate user activities. The phase-specific Transformer network analyzes temporal sequences and signal interactions, aiding in identifying location-specific activities. This dual-view approach captures both spatial and temporal aspects of the CSI data, enhancing the system’s robustness in multi-user cluttered environments. Integrating features from both amplitude and phase views results in a comprehensive feature representation, which is crucial for accurately modeling complex dynamics in multi-user scenarios. This detailed representation ensures the system’s effectiveness in real-world settings, providing robust and precise activity recognition and localization.
5. System Details
The proposed system leverages CSI to accomplish the dual objectives of localization and activity recognition in a multiperson environment. It operates in two phases: the offline phase, where the data are collected and the prediction model is trained, and the online phase, where real-time data are processed to provide prediction of the activities and locations of users. Below, we delve into the specifics of each module within these phases.
5.1. MultiSenseX Data Preprocessing
The preprocessing module is essential for refining raw CSI data, facilitating its transition into a format amenable to subsequent analysis. The preprocessing steps include noise reduction, normalization, and segmentation, each crucial for enhancing the signal quality and relevance for activity recognition and localization tasks. Noise inherent in CSI data can significantly obfuscate the underlying patterns related to human activities and environmental characteristics. We initiate preprocessing by applying noise reduction techniques to the raw CSI signals. This involves the use of a moving average filter to smooth out the data and eliminate transient noise components. Additionally, phase unwrapping and offset correction are performed to mitigate phase discontinuities and ensure coherent phase information across the signal spectrum. After cleansing, the CSI data, inherently complex-valued, are decomposed into amplitude and phase components. This separation facilitates targeted analysis of the distinct features embedded in each component, correlating to the physical phenomena affecting signal propagation and interaction. These components are then normalized to a 0–1 range using Min-Max normalization, highlighting key signal patterns crucial for model learning and speeding up its convergence. The normalized data are further segmented into fixed-length sequences, creating uniform data chunks that facilitate the identification of activity patterns and spatial localizations within specific time windows. This segmentation aligns with the temporal nature of human activities and is tailored to capture the dynamics of the interactions within the environment. Each segment is structured to reflect a coherent snapshot of activity or environmental state, serving as a distinct instance for training the Transformer-based model. Finally, to bolster the model’s ability to generalize to different environmental conditions and ensure it is not overly fitted to the noise-free training data, white Gaussian noise is introduced as a form of data augmentation. This step is designed to simulate realistic signal variations that the system might encounter in actual deployment. Thus, the model is effectively trained to distinguish between true signal patterns and random noise.
5.2. The Spatial Discretization
To effectively analyze human activities and localization using CSI data, we have adopted a spatial discretization strategy. This method segments the monitored space into a virtual grid, each cell measuring 1 m × 1 m. We assume that each individual occupies a single cell during any activity. This structured approach enables precise analysis of user interactions with the wireless signals, as each cell’s signal characteristics and associated human activities can be independently assessed. The 1 m square cells provide an optimal balance between spatial resolution and computational efficiency, ensuring that the CSI data’s spatial distribution is captured comprehensively. This setup offers a practical yet detailed framework for examining the impact of human presence and movements on the wireless signals, enhancing our understanding and ability to accurately model these dynamics.
5.3. The Model Training
The proposed system employs a multiview Transformer architecture (shown in
Figure 4) to process CSI data, facilitating simultaneous localization and activity recognition of multiple users. This architecture utilizes separate views for the amplitude and phase components of the CSI data, capitalizing on their distinct informational characteristics to enhance the understanding of environmental dynamics and human activities.
The Amplitude View () processes the amplitude component of the CSI signals, highlighting variations in signal strength caused by human movements. The amplitude data are transformed by the amplitude-specific Transformer network , designed to extract features that are pertinent to environmental and locational characteristics.
The Phase View () processes the phase component of the CSI signals to glean insights into the temporal dynamics and phase shifts. The phase-specific Transformer network is employed to analyze the temporal sequences and interactions of signals within the environment, aiding in the identification of location-specific activities.
The features extracted from both amplitude and phase views are then integrated into a comprehensive feature representation , which encapsulates the spatial and temporal aspects of the CSI data. These features are fed in to enable a hierarchical classification process to perform simultaneous localization and activity recognition of multiple users. The framework is bifurcated into a two-tier classification system: a multilabel classification for identifying user locations and a subsequent conditional classification for recognizing user activities within these locations.
The first tier of our model employs a multilabel classification network
, tasked with predicting the presence of users across multiple predefined locations. Each location
within the set
is associated with a binary label, collectively forming the prediction vector
representing the likelihood of user presence in these locations. The model optimizes the binary cross-entropy loss
, calculated as:
where
and
are the ground truth and predicted probabilities for each location label
m, respectively.
Following location identification, the second tier comprises location-specific activity recognition models
, each tailored to the unique environmental context of the identified locations. The activity classification for each location
i generates a probability distribution
over potential activities, employing categorical cross-entropy as the loss function
:
where
and
represent the number of samples and activity classes at location
i, respectively, with
as the ground truth and
as the predicted probability for each activity class
a.
6. Evaluation
6.1. Experimental Setup and Data Collection
The system is evaluated with data collected in a realistic environment that is publicly available in [
28]. The experiments are designed to capture CSI data of different activities conducted by varying numbers of participants. These participants engage in identical or distinct activities concurrently. The dataset contains a total of 11,286 WiFi CSI and corresponding video samples, cumulatively spanning over 9.4 h of data.
For the acquisition of the WiFi CSI data used in this paper, two standard computing systems (HP EliteDesk 800 G2 TWR) were leveraged as the transmitter and receiver. Each system was equipped with an Intel 5300 Network Interface Card and configured with the Linux 802.11n CSI tool. To exploit the full spectrum of WiFi bands, configurations were set to channel 12 in the 2.4 GHz band and channel 64 in the 5 GHz band for dual-band CSI collection.
The process of collecting CSI data was executed in three sequential steps. Initially, the receiver was configured to log the CSI of all incoming packets by tuning to a specific WiFi channel. Subsequently, the transmitter began dispatching packets to this channel, during which participants engaged in predetermined activities to generate the necessary interaction data. Finally, upon completion of these activities, the receiver terminated its logging and listening operations, concluding the data collection phase. In line with established protocols [
1,
29], activities were performed over a duration of 3 s, with the transmitter emitting 3000 packets at a rate of 1000 packets per second. Each 3 s CSI sample thus consisted of 3000 time steps, irrespective of packet loss. The devices, each equipped with three antennas and utilizing 30 subcarriers, recorded a CSI dimension of
at every time step, culminating in a comprehensive sample dimension of
. Alongside the CSI data, synchronized video recordings were captured to support the ground-truth profiling.
The dataset incorporates nine daily life activities, namely: Nothing, Walking, Rotation, Jumping, Waving, Lying Down, Picking Up, Sitting Down, and Standing Up (an example is shown in
Figure 5). Data collection was conducted across three distinct environments: a classroom shown in
Figure 6, a meeting room shown in
Figure 7, and an empty room shown in
Figure 8, each with five predefined locations. Six volunteers (3 males and 3 females) participated in the data collection process.
6.2. Comparative Evaluation
In this comparative analysis, our proposed system is benchmarked against eight other baseline models that utilize WiFi data. These models include a Random Forest classifier employing a Short-time Fourier Transform (ST-RF) [
17], Multilayer Perceptron (MLP) [
13], Long Short-Term Memory (LSTM) [
17], one-dimensional Convolutional Neural Network (CNN-1D) [
13], two-dimensional Convolutional Neural Network (CNN-2D) [
7], Convolutional LSTM (CLSTM) [
30], Attention-Based LSTM (ABLSTM) [
21], and a Temporal Hierarchical Attention Network (THAT) [
31].
For our evaluation metrics, we use accuracy as the primary measure of performance for multi-user sensing activities, consistent with prior studies [
17,
21,
31]. The dataset is divided into a training set, comprising 80% of the data, and a test set, making up the remaining 20%.
6.3. Localization Performance Analysis
Figure 9,
Figure 10 and
Figure 11 show the results of different systems in the classroom, the meeting room and the empty room, respectively. The
MultiSenseX system exhibited superior localization accuracy across all tested environments, significantly outperforming the baseline models. The proposed system achieved 91.9% accuracy in the Classroom, 92.3% in the Meeting Room, and 90.5% in the Empty Room, surpassing the highest-scoring baseline, THAT, which registered 80.42%, 83.25%, and 79.97% respectively. The enhanced localization performance of
MultiSenseX is attributed to its sophisticated multiview Transformer architecture, which efficiently captures and integrates the nuanced spatial characteristics encoded in the amplitude and phase components of the CSI data. Unlike traditional machine learning models such as ST-RF and MLP, which rely on static feature extraction methods, or neural networks like LSTM and CNN-1D/2D that process signals in a unidimensional manner,
MultiSenseX leverages a comprehensive understanding of the signal’s multidimensional nature, facilitating a more accurate and dynamic representation of the environment.
6.4. Activity Recognition Performance Analysis
For activity recognition,
Figure 9,
Figure 10 and
Figure 11 illustrate the results of the different systems in the classroom, the meeting room, and the empty room, respectively. The figures confirm that
MultiSenseX outperforms the baseline models, achieving remarkable accuracy levels. It reaches 82.3% in the classroom, 81.9% in the meeting room, and 82% in the empty room, substantially exceeding the performance of the leading baseline models, which hovered around the 60% mark. The success of
MultiSenseX in activity recognition can be credited to its ability to exploit the temporal and spatial dynamics of CSI data through its advanced Transformer-based processing. This is further enhanced by the system’s capacity to conduct hierarchical classification, allowing for distinct yet interrelated analyses of location and activity, a feature not commonly present in conventional approaches like CNN-1D/2D or CLSTM.
6.5. The Performance in Different Environments
Figure 9,
Figure 10 and
Figure 11 also confirm the proposed system’s enhanced performance in a classroom setting compared to meeting and empty rooms. The classroom setting is characterized by a richer, more complex signal environment created by numerous objects. These objects, while potentially causing more reflections and interferences, actually provide a detailed dataset that the system’s sophisticated multiview Transformer architecture can utilize effectively. This architecture excels in extracting meaningful information from complex noise, enabling the system to learn from a broader range of spatial and temporal signal variations. Moreover, the system’s hierarchical classification capabilities allow it to distinguish between different signal interactions more accurately. Thus, despite the seemingly adverse conditions, the complexity of a classroom environment ultimately enriches the CSI data, leading to better model training and improved localization and activity recognition performance. The
MultiSenseX architecture, with its multiview Transformer approach, effectively synthesizes the information gleaned from the CSI signal’s amplitude and phase components. This synthesis enables a more granular and accurate interpretation of environmental and motion-induced signal variations, leading to improved performance metrics. The architecture’s ability to handle the inherent complexity and non-linearity of CSI data, especially in distinguishing between multiple activities and locations within the same environmental setup, provides a technical advantage over the evaluated baseline models.
MultiSenseX not only surpasses traditional and contemporary models in performance but also sets a new benchmark in leveraging the intricate properties of CSI data for various environmental settings. This comparative analysis solidifies the standing of the proposed
MultiSenseX system as a significant advancement in the field of WiFi-based CSI analysis for localization and activity recognition.
6.6. The Performance in Different Frequency
This section evaluates the performance of our system in both 2.4 GHz and 5 GHz frequency bands to determine how frequency affects performance metrics.
MultiSenseX consistently delivers high performance in both frequency settings across various environments, as shown in
Figure 12 and
Figure 13, underlining its robustness and adaptability to frequency variations. Notably,
MultiSenseX performs better at 5 GHz, a result of the higher bandwidth and lower interference associated with this frequency band. The 5 GHz band provides a less congested signal environment than the heavily used 2.4 GHz band, reducing signal noise from overlapping devices. Moreover, the shorter wavelengths of 5 GHz enable more precise detection of minute movements and subtle environmental changes, enhancing the system’s localization and activity recognition capabilities. Consequently, the refined data capture at 5 GHz significantly improves
MultiSenseX performance in scenarios requiring high resolution and detailed sensitivity.
MultiSenseX achieves consistent high performance in both frequency settings across all environments, reinforcing its robustness and adaptability to frequency variations.
6.7. The Effect of Changing the Cell Size
In this section, we investigate the effect of varying the cell size created by the spatial discretization module on system performance. As depicted in
Figure 14, when the cell size is small, such as 0.5 m, the system performance degrades. This degradation occurs due to the increased likelihood of a person spanning two cells simultaneously, leading to inaccuracies in activity recognition. Specifically, the discretization process can cause ambiguity in the person’s exact position, resulting in errors in both the localization and activity recognition tasks. The system is designed to define a localization branch for each cell and an activity recognition network branch for each detected location, which, in the case of smaller cells, leads to a substantial increase in computational cost. This is because the number of branches grows exponentially with the reduction in cell size, requiring more time to respond with an estimate. Conversely, a larger cell size reduces the computational cost significantly by decreasing the number of localization and activity recognition branches. However, this comes at the expense of system performance. When the cell size is too large, such as 2 m and above, multiple persons can be localized within the same cell, which introduces errors in distinguishing between individuals. This overlap reduces the precision of the system, as it struggles to accurately track and recognize activities for each individual separately. The loss of granularity in spatial resolution directly impacts the accuracy of both localization and activity recognition. Therefore, we determined that an optimal cell size of 1 m balances computational efficiency and system performance. At this cell size, the system minimizes the likelihood of a person spanning multiple cells while maintaining a manageable number of localization and activity recognition branches. This balance ensures that the system can operate efficiently without compromising the accuracy of localization and activity recognition. The 1 m cell size provides sufficient spatial resolution to accurately distinguish between individuals while keeping the computational demands within practical limits. This optimal cell size ensures that the system performs reliably in real-time applications, providing accurate and efficient monitoring and analysis.
6.8. The Effect of Changing the Number of Persons in the Environment
In this section, we investigate the effect of increasing the number of persons in the environment on system performance.
Figure 15 illustrates that the system maintains accurate activity recognition performance of over 90% when the number of persons is small, such as 1, 2, or 3. However, as the number of persons increases, the system performance decreases, particularly at 5 users and beyond. Despite this decline, the system continues to perform well, maintaining an accuracy of approximately 82%, which highlights its resilience. This resilience is attributed to the proposed multiview Transformer approach. The multiview Transformer is designed to handle multiple perspectives and integrate information from various viewpoints effectively. This approach allows the system to disambiguate activities and localizations of different individuals, even in crowded environments. By leveraging the capabilities of the Transformer architecture, which excels in capturing long-range dependencies and contextual information, the system can better distinguish between overlapping activities and reduce interference caused by the presence of multiple persons. The multiview Transformer processes input data from different sensors and viewpoints, creating a comprehensive representation of the environment. This representation enhances the system’s ability to track and recognize activities accurately, even as the number of persons increases. The Transformer mechanism’s attention mechanism selectively focuses on relevant parts of the input data, enabling the system to prioritize critical information and maintain high performance. Therefore, while the increase in the number of persons introduces additional complexity and potential for interference, the multiview Transformer approach ensures that the system remains robust and effective. The system’s ability to sustain an accuracy of around 82% in crowded environments demonstrates its capability to manage real-world scenarios where multiple individuals are present, thereby validating the efficacy of the proposed approach.
Finally, we can summarize the advantages of MultiSenseX in terms of accuracy, user comfort, and privacy as follows:
- 1.
Traditional sensor-based methods (e.g., wearable devices) typically deliver high accuracy in activity recognition but require physical sensors to be worn, which can be intrusive and uncomfortable for users, especially for elderly or disabled individuals.
- 2.
Vision-based methods (e.g., camera-based systems) also achieve high accuracy but struggle with occlusions, light conditions, and environmental variations. They are highly dependent on clear line-of-sight conditions. Additionally, these methods inherently compromise user privacy due to video capture. Moreover they can make users feel monitored, impacting their natural behavior.
- 3.
MultiSenseX (device-free and non-intrusive system) surpasses these methods in multi-user environments by leveraging CSI data as they are robust to occlusions and lighting conditions. Additionally, it achieves a comparable performance of over 91% localization and 82% activity recognition accuracy in challenging environments. This comes with a higher level of privacy by only analyzing wireless signal modulations, without capturing images or requiring physical tracking devices.
- 4.
The architecture of MultiSenseX is inherently flexible due to its multiview Transformer-based design. By fine-tuning the model with additional labeled data, it can be extended to detect rare but critical activities such as falls or sudden movements indicative of emergencies. MultiSenseX can leverage transfer learning to integrate new activity classes without retraining the entire model. For example, by collecting CSI data specifically for falls, we can fine-tune the existing classification layers to detect these events. Additionally, generative adversarial networks (GANs) could be used to augment training data for such rare occurrences, reducing the burden of large-scale data collection.
An additional strategy can be employed to enhance
MultiSenseX performance in even more challenging scenarios. This strategy involves multimodal data fusion. Introducing additional modalities, such as audio signals, LiDAR point clouds or thermal imaging, could complement CSI data and help the system better resolve ambiguities in multi-user environments, effectively mitigating issues like signal noise, multipath interference, and occlusions. The work in [
32] designed to integrate diverse data sources through a shared latent space could be beneficial for this purpose. Fusing data streams with multimodal attention layers could enhance the accuracy of both localization and activity recognition.
7. Conclusions
In this study, we introduced MultiSenseX, a novel system for WiFi-based human sensing that leverages a multiview Transformer architecture to enable simultaneous localization and activity recognition for multiple users. By processing the amplitude and phase components of CSI data through distinct Transformer views, MultiSenseX significantly enhances the ability to analyze complex dynamics in multi-user environments. The system employs multilabel classification for accurate localization and conditional branching for multiclass activity recognition, facilitating precise, real-time monitoring in cluttered and dynamic settings. The proposed system supports smart and inclusive environments (SDG 11) by enabling intelligent monitoring of urban spaces and shared facilities, improving safety and operational efficiency. Additionally, its capability for seamless multi-user sensing lays the groundwork for advancements in healthcare monitoring (SDG 3), such as non-intrusive patient activity tracking and fall detection. By addressing the challenges of complex multi-user environments, MultiSenseX represents a transformative step toward sustainable and human-centric technological solutions for modern living. Some limitations of the MultiSenseX can be summarized as follows:
- 1.
Scalability: The current system is designed for a fixed spatial grid (1 m × 1 m cells) to simplify spatial discretization. While this setup works well in moderately crowded environments, scalability becomes challenging in scenarios with a high density of users, where multiple individuals may occupy the same grid cell. This could lead to signal overlap and reduced performance. To address this, we propose integrating more sophisticated spatial modeling techniques, such as adaptive grids or dynamic clustering, to better separate overlapping signals.
- 2.
Dynamic Environments: While MultiSenseX has been tested across three distinct environments, the system assumes a static setting during both training and testing. In real-world applications, dynamic factors, such as new obstacles, moving furniture, or changing wireless signal properties (e.g., signal attenuation due to rain or crowding), may degrade system accuracy. Incorporating mechanisms for online recalibration and adaptive learning will mitigate these limitations.
Despite these challenges, MultiSenseX consistently outperforms baseline models in diverse environments, underscoring its robustness in handling real-world conditions compared to existing solutions.
Author Contributions
Conceptualization, H.R., A.E. and H.Y.; methodology, H.R., A.E. and H.Y.; software, H.R.; validation, H.R., A.E., M.R. and H.Y.; formal analysis, H.R.; investigation, H.R., A.E. and H.Y.; resources, H.R. and H.Y.; data curation, H.R.; writing—original draft preparation, H.R. and A.E.; writing—review and editing, H.R., M.R. and A.E.; visualization, H.R., A.E. and H.Y.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This project is sponsored by Prince Sattam Bin Abdulaziz University (PSAU) as part of funding for its SDG Roadmap Research Funding Programme project number PSAU-2023-SDG-57.
Data Availability Statement
Dataset available on request from the authors.
Acknowledgments
This project is sponsored by Prince Sattam Bin Abdulaziz University (PSAU) as part of funding for its SDG Roadmap Research Funding Programme project number PSAU-2023-SDG-57.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Bocus, M.J.; Li, W.; Vishwakarma, S.; Kou, R.; Tang, C.; Woodbridge, K.; Craddock, I.; McConville, R.; Santos-Rodriguez, R.; Chetty, K.; et al. OPERAnet, a multimodal activity recognition dataset acquired from radio frequency and vision-based sensors. Sci. Data 2022, 9, 474. [Google Scholar] [CrossRef] [PubMed]
- Cao, R.; Yang, X.; Yang, Z.; Zhou, M.; Xie, L. Research on Human Activity Recognition Technology under the Condition of Through-the-wall. In Proceedings of the 2020 IEEE/CIC International Conference on Communications in China (ICCC), Chongqing, China, 9–11 August 2020; pp. 501–506. [Google Scholar]
- Wang, Y.; Liu, H.; Cui, K.; Zhou, A.; Li, W.; Ma, H. m-Activity: Accurate and Real-Time Human Activity Recognition Via Millimeter Wave Radar. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 8298–8302. [Google Scholar] [CrossRef]
- Guo, L.; Zhang, H.; Wang, C.; Guo, W.; Diao, G.; Lu, B.; Lin, C.; Wang, L. Towards CSI-based diversity activity recognition via LSTM-CNN encoder-decoder neural network. Neurocomputing 2021, 444, 260–273. [Google Scholar] [CrossRef]
- Shalaby, E.; ElShennawy, N.; Sarhan, A. Utilizing deep learning models in CSI-based human activity recognition. Neural Comput. Appl. 2022, 34, 5993–6010. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Yang, J.; Cui, W.; Xie, L.; Sun, S. Multimodal CSI-based human activity recognition using GANs. IEEE Internet Things J. 2021, 8, 17345–17355. [Google Scholar] [CrossRef]
- Moshiri, P.F.; Nabati, M.; Shahbazian, R.; Ghorashi, S.A. Csi-based human activity recognition using convolutional neural networks. In Proceedings of the 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE), Mashhad, Iran, 28–29 October 2021; pp. 7–12. [Google Scholar]
- Zhang, Y.; Yin, Y.; Wang, Y.; Ai, J.; Wu, D. CSI-based location-independent human activity recognition with parallel convolutional networks. Comput. Commun. 2023, 197, 87–95. [Google Scholar] [CrossRef]
- Aly, H.; Agrawala, A. Hapi: A robust pseudo-3D calibration-free WiFi-based indoor localization system. In Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, New York, NY, USA, 5–7 November 2018; pp. 166–175. [Google Scholar]
- Ding, J.; Wang, Y.; Si, H.; Gao, S.; Xing, J. Three-dimensional indoor localization and tracking for mobile target based on wifi sensing. IEEE Internet Things J. 2022, 9, 21687–21701. [Google Scholar] [CrossRef]
- Wu, Z.; Xu, Q.; Li, J.; Fu, C.; Xuan, Q.; Xiang, Y. Passive indoor localization based on csi and naive bayes classification. IEEE Trans. Syst. Man Cybern. Syst. 2017, 48, 1566–1577. [Google Scholar] [CrossRef]
- Gao, Q.; Wang, J.; Ma, X.; Feng, X.; Wang, H. CSI-based device-free wireless localization and activity recognition using radio image features. IEEE Trans. Veh. Technol. 2017, 66, 10346–10356. [Google Scholar] [CrossRef]
- Wang, F.; Feng, J.; Zhao, Y.; Zhang, X.; Zhang, S.; Han, J. Joint activity recognition and indoor localization with WiFi fingerprints. IEEE Access 2019, 7, 80058–80068. [Google Scholar] [CrossRef]
- Sainjeon, F.; Gaboury, S.; Bouchard, B. Real-Time Indoor Localization in Smart Homes Using Ultrasound Technology. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Corfu Island, Greece, 29 June–1 July 2016. [Google Scholar] [CrossRef]
- Geng, X.; Peng, R.; Li, M.; Liu, W.; Jiang, G.; Jiang, H.; Luo, J. A Lightweight Approach for Passive Human Localization Using an Infrared Thermal Camera. IEEE Internet Things J. 2022, 9, 24800–24811. [Google Scholar] [CrossRef]
- Ouyang, G.; Abed-Meraim, K. A Survey of Magnetic-Field-Based Indoor Localization. Electronics 2022, 11, 864. [Google Scholar] [CrossRef]
- Yousefi, S.; Narui, H.; Dayal, S.; Ermon, S.; Valaee, S. A survey on behavior recognition using WiFi channel state information. IEEE Commun. Mag. 2017, 55, 98–104. [Google Scholar] [CrossRef]
- Zhang, R.; Jiang, C.; Wu, S.; Zhou, Q.; Jing, X.; Mu, J. Wi-Fi sensing for joint gesture recognition and human identification from few samples in human-computer interaction. IEEE J. Sel. Areas Commun. 2022, 40, 2193–2205. [Google Scholar] [CrossRef]
- Zou, H.; Zhou, Y.; Yang, J.; Jiang, H.; Xie, L.; Spanos, C.J. Deepsense: Device-free human activity recognition via autoencoder long-term recurrent convolutional network. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Chen, Z.; Zhang, L.; Jiang, C.; Cao, Z.; Cui, W. WiFi CSI based passive human activity recognition using attention based BLSTM. IEEE Trans. Mob. Comput. 2018, 18, 2714–2724. [Google Scholar] [CrossRef]
- Duan, P.; Li, C.; Li, J.; Chen, X.; Wang, C.; Wang, E. WISDOM: Wi-Fi-Based Contactless Multiuser Activity Recognition. IEEE Internet Things J. 2022, 10, 1876–1886. [Google Scholar] [CrossRef]
- He, J.; Yang, W. Imar: Multi-user continuous action recognition with wifi signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 1–27. [Google Scholar] [CrossRef]
- Ma, Y.; Zhou, G.; Wang, S.; Zhao, H.; Jung, W. SignFi: Sign language recognition using WiFi. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–21. [Google Scholar] [CrossRef]
- Tan, S.; Zhang, L.; Wang, Z.; Yang, J. MultiTrack: Multi-user tracking and activity recognition using commodity WiFi. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–12. [Google Scholar]
- Venkatnarayan, R.H.; Page, G.; Shahzad, M. Multi-user gesture recognition using WiFi. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, Munich, Germany, 10–15 June 2018; pp. 401–413. [Google Scholar]
- Yang, Z.; Zhang, Y.; Zhang, Q. Rethinking fall detection with Wi-Fi. IEEE Trans. Mob. Comput. 2022, 22, 6126–6143. [Google Scholar] [CrossRef]
- Huang, S.; Li, K.; You, D.; Chen, Y.; Lin, A.; Liu, S.; Li, X.; McCann, J.A. WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing. arXiv 2024, arXiv:2402.09430. [Google Scholar]
- Zheng, Y.; Zhang, Y.; Qian, K.; Zhang, G.; Liu, Y.; Wu, C.; Yang, Z. Zero-effort cross-domain gesture recognition with Wi-Fi. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services, Seoul, Republic of Korea, 17–21 June 2019; pp. 313–325. [Google Scholar]
- Mo, H.; Kim, S. A deep learning-based human identification system with wi-fi csi data augmentation. IEEE Access 2021, 9, 91913–91920. [Google Scholar] [CrossRef]
- Li, B.; Cui, W.; Wang, W.; Zhang, L.; Chen, Z.; Wu, M. Two-stream convolution augmented transformer for human activity recognition. AAAI Conf. Artif. Intell. 2021, 35, 286–293. [Google Scholar] [CrossRef]
- Cheng, S.; Zhuang, Y.; Kahouadji, L.; Liu, C.; Chen, J.; Matar, O.K.; Arcucci, R. Multi-domain encoder–decoder neural networks for latent data assimilation in dynamical systems. Comput. Methods Appl. Mech. Eng. 2024, 430, 117201. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).