SitPAA: Sitting Posture and Action Recognition Using Acoustic Sensing

: The technologies associated with recognizing human sitting posture and actions primarily involve computer vision, sensors, and radio frequency (RF) methods. These approaches often involve handling substantial amounts of data, pose privacy concerns, and necessitate additional hardware deployment. With the emergence of acoustic perception in recent times, acoustic schemes have demonstrated applicability in diverse scenarios, including action recognition, object recognition, and target tracking. In this paper, we introduce SitPAA, a sitting posture and action recognition method based on acoustic waves. Notably, our method utilizes only a single speaker and microphone on a smart device for signal transmission and reception. We have implemented multiple rounds of denoising on the received signal and introduced a new feature extraction technique. These extracted features are fed into static and dynamic-oriented networks to achieve precise classification of five distinct poses and four different actions. Additionally, we employ cross-domain recognition to enhance the universality of the classification results. Through extensive experimental validation, our method has demonstrated notable performance, achieving an average accuracy of 92.08% for posture recognition and 95.1% for action recognition. This underscores the effectiveness of our approach in providing robust and accurate results in the challenging domains of posture and action recognition.


Introduction
In recent years, there has been a growing emphasis on personal health, leading to the emergence of numerous products designed to leverage external factors for health maintenance [1,2].Given that sitting constitutes a significant portion of daily activities, its impact on health is undeniable [3,4].Prolonged periods of sitting, particularly prevalent among students and professionals, often result in poor posture [5].Persistent poor posture can contribute to the development of various musculoskeletal conditions, including obesity, cervical spondylosis, hunchback, and lumbago, among others [6][7][8][9].Beyond causing physical discomfort, these issues escalate medical expenses and pose a burden on individuals' daily activities, ultimately diminishing work efficiency [10].An effective sitting posture and action detection system can reduce the incidence of health problems, enhance efficiency, and contribute to overall safety [11].As such, investing in technologies that facilitate sitting posture detection aligns with the broader goal of fostering healthier lifestyles and improving the quality of life.
Nowadays, propelled by advancements in computer vision and diverse sensing technologies, substantial strides have been made in the realm of sitting posture detection methods.Currently, prevalent methods in the market encompass vision-based, RF-based, and sensor-based approaches.Vision-based methods have garnered widespread popularity owing to their high accuracy and minimal error rates [12].Typically, a solitary camera proves sufficient to capture ample pertinent information, rendering it a cost-effective and easily deployable solution.However, visual-based solutions that overly depend on light and environmental conditions, such as intense light, shadows, or dust, may experience performance fluctuations.Furthermore, privacy concerns arise as visual-based methods may inadvertently capture user-specific information, potentially leading to privacy breaches.Additionally, the recorded visual data often include a substantial amount of redundant information, consuming significant storage space.While sensor-based solutions excel in extracting desired features accurately, they necessitate wearable devices and supplementary hardware deployment, incurring both cost and inconvenience [13,14].On the other hand, RF-based solutions exhibit resilience against environmental interference and demonstrate proficiency in recognizing nuanced actions.Nonetheless, the high associated costs and the requirement for wearable devices pose practical challenges [15].In light of these considerations, striking a balance between accuracy, cost-effectiveness, and privacy considerations remains pivotal in the ongoing evolution of sitting posture detection technologies.
In contrast, acoustic-based solutions emerge as a superior option.Firstly, the relatively slow propagation speed of sound signals facilitates ease of processing, enabling the attainment of high accuracy [16][17][18].Secondly, acoustic performance is not contingent on light conditions and can be executed in any level of brightness.Notably, recording data within an acoustic scheme avoids compromising user privacy, and audio files, as opposed to video files, occupy less memory.Moreover, the implementation of acoustic solutions requires only a pair of built-in speakers and microphones on the designated device, obviating the necessity for wearable devices or additional hardware deployment.This stands in stark contrast to sensor-based solutions, which often involve cumbersome wearable devices and additional hardware expenses.Additionally, when compared with RF-based solutions, acoustic methods offer the advantage of allowing us to customize the transmission signal format and modulate signals according to specific requirements.This imparts a high degree of flexibility and lowers overall costs, enhancing the feasibility and efficiency of acoustic-based sitting posture detection systems.
In this paper, we present SitPAA, an acoustic-based method for sitting posture recognition and action recognition.SitPAA involves transmitting a modulated Frequency Modulated Continuous Wave (FMCW) signal ranging from 18,000 Hz to 22,000 Hz through the speaker.The FMCW signal reflects upon encountering the human body, and the reflected signal is captured by the smartphone's microphone, as illustrated in Figure 1.After initial denoising and other preprocessing steps, we obtain a relatively pure signal for subsequent classification.The signal segmentation algorithm employed is based on short-term energy accumulation, allowing for classification based on energy intensity.This leads to the identification of two distinct signal types: static (characterized by a low short-term energy level) and dynamic (marked by a high short-term energy level).Leveraging the variance in noise types and human information contained within static and dynamic classes, we have devised two distinct patterns for further denoising, feature extraction, and the design of deep learning networks for classifying static and dynamic signals, as depicted in Figure 2. To enhance the universality of our classification results, we introduced a domain adaptive method, achieving an impressive average accuracy of 92.08% for posture recognition and 95.1% for action recognition in cross-domain classification scenarios.This comprehensive methodology underscores the efficacy of our acoustic-based approach in achieving accurate and versatile sitting posture and action recognition.
Our methodology encounters the following challenges:  To address the first challenge, we employ the difference method to eliminate static variables in dynamic classes.Additionally, we utilize Ensemble Empirical Mode Decomposition (EEMD) and delineate the range bin to counteract the impact of non-target reflections.For static classes, we proactively record the environmental information and dynamically multiply it by a custom constant k to subtract it from the received signal.This process effectively eliminates the influence of direct signals and partial multipath interference.In response to the second challenge, we introduce a novel model and feature extraction method.This innovative approach involves extracting features related to various postures and actions through coarse-grained localization of approximate human body part positions and fine-grained interception of the relative distances corresponding to percentage energy.By doing so, we bridge the gap in feature extraction methods tailored for sound wave analysis, offering a more suitable and effective solution for pose and action recognition.This paper makes significant contributions in the following areas:

•
Innovative Sitting Posture Detection Method: We introduce SitPAA for sitting posture detection by leveraging acoustic waves.This novel approach demonstrates exceptional accuracy in detecting various sitting postures.

•
Advanced Feature Extraction Technique: We present a novel feature extraction method designed to efficiently capture distinctive features associated with different postures and movements.Importantly, this method exhibits effectiveness in achieving crossdomain classification, contributing to the versatility of the proposed approach.

•
Comprehensive Experimental Validation: To affirm the efficacy of our method, we conducted extensive experiments involving fifteen volunteers.Our evaluation encompasses variations in clothing materials, angles, distances, and decibel noise conditions, providing a comprehensive understanding of the method's performance under different circumstances.

Related Work
This section summarizes a comprehensive introduction to different methods, including visual, sensor, RF, and acoustic.

Visual-Based Solution
The vision-based approach initially captures human body postures using a camera, extracting essential skeletal or contour features.Subsequently, sitting posture or action recognition is accomplished through computer vision methods, with convolutional neural networks (CNN) being a prevalent choice, as noted by Kulikajevas et al. [12].Specifically, in the work of Chen et al. [19], Openpose is employed for extracting human pose features, which are then trained using CNN.Despite these efforts, the accuracy remains limited due to constrained training data and inherent CNN limitations which is only more than 90% in their paper.To address these challenges, SPRNet [20] introduces External Attention and Convolutional Stem, partially optimizing previous models and enhancing accuracy to 99.1%, although it remains sensitive to lighting conditions.Wang et al.'s SEEN [21] achieves cross-domain action recognition with few shots, showcasing advancements in visual-based solutions.This method improves accuracy from 2.9% to 8.4% compared to the previous few sample methods.However, it is important to note that visual-based solutions, including those mentioned above, may compromise user privacy to varying extents.Moreover, the abundance of irrelevant data in images contributes to increased storage requirements.

Sensor-Based Solution
The sensor-based solution involves collecting data from different parts of the human body using either existing or custom-designed sensors.This may include an array of sensors such as pressure sensors, induction sensors, and electromyography sensors, among others.Machine learning methods are then applied to analyze the collected data and determine the pose or action type.Noteworthy examples in the literature showcase the versatility of sensor-based solutions: Ramakrishna Kini et al. [22] utilized pressure sensor data embedded in wheelchairs to automatically identify imbalanced sitting postures, achieving a 99.41% F1-score through the ICA-KD method.Khoshnam et al. [23] enhanced sensor inductance value and resolution by integrating copper wire into elastic fabric as an induction sensor, allowing for precise tracking of the user's torso state when bending forward.Laidi et al. [3] employed low-cost electromyography sensors and low-power Bluetooth to achieve affordable sitting posture detection.Batool et al. [24] significantly enhanced action recognition performance to 95.56% through their research on fused sensors.Kim et al. [25] placed a sensor on the wrist to detect falling movements.Despite various methods aimed at cost reduction and accuracy improvement, the deployment of additional hardware still incurs extra costs.Moreover, users are required to wear devices to establish contact with sensors, introducing inconvenience into their daily lives.

RF-Based Solution
The RF-based method captures phase changes induced by various postures during the transmission and reception of RF signals.Assuming a distance d between the antenna and the tag, and the propagation of the RF signal along the direct path, the received phase ϕ by the receiver can be determined by: where λ is the wavelength.By computing the phase difference, we obtain the relative distance feature, which is subsequently input into the machine learning network for classification.An exemplary system utilizing this RF-based approach is SitR [26], the pioneer in sitting posture recognition systems relying solely on RF signals.SitR successfully identifies seven distinct postures using three tags and the accuracy of SitR is approximately 98.83%.
Another system, Sitsen [27], employs only a lightweight and cost-effective RFID tag placed near the user.Following denoising, effective features are extracted from both time-domain and frequency-domain dimensions to identify sitting postures.While the advantage of not requiring wearable devices is evident, it is important to note that additional hardware deployment remains a necessity.

Acoustic-Based Solution
In recent years, acoustic sensing has gained popularity for its higher resolution and cost-effectiveness.Users leverage the speakers on smart devices to transmit modulated sound waves, and the resulting echo reflected by the human body is captured by the microphone.Through the analysis of characteristics such as phase difference, Doppler frequency shift, and frequency difference in the received signal, crucial information about distance and velocity is extracted.This innovative approach provides an effective means of acquiring precise spatial and movement-related data using readily available components on smart devices.The relationship between phase difference ∆ϕ and relative distance can be obtained by the following formula: where λ is the wavelength and d is the distance between device and target.The Doppler frequency shift and velocity can be obtained by the following formula: where f D is the frequency after Doppler shift, f 0 is the frequency before the Doppler shift, v is the speed of the target, and c is the speed of sound.The cross-correlation-based TOF can be obtained with the following formula: where τ can be measured by the main peak after cross-correlation.Acoustic sensing has found applications in diverse fields, including gesture recognition, heartbeat detection, fall detection, and more [28][29][30].Wang et al. [31] introduced RoubCIR, a non-contact gesture recognition system capable of recognizing 15 distinct gestures which achieved a 98.4% accuracy rate.In another instance, Lian et al. [32] employed a continuous wave with a frequency of 20,000 Hz to capture Doppler frequency shifts induced by falls.This approach exclusively utilizes home audio devices to identify and detect fall events, showcasing the versatility of acoustic sensing in various applications.

Signal Processing
Considering the diverse types of noise and the requirements for subsequent signal processing in this paper, we intend to FMCW to obtain useful sitting posture and action information.

Signal Design and Transmission
FMCW signal modeling is as follows: where a is attenuation, a < 1, and f 0 is the initial a frequency of FMCW, as shown in Figure 3.
In this paper, we set f 0 = 18,000 Hz, because frequency above 18,000 Hz is not audible to the human ear.The sampling rate of a regular mobile phone is 48,000 Hz.According to Nyquist Shannon Sampling Theorem, the maximum frequency we send cannot exceed half of the sampling rate, which is 24,000 Hz.B is bandwidth, we set B = 4000 Hz.Therefore, the frequency range of FMCW is from 18,000 Hz to 22,000 Hz.T s is the sweep period; we chose T s = 10 ms.Since the sampling rate is 48,000 Hz, there are 480 sampling points in a cycle.According to d = cT s 2 , c = 340 m/s, the maximum detectable distance is 1.7 m, which is sufficient for our sitting posture recognition and action detection.Transmitted signal is received by the microphone after being reflected by the target.Since the microphone and speaker are on the same device and transmit and receive simultaneously, frequency offset does not need to be considered.The received signal is actually a delayed version of the transmitted signal, which can be represented as: where τ is time delay and W(t) is the Gaussian white noise.For S r (t), we first multiply it by S t (t) and the conjugate of S t (t), which is a sin(2π( f 0 + Bt 2T s )t).After applying a low-pass filter, the I (In-Phase) and Q (Quadrature) parts of the mixed signal can be obtained separately.The In-Phase part and Quadrature part are as follows: The final mixed signal obtained by mixing I and Q is as follows: This mixed signal can be used to obtain distance information from the target, as shown in Figure 3. Since the received signal is the delay version of the transmission signal, we can know from the graph that the frequency shift ∆ f is proportional to TOF τ.By measuring the frequency shift ∆ f of the receiver, we could obtain TOF τ = ∆ f T s B .Thus, the distance between the target and the receiver can be calculated as d = c∆ f T s 2B .

Preliminary Denoising
Indeed, our received signal encompasses various types of noise, as illustrated in Figure 4.The direct signal (depicted by the red line) is transmitted directly from the speaker to the microphone.The reflected signal (represented by the green line) from surrounding objects can interfere with the reflected signal (shown in blue) originating from the human body.Additionally, the signal includes ambient noise, such as audible sounds (illustrated by the black line) present in the environment.Managing and isolating these different components is crucial for accurate signal processing and reliable results in our acoustic-based method.

Reflection from body
Environmental noise To enhance the accuracy of subsequent dynamic and static classification, our focus at this stage primarily revolves around removing low-frequency noise and environmental variables.We employ the difference method to eliminate environmental variables, calculating the variance between adjacent frames to accentuate the dynamic aspects of the signal.This is illustrated in Figure 5, where both dynamic and static energy intensities are initially high and similar before differentiation.The differentiation process significantly amplifies the variation between dynamic and static classes, facilitating a clearer distinction.It is worth noting that we solely use differentiation to distinguish between dynamic and static signals.If the final classification result indicates a static posture, we restore the signal to its state before the differentiation process.This is because differentiation may remove certain components of our static posture data.
Furthermore, low-frequency noise, comprising sounds audible to the human ear, can affect the short-term energy of speech segments.Excessive low-frequency noise in a static posture may lead to misclassification.To address this, we subject the signal to high-pass filtering, effectively removing low-frequency noise and optimizing the signal for accurate classification.

Signal Classification
After obtaining a relatively pure signal through preliminary denoising, we need to classify the signal.In this article, we use the method of short-term energy accumulation for signal classification.The definition of short-term energy is as follows: where: where ω(n) is Hamming window-function, L is the frame length, len is the frame shift length, and f n is the total number of frames after framing, as shown in Figure 5b.After differentiation, there is a significant difference between the dynamic and static states, so we have set an empirical threshold of strength η = 2 × 10 −5 to distinguish between the dynamic and static states.

Feature Extraction
In this section, we conduct feature extraction separately for the static postures and dynamic actions.

Static Postures
For static postures, we first need to remove the noise contained in the signal, mainly including environmental noise and direct signal.As the differentiation process in Section 3.2 was employed solely for distinguishing between dynamic and static signals, the input signal S at this stage should be the signal before differentiation, as illustrated in Figure 2. Similarly, we commence by applying high-pass filtering to the input signal to eliminate the impact of low-frequency environmental noise.Subsequently, the primary noise components remaining in the signal are irrelevant static objects and direct signals in the environment.The next step involves the removal of these static objects and direct signals.
We first position the phone in the environment and record the direct transmission for reference.Subsequently, we employ subtraction, subtracting the actual signal from the prerecorded signal, with the objective of eliminating interference from the environment and the direct transmission path.However, in practice, the Automatic Gain Control (AGC) [33] in the microphone maintains the received sound at a stable level.If the received sound is too loud, AGC automatically reduces the gain, and vice versa.Due to variations in AGC settings across different environments, simple subtraction may not achieve optimal cancellation.Thus, we can set a dynamic scaling coefficient k to accurately compensate for the difference between the received signal and the pre recorded signal in each chirp.Specifically, we calculate |S − kS d | [33], where S is the processed received signal and S d is the pre-recorded signal.When the above calculation reaches the minimum value, it indicates that k can accurately compensate for the difference between the received signal and the pre-recorded signal.We calculate S − kS d to remove direct signals and other static objects.
Following the aforementioned processing steps, we obtain a time-range image of the static pose.Given the static nature of the posture, any frame can be selected for analysis.Before generating the time-range graph, we limit our consideration to the impact after 86 points per frame.This choice is based on the necessity to maintain an appropriate distance between the seat and the equipment, adhering to standard sitting posture specifications.Moreover, it helps eliminate the influence of direct signal sidelobes and the multi-path from the preceding frame.In Figure 6, we present a frame image depicting three postures: standard, anteversion, and hypsokinesis.It can be seen that compared to the standard posture (blue line and the first peak point is approximately 0.42 m), anteversion has a smaller distance from the device (red line and the first peak point is approximately 0.34 m), and due to the forward leaning of the body, the energy intensity at close range (0.34 m) is significantly increased.The opposite is true for the hypsokinesis, which is relatively far from the device (yellow line and the first peak point is about 0.46 m), and the overall energy intensity is relatively low.However, due to the originally close position moving further away, the energy at a long distance (about 0.6 m) is significantly increased.According to the standard sitting posture, the portion at the same height as the desktop corresponds to the human chest, passing through the chest, shoulders, neck, and head from near to far, corresponding to the peak position in the figure.When our posture changes, the absolute distance (d i ) between different parts and the signal source and the absolute distance difference (d i − d 0 ) from the nearest reflection point will also change, as shown in Figure 7.Given that d i is subject to changes based on the environment, our focus here is on considering the variation in d i − d 0 .We intend to leverage the relationship between the energy proportion of different segments in various postures to deduce the changes in d i − d 0 .The position of d 0 corresponds to the closest reflection point, and this can be determined based on the location of the first peak point after the received signal and the transmitted signal are cross-correlated.In specific terms, we initiate the process by calculating the total energy for each frame as depicted in Figure 6.This total energy signifies the cumulative energy reflected by the human body at the given moment.Drawing on anthropometric principles, we ascertain the proportion of reflective areas, corresponding to the body surface area, for specific regions such as the chest, abdomen, arms, shoulders, neck, and head [34].An essential premise for applying this method is the positive correlation between the area of the reflective surface and the intensity of the reflected signal.By extracting distances corresponding to percentages from near to far, we obtain the absolute distance difference (d i − d 0 ) between the center positions of each body part and the nearest reflection point.When our posture changes, such as leaning forward, each body part above the abdomen will approach the signal source.However, the degree of proximity, indicated by the absolute distance difference (d i − d 0 ) from the nearest reflection point, varies across body parts.Notably, the change in head distance is the most significant, causing the head to move forward from the distance bin and resulting in an increase in energy in that distance bin, as illustrated by the red line in Figure 6.Considering a total of six body parts, we extract six-dimensional absolute distance differences as features.

Dynamic Actions
For dynamic signals, additional denoising is required beyond the preliminary denoising in Section 3.2.At this stage, our primary focus is on addressing the impact of minor movements, such as body shaking, breathing, heartbeat, etc.These subtle actions pose a challenge for removal using the method in Section 3.2 due to their low energy intensity.The Empirical Mode Decomposition (EMD) algorithm has demonstrated potential in handling this issue [35].However, EMD is prone to problems like modal mixing and endpoint effects [36].Therefore, we intend to employ the Ensemble Empirical Mode Decomposition (EEMD) algorithm to effectively eliminate micro-motions induced by human activity [37].
We first set the number of times the original signal is processed m, which is equal to 6 in this article.Then, we add a Gaussian white noise (G i (t)) to the original signal (S(t)).By utilizing its characteristic of having a mean of 0, the noise of the signal itself is masked by multiple artificially added noises, thereby obtaining more accurate upper and lower envelopes.At the same time, the decomposition results are averaged to counteract the added white noise, thereby effectively suppressing modal aliasing.Repeat the signal m times by adding Gaussian white noise and EMD decomposition.a i,j (t) is the mode decomposed of EMD, and r i,j (t) is the residual function.Finally, perform set averaging on a i,j (t) to obtain the final mode after EEMD decomposition.
After applying EEMD, we generate a time-range diagram illustrating four dynamic actions, as shown in the figure.Our process involves initially extracting the position of the point with the strongest energy in each frame, representing the current body position.Subsequently, we extract the positions of the nearest and farthest boundaries, indicating the nearest and farthest ends of the action.Finally, since actions can be perceived as the superposition of many static postures in the time dimension, we continue to utilize the feature extraction method for static postures, incorporating an additional time dimension.At this stage, we have obtained a total of nine dimensional features for dynamic actions.

Network Design
In this section, we classify the extracted static and dynamic features separately.Since the two sets of features have different dimensions, we need to design separate static and dynamic networks.

Static Networks
To better capture features in static postures, we designed a residual network, as shown in Figure 8.The convolutional layers use one-dimensional convolution with a FilterSize of 3, and the number of filters is set to 16.We employ batchNormalizationLayer for data normalization to prevent overfitting and gradient vanishing.The Relu layer introduces non-linear relationships between layers, enhancing network sparsity.Before the residual block, we use maxPooling1dLayer for pooling with a pooling region width of 3, aiding in dimensionality reduction and feature extraction to improve model generalization.To prevent model degradation, we introduce four ResNet blocks.Each ResNet block has two paths-one through multiple convolutional layers and the other through a single convolutional layer or an empty path.This design guides the network to discern differences between them and enhances operational speed.In the first two ResNet blocks, the convolutional layer parameters are consistent with the preceding layers, while in the third and fourth ResNet blocks, the number of filters is set to 32.After the ResNet blocks, a dropout layer with a dropout rate of 0.5 is introduced, followed by a globalAveragePooling1dLayer.The fully connected layer has an output size of 5, and the final classification is achieved through a Softmax layer with five categories.

Dynamic Networks
As shown in Figure 9, the dynamic and static networks are largely similar, with the main difference lying in the structure of the input data.The static input data have a size of 6 × 1, while the dynamic data features are nine-dimensional, and the length is related to the signal length.After the ResNet blocks, to capture the temporal features of dynamic actions, we have also incorporated two layers of bidirectional LSTM (bilstm) with 128 units each, ultimately achieving four-class classification.

Domain Adaptation
In our experiments, we observed that individuals exhibit variations in posture and action habits, such as differences in the amplitude and speed of actions.In theory, with a sufficiently large dataset, it is possible to identify different actions and postures, but this may impact classification accuracy.To address this issue, we employed domain adaptation techniques.
For data augmentation, we initially applied compression and stretching to the signals to simulate scenarios with different speeds.Subsequently, we added a certain value to the signals to simulate actions with different amplitudes.Lastly, we introduced random noise to simulate various environmental and hardware noise scenarios.The purpose of data augmentation is to increase the size of the training set, providing the model with better robustness when facing different situations.
Generally, neural networks are trained by minimizing cross-entropy.In the context of cross-domain recognition, we define cross-entropy as the loss function: where N s is the number of samples.Assuming the parameter set to be optimized is A, a gradient descent algorithm is used, where µ represents the learning rate: Then, we used the method in Wi-learner [38] to construct a metafunction which is min ∑ E(A ′ ) to find parameters that can minimize the total loss of all tasks, and then used the stochastic gradient descent (SGD) algorithm for optimization and parameter updates: 6. Implementation and Evaluation 6.1.Experimental Setup 6.1.1.Environment SitPAA system sends FMCW in the range of 18,000 Hz to 22,000 Hz through the speakers on a Lenovo V14 laptop and receives them using the single microphone on the same device.The experimental chair's height is set to the national standard (chair 420 mm, Table 720 mm).The laptop is placed in the middle of the table, facing the volunteers, with a distance between 35 cm and 50 cm.

Data Collection
We collected data from a total of 6200 volunteers (5200 static and 1000 dynamic) in an open and quiet environment to train both static and dynamic networks.The volunteers providing data have a height ranging from 165 cm to 190 cm and an age range of 13 to 55. Finally, the training set, test set, and validation set are divided in a ratio of 8:1:1.In the in-domain experiment, the test data for static and dynamic classes come from the same volunteer.While controlling other variables, we collected data with 2 sets of noise, 4 types of fabric clothes, 8 environmental conditions, 7 angles, and 13 distances.In the cross-domain experiment, we controlled variables such as noise, fabric type, and environment, collecting data from 15 different volunteers.

Evaluating Indicator
We use accuracy, precision, recall, and F1score to evaluate SitPAA.The formula is as follows: where TP (True Positive) indicates that this sample is actually positive and the model also predicts this sample to be positive, TN (True Negative) indicates that the sample is actually negative and the model also predicts this sample to be negative, FP (False Positive) indicates that this sample is actually negative but the model predicts it to be positive, FN (False Negative) indicates this sample is actually positive but the model predicts this sample to be negative, and P is positive samples.

Static Class Evaluation
The overall performance of the static network is shown in Figure 10a.We distinguish a total of five static postures (S1 to S5), which are standard, anteversion, hypsokinesis, lying on the table, and holding your head.The average accuracy of our model validation set is 98.05%.In order to better evaluate our performance, we considered the effects of clothing material, experimental environment, noise, distance, and angle.Every time we collected experimental data, we ensured that other variables remain unchanged, with the aim of better evaluating the impact of a single factor.Considering the differences in user clothing, we evaluated the impact of clothing material on the accuracy of the model.Figure 11 shows us the impact of different clothing materials and evaluated four materials, namely cotton, skin, fiber, and silk, and provided the precision, recall, and F1-score for each material.Although F1-score is the harmonic average of the first two parts, which can reflect the robustness of the model, in this experiment, we tend to focus more on the precision of the model because our goal is to classify correctly.The average accuracy of the four materials is 89.59%, 90.93%, 93.60%, and 90.13%, respectively.From Figure 11, we can see that the accuracy of hypsokinesis is generally low, because it is far away from our equipment, resulting in a lower energy intensity of the reflected signal and a higher probability of misclassification.Different clothing ma-terials can lead to different reflectivity of acoustic signals, resulting in differences in the energy received.However, in the feature extraction process, we extract relative distance features rather than energy features.Different materials do not affect the relative distance, so different clothing materials have little impact on our accuracy.Figure 12 shows us the effects of four different experimental environments.The environment in which users are located is different.The eight different experimental environments are scene1: office, scene2: classroom, scene3: restaurant, scene4: dormitory, scene5: home, scene6: an empty home, scene7: library, scene8: car.The main differences between different environments are the number of static objects and the degree of clutter.The average accuracy of the eight environments is 90.27%, 90.80%, 91.73%, 89.47%, 91.2%, 94.8%, 92.27%, and 91.6%, respectively.In Sections 3 and 4, we have removed irrelevant static objects by subtracting pre-recorded environmental signals through differentiation.Before obtaining the time-range graph, we only considered the nodes after 86 in each frame.Therefore, there is almost no difference in model performance between different environments, which can meet practical needs.However, due to hardware reasons, low-frequency noise filtering is not thorough enough, which can affect the performance of the model when there is too much noise (≥80 dB), as shown in Figure 13a.Next, we evaluated the impact of distance from users to smart devices on accuracy, as shown in Figure 14a.We collect volunteer data every 10 cm from 40 cm to 160 cm.Since we only consider the nodes after 86, we started testing from 40 cm until the theoretically detectable maximum distance.In fact, the reflected signal strength after 1.2 m is already very low.From the graph, it can be seen that as the distance increases, the accuracy of the model shows a decreasing trend.A distance within 80 cm can achieve an average accuracy of about 90%, and after 120 cm, the accuracy drops to about 20%.This is due to the influence of signal strength attenuation, and we are unable to obtain more information for posture recognition.The results of this test correspond to the generally low accuracy of the previous interpretation and the fact shows that the model can achieve high performance within 80 cm.
Figure 14b shows us the impact of angle on model performance was demonstrated.We extract volunteer data every 15 degrees.The angle here refers to the angle between the smart device and the perpendicular line between the smart device and the human body in the same plane, with 0 degrees indicating that the device is facing the user directly.The feature extraction method of Section 4 determines that as the angle increases, the relative distance between the body and the device also changes, and the extracted features cannot represent the appropriate category.As the angle increases, the model performance gradually decreases, with an accuracy of only about 20% at 90 degrees.Experiments have shown that high performance can be achieved within 30 degrees.

Dynamic Class Evaluation
The overall performance of the dynamic network is shown in Figure 10b.We can distinguish four daily actions (D1 to D4), namely lean forward, lean back, hang the head, and raise head.The average accuracy of our model validation set is 98.25%.We also considered the effects of clothing material, experimental environment, noise, distance, and angle, and controlled variables for each experimental data collection.
Figures 15-17 sequentially show the effects of clothing material, experimental environment, noise, distance, and angle on movement.Compared to static postures, dynamic actions have a higher overall accuracy due to their increased temporal dimension and more features than static postures.Specifically, we still selected four materials, i.e., cotton, skin, fiber, and silk, for validation, with average accuracy rates of 97.33%, 98.05%, 95.9%, and 95.9%, respectively.Similar to the static experimental results, different clothing materials did not have a significant impact on accuracy.We further verified the accuracy of dynamic actions in the same environment as static posture, and the average accuracy in the eight environments, i.e., scene1: office, scene2: classroom, scene3: restaurant, scene4: dormitory, scene5: home, scene6: an empty home, scene7: library, scene8: car, was 96.58%, 96.68%, 96.95%, 96.9%, 95.9%, 98.48%, 95.53%, and 93.63%, respectively.Moreover, noise can also affect system performance.Taking 80 dB as the boundary, the average accuracy for noise below 80 dB was 96.33%, which is 2.28% higher than the average accuracy of 94.05% for noise above 80 dB.The energy contained in dynamic actions is greater than that in static classes, so the model has good performance within 130 cm (with an accuracy rate of approximately 90%).The angle will destroy the overall features, so as the angle increases, the performance of the model gradually decreases.The accuracy rate is only 20.12% at 90 degrees, and can achieve more than 90% accuracy within 30 degrees.

Cross-Domain Evaluation
In this section, we assess the model's cross-domain classification accuracy.Generally, data such as clothing, environment, location, and individuals are impossible to exhaust, necessitating cross-domain transfer learning.However, in this study, we successfully filtered out the impact of clothing, environment, etc., indicating that they do not affect cross-domain recognition.Therefore, our focus is solely on cross-domain identification for different individuals.
We collected and analyzed data from 15 distinct volunteers in a spacious location, aiming to maintain consistency in other unrelated variables.We present the cross-domain recognition accuracy for four of the volunteers, as illustrated in Figure 18.Overall, as the number of learning sample points increases, the accuracy consistently improves.Due to the more intricate characteristics of dynamic movements, the average accuracy for dynamic actions surpasses that of static postures.Generally, only two sample points are needed to achieve a cross-domain learning accuracy of 90%, and a stable average accuracy exceeding 95% is achieved with just five sample points.The overall average accuracy for the entire static and dynamic system after cross-domain learning is calculated to be 92.08% and 95.1%, respectively.

Discussion on Limitations
In this section, we discuss the limitations of the system.The microphone and speaker used in this experiment are both located on the side of the keyboard.We replaced the device (Vivo iqoo5) and tried four different combinations of speakers and microphones, where the bottom speaker and microphone are positioned on the side facing the human body.
When using the bottom speaker to send signals and the bottom microphone to receive signals, the initial network's accuracy for the static class is 97.79%, and for the dynamic class this is 98.4%, which is comparable to the computer's accuracy of 98.05% and 98.25%, respectively.When using the bottom speaker to send signals and the top microphone to receive signals, the accuracy for the static class is 92.3%, and for the dynamic class this is 94.33%.When using the top speaker to send signals and the bottom microphone to receive signals, the accuracy for the static class is 87.12%, and for the dynamic class this is 90.6%.When using the top speaker to send signals and the top microphone to receive signals, the accuracy for the static class is 71.14%, and for the dynamic class this is 72.85%.
The reasons for the above results are, firstly, that the top microphone is farther from the speaker, significantly increasing the impact of direct signal transmission.Secondly, when the orientation of the microphone and speaker deviates from the human body, more noise is generated (such as additional desktop reflections).Therefore, it is necessary to choose a microphone and speaker combination that is oriented towards the human body and relatively close in distance.

Conclusions
In this paper, we propose a sitting posture detection system based on FMCW, capable of dual recognition for static postures and dynamic actions.The system employs a single speaker and microphone on a smart device for signal transmission and reception.We conducted multiple rounds of denoising on the reflected signals.Introducing a novel feature extraction method that focuses solely on relative positions, we eliminated the impact of energy on recognition accuracy.Addressing the challenge of incomplete input samples, we achieved cross-domain recognition.Performance evaluation in various real-

Figure 2 .
Figure 2. The overall process of SitPAA.

Figure 4 .
Figure 4. Various noises encountered in the actual environment.

Figure 5 .
Figure 5.The influence of difference on static postures and dynamic actions.

Figure 6 .
Figure 6.Energy and distance map for standard, anteversion, and hypsokinesis postures, and the dashed line represents the distance where the maximum peak value is located.

Figure 7 .
Figure 7.The map of the distance between different parts (d i ) and the closest position (d 0 ).

Figure 8 .
Figure 8.Our network for classifying static sitting postures.

Figure 9 .
Figure 9.Our network for classifying dynamic sitting actions.

Figure 10 .
Figure 10.The classification accuracy of original static and dynamic network, blue represents the percentage of successfully identified samples, light and white represent the percentage of incorrectly identified samples, and white specifically refers to the number of samples being 0.

Figure 11 .
Figure 11.The impact of different kinds of clothes on static postures.

Figure 12 .Figure 13 .
Figure 12.The impact of different kinds of scenes on static postures.

Figure 14 .
Figure 14.The impact of different distance and angle from devices on static classification accuracy.

Figure 16 .
Figure 16.The impact of different scenes on different actions.

Figure 17 .
Figure 17.The impact of different distance and angle from devices on dynamic classification accuracy.

Figure 18 .
Figure 18.The impact of different number of learning points on classification accuracy after crossdomain learning.
The impact of different clothing materials on different actions.