A Resource Constrained Neural Network for the Design of Embedded Human Posture Recognition Systems

A custom HW design of a Fully Convolutional Neural Network (FCN) is presented in this paper to implement an embeddable Human Posture Recognition (HPR) system capable of very high accuracy both for laying and sitting posture recognition. The FCN exploits a new base-2 quantization scheme for weight and binarized activations to meet the optimal trade-off between low power dissipation, a very reduced set of instantiated physical resources and state-of-the-art accuracy to classify human postures. By using a limited number of pressure sensors only, the optimized HW implementation allows keeping the computation close to the data sources according to the edge computing paradigm and enables the design of embedded HP systems. The FCN can be simply reconfigured to be used for laying and sitting posture recognition. Tested on a public dataset for in-bed posture classification, the proposed FCN obtains a mean accuracy value of 96.77% to recognize 17 different postures, while a small custom dataset has been used for training and testing for sitting posture recognition, where the FCN achieves 98.88% accuracy to recognize eight positions. The FCN has been prototyped on a Xilinx Artix 7 FPGA where it exhibits a dynamic power dissipation lower than 11 mW and 7 mW for laying and sitting posture recognition, respectively, and a maximum operation frequency of 47.64 MHz and 26.6 MHz, corresponding to an Output Data Rate (ODR) of the sensors of 16.50 kHz and 9.13 kHz, respectively. Furthermore, synthesis results with a CMOS 130 nm technology have been reported, to give an estimation about the possibility of an in-sensor circuital implementation.


Introduction
The monitoring and interpretation of static and dynamic behavior of the human body are very attractive for a number of applications ranging from biomedical to industrial and automotive [1,2]. Although the capability to classify human postures can be considered a specific subset of the Human Activity Recognition (HAR) [3][4][5], it requires specific technological solutions, very different from HAR, and is very important in peculiar application fields. To improve the quality of life, for example, which can be significantly compromised by prolonged poor sitting and laying postures [6][7][8], causing serious health problems such as pressure ulcers, cervical and back diseases, and complex muscle and skeletal deformations. Professional vehicle drivers, like taxi/truck/farm tractor drivers, often suffer from Muscular-Skeletal Diseases (MSD) due to long time sitting [8,9]. Furthermore, automotive applications have been pushed by updated safety protocols of autonomous vehicles, which require that the driver postures must be monitored not only to verify their readiness to take over the control in warning situations, but also for perceived (dis)comfort [10]. Human posture while seated is one of the main parameters affecting safety and health of a sitting person [11]; it has been demonstrated that posture changes and macro/micro movements are the first indicator of increasing discomfort or pain in time [12]. Thus, human posture monitoring can help in understanding what is happening to the driver/passengers and to apply countermeasures for reducing stress, discomfort and consequent errors while driving. Pressure at the interface between the seat and human body is widely used as a good indicator to evaluate the perceived (dis)comfort and to identify movements and postures on the seat [13].
Human Posture Recognition (HPR) has been recently implemented by using Machine Learning (ML) approaches [14,15] in conjunction with mechanical or image sensors [16][17][18]. In particular, recent papers deal with use of machine learning and AI for posture tracking and recognition in order to improve car drivers' safety and car-occupants' comfort. In Lin et al. [19], computer vision technology has been used to recognize the head and neck posture in order to detect drivers' drowsiness or sleepiness. Since 2010, in Wachs et al. [20], Neural Networks have been used for parts-based object detection of human body parts, both inside and outside the car. The purpose of this research was focused on techniques for identifying drivers' or pedestrians' postures for safety reasons. The need to acquire and recognize a posture with a contactless or embedded system has become one of the studied topics for future vehicle development. In 2017, Loeb et al. [21] used Kinect™ in order to track the human segments of car occupants (especially very young occupants) in order to recognize occupant posture and improve safety in case of an accident. In 2020, Zhao et al. [22] proposed a head pose estimation method based on deep learning applied to images to demonstrate that the head pose could be used as the basis for distraction detection.
Although image-based HPR systems have been favored by the advancements of image processing methods of recent years, they introduce important issues related to the privacy of people captured by the camera, the poor performances when the subject is partially occluded, as well as the cost of the systems. On the other hand, pressure sensors not only allow the design of much more compact systems which can be embedded in specific supports like chairs and beds, but they are also able to capture small body deformations more accurately than cameras [23]. Indeed, for the study of the posture of the human body on a mattress, a pressure pad between body and mattress is often used whose data is analyzed in different ways [17,24] to prevent the formation of bedsores in long-term patients. The same approach [25] can be used in the case of the sitting posture since contact pressure is the only way to support body weight and influence posture and comfort [26]. However, recent works have demonstrated that the accuracy of such HPR systems strongly depends on the careful distribution of the sensors in specific key-points of the chair, bed or any other kind of support equipped with the system, in order to acquire data from as many body parts as possible and avoid an excessive number of sensors [27]. Therefore, the reliability of pressure-based HPR systems appears much too dependent on the shapes of the specific supports. Moreover, despite of the reduced number of sensors, the overall extension of the system is not negligible, considering the processing unit and the connections between this and the sensors. On the other hand, other solutions exhibit a computational load scarcely compatible with embedded systems with autonomous power supplies [3][4][5].
In this paper, a custom HW design of a tiny Fully Convolutional Network (FCN) is presented to implement a HPR system, which better combines high recognition accuracy and low-energy and low-area requirements with respect to the existent literature, in order to extend the application range. Indeed, the FCN achieves state-of-the-art recognition accuracy both for laying and sitting postures, by exploiting only pressure sensors grouped in a small area close to the FCN, according to the edge-computing paradigm [28,29], without any particular distribution strategy. The FCN implements an end-to-end classification by exploiting a base-2 quantization scheme for weights and binarized activations [30,31] to meet the optimal trade-off between high recognition accuracy, the number of mapped physical resources and low power consumption [32,33]. The FCN achieves an average accuracy of 96.77% and 98.88% to classify laying and sitting postures, respectively. The main advantages of the proposed system over the existent literature can be summarized in the following points: • the capability of the FCN to achieve high recognition accuracy by monitoring only the footprint of the human body in a limited space region covered by a reduced number of pressure sensors. • Any sensor placement strategy is unnecessary, namely the system reliability is not dependent on the specific support. • FCN can be easily reconfigured to different applications. Case studies are presented on laying and sitting postures recognition. • FCN provides end-to-end classification by using a quantization scheme that overcomes binarized and ternary counterparts in terms of accuracy and meets the optimal tradeoff between accuracy and employed physical resources for HW implementation.
Implemented on a small Xilinx Artix 7 FPGA, FCN dissipates 10.40 mW dynamic power and achieves a maximum operation frequency of 26.6 MHz, corresponding to sensors with Output Data Rate (ODR) of 9.13 kHz, when used for laying posture recognition. When used for sitting posture recognition, the FCN is reconfigured to use less physical resources and achieves 6.88 mW dynamic power dissipation and a maximum operation frequency of 47.64 MHz, compatible with a sensor Output Data Rate (ODR) of 16.50 kHz, which is very important for critical applications requiring a continuous monitoring and a real-time action in an emergency. In order to explore the possibility to embed the proposed accelerator for in-sensor circuitry, analysis with a conservative TSMC LP-HVT CMOS 130 nm technology has been done, which is compatible with that of the glue logic in modern MEMS. Synthesis results, by using the Cadence toolchain return a power dissipation of 425 µW/MHz and 1.7 mW, respectively, at the maximum operating frequency of 40 MHz, and an area occupation 1.78 mm 2 when the FCN is configured for laying posture recognition, which support in real-time ODR up to 8.7 kHz.
The remainder of the paper is organized as follows: Section 2 describes the proposed models; design choice and architecture of the HW accelerator are discussed in Section 3; implementation results are presented in Section 4; comparisons with the state-of-the-art are discussed in Section 5; Section 6 concludes the paper.

The Proposed System and the Underlying Model
The HPR system has been designed according to the scheme in Figure 1. The FCN processes data from commercial pressure sensors and classifies them in a number of classes depending on the specific application. As a case study, a Medilogic ® Seat Pressure carpet has been used for sitting posture classification of 8 classes, while data from a public dataset [34], obtained with a quite similar acquisition system, have been used for lying posture classification of 17 classes. The FCN can be reconfigured for the two applications by easily adapting the input and output layers to the different number of input sensors and output classes, respectively.

The FCN Model
The FCN is schematized in Figure 2. It is composed by 3 convolutional (CONV) layers, a Global Average Pooling (GAP) layer and a dense fully convolutional layer. In order to reduce the number of physical resources for the HW implementation of the network, all the weights of the CONV layers have been quantized. As it will be shown in the next section, the conventional binary and ternary quantization schemes have given an unacceptable low accuracy for laying posture recognition. Therefore, a quantization scheme has been introduced, which exploits weights from the set {−2, −1, 0, +1, +2} in place of {−1, 0, +1} of ternary and {−1, +1} of binarized neural networks, selected according to the following criteria: Additionally, all the activations have been binarized, and the activation functions have been reduced to [4]: The advantages of this choice with respect to a full precision implementation can be roughly estimated in reducing memory requirements by about 1 order of magnitude (a factor of 32/3 for quantized weights and 32 for binarized activations) since weights are coded with 3 bits and activations with 1 bit; Multiply-Accumulate (MAC) operations, typically required to implement the convolutions, are simplified into Shift-Accumulate (SAC) operations for the absence of floating point (FP) multiplications and the consequent reduction of about 2 orders of magnitude in the number of FPGA LUTs. Each CONV layer in Figure 2 is followed by a Batch Normalization (BN) layer, where, as schematized in Figure 3 with more details, each sample is scaled by a factor σ and subtracted by the mean value µ, defined during the training and stored in a devoted memory. No padding has been used. The fourth stage is made up of a Global Average Pooling (GAP), which is very robust to translations of the inputs and enables the Class Activation Map (CAM), which, together with the SoftMax, provides the final classification with less resources than typical dense layers of CNNs [35]. The GAP also reduces the dimensions of the network with respect to a conventional MaxPool since it transforms N-30 inputs to 1 and reduces the complexity of the following stage. The last stage is composed of a Fully Connected (FC) layer and a SoftMax classifier. The output of this last stage represents the probability of belonging to each output class, therefore the number of units of the fully connected corresponds to the number of considered classes. As shown in Table 1, the computational complexity of each layer in terms of required math operations and memory requirements changes depending on the specific application. When the FCN is used for laying posture recognition, it receives 108 input samples coded with 12 bits, representing a snapshot of the posture. The first and second CONV stages are composed by 24 one-dimensional filters of length 11, and the third by 32 filters. Depending on the number of classes to be considered, the output layer produces 17 and 8 values for laying and sitting postures, respectively.

FCN Training and Accuracy Results
Keras and Larq tools have been used to describe the FCN model. In order to prove the performance of the system in two of the most interesting contexts for HPR in biomedical and industrial application fields, two datasets have been employed for lying and sitting posture recognition, respectively. The public PmatData dataset in Table 2 has been specifically designed for in-bed posture classification [35]. The pressure data have been collected by using a Vista Medical FSA SoftFlex 2048, equipped with 2048 1 inch 2 pressure sensors placed on a 32 × 64 grid. The sensors provide output values coded with 12 bits and normalized in the range [0, 1]. The dataset is composed of 17 postures listed in Table 2, sampled at 1 Hz and taken from 13 participants whose physical characteristics range in the intervals: [19,34]  The Sparse_categorical_cross-entropy loss function has been used for training, set with 100 epochs, a batch size of 20 and a learning rate of 5 × 10 −4 . Initial tests on the dataset showed that 2048 input samples provided for each acquisition are an unnecessary oversampling of the body footprint, which only increases complexity of the input layer without any evident advantages in terms of accuracy. The number of the inputs has been reduced to 108 by a downsampling of about 1:19, empirically determined as the best trade-off between the HW complexity of the input layer and the overall classification accuracy. The effects on the mean classification accuracy of the FCN of the binary (BNN) and ternary (TNN) quantization schemes for weights are shown in Figure 4 and compared with the Base-2 defined from Equation (1). Table 3 reports the main test results when a 10-fold cross validation has been used. The positions "Supine 1-4" are supine postures with different body attitudes: legs and arms more or less spread, cozy position, straddling left and right leg. In order to prove the effectiveness of the FCN to classify sitting postures, we built a custom dataset by using the Medilogic ® Seat Pressure Measurement System in Figure 5 to train and test the FCN, considering that, based on our knowledge, there is no public dataset available for the purpose. The measurement system is composed by a carpet of 480 piezoresistive sensors distributed on a matrix of 24 × 20 elements. The commercial measurement system has been chosen to make reliable acquisitions, but the number of sensors is suitable for different applications and, also in this case, it is excessive for the FCN operations. Namely, a subsampling was applied to reduce the number of sensing elements to 56, which coincides with the number of inputs to the first layer. In Figure 5, also the sampling scheme is shown. Results of Figure 4 and Table 4 prove that binarization in the case of sitting posture recognition could be a sufficient quantization scheme.
In this paper, we chose to maintain the base-2 quantization for the better results in other applications. However, for fair comparisons between HW implementations, also binarized version of the FCN will be considered in the following. Since there are no studies on standard postures and all papers propose different approaches, usually depending on the chair examined and the type of analysis that is carried out, in Table 4, postures have been considered, the combination of which makes it possible to obtain plausible postures for many activities and for many types of chairs. However, the interaction with other objects such as a desk, armrests or steering wheel was not considered. The posture of a seated person involves the inclination of the trunk (supported, erect, inclined forward or sideways (left and right).   Since the weight of the trunk and head can be partially discharged by placing the arms on a desk or armrests, the position of the thinker which foresees the elbow on the knees has been considered; the legs can be rested on or raised in the case of subjects of small stature. In total, 8400 samples have been acquired. Training has been done with a k-fold cross validation with k = 5 on 6720 samples and tested on a sub-set of 1680 samples.

System Design
The HW architecture of the FCN follows the scheme in Figure 1. It is composed of 5 sequential layers and a simple control logic, which initiates and terminates the processing and initializes the memories embedded into the layers. The layer operations are the same of the model in Figure 2: the first three are CONV layers, followed by the GAP and a dense FC layer. The last layer differs from the model because the SoftMax classifier has been substituted by a simple prediction stage, which only selects the maximum output values from the FC layer. This has been possible since the combination of GAP and FC layers calculates the scores of the output classes as: where C is the class, W i,C are the weights for each class and b k are the binarized inputs to the GAP and the other quantities can be taken from Table 1. Given the linearity of Equation (3), and the only interest toward the class with the highest score, which can be obtained by a trivial comparison between the scores of all the classes, not only the SoftMax but also the division in Equation (3) can be avoided for a more compact HW implementation of the FCN, without loss of accuracy. FCN is fully synchronous, namely the layers exchange data after a fixed number of cycles. All the layers share the architecture in Figure 6, composed by the control unit (CU), the memories to store the kernel coefficients and the outputs of the BN and Operational (OP) Block whose dimensions can be obtained by Table 1 once N and the number of classes have been selected. The OPBlocks of CONV and FC layers are schematized in Figure 7. Thanks to the base-2 quantization, the OP block is free of multipliers and it has been essentially reduced to SAC operators, which are in turn composed by multiplexers and shifters.  Muxes are used to select 0, the input value or the left-shifted input value to emulate the product between inputs and quantized weights, according to Equation (1). The CONV1 layer differs from the other two CONV layers for the number of the operators, considering the different number of filters between the first three layers, as well as for the dimensions of the input data which depend on the acquisition system, and that in our case is of 12 bits. The remaining parts are the same of Figure 7. BNs and activations are simply implemented by a XNOR gate considering that: where µ and σ are calculated during training and stored in a proper memory. Considering that activations are binarized, the GAP scheme is a simplified version of the one in Figure 7a. It is composed by a popcount and a memory buffer for results as reported in Figure 7b, where signed additions are calculated as in Figure 8. Quantization means that logical and arithmetic operators are very compact and require few physical resources to be implemented. Therefore, resources are largely dependent on the memories for bias, activations and weights. Although, memory requirements should depend on the specific configuration, sitting or laying posture recognition in our cases, the amount of instantiated HW resources for the FCN is lower-bounded by the larger laying posture configuration, as results from the data in Table 1. Memory requirements of the CONV layers, in this configuration, are 0.11 kB, 2.35 kB and 3.14 kB for the first, second and third layer, respectively. The entire architecture requires 5.80 kB.

Synthesis and Implementation Results
The proposed design has been implemented on a small Digilent CMOD A35T, equipped with a Xilinx Artix-7 (xc7a35tfgg484-1) FPGA by using the Vivado IDE suite. The dimensions of the resulting systems are defined by the sensor carpet considering that the FPGA board is about 2 × 8 cm, therefore the overall dimensions of the systems are very compact.
The most interesting results of the FPGA implementation results are reported in Table 5 for both sitting and laying posture recognition configurations. For sitting posture recognition, our design requires 10,983 LUTs and 8424 FFs, in turn, 15,802 LUTs and 12,287 FFs are required for laying posture recognition. A resource reduction of about 14% both in the number of LUTs and FFs is obtained by imposing the use of BRAMs and DSPs in the synthesis tool. Our choice to impose the absence of these hard macros in the synthesis of our design is due to provide implementation results as much as independent from the specific FPGA topologies, which could significantly differ in the numbers and capabilities of the embedded macros. It is worthwhile to underline that a consistent reduction of the mapped resources, roughly about 66%, could be obtained by exploiting the similarities between the architectures of the layers, and implementing an iterative topology around a superset of a single CONV layer. However, considering that the unrolled architecture also fits well the small FPGA used for tests, the proposed implementation returns a much higher speed performance. In particular, the proposed design achieves a maximum operation frequency of 26.6 MHz and 47.64 MHz for laying and sitting posture configuration, respectively. Considering that the unrolled configuration completes the processing of an input set in 2920 and 2880 clock cycles, respectively, for the two configurations, sensors with 9.13 kHz and 16.50 kHz Output Data Rate (ODR) are supported, (processing time of 109 µs and 60 µs). Although for some applications, like sleep monitoring, the above ODRs could be unnecessarily high, other critical applications, like driver monitoring and situation awareness, take real advantages from our design choices. The proposed system meets state-of-the-art performance also in terms of power dissipation, which is a very relevant parameter considering the large number of possible embedded applications of HPR systems. At the maximum operation frequency the FCN dissipates 10.4 mW and 6.88 mW for lying and sitting posture recognition configurations, respectively. Namely, 391 µW/MHz and 144.4 µW/MHz. Considering that conventional human activity recognition systems operates at 50 Hz [4,5], a dynamic power dissipation less than 1 mW can be considered, which is not sensed by the Xilinx tool since it is much lower than the 70 mW of the quiescent power dissipation of the employed FPGA.

Comparison with the Literature
Comparison with the current literature is not a trivial operation because there are no other systems which work well with both laying and sitting postures recognition. Moreover, custom HW implementations are tailored for a specific application or for a specific support, obviously, and require less physical resources than are proposed. This happens, for example, with the sitting posture recognition in [27], specifically designed to use six flexible sensors applied at the armrests, backrest and seat of a chair, in conjunction with a very compact two-layer Artificial NN (ANN) to classify seven sitting positions, representing the state-of-the-art solution for this specific problem. However, although the Artificial NN in [27] requires less physical resources than the FCN, (755 slice reg., 1822 FFs and 649 LUTs), it consumes more power (7.33 mW) with a much higher processing time of 267.5 µs to classify one posture less than the proposed one with an average accuracy of 97.43%.
With reference to laying posture recognition, the proposed FCN obtains an average accuracy value of 96.77% to classify 17 laying postures in real-time exploiting 108 sensors, and a throughput of 9.13 kHz. The state-of-the-art in this case is represented by the recent work in [36], dealing with in-bed posture recognition, exploiting a microcontroller unit to implement a very complex ResNet composed by 17 CONV layers, two MaxPool and three FC layers, in order to obtain an average classification accuracy of 95.08% by using 1024 force sensitive resistor sensors.
In order to explore the possibility to embed the proposed accelerator for in-sensor circuitry, analysis with a conservative TSMC LP-HVT CMOS 130 nm technology has been done, which is compatible with that of the glue logic in modern MEMS. Synthesis results, by using the Cadence tool-chain of the larger configuration for lying posture recognition, return a power dissipation of 425 µW/MHz and 1.7 mW, respectively, at the maximum operating frequency of 40 MHz, and an area occupation 1.78 mm 2 and which support in real-time an ODR up to 8.7 kHz. All the above results overcome the state-of-the-art for this kind of system.

Conclusions
In this work, a new FCN has been designed to implement HPR operations. The design exploits 2-base quantization schemes to limit the amount of mapped physical resources and achieves state-of-the-art performance in terms of power consumption and area occupation. The FCN has been tested with datasets for sitting and laying posture recognition. In both applications, state-of-the-art performance and an adaptation capability which largely overcomes the existent solutions were demonstrated. The compactness of the design has also suggested a perspective ASIC implementation, encouraged by the synthesis results with a CMOS 130 nm technology. Future improvements will be aimed at the reduction of the overall area required by the sensor array, which could limit the application of the proposed system when used on small supports.