Recurrent Neural Network for Human Activity Recognition in Embedded Systems Using PPG and Accelerometer Data

: Photoplethysmography (PPG) is a common and practical technique to detect human activity and other physiological parameters and is commonly implemented in wearable devices. However, the PPG signal is often severely corrupted by motion artifacts. The aim of this paper is to address the human activity recognition (HAR) task directly on the device, implementing a recurrent neural network (RNN) in a low cost, low power microcontroller, ensuring the required performance in terms of accuracy and low complexity. To reach this goal, (i) we ﬁrst develop an RNN, which integrates PPG and tri-axial accelerometer data, where these data can be used to compensate motion artifacts in PPG in order to accurately detect human activity; (ii) then, we port the RNN to an embedded device, Cloud-JAM L4, based on an STM32 microcontroller, optimizing it to maintain an accuracy of over 95% while requiring modest computational power and memory resources. The experimental results show that such a system can be effectively implemented on a constrained-resource system, allowing the design of a fully autonomous wearable embedded system for human activity recognition and logging.


Introduction
Human activity recognition (HAR) using wearable sensors, i.e., devices directly positioned on the human body, is one of the most popular research areas, which focuses on automatically detecting what a particular human user is doing based on sensor data.
To this end, photoplethysmography (PPG) is an optical technique commonly employed in wearables and other medical devices to measure the change in the volume of blood in the microvascular tissue. Light is emitted from a dedicated device and then reflected and absorbed at different rates during the cardiac cycle. The reflected light is read by a photo-sensor to detect those changes. The output from this sensor can then be processed to obtain a valid heart rate (HR) estimation. Being that PPG is a noninvasive method for HR estimation with respect to electrocardiography (ECG) and surface electromyography, requiring simpler body contact at peripheral sites on the body, such sensors are being more and more used in wearable devices, such as smart watches, as the preferred modality for HR monitoring in everyday activities.
However, accurate estimation of the PPG signal recorded from the subject's wrist when the subject is performing various physical exercises is often a challenging problem, as the raw PPG signal is severely corrupted by motion artifacts (MAs). These are principally due to the relative movement between the PPG light source/detector and the wrist skin of the subject during motion. In order to reduce the MAs, a number of signal processing techniques based on data derived from different sensor types, especially accelerometer data, have proven to be very useful [12][13][14].
Among smartphones and smart watches, built-in triaxial accelerometers are probably the most widespread sensors that can be used for activity monitoring. Because smartphones and smart watches have become very popular, the data-fusion techniques of PPG and acceleration data can be used for providing accurate and reliable information on human activity directly on such devices [15,16]. PPG sensors alone are not usually applied in HAR classification since they are not designed to capture motion signals as opposed to inertial measurement units (IMU), typically comprising accelerometers and gyroscopes. However, using a PPG sensor for HAR presents several advantages [17]: (i) wearable devices are becoming ubiquitous and almost always embed a PPG sensor, so it makes sense to exploit the information that it can provide, as it comes at no additional cost to the user of one of these PPG-enabled smartwatches or wristbands; (ii) the PPG sensor can either be used alone when other HAR sensors are unavailable, or combined with them to augment recognition performance; and (iii) this sensor can be used to monitor different physiologic parameters (heart rate, blood volume, etc.) in one solution. For these reasons, we chose to also employ the PPG signal to predict human activities.
HAR can be treated as a pattern recognition problem, and in this context, machine learning techniques have proven particularly successful. Due to recent advancements of deep learning techniques, these methods can be categorized in two main approaches: (i) conventional machine learning techniques, and (ii) deep learning-based techniques. In the first category, various machine learning methods, such as k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), Gaussian Mixture Models (GMM), Hidden Markov Models (HMM) [18], Random Forests (RF) [19], and Molecular Complex Detection methods (MCODE) [20], have been adopted.
Furthermore, recent advancements in machine learning algorithms and portable device hardware could pave the way for the simplification of wearables, allowing the implementation of deep learning algorithms directly on embedded devices based on microcontrollers (MCUs) with limited computational power and very low energy consumption, without the need for transferring data to a more powerful computer to be elaborated [36,37].
In recent years, edge computing has emerged to reduce communication latency, network traffic, communication cost, and privacy concerns. Edge devices are resourceconstrained devices and cannot support high computation loads. As previously mentioned, in the literature, various machine learning methods and DNN models have been developed for HAR. Particularly, deep learning algorithms have shown high performance in HAR, but these algorithms require high computation, making them inefficient to be deployed on edge devices. To our knowledge, there are still few works that have addressed this problem specifically for the HAR classification task [36,37], as most have tested the DNN architectures on high performance processor units [17,28,[38][39][40].
Thus, the main goal of this paper is to prove that the proposed RNN can be implemented in a low cost, low power core, while preserving good performance in terms of accuracy. To reach this goal, we proceed as follows: • We design an RNN using PPG and triaxial accelerometer data in order to detect human activity, using a publicly available data set for its design and testing. The design and hyper-parameter optimization is performed on a computer architecture. • After the RNN has been designed, we investigate the porting and performance of the network on an embedded device, namely the STM32 microcontroller architecture from ST, using their "STM32Cube.AI" software solution [41]. This framework allows the porting of a pre-built DNN, converting it to optimized code to be run on the constrained hardware platform.
• When porting the RNN to the embedded system, we show how the network can be simplified to better fit the microcontroller limited resources. In particular, it is demonstrated that the input data can be downsampled to a significant degree, while maintaining good accuracy and requiring fewer hardware resources in order to be implemented.
The rest of the paper is organized as follows. In Section 2, we summarize the related work and we state the motivations for our work. Section 3 summarizes the basic concepts of the RNNs. Section 4 describes the data set adopted in the experiments and the pre-processing data applied to improve the RNN performance. Section 5 reports the details of the proposed RNN architecture with a description of the main features, hardware and software to implement this network in the low-power, low-cost Cloud-JAM L4 board (STM32L476RG microcontroller). Finally, the experimental results are presented in Section 6.

Related Work
Nowadays, deep learning techniques have brought great improvements in signal recognition/classification and object detection.
In References [42,43], an automatic target detection and recognition in the infrared images based on a CNN is studied.
In Reference [44], a robust multi-camera multi-player tracking framework is presented. In this system, the player identity, which is commonly ignored in existing methods, is specifically considered, using a deep player identification model for players' identification, 2D localization and segmentation based on a Cascade Mask R-CNN model.
In this section, we provide a summary of the most recent deep learning techniques adopted for classification of PPG signal.
Heart Rate Variability (HRV) is the continuous fluctuation of period length between cardiac cycles, which can be used for the diagnosis of cardiovascular diseases, such as myocardial infarction and cardiac arrhythmia. In Reference [45], an RNN based on bidirectional long short-term memory (biLSTM) is introduced for accurate PPG cardiac period segmentation to derive three important indexes for HRV estimation.
biLSTM is an improved version of long short-term memory (LSTM), which receives forward and backward feature inputs in order to gain information behind and ahead of a specific sample point.
In the study [46], a new hybrid prediction model is proposed by combining ECG and PPG signals with an RNN to estimate blood pressure continually within the RNN structure; a biLSTM is used as the input hidden layer to look for contextual features both forward and backward, while a rectified linear unit (ReLU) layer is selected as the last hidden layer.
In Reference [47], different CNN architectures for PPG-based heart rate estimation are investigated. To train the network, an end-to-end learning approach that takes the timefrequency spectra of synchronized PPG and accelerometer signals as the input and provides the estimated heart rate as the output, is adopted. A deep learning model for heart rate estimation using a single-channel wrist PPG signal is proposed in [48]. The model contains three components: a CNN, an LSTM, and a fully connected network (FCN).
The input data segmented into eight windows of 1 second duration is passed to the CNN-LSTM feature extractor by performing five parallel convolutions, thereby providing diverse feature representations from the input signal at various receptive fields.
In Reference [17], a novel method is adopted to extract meaningful features from the PPG to predict human activities that combines convolutional and recurrent layers. The convolutional layers are set as feature extractors and provide abstract representations of the three CS, RS and MS data in feature maps, while the recurrent layers model the temporal dynamics of the activation of the feature maps.

Brief of RNNs
While traditional neural networks are characterized by the complete connection between adjacent layers, recurrent neural networks (RNNs) can map target vectors from the entire history of the previous map. The structure of an RNN network is shown in Figure 1.  In this architecture, each node produces a current hidden state h t and output o t by using current input x t and previous hidden state h t−1 as follows: where W and V are the weights for the hidden layers in recurrent connections, b denotes the bias for hidden and output states, and f is an activation function.
Although an RNN is very effective in modeling the dynamic of a continuous data sequence, it may encounter the problem of gradient disappearance and explosion [49] when modeling long sequences. In order to overcome this issue, Hochreiter et al. [50] propose a variant type of RNN, based on the LSTM, which combines learning with model training without additional domain knowledge. The structure of the LSTM unit is shown in Figure 2. The following equations show the long-term and the short-term states and the output of each layer at each time step: where

Data Set
We used a recent data set that is publicly available [51] and includes PPG and tri-axis accelerometer data from seven different subjects performing five series of three different activities (resting, squat, and stepper). The four signals are simultaneously acquired with a sampling frequency of 400 Hz and include a total of 17,201 s of recording data. The seven adult subjects include three males and four females aged between 20 years and 52 years.
The PPG and accelerometer signals were recorded from the wrist during some voluntary activity, using the Maxim Integrated MAXREFDES100 health sensor platform. This platform integrates one biopotential analog front-end solution (MAX30003/MAX30004), one pulse oximeter and heart-rate sensor (MAX30101), two human body temperature sensors (MAX30205), one three-axis accelerometer (LIS2DH), one 3D accelerometer and 3D gyroscope (LSM6DS3), and one absolute barometric pressure sensor (BMP280). Particularly, the PPG signals were acquired at the ADC output of the photodetector with a pulse width of 118 µs, a resolution of 16 bits and a full-scale range of 8192 nA, lighted with the green LED. The three-axis accelerometer signal values correspond to the MEMS output with a 10-bit resolution, left-justified, ±2 g scale and axes oriented as shown in Figure 3, with z pointing toward the experimenter's wrist. For the data acquisition, the following measurement set-up was followed as shown in Figure 4: (1) positioning of the sensor directly on the wrist; (2) insertion of the sensor inside a specific weight lifting bracelet, adjustable by a hook-and-loop closure, with optimal elastic characteristics that make it particularly suitable to guarantee perfect adherence of the sensor device to the skin surface; (3) verification of the correct wiring, as the loss of adhesion to the skin-device interface would cause the addition of high frequency noise in the acquired signals, making them unusable; (4) use of the sensor with the cable in "tethered" mode, where the cable comes out from the rear end of the band thus still guaranteeing freedom of movement. Of the data set, the first five subjects were used for the training phase, while the last two subjects were left for the final testing.

Data Pre-Processing
The PPG and accelerometer data from every single recording session are combined to obtain a series of four-dimension input data.
A preliminary cleaning of the data is performed for the presence of occasional spikes, including NaN points, probably due to glitches in the communication channel during acquisition. Those are always single points, so they can easily be fixed in software by interpolating the two adjacent points. This cleaning is performed on the five training subjects only, to improve the training process. Data from the two test subjects are left unaltered, to account for transmission errors in real-life applications and to not add overhead to a possible embedded implementation (tests on the computer have shown this to make no difference on the results).
The data are then split in partially overlapping windows of the same size. The choice of window size and overlapping is explained in detail in Section 4.3.
Before feeding the neural network with the resulting inputs, preliminary tests have shown that some basic normalization of data is needed for PPG to achieve acceptable results. It has been already mentioned that it is extremely sensitive to movement. As an example, Figure 5 shows PPG data from a single subject performing five series of the same exercise. It can be seen that the signal varies greatly not only between series, but also in the short term during the same recording.
To better isolate the PPG signal trend from the motion artifacts, we apply statistical standardization to the data, that is, we scale the data so that the resulting mean and standard deviation are 0 and 1, respectively, according to the following formula: with µ and σ being the original mean and standard deviation, respectively. In order to ensure that the data can be processed in real time when porting the RNN to the embedded system, µ and σ are computed independently for each window of the incoming data, and so is the standardized signal. Standardization thus transforms each input signal window into another vector of the same length but with predefined mean and variance. Moreover, per-window standardization has the added benefit of also compensating somewhat rapid signal variations between windows. The results of this perwindow standardization are still shown in the same Figure 5, where the right panel shows standardized data for 1200 sample windows with no overlapping. The non-overlapping output windows are simply juxtaposed on the graph for ease of representation. On the other hand, accelerometer signals are more regular than PPG, suffering only from low-magnitude noise, which is intrinsic in accelerometers. Figure 6 shows, as a matter of example, the accelerometer data from the recordings of a single subject in one activity. The only issue that must be addressed is that the data generally have a fixed offset, approximately constant, due to projection of gravity acceleration across the three spatial axes. Being that this offset is practically random for the purpose of data analysis, we remove it by subtracting the mean value from the data: with µ being the original mean. Moreover, as can be seen in the same figure, the offset can change abruptly during the same exercise, due to the subject unconsciously changing position. So again, we choose to subtract the mean value in single data windows, individually. The resulting processed signals for the same data are always shown in the right panel of the same Figure 6. While this procedure may not be optimal, for the few data windows crossing an offset change, nevertheless, it is computationally lightweight so as to be implemented in real time in an embedded system and, as the figure shows, it results in a good filtering of the signal.
Preliminary tests have shown that normalization of acceleration values according to their standard deviation has a negative effect on final accuracy, with the normalization layer of the RNN itself leading to better results.

Data Downsampling
The original sample rate of the data (400 Hz) can impose a significant load on the processor and memory of an embedded device. Moreover, previous works show that classification of human activity does not require high sample rates [52].
For this reason, a crucial part of the work is examining varying degrees of downsampling of the original signals to find an optimal combination of accuracy and performance on constrained hardware platforms.
To efficiently downsample data, we chose not to use resampling algorithms that require digital filters, which would add significant computational cost when implemented in the final embedded system. We instead used a simple decimation procedure in which 1 out of M samples are retained, discarding the rest. This leads to sample rates corresponding to integer decimation factors only. Mathematically, this is equivalent to transforming the original signal x[n] to a new signal y[n], such as the following: with M being the decimation factor. A new RNN must be built and trained for every sample rate because the size of the network layers depend on the size of input data windows.
In the rest of the paper, when talking about the number of samples in data windows, we will always refer to the samples before downsampling in order to avoid confusion.

Data Windowing
The window length and overlapping are important hyper-parameters in neural networks, as well as other machine learning algorithms [53,54]. Being that w is the number of samples in a window and o is the number of overlapping samples between adjacent windows, the n-th data window corresponds to samples in the following range: with n >= 0. To find the best combination for our particular network, we conducted a series of tests with various values of the two parameters. It is common practice, when training a neural network, to further split the training data in two sets: data actually used to fit the network weights, and validation data to monitor the performance of the network during the various training epochs.
Since the number of different subjects in the data set is small and different subjects inevitably have substantial differences in their data, the statistical distribution of the data might not be uniform enough, and so choosing a single partition of training and validation data might not lead to representative results. So, we decided to adopt a cross-validation strategy, that is, for every window length and overlapping combination we trained five networks, isolating each time a different subject for the validation (the network architecture is explained in detail in Section 5). The resulting accuracy of every combination was then computed as the average of the maximum accuracy obtained for the validation data in every test during the training epochs.
Being that this process is quite time-consuming, we examined a limited combination of parameters in the neighborhood of what was already tested in [53], and with a downsampling factor of 10. As can be seen in Table 1, the best accuracy was reached with a window of 1200 samples (before downsampling), corresponding to 3 seconds and 50% overlapping. The final network used in the rest of the article was trained with the mentioned windowing parameters, and using all the 5 training subjects (no validation data).

Data Augmentation
Since the number of inputs belonging to the three different activities are not equally represented, the network might end up being biased towards a specific class. A simple technique to address this problem is oversampling [55], a form of data augmentation where the data from classes with less occurrences are duplicated as needed, so that the data used for training are more uniformly distributed among the different classes. ("Oversampling" in this context must not be confused with data resampling in time domain, performed independently). Table 2 shows the number of input windows for the 3 classes, limited to the 5 subjects used for training, before and after data augmentation.
To summarize, Table 3 shows the number of inputs of the 7 subjects, before and after the oversampling applied to the first 5 ones. Oversampled data were used to train the final network.

Rnn Architecture
The RNN used in this paper is depicted in Figure 7. It is based on an architecture commonly used with time-based sensor data [54][55][56] and consisting of a combination of fully connected layers and LSTM cells. Input data are assembled from PPG and the three acceleration axes, resulting in four-dimension time-series. Data are then fed to the network in windows of size w × 4, with the parameter w being the size in time points of a single data window as described in Section 4.3.
The first layer is a fully-connected one (dense), with the purpose of identifying relevant features in the input data. In this layer, the generic n-th neuron produces an output value y n , according to the x 1 , . . . x m inputs to the layer and the w nj neuron weights associated to every input. Specifically, where φ is an activation function and b n is a bias value. Next, there is a batch normalization layer, which normalizes the mean and standard deviation of the data globally, operating on single batches of data as the training progresses. For every input data batch x, its output is the following: where x and σ 2 are the mean value and variance of the data batch, respectively, and γ, , β are internal trainable parameters of the layer. The core of the recurrent neural network, then, is represented by three cascaded LSTM layers, whose internal architecture was briefly explained in Section 3. Each one is followed by a dropout layer that randomly discards a part of the input to reduce overfitting.
Finally, there is a fully-connected layer of size 3 that, together with the Sparse Categorical Cross-entropy loss function assigned to the network, performs the classification in one of the three classes. The loss function, or cost function in more general terms of optimization problems, represents the error that must be minimized by the training process. The specific representation of the error depends on the particular function assigned to the network. For the Categorical Cross-entropy function, the error is as follows: where w is the set of model parameters, e.g., the weights of the RNN, N is the number of input test features, y i andŷ i are the true and predicted classes respectively, expressed numerically. The intermediate layers have size 32; this hyper-parameter was determined experimentally, starting with a larger value and decreasing it until the accuracy varied significantly. Table 4 shows the details of the individual layers. The RNN, as built in this configuration, has 25,283 trainable parameters.

Hardware and Software
For the first part of the design and hyper-parameter optimization, the RNN was developed with TensorFlow 2.4.1 and Keras 2.4.0. The network and the related algorithms were initially developed on the Google Colaboratory platform; then, the final computations were performed on a computer with an Intel Core i7-6800K CPU, 32 GiB of RAM and an NVIDIA GeForce GTX 1080 GPU.
For the embedded part, we tested the RNN on a Cloud-JAM L4 board (https://www. rushup.tech/jaml4, accessed on 1 June 2021), which, for its small form factor and integrated Wi-Fi, can represent a valid prototyping base for a wearable system. While it features a set of inertial and environmental sensors, it is not a complete system with PPG sensor and other needed features. Nevertheless it allows testing the RNN on a real hardware and evaluating its performance in terms of memory and execution time, should a full-featured monitoring system be designed. The classification of test data is done in real time by providing input data to the board from the test set via a serial interface. This also ensures reproducibility of the results with respect to the other tests.
The porting of the neural network to the STM32 architecture is made possible by a software framework from ST, named "STM32Cube.AI" [41] (current version 6.0.0), integrated in the STM32Cube IDE. The software is a complete solution to import a Keras (or other) model, analyze it for compatibility and memory requirements, and convert it to an optimized C implementation for the target architecture. The generated network can then be evaluated with test input data, both on the computer and the actual device to obtain various metrics, such as execution time, number of specific hardware operations and accuracy.
All the software developed for this article is publicly available at https://github.com /MAlessandrini-Univpm/rnn-ppg-har, accessed on 1 June 2021, published in July 2021.

Experimental Results
The final RNN was tested on both the computer and the MCU, with several decimation factors. For every factor, the network was trained and tested with the following parameters: • Windows of 1200 samples (before decimation) and 50% overlapping. • Data augmentation applied. • Five subjects used for training, with no further split for validation. • Test performed on the last two subjects, not involved in training. • A total of 100 training epochs.
In addition to the other hyper-parameters already discussed, the number of epochs was chosen experimentally by examining the training accuracy and loss value during the training stage. Figure 8 shows the progress of accuracy (estimated on the training material itself) and loss with respect to the training epochs for the network with no downsampling (original data at 400 Hz). It can be seen that at about 100 epochs, the values reach convergence.  Table 5 shows the accuracies and resource usage obtained by the training and the final test for the various RNNs, for both computer and MCU.
On the computer side, reported times are the total time for the training and test stages, respectively. On the embedded system, every RNN requires a given amount of flash and RAM memory, reported by the framework during the initial analysis. Flash memory requirements do not depend on the sample rate, but only on the network architecture, namely, the quantity of weights and other parameters that are read-only values after the training is done. As shown, the amount of flash memory required is well below the available quantity.
RAM memory, on the other hand, is more limited (96 KiB in this case) and its usage is strongly dependent on the size of input data (and so on the sample rate). Moreover, part of the RAM is needed by the program besides data structures belonging to the RNN. It can be seen that not all the configurations can fit in RAM; combinations that would require more than 100% of RAM could not be executed on the MCU. (An alternative practice to fit a DNN model to a constrained architecture is converting it to TensorFlow Lite format. Unfortunately, the current STM32Cube.AI version-6.0.0-does not support some specific operations generated by the T.F. Lite converter for our model). Timing results are computed by running the RNN on the actual device (see Section 5.1). A dedicated firmware application is provided by the framework; the IDE tool can communicate with such an application on the board, send it the test data to make it run the neural network inference on the hardware and finally, obtain the statistics on performance. Every time a different model is used, a series of operations are needed: generating the code, compiling it, programming it on the MCU flash memory and finally, running the test.
Presumably for a limitation of the firmware validation application, the program stops working if the input and reference data provided are too big, so it was not possible to use the full test data (consisting of more than 2000 rows). A subset of the data (100 rows) had to be used. Since the application reports the average time needed for every inference, the timing results are still meaningful. Indeed, the reported test time for the MCU is the average time of a single data input.
About the accuracy, to have a meaningful comparison with results on the computer using the full test data, we referred to the validation performed by the toolkit on the computer; this uses the same C code generated for the MCU and so it is expected to provide equivalent numerical results.
The CPU percentage usage was computed as the ratio between the average inference time reported by the validation application and the duration of a data window (3 s), multiplied by a factor of two to account for 50% overlapping of the data windows. This parameter can give an estimation of the capability of the embedded system to handle the data classification in real time and the CPU time remaining for other concurrent activities. The table also reports the number of MACC operations, in rounded thousand units, required for a single inference.
It can be seen from the results that the accuracy does not decrease while downsampling the data down to 10 Hz (in fact, it actually increases), corresponding to a CPU usage of 10%, leaving plenty of execution time for other concurrent activities, or alternatively, allowing the reduction of the CPU clock frequency to achieve a lower power consumption. Note that the CPU usage does not include data pre-processing, that is, normalization of the mean value and/or standard deviation (see Section 4.1) that would be needed if data are acquired in real time. Those operations are much simpler than RNN inference, and so should not add a significant overhead.
It can also be seen that the accuracies achieved by the MCU implementation are identical to the ones obtained on the computer. This is presumably due to the differences between the two models being relatively small: apart from the limited precision of the microcontroller FPU (32 bits), the model does not require further compression or quantization to fit on the embedded system. Figure 9 shows the confusion matrix from the classification of test data in the same setup. It can be seen that the squat and stepper activities are the ones suffering from the larger mistake rates, while the resting activity is recognized correctly in 98% of the cases. This may be due in part to the amount of original input data being substantially less for squat and stepper activities with respect to resting.
In the current setup, the accuracy of the testing stage reaches a maximum of 95.54% for a decimation factor 40. While splitting the data set into five training subjects and two testing subjects is a natural choice, the limited size of the data set can lead to a bias in the results, according to the chosen partition. Moreover, it can be seen from Table 5 that by increasing the decimation factor, the difference between the training and testing accuracies increases.
To test the effect of such a bias, we repeated the previous tests with a leave-one-subjectout, cross-validation strategy. This means testing seven models for every experiment in which six subjects are used for training and one (different each time) for testing. Table 6 shows the test accuracy for this setup, averaged over all the models. Since reducing the test material with respect to training can increase the overfitting effect, we repeated the tests with 50 epochs in addition to 100.  It can be seen that in this configuration, the accuracies are significantly lower. Again, this can be explained by the data set being of limited size, and so a single subject may not be representative enough to be used for testing. Indeed, if one better examines a single case, for example, the one at decimation 40 and 100 epochs that results as the best one in Table 5, it can be seen that a few subjects can negatively influence the average results, while most of them have accuracies similar to the better ones reported earlier. This is shown in Table 7. This, again, confirms that the limited size of the data set can limit the generality of the results, producing a strong bias, according to the subject partition. A wider data set could solve those kinds of problems and provide more general results; this can be the subject for future work in this field. Table 8 reports a list of the state-of-the-art works related to HAR in terms of the employed algorithm, type of signal, data set used for experimentation, number of classes for each data set, hardware used for testing and performance. The commonly used metrics to evaluated the validity of the HAR algorithms are accuracy and F1 score: accuracy is the ratio of the sum of true positives (TP) and true negatives (TN) to the total number of records; the F1 score is an evaluation of the test's accuracy calculated as a weighted average of the precision and recall, where precision is defined as TP/(TP + FP) with FP = false positives, and recall as TP/(TP + FN) with FN = false negatives. By making a comparison with the methods present in Table 8, an evaluation of the contribution of the proposed work can be made. Regarding the data, accelerometer and gyroscope signal sources are the most commonly used in the state of the art since these signals are simple to acquire. So, many works focused on the popular and publicly available UCI HAR data set, which contains six activities (walking, walking upstairs, walking downstairs, sitting, standing, laying down). However, data sets containing PPG signals are relatively less common and more limited in the number of presented activities, but it is still an interesting topic because a PPG sensor is already embedded in smartwatches or wristbands and can either be used alone when other HAR sensors are unavailable, or combined with them to improve recognition performance; moreover, this sensor can be used to monitor different physiologic parameters in one device. Finally, as can be seen, the results obtained with the proposed method are in line with those of the state of the art, especially considering the few works that have experimented the implementation on microcontrollers.

Conclusions
In this paper, an RNN was built for human activity recognition, using PPG and accelerometer data from a publicly available data set. The RNN was then ported to an embedded system based on an STM32 microcontroller, using a specific toolkit for the porting of the network model to the mentioned architecture. The results show that an accuracy of more than 95% is achieved in the classification of test data, and that the sample rate of the acquired data can be downsampled down to 10 Hz, while maintaining the same accuracy. This, in turn, allows the network to be run on the embedded device, using modest hardware resources, paving the way to a fully autonomous activity classifier implemented as a wearable embedded device, using commonly available and cheap microcontrollers.