2.3. Dataset Preparation
An important element that an ideal estimation method must satisfy is its universality. To ensure this condition, the models proposed in this work were trained and tested using cross-validation. Since the dataset consists of 15 complete sessions for different participants, a variation of cross-validation was used that was split into training and validation, and the test sets refer to the individual sessions of the participants. The variation used is called leave-one-out cross-validation (LOOCV), and for the available data, each fold in LOOCV consisted of 12 participants making up the training data set, 2 participants being the validation data set, and 1 participant being the test set. On this participant, the performance of the presented solutions was evaluated. Smaller pieces of research related to building the network architecture were evaluated in 5 folds, while the final solution was evaluated in 15 folds.
The most promising solutions process data in the form of a time-frequency spectrum. In the prestudy, we compared the performance of the Deep PPG model using both frequency and time domain data. Based on this study, we found that for some of the participant’s data in the test set, the time domain was preferable, while for others the frequency domain was better. Therefore, our solution uses data in the time domain, for which preprocessing is reduced to the bare minimum. In this work, the final solution used PPG and 3-axis accelerometer signals. Signals measured by the Empatica E4 device were recorded at different frequencies: 32 Hz for 3 signals from the accelerometer and 64 Hz for the PPG signal. Our final model combined these 4 features. After the dataset was divided into each fold, further processing was performed in 3 simple steps:
- 1
- In the first step, the resolution of the PPG signal was reduced using decimation to 32 Hz. 
- 2
- In the second step, standardization was performed using a z-score by removing the mean and scaling the features to unit variance. Standardization according to the generally accepted principle was performed using statistics obtained only for the training set. Then, mean and standard deviation from the training set were used to standardize the validation and test sets. 
- 3
- In the third step, for all three sets, the features were processed using the sliding window method. The window size and shift were adapted from most previous works on the HR problem. 
Therefore, a single window was 8 s long, which corresponds to one feature of datasets of 256 data points (sampling the 32 Hz signal). The next window was created by a shift of 2 s, making 6 s of feature overlap. Finally, the training data corresponding to the individual subjects were fed into the network. The length of the training sets for each fold slightly varied according to the duration of the entire data collection protocol.
  2.4. Design of Network Architecture
The basic assumption of the neural network architecture being built for this work was to eliminate, as much as possible, the influence of motion artifacts on the obtained results. Therefore, the entire network architecture design process was only concerned with data from the wrist-worn accelerometer. The idea was simple: if the network could successfully estimate HR from the accelerometer alone, it would be more capable of solving the HR problem using data from the accelerometer and additional features.
In the prestudy, hyperparameters related to the training of networks were selected using a validation set. The selected parameters performed best on early versions of the network architecture, which was the reason for their use. All experiments used the same set of hyperparameters as shown in 
Table 1.
The 5-fold LOOCV was responsible for the versatility of the results of each experiment performed in building the network. The participants constituting the test sets for each fold were also selected to ensure that the tests provided as much information as possible. Therefore, in addition to the results for the participants for which HR did not deviate significantly from the mean (S3, S9, S15), the experiments will also provided information on how the proposed solutions performed for the user with the highest mean HR (S5) and the lowest (S12). Each network considered in this work solves a regression problem by estimating a single HR value for an 8 s window. In determining the final architecture, we tested the impact of various network elements on the performance metric score.
The first experiment conducted tested the number of filters in the first two convolutional layers in the basic network architecture. It was assumed that in the first convolutional layer, there would be n filters, while in the second layer it would be equal to . Increasing the number of filters in the subsequent layers of the convolutional neural network (CNN, Conv) is an often repeated procedure that allows the layers to capture more patterns in data. The experiment conducted tested four different sizes of n (24, 48, 96, 128) with the same kernel size (). To reduce the dimensionality of the resulting feature maps, each convolution operation was followed by a max-pooling operation. Then, the resulting feature maps were flattened by a flattening layer, from which they were processed by a fully connected layer. The rectified linear unit was used as a default activation function across the network and to avoid overfitting the dropout layer. The network configuration used in this experiment was as follows:
- 1D convolutional layer (number of filters = n, kernel size = k, strides = 1), 
- max-pooling layer (pool size = 3, strides = 3), 
- 1D convolutional layer (number of filters = 2n, kernel size = k, strides = 1), 
- max-pooling layer (pool size = 3, strides = 3), 
- flattening layer, 
- fully connected layer (number of neurons = 128), 
- dropout layer (size = 0.5), 
- fully connected layer (number of neurons = 1). 
In total, the network was trained 20 times in this experiment, each time changing the training fold or the number of filters in the first two convolutional layers. The resulting mean MAE scores for all test participants are shown in 
Table 2.
The obtained results obtained varied slightly depending on the number of filters used. For 64 filters in the first layer and 128 filters in the second layer, the best results were achieved, with an average MAE of 20.36 bpm. The parameters obtained from the test could be used as the basis for further development of the network architecture. However, we assumed that a kernel size adjusted from the vast majority of image recognition works may not be sufficient for time-dependent accelerometer signals, which are the basis for the next experiment.
Small kernel size combined with the max pooling operation resulted in more localized feature extraction from the available signals. In the case of signal representation in the form of a time–frequency spectrum, such a solution may be appropriate, because the spectrum can be treated as a two-dimensional image. However, in the case of a one-dimensional signal, a simpler solution was to use a larger kernel size for the convolution operation, which in turn allows the features to be considered in a wider context. In a second experiment on determining the appropriate network architecture, a test was conducted to determine the effect of a larger kernel size (
). The same network architecture was used for the experiment, except that the convolutional layers used a kernel size of 12. Again, the network was trained for individual varying numbers of filters and training folds. The results of the experiment are presented in 
Table 3.
Based on these results, the best number of filters was 96 in the first layer and 192 in the second layer, as this configuration obtained the best results, with an average MAE of 19.32 bpm. For the larger kernel, a performance improvement is evident for each number of filters tested and, consequently, the best configuration of the first two layers was found, which was used for further experiments.
Increasing the depth of a neural network is a popular procedure that is often used in state-of-the-art solutions in many fields [
9,
17]. Deeper neural networks allow for the network to learn more complex nonlinear relationships. For the network architecture under development, another test was performed to check the impact of adding more convolutional layers to the network. The tests involved adding 1 to 3 additional convolutional layers, giving a maximum of 5 layers in total, given the existing 2 layers in the already existing architecture. Adding more convolutional layers was impossible because the spatial dimension shrinks to 1 with 3 additional layers. The configuration of the additional layers was as follows:
- First additional convolutional layer: - -
- 1D convolutional layer (number of filters = 128, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 2, strides = 2), 
- -
- dropout layer (size = 0.1). 
 
- Second additional convolutional layers: - -
- 1D convolutional layer (number of filters = 192, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 2, strides = 2), 
- -
- dropout layer (size = 0.1), 
- -
- 1D convolutional layer (number of filters = 128, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 2, strides = 2), 
- -
- dropout layer (size = 0.1). 
 
- Third additional convolutional layers: - -
- 1D convolutional layer (number of filters = 192, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 2, strides = 2), 
- -
- dropout layer (size = 0.1), 
- -
- 1D convolutional layer (number of filters = 128, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 2, strides = 2), 
- -
- dropout layer (size = 0.1), 
- -
- 1D convolutional layer (number of filters = 64, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 2, strides = 2), 
- -
- dropout layer (size = 0.1). 
 
Subsequent convolutional layers already used a kernel size of 3 because their task was to extract more local signal dependencies. For each network configuration, the rest of the network was analogous to the first two experiments:
The first two experiments conducted showed that convolutional layers with larger kernels achieve better results. However, for some test participants, a small kernel size was found to be better. To apply simultaneous feature extraction with convolutions using different kernel sizes, a multi-headed architecture was used. The proposed architecture is a combination of the best number of filters obtained for a smaller and a larger kernel. One part of the model extracts features using two convolutional layers (96 and 192 filters) using a kernel size equal to 12, while the other part of the model performs exactly the same operation using another two convolutional layers (64 and 128 filters) but using a kernel size equal to 3. The resulting network from this concept consists of the following layers:
- head 1 - -
- 1D convolutional layer (number of filters = 64, kernel size = 3, strides = 1), 
- -
- max-pooling (pool size = 3, strides = 3), 
- -
- 1D convolutional layer (number of filters = 128, kernel size = 3, strides = 1), 
- -
- max-pooling (pool size = 3, strides = 3), 
- -
- flattening layer, 
 
- head 2 - -
- 1D convolutional layer (number of filters = 96, kernel size = 12, strides = 1), 
- -
- max pooling (pool size = 3, strides = 3), 
- -
- 1D convolutional layer (number of filters = 192, kernel size = 12, strides = 1), 
- -
- max-pooling (pool size = 3, strides = 3), 
- -
- flattening layer, 
 
- concatenate layer (merging two model heads), 
- fully connected layer (number of neurons = 128), 
- dropout (size = 0.5), 
- fully connected layer (number of neurons = 1). 
Each parallel part of the network creates its own map of features, which are then combined in the concatenate layer. The resulting features are then further processed by a fully connected layer with 128 neurons, half of which are switched off in the dropout layer. This is achieved by applying the created different combinations of convolutional neural network (CNN) architectures to the available data. In this case, the following results were obtained and are shown in 
Table 4. L corresponds to the number of additional convolutional layers used.
The results obtained in this experiment indicate that deepening the network architecture used did not have the intended effect. For the additional convolutional layers, the differences in the results obtained were minimal, indicating that the subsequent feature maps did not add any new information to the network. Surprisingly, the multi-headed model gave the best results for most participants.
Next, improvements were possible by increasing the complexity of the network through the use of recurrent layers. The use of long short-term memory (LSTM) cells can enable the capture of temporal dependencies in the changing signals from accelerometers [
18,
19]. To enable this, the most promising multi-headed CNN model enriched with LSTM cells (MH Conv-LSTM DeepPPG) was proposed. The LSTM cells were added directly after the convolutional layers. This addition allowed the LSTM cells to process the feature maps obtained using the two convolutional layers for each model head. This experiment investigated 3 upgrades to the existing architecture using LSTM cells. The first two architectures only add a single LSTM cell or stacked LSTM cells per head, while the third architecture uses a time-distributed LSTM layer.
In the architecture using single LSTM layers, each head used layers with 128 cells. Signal outputs from two parallel heads were combined in the concatenation layer, and from there there they moved to a fully connected layer of 256, which corresponded to the size of the combined features. For architecture using stacked LSTM layers, subsequent LSTM layers had 256 units each. This assumption was intended to allow for a broader map of features for combination from the two heads. The resulting features were then processed by a fully connected layer of size 512, with the number of neurons in this layer corresponding to the output size of the concatenation layer. Both tested architectures used a dropout layer with a dropout size of 0.5 to regulate the network.
The use of a time-distributed LSTM layer in the last architecture in this test allowed individual convolutional layers to be applied to each temporal slice of the input data. To make this possible, each time window of length 256 data points was further divided into 8 steps of 32 data points for each accelerometer channel. Thus, the convolutional layers did not extract features from the 256 data points of the channel as a whole, but separately for each of the 8 steps of the time window used. By training the network for each fold and architecture considered, the following results were obtained as shown in 
Table 5.
The biggest disappointment of the experiment was that the model using the time-distributed LSTM layer performed even worse than architectures without LSTM cells. The reason for this may be the lack of significant time dependencies in the single seconds of the window, which explains the use of 8 s windows in other existing solutions to the HR estimation problem. Nevertheless, the experiment was successful, and further research will use the multi-headed model using the best performing single LSTM layers.
The last experiment in the network architecture design process concerned the size of the last fully connected layer. Four different numbers of neurons 
S (128, 256, 512, 1024) were considered. The task of the fully connected layer was to process the final high-level time-dependent features obtained from the LSTM cells. Selecting the right number of neurons in this layer enabled the network to efficiently learn nonlinear combinations of these features. The network architecture used in this experiment was the same as the best MH Conv-LSTM DeepPPG architecture, only the number of neurons in the fully connected layer was changed between tests. Evaluation results of this experiment are shown in 
Table 6.
For 512 neurons in the fully connected layer, the results for all participants in the experiment were noticeably higher, which prompted the use of this configuration. The size of the fully connected layer was the last experiment conducted to determine the final network architecture. The resulting network architecture determined from numerous experiments was used for the final solution for HR estimation:
- head 1 - -
- 1D convolutional layer (number of filters = 64, kernel size = 3, strides = 1), 
- -
- max pooling layer (pool size = 3, strides = 3), 
- -
- 1D convolutional layer (number of filters = 128, kernel size = 3, strides = 1), 
- -
- max-pooling layer (pool size = 3, strides = 3), 
- -
- LSTM layer (number of cells = 128), 
- -
- flattening layer, 
 
- head 2 - -
- 1D convolutional layer (number of filters = 96, kernel size = 12, strides = 1), 
- -
- max pooling layer (pool size = 3, strides = 3), 
- -
- 1D convolutional layer (number of filters = 192, kernel size = 12, strides = 1), 
- -
- max-pooling layer (pool size = 3, strides = 3), 
- -
- LSTM layer (number of cells = 128), 
- -
- flattening layer, 
 
- concatenate layer (merging two model heads), 
- fully connected layer (number of neurons = 512), 
- dropout (size = 0.5), 
- fully connected layer (number of neurons = 1).