A Lightweight Convolutional Neural Network Method for Two-Dimensional PhotoPlethysmoGraphy Signals

: Data information security on wearable devices has emerged as a significant concern among users, so it becomes urgent to explore authentication methods based on wearable devices. Using PhotoPlethysmoGraphy (PPG) signals for identity authentication has been proven effective in biometric authentication. This paper proposes a convolutional neural network authentication method based on 2D PPG signals applied to wearable devices. This method uses Markov Transition Field technology to convert one-dimensional PPG signal data into two-dimensional image data, which not only retains the characteristics of the signal but also enriches the spatial information. Afterward, considering that wearable devices usually have limited resources, a lightweight convolutional neural network model is also designed in this method, which reduces resource consumption and computational complexity while ensuring high performance. It is proved experimentally that this method achieves 98.62% and 96.17% accuracy on the training set and test set, respectively, an undeniable advantage compared to the traditional one-dimensional deep learning method and the classical two-dimensional deep learning method.


Introduction
Wearable devices have been widely popularized and applied in recent years.Their portability, real-time, and intelligence characteristics make them indispensable to people's lives.With the popularity of these devices, the user information collected by various built-in sensors has also caused users to worry about the security of data information.The information collected by wearable devices covers many aspects of users, including sensitive data such as daily behaviors, location trajectories, environmental information, and physiological health.Once this information is leaked or illegally obtained, it may be used for various illegal activities, which will seriously threaten the user's privacy and personal safety.
In order to better protect users' data security, researchers have turned their attention to biometric authentication methods.Among them, identity authentication based on photoplethysmography (PPG) signals is one.The PPG signal is obtained by non-invasive physiological detection using photoplethysmography (PhotoPlethysmoGraphy).When the heart contracts and relaxes periodically, blood flow in the body also increases and decreases.When the light emitted by the light source penetrates the skin tissue, the hemoglobin in the blood absorbs it, and the intensity of the light reflected or projected back is also weakened and enhanced.At this time, the light receiver converts this change in light intensity into an electrical signal, the PPG signal [1].Existing research shows that by analyzing the original PPG signal and the key signs of derivatives such as systolic peak and diastolic blood pressure peak extracted from it, a series of benchmark features can be obtained, which can be used as unique biometric information of the user [2][3][4].Moreover, the PPG signal is collected through the sensor in the wearable device without additional operations, which makes continuous user authentication more natural.At the same time, the PPG signal can continuously monitor blood circulation, thereby achieving continuous user identity verification and improving the accuracy and reliability of identity authentication.PPG signals meet the essential characteristics of biometrics: universality, persistence, uniqueness, and ease of collection.Therefore, using PPG signals as biometrics to research identity authentication methods is feasible and can bring new possibilities to biometric identity authentication.
Consequently, this study proposes a deep learning-based identity authentication method for PPG signals.The primary contributions are as follows: firstly, the filtered onedimensional temporal PPG data is partitioned into individual cycles and transformed into two-dimensional images using the Markov Transition Field (MTF) technique [5].Unlike most existing identity authentication methods that directly employ one-dimensional PPG signals as inputs, this paper initially converts the one-dimensional temporal data into twodimensional image data before incorporating it into a convolutional neural network (CNN).Two-dimensional image data is richer than one-dimensional time series data because of its multi-dimensional spatial information.Especially in color images, the three independent channels of red, green, and blue each carry different image information.Therefore, in feature extraction by convolutional neural networks, two-dimensional image data shows incomparable advantages.Secondly, a lightweight CNN model is devised, integrating the concepts of Depthwise Separable Convolution [6] and Residual Structures [7].This model maintains high recognition accuracy while reducing memory consumption and improving runtime efficiency.It is well-suited for resource-constrained devices such as wearable devices, which have small form factors, low power requirements, and limited computational resources.Identity authentication can be approached as a classification problem to gain deeper insights.Experimental results show that the lightweight CNN model proposed in this study achieves an accuracy rate of 98.62% on the training set and 96.17% on the testing set, both surpassing some traditional deep learning methods.Hence, this research demonstrates high feasibility.
The use of PPG signals for biometric identification has garnered significant attention and application in the current literature.Researchers have actively explored methods that use PPG signals to distinguish between individuals by extracting unique and stable features from these signals.Two primary categories of methods are available for use.
Regarding the first category of methods, analyzing waveform characteristics of signals and extracting features from both the time domain and frequency domain has been employed.Kavsaoğlu et al. [8] successfully extracted 40-time domain features from the PPG signal and its first and second derivatives and innovatively proposed a feature sorting algorithm.They further verified the effectiveness of the algorithm through the k-NN classifier.Jaafar et al. [9] utilized the second derivative of PPG signals, known as APG signals, obtained from ten individual users.The morphology of APG signals was analyzed to extract features, which were then classified using a Bayesian network.Meanwhile, Salanke et al. [10] segmented PPG signals based on P-wave intervals and used principal component analysis to extract features successfully utilized for signal classification.
In the second category, deep neural network models are employed for automatic feature classification and learning from PPG signals.Deep learning methods can automatically learn useful features directly from raw data without requiring manual feature design and selection.Li et al. [11] designed and trained a multi-scale feature fusion deep learning (MFFD) model.This model is mainly based on convolutional neural network architecture and is used to effectively extract the features of PPG signals and to learn how to accurately distinguish different individuals based on each person's unique PPG pattern.Wei et al. [12] first proposed PPG enhancement technology to generate multi-scale PPG signals and then proposed a deep end-to-end model to extract and classify multi-scale features.Seok et al. [13] proposed a one-dimensional twin neural network biometric model based on PPG, which reduced noise and retained individual unique characteristics through the multi-period averaging method, achieving efficient and safe identification and authentication.Abbani et al. [14] used a two-way long short-term memory deep learning algorithm in their research and successfully designed an identity authentication model based on PPG signals.Dwaipayan et al. [15] designed a novel deep learning model, CorNET, which combines two convolutional neural network layers and two long short-term memory layers for identity authentication tasks.The network combines two convolutional neural layers and two long short-term memory layers.Jordi et al. [16] proposed an end-to-end recognition architecture based on the original PPG signal, mainly built by a convolutional neural network.However, some methods convert PPG signals into two-dimensional image data and utilize deep learning methods for classification.Cherry et al. [17] transformed one-dimensional PPG signals into two-dimensional spectrograms and employed convolutional neural networks for automatic feature extraction and classification tasks.Using the scalogram technique, Mostafa et al. [18] converted one-dimensional PPG signals into two-dimensional images.They developed a CVT-ConvMixer classifier and attention mechanisms to achieve individual identity recognition.
The main problem of the first type of method in the above literature is that the features extracted by analyzing the waveform characteristics of the signal or the signal characteristics from the time domain and frequency domain are often not comprehensive enough, and manual processing of features is prone to errors.In the second type of method, the deep learning model used has a relatively complex network structure and many parameters, which results in high computational complexity and consumes many resources.The method in this article uses a neural network model to extract features and learn to classify PPG signals which are automatically converted into two-dimensional images.Compared with the literature mentioned in the first method above, the method used in this paper does not need manual processing of features but realizes end-to-end learning from original input to final output, which can significantly simplify the whole process of feature extraction and classification and improve the efficiency and practicality of the model.Deep learning methods can learn multi-level and multi-scale feature representations of data through multi-level neural network structures.This enables the model to perform effective feature extraction and recognition when facing new, unseen data and has strong generalization capabilities.Compared with the literature mentioned in the second method above, the deep learning model proposed in this article significantly reduces the model's size by optimizing the network structure and reducing the number of parameters, making the model more efficient in storage and transmission.Efficient, especially for resource-constrained environments such as wearable devices, without sacrificing accuracy.

Methods
This section will comprehensively elucidate the identity authentication method utilizing PPG signals as a carrier.This method comprises three essential phases.The first step involves signal preprocessing to eliminate noise interference and enhance signal quality.Subsequently, the preprocessed one-dimensional signal is converted into a two-dimensional image signal to facilitate subsequent feature extraction and recognition.Lastly, the transformed two-dimensional signal is classified using a constructed lightweight convolutional neural network, thus achieving accurate identity authentication.The workflow is depicted in Figure 1.

PPG Signal Preprocessing
Various types of noise often accompany PPG signals, acting as interference during the collection process [19].These noises mainly include baseline drift, which arises from respiratory fluctuations and the instability of amplification circuits; power line interference, originating from AC power sources; electromyographic noise, resulting from limb tremors and muscle contractions; and motion artifacts, caused by changes in the optical measurement due to bodily movements.
To enhance the quality of PPG signals, we used a 3rd-order bandpass Butterworth filter.Choosing a 3rd-order bandpass Butterworth filter can balance frequency selectivity and phase response during the filtering process, reducing distortion and signal delay.The design of this filter takes into account the characteristics of the PPG signal.It sets the highpass cutoff frequency to 8 Hz, which can effectively filter out high-frequency interference caused by electromyographic noise and power frequency drift.The low-pass cutoff frequency is 0.5 Hz, which can effectively filter out low-frequency interference caused by baseline drift.The filtered PPG signals exhibited a significant quality improvement, forming the basis for subsequent identity authentication tasks.Furthermore, the filtered PPG signals underwent amplitude normalization to address amplitude variations in the signals.This step ensured a unified dynamic range of the signals at 1, further enhancing the accuracy and stability of the identity authentication process.

Two-Dimensional Signal Transformation Methods
Although various deep learning algorithms, such as 1D-CNN and LSTM, have been developed to handle one-dimensional time series data like PPG signals, two-dimensional images possess a wealth of spatial information and structural characteristics.Leveraging deep learning methods for feature extraction can effectively capture edge detection, corner identification, and other informative details, ultimately improving learning efficiency.Hence, the conversion of one-dimensional PPG signals into two-dimensional images is indispensable.
This paper adopts the Markov Transition Field (MTF) method to transform the onedimensional PPG signals into two-dimensional images.MTF is an image encoding technique that uses the Markov transition matrix to encode time series data.It treats the temporal evolution of the time series as a Markov process, where the future state depends solely on the present state, independent of the past states.By constructing the Markov transition matrix to reflect this concept, we encode time series data as images by extending it into a Markov Transition Field.The main steps involved in this process are as follows: 1. Divide the time series data equally into Q different quantile bins and label these quantile bins from 1 to Q in sequence.2. Replace each data point in the time series with its bin number.3. Calculate the transfer frequency between each quantile bin along the time axis of the time series as a 1st-order Markov chain and construct a Q × Q transfer matrix W accordingly, as shown in Formula (1).In this matrix, the element ωij represents the transition frequency from quantile bin i to quantile bin j.The transition matrix provides quantitative information about the quantile bin transition patterns in the time series.
4. In the Markov transfer field, x1 to xN are elements in the time series, qi and qj are quantile intervals of time steps i and j, respectively, and the transfer probability from qi to qj is Mij.Considering each probability of time position arrangement, the Markov transfer matrix W is extended to the Markov transfer field M, as shown in Formula (2).
This paper uses the peak detection method to split the long-segment PPG signal into independent single-cycle signals.Then, the single-cycle signals are used as input for the neural network model.During the segmentation process, all abnormal single-cycle signals are eliminated to ensure the accuracy and reliability of subsequent analysis.Subsequently, these single-period signals were successfully converted into two-dimensional images with a size of 28 × 28 using the Markov transfer field method, as shown in Figure 2.

Lightweight Convolutional Neural Network, LW-CNN
Since this paper studies identity authentication based on wearable devices and considers the resource-limited characteristics of wearable devices, we propose a lightweight convolutional neural network LW-CNN with deep separable convolution ideas and residual connections.Compared with traditional convolutional neural networks (CNN), LW-CNN significantly reduces the complexity of the model by reducing the number of network layers, using smaller convolution kernels and a more straightforward network structure.Additionally, because of its simpler structure and fewer parameters, both forward and backward calculations require less computation, resulting in improved efficiency.Moreover, because the residual structure is introduced, the network structure is more stable, which improves the network's performance and enhances its robustness.
Depthwise separable convolution can be regarded as a special convolution operation.In traditional convolution operations, each convolution kernel performs convolution operations simultaneously on all channels of the input feature map, thereby generating channels of the output feature map.If the input feature map has N channels, each convolution kernel requires N convolution sub-kernels for the input channels.However, depthwise separable convolution divides this process into two stages.The first part is the Depthwise Convolution layer, in which each input channel has its own 3 × 3 convolution kernel for independent convolution operations.This means there will be N convolution kernels if there are N input channels, and each kernel will only convolve one input channel.In this way, the deep convolutional layer can extract the features of each channel without increasing the number of parameters.Secondly, the pointwise convolution layer (Pointwise Convolution), after the depth convolution, uses a 1 × 1 convolution kernel to convolve the output of the depth convolution.This 1 × 1 convolution kernel is a cross-channel linear transformation.Its function is to fuse and combine the features of all channels to generate the final output feature map.In this way, depthwise separable convolution can significantly reduce the model's parameters while maintaining the convolution operation's spatial filtering capabilities.Specifically, for a traditional convolutional layer with N input channels and M output channels, the number of parameters is N × M × K × K (where K is the size of the convolution kernel).The number of parameters of the corresponding depthseparable convolution layer is reduced to N × K × K (depth convolution layer) + M × N (point-by-point convolution layer).Usually, M and N are both large, so this reduction is very significant.Figure 3   The residual structure is proposed to solve the problem of gradient explosion or dispersion in neural network training.The core idea is to build a neural network by introducing residual blocks (Residual blocks).These residual blocks create a shortcut connection between the input and output, allowing information to pass more smoothly through the network.Specifically, each residual block contains multiple convolutional layers used to extract and transform features.At the same time, short-circuit connections allow the input signal to skip one or more layers and be added directly to the output of subsequent layers.This structural innovation ensures that gradients can flow back to previous layers more effectively during backpropagation, thereby avoiding the problem of vanishing gradients.Mathematically, this short-circuit connection can be expressed as y = F(x) + x, where y represents the output of the current layer, F(x) represents the feature map obtained by the current layer through operations such as convolution, and x is the output of the previous layer or layers, that is, the input of a short-circuit connection.This addition operation not only preserves the information of the original input but also allows the network to focus on learning the residual between the input and the output, that is, the difference between them.This residual learning method helps the network extract features more efficiently and makes the training of deep networks more stable.Since the gradient contains the derivative term concerning the input x, the gradient can be effectively propagated even in deep networks, alleviating the vanishing gradient problem.In addition, since the convolutional layers in the residual block are usually accompanied by batch normalization and ReLU activation functions, this further improves the stability and convergence speed of the network.Figure 4 shows a schematic diagram of a three-layer residual structure.The lightweight convolutional neural network model, LW-CNN, employed in this study primarily comprises an initial convolutional layer, three residual blocks, a global average pooling layer, a Dropout layer, and a fully connected layer.In the initial convolutional layer, a 3 × 3 kernel performs convolutional operations on the input image with a stride of 2. Padding is applied to the image edges to ensure consistency in spatial dimensions, transforming the original image with three channels into a feature map with 32 channels.The three residual blocks are responsible for expanding the number of channels in the feature map, achieving values of 64 and 128 and maintaining it at 128, respectively.Within each residual block, two convolutional layers are incorporated, wherein the first one employs a 3 × 3 kernel, and the second utilizes a 1 × 1 kernel.The stride for the 3 × 3 convolutional layers in the first and third residual blocks is set to 1, while it is 2 for the second residual block.The outputs are summed together by establishing shortcut connections between the two parts of each residual block, effectively addressing gradient vanishing problems and ensuring a smooth flow of information.A batch normalization layer and a ReLU activation function follow the initial and convolutional layers in the residual blocks.The former stabilizes the training process and expedites model convergence, while the latter enhances the model's representative capacity and improves classification accuracy.Subsequently, a global average pooling layer is employed to reduce the spatial dimensions of the feature map to 1 × 1, yielding a feature vector.A Dropout layer is introduced to enhance the model's generalization capability further.During training, the Dropout layer randomly discards specific neuron outputs, preventing the model from over-relying on specific neurons and thereby enhancing the model's robustness.Finally, the feature vector is fed into a fully connected layer for classification to generate the final prediction.Experimental results show that setting the dropout rate to 0.5 achieves the best prediction performance.The schematic diagram of this lightweight convolutional neural network is depicted in Figure 5.

Experiment
The experiments in this study were performed using a computer equipped with an Intel(R) Core(TM) i5-7300HQ CPU operating at a frequency of 2.50 GHz and having 8.00 GB of memory.The experimental development environment was set up using Py-charm2022 as the software, with Python as the programming language.All neural networks' construction and training evaluations were carried out using the PyTorch deep learning framework.

Dataset
The PPG signals used in this study were sourced from the BIDMC PPG and Respiration Dataset [20] available on Physionet.In order to conduct a more accurate experimental analysis, the PPG signals of 16 individuals were screened as research samples, and the sampling frequency was 125 Hz.These individuals included a total of eight men and eight women.The oldest was 88 years old, the youngest was 19 years old, and the average age was 41 years old, ensuring the representativeness and validity of the data.After preprocessing the dataset and eliminating any aberrant values, 4439 two-dimensional images were generated, each contributing between 200 and 350 images.For model training and learning, 70% of the overall dataset for each individual was allocated as the training set.The remaining 30% was the test set to evaluate the model's performance and accuracy.

Experimental Results and Analysis
While training and testing the LW-CNN model, this experiment used the cross-entropy loss function and Adma optimizer.The primary function of the cross-entropy loss function is to measure the difference between the model prediction results and the actual labels and provide guidance for model optimization.When the loss value approaches 0, the model's prediction results are closer to the actual label.The Adma optimizer flexibly adjusts each parameter's learning rate based on the loss function's gradient information to achieve a more efficient optimization process.In the experiment, after many attempts and adjustments, we set the learning rate of the Adma optimizer to 0.001.This learning rate value effectively balances convergence speed and model stability, ensuring efficient evaluation.In addition, we also set the number of model epochs to 80 and the batch size to 128.After training, the accuracy of the LW-CNN model on the training set reached 98.62%, and the accuracy on the test set also reached 96.17%, showing good performance.Figure 6a shows the accuracy curves of the training set and validation set during the 80 iterations of the model.As the number of epochs increases, the accuracy gradually increases and stabilizes.Figure 6b shows the loss curves of the training set and the test set.As the number of epochs increases, the loss value gradually decreases and finally converges to a lower level.These results fully prove the effectiveness and stability of the LW-CNN model.To evaluate the performance of the LW-CNN model more comprehensively and indepth, in addition to observing the changing trends of the accuracy curve and loss curve, we further drew a confusion matrix to analyze the classification effect of the model in detail.The confusion matrix shows the model's intuitive classification of categories, including the number of correctly and incorrectly classified samples.We can observe the model's specific performance in each category through the confusion matrix and then better evaluate the model's performance.Precision, accuracy, recall, and F1-score can also be calculated by analyzing the confusion matrix.Precision measures the proportion of correctly predicted classifications among all the predicted correct classifications, while accuracy calculates the percentage of accurate predictions out of all predictions made.Recall establishes the ratio of correctly classified instances over the total number of genuinely correct classifications, and the F1-score balances the precision and recall measures by taking their harmonic mean.The confusion matrix for the LW-CNN model is depicted in Figure 7, with corresponding metrics of accuracy (96.17%), precision (96.24%), recall (96.28%), and F1-score (96.11%).This study conducted two control experiments to compare and showcase the LW-CNN model.The hyperparameters, including iteration count, batch size, learning rate, and dropout, were consistent with the LW-CNN model described earlier.
Firstly, in the first set of experiments, the LW-CNN model utilized two-dimensional image data transformed from one-dimensional temporal data as inputs.Two other neural network models directly processing one-dimensional temporal data were selected to serve as a reference.The first model employed a one-dimensional convolutional neural network (1D-CNN) [21] capable of extracting features directly from the raw one-dimensional temporal data and performing classification through fully connected layers.The second model utilized an extended short-term memory network (LSTM) [22], which excels at processing sequence data with long-term dependencies.LSTM captures the dynamic variations in temporal data through its internal gating mechanisms and memory units.Table 1 illustrates the performance of these three networks in terms of accuracy, precision, recall, and F1-score.As seen from the table, compared with the traditional neural network model that directly processes one-dimensional time series data, the LW-CNN model converts one-dimensional time series data into two-dimensional image data and then trains it, which is a more effective method to improve classification accuracy.In addition, Table 2 below also gives the total number of parameters, the total parameter size, and the time required for each iteration of the three networks.LW-CNN has the advantages of fewer parameters and a faster training speed than the other two neural networks that directly process one-dimensional time series signals.Followed by the second set of experiments, LW-CNN, as a lightweight convolutional neural network, was compared with two classic convolutional neural network models, AlexNet and GoogleNet.Both models perform excellently in image classification tasks, and each has its characteristics.AlexNet performs extracting and classifying image features by stacking multiple convolutional layers, pooling, and fully connected layers.GoogleNet introduces the Inception module, which improves the model's feature extraction capabilities and performance by using convolution kernels of different sizes and pooling operations in parallel.Table 3 shows the three networks' accuracy, precision, recall, and F1-score performance.

Figure 1 .
Figure 1.Overall methodology flowchart for identity authentication using PPG signals.

Figure 2 .
Figure 2. Conversion of one-dimensional data to two-dimensional.
is a schematic diagram of depthwise separable convolution.

Table 1 .
Comparison of the performance of LW-CNN, 1D-CNN, and LSTM in terms of accuracy, precision, recall, and F1-score.

Table 2 .
Comparison of the performance of LW-CNN, 1D-CNN, and LSTM.

Table 3 .
Comparison of the performance of LW-CNN, AlexNet, and GoogleNet in terms of accuracy, precision, recall, and F1-score.