Deep Learning and Vision-Based Early Drowning Detection

: Drowning is one of the top ﬁve causes of death for children aged 1–14 worldwide. According to data from the World Health Organization (WHO), drowning is the third most common reason for unintentional fatalities. Designing a drowning detection system is becoming increasingly necessary in order to ensure the safety of swimmers, particularly children. This paper presents a computer vision and deep learning-based early drowning detection approach. We utilized ﬁve convolutional neural network models and trained them on our data. These models are SqueezeNet, GoogleNet, AlexNet, ShufﬂeNet, and ResNet50. ResNet50 showed the best performance, as it achieved 100% prediction accuracy with a reasonable training time. When compared to other approaches, the proposed approach performed exceptionally well in terms of prediction accuracy and computational cost.


Introduction
According to the World Health Organization (WHO) report [1], drowning is the third most common reason for unintentional fatalities worldwide for children and young people aged 1-14 years, with children under the age of 5 at highest risk.There are an estimated 236,000 annual drowning deaths around the world [2].In 48 of the 85 countries, drowning is one of the top five fatalities of children between the ages of 1 and 14 [3].According to [4], as the population increases and the development of hotels and villas with swimming pools becomes more popular, the death rate due to drowning will increase.Several investigations have been performed by governments and organizations to find appropriate ways to save people.Some of these ways include providing information on dangers of drowning given to parents through the child surveillance programs, encouraging fencing or draining of garden ponds and domestic swimming pools, and increasing supervision of swimming in lakes, rivers, and beaches in order to reduce the number of accidents.Unfortunately, these solutions are not enough and can be considered rudimentary.The effective reduction in drowning and the assurance of pool safety can be achieved through the implementation of a smart automated monitoring system.
There are several approaches for automatic drowning detection which can be categorized into two classes.The approaches of the first category are based on wearing sensing devices that are attached to the swimmer through a wristband or goggles.These sensors can monitor the swimmer behavior through providing measurements such as heart rate, blood oxygen level, motion, hydraulic pressure, and depth.The second category involves vision-based approaches where overhead or underwater cameras are used to monitor the swimmers, and machine learning (ML) algorithms are employed to detect drowning instances from the output of these cameras.
The main contribution of this study is the proposal of a ground-breaking technique that quickly and automatically detects drowning victims based on deep learning convolutional neural networks (CNNs).We investigated five pretrained CNN models for identifying drowning cases within a swimming pool.These models were AlexNet [5], GoogleNet [6], SqueezeNet [7], ShuffleNet [8], and ResNet [9].After fine-tuning these models and training them on our dataset, all models successfully identified the drowning instances from normal swimming events with a very high prediction accuracy and confidence level.

Related Work
Automatic drowning detection approaches can be classified as sensor-based and MLbased approaches.There are very few available ML-based drowning detection approaches in the literature.Alotaibi [10] proposed a swimming pool monitoring system based on IoT and transfer learning.The motion sensor detects any objects around the swimming pool and sends a signal to an overhead camera which captures a single image.The image is sent via wireless communication to the server station, for processing and classification.The ResNet50 model classifies the detected object as human, animal, or object.Li et al. [11] proposed a technique for identifying drowning victims at sea.They created a dataset of 6079 images using a team of actors.They employed the Yolov3 algorithm with some modifications.Briefly, the residual module with channel attention mechanism was used in the feature extraction network, a bottom-up structure was added to the feature fusion network (FPN) structure, CIoU was used as a loss function, and a linear transformation method was used to deal with the anchor boxes generated by a clustering algorithm.The model classifies human targets into four categories: sea person, uncertain sea person, land person, and uncertain land person.The model achieved an accuracy of 72.17%.
Chan et al. [12] presented an AlexNet CNN model with the use of NVIDIA Jetson Nano.The model was trained using 1168 drowning images and 2333 non-drowning images.The testing dataset contained 389 drowning images and 777 non-drowning images.The dataset was created by 30 volunteers who made different poses in the pool.The model achieved 85% classification accuracy.Handalage et al. [13] proposed a drowning rescue system with three main functions: detecting drowning victims, sending drones to victims, and detecting dangerous activities.The drowning detection component detects drowning victims through a CNN model.The second component is the rescue drone which is sent to the victim's location coordinates.The third component detects dangerous activities such as running around the swimming pool and drinking.The drowning detection system was trained through 5000 images representing four categories: drowning stage 1, drowning stage 2, drowning stage 3, and not drowning.The major source of the data was the introduction of actors and the collection of videos in real time.The secondary source of data was the Internet.Swimmers in the pool were detected using an overhead camera.YOLO [14] was used to detect objects by locating one or more objects in the image and sorting each object.The CNN model was implemented on the NVIDIA Jetson Nano board in order to run multiple neural models in parallel.

Materials and Methods
The main contribution of this study is the development of an automated and intelligent system for monitoring swimming pools for early drowning detection.We utilize deep machine learning to efficiently process the swimmers' images and enable early detection of any drowning case.

Dataset
Our dataset contains 200 images that were collected through the Google search engine.The data consist of two classes (drowning and swimming), where each class includes 100 images representing swimmers from both genders but different ages.We used 100 of these images (50 drowning and 50 swimming) for model training and validation, while the other 100 images (50 drowning and 50 swimming) were used for model testing.The training and validation dataset was further split into 70% for training and 30% for validation.Samples of the drowning and swimming images are presented in Figure 1.
deep machine learning to efficiently process the swimmers' images and enable early detection of any drowning case.

Dataset
Our dataset contains 200 images that were collected through the Google search engine.The data consist of two classes (drowning and swimming), where each class includes 100 images representing swimmers from both genders but different ages.We used 100 of these images (50 drowning and 50 swimming) for model training and validation, while the other 100 images (50 drowning and 50 swimming) were used for model testing.The training and validation dataset was further split into 70% for training and 30% for validation.Samples of the drowning and swimming images are presented in Figure 1.Machine learning algorithms usually require large training data to perform well.In our case, the available data were limited, which may cause overfitting problems.To address this issue, we utilized data augmentation and transfer learning.Data augmentation is a mechanism for increasing the quantity of data by introducing slightly modified copies of current data or newly created synthetic data from existing data [18].It regularizes and aids in the training of a machine learning model to reduce overfitting.Data augmentation in deep learning takes the form of geometric modifications, flipping, color alteration, cropping, rotation, noise injection, and random erasure to improve the image [19].
After loading our data into the network, we used two forms of data augmentation: rotation and scaling.For each CNN, we used different rotational angles.For GoogleNet, ShuffleNet, AlexNetr, and ResNet50 we used a rotational angle of −45° to 45°.After many trials and errors, this rotational angle achieved the highest accuracy for these four CNNs.However, the rotational angle for SqueezNet that achieved the highest validation accuracy was −60° to 60°.On each image, random scaling factors in the range of 1 to 2 were applied for all five CNNs.
The experiments in this work were implemented using the deep learning toolbox in MATLAB where rotation angles and scale factors are picked randomly from continuous uniform distributions within the specified intervals.Each epoch produces slightly different transformed versions of each image in the training dataset while maintaining an equal number of training images across epochs.These transformed images are not stored in memory [20].Machine learning algorithms usually require large training data to perform well.In our case, the available data were limited, which may cause overfitting problems.To address this issue, we utilized data augmentation and transfer learning.Data augmentation is a mechanism for increasing the quantity of data by introducing slightly modified copies of current data or newly created synthetic data from existing data [18].It regularizes and aids in the training of a machine learning model to reduce overfitting.Data augmentation in deep learning takes the form of geometric modifications, flipping, color alteration, cropping, rotation, noise injection, and random erasure to improve the image [19].
After loading our data into the network, we used two forms of data augmentation: rotation and scaling.For each CNN, we used different rotational angles.For GoogleNet, ShuffleNet, AlexNetr, and ResNet50 we used a rotational angle of −45 • to 45 • .After many trials and errors, this rotational angle achieved the highest accuracy for these four CNNs.However, the rotational angle for SqueezNet that achieved the highest validation accuracy was −60 • to 60 • .On each image, random scaling factors in the range of 1 to 2 were applied for all five CNNs.
The experiments in this work were implemented using the deep learning toolbox in MATLAB where rotation angles and scale factors are picked randomly from continuous uniform distributions within the specified intervals.Each epoch produces slightly different transformed versions of each image in the training dataset while maintaining an equal number of training images across epochs.These transformed images are not stored in memory [20].

CNN Models
Deep learning algorithms including CNN have led to significant advances in the field of computer vision in recent years.The major advantage of a CNN is that it can learn directly from input images, eliminating the need for preprocessing and feature extraction techniques [21,22].
Three of the most critical characteristics that are typically considered when selecting a convolutional neural network model are classification accuracy, computational time, and memory requirement [23].Due to time constraints, computational limitations, and the unavailability of an adequate amount of training data, which often make it a significant challenge to build a CNN model from scratch, using pretrained CNN models is considered a good option.There are several publicly available pretrained CNN models [19].In this work, we examined five pretrained networks; AlexNet, GoogleNet, SqueezeNet, ShuffleNet, and ResNet50.These models were selected in this study as they have been successfully implemented in many state-of the-art research publications and have shown great performance in several applications.
AlexNet was one of the first deep convolutional networks that reached significant accuracy [5].The overfitting problem is solved in AlexNet by using dropout layers, where a connection is dropped with a probability of 0.5 during testing.A probability of 0.5 was chosen since it was the best fit for the network parameters and training options.Although this prevents the network from overfitting by allowing it to escape from undesirable local minima, it also doubles the number of iterations required for convergence.Millions of images have been classified using this algorithm into object categories, such as faces, fruit, cups, pencils, and animals.Networks take images as input and assign labels for those objects.In addition, they take probabilities for the categories in which those objects fall.There are two sets of images involved with the input to the network: 227 × 227 × 3 RGB images [24][25][26][27].The AlexNet architecture is shown in Figure 2.

CNN Models
Deep learning algorithms including CNN have led to significant advances in the field of computer vision in recent years.The major advantage of a CNN is that it can learn directly from input images, eliminating the need for preprocessing and feature extraction techniques [21,22].
Three of the most critical characteristics that are typically considered when selecting a convolutional neural network model are classification accuracy, computational time, and memory requirement [23].Due to time constraints, computational limitations, and the unavailability of an adequate amount of training data, which often make it a significant challenge to build a CNN model from scratch, using pretrained CNN models is considered a good option.There are several publicly available pretrained CNN models [19].In this work, we examined five pretrained networks; AlexNet, GoogleNet, SqueezeNet, ShuffleNet, and ResNet50.These models were selected in this study as they have been successfully implemented in many state-of the-art research publications and have shown great performance in several applications.
AlexNet was one of the first deep convolutional networks that reached significant accuracy [5].The overfitting problem is solved in AlexNet by using dropout layers, where a connection is dropped with a probability of 0.5 during testing.A probability of 0.5 was chosen since it was the best fit for the network parameters and training options.Although this prevents the network from overfitting by allowing it to escape from undesirable local minima, it also doubles the number of iterations required for convergence.Millions of images have been classified using this algorithm into object categories, such as faces, fruit, cups, pencils, and animals.Networks take images as input and assign labels for those objects.In addition, they take probabilities for the categories in which those objects fall.There are two sets of images involved with the input to the network: 227 × 227 × 3 RGB images [24][25][26][27].The AlexNet architecture is shown in Figure 2. The inception module in the GoogleNet design solved most of the problems that huge networks faced [6].GoogleNet has an error rate of 6.67%, which is very close to human performance.The design consists of 22 deep CNN layers, lowering the number of parameters to four million (60 million compared to AlexNet).In addition to the 22 layers of GoogleNet, there are five pooling layers [28].The initiation modules comprise nine linear layers altogether.There are also 1 × 1 convolution filters.In part due to the parallel network implementation and layer reduction, the network has very good computational and memory efficiency.The model size is also smaller than other networks [24,27].The GoogleNet architecture is presented in Figure 3.The inception module in the GoogleNet design solved most of the problems that huge networks faced [6].GoogleNet has an error rate of 6.67%, which is very close to human performance.The design consists of 22 deep CNN layers, lowering the number of parameters to four million (60 million compared to AlexNet).In addition to the 22 layers of GoogleNet, there are five pooling layers [28].The initiation modules comprise nine linear layers altogether.There are also 1 × 1 convolution filters.In part due to the parallel network implementation and layer reduction, the network has very good computational and memory efficiency.The model size is also smaller than other networks [24,27].The GoogleNet architecture is presented in Figure 3.
SqueezeNet is a small CNN requiring less communication between servers during distribution training [7].Smaller CNNs are also easier to implement on hardware with limited memory, such as a field-programmable gate array (FPGA).SqueezeNet is a convolutional neural network with 18 layers.Image categories are categorized into 1000 categories by the pretrained network.The network learns complex function representations for a wide variety of images.The goal of utilizing SqueezeNet is to create a smaller neural network using fewer datasets that can be readily integrated into computer memory and transmitted via a computer network [24,27,29].The SqueezeNet architecture is shown in Figure 4. SqueezeNet is a small CNN requiring less communication between servers duri distribution training [7].Smaller CNNs are also easier to implement on hardware w limited memory, such as a field-programmable gate array (FPGA).SqueezeNet is a co volutional neural network with 18 layers.Image categories are categorized into 1000 c egories by the pretrained network.The network learns complex function representatio for a wide variety of images.The goal of utilizing SqueezeNet is to create a smaller neu network using fewer datasets that can be readily integrated into computer memory a transmitted via a computer network [24,27,29].The SqueezeNet architecture is shown Figure 4. ShuffleNet is a very resource-efficient CNN architecture that was created specifica for mobile devices with very little processing power [8].The network architecture sign icantly lowers computation costs while retaining accuracy by using two new operatio pointwise group convolution and channel shuffle.The ShuffleNet architecture is ill trated in Figure 5.  SqueezeNet is a small CNN requiring less communication between servers during distribution training [7].Smaller CNNs are also easier to implement on hardware with limited memory, such as a field-programmable gate array (FPGA).SqueezeNet is a convolutional neural network with 18 layers.Image categories are categorized into 1000 categories by the pretrained network.The network learns complex function representations for a wide variety of images.The goal of utilizing SqueezeNet is to create a smaller neural network using fewer datasets that can be readily integrated into computer memory and transmitted via a computer network [24,27,29].The SqueezeNet architecture is shown in Figure 4. ShuffleNet is a very resource-efficient CNN architecture that was created specifically for mobile devices with very little processing power [8].The network architecture significantly lowers computation costs while retaining accuracy by using two new operations, pointwise group convolution and channel shuffle.The ShuffleNet architecture is illustrated in Figure 5. ShuffleNet is a very resource-efficient CNN architecture that was created specifically for mobile devices with very little processing power [8].The network architecture significantly lowers computation costs while retaining accuracy by using two new operations, pointwise group convolution and channel shuffle.The ShuffleNet architecture is illustrated in Figure 5.
ResNet was developed in 2016 by introducing some features that significantly increase network accuracy and speed [9].Not every neuron in the ResNet design needs to fire at once.After learning a feature once, it does not try to learn it again; instead, it focuses on learning additional features.This strategy enhances the effectiveness of model training.The ResNet50 network consists of 50 layers, and its architecture is illustrated in Figure 6.ResNet was developed in 2016 by introducing some features that significantly increase network accuracy and speed [9].Not every neuron in the ResNet design needs to fire at once.After learning a feature once, it does not try to learn it again; instead, it focuses on learning additional features.This strategy enhances the effectiveness of model training.The ResNet50 network consists of 50 layers, and its architecture is illustrated in Figure 6.ResNet was developed in 2016 by introducing some features that significantly increase network accuracy and speed [9].Not every neuron in the ResNet design needs to fire at once.After learning a feature once, it does not try to learn it again; instead, it focuses on learning additional features.This strategy enhances the effectiveness of model training.The ResNet50 network consists of 50 layers, and its architecture is illustrated in Figure 6.When training any pretrained network, we start by modifying the parameters of the basic design.In each of the CNN models, we tuned the pretrained network parameters, i.e., the convolution 2D layer and the classification output layer.We modified the filter size to 1 × 1, and the number of filters to 2 as we had two classes of data.We modified the classification output layer to suit our output classification and labels.Furthermore, we set the starting learning rate to 0.0001, the validation frequency to 5, and the maximum epochs to 60, since we wanted to avoid the failure of training pauses based on error rates.An epoch is a single learning cycle in which the learner is exposed to the whole training dataset.Furthermore, the minimum batch size is equal to 11, which corresponds to the memory needs (8.00 GB) of the CPU hardware, which operates at 1.8 GHz.Increasing the number of epochs leads to better training and validation accuracy, but it may lead to overfitting.The training dataset was randomly separated into two parts: 70% of the data for training and 30% of the data for validation to avoid overfitting.When training any pretrained network, we start by modifying the parameters of the basic design.In each of the CNN models, we tuned the pretrained network parameters, i.e., the convolution 2D layer and the classification output layer.We modified the filter size to 1 × 1, and the number of filters to 2 as we had two classes of data.We modified the classification output layer to suit our output classification and labels.Furthermore, we set the starting learning rate to 0.0001, the validation frequency to 5, and the maximum epochs to 60, since we wanted to avoid the failure of training pauses based on error rates.An epoch is a single learning cycle in which the learner is exposed to the whole training dataset.Furthermore, the minimum batch size is equal to 11, which corresponds to the memory needs (8.00 GB) of the CPU hardware, which operates at 1.8 GHz.Increasing the number of epochs leads to better training and validation accuracy, but it may lead to overfitting.The training dataset was randomly separated into two parts: 70% of the data for training and 30% of the data for validation to avoid overfitting.

Evaluation Measures
A machine learning model can be assessed and compared to other methods using a variety of performance metrics.The most commonly used evaluation metrics are accuracy, sensitivity, specificity, precision, F1, and MCC.The accuracy (Ac) metric measures the percentage of correctly predicted drowning and swimming instances in the testing dataset.The percentage of drowning instances that were successfully predicted relative to all of the drowning cases included in the dataset is known as sensitivity or recall (R).Precision (Pr) measures the percentage of accurately predicted drowning cases to all predicted drowning cases.The percentage of accurately predicted non-drowning cases to all non-drowning cases listed in the dataset is known as specificity (Sp).These metrics can be mathematically represented as follows [32,33]: where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative instances, respectively.The F-measure (F1) is an evaluation metric that integrates precision and recall into a single value.A statistic that strikes a compromise between prediction sensitivity and specificity is the Mathew correlation coefficient (MCC).MCC is a numeric scale that goes from −1, which denotes an inverse prediction, through 0, which stands for a random classifier, to +1, which denotes a flawless prediction [34][35][36].

Model Training and Validation
We examined five different pretrained CNN models.We started with GoogleNet training, as shown in Figure 7, which required a total of 60 epochs with two iterations per each epoch for a total of 120 iterations for the network to properly train and validate the data.It achieved a validation accuracy of 91.67% after 120 iterations.The network training took 3 min and 36 s to complete.Furthermore, the validation was carried out in a five-iteration process to verify that the system was well trained, while avoiding overfitting of the data.To train SqueezeNet, we used a total of 20 epochs with two iterations per each ep as shown in Figure 8.The model achieved a validation accuracy of 100% after 40 iterat which is ideal accuracy and indicates a well-trained network.The training procedure only 26 s, which was significantly less time than the other two networks.To avoid overfitting, the validation frequency was also performed in a five-iteration process.To train SqueezeNet, we used a total of 20 epochs with two iterations per each epoch, as shown in Figure 8.The model achieved a validation accuracy of 100% after 40 iterations, which is ideal accuracy and indicates a well-trained network.The training procedure took only 26 s, which was significantly less time than the other two networks.To avoid data overfitting, the validation frequency was also performed in a five-iteration process.
To train SqueezeNet, we used a total of 20 epochs with two iterations per each ep as shown in Figure 8.The model achieved a validation accuracy of 100% after 40 iterat which is ideal accuracy and indicates a well-trained network.The training procedure only 26 s, which was significantly less time than the other two networks.To avoid overfitting, the validation frequency was also performed in a five-iteration process.However, there were 15 epochs overall for training AlexNet (Figure 9), with tw erations per epoch, allowing the network to train and validate the data extremely suc fully.It attained a validation accuracy of 91.67% after 30 iterations.The network trai took 1 min and 5 s to complete, and the validation was carried out through a five-itera process.As shown in Figure 10, we used a total of 18 epochs with two iterations for each ep to properly train and validate the data for ShuffleNet.After 36 iterations, we achiev prediction performance of 100%, which is optimal accuracy and suggests a well-tra network.Exactly 7 min and 18 s were needed for the training process, far slower than time required for the three previous networks.The validation frequency was also car out using a five-iteration process to prevent data overfitting.As shown in Figure 10, we used a total of 18 epochs with two iterations for each epoch to properly train and validate the data for ShuffleNet.After 36 iterations, we achieved a prediction performance of 100%, which is optimal accuracy and suggests a well-trained network.Exactly 7 min and 18 s were needed for the training process, far slower than the time required for the three previous networks.The validation frequency was also carried out using a five-iteration process to prevent data overfitting.As shown in Figure 10, we used a total of 18 epochs with two iterations for each epoch to properly train and validate the data for ShuffleNet.After 36 iterations, we achieved a prediction performance of 100%, which is optimal accuracy and suggests a well-trained network.Exactly 7 min and 18 s were needed for the training process, far slower than the time required for the three previous networks.The validation frequency was also carried out using a five-iteration process to prevent data overfitting.As shown in Figure 11 we used a total of 20 epochs with five iterations for each epoch to properly train and validate the data for ResNet50.After 40 iterations, we achieved a predictive performance of 100%, which is optimal accuracy and suggests a well-trained network.Exactly 2 min and 33 s were needed for the training process, a moderate time compared to the other four networks.The validation frequency was also carried out using a five-iteration process.As shown in Figure 11 we used a total of 20 epochs with five iterations for each epoch to properly train and validate the data for ResNet50.After 40 iterations, we achieved a predictive performance of 100%, which is optimal accuracy and suggests a well-trained network.Exactly 2 min and 33 s were needed for the training process, a moderate time compared to the other four networks.The validation frequency was also carried out using a five-iteration process.As demonstrated in Figures 7-11, we employed a single CPU system.We used the same initial learning rate of 0.0001 for all of five networks, whereas the maximum iterations and number of epochs were varied for each model.Because SqueezeNet had the fastest training time, it was found to be one of the best networks in terms of validation accuracy.While ResNet50 also achieved 100% validation accuracy, SqueezeNet was faster than ResNet50.AlexNet and GoogleNet both had the same validation accuracy of 91.67% with a higher training time than SqueezeNet, taking 3 min and 36 s for GoogleNet and 1 min and 5 s for AlexNet.Although ShuffleNet showed perfect validation accuracy, it took As demonstrated in Figures 7-11, we employed a single CPU system.We used the same initial learning rate of 0.0001 for all of five networks, whereas the maximum iterations and number of epochs were varied for each model.Because SqueezeNet had the fastest training time, it was found to be one of the best networks in terms of validation accuracy.
While ResNet50 also achieved 100% validation accuracy, SqueezeNet was faster than ResNet50.AlexNet and GoogleNet both had the same validation accuracy of 91.67% with a higher training time than SqueezeNet, taking 3 min and 36 s for GoogleNet and 1 min and 5 s for AlexNet.Although ShuffleNet showed perfect validation accuracy, it took a long training time.Table 1 details the training and validation performance of the five networks.

Model Testing and Evaluation
Our testing dataset consisted of 100 unseen images with 50 drowning instances and 50 normal swimming instances.This testing dataset was a completely separate set which was not used in the training and validation process.Figure 12 presents samples of the testing data.This dataset was used to test the five CNN models, and the testing results are summarized in the confusion matrices presented in Tables 2-6.Samples of the classification results of the five CNN models along with their confidence level are illustrated in Table 7.On the basis of their confusion matrices, the five CNN models were evaluated and compared through six performance measures: accuracy (Ac), recall (R), precision (Pr), specificity (Sp), F1, and MCC.The performance metrics of these models are summarized in Table 8.The five CNN models were able to correctly distinguish the swimmer's predicament with a very high confidence level.ResNet50 and AlexNet showed the best prediction performance in terms of all six evaluation metrics.The next best performer was SqueezeNet, while the ShuffleNet model showed the lowest classification performance among the five CNN models.
To additionally minimize false positive and false negative rates, it would be interesting to combine more than one CNN model, i.e., three or five models, into one system, where the final decision of the system is based on a voting criterion of the individual The five CNN models were able to correctly distinguish the swimmer's predicament with a very high confidence level.ResNet50 and AlexNet showed the best prediction performance in terms of all six evaluation metrics.The next best performer was SqueezeNet, while the ShuffleNet model showed the lowest classification performance among the five CNN models.
To additionally minimize false positive and false negative rates, it would be interesting to combine more than one CNN model, i.e., three or five models, into one system, where the final decision of the system is based on a voting criterion of the individual model outputs.
For further evaluation of our proposed approach, we compared it with other current approaches.The accuracy, machine learning technique, and dataset size of the various systems currently in use are compared in Table 9 and Figure 13.Although it was trained on a smaller amount of data, our proposed approach outperformed other existing approaches in terms of accuracy, as illustrated in Table 9 and Figure 13.

Conclusions
This paper presented a deep learning-based approach for early drowning detection.We examined five pretrained convolutional neural networks and trained them on our data.SqueezeNet, GoogleNet, AlexNet, ShuffleNet and ResNet50 as the five networks achieved prediction accuracies of 97%, 95%, 99%, 81%, and 100%, respectively.The best model among them was ResNet50, since it achieved the highest validation and testing accuracy.When compared to other techniques, the system performed exceptionally well in terms of prediction accuracy and training time.Experimental results proved that the proposed models could successfully detect drowning cases within swimming pool environments with very high confidence levels.
The suggested method can be implemented in a variety of pools and settings, including schools, gyms, hotels, and villas.This method can be installed and combined with an alarm system, or it can be integrated with an automated drowning rescue system.
More pretrained CNN models and more drowning/swimming image data can be examined to expand on this research.It would be fascinating to test these models in various swimming conditions with varying lighting and settings.To further minimize false positive and false negative rates, it will be interesting to implement more than one CNN model, i.e., three or five models in one system, with the final decision of the system based

Conclusions
This paper presented a deep learning-based approach for early drowning detection.We examined five pretrained convolutional neural networks and trained them on our data.SqueezeNet, GoogleNet, AlexNet, ShuffleNet and ResNet50 as the five networks achieved prediction accuracies of 97%, 95%, 99%, 81%, and 100%, respectively.The best model among them was ResNet50, since it achieved the highest validation and testing accuracy.When compared to other techniques, the system performed exceptionally well in terms of prediction accuracy and training time.Experimental results proved that the proposed models could successfully detect drowning cases within swimming pool environments with very high confidence levels.
The suggested method can be implemented in a variety of pools and settings, including schools, gyms, hotels, and villas.This method can be installed and combined with an alarm system, or it can be integrated with an automated drowning rescue system.More pretrained CNN models and more drowning/swimming image data can be examined to expand on this research.It would be fascinating to test these models in various swimming conditions with varying lighting and settings.To further minimize false positive and false negative rates, it will be interesting to implement more than one CNN model, i.e., three or five models in one system, with the final decision of the system based on a voting criterion of the model outputs.

Figure 1 .
Figure 1.(a) Samples of drowning data; (b) samples of swimming data.

Figure 1 .
Figure 1.(a) Samples of drowning data; (b) samples of swimming data.
tion 2023, 13, x FOR PEER REVIEW

Figure 8 .
Figure 8.The training performance of SqueezeNet.

Figure 8 .
Figure 8.The training performance of SqueezeNet.However, there were 15 epochs overall for training AlexNet (Figure 9), with two iterations per epoch, allowing the network to train and validate the data extremely successfully.It attained a validation accuracy of 91.67% after 30 iterations.The network training took 1 min and 5 s to complete, and the validation was carried out through a five-iteration process.ation 2023, 13, x FOR PEER REVIEW 9

Figure 9 .
Figure 9.The training performance of AlexNet.

Figure 9 .
Figure 9.The training performance of AlexNet.

Figure 9 .
Figure 9.The training performance of AlexNet.

Figure 10 .
Figure 10.The training performance of ShuffleNet.

Figure 10 .
Figure 10.The training performance of ShuffleNet.

Figure 11 .
Figure 11.The training performance of ResNet50.

Table 1 .
Convolution neural network training results.

Table 8 .
Testing results of the five CNN models.

Table 8 .
Testing results of the five CNN models.

Table 9 .
Comparison of drowning detection approaches.