Convolutional Models for the Detection of Firearms in Surveillance Videos

Featured Application: The system described in this article aims to be a ﬁrearms detection system for retail businesses, banks, train stations, bus stops, etc., which performs automatic monitoring, being able to perform a detection when a ﬁrearm is shown. Abstract: Closed-circuit television monitoring systems used for surveillance do not provide an immediate response in situations of danger such as armed robbery. In addition, they have multiple limitations when human operators perform the monitoring. For these reasons, a ﬁrearms detection system was developed using a new large database that was created from images extracted from surveillance videos of situations in which there are people with ﬁrearms. The system is made up of two parts—the “Front End” and “Back End”. The Front End is comprised of the YOLO object detection and localization system, and the Back End is made up of the ﬁrearms detection model that is developed in this work. These two systems are used to focus the detection system only in areas of the image where there are people, disregarding all other irrelevant areas. The performance of the ﬁrearm detection system was analyzed using multiple convolutional neural network (CNN) architectures, ﬁnding values up to 86% in metrics like recall and precision in a network conﬁguration based on VGG Net using grayscale images.


Introduction
Closed-circuit television (CCTV) systems are composed of one or more surveillance cameras connected to one or more video monitors [1].This type of security system tries to prevent dangerous situations such as intrusions or armed robberies.Usually, human operators observe these events and activate action protocols when a dangerous situation occurs.However, these systems have the disadvantage that they depend on human detection to activate alarms or action protocols.In [2] it is shown that an operator's ability to accurately observe activity on a screen is reduced by up to 45% after 12 min of constant monitoring.The failure rate increases to 95% after 22 min.Normally, the high cost of surveillance systems with additional monitoring services is a deterrent to their wide-scale use.In many cases, businesses only implement surveillance cameras without additional monitoring services for their protection.One way to reduce or avoid this type of crime could be the real-time detection of firearms in dangerous situations such as armed robberies.This would provide a faster reaction from security forces, because the detection would be made at the same time that the gun is first detected on the scene.This would allow security forces to be notified simultaneously with the activation of alarms in incidents involving a firearm, thus having a deterrent effect on the attackers.This system could also work as a support system, notifying those observing the monitors.
The problem with firearms detection in CCTV videos has been addressed in many different ways, firstly using classic machine learning algorithms like K-means to make color-based segmentation, combining it with algorithms like SURF (speeded up robust features), Harris interest point detector, and FREAK (fast retina keypoint) to make the detection and localization of the gun [2,3].In [4] the authors use algorithms like SIFT (scale-invariant feature transform) to extract different features of the image, combining it with K-means clustering and support vector machines to decide whether an image has a gun or not.The authors in [5] use algorithms like background and canny edge detection in combination with the sliding window approach and neural networks to detect and localize the gun.The disadvantage of these systems is that they use a database where the gun occupies most of the image, which does not represent authentic scenarios in which a firearm is involved.Therefore, these systems are not optimal for continuous monitoring where the images extracted from CCTV videos have a high complexity due to the multiple factors involved or where there are open areas with many objects around.
This problem has also been addressed with more complex algorithms like deep convolutional neural networks (CNNs).In [6] the authors used transfer learning, utilizing faster R-CNN trained in a database with only high-quality and low-complexity images.The authors show that the best system that was evaluated in well-known films garnered a low recall produced by the frames with very low contrast and luminosity.It also obtained false positives in the detection produced by the objects in the background of the image, which could be produced because multiple areas of the image are analyzed with the sliding window and region proposals approaches to detect and localize the gun.The authors in [7] face this problem using a symmetric dual camera system and CNNs, using a database made by the authors.However, the most common cameras in CCTV systems are not dual cameras [8], and therefore the use of this system would not apply to most retail businesses.
The most common problems that we found were first that the developed systems use small databases that do not represent authentic robbery scenarios, where many factors are involved.Small and medium markets and businesses use low-cost cameras that capture low-quality video.Luminosity is a risk factor in firearm detection; robberies can be done at any time of the day.Additionally, firearm position is an important factor; the gun can be shown in multiple positions.Second, the sliding window and region proposals approaches that are used for detection and localization of the gun analyze multiple places in the image where a gun could never be found.This could contribute to obtaining a large number of false positives, because the system could easily confuse a number of harmless objects with a gun.Once it is established that a firearm is most likely only to be found next to people, then the close monitoring of any image needs to be in and around the people in the image.To overcome the limitations of the described developed systems, we propose a firearm detection system made up of two convolutional models, in order to focus the detection system only in the part of the image where people are located.It uses the YOLO object detection and localization system [9] and the convolutional model to detect firearms that is developed in this work.The development of this model was done with a new large database of images extracted from surveillance videos and other situations that simulate for the most part a real robbery, considering factors like luminosity in the image, image quality, firearm position, and camera position.
This paper is organized as follows.In Section 2, we describe the database and the system architecture.In Section 3, we present the experiments and the final results.Finally, in Section 4, the conclusions are presented.

Database
For the development of systems that use deep learning, it is necessary to have large databases to train these systems.Due to the non-existence of a large image database of people with firearms on the web, we created a database that designated perpetrators into two classes, those with handguns and those without.This database was originally made up of 17,684 images.These images were extracted from surveillance videos of real robberies, videos of people practicing shooting with firearms, and other types of situations different from robberies or shooting practices.These last two types of images were chosen because they were very similar to real robbery scenarios.All of these data were obtained from YouTube, Instagram, and Google.The structure of each class is shown in Figures 1 and 2.
Appl.Sci.2019, 9, x FOR PEER REVIEW 3 of 11 and those without.This database was originally made up of 17,684 images.These images were extracted from surveillance videos of real robberies, videos of people practicing shooting with firearms, and other types of situations different from robberies or shooting practices.These last two types of images were chosen because they were very similar to real robbery scenarios.All of these data were obtained from YouTube, Instagram, and Google.The structure of each class is shown in Figures 1 and 2.  In these types of situations, the gun can be shown in different types of positions.In [10], it is concluded that in the majority of cases where the robbers show their weapons upon entering the scene, they tend to keep their weapons at waist height, or most commonly at shoulder height when posing the initial threat.Luminosity in the image is an important factor in these types of situations because a real robbery could happen at any time of the day.Moreover, image quality is an important factor, because surveillance videos are usually captured with low-cost cameras.This database was created taking all of these factors into consideration; these factors are shown in Figures 3 and 4.
We used CNNs for the development of this system because this type of network is an automatic feature extractor and because this network allows us to detect the firearm in different types of positions and at different distances, which is very important in this application.Before using the database, it was first necessary to resize all the images to a fixed size, because the input of the CNN requires images of the same size.These images were resized to 224 × 224 pixels.It was also necessary to increase the number of images in the database.This was done by applying multiple techniques like flipping the image in the horizontal axis and rotating the images in multiple angles.With these techniques, we increased the original database from 17,684 images to 247,576 images.The applied and those without.This database was originally made up of 17,684 images.These images were extracted from surveillance videos of real robberies, videos of people practicing shooting with firearms, and other types of situations different from robberies or shooting practices.These last two types of images were chosen because they were very similar to real robbery scenarios.All of these data were obtained from YouTube, Instagram, and Google.The structure of each class is shown in Figures 1 and 2.  In these types of situations, the gun can be shown in different types of positions.In [10], it is concluded that in the majority of cases where the robbers show their weapons upon entering the scene, they tend to keep their weapons at waist height, or most commonly at shoulder height when posing the initial threat.Luminosity in the image is an important factor in these types of situations because a real robbery could happen at any time of the day.Moreover, image quality is an important factor, because surveillance videos are usually captured with low-cost cameras.This database was created taking all of these factors into consideration; these factors are shown in Figures 3 and 4.
We used CNNs for the development of this system because this type of network is an automatic feature extractor and because this network allows us to detect the firearm in different types of positions and at different distances, which is very important in this application.Before using the database, it was first necessary to resize all the images to a fixed size, because the input of the CNN requires images of the same size.These images were resized to 224 × 224 pixels.It was also necessary to increase the number of images in the database.This was done by applying multiple techniques like flipping the image in the horizontal axis and rotating the images in multiple angles.With these techniques, we increased the original database from 17,684 images to 247,576 images.The applied In these types of situations, the gun can be shown in different types of positions.In [10], it is concluded that in the majority of cases where the robbers show their weapons upon entering the scene, they tend to keep their weapons at waist height, or most commonly at shoulder height when posing the initial threat.Luminosity in the image is an important factor in these types of situations because a real robbery could happen at any time of the day.Moreover, image quality is an important factor, because surveillance videos are usually captured with low-cost cameras.This database was created taking all of these factors into consideration; these factors are shown in Figures 3 and 4.
We used CNNs for the development of this system because this type of network is an automatic feature extractor and because this network allows us to detect the firearm in different types of positions and at different distances, which is very important in this application.Before using the database, it was first necessary to resize all the images to a fixed size, because the input of the CNN requires images of the same size.These images were resized to 224 × 224 pixels.It was also necessary to increase the number of images in the database.This was done by applying multiple techniques like flipping the image in the horizontal axis and rotating the images in multiple angles.With these techniques, we increased the original database from 17,684 images to 247,576 images.The applied techniques are shown in Figure 5.All the details involved in the creation of this database are shown in [11].The URL where the database is located is shown in Supplementary Materials URL S1.
For the development of the detection system we initially used 70% of the database for the training phase, 15% for the evaluation phase, and 15% for the testing phase.The detection system was developed using Python and TensorFlow.Before using the database, we proceeded to transform the database into a file called "TF.Record", which is a simple binary format used by TensorFlow.Its use has a significant impact on the importation of the data.This is because for datasets that are too large to be stored in memory only the data that is required at the time (batch) are loaded and   All the details involved in the creation of this database are shown in [11].The URL where the database is located is shown in Supplementary Materials URL S1.
For the development of the detection system we initially used 70% of the database for the training phase, 15% for the evaluation phase, and 15% for the testing phase.The detection system was developed using Python and TensorFlow.Before using the database, we proceeded to transform the database into a file called "TF.Record", which is a simple binary format used by TensorFlow.Its use has a significant impact on the importation of the data.This is because for datasets that are too large to be stored in memory only the data that is required at the time (batch) are loaded and   All the details involved in the creation of this database are shown in [11].The URL where the database is located is shown in Supplementary Materials URL S1.
For the development of the detection system we initially used 70% of the database for the training phase, 15% for the evaluation phase, and 15% for the testing phase.The detection system was developed using Python and TensorFlow.Before using the database, we proceeded to transform the database into a file called "TF.Record", which is a simple binary format used by TensorFlow.Its use has a significant impact on the importation of the data.This is because for datasets that are too All the details involved in the creation of this database are shown in [11].The URL where the database is located is shown in Supplementary Materials URL S1.
For the development of the detection system we initially used 70% of the database for the training phase, 15% for the evaluation phase, and 15% for the testing phase.The detection system was developed using Python and TensorFlow.Before using the database, we proceeded to transform the database into a file called "TF.Record", which is a simple binary format used by TensorFlow.Its use has a significant impact on the importation of the data.This is because for datasets that are too large to be stored in memory only the data that is required at the time (batch) are loaded and processed.This makes the importing process more efficient.Additionally, the binary data can be read more efficiently from the disk.

System Architecture
Surveillance cameras are usually located in places where there are a lot of people and multiple objects.Therefore, surveillance videos usually have high complexity in terms of the number of elements in each frame.To simplify the complex environments captured by the camera, we designed a detection system that is divided into two parts.The first part is called the "Front End" and is made up of the YOLO object detection and localization system.YOLO is a real-time object detection and localization system that was trained using a large database called COCO.This database has various types of categories such as persons, cars, animals, etc.The second part of the system is called the Back End, and is made up of the firearms detection model developed in this work.This is shown in Figure 6.processed.This makes the importing process more efficient.Additionally, the binary data can be read more efficiently from the disk.

System Architecture
Surveillance cameras are usually located in places where there are a lot of people and multiple objects.Therefore, surveillance videos usually have high complexity in terms of the number of elements in each frame.To simplify the complex environments captured by the camera, we designed a detection system that is divided into two parts.The first part is called the "Front End" and is made up of the YOLO object detection and localization system.YOLO is a real-time object detection and localization system that was trained using a large database called COCO.This database has various types of categories such as persons, cars, animals, etc.The second part of the system is called the Back End, and is made up of the firearms detection model developed in this work.This is shown in Figure 6.Firearms are almost exclusively found next to people.Therefore, these areas of an image are the most important regarding detection.YOLO is used to detect, locate, and identify the segments of an image where there are people.These segments are the images that will be the input for the developed firearm detection model.In this way, the firearm detection system will analyze only the segments of the image that are most important, reducing the possibility of obtaining false positives in places where firearms will never be found, discarding a large area of the image that is not important for detection.In this way we will not needlessly analyze multiple places in the image in search of firearm detection.By including this important step, we are greatly reducing the number of false alarms that would otherwise occur due to the complex environments of surveillance videos.This is shown in Figures 7 and 8.  Firearms are almost exclusively found next to people.Therefore, these areas of an image are the most important regarding detection.YOLO is used to detect, locate, and identify the segments of an image where there are people.These segments are the images that will be the input for the developed firearm detection model.In this way, the firearm detection system will analyze only the segments of the image that are most important, reducing the possibility of obtaining false positives in places where firearms will never be found, discarding a large area of the image that is not important for detection.In this way we will not needlessly analyze multiple places in the image in search of firearm detection.By including this important step, we are greatly reducing the number of false alarms that would otherwise occur due to the complex environments of surveillance videos.This is shown in Figures 7 and 8  processed.This makes the importing process more efficient.Additionally, the binary data can be read more efficiently from the disk.

System Architecture
Surveillance cameras are usually located in places where there are a lot of people and multiple objects.Therefore, surveillance videos usually have high complexity in terms of the number of elements in each frame.To simplify the complex environments captured by the camera, we designed a detection system that is divided into two parts.The first part is called the "Front End" and is made up of the YOLO object detection and localization system.YOLO is a real-time object detection and localization system that was trained using a large database called COCO.This database has various types of categories such as persons, cars, animals, etc.The second part of the system is called the Back End, and is made up of the firearms detection model developed in this work.This is shown in Figure 6.Firearms are almost exclusively found next to people.Therefore, these areas of an image are the most important regarding detection.YOLO is used to detect, locate, and identify the segments of an image where there are people.These segments are the images that will be the input for the developed firearm detection model.In this way, the firearm detection system will analyze only the segments of the image that are most important, reducing the possibility of obtaining false positives in places where firearms will never be found, discarding a large area of the image that is not important for detection.In this way we will not needlessly analyze multiple places in the image in search of firearm detection.By including this important step, we are greatly reducing the number of false alarms that would otherwise occur due to the complex environments of surveillance videos.This is shown in Figures 7 and 8.The firearm detection model was developed through a CNN.The structure of the CNN was chosen after we carried out different tests with multiple configurations.The initial concept was based on the idea of two types of network architectures that have obtained good results in image detection and classification tasks.The first one is VGG Net [12].This network is characterized by its depth, implementing a large number of layers and small convolutional filters.The second one is ZF Net [13].This network is characterized by the use of large convolutional filters in its first layer without implementing a large number of these filters.The CNN was programmed using TensorFlow through Custom Estimators, which is a TensorFlow API that simplifies the development of the model.

Evaluation Metrics
The evaluation of the model was made based on different metrics to quantify the quality of the predictions of the detection system.We considered the following metrics:

•
Accuracy: In terms of a classifier, accuracy is defined as the probability of correctly predicting a class [14].

•
Equal rate: The equal error rate (EER) is the value where the proportion of false positives (FAR) is equal to the proportion of false negatives (FRR).This sets the detection sensitivity to where the number of errors produced is minimized [15,16].

•
Precision: Precision measures the fraction of examples classified as positive that are truly positive [17].

•
Recall: Recall measures the fraction of positive examples that are correctly labeled [17].

Training and Evaluation Phases
For the training phase of the CNN, multiple tests were made, first using half of the database, which consists of 123,788 images.This test was made in order to determine if the new images that were created in the database augmentation phase by flipping and rotating the images provided new The firearm detection model was developed through a CNN.The structure of the CNN was chosen after we carried out different tests with multiple configurations.The initial concept was based on the idea of two types of network architectures that have obtained good results in image detection and classification tasks.The first one is VGG Net [12].This network is characterized by its depth, implementing a large number of layers and small convolutional filters.The second one is ZF Net [13].This network is characterized by the use of large convolutional filters in its first layer without implementing a large number of these filters.The CNN was programmed using TensorFlow through Custom Estimators, which is a TensorFlow API that simplifies the development of the model.

Evaluation Metrics
The evaluation of the model was made based on different metrics to quantify the quality of the predictions of the detection system.We considered the following metrics:

•
Accuracy: In terms of a classifier, accuracy is defined as the probability of correctly predicting a class [14].

•
Equal error rate: The equal error rate (EER) is the value where the proportion of false positives (FAR) is equal to the proportion of false negatives (FRR).This sets the detection sensitivity to where the number of errors produced is minimized [15,16].

•
Precision: Precision measures the fraction of examples classified as positive that are truly positive [17].

•
Recall: Recall measures the fraction of positive examples that are correctly labeled [17].

Training and Evaluation Phases
For the training phase of the CNN, multiple tests were made, first using half of the database, which consists of 123,788 images.This test was made in order to determine if the new images that were created in the database augmentation phase by flipping and rotating the images provided new information or simply affected the training of the model.Second, the complete database was taken, which consists of 247,576 images.These tests were made with the structure that provided the best results in the previous tests using half of the database.For the training phase, we used a batch size of 200 images and a learning rate of 0.001.We used the gradient descent algorithm to the networks.We implemented the ReLu activation function in the convolutional stage as well as in the layers of the fully connected neural network, except in the last layer where a SoftMax activation function was used.As mentioned previously, we put forward various network architectures based on VGG Net and ZF Net to find the architecture that provided the best results with the created database.

Tests Performed Using Half of the Database
Proposed Networks Based on VGG Net: Firstly, two configurations were proposed based on VGG Net.These configurations were the same in the convolutional layers, but they differed in the number of neurons used in the fully connected network (FC).These networks were characterized by the use of a large number of small convolutional filters in their layers.These configurations implemented a 1-pixel step in the convolutional filters and a 2-pixel step in the max-pooling layers.These two configurations are shown in Table 1.The results obtained with these configurations are shown in Table 2.
Table 1.Proposed networks based on VGG Net-Using half of the database.FC: fully connected.Three architectures were proposed based on ZF Net.These networks were characterized by the use of large filters in their first convolutional layers, differing from the previous architectures based on VGG Net that used 3 × 3 filters in all their layers.These configurations implemented a 2-pixel step in the first two convolutional layers and a 1-pixel step was used the following layers.In the max-pooling layers a 2-pixel step was used.These three configurations are shown in Table 3.The results obtained with these configurations are shown in Table 4. Configurations C2 in test T4 and C3 in test T1 yielded the best results in the evaluation phase.These results were compared, showing that test T4 had an improvement in accuracy of 3.33% and in loss of 21.4% compared with the results obtained in test T1.Moreover, with these results, it can be concluded that the new images that were created in the database augmentation phase provided new information that helped the training of the system by providing complementary information to the initial one.

Tests Performed Using the Complete Database
After having confirmed in the previous tests that the increase of the database provided new information, we proceeded to carry out tests using the complete database using the architecture C2, which was the one that provided the best results in the previous test conducted with half of the database.The obtained results in the training and evaluation phases are shown in Table 5.The best results were obtained in test T2 using grayscale images, resulting in an improvement in the evaluation phase in accuracy of 0.22% and in loss of 34.3% in comparison with the results obtained in test T1.Therefore, the model used in test T2 with grayscale images was used in the implementation of the system.

Test Phase
We proceeded to test the system using metrics that were obtained using a test set of 2723 images.Firstly, EER was obtained to find the detection sensitivity to which the system provided the least amount of errors, this value was obtained through the FAR and FRR values.The crossing between FAR and FRR provides the EER value, which corresponded to 0.09, with a sensitivity of 0.52.This is shown in Figure 9.
Appl.Sci.2019, 9, x FOR PEER REVIEW 9 of 11 obtained in test T1.Therefore, the model used in test T2 with grayscale images was used in the implementation of the system.

Test Phase
We proceeded to test the system using metrics that were obtained using a test set of 2723 images.Firstly, EER was obtained to find the detection sensitivity to which the system provided the least amount of errors, this value was obtained through the FAR and FRR values.The crossing between FAR and FRR provides the EER value, which corresponded to 0.09, with a sensitivity of 0.52.This is shown in Figure 9.The confusion matrix and the recall and precision values were obtained for the test set; these are shown in Tables 6 and 7.

Metrics Results
Recall 0.86 Precision 0.86

Interface
The interface that was created to implement the detection system is shown in Figure 10.The surveillance video is on the right side of the interface.When the system makes a detection, it locates the segment of the image where the firearm was detected, showing this segment on the left side of the interface in addition to an alert message that a firearm has been potentially detected.The confusion matrix and the recall and precision values were obtained for the test set; these are shown in Tables 6 and 7.

Interface
The interface that was created to implement the detection system is shown in Figure 10.The surveillance video is on the right side of the interface.When the system makes a detection, it locates the segment of the image where the firearm was detected, showing this segment on the left side of the interface in addition to an alert message that a firearm has been potentially detected.

Conclusions
This paper addressed the development of a firearm detection system.Two convolutional models were used in order to discard areas of the image that are irrelevant for the detection and to focus the firearm detection model only on the areas of the image where people are located.The results showed that with this configuration we were able to reduce the complex environment of real robbery scenarios, taking only the segments of the image where there were people, since these segments are the most important areas of the image to make a detection.Using a convolutional network architecture in the firearm detection model based on VGG Net allowed us to obtain a relative improvement in this application of 21.4% in loss and 3.33% in accuracy, compared to a convolutional network architecture based on ZF Net.The use of grayscale images allowed us to obtain a better performance, having an improvement of 0.22% in accuracy and 34.3% in loss in the evaluation phase of the network, compared to the results obtained with RGB images.In the final performance of the detection system, we obtained 86% precision and 86% recall, which are not the best results.However, to improve this performance, as future research lines it would be interesting to use the same architecture with the two convolutional models to focus the detection system only in the important parts of the image, but using a new model to detect firearms either by training the model on a larger database or using other models and adapting it to this problem with transfer learning.

Conclusions
This paper addressed the development of a firearm detection system.Two convolutional models were used in order to discard areas of the image that are irrelevant for the detection and to focus the firearm detection model only on the areas of the image where people are located.The results showed that with this configuration we were able to reduce the complex environment of real robbery scenarios, taking only the segments of the image where there were people, since these segments are the most important areas of the image to make a detection.Using a convolutional network architecture in the firearm detection model based on VGG Net allowed us to obtain a relative improvement in this application of 21.4% in loss and 3.33% in accuracy, compared to a convolutional network architecture based on ZF Net.The use of grayscale images allowed us to obtain a better performance, having an improvement of 0.22% in accuracy and 34.3% in loss in the evaluation phase of the network, compared to the results obtained with RGB images.In the final performance of the detection system, we obtained 86% precision and 86% recall, which are not the best results.However, to improve this performance, as future research lines it would be interesting to use the same architecture with the two convolutional models to focus the detection system only in the important parts of the image, but using a new model to detect firearms either by training the model on a larger database or using other models and adapting it to this problem with transfer learning.

Table 2 .
Results obtained with the networks based on VGG Net-Using half of the database.

Table 3 .
Proposed networks based on ZF Net-Using half of the database.

Table 4 .
Results obtained with the networks based on ZF Net-Using half of the database.

Table 5 .
Results obtained with the configuration C2-Using the complete database.

Table 7 .
Recall and precision values.

Table 7 .
Recall and precision values.