Semi-Supervised Faster RCNN-Based Person Detection and Load Classiﬁcation for Far Field Video Surveillance

: This paper presents a semi-supervised faster region-based convolutional neural network (SF-RCNN) approach to detect persons and to classify the load carried by them in video data captured from distances several miles away via high-power lens video cameras. For detection, a set of computationally e ﬃ cient image processing steps are considered to identify moving areas that may contain a person. These areas are then passed onto a faster RCNN classiﬁer whose convolutional layers consist of ResNet50 transfer learning. Frame labels are obtained in a semi-supervised manner for the training of the faster RCNN classiﬁer. For load classiﬁcation, another convolutional neural network classiﬁer whose convolutional layers consist of GoogleNet transfer learning is used to distinguish a person carrying a bundle from a person carrying a long arm. Despite the challenges associated with the video dataset examined in terms of the low resolution of persons, the presence of heat haze, and the shaking of the camera, it is shown that the developed approach outperforms the faster RCNN approach.


Introduction and Related Works
The video surveillance market is currently valued at more than $35 billion and is estimated to grow to more than $65 billion in five years [1].There are many applications of video surveillance, for example, traffic monitoring, public safety, parking lot monitoring, theft detection in the retail industry, and crime prevention.Many image processing algorithms for detection of a specific object or event in video data have been developed in the literature.Recently, the use of deep learning in image processing has experienced tremendous growth and deep learning approaches have been applied to video surveillance applications.This growth has occurred as, in many image processing applications, deep learning solutions have outperformed conventional solutions where detection/recognition is normally performed based on some handcrafted features that are designed for a specific application.When using deep learning approaches, the design of handcrafted features is not needed and the raw image data can be fed directly into a deep learning network to achieve detection/recognition.
Person detection for the application of pedestrian monitoring has been well studied.Three noteworthy pedestrian detection algorithms are reported in the literature [2][3][4].In the work of [2], histograms of oriented gradients (HOG) features together with a support vector machine (SVM) classifier were used.In the work of [3], integral channel features (ICF) together with an AdaBoost classifier were used.In the work of [4], aggregated channel features (ACF) with an AdaBoost classifier were used.Variations of these methods have appeared in the literature [5][6][7][8][9][10][11].More recently, convolutional neural network (CNN)-based approaches have shown improvements over conventional approaches in pedestrian detection.These deep learning-based pedestrian detection approaches involve either a two-stage or a single-stage approach.Examples of two-stage approaches include region-based convolutional neural network (RCNN) [12], fast RCNN [13], and faster RCNN [14].These approaches perform both region scanning and detection.Although, in general, these approaches have higher accuracy, their computational complexity is higher as well.Examples of single-stage approaches include SSD (single shot detector) [15] and the YOLO (you only look once) [16].These approaches do not address region scanning.In general, these approaches have higher computational efficiency, but lower accuracy.Variations of the above deep learning-based approaches have also appeared in the literature [17][18][19][20][21].
For the far field video surveillance application, the use of these methods poses challenges owing to the lack of a large dataset and the low resolution of images involved.Because of the lack of a large dataset, a semi-supervised faster RCNN (SF-RCNN) approach is developed in this paper to achieve person detection and load classification based on far field video data.Far field indicates the use of high-power lenses to enable monitoring at distances that are three to five miles away.The application of interest here for far field video surveillance involves monitoring borders from a far distance for illegal crossing or activities.More specifically, the application of interest involves monitoring borders from a far distance in order to detect persons and to identify the load they carry.The loads of interest include drug bundles and long arms.
A two-stage approach is developed in this paper to address both person detection and load classification based on far field video data.During the person detection stage, a fast or computationally efficient approach for detecting moving areas in an image is considered, followed by a person detector to see whether there is a person in the moving areas.This approach is compared with the state-of-the-art faster RCNN person detector.The developed person detector is first trained by the Caltech pedestrian detection dataset, and then re-trained in a semi-supervised manner by the unlabeled far field video dataset.During the classification stage, a CNN with transfer learning is used to distinguish between the situations involving a person carrying a bundle and a person carrying a long arm.In our previous work [22], the detection was done using the AdaBoost person detector, while in this paper, the detection is carried out using the developed SF-RCNN person detector.
The rest of the paper is organized as follows.A description of the far field video dataset is provided in Section 2. The architectures of the person detection and load classification deep learning networks used are covered in Section 3. The experimental results and their discussion are then reported in Section 4. Finally, the conclusion is stated in Section 5.

Far Field Video Dataset
A dataset of far field video clips was made available for this work by the company Elbit Systems of America.The dataset consists of 32 video clips at 30 frames per second.Seventeen of the video clips were labeled as 'Bundle' video clips and 15 as 'Long Arm' video clips, denoting the loads carried by the person in the video clip.Figure 1 provides two sample images in these video clips, which are (1080 × 1920) pixels in size.Figure 1a corresponds to a person carrying a long arm, and Figure 1b corresponds to a person carrying a bundle.A zoomed version of the area in which a person was detected is also shown on the right side of these figures.The video clips were captured from a three-mile distance.No frame-level labels are provided for these video clips, meaning that it is unknown when a person will appear in the scene.
It is worth stating that the video data for this far field video surveillance application differ in appearance from the video data often seen for the pedestrian detection application.Far field video data involve the following challenges that do not appear in pedestrian monitoring video data: (1) as the video is taken from a far distance, a person appears in the scene with low resolution and, in many cases, with only the upper portion of the body being visible; (2) the presence of noise generated by the shaking of the camera due to wind or going out of focus; and (3) the presence of noise generated by heat haze as a result of the distance being far.
Besides the far field video dataset, the Caltech pedestrian dataset [23] is also considered here for training the person detection model.The Caltech pedestrian dataset consists of approximately 10 h of (640 × 480) 30 Hz video taken from a vehicle driving through regular traffic in an urban environment, providing about 250,000 frames with a total of 350,000 bounding boxes and 2300 pedestrian annotations.The details of this dataset appear in the work of [23].Besides the far field video dataset, the Caltech pedestrian dataset [23] is also considered here for training the person detection model.The Caltech pedestrian dataset consists of approximately 10 hours of (640  480) 30 Hz video taken from a vehicle driving through regular traffic in an urban environment, providing about 250,000 frames with a total of 350,000 bounding boxes and 2300 pedestrian annotations.The details of this dataset appear in the work of [23].

Developed Detection and Classification Approach
The steps involved in the developed person detection and load classification are illustrated in Figure 2. Initially, it is required to detect or locate the presence of a person in an entire image before performing any classification.Because of the large size of the image frames (1080  1920), it was found to be computationally inefficient to apply a person detector algorithm to the entire image.To make the detection process computationally efficient, a moving or changing areas detection step is first considered.The person detector is then applied only to the moving or changing areas and thus not to the entire image.Next, the detected area or sub-image in which a person is detected is passed onto a load classifier to obtain the load carried by the detected person.In addition to a frame-level classification, a video-level label is generated based on the image or frame-level labels.

Moving Areas Detection
A simple moving areas detection is applied first to allow the person detection module to operate in a computationally efficient manner.Although there are many background subtraction methods

Developed Detection and Classification Approach
The steps involved in the developed person detection and load classification are illustrated in Figure 2. Initially, it is required to detect or locate the presence of a person in an entire image before performing any classification.Because of the large size of the image frames (1080 × 1920), it was found to be computationally inefficient to apply a person detector algorithm to the entire image.To make the detection process computationally efficient, a moving or changing areas detection step is first considered.The person detector is then applied only to the moving or changing areas and thus not to the entire image.Next, the detected area or sub-image in which a person is detected is passed onto a load classifier to obtain the load carried by the detected person.In addition to a frame-level classification, a video-level label is generated based on the image or frame-level labels.Besides the far field video dataset, the Caltech pedestrian dataset [23] is also considered here for training the person detection model.The Caltech pedestrian dataset consists of approximately 10 hours of (640  480) 30 Hz video taken from a vehicle driving through regular traffic in an urban environment, providing about 250,000 frames with a total of 350,000 bounding boxes and 2300 pedestrian annotations.The details of this dataset appear in the work of [23].

Developed Detection and Classification Approach
The steps involved in the developed person detection and load classification are illustrated in Figure 2. Initially, it is required to detect or locate the presence of a person in an entire image before performing any classification.Because of the large size of the image frames (1080  1920), it was found to be computationally inefficient to apply a person detector algorithm to the entire image.To make the detection process computationally efficient, a moving or changing areas detection step is first considered.The person detector is then applied only to the moving or changing areas and thus not to the entire image.Next, the detected area or sub-image in which a person is detected is passed onto a load classifier to obtain the load carried by the detected person.In addition to a frame-level classification, a video-level label is generated based on the image or frame-level labels.

Moving Areas Detection
A simple moving areas detection is applied first to allow the person detection module to operate in a computationally efficient manner.Although there are many background subtraction methods

Moving Areas Detection
A simple moving areas detection is applied first to allow the person detection module to operate in a computationally efficient manner.Although there are many background subtraction methods that can be used to find moving areas-for example, see the literature [24][25][26][27][28][29][30][31][32][33]-a simple frame differencing is utilized here to provide the input to the deep learning-based person detector.The person detector then corrects remaining errors associated with moving areas.Note that because the camera is located miles away, camera shaking and heat haze make the background unstable, and background subtraction alone would not lead to a robust outcome.
To detect moving areas in an image, the steps illustrated in Figure 3 are considered.First, the image frames are down-sampled five times to (216 × 384), so that the computational efficiency of the subsequent processing steps or components is increased.Then, the captured RGB images are converted into one luminance or gray-scale image using the luminance equation Y = (R + G + B)/3.Next, the difference of consecutive frames is passed through a convolution operation with an averaging filter to obtain the most significant moving or changing area in the image in a computationally efficient manner.It is worth noting that camera shaking due to winds also leads to the detection of moving areas.No camera stabilization is applied as part of our processing pipeline as this would add a considerable amount of computation time, not allowing our detection and classification solution to run in real-time on a regular computer.Sample outcomes of the color to luminance conversion and frame differencing steps are shown in Figure 4.Note that detected moving areas may occur because of heat haze noise or the presence of moving objects other than persons such as animals or cars.Figure 4 also includes images corresponding to detected moving areas, one with and one without a person in it.
Mach.Learn.Knowl.Extr.2019, 1 4 that can be used to find moving areas-for example, see the literature [24][25][26][27][28][29][30][31][32][33]-a simple frame differencing is utilized here to provide the input to the deep learning-based person detector.The person detector then corrects remaining errors associated with moving areas.Note that because the camera is located miles away, camera shaking and heat haze make the background unstable, and background subtraction alone would not lead to a robust outcome.
To detect moving areas in an image, the steps illustrated in Figure 3 are considered.First, the image frames are down-sampled five times to (216  384), so that the computational efficiency of the subsequent processing steps or components is increased.Then, the captured RGB images are converted into one luminance or gray-scale image using the luminance equation Y=(R+G+B)/3.Next, the difference of consecutive frames is passed through a convolution operation with an averaging filter to obtain the most significant moving or changing area in the image in a computationally efficient manner.It is worth noting that camera shaking due to winds also leads to the detection of moving areas.No camera stabilization is applied as part of our processing pipeline as this would add a considerable amount of computation time, not allowing our detection and classification solution to run in real-time on a regular computer.Sample outcomes of the color to luminance conversion and frame differencing steps are shown in Figure 4.Note that detected moving areas may occur because of heat haze noise or the presence of moving objects other than persons such as animals or cars.Figure 4 also includes images corresponding to detected moving areas, one with and one without a person in it.

Person Detection
The next step of the approach consists of passing the detected moving areas to a person detector to generate boxes around the person in the scene.In our previous work [22], the person detection was Mach.Learn.Knowl.Extr.2019, 1 4 that can be used to find moving areas-for example, see the literature [24][25][26][27][28][29][30][31][32][33]-a simple frame differencing is utilized here to provide the input to the deep learning-based person detector.The person detector then corrects remaining errors associated with moving areas.Note that because the camera is located miles away, camera shaking and heat haze make the background unstable, and background subtraction alone would not lead to a robust outcome.
To detect moving areas in an image, the steps illustrated in Figure 3 are considered.First, the image frames are down-sampled five times to (216  384), so that the computational efficiency of the subsequent processing steps or components is increased.Then, the captured RGB images are converted into one luminance or gray-scale image using the luminance equation Y=(R+G+B)/3.Next, the difference of consecutive frames is passed through a convolution operation with an averaging filter to obtain the most significant moving or changing area in the image in a computationally efficient manner.It is worth noting that camera shaking due to winds also leads to the detection of moving areas.No camera stabilization is applied as part of our processing pipeline as this would add a considerable amount of computation time, not allowing our detection and classification solution to run in real-time on a regular computer.Sample outcomes of the color to luminance conversion and frame differencing steps are shown in Figure 4.Note that detected moving areas may occur because of heat haze noise or the presence of moving objects other than persons such as animals or cars.Figure 4 also includes images corresponding to detected moving areas, one with and one without a person in it.

Person Detection
The next step of the approach consists of passing the detected moving areas to a person detector to generate boxes around the person in the scene.In our previous work [22], the person detection was

Person Detection
The next step of the approach consists of passing the detected moving areas to a person detector to generate boxes around the person in the scene.In our previous work [22], the person detection was done using the AdaBoost person detector.AdaBoost, short for adaptive boosting [34], involves forming a classifier as a linear combination of simple classifiers.In the work of [4], it was shown that the ACF features together with an AdaBoost classifier performed better than the HOG features together with an SVM classifier.Furthermore, in the work of [35], it was shown that the deep learning-based RCNN approach outperformed the AdaBoost approach for the pedestrian detection application.

Faster RCNN Detector
Faster RCNN is an extension of the RCNN and fast RCNN networks that have been used for object detection applications, which are variations of the CNN network.The main difference between them is how regions get selected for processing.RCNN and fast RCNN use a region selection algorithm such as Edge Boxes [36] or Selective Search [37], which are independent of the CNN network.Faster RCNN does the region selection as part of the CNN training and detection.
To address the limited amount of training data, the transfer learning method is considered here.In transfer learning, pre-trained CNN models are used.These pre-trained models are trained using big datasets.The layers of the pre-trained models are used up to the last fully connected layer.The last fully connected layer is trained using the dataset associated with this application.More details of the transfer learning method appear in the work of [38].Here, the transfer learning method based on the pre-trained ResNet50 [39] model is used.
ResNet50 is a convolutional neural network that is trained on more than a million images from the ImageNet database [40].The ImageNet database consist of 1.2 million images classified into 1000 classes.A block diagram illustrating the ResNet50 transfer learning architecture is shown in Figure 5.As illustrated in this figure, the ResNet50 architecture consists of convolution layers with skip layer connections, average pooling layers, and fully connected layers of processing elements.The details of these layers are discussed in the work of [39].
Mach.Learn.Knowl.Extr.2019, 1 5 done using the AdaBoost person detector.AdaBoost, short for adaptive boosting [34], involves forming a classifier as a linear combination of simple classifiers.In the work of [4], it was shown that the ACF features together with an AdaBoost classifier performed better than the HOG features together with an SVM classifier.Furthermore, in the work of [35], it was shown that the deep learning-based RCNN approach outperformed the AdaBoost approach for the pedestrian detection application.

Faster RCNN Detector
Faster RCNN is an extension of the RCNN and fast RCNN networks that have been used for object detection applications, which are variations of the CNN network.The main difference between them is how regions get selected for processing.RCNN and fast RCNN use a region selection algorithm such as EdgeBoxes [36] or Selective Search [37], which are independent of the CNN network.Faster RCNN does the region selection as part of the CNN training and detection.
To address the limited amount of training data, the transfer learning method is considered here.In transfer learning, pre-trained CNN models are used.These pre-trained models are trained using big datasets.The layers of the pre-trained models are used up to the last fully connected layer.The last fully connected layer is trained using the dataset associated with this application.More details of the transfer learning method appear in the work of [38].Here, the transfer learning method based on the pre-trained ResNet50 [39] model is used.

Semi-Supervised Faster RCNN (SF-RCNN) Detector
Semi-supervised learning is a machine learning approach that makes use of both labeled and unlabeled data for training.It starts with a model trained using labeled data and then improves the performance using unlabeled data.As manual labeling is time consuming and labor intensive, the semi-supervised approach makes the training process more efficient.More details regarding semi-supervised learning are described in the work of [41].
Noting that the far field video dataset does not give frame-level labels, the supervised training is first done using the Caltech pedestrian dataset.When the model is tested using the far field video dataset, one faces a mismatch between the training and testing datasets.To address this mismatch, the semi-supervised method is adopted in order to first obtain frame-level labels automatically from the unlabeled far field video dataset using a "high threshold" (e.g., 0.99) for person detection in the Caltech pedestrian dataset.Then, the automatically labeled persons in the far field data are used to further train the faster RCNN network.This training process allows the deep learning model to learn the common features in both the Caltech pedestrian and far field datasets.During testing or operation, a "nominal threshold" (e.g., 0.6) is used to ensure that all persons get detected for the load classification stage.A block diagram illustrating the developed semi-supervised architecture is shown in Figure 6.
ResNet50 is a convolutional neural network that is trained on more than a million images from the ImageNet database [40].The ImageNet database consist of 1.2 million images classified into 1000 classes.A block diagram illustrating the ResNet50 transfer learning architecture is shown in Figure 5.As illustrated in this figure, the ResNet50 architecture consists of convolution layers with skip layer connections, average pooling layers, and fully connected layers of processing elements.The details of these layers are discussed in the work of [39].

Semi-Supervised Faster RCNN (SF-RCNN) Detector
Semi-supervised learning is a machine learning approach that makes use of both labeled and unlabeled data for training.It starts with a model trained using labeled data and then improves the performance using unlabeled data.As manual labeling is time consuming and labor intensive, the semi-supervised approach makes the training process more efficient.More details regarding semisupervised learning are described in the work of [41].
Noting that the far field video dataset does not give frame-level labels, the supervised training is first done using the Caltech pedestrian dataset.When the model is tested using the far field video dataset, one faces a mismatch between the training and testing datasets.To address this mismatch, the semi-supervised method is adopted in order to first obtain frame-level labels automatically from the unlabeled far field video dataset using a "high threshold" (e.g., 0.99) for person detection in the Caltech pedestrian dataset.Then, the automatically labeled persons in the far field data are used to further train the faster RCNN network.This training process allows the deep learning model to learn the common features in both the Caltech pedestrian and far field datasets.During testing or operation, a "nominal threshold" (e.g., 0.6) is used to ensure that all persons get detected for the load classification stage.A block diagram illustrating the developed semi-supervised architecture is shown in Figure 6.

Load Classification
After identifying areas or sub-images in which a person is present, these areas or sub-images are passed to a CNN classifier.Considering that during the detection stage, misdetection could occur, areas that contain trees, grass, or other objects were manually extracted and placed into a third class labeled 'Others'.Also, background areas from two of the Bundle video clips were randomly selected and were placed into the 'Others' class manually.These two video clips were thus not used in the experimentations reported in Section 4. In other words, 30 video clips (15 Bundle and 15 Long Arm) were used for the training and testing of the CNN classifier, outputting three classes consisting of person with long arm, person with bundle, and others.The leave-one-out cross validation technique was carried out, that is, 29 video clips were used for training and the remaining video clip was used for testing.The training and testing was repeated 30 times, each time selecting a different video clip for testing and a different set of 29 video clips for training.The results were averaged over the 30 repetitions of training and testing.
Again, because of the lack of a large dataset, transfer learning is adopted during this stage as well.The transfer learning method based on the pre-trained AlexNet [42] model, the pre-trained GoogleNet [43], and the pre-trained ResNet50 [39] model were considered.Four CNN approaches were thus examined.The first approach was self-defined CNN, meaning that the training was done using the described earlier.The self-defined CNN model included three convolution layers, two max-pooling layers, and three fully connected layers.The second, third, and fourth CNN approaches incorporated pre-trained models of AlexNet, GoogleNet, and ResNet50, respectively.These networks were trained using the ImageNet database [40].
The classification above was done on a per image basis.It is also possible to carry out the classification on a per video clip basis by majority voting.For video-based classification, there are only two classes.For image-based classification, the third class 'Others' was manually created, which is not applicable to the video-based classification.

Experimental Results and Discussion
In this section, the experimental results of the developed person detection and load classification approach are reported.First, the detection and classification were examined separately, and then the detection and classification were evaluated together.The results corresponding to the real-time aspect of the developed approach are also provided.All of the coding was done in MATLAB 2018b and the timing results reported are for a personal computer equipped with an Intel i7-7700K CPU (central processing unit) and an NVIDIA QuadroP4000 GPU (graphics processing unit).

Person Detection Results
The developed SF-RCNN approach was compared to the faster RCNN, which is increasingly being used for person detection.The MATLAB faster RCNN training function [44] was used for the faster RCNN and SF-RCNN.This function includes a so-called region proposal network (RPN), an ROI (region of interest) max pooling layer, and classification and regression layers.The performance metrics used include FPR (false positive rate), TPR (true positive rate), and FNR (false negative rate), which are widely used and are defined as follows: where TP, FP, FN, and FP denote the number of true positives, the number of false positives, the number of false negatives, and the number of false positives, respectively.In our entire dataset of 30 video clips, there were a total of 65,220 frames, with 30,060 frames as true positive or denoting the presence of a person, and 35,160 frames as true negative or denoting the absence of a person.When using an RCNN, a score gets generated for each detecting box that can be thresholded.Table 1 shows the results for the faster RCNN and SF-RCNN approaches at different thresholds.Lower thresholds led to more detected boxes as well as higher errors.An ROC (receiver operating characteristic) curve was plotted based on TPR and FPR (see Figure 7), indicting the performance at different thresholds.As can be seen from Table 1 and Figure 7, the SF-RCNN approach generated a higher ROC curve.When using an RCNN, a score gets generated for each detecting box that can be thresholded.Table 1 shows the results for the faster RCNN and SF-RCNN approaches at different thresholds.Lower thresholds led to more detected boxes as well as higher errors.An ROC (receiver operating characteristic) curve was plotted based on TPR and FPR (see Figure 7), indicting the performance at different thresholds.As can be seen from Table 1 and Figure 7, the SF-RCNN approach generated a higher ROC curve.The processing time for both the faster RCNN and SF-RCNN person detection was 6.5 s when using the CPU, which dropped to 0.4 s per image frame when using the GPU.These times included the time for reading image frames.

Load Classification Results
For the classification component, four CNN approaches were considered.The images used consisted of the areas with known manually identified class labels.Figure 8 provides the average accuracy of the four different CNN approaches examined, as well as the processing times when using the CPU and when using the GPU.From Figure 8, it can be seen that the GoogleNet and ResNet-50 transfer learning network provided the highest accuracy of 88%, with the GoogleNet having a lower processing time.Also, the figure shows the speed up in the processing time per image frame when using the GPU instead of the CPU.The processing time for both the faster RCNN and SF-RCNN person detection was 6.5 s when using the CPU, which dropped to 0.4 s per image frame when using the GPU.These times included the time for reading image frames.

Load Classification Results
For the classification component, four CNN approaches were considered.The images used consisted of the areas with known manually identified class labels.Figure 8 provides the average accuracy of the four different CNN approaches examined, as well as the processing times when using the CPU and when using the GPU.From Figure 8, it can be seen that the GoogleNet and ResNet-50 transfer learning network provided the highest accuracy of 88%, with the GoogleNet having a lower processing time.Also, the figure shows the speed up in the processing time per image frame when using the GPU instead of the CPU.The examination of the misclassifications indicated that they were primarily the result of persons with a arm getting labeled as persons with a bundle, because in most of the long arm video clips, the persons also carried a bundle.As a result, when an image reflected the back or side of a person with a long arm, it resembled a person carrying a bundle.

Combined Detection and Classification Results
The results of our combined detection and classification are reported here, which is the way an actual system operates in the field.Image areas from the developed SF-RCNN person detector were used for training GoogleNet transfer learning without manually selecting labels.The confusion matrices of the combined detection and classification appear in Table 2.As can be seen from this table, for the image-based approach, an overall accuracy of 90.9% was obtained.More specifically, the developed person detector processed 15,553 images in the entire video dataset, consisting of 15,551 true positives images and 2 false positive images with a threshold of 0.6.For the total of 1415 mislabeled images, only 2 of them (0.14%) were due to the person detector and 1413 of them (99.86%) were due to the load classifier.In essence, the errors were nearly all caused by the classifier corresponding to the situations when the back or side of a person faced the camera.

Identified Class True Class Long Arm Bundle
Long Arm 91.0% 9.0% Bundle 9.3% 90.7%A majority vote was taken over the image frames of a video clip to classify that video clip.This way, that is, for the video-based approach, the overall accuracy was found to be 93.3%.It was noticed that in two of the long arm video clips, the person's back was facing the camera.By not considering these two video clips, the overall accuracy for the video-based approach reached 100%.

Real-Time Processing
As noted earlier, the SF-RCNN person detection takes 0.4 s per image and the transfer learning GoogleNet takes 7 ms when using the GPU.This allows processing 2 frames per second when using The examination of the misclassifications indicated that they were primarily the result of persons with a long arm getting labeled as persons with a bundle, because in most of the long arm video clips, the persons also carried a bundle.As a result, when an image reflected the back or side of a person with a long arm, it resembled a person carrying a bundle.

Combined Detection and Classification Results
The results of our combined detection and classification are reported here, which is the way an actual system operates in the field.Image areas from the developed SF-RCNN person detector were used for training GoogleNet transfer learning without manually selecting labels.The confusion matrices of the combined detection and classification appear in Table 2.As can be seen from this table, for the image-based approach, an overall accuracy of 90.9% was obtained.More specifically, the developed person detector processed 15,553 images in the entire video dataset, consisting of 15,551 true positives images and 2 false positive images with a threshold of 0.6.For the total of 1415 mislabeled images, only 2 of them (0.14%) were due to the person detector and 1413 of them (99.86%) were due to the load classifier.In essence, the errors were nearly all caused by the classifier corresponding to the situations when the back or side of a person faced the camera.A majority vote was taken over the image frames of a video clip to classify that video clip.This way, that is, for the video-based approach, the overall accuracy was found to be 93.3%.It was noticed that in two of the long arm video clips, the person's back was facing the camera.By not considering these two video clips, the overall accuracy for the video-based approach reached 100%.

Real-Time Processing
As noted earlier, the SF-RCNN person detection takes 0.4 s per image and the transfer learning GoogleNet takes 7 ms when using the GPU.This allows processing 2 frames per second when using a personal computer without any other image processing hardware board.A real-time processing was conducted by performing the detection and classification every half second or one per 15 frames.The average accuracy of the video-based approach when selecting different frames for the majority voting was found to be 94.4%.The confusion matrix of the real-time video-based approach for the combined detection and classification appears in Table 3.

Conclusions
A semi-supervised faster RCNN approach was developed in this paper for the purpose of detecting persons and the load carried by them in far field video surveillance data that are captured at distances several miles away via a video camera fitted with a high-power lens.This approach was compared to the faster RCNN approach as the current state-of-the-art, and the results obtained indicated that the developed approach provides an effective solution for detecting and distinguishing a person carrying a bundle from a person carrying a long arm in the far field video dataset examined.
Possible future improvements include developing a dedicated hardware platform to run the processing pipeline at a higher frame rate and collecting more data for the training of the deep neural networks.A dedicated hardware platform would allow running computationally intensive and advanced preprocessing algorithms, such as image stabilization and background subtraction, in real-time as part of the detection and classification processing pipeline.

Figure 1 .
Figure 1.Sample video clip images: (a) a person carrying a long arm and (b) a person carrying a bundle.

Figure 2 .
Figure 2. Steps involved in the developed detection and classification approach.

Figure 1 .
Figure 1.Sample video clip images: (a) a person carrying a long arm and (b) a person carrying a bundle.

Figure 1 .
Figure 1.Sample video clip images: (a) a person carrying a long arm and (b) a person carrying a bundle.

Figure 2 .
Figure 2. Steps involved in the developed detection and classification approach.

Figure 2 .
Figure 2. Steps involved in the developed detection and classification approach.

Figure 3 .
Figure 3. Steps in moving areas detection.

Figure 4 .
Figure 4. Sample images: (a) luminance, (b) frame differencing, (c) detected moving area containing a moving person, (d) detected moving area containing no moving person caused by heat haze.

Figure 3 .
Figure 3. Steps in moving areas detection.

Figure 3 .
Figure 3. Steps in moving areas detection.

Figure 4 .
Figure 4. Sample images: (a) luminance, (b) frame differencing, (c) detected moving area containing a moving person, (d) detected moving area containing no moving person caused by heat haze.

Figure 4 .
Figure 4. Sample images: (a) luminance, (b) frame differencing, (c) detected moving area containing a moving person, (d) detected moving area containing no moving person caused by heat haze.

Figure 6 .
Figure 6.Diagram illustrating the developed semi-supervised learning architecture.Figure 6. Diagram illustrating the developed semi-supervised learning architecture.

Figure 6 .
Figure 6.Diagram illustrating the developed semi-supervised learning architecture.Figure 6. Diagram illustrating the developed semi-supervised learning architecture.

Figure 7 .
Figure 7. Receiver operating characteristic (ROC) curves of the semi-supervised faster region-based convolutional neural network (SF-RCNN) versus faster RCNN approaches for person detection.

Figure 7 .
Figure 7. Receiver operating characteristic (ROC) curves of the semi-supervised faster region-based convolutional neural network (SF-RCNN) versus faster RCNN approaches for person detection.

Figure 8 .
Figure 8. Different CNN approaches for classification.

Figure 8 .
Figure 8. Different CNN approaches for classification.

Table 2 .
Confusion matrix of image-based combined detection and classification.

Table 2 .
Confusion matrix of image-based combined detection and classification.

Table 3 .
Confusion matrix of real-time video-based combined detection and classification.