1. Introduction
Police and various security services use video analysis when investigating criminal activity. Long surveillance videos are increasingly searched by dedicated image analysis software to detect criminal events, to store them, and to initiate proper security actions. One of the prominent examples is the P-REACT project [
1] (Petty cRiminality diminution through sEarch and Analysis in multi-source video Capturing and archiving plaTform). Solutions to automatic analysis of surveillance videos seem already to be mature enough, as the research community is recently also involved in significant benchmark initiatives [
2,
3]. The computer vision research focus is now shifted to the analysis of video data coming from handheld, body-worn, and dashboard cameras and on the integration of such analysis results with police- and public-databases.
In typical object detection scenarios, there are much data to learn from and a major objective is to use them effectively. In a security-oriented environment, the user interaction should be kept as simple as possible. The optimal solution would be marking only a single object in a selected image frame and initiating a search to find occurrences of similar objects in other frames of the processed sequence or different sequences. This imposes several constraints on the Machine Vision solution that need to be addressed.
First of all, the system should learn on-line or nearly on-line. The system must also perform per-frame detection quickly and provide approximate results in a short time, but also should be tuned in such a way, that no occurrence of the interesting object is skipped. Last but not least, the system must be able to learn from small data sets.
This paper is an extension of our conference paper [
4] describing an effective and time-efficient algorithm for instance search and detection in images from handheld video cameras. The system described there uses a discriminant approach to differentiate the object from its foreground. To do so, a combined Haar–Cascade detector and Histogram of Oriented Gradients–Support Vector Machine (HOG-SVM) classifier are used. We argued that this provided a desirable trade-off between detection quality and training/detection times. Both the positive, as well as negative samples, are extracted only from training images.
The extended version presented here includes new system elements and experiments. In particular, the additions are as follows:
introduction of a new foreground/background segmentation procedure,
incorporation and evaluation of a CNN classifier in the detector framework,
additional experiments covering new and previous features,
extended state-of-the-art analysis.
The main components of the system that are affected by the additions were marked bold in
Figure 1.
Comparable detector solutions based on CNNs provide excellent detection performance [
5]. Such solutions, however, rely on off-line training and the training/detection speed is still a bottleneck for such systems. This effect is, to some extent, ameliorated by GPU utilization. Recent developments aim at the reduction of detection times by cascading CNNs [
6] or by detecting salient regions first using fuzzy logic [
7]. However, a significant reduction of training time is still an open area of research.
This work can be categorized as the few-shot learning system. The most popular approaches in this field focus on distance metric learning (which, on its own, has a long history [
8]) with further data clustering using the
metric. In recent years, new solutions based on deep learning emerged [
9], with relational networks being the popular choice for few-shot learning [
10]. A popular choice for the loss function, in that case, is triplet loss [
11]. The application of distance metric learning for classification is straightforward. Metric is produced based on training samples, and the queries embed-dings are compared with the class representatives, with
metric and/or
k-Nearest Neighbor. This approach proved to be successful in multiple classification tasks [
12,
13,
14,
15].
Application of few-shot learning for detection is a harder problem, as usually the marked object in training set occupies only a small portion of the image. Hence, the training data set is heavily unbalanced. One of the possibilities is to transfer the knowledge from existing detectors [
16]. Another alternative is to use semi-supervised learning, with few labelled samples and a larger set of unlabelled data, iteratively used in training [
17]. A different approach is described in [
18], where the imitations are used in the training process for a robot-grasping task. Finally, the direct application of distance metric learning was proposed in [
19], where the InceptionV3 network is used as a backbone for metric learning.
Using the taxonomy proposed in [
20] (data-, model-, or algorithm-based), our solution is data-based, with handcrafted rules for transforming the training dataset.
One contribution of the paper is the procedure of collecting as much realistic training data as possible, providing limited user interaction. Another contribution involves the proposition of a complete computer-aided video surveillance procedure and a thorough evaluation of the applicability of particular computer vision methods for specific stages of processing, having in mind the system requirements such as the ability to learn from short video sequences, performing quick learning and reaching fast detection time. The evaluation comprised both classic computer vision methods as well as contemporary CNN approaches.
Ideally, the modeling stage should be able to start from a single selection of a Region of Interest (ROI), and all additional examples should be obtained automatically. Such least-user-effort approaches were already discussed e.g., for semi-automatic video annotation and detection systems, such as [
21,
22]. In the cited method, however, the user may be asked to annotate video several times (to decide about samples lying on decision boundary), which is not necessarily acceptable for all end users. An example of another successful detector that works on a single selection is given in [
23]. The detector operates on sparse image representation (collection of Scale Invariant Feature Transform (SIFT) descriptors), so it is very time efficient. Our initial experiments have shown that descriptor-based approaches work the best for highly textured and fairly complex objects that occupy a large part of the image, which is not always true in the surveillance scenarios.
The procedure of collecting training data, given in this paper, combines object tracking and background subtraction methods for semi-supervised collection of training windows and their foreground masks. This step is supported by the GrabCut algorithm [
24], with the possibility of smoothly mixing the results of both. The samples collected during tracking are further synthetically generalized (augmented) to enrich the training set. Scenarios, where tracking results are utilized for the collection of detector’s training data, were already covered in literature, especially regarding tracking, with prominent examples [
25,
26] or more recent CNN approaches [
27,
28]. In such approaches, the exact foreground–background separation (which is crucial for effective synthesis of samples) is often neglected, since the algorithms typically have enough frames to collect rich training data.
The proposed methods were evaluated on a corpus of surveillance videos and proved that its efficiency is good enough to be effective in supporting a user (police officer or security official) in their everyday working tasks. The proposed two-level detector architecture was evaluated using alternative methods for feature calculation, namely Histogram of Oriented Gradients (HOG) features and VGGNet-based deep features. In the former case the SVM classifier was used, in the latter, the neural network classification layers were consistently utilized. For final evaluation we compared the accuracy of our proposed system with the state-of-the-art Faster-RCNN [
29] detector trained on the same data.
The paper is organized as follows: in
Section 2 there is given a technical background and methodology used in our system,
Section 3 provides experimental results, and
Section 4 contains conclusions. For the reader’s convenience, a short dictionary of abbreviations used in the paper is presented at the end.
2. Methods
2.1. Detector Overview
In the system described in this paper, we utilize a classic detection framework, where a sliding window with varying sizes is moved over each frame and, for each location, the selected image part is evaluated against information gathered from training samples. A crucial part of the detector is formed by a classifier (SVM or NN), which is responsible for the evaluation of each selected image part. A pure classifier, when applied to hundreds of thousands of candidate areas, would be too slow to learn and detect. In our scenario, a pre-classification step utilizing Haar-like feature-based cascade classifier is applied to limit the number of candidate windows to several hundred. We claim that this simple structure combines a good detection rate with acceptable detection speed (about ten full-HD frames per second on modest computer) as well as acceptable training speed in typical scenarios (less than a few minutes per pattern).
In essence, the two-stage detector architecture resembles some significant modern CNN approaches, where the detection is divided into the region proposal part and the region recognition part [
29]. In our approach, region proposal is performed by the cascade classifier, and the (SVM/NN) classifier does the final classification. Both methods offer reasonable training and detection speeds required for this application.
In our scenario, sources of data are naturally sparse. Depending on user decision, the detector can be trained either on one or a short sequence of training images. Therefore a critical part of our system is a set of tools aiding the user in an effortless collection of training examples from short image sequences as well as methods for artificial synthesis and generalization of training samples to provide the detector with the training data as rich as possible. These tools and methods are discussed in subsequent sections. The overall structure of the training procedure is given in
Figure 1.
The first stage of processing is and interactive collection of training samples (Interactive ROI selection). The user marks an object in the image and initializes the tracking procedure to collect more training data for the selected object (Tracking of marked object). In case of un-stabilized images (e.g., from hand-held camera) additional stabilization step can be applied (Short sequence stabilization). The tracking is open to user intervention e.g., correction of the ROI in a single frame or frames.
The next step of the procedure is foreground–background segmentation (Fg./Bg. segmentation), which helps to recover the precise outline of the object from the rectangular ROI basing on motion and color information. After this comes Samples synthesis, here, the collected training samples are jittered in different ways to enrich the training set. Both steps enable the operator to correct algorithm parameters with visual feedback. The last step of detector generation is the training of a 2-level classifier (2-level classifier training).
2.2. Collection of Positive Training Samples
Although for some patterns (which include e.g., flat patterns) good detection results can be obtained using only one selected sample that is further generalized and synthesized into a set with larger variability, in most cases detection results highly depend on size and diversity of input training set. In the scenario discussed in this paper, these properties of the training set can (at least partially) be achieved by collecting samples from a short sequence of input images. Our scenario is organized as follows: (1) a user selects an object of interest using a rectangular area, (2) the application tracks the object in subsequent frames of the sequence (with optional manual reinitialization), (3) object foreground masks are established using motion information and image region properties.
2.2.1. Object Tracking and Foreground–Background Separation Using Motion Information
For tracking of rectangular area an optimized version of Circulant Structure of Kernels (CSK) tracker [
30] that utilizes color-names features [
31] is used. As a result of the tracking procedure, we obtain a sequence of rectangular areas that encompass the object of interest in subsequent frames. In most cases both object foreground, as well as background, will be present in the tracked rectangle. However, if the object is moving against a moderately static background, we can exploit motion information to effectively separate object foreground from background by background subtraction.
Let the tracking results be described by a sequence of rectangular areas
and let us denote coordinates of pixel
i as
, color attributes for pixel
i at time
t as
, and a mean of color attributes in the background as:
where averaging factor
is the number of frames where tracking window does not contain pixel
i and can be computed as
.
Now we can specify a background training sequence for each pixel
Following the rule above, only pixels that at given time-step do not belong to the tracked area contribute to the background model computed for the image. Each pixel that always belongs to the tracked area is conservatively treated as a foreground as we are unable to establish a background model for these areas.
The background model adopted here follows algorithms from [
32]. In this method, scene color is represented independently for all pixels. The color for each pixel (both from the background and foreground
) given the training sequence
, is modeled as:
where
are estimated means and standard deviation of color mixture components,
are mixing coefficients,
M is the total number of mixtures, and
denotes Gaussian density function evaluated in
.
In the algorithm a (generally correct) assumption is made that background pixels, as appearing most often, will dominate the mixture. Therefore the background model (
) is built from the selected number of largest clusters in the color mixture:
where
B is the selected number of background components. The pixel is decided to belong to the background when
Threshold
can be interactively adjusted by the user. Exact algorithms for updating mixture parameters are given in [
32]. Sample result of background subtraction procedure is given in
Figure 2.
Some modern developments in foreground–background separation using the Robust Principal Component Analysis (RPCA) approach were proposed e.g., in [
33,
34]. They are founded on extensive optimization in 3D spatio-temporal volumes and offer excellent accuracy at the expense of some processing speed. Since our system relies on the interaction between the human operator and computer, the processing speed is very important, so purely local methods seem to be currently the best choice. However, this topic will be investigated in the future versions of the system since some ideas from [
33,
34] are likely to be complementary to our developments presented in
Section 2.2.3.
2.2.2. GrabCut Algorithm
GrabCut [
24] is a widely acclaimed method for (semi)automatic foreground/background segmentation. The method takes into account several properties or image regions: color distribution, coherence of regions, contrast between regions. These factors are described in the form of image-wide energy function to optimize, that assumes the following form:
where
describes segmentation information for pixel
n (can be binary),
is the set of parameters of the Gaussian Mixture Model representing color distribution in background and foreground and
are observed image pixels. Estimated parameters are underlined. The energy term
U describes, how well current estimation of foreground and background pixels matches the assumed color distribution of foreground and background, while the term
V evaluates spatial consistency of regions by penalizing discontinuities (except for the areas of high contrast).
The best configuration of parameters is the one minimizing the term . The components are designed in a way that the energy term can be minimized using an effective graph-cut algorithm. The optimization algorithm is iterative and switches between (re)estimation of region color distribution and (re)estimation of segmentation.
The input of the algorithm is defined in [
24] as a trimap
.
stands for sure background,
stands for sure foreground, and
is an unknown area (to estimate).
and
are fixed and cannot change during the algorithm. Typical initialization is to set
to the area outside of object ROI, set
to ∅, and
is the remaining part of the image. In the first iteration, all pixels from
are initialized as background and all pixels from the unknown area
as foreground (which is subject to change). Implementations like [
35], however, allow to specify the additional areas within
: the likely foreground
and likely background
as a convenient starting point for optimization.
2.2.3. Object Tracking and Foreground–Background Separation Using Hybrid Motion Information and GrabCut
While pure object motion information is sufficient to perform foreground–background segmentation in most cases, it fails altogether for static objects. In addition, the precision of such an approach varies from case to case and strongly depends on the manual threshold selection for background subtraction . Therefore we propose a modified procedure of fine-tuning results of background subtraction using GrabCut.
- 1.
The object is tracked and its foreground mask is obtained using methods from
Section 2.2.1.
- 2.
For each frame the current foreground–background segmentation results are used to initialize a GrabCut trimap, specifically:
the foreground region is used to initialize the G-C , with an exception for the area for which no background model could be reliably established (areas that belong to each collected tracking ROI),
the G-C is initialized outside the tracked ROI border, to provide enough pixels for background estimation the tracking area is scaled uniformly by 50%,
the remaining area of ROI becomes the .
We found our solution somewhat similar to the one proposed quite recently in [
36]. However, in the cited approach the trimap is initialized differently. The G-C
area is limited only to the area of the morphological gradient (difference between dilation and erosion) of the foreground area established by background subtraction. In our solution we safely assume that
is always outside the tracked ROI, so there is no risk to incorporate the foreground object in
. Additionally, in contrast to our approach, in [
36] the background subtraction is not discussed within the tracking context.
The solution proposed here is parametrized by a single background subtraction threshold and provides a smooth user experience when transiting between different threshold values. By specifying very low thresholds, the user selects a foreground mask covering the whole tracked ROI. For larger thresholds, we obtain the results of G-C algorithm with foreground constrained to be at least the mask generated by background subtractor. For very large, extreme values of , the foreground seeds from the background subtractor become small, and the method converges to the output of the vanilla G-C algorithm for static images.
This is interesting to note that for static ROIs we use exactly the same procedure. For such ROIs, the area of uncertain background model (the area where no background model could be reliably established) is very large and covers the whole ROI. In such a situation, the entire ROI area is simply a subject to the classic G-C algorithm. Note also, that using only methods from
Section 2.2.1 the whole ROI area would be inevitably labelled as foreground.
2.2.4. Image Stabilization in a Short Sequence
The foreground–background segmentation procedure works best when the stable camera position is available (or image sequence is stabilized before segmentation). The system proposed here uses a stabilization procedure basing on matching of SURF features [
37] and computation of homography transformation between pairs of images. The stabilization works on short subsequences of the original sequence. The first frame to stabilize is the one used for marking the initial region of interest. The procedure then aligns all subsequent frames to the first frame by evaluating homography relating two images. In order to do so, matching methods from [
38] and the Least Median of Squares principle [
39] are utilized. To increase stabilization efficiency, GPU-accelerated procedures for keypoints/descriptors extraction and matching from OpenCV library are utilized [
35].
2.3. Collection of Negative Training Samples
Negative samples that are used in detector training are extracted from the same sequence images that positive samples originated from. For each training image, one fragment is used to extract a positive sample, while the remaining part of the image is divided into at most four sources of negative samples, as given in
Figure 3. Thus, an assumption is made that these remaining parts of the training sequence images do not contain positive samples. This assumption is not always valid, but may be strengthened by asking a user to mark all positive examples in the training sequence.
2.4. Positive Samples Generalization and Synthesis
2.4.1. Geometric Generalization
In this step, 3D rotations are applied to collected pattern images and their masks. It is assumed that patterns are planar, so this generalization method can be useful only to some extent for non-planar objects. The rotation effect is obtained by applying a homography transformation, imitating application of three rotation matrices
,
,
to a 3D object. The matrices correspond to rotations around
x,
y, and
z axes correspondingly. 3D rotation matrices are defined classically:
To compute the transformation, first a homography matrix is computed using formula:
where
is a vector normal to the pattern plane (we set it to
),
d is the distance from the virtual camera to the pattern (we set it arbitrarily to
, since it only scales ’real-world’ units of measurement) and
R is the 3D rotation matrix that is decomposed as:
In order for the image center (having world coordinates
) to remain intact during transformation we define ’correcting’ translation vector as:
Then we can specify artificial camera matrices as
and
where
and
are pixel coordinates of input and output image correspondingly, while
f is the artificial camera focal length given in pixels. In this application we set
f to be
times larger input image dimension. Multiplier
decides about the virtual distance of our virtual camera to the object. Smaller values introduce larger perspective distortions of the transformation, larger values introduce smaller distortions. We arbitrarily set
to 10 implying only slight perspective distortions.
The final homography transformation applied to the pixels of the input image is given by
Rotation angles
,
, and
are selected randomly from the uniform distribution (denoted here as
). The amount of rotation around axes
y is twice times the amount of rotation around remaining axes to better reflect dominant rotations in human movement
and
is the parameter specifying the maximum extent of allowed rotation.
2.4.2. Intensity and Contrast Synthesis
In the proposed approach image intensity and contrast synthesis is applied in addition to geometric transformations. It is especially important for Haar-like features that lack intensity normalization.
First, intensity values of pixels are retrieved from RGB image by extracting V component from HSV representation of the image and setting . The intensity and contrast adjustment affects only V channel. After adjustment, the RGB image is reconstructed from HSV′ where .
For adjustment, a simple linear formula is used. For each pixel gray value
we have
where
where
is the average intensity of the sample. Contrast deviation
as well as intensity deviation
are sampled from the uniform distribution
and
.
is a parameter denoting the maximum allowed contrast change and
is a parameter denoting the maximum allowed intensity change. Changes in contrast preserve mean intensity of an image. After application of the formula its results are appropriately saturated.
2.4.3. Application of Blur
Training and test samples may differ in terms of quality of image details due to different factors such as deficiencies of optics used, motion blur or distance. In our case, we apply a simple Gaussian filter to simulate natural blur effects
where
and
are image sample sizes and
controls the maximum size of the Gaussian kernel.
2.4.4. Merging with the Background
Generalized training images are superimposed on background samples extracted from negative examples of size ranging from about 0.25 to 4 times the positive sample size. Gray-level masks are used for the seamless incorporation of positive samples into background images.
2.5. Detector Training
The detector training procedure is divided into two steps. In the first step, the cascade classifier using HAAR-like features is trained. The classifier is trained on training samples resampled to a fixed size of 24 × 24 pixels. In our scenario, for each cascade stage, 300 positive samples and 100 negative samples are utilized. The minimum true positive rate for each cascade level is set to , and the maximum false positive rate is set to . The classifier is trained for a maximum of 15 stages or until reaching ≈0.00003 FPR. The expected TPR is at least . By using these settings, up to about 1000 detections are generated for each Full-HD test image.
2.6. Detector Training Using HOG+SVM
During the second stage of training an SVM classifier is trained to handle samples that passed the first cascade classification. For most experiments, the SVM classifier is trained on 300 positive and 300 negative samples or 600 and 600 samples accordingly. The SVM classifier uses the Gaussian RBF kernel.
The Gaussian kernel size
and SVM regularization parameter
C are adjusted using automatic cross-validation procedure performed on the training data. For SVM classification Histogram of Oriented Gradients features [
40] are extracted. There were used 2 resolutions of training images: 24 × 24 and 32 × 32. For each sample a 9-element histogram in 4 × 4 cells is created with 16 × 16 histogram normalization window overlapping by 8 pixels, thus giving
HOG features in total in 24 × 24 case and
in 32 × 32 case.
Negative samples are extracted from the Cascade Classifier decision boundary (containing samples that were positively verified by CC but still negative) if possible. If not, image fragments used as background images for positive samples or (as the last resort) other randomly selected samples are used. In all experiments OpenCV 3.1 [
35] Cascade Classifier and SVM implementation are utilized.
Given our test data, the number of resulting support vectors in the SVM classifier varies between 200 and 400 for 24 × 24 case but can be twice as large for 32 × 32 case. Let us review one specific configuration: ’hat’ pattern trained on 55 24 × 24 images with masks and pattern generalization settings , , . After SVM metaparameter optimization we obtain SVM regularization parameter , RBF kernel size , and the number of support vectors 233.
2.7. Detector Training Using Tuned VGG16 Network
The number of features processed by our detector is limited to HOG and Haar-like features. There exist quite a few other solutions that use a much richer set of features for object detection e.g., [
41], however, at the expense of increased computational complexity, which is an important practical concern of our system. As an alternative to evaluating complex sets of hand-crafted features, we decided to resort to the state-of-the-art methods automatic feature computation based on Convolutional Neural Networks.
Therefore in addition to the classifier described in
Section 2.6, we evaluate a VGG16 network [
42] trained on the ImageNet dataset [
43] tuned to our problem using transfer-learning principle [
44]. Transfer learning is a common technique to overcome the problem of the availability of training data. In case of a limited access to the training data for the given problem, one can use an existing network trained on another dataset and subsequently fine-tune (retrain) this network to accommodate the specific problem samples. Since usually the original network was trained on millions of examples and hundreds of classes, it is capable of extracting robust and usually quite universal features. Thus, the new network can benefit from pre-trained feature layers and train only classification layers.
For our task, we utilize a convolutional layer of VGG16 network (which is obtained from the original network by removing fully-connected layers). The network is augmented with two fully connected layers (1024 neurons with ReLu activation and a single neuron with sigmoid activation correspondingly) and one dropout layer (with dropout rate 0.5) to avoid overfitting. The VGG16 network was originally trained on a sample image of 224 × 224, however, the convolutional part would accept any multiplicity of 32 for image width and height. In this paper, we verify classifier performance for input image sizes 32 × 32, 64 × 64, 128 × 128, and finally 224 × 224 pixels. The total number of parameters in fully connected layers ranges from 522 337 to 25 692 161 depending on input image size. The overall network structure is given in
Figure 4.
The selected loss function is a binary cross-entropy for binary classification. During training, all the convolutional layers are frozen and only the parameters of fully connected layers are adapted.
2.8. Detection and Post-Processing
During the detection phase, each test image is first processed by the cascade classifier, typically returning several hundreds of candidate areas. After this, each candidate area is examined by the classifier and a score is assigned to each detection. In case of the SVM, the score is computed as the signed distance from the separating plane in support vector space with the lowest negative scores treated as best matches and high positive scores as worst matches. For the VGG16-based neural network, the output of the sigmoid function is negated and used as the score.
For each image, only the best score area is considered for further processing. Frames from the test sequence are sampled and processed with increasing density (first, last, and middle frame for a start, and then intermittent frames), to quickly produce some results for the user to review (non-minima suppression is used to reduce clutter)