1. Introduction
People detection is one of the main tasks of computer vision with applications in many areas such as video-surveillance or human-computer interaction. Such detection is difficult due to the variety of people appearance and pose, and its performance is also very dependent on the data used for training [
1]. Classical people detection techniques can be divided into three stages [
2]: firstly, a person model is designed that defines the characteristics that the detected objects must fulfill to be considered people; secondly, an object extraction process is performed, which will find the candidates to be classified; finally, the classification consists of the comparison of the objects detected in the sequence with the model generated in the first step. In this last step, a decision is made about the objects, and it is decided whether the objects are classified as persons or not. Depending on the application, the decision can be binary or a probability value of being a person.
The information provided by a single camera is limited, so in order to monitor a wide area or to obtain more information from the different viewpoints of a region of interest, it is necessary to use more than one camera. For this reason, the use of several cameras is a common way of developing applications [
3,
4], since it is also useful for solving occlusions in scenarios with a high density of people/objects and for 3D applications [
5,
6]. The use of a multi-camera environment in scenarios with possible occlusions usually improves the detection performance with respect to the use of the cameras independently. A method is proposed in [
7] to perform detection and tracking of people in multi-camera environments where there are occlusions. This method is based on the methodology proposed in [
8] and is based on using the information of each of the cameras from the scenario, merging it into a common plane (the ground plane) obtained by homographies. The individual information that is combined in the common plane is previously obtained by subtracting the background. Then, the object detections are performed in the common plane, and, afterwards the correspondence between cameras and objects is made. In this way, using cameras with different locations, the problem of occlusion is solved. The main limitation is that the individuals have to appear initially isolated. In [
9], an improvement of the previous method [
8] to eliminate false positives was proposed. Firstly, the algorithm that performs this process compares the views of all the cameras for each one of the detected objects, and then, it is able to avoid false detection by applying multiple view perspective geometry of people presence on the ground plane. It is also interesting to consider [
10], where a method that uses a Kalman filter to obtain 3D information from the 2D information is presented.
Unlike the previous approaches, in this work, we propose to transfer the detections from one camera to another instead of just projecting all the detections to the common plane. In the state of the art, the information of the detections is usually projected in the ground plane at the point level (one point per detection) or at the mask level (masks are projected, and the intersections indicate the position of the detected person). The work presented in this paper considers the common plane to obtain the different camera views’ information, and it allows transferring (and afterwards, correlating) people detections from each camera to the other ones.
By employing multiple cameras, the available viewpoints provide additional information that may allow overcoming the limitations of detectors applied to single camera views. However, determining the confidence of the information generated for each viewpoint and, therefore, the automatic parametrization remains a challenging problem. Traditionally, optimal parameters are determined and fixed previously during training time [
11,
12,
13,
14]. A method is proposed in [
15] to adapt people detectors automatically during runtime classification. The authors propose a mono-camera approach based on the correlation and combination of six detectors in order to choose frame by frame each detector threshold properly.
In this paper, we also propose a method to adapt the configuration of people detectors automatically during runtime detection. Unlike generic approaches fixing confidence thresholds or approaches restricted to single camera limitations, this method adapts the detector’s threshold for each frame and camera. We consider generic threshold-based detectors, trained on standard datasets, making this proposal applicable to most state-of-the-art people detectors.
This paper is organized as follows:
Section 2 overviews the proposed approach and main contributions, whereas
Section 3 and
Section 4 describe the detection transfer between cameras and the correlation framework.
Section 5 presents the experiments. Finally,
Section 6 concludes this paper.
2. Framework Overview
In [
15], a method is proposed to select, based on the correlation of multiple detectors, the working threshold in runtime. Note that the work in [
15] is based on the correlation of pairs of detectors and not pairs of cameras, and it does not consider the information transfer between cameras. The new proposed approach uses the correlation stage just as part of the whole framework (see
Figure 1). The proposed framework is able to transfer detections to a different point of view, to combine multiple cameras automatically and to select the working threshold of each one of them automatically in runtime. Firstly, the system is able to extract the information of each camera independently, i.e., each camera detects independently. Afterwards, a generic transfer algorithm is proposed in order to establish a common point of view and to concentrate the information of every camera. Then, the corresponding correlation between pairs of cameras is performed in the common field of view of all cameras. Finally, the correlation stage is able to determine automatically the best threshold for each camera simultaneously.
Figure 1 shows the complete framework. The different parts of the framework, which are described in more detail in their corresponding sections, are the following:
The frame-by-frame detections of all cameras are extracted, transferred, and homogenized (the position and volume of the transferred detections between cameras must be corrected) to the desired viewpoint . In this way, the object information is not reduced to a simple coordinate, allowing transferring more information (volume, height, aspect ratio) and processing the information for each camera viewpoint.
The homogenized detections from the previous stage are correlated frame by frame, and an optimal decision threshold is selected for each camera and frame. The correlations are computed for each pair of transferred detection results ( and ), which determine an optimal pair of thresholds for each pair of cameras ( and , respectively). Finally, the pair-wise selected thresholds are combined by weighted voting to obtain the best adapted threshold for each individual camera (, ..., ).
3. Detection Transfer between Cameras
A cylinder is considered to approximate the location and volume of a person in order to transfer the position of the detection bounding boxes from one camera to another, maintaining the volume that a person occupies, instead of using only the projected plane generated from the detected bounding box. The consideration of the representation of people as cylinders has been used previously in the state of the art [
16], but as a method for people counting (estimation) from a single camera perspective. The objective of the developed technique is to transfer the bounding boxes of the detections from one camera to the viewpoint of another camera. As the projections on the common plane of the detected bounding boxes do not correspond spatially with the position and volume of the detected object, the transfer between cameras must be corrected.
Figure 2a shows two bounding boxes that will be transferred. In
Figure 2b, each projected bounding box base is represented with the continuous blue line, and the cylinder base is represented with a (green) circle. The continuous red line corresponds to the projection of the transferred bounding box base and belongs to the rotated (red) square. An example of the resulting cylinders is shown in
Figure 2c. Here, we describe the method applied to each bounding box detected by the camera whose information is transferred.
4. Correlation Stage
We apply a method to improve the detection performance at runtime by adapting the detector configuration (see
Figure 1). This proposal is based on the maximization of mutual information strategy where classifiers are combined assuming that their errors are complementary [
15]. In our case, the detection model, executed in the different cameras, has been trained using the same content set. The incorrect detections will be different for each camera, so the correlation will reinforce the correct detections common to all cameras and penalize the isolated errors of each camera.
We start from a set of
N camera frames. Each detector obtains a confidence map in every camera,
, representing the likelihood of people presence at each spatial location in the frame. Then, detection candidates are obtained by thresholding this map. Each detection (i.e., bounding box) is described by its position (
) and dimensions (
). The set of detections are transferred to the camera under analysis (i.e., the desired viewpoint)
. The transferred camera detections are compared to obtain a set of pairwise correlation scores. Firstly, the decision space of each camera output is explored by applying multiple thresholds. Then, these multiple outputs are correlated for each pair of camera detections (
and
) to obtain a correlation map, which measures the output similarity. Finally, the configuration with the highest similarity allows selecting the best detection threshold for each camera output (
and
, respectively). Up to this point, we have a hypothesis obtained for each compared pair of detections (
and
), which are combined to obtain a final configuration for each camera threshold (
, ...,
). Such a hypothesis combination is performed as a traditional mixture of experts via weighted voting in the decision fusion stage as follows:
where
is the weight for the hypothesis
achieved by comparing
and
and
. Currently in this work, we assume no prior knowledge about cameras’ performance, so we consider equal weighting
.
Figure 3 shows one example of the correlation between two cameras: Camera 1 in
Figure 3a and Camera 2 in
Figure 3b.
Figure 3c shows both camera detections (
and
) in the camera under analysis and the final threshold configuration. Note how the correlation is able to avoid one false positive detection from each camera and to detect the occluded person in Camera 2, but not occluded in Camera 1.
The correlation is only coherent to be carried out in the common field of view of all cameras, since otherwise, disjoint sets would be correlated and the process would not be useful. To locate the common field of view, the ground plane of each camera is transferred to the desired point of view. Visual examples of this process are shown in
Figure 4a, in which the plane of each camera is represented with a different color and the common field of view of all the cameras has been darkened to ease its localization. Since the common field of view is defined in the ground plane (ground floor plane in our scenario), the correlation and evaluation process will only take into account those pedestrians whose projected bounding box base is included in the common field of view (see
Section 3 and
Figure 2 for more details).
6. Conclusions and Future Work
We present a framework to choose the optimal people detector threshold automatically during runtime. The proposal accurately transfers detector results between camera viewpoints and then exploits the correlation among multiple camera detections transferred to a common camera to determine the best threshold for each camera. The proposed approach is capable of working over standard state-of-the-art detector outputs (bounding boxes), so any kind of detector and object model can be considered. The cylinder model may need to be adapted in other cases in which the object has a very unbalanced length-width aspect ratio (for example, a car or van). This framework allows the automatic threshold parametrization without requiring any model (re-)training process and, therefore, is completely online.
For future work, more object detectors can be considered. Other additional optimal parametrization can be considered and not only the detection threshold; for example, the position of the bounding box, the scale of detected objects, the pose, etc. Furthermore, following [
15], multiple and different detectors could be also applied for each camera and combined simultaneously, in order to further improve the results.