Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras

Since it is impossible for surveillance personnel to keep monitoring videos from a multiple camera-based surveillance system, an efficient technique is needed to help recognize important situations by retrieving the metadata of an object-of-interest. In a multiple camera-based surveillance system, an object detected in a camera has a different shape in another camera, which is a critical issue of wide-range, real-time surveillance systems. In order to address the problem, this paper presents an object retrieval method by extracting the normalized metadata of an object-of-interest from multiple, heterogeneous cameras. The proposed metadata generation algorithm consists of three steps: (i) generation of a three-dimensional (3D) human model; (ii) human object-based automatic scene calibration; and (iii) metadata generation. More specifically, an appropriately-generated 3D human model provides the foot-to-head direction information that is used as the input of the automatic calibration of each camera. The normalized object information is used to retrieve an object-of-interest in a wide-range, multiple-camera surveillance system in the form of metadata. Experimental results show that the 3D human model matches the ground truth, and automatic calibration-based normalization of metadata enables a successful retrieval and tracking of a human object in the multiple-camera video surveillance system.


Introduction
Multiple camera-based video surveillance systems are producing a huge amount of data every day. In order to retrieve meaningful information from the large data set, normalized metadata should be extracted to identify and track an object-of-interest acquired by multiple, heterogeneous cameras.
Hampapur et al. proposed a real-time video search system using video parsing, metadata descriptors and the corresponding query mechanism [1]. Yuk et al. proposed an object-based video indexing and retrieval system based on object features' similarity using motion segmentation [2]. Hu et al. proposed a video retrieval method for semantic-based surveillance by tracking clusters under a hierarchical framework [3]. Hu's retrieval method works with various queries, such as keywords-based, multiple object and sketch-based queries. Le et al. combined recognized video contents with visual words for surveillance video indexing and retrieval [4]. Ma et al. presented a multiple-trajectory indexing and retrieval system using multilinear algebraic structures in a reduced-dimensional space [5]. Choe et al. proposed a robust retrieval and fast searching method based on a spatio-temporal graph, sub-graph indexing and Hadoop implementation [6]. Thornton et al. extended an existing indexing algorithm in crowded scenes using face-level information [7]. Ge et al. detected and tracked multiple pedestrians using sociological models to generate the trajectory data for video feature indexing [8]. Yun et al. presented a visual surveillance briefing system based on event features, such as object's appearances and motion patterns [9]. Geronimo et al. proposed an unsupervised video retrieval system by detecting pedestrian features in various scenes based on human action and appearance [10]. Lai et al. retrieved a desired object using the trajectory and appearance in the input video [11]. The common challenge of existing video indexing and retrieval methods is to summarize infrequent events from a large dataset generated using multiple, heterogeneous cameras. Furthermore, the lack of normalized object information during the search prevents from accurately identifying the same objects acquired from different views.
In order to solve the common problems of existing video retrieval methods, this paper presents a normalized metadata generation method from a very wide-range surveillance system to retrieve an object-of-interest. For automatic scene calibration, a three-dimensional (3D) human model is first generated using multiple ellipsoids. Foot-to-head information from the 3D model is used to estimate the internal and external parameters of the camera. Normalized metadata of the object are generated using the camera parameters of multiple cameras. As a result, the proposed method needs neither a special calibration pattern nor a priori depth measurement. The stored metadata can be retrieved using a query, such as size, color, aspect ratio, moving speed and direction.
This paper is organized as follows. Section 2 describes the 3D human model using multiple ellipsoids. A human model-based automatic calibration algorithm and the corresponding metadata retrieval method are respectively presented in Sections 3 and 4. Section 5 summarizes the experimental results, and Section 6 concludes the paper.

Modeling Human Body Using Three Ellipsoids
A multiple camera-based surveillance system must be able to retrieve the same object in different scenes using an appropriate query. However, non-normalized object information results in retrieval errors. In order to normalize the object information, we estimate camera parameters using automatic scene calibration and then estimate a projective matrix using camera parameters obtained by scene calibration. After obtaining normalized information, the object in the two-dimensional (2D) image is projected to a 3D world coordinate using the projection matrix. Existing camera calibration methods commonly use a special calibration pattern [12], which extracts feature points from a planar pattern board and then estimates the camera parameters using a closed-form solution. However, the special calibration pattern-based algorithm has a limitation because the manual calibration of multiple cameras at the same time is impractical and inaccurate. In order to solve this problem, we present a multiple ellipsoid-based 3D human model using the perspective property of 2D images, and the block diagram of the proposed method is shown in Figure 1. Let X f = [X f Y f 1] T be the foot position on the ground plane and x f = [x f y f 1] T the corresponding foot position in the image plane, all in the homogeneous coordinate. Given x f , X f can be computed using the homography as: where H = [p 1 p 2 p 3 ] T is the 3 × 3 homography matrix, and p i for i = 1, 2, 3 are the first three columns of the 3 × 4 projection matrix P that is computed by estimating camera parameters. We then generate the human model with height h on the foot position using three ellipsoids, including head Q h , torso Q t and leg Q l , in the 3D world coordinate. The 4 × 4 matrix of the ellipsoid is defined as [13]: where Q k , k ∈ {h, t, l}, respectively, represent the ellipsoid matrices of head, torso and leg. R X , R Y and R Z respectively represent the radii of ellipsoids in X, Y and Z coordinates and [X c Y c Z c ] T the center of the ellipsoids. To fit the model to real humans, we set the average heights of children, juveniles and adults as 100 cm, 140 cm and 180 cm, respectively. The ratio of the head, torso and leg is set to 2:4:4. Each ellipsoid is back-projected to match a real object in the 2D space. The back-projected 3 × 3 ellipse, denoted as C k , by projection matrix P is define as: where C represents the ellipsoid matrix, such as u T Cu = 0. Figure 2 shows the result of the back-projected multiple ellipsoids at different positions. In each dotted box, three different ellipsoids have the same height. The multiple ellipsoid-based human model is generated according to the position and height of an object from multiple cameras. The first step of generating the human model is to perform shape matching in the image. To match the shape, the proposed algorithm detects a moving object region by modeling the background using the Gaussian mixture model (GMM) [14] and then normalizes the detected shape. Since the apparent shape differs by the location and size of the object, the normalized shape is represented by a set of boundary points. More specifically, each boundary point is generated where a radial line from the center of gravity meets the outmost boundary of the object. If the angle between adjacent radial lines is θ, the number of boundary points is N = 360 • /θ. The shapes of an object and the corresponding human model are respectively defined as: where B represents the shape of the object, i ∈ {children, juvenile, adult}, M i the shape of the human model and N the number of normalized shapes. In this work, we experimentally used θ = 5 • , which results in N = 72. The matching error between B and M i is defined as: As a result, we select an ellipsoid-based human model with the minimum matching error e i to three human models, including child, juvenile and adult. If the matching error is greater than a threshold T e , the object is classified as nonhuman. If the threshold T e is too big, nonhuman objects are classified as human. On the other hand, very small T e makes human detection fail. For that reason, we chose T e = 8 for the experimentally best human detection performance. The shape matching results of the ellipsoid-based human model appropriately fit real objects, as shown in Figure 3, where moving pedestrians are detected and fitted by the ellipsoid-based human model. The ellipsoid-based fitting fails when a moving object is erroneously detected. However, the rest of the correct fitting results can compensate for the occasional failure.

Human Model-Based Automatic Scene Calibration
Cameras with different internal and external parameters produce different sizes and velocities in the 2D image plane for the same object in the 3D space. In order to identify the same object in a multiple camera-based surveillance system, detection and tracking should be performed in the 3D world coordinate that is not affected by camera parameters. Normalized physical information of an object can be extracted in two steps: (i) automatic scene calibration to estimate the projective matrix of a camera [15][16][17]; and (ii) projection of the object into the world coordinate using the projective matrix. The proposed automatic calibration algorithm assumes that the foot-to-head line of a human object is orthogonal to the xy plane and parallel to the z-axis in the world coordinate.
The proposed human model-based automatic scene calibration consists of three steps: (i) extraction of foot and head candidate data to compute foot-to-head homology; (ii) homology estimation using foot-to-head inlier data; and (iii) camera calibration by estimating vanishing points and lines using the foot-to-head homology.

Foot-To-Head Homology
In the Euclidean geometry, two parallel lines do not meet anywhere. On the other hand, in the projective geometry, two parallel lines meet at a point called the vanishing point. A line connecting two vanishing points is called the vanishing line, as shown in Figure 4. Existing single image-based methods to estimate vanishing points and lines often fail if there are no line components in the background image [18,19]. In order to overcome the limit of background generation-based methods, a foreground object-based vanishing point detection method was recently proposed [15][16][17]. Since a general surveillance system has a camera installed at a higher position than the ground to view down objects, foot-to-head lines of a standing person at various positions on the ground, which is equivalent to the XY plane in the world coordinate, converge to a single point below the ground plane, as shown in Figure   The vanishing line and point are used to estimate the camera projection matrix. More specifically, letX = [X Y Z 1] T be a point in the homogeneous world coordinate; its projective transformation becomesx = PX, where P is the projection matrix. Givenx = [xȳ z 1] T , the corresponding point in the image plane is determined as x =x/z, and y =ȳ/z. Since we assume that the XY plane is the ground plane, the foot position in the world coordinate is X f = [X Y 0] T and the projected foot In the same manner with the XY plane moving to the head plane, we havex h = H hXh , where both H f and H f are 3 × 3 matrices. Since a head position is projected onto the corresponding foot position, such asX f =X h , where both Given the coordinate of a foot position in the ground plane, the corresponding head position in the image plane can be determined using H h f . H = H f h is defined as the foot-to-head homology, and can be determined by computing the projection matrix P using the vanishing point, vanishing line and the object height Z.

Automatic Scene Calibration
The automatic scene calibration process consists of three steps: (i) extraction of foot and head inlier data; (ii) estimation of foot-to-head homology using the extracted inlier data; and (iii) detection of vanishing line and points. For the first step of the scene calibration, a human object is detected using the Gaussian mixture model. The detected object region goes through a morphological operation for noise-free labeling [20]. The inlier candidate of the foot and head of the labeled object is selected on two conditions: (i) a foot-to-head line should be inside a finite region with respect to the y-axis; and (ii) the foot-to-head line should be a major axis of an ellipsoid that will approximate the human object.
In order to obtain the angle, major axis and minor axis of the labeled human object, ellipse fitting is performed. More specifically, the object shape is defined by the external boundary as: where s i = [x i y i ] T , for i = 1, . . . , N, represents the i-th boundary point and N the number of total boundary points. Using the second moments [21], the angle of shape S is computed as: where: and: In order to compute the major and minor axes of the ellipsoid, we first define the minimum and maximum inertial moments respectively as: The major and minor axes are determined using I min and I max as: The aspect ratio of the object is defined as r = A l /A s , and a candidate foot and head vector is defined as c = [x f y f x h y h ] T . c is computed using θ as: where y max and y min respectively represent the maximum and minimum of y i , for i = 1, . . . , N. The set of inlier candidates C = [c 1 c 2 · · · c L ] T is generated from c i 's that satisfy four conditions: (iii) there exist s i whose distance from (x f , y f ) is smaller than d 1 , and s j whose distance from (x h , y h ) is smaller than d 1 ; and (iv) there are no pairs of c i 's whose distance is smaller than d 2 . In the first condition, r 1 = 2 and r 2 = 5 are used, and in the second condition, θ 1 = 80 • and θ 2 = 100 • are used for the experimentally best result. In the third and fourth conditions, d 1 = 3 and d 2 = 10 are respectively used.
Since the inlier candidate C still contains outliers, a direct computation of foot-to-head homology H results in a significant error. To solve this problem, we remove outliers in c using a robust random sample consensus (RANSAC) algorithm [22]. H can be determined using four inlier data since its degree of freedom is eight. Let a = [h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 ] T be a vector whose eight elements are the first, row-ordered eight components of H; then, a can be determined by solving: Since Equation (14) generates two linear equations given a candidate vector, four candidate vectors can determine H. In order to check how many inlier data support the estimated H, the head position of each candidate vector is estimated using H, which is determined by the corresponding foot position. The estimated head position is compared to the real head position, and the candidate vector is considered to support H if the error is sufficiently small. This process repeats a given number of times, and candidate vectors that support the optimal H become inliers. The inliers generate Equation (14). Since many inliers generally produce more than eight equations, vector a, which is equivalent to matrix H, is finally determined using the pseudo inverse. Although outliers can be generated by occlusion, grouping and non-human objects, the correct inlier data can be estimated while the process repeats and candidate data are accumulated.
Given the estimated foot-to-head homology H, arbitrarily chosen two foot positions generate corresponding two head positions. Two lines connecting the two pairs of feet and head positions meet at the vanishing point. More specifically, a line in the 3D coordinate can be represented using a vector l = [a b c] T , which satisfies the linear equation: where the line coefficients {a, b, c} are determined using two points p = [p x p y ] T and q = [q x q y ] T as: If two lines l 1 and l 2 meet at the vanishing point V 0 , the following relationship is satisfied: In order to determine the vanishing line, three candidate vectors {c 1 , c 2 , c 3 } are needed. Two lines connecting both feet and head pairs connecting c 1 and c 2 meet at a point, say r = [r x r y ] T . Likewise, another point s = [s x s y ] T is determined using c 2 and c 3 . The line connecting two points r and s is the vanishing line V L . Given V 0 and V L , camera parameters can be estimated as shown in Figure 6.

Camera Parameter Estimation
Internal parameters include focal length f, principal point [c x c y ] T and aspect ratio a. Assuming that the principal point is equal to the image center, a = 1, and there is no skew, the simplified internal camera parameters are given as: External parameters include panning angle α, tilting angle θ, rolling angle ρ, camera height with respect to the z-axis and translations in the x and y directions. Assuming that α = 0, x = y = 0, the camera projection matrix is obtained by the multiplication of the internal and external parameter matrices as: where w = K −T K −1 represents the image of the absolute conic (IAC). Substitution of Equation (18) into Equation (20) yields [23]: which demonstrates that the horizontal vanishing line can be determined by the vertical vanishing point and the focal length and that rotation parameters can be computed from v x , v y , f as [8]: where a = 1.
The proposed algorithm can compute f, ρ and θ by estimating the vanishing line and point using Equations (21) and (22). The camera height h c can be computed using the real height of an object in the world coordinate h w , vanishing line v L and vanishing point v 0 : where p f and p h respectively represent the foot and head positions of the i-th object and d(a, b) the distance between points a and b. In the experiment, h w = 180 cm is used for the reference height.

Indexing of Object Characteristics
After object-based multiple camera calibration, the metadata of an object should be extracted given a query for the normalized object indexing. In this work, queries of an object consist of a representative color in the HSV color space, horizontal and vertical sizes in meters, moving speed in meters per second, the aspect ratio and moving trajectory.

Extraction of Representative Color
The color temperature of an object may change when a different camera is used. In order to minimize the color variation problem, the proposed work performs color constancy as a pre-processing step to compensate for the white balance of the extracted representative color.

Color Constancy
If we assume that an object is illuminated by a single light source, the estimated color of the light source is given as: where e(λ) represents the light source, s(λ) the reflection ratio of the surface, c = [R(λ) G(λ) B(λ)] T the camera sensitivity function and w the wavelength spectrum, including the red, green and blue colors. The proposed color compensation method is based on the shades of gray method [24,25]. The input image is down-sampled to reduce the computational complexity, and simple low pass filtering is performed to reduce the noise effect. The modified Minkowsky norm-based color with the consideration of local correlation is given as: where f (x) represents the image defined on x = [x y] T , f σ = f * G σ the filtered image by the Gaussian filter G σ and p the parameter of the Minkowski norm. A small p makes the uniform distribution of weights between measurement values, and vice versa. An appropriate choice of p prevents the light source from being biased to a specific color channel. In the experiment, p = 6 was used for the experimentally best results for multiple camera color compensation. As a result, scaling parameters {w R , w G , w B } can be determined using the estimated color of the light source. The corrected color is given as: Figure 7 shows the results of color correction using three different cameras. Color correction can also minimize the inter-frame color distortion, since it estimates the normalized light source.

Representative Color Extraction
The proposed color extraction method uses the K-means clustering algorithm. An input RGB image is transformed to the HSV color space to minimize the inter-channel correlation as: Let j n = [H n S n V n ] T be the HSV color vector of the n-th pixel, for n = 1, . . . , N, where N is the total number of pixels in the image. Initial K pixels are arbitrarily chosen to make a set of mean vectors {g 1 · · · g K }, where g i , for i = 1, . . . , K, represents the selected HSV color vector. For every color vector, if j n is the closest to g i , j n has the label J i as: Each mean vector g i is updated by the mean of j n 's in the cluster J i , and the entire process repeats until there are no more changes in g i . Figure 8 shows the results of K-means clustering in the RGB and HSV color spaces with K = 3. The fundamental problem of the K-means clustering algorithm is the dependency on the initial set of clusters, as shown in Figure 9. Since a single try of K-means clustering cannot guarantee extracting the representative colors, each frame generates candidate colors while tracking an object, and only the top 25% colors in the sorted candidates are finally selected. As a result, the representative colors of the object are correctly extracted even with a few errors. Figure 10 shows objects with extracted representative colors.

Non-Color Metadata: Size, Speed, Aspect Ratio and Trajectory
When multiple cameras are used in a video surveillance system, object size and speed are differently measured by different cameras. In order to extract the normalized metadata of an object, physical object information should be extracted in the world coordinate using accurately-estimated camera parameters.

Normalized Object Size and Speed
We can compute the physical object height in meters if the projection matrix P and foot and head coordinates are in the image plane. In order to extract the physical information of an object in the world coordinate, the foot position on the ground planeX f = H −1x f should be computed using Equation (1). On the other hand, the y coordinate in the image plane is computed as: where P represents the projection matrix and H 0 the object height. Using Equation (29), H 0 can be computed from y as: H o = (P 2,1 − P 3,1 · y)X + (P 2,2 − P 3,2 · y)Y + P 2,2 − P 3,2 · y P 3,3 · y − P 2,3 The width of an object W 0 is computed as: where X 0 represents the foot position in the world coordinate, X 0 the foot position that corresponds to the one-pixel shifted foot position in the image plane and W i the object width in the image plane. Figure 11 shows the results of normalized object size estimation. As shown in the figure, the estimated object height does not change while the object is moving around.
(a) (b) Figure 11. Size estimation results of the same object that is (a) far from the camera; (b) close to the camera.
The object speed S 0 can be computed as: where (X t 0 , Y t 0 ) represents the object position in the world coordinate at the t-th frame and (X t 0 , Y t 0 ) the previous object position by one second. However, the direct estimation of S 0 from the object foot position is not robust because of the object detection error. To solve the problem, the Kalman filter can compensate for the speed estimation error. Figure 12 shows the result of the object speed estimation with and without using the Kalman filter.

Aspect Ratio and Trajectory
The aspect ratio of an object is simply computed as: where H i and W i respectively represent the object height and width in the image plane. Instead of saving the entire trajectory of an object, the proposed system extracts object information using four positions in the trajectory. The object trajectory is defined as:

Unified Model of Metadata
Five types of metadata described in Sections 4.1 and 4.2 should be unified into a single data model to be saved in the database. Since object data are extracted at each frame, median values of size, aspect ratio and speed data are saved at the frame right before the object disappears. Three representative colors are also extracted using the K-means clustering algorithm with the previously-selected set of colors. The object metadata model, including object features, serial number and frame information, is shown in Table 1. As shown in the table, duration, moving distance and area size are used to sort various objects. For the future extension, minimum and maximum values of object features are also saved in the metadata.

Experimental Results
This section summarizes the experimental results of the proposed object-based automatic scene calibration and metadata generation algorithms. To evaluate the performance of the scene calibration algorithm, Table 2 summarizes the variation of object mean values captured in seven different scenes. The experiment extracts normalized physical information of a human object with a height of 175 cm in various scenes. As shown in Table 2, camera parameters were estimated and corrected at each scene. Object A appears 67 times, and object height is estimated every time.   Figure 13 shows that the average object height is 182.7 cm with a standard deviation 9.5 cm. Since the real height is 175 cm, the estimation error is 7.5 cm, because the reference height h w was set to 180 cm. This result reveals that the proposed calibration algorithm is suitable to estimate the relative height rather than the absolute value.  Figure 14 shows the experimental results to search an object using the color query, including red, green, blue, yellow, orange, purple, pink, brown, white, gray and black. Table 3 summarizes the classification performance using the object color. The rightmost column has the number of total objects and the correctly classified ones in the parenthesis. The experiment can correctly classify 96.7% of the objects on average.  Figure 15 shows eight test videos with estimated camera parameters. Figure 16 shows the camera calibration results of eight test videos on the virtual ground plane and ellipsoids of a height of 180 cm. Table 3. Result of the classification based on the color.  Figure 15. Test video files with estimated camera parameters: (a,b) two images of the first scene captured by two different camera parameters; (c,d) two images of the second scene captured by two different camera parameters; (e,f) two images of the third scene captured by two different camera parameters; (g,h) two images of the fourth scene captured by two different camera parameters. Figure 16. Result of camera calibration on the virtual three-dimensional grid on: (a,b) two images of the first scene captured by two different camera parameters; (c,d) two images of the second scene captured by two different camera parameters; (e,f) two images of the third scene captured by two different camera parameters; (g,h) two images of the fourth scene captured by two different camera parameters. Figure 17 shows the experimental results of the object search using the size query, including children (small), juveniles (medium) and adults (large). Figure 17a shows that the proposed algorithm successfully searched children smaller than 110 cm, and Figure 17b,c shows the similar results with a juvenile and adult, respectively. Table 4 summarizes the classification performance using the object size. The right most column has the number of total objects and the correctly-classified ones in the parenthesis. The experiment can correctly classify 95.4% of the objects on average.   Figure 18 shows the experimental results of the object search using the aspect ratio. The horizontal query is used to find vehicles; the normal query is used to find motorcycles and groups of people; and the vertical query is used to find a single human object. Table 5 summarizes the classification performance using the aspect ratio. The rightmost column has the number of total objects and the correctly-classified ones in the parenthesis. The experiment can correctly classify 96.9% of the objects on average.   Figure 19 shows the experimental results of the object search using the speed queries, including slow, normal and fast. Table 6 summarizes the search results using the object speed with the classification performances. As shown in Table 6, more than 95% of the objects are correctly classified. Tables 3-6 show the accuracy and reliability of the proposed algorithm. More specifically, the color-based searching result shows relatively high accuracy with various searching options. For that reason, the object color can be the most important feature for object identification.   Figure 20 shows the experimental results of the object search using user-defined boundaries to detect a moving direction.   Figure 22 shows the processing time of the proposed algorithm. To measure the processing time, a personal computer is used with a 3.6-GHz quad-core CPU and 8 GBytes of memory. As shown in Figure 22, it takes 20-45 ms to process a frame, and the average processing speed is 39 frames per second (FPS).

Conclusions
This paper presented a multiple camera-based wide-range surveillance system that can efficiently retrieve objects-of-interest by extracting normalized metadata of an object acquired by multiple, heterogeneous cameras. In order to retrieve a desired video clip from a huge amount of recorded video data, the proposed system allows a user to query various features, including the size, color, length ratio, moving speed and direction. The first step of the algorithm is the auto-calibration to extract normalized physical data. The proposed auto-calibration algorithm can estimate both the internal and external parameters of a camera without using a special pattern or depth information. Image data acquired by the appropriately-calibrated camera provides normalized object information. In the metadata generation step, a color constancy algorithm is first applied to the input image as preprocessing. After a set of representative colors are extracted using K-means clustering, the physical size and speed of an object-of-interest is estimated in the world coordinate using the camera parameters. The metadata of the object are then generated using the size ratio and motion trajectories. As a result, an object-of-interest can efficiently be retrieved using a query that combines physical information from big video data recorded by multiple, heterogeneous cameras. Experimental results demonstrated that the proposed system successfully extracts the metadata of the object-of-interest using three-dimensional (3D) human modeling and auto-calibration steps. The proposed method can be applied to a posteriori video analysis and retrieval systems, such as a vision-based central control system and a surveillance system.