1. Introduction
Detection and tracking of moving targets in cluttered urban environments is an important task for local law enforcement and security forces. The ability to locate and track a single target in a large cluttered scene is difficult due to several factors based on the target and its surrounding environment. For example, it is sometimes difficult to differentiate between targets when using a thermal infrared (IR) sensor as target signatures are similar and thermal contrast between the target and the background can be low. Radio frequency (RF) communications contain identification information about the transmitting source but lack the ability to spatially localize the target with low uncertainty [
1]. The combination of data from two or more sensors, referred to as sensor fusion, exploits the advantages of multiple sensors while overcoming the disadvantages of each individual sensor [
2,
3].
RF transmissions have been instrumental for decades in detection and tracking of targets using both active and passive systems. Passive RF systems exploit existing sources of opportunity such as cellular communications or television broadcasts. These systems have been demonstrated in a number of applications such as surveillance [
4,
5], geolocation [
6,
7], and motion estimation [
8]. A cellular phone is a device that emits an RF signal and has been of interest for surveillance by government agencies including local law enforcement and the federal bureau of investigation [
9]. Cellular phones can be tracked by devices known as “stringrays” that act as a cellular tower and intercept the cellular signal to localize and track a specific target [
10]. As cellular phones contain unique identifications, they are excellent sources to identify targets with high confidence.
Given their generally high angular resolution, and in the case of infrared sensing, night vision capability, electro-optical/infrared (EO/IR) sensors are commonly used to identify and track a variety of targets including pedestrians [
11,
12,
13], vehicles [
14], ships [
15], and aircraft [
16]. Under optimal viewing conditions EO/IR sensors can measure the location of a target with high accuracy and precision making them an important asset for security systems. Algorithms to accomplish these tasks have been demonstrated making use of background estimation [
17,
18,
19], edge detection [
20], and feature recognition [
21]. That said, it is the case that EO/IR sensors cannot see inside most vehicles. Hence, in a crowded traffic environment where a particular cell phone is being used, associating the cell phone emanation with a particular vehicle is an important problem, which is addressed here.
A number of sensor combinations have been developed to aid in target detection and tracking applications. Noulas et al. fuse audio segments with a video sequence to associate the audio with its corresponding video target [
22]. Kilic et al. track speaking targets by fusing likelihoods built from audio and visual data [
23]. D’Arca et al. fuse audio and visual sensors to estimate a targets trajectory using separate Kalman filters that get fused into a single Kalman filter [
24]. Chin et al. demonstrate a fusion technique using an optical tracker and Wi-Fi to track a target through obscurations [
25]. We explore a fusion algorithm to localize and track a specific vehicle using two new sensor types; a constellation of RF sensors capturing cell phone emanations and a multi-spectral imaging system. To the best of our knowledge the fusion of these sensors is unique in the literature.
We present a novel combination of passive sensor data fusion by using a constellation of RF sensors measuring a cellular emanation from a specific phone with a multi-spectral imaging sensor detecting and tracking vehicles in a target rich environment. In practice, neither signal contains enough information to allow a particular vehicle to be uniquely identified as the source of the cellular emanations. However, by fusing these two sources of data we demonstrate that a specific target can confidently be identified and tracked through a sequence of frames. From the cellular emanation we make use of the frequency difference of arrival (FDOA), also referred to as the Doppler differential (DD), which is a result of relative motion between the emitter and separated receivers [
26,
27]. The multi-spectral sensor produces centroid estimates of multiple moving vehicles through a sequence of frames. Constellations of unmanned aerial vehicles (UAV) have become readily available and demonstrated for various applications [
28,
29,
30]; the notional sensor configuration studied here is a UAV scenario with a multi-spectral imaging sensor located at the scene origin at an altitude of 1000 m and RF sensors spaced on the border of the imaging sensors field of view at an altitude of 1000 m. The sensor-scene geometry is shown in
Figure 1, and the sensor coordinates shown in
Figure 2. The geometry of the RF sensors on the border of the scene is chosen to provide diversity to the DD of received signals; other geometries would work for this application.
The block diagram of the fusion algorithm described here is shown in
Figure 3. The multi-spectral video tracker fuses images for detection of moving foreground objects which are then tracked through a video sequence [
31], giving two dimensional time history of
centroid locations of multiple moving targets. Radial velocity estimates are computed from the tracker outputs and used to calculate the theoretical DD which would have been observed at the cell phone frequency for each tracked target. The RF receivers in the constellation each measure incoming cellular emanations and isolate a signal of interest and extract the Doppler shift. The RF process for an individual sensor is shown in
Figure 4. It is not in the scope of this paper to cover the Doppler shift estimation, but we note this is a viable operation in wireless communications to isolate a single RF signal [
32] and estimate its Doppler shift [
33]. DD are calculated for all combinations of RF sensors and the sensor pair corresponding to the maximum DD is used to associate the RF sensors with the and multi-spectral tracker output. To associate the multi-spectral image tracker with the RF sensors, the absolute difference and root mean square difference (RMSD) is calculated and the sensor pair with the minimum metric is selected as the matching target. We compare the association rate of the sensors to the correct result to evaluate performance.
This algorithm was developed and evaluated using synthetically generated datasets. RF sensor measurements were simulated using the known ground truth radial velocity of the emanating target to generate Doppler shifts and varying levels of random measurement uncertainty were added. The imaging sensor is multi-spectral and includes visible, near-infrared (NIR), mid-wave infrared (MWIR), and long-wave infrared (LWIR) which were simulated using the Digital Imaging and Remote Sensing Image Generation (DIRSIG) software [
34].
We present results on associating the RF sensors with the corresponding target from the multi-spectral video to localize and track a specific moving target with a cell phone. Using the absolute DD, the algorithm has a high rate of correctly associating the RF emanation with the multi-spectral target at low measurement uncertainty for targets that have motion patterns that are not similar to other targets. The RMSD improves the association rate particularly for low uncertainty cases, but maintains good performance through significant uncertainty levels. The confidence of identifying the correct multi-spectral target from background targets is high for low measurement uncertainty and remains high for over half of the targets through all uncertainty levels.
The remainder of the paper is organized as follows.
Section 2 discusses the extraction of DD from RF sensors measuring a specific cellular emanation.
Section 3 details the multi-spectral video tracker and the calculation of the theoretical DD.
Section 4 discusses the metrics for associating the cellular emanation measured using the RF sensors with the output of the multi-spectral video tracking to localize a specific moving target. Experimental results for matching the cellular emanation as measured from the RF sensors with the video targets is discussed in
Section 5. In
Section 6 conclusions are presented.
2. Cell Phone Emanations
In this section we present background on sensing a specific cellular emanations from multiple RF receiver.
We start by reviewing the Doppler shift that occurs in a cellular emanation due to radial motion of the transmitter. A transmitting target has a position
at time instance
k. RF receivers are located at positions
, where
ℓ is the receiver label. The range
between the transmitting target and a receiver
ℓ is given
Shown in
Figure 5 are the ranges between the multi-spectral sensor
and the ground target (GT) vehicles as a function of time for the interval covered by the DIRSIG simulation. For the geometry and pattern flow of this scenario the targets enter at the edge of the scene, move towards the intersection at the center, and proceed to move towards the edge of the scene.
The derivative of the range with respect to time produces the range-rate
v, also known as the radial velocity, and given by
where
T is the sampling period between range measurements.
Figure 6 is the range-rate for the GT in the scenario presented here. A negative range-rate indicates that the receiver and target are getting closer in range, and conversely, a positive range-rate indicates that the pair are moving away from one another. With the sensor altitude
z being large compared to the
displacement and the motion being confined to the
x−
y plane, the large ranges and low ground velocities result in low range-rates. The similarity of the traffic patterns is due to the geometry of the scene and the traffic scenario. This introduces difficulty in distinguishing between targets since the Doppler shifts will be small in general, and when vehicles stop, for example at traffic lights, it will disappear. By using spatially separated receivers we create diversity in the Doppler shifts and create the opportunity to use the highest Doppler shifts which will least likely be associated with non-moving targets.
We now examine the effect of the radial velocity on the frequency
of a carrier signal for a cellular emanation. A transmitter with a radial velocity in the direction of a stationary receiver results in a shift of frequency that makes it larger, whereas radial velocity in the opposing direction of the receiver results in a frequency shift that makes it smaller. This is a well known concept called the Doppler effect. The shift in frequency
is given by
where
c is the speed of light. Shown in
Figure 7 are the Doppler shifts corresponding to the radial velocities from
Figure 6 for a carrier frequency of 1 GHz. For the 1 GHz carrier the Doppler shifts range between −4 and 4 Hz.
Similar motion profiles resulting from using a single sensor make it difficult to distinguish between targets. The use of a constellation of spatially separated receivers produces variation in the Doppler signature for the targets. For the purpose of this study eight RF receivers were placed on the edge of the imaging sensors field of view and one in the center; locations are shown in
Figure 2. The spatial distribution of the sensors is such that some will be in the direction of travel and range will be decreasing, resulting in a negative range-rate and a negative Doppler shift. The other sensors will be opposite the direction of travel and the range will increase, resulting in a positive range-rate and positive shift in frequency.
Figure 8 shows the range between an example transmitting target (GT #1) and the constellation sensor locations in
Figure 2. As indicated, some sensors are decreasing in range while others are increasing. This is better illustrated in
Figure 9 and
Figure 10 where the range-rate and Doppler shift for a 1 GHz carrier are shown, respectively. At the beginning of movement for this example, the Doppler shift for the 1 GHz carrier has a shift near −8 Hz for three sensors and a shift near 0 Hz for three sensors, giving a difference of 8 Hz. That difference decreases as the target approaches the intersection with decreasing speed and eventually comes to a stop around 6 s, resulting in nearly 0 Hz difference in Doppler shift.
The DD
is defined as the difference in Doppler shift between receivers
and
and is given by
The DD varies between RF sensor pairs based on their geometry and the radial velocity of the moving target. For example, one target may be stationary and have no DD, whereas another target may be moving radially with respect to the sensors which results in a near zero DD (but may be non-zero to other sensor pairs). The maximum DD selects the sensors that are positioned orthogonal to the targets motion. An example of the maximum DD is shown in
Figure 11 for GT #1 with the corresponding ground speed. The target starts out with its highest DD when it first enters the scene and decelerates as it moves towards the intersection at the center of the scene and reaches a DD of 1 Hz. After reaching the intersection the target increases in speed and the DD increases to 8.3 Hz.
3. Multi-Spectral Video Tracker
We present an overview of the algorithm to fuse multi-spectral video data to detect and track moving targets in a cluttered urban environment [
31]. The algorithm was developed and tested using a sythetically generated dataset produced using the DIRSIG toolset in visible, NIR, MWIR, and LWIR spectral bands [
34].
Figure 12 shows an example frame of each spectral band with a frame size of 2000 × 2000-pixels and a ground sample distance of 0.0635 m, resulting in a field of view covering 137 × 137 m
2. By visual inspection of the frames in
Figure 12, the appearance of target vehicles varies between the spectral bands, providing different intensity information. The vehicle motion in the video sequence was simulated as a common traffic pattern using the open source tool Simulation of Urban MObilitiy (SUMO) to provide realistic traffic maneuvers [
35].
The intensity of a pixel fluctuates due to noise, changes in illumination, and movement from both clutter and target objects. This does not allow a single value to represent the time history of the intensity of a single pixel for a given sequence of video frames. To compensate for these changes, background modeling techniques are used to describe the probability distribution of the pixels intensity by empirically deriving and updating the parameters from the video sequence. The Gaussian mixture model (GMM) has been successfully demonstrated to model the fluctuations in pixel intensities in outdoor scenes for detection of pedestrians [
17,
18] and vehicles [
19,
36].
To deal with fluctuating pixel intensities in our video sequence, in each spectral band we use a GMM that adapts to the evolving scene by modeling the time history of intensity of each pixel to determine the foreground pixels. The GMM extracts the foreground pixels by modeling the background distribution of intensity at each pixel by a number of Gaussian distributions, and a pixel not fitting these distributions is classified as a foreground pixel. A fused foreground map is created by weighting and summing foreground pixels from all spectral bands. A threshold is applied to the fused foreground map to remove low weighted pixels and an image closing operation is performed to create pixel groups labeled as targets.
Targets are associated between frames by relating historical track information constructed from prior tracked frames with position estimates in the current frame. Relating track history to current positional data is trivial in the scenarios where targets stay separated and no occlusions exist. However, in actual practice and in this data set, targets can be merged and appear as a single target, or occluded by trees, etc., making distinguishing between targets and maintaining the correct association of tracker data and target difficult. Features using the scale-invariant feature transforms (SIFT) were selected for identification of targets due to their robustness with respect to changes in rotation and scale, and their invariance to change in camera viewpoints and illumination changes [
37,
38]. SIFT features are composed of a keypoint that has a sub-pixel location estimate and the gradient orientation of the feature, along with a descriptor that is calculated based on histogram of the local pixel texture.
Results from the multi-spectral imaging system are a collection of tracked targets that have correlated
centroid measurements in the imaging plane. The root-mean-square error (RMSE) between the true and estimated centroid locations for the video tracked objects are shown in
Figure 13. Video targets 8 and 10 have the largest RMSE where the
y-error is over 0.3 meters and the
x-error is over 0.15 m. The other targets have a considerably lower RMSE in both
x and
y.
Theoretical DD for the output targets of the multi-spectral tracker are calculated using Equations (
1)–(
4) where target positions
and
are ground coordinates from the measured video frames and
is the ground height which is 0 meters for this scenario. Sensor coordinates
are the locations of the RF sensors.
5. Experiment
We present the performance of the fusion algorithm for associating a specific cellular emanation detected from a constellation of RF receivers with the corresponding target from the multi-spectral video tracker. An association is classified as correct when the cellular emanation from the RF sensors is correctly matched with the corresponding video target. Performance of the algorithm was evaluated using Doppler measurements from the RF receivers with added measurement uncertainty and
centroid locations from a multi-spectral video tracker. Doppler shifts for a cellular frequency were generated from the ground truth radial velocity with varying RMS levels of white Gaussian noise added to the Doppler shifts to model measurement uncertainty. Centroid location information was extracted from a multi-spectral video set that was developed with the DIRSIG software tool [
34] with the detection and tracking algorithm presented in [
31]. The results are the correct association rates over the video sequence for 100 Monte Carlo simulations.
Figure 15 and
Figure 16 show the association rates using the absolute difference metric. With no uncertainty in the Doppler measurement targets 1, 9, and 11 have association rates above 0.96. Target 1 maintains a rate of 0.92 through 1 Hz of RMS uncertainty and has a gradual decrease to 0.52 at 10 Hz RMS uncertainty. The association rate for target 11 drops to 0.84 with 1 Hz RMS uncertainty, and has a slow decrease through 10 Hz which has a rate of 0.51. Target 2 has a steeper decrease in rate and drops to 0.7 with 1 Hz RMS uncertainty and decreases to 0.40 with 10 Hz. Remaining targets have an association rate greater than 0.60 with no uncertainty. Targets 4 and 6–8 have the poorest performance as they drop below 0.5 with 0.2 Hz RMS uncertainty. Overall the results with no Doppler uncertainty are promising, but with added uncertainty there is a significant decrease in performance.
Figure 17 and
Figure 18 show the association rates using the RMSD. With no added Doppler uncertainty all targets start off with an association rate above 0.82 excluding target 4. Targets 1, 3, 5, 9, and 11 have perfect association rates with 7 of the 11 targets being 0.92 or greater. Targets 1 and 11 have a gradual decrease in association rate and maintain a rate greater than 0.83 through 10 Hz of RMS uncertainty. Target 5 has an association rate greater than 0.9 through 4 Hz and gradually decreases to 0.59 at 10 Hz. Target 2 has a rate above 0.92 through 0.5 Hz of RMS uncertainty and gradually decreases to below 0.51 at 5 Hz. Targets 6 and 7 drop below 0.62 at 0.4 Hz of RMS uncertainty due to similar Doppler signatures. Target 10 has a rate above 0.77 through 0.7 Hz RMS uncertainty but begins to decrease significantly. This target has poor performance in the video tracker and a similar DD signature as target 2. Target 8 has the most immediate decrease in performance by dropping below 0.44 at 0.3 Hz due to a similar Doppler signature as target 4. There is an improvement in performance for some targets as uncertainty is increased, particularly target 4 when increasing from no uncertainty through 0.3 Hz. This is attributed to similar Doppler signatures for a particular sensor configuration. With added uncertainty a different sensor pair produces a higher DD, and that sensor pair proves to have better performance with the multi-spectral video tracker. For example, with no RMS uncertainty target 4 has multiple sensor pairs with nearly identical DD, with a slightly higher DD for Sensor 1 and Sensor 9. As RMS uncertainty is added, this optimal sensor pair begins to switch between different pairs (i.e. Sensor 1 & Sensor 9, Sensor 4 & Sensors 6, Sensor 6 & Sensor 7). These new sensor pairs produce DD for target 2 that are not similar to target 4, reducing the incorrect associations. Overall, these results are improved in comparison to the absolute difference and provide robustness to measurement uncertainty.
The confidence ratio
is a performance measure of confidence for correctly identifying the true video target from the most probable incorrect target after all frames measurements have been made. It is defined as the normalized difference ratio between the number of correct associations
and the number of associations for the highest detected incorrect target
and given by
A positive detects the correct target more than an incorrect target, where a maximum value of 1.0 indicates that the correct association was made for all frames. A negative detects an incorrect target more than the correct target, where a value of −1.0 indicates that the incorrect association was made for all frames. This metric gives us a value on how likely we are to differentiate the true target from the most probable incorrect target.
Figure 19 and
Figure 20 show the confidence ratios. Targets 1, 3, 5, 9, and 11 have an
of 1.0 with no Doppler uncertainty. Targets 1, 5, and 11 maintain a
greater than 0.75 through 10 Hz RMS uncertainty while target 3 maintains a ratio greater than 0.6. The
for target 6 and 7 becomes negative at 0.5 Hz of RMS uncertainty, but becomes positive again at 3 Hz. The increase in
is attributed to
becoming distributed between multiple incorrect targets, resulting in a lower
for the single target. Due to the similarity with another target at the beginning of the time interval, the
for target 4 begins low but after 0.1 Hz of RMS uncertainty it increases above 0.66 for all levels of uncertainty. Target 10 has the lowest
at 1 Hz RMS uncertainty and remains low for all levels. Target 10 has the highest RMSE in the video tracker and has a similar DD signature to target 2. For targets 1–5 and 11 the confidence is positive for all uncertainty levels, indicating we can successfully associate the RF emanations with the corresponding target from the multi-spectral video tracker through all uncertainty levels.