Fusion of Heterogenous Sensor Data in Border Surveillance

Wide area surveillance has become of critical importance, particularly for border control between countries where vast forested land border areas are to be monitored. In this paper, we address the problem of the automatic detection of activity in forbidden areas, namely forested land border areas. In order to avoid false detections, often triggered in dense vegetation with single sensors such as radar, we present a multi sensor fusion and tracking system using passive infrared detectors in combination with automatic person detection from thermal and visual video camera images. The approach combines weighted maps with a rule engine that associates data from multiple weighted maps. The proposed approach is tested on real data collected by the EU FOLDOUT project in a location representative of a range of forested EU borders. The results show that the proposed approach can eliminate single sensor false detections and enhance accuracy by up to 50%.


Introduction
Wide area surveillance has become an increasingly important topic with respect to security concerns, not only for large industrial premises and critical infrastructures but for 'green' land borders. CCTV installations, based on video-and recently also on thermalinfrared-cameras, are still the backbone of any surveillance solution because they allow fast identification of potential threats and provide good situational awareness for the operator. Because the permanent visual monitoring of a multitude of screens in parallel is nearly impossible for a human operator, automated detection of incursions is the only possible way to scale up to automated wide area green border surveillance.
Automated video detection can be used to reduce the work load of the human operator [1], but it is prone to false detections under low light conditions or adverse weather such as rain or fog. To date, automatic motion detectors are widely used in perimeter security to supplement the camera installations and automatically alert the video operators. Audio detection is sometimes used, especially for the detection and localization of specific strong acoustic events such as glass breaking [2]; the use of specialised microphone hardware has also been reported for source localization [3]. Due to robustness against low light and weather conditions, passive infrared (PIR) detectors [4,5] and radar detectors [6] are frequently applied. PIR devices utilize a pair of pyroelectric sensors to detect heat energy (infrared radiation) in the surrounding environment. A change in signal differential between those two sets off an alarm-PIR can thus detect any IR-emitting object within their visual field of view.
However, even an acceptable false positive rate for a single sensor in any automatic detection system will accumulate a significant number of false alarms as the number of

Person Detection Data from Video and PIR Sensors
The image frames of thermal and RGB video cameras feed into state-of-the-art detectors 'You Only Look Once' (YOLOv5 [18]), which have been trained on the MS COCO (Microsoft Common Objects in Context) dataset [19]. Upon each detection, the corresponding confidence value of the classifier is attributed to the respective grid cells of the density map, as described above. The grid locations have been derived from the camera's measured field of view (FoV), which represents a triangular shaped area in the weighted map.
The PIR sensor detections were attributed to the surrounding of each PIR sensor's location. Upon each detection, a defined confidence was attributed to the respective grid cells of the density map on a circular area with 7.5 m diameter around the sensors location on the density map.

Geo-Registration
As each sensor provides data in its own local coordinate systems, a transformation is required to map the detections into a common geodetic coordinate system before they can be fused. A registration procedure is then applied to convert sensors' raw detections from their respective local coordinate systems to the World Geodetic System (WGS84), chosen as the reference coordinate system in this study. The registered detections are expressed as detection circles or polygons.
The performance of the data fusion step strongly relies on the accuracy of the registration. Therefore, an accurate calibration of each sensor along with a precisely known sensor network geometric configuration are essential to ensure an effective registration. In this study, one of the major challenges lies in the fact that the data to register are highly heterogeneous in terms of intrinsic properties and acquisition methods, leading to a different registration procedure according to two sensors categories: (1) omnidirectional sensors and (2) directional sensors.

Omnidirectional Sensors
For omnidirectional sensors, such as PIR sensors, the procedure is straightforward. The registered detection is centred on the sensor location referenced in the common WGS84 frame and uses the detection range as the radius of the circle element defining the registered detection.

Directional Sensors
For directional sensors, the procedure is more complex. Because the directional sensor's own local coordinate system is well defined with respect to the WGS84 coordinate system, the Helmert transformation can be calculated. This transformation is a combination of a translation, a rotation, and a scale operation and requires accurate sensor location, height, and position angles (heading, roll, pitch) to register the raw detection. The position of the raw detection in the local coordinate system is provided as a directional vector in polar coordinates (elevation, azimuth), with associated uncertainties used to generate the registered detection as a polygon element. In some cases, the transformation may lead to extreme or illegal registered positions, such as a detection located above the horizon. In those cases, an exception procedure setting the registered detection as the sensor detection FOV based on the sensor FOV and maximum range detection distance is applied. To be processed as an exception, a raw detection must fulfill one of the following conditions: • The sum of the sensor pitch and detection elevation angles is >= 0 or <−90; • The distance between the camera and the registered detection is larger than the sensor maximum detection range.
Appropriate sensor calibration and precise knowledge of the location of each sensor within the WGS84 coordinate system is crucial to limit the uncertainty linked to the relative position of the detection. For directional sensors, the accuracy of the registration also strongly relies on the accuracy of the sensor configuration measurements: position angles and sensor height. The rotation and translation matrices applied in the Helmert transform are computed based on those values. Measurement errors are therefore propagated to the registered detection, their impact increasing with sensor height and pitch. A succinct validation based on field GPS data showed satisfying results for this experiment, with registration performance ranging from millimetres to a few tens of metres in the worst case. Those results take into account realistic uncertainties associated with sensor configuration measurements and should be benchmarked against the smartphone GPS accuracy used for validation, which is typically about 5 m under the open sky.

Multi Sensor Fusion with Weighted Maps
Multi sensor data fusion is widely used for location based applications, including sensor networks. The potential allows for use of such methodologies in different sectors, such as border surveillance, surveillance of critical infrastructure, as well as in the automotive sector. To estimate an object's location at a specific time, common approaches are feature methods or location based [20]. For example, in the automotive sector, a relatively small sensor network is used to model the vicinity of a vehicle and derive appropriate actions using a feature based approach [21]. Nevertheless, the scalability of such methods is questionable in the context of surveillance tasks that include a large amount of sensors (e.g., in the order of 1000 for a land border) and a very large geographical extent, as is the case in border surveillance. Location based maps, where each cell represents the probability of an event, is therefore a more suitable approach to reduce the computational cost and meet the real-time capability that is still essential in both sectors. Occupancy grid maps [22] and density maps [23] are common solutions for location based maps.
In this work, we adapt the approach presented in [17], in which the concept of weighted maps was introduced. The weighted maps concept is derived from probabilistic occupancy maps [24]. Essentially, the fusion process is modeled by a two step approach. First, the update process of weighted maps is introduced. This models the spatio-temporal behaviour of a weighted map inferred with events reported by a sensor or even a set of sensors of the same sensor modality. The second step is the fusion of multiple weighted maps. A linear opinion pool (LOP) [25] was used for this step. One main advantage of a LOP is that sensors yielding highly reliable data can be prioritized by increasing the weights employed in the LOP. This also allows to fuse sensor data in a rule based fashion, which helps to interpret the fusion methodology more easily as well as easing the parameterization. For completeness, a summary regarding the core methodology of the work presented in [17] is provided in this section.
According to [17], a weighted map M i is defined on a grid G that represents the area of interest of a sensor S i that is able to report or give evidence about certain events in its vicinity, such as the PIR sensors as well as the visual camera and the thermal camera. Here, m i j,k are the weights modeling the time dynamic behaviour of the weighted map for each cell (j, k) ∈ G. In this work, all of the sensors S i provide localized events reported at a specific time t E . Additionally, each sensor is able to estimate its confidence regarding the reported event. For example, a score of the classification whilst detecting a person, or one or zero in case of a binary classification. Note that in this work we use this confidence as weights {w E,j,k } (j,k)∈G as described in [17]. At the time of occurrence t E of an event reported by a sensor S i , we can calculate the state (weight) of the current cell (j, k) ∈ G according to: Thus, Equation (1) describes the update process, modeling the spatio-temporal behaviour of a weighted map. To reduce the weights provided by older events, an exponential decay with a decay constant λ i was introduced. For a detailed explanation on how the update process is done, please refer to [17]. In our setup, a typical example for a thermal sensor would be λ = 0.5 s, w E = 1 30 , and w max = 1. Here, we assume that the thermal sensor yields approximately 30 events per second, in case an event is present and detected. Using this approach, we see that the weighted maps M i yield high weight values if the corresponding sensors provide a large number of events at the same location in a short timespan.
In the second step, the fusion of the weighted maps, the LOP is employed as described in [25]. In this work, we chose the normalization factor α = ω max . Thus, the fusion process is modeled by evaluating the LOP F j,k (t) of different weighted maps M i for each weight m i j,k (t) at any point in time t. In Figure 1, the fusion process is depicted using a weighted map D 1 for PIR detections (left) and a weighted map D 2 for detections of person classification employed on thermal images (right). Finally, a decision can be made to trigger an alarm if a certain threshold τ ∈ [0, 1] is exceeded for each individual cell (j, k) ∈ G. The alarm resulting from the decision process is localized at those cells of the grid where the threshold τ is exceeded. This set of cells is denoted as {(j, k) ∈ G : F j,k (t) > τ}. An example of a triggered alarm and its location is shown in Figure 1 in purple (bottom).
Note that in this paper we do not include the restricted fusion map as described in [17]. In this way, we also allow the triggering of alarms in areas where no overlap of sensors of different types occur in the case where sufficient confidence is provided. This results in higher coverage of the area of interest. Specifically, in the task of detection through foliage, this approach turned out to be more suitable. An example (illustrated by Figure 1) of the expression using a weighted map D 1 for PIR detections and a second weighted map D 2 for detections of a person in thermal videos can be written as: In this example, the weights are chosen as ω 1 = ω 2 = 1 2 . In our work, the weights were chosen uniformly, i.e., ω i = 1 3 for fusion of three sensors modalities and ω i = 1 2 for fusion of two sensor modalities. The threshold for triggering an alarm was chosen to be τ = 0.8. Generally, in this methodology, the parameter τ is use to parameterize the sensitivity of the fusion process. The higher τ is chosen, the more sufficient evidence the sensors (i.e., the weighted maps) need to provide overtime to be very confident in the decision. This parameter typically is chosen empirically based on the knowledge of sensors behaviour and the required sensitivity.

Tracking of Fused Objects
Within a surveillance system, a natural application of the fused data would be to feed a tracking system that would allow, in the foreseen application, following the movement of an person illegally crossing a border. With this application as a target, we have developed a simple tracker to analyse the potential use of fused data on tracking.
The tracking system works by building a model of the object exclusively based on its position and time stamp. At the first object detection, the model is initialised with the position and timestamp of that detection. A track model is defined thus as the following tuple: where x, y, and t correspond, respectively, to the latitude, longitude, and timestamp of the point. If several object detections occur at the same time, there are as many model templates created as there are detections simultaneously received. Subsequent detections are added to a given track model depending on the cost involved on appending the detection to the track. The cost is defined as the distance between the incoming detection and the track candidate. Let d s (T i , o) be the spatial distance between the most recent point in the track T i and the incoming detection o. The spatial distance is calculated as the Euclidean distance between the latitude and longitude of the two points.
Let d t (T i , o) be the temporal distance between the most recent point in the track T i and the incoming detection o given by the substraction of the two point timestamps. The cost of appending object o to track T i will be then calculated as: where τ s and τ t are, respectively, spatial and temporal similarity parameters tuned empirically for our current implementation. The object is appended to the track if the cost is less or equal than a given threshold τ c ; otherwise that object will initialise another track. In case of multiple incoming detections and multiple track candidates, a Hungarian algorithm [26] was implemented so that the associations between detections and tracks incurs the minimum cost.

Data Description
Data were collected at a simulated border site where actors were asked to simulate typical border scenarios under realistic conditions, i.e., different times of the day and the weather conditions prevailing on the day. Prior to the collection of data, an ethics approval process was adopted. The actors were provided an information sheet describing the purpose of the study, the data that would be collected, and how it would be stored and used. Written informed consent to take part in this study was obtained for all participants.
The simulated border site was characterised by a road with a car park. A simulated border was established down one side of the road, as shown in Figure 2, and both sides of the road had areas of foliage. Two types of sensors were deployed to monitor this area: PIR sensors and video cameras (one thermal and one RGB camera). Whereas Figure 2 gives an overview of the sensor placement as well as the dimensions of the area, some sample images captured by the thermal and RGB cameras are given in Figure 3. These images are representative of the challenge addressed in this work, namely, the though-foliage detection of people in green areas. The problem is fragmented occlusion, which appears in the case of through-foliage detection in natural (green) environments. This detection is important and very much needed by border authorities worldwide for enhancing border security operations on green borders. Fragmented occlusion usually appears simultaneously as partial or full occlusions, which undoubtedly affect the performance of automatic people detectors. From Figure 3 it can be observed how challenging the person detection is, taking into account the foliated environment and natural conditions that can be particularly difficult, such as low light during the night. Fragmented occlusion has become a hot topic in automated surveillance at green borders where through-foliage detection is key.
The collected dataset has, in part, been made available to the scientific community to foster developments in this new area [27] and aims to fill the gap currently existing on datasets addressing through-foliage detection. Six PIR sensors were deployed along the simulated border at 7 m intervals. Each PIR sensor has a range of approximately 10 m and a FoV of 90 • parallel to the simulated border. The thermal camera was deployed approximately 50 m from the PIR sensors on the simulated border, with the FoV parallel to the simulated border. The thermal camera used was the FLIR F-606E. As this camera detects heat, it is possible to detect a person even in poor weather and lighting conditions, making it an ideal complimentary sensor for this application. This camera has a thermal spectral range of 7.5 µm to 13.5 µm, and a FoV of 6.2 • × 5 • [28]. The RGB camera used was the DH-SD6AL830V-HNI 4K Laser Pan-Tilt-Zoom (PTZ) Network Camera. It features 12Megapixel STARVIS™ CMOS, powerful optical zoom (×30), and accurate pan/tilt/zoom performance; this camera provides an all-in-one solution for capturing long distance video surveillance for outdoor applications. The sensor layout can be seen in Figure 2. In addition to the PIR sensors and the RGB and thermal cameras, participants each carried a mobile phone with GPS capability. An app called 'GPS Logger for Android' [29] was installed on each of these devices and used to record each actor's location during the data collection to be used as ground-truth (GT) data. A summary of all characteristics of the dataset addressed in this paper are given in Table 1. The area that is within range of the PIR and thermal sensors is referred to as the zone of interest (ZoI) and is illustrated in Figure 4.  Participants were asked to simulate typical border scenarios based on a range of predefined scripts. These scripts include activities such as: a single actor, or group of actors, simulating a border crossing though the ZoI and negotiating the surrounding area; a single actor, or group of actors, simulating a border crossing though the ZoI and waiting to be picked up by a vehicle; or actors performing simulated illicit activities, such as a vehicle loading or unloading illicit material near the border but not necessarily in the ZoI. One of the objectives of the data collection was to perform it under different representative conditions to evaluate how effective the system would be to detect the activities. As such, two different behaviour modalities were scripted. The first mode was naïve behaviour, where the actor simulates being unaware of the surveillance system and performs the activity without hiding from cameras or PIR sensors; in the second mode, system aware behaviour mode, the actor would act in such a way to know at least the existence of a surveillance system and try to move more quickly and silently and attempt to partially hide. From the data collected, fifteen sequences were generated representing simultaneous recording by RBG, thermal, and PIR sensors. The selection of sensors was considered appropriate for analysis to demonstrate the reduction in false detections using the fusion techniques described in this paper. A summary of these sequences is given in Table 2.

Evaluation Methodology
In order to test how the approach described in this paper can reduce the false alarm rate of a single sensor's performance, we compare the object detection performance of individual detectors and that of fused output of combined detectors taking as ground-truth the GPS data collected from the actors' phones. We evaluate the proposed multi sensor fusion approach on the area where the FoV of the thermal and RGB cameras overlap the detection range of the PIR sensors. The full processing schematic for single sensor evaluation is summarized in Figure 5; the fused output evaluation is given in Figure 6. Note that the evaluation process includes a tracking component. The influence of the tracking component alone in the proposed approach is evaluated by inputting into the system the ground-truth data themselves as detection data only (no knowledge of tracking ID) and allowing the tracker to associate the individual detections, form the tracks, and assign an ID. The resulting tracks are compared with the GT. These results are discussed in the next section, but it is expected that the tracker would have almost 100% accuracy when the input data are the ground-truth themselves; this would confirm the tracking component does not distort or influence the results from the fusion algorithm in our proposed approach.
We thus concentrated the evaluation on a zone of interest inside the FoV of the thermal and RGB cameras and also covering the PIR sensors (see Figures 2 and 4). Note that the evaluation focuses on establishing whether the activity in the ZoI is true or false when compared to GT data, which are given by the GPS sensors carried by actors performing activities (or not) in the ZoI. The full processing schematic for the multi fusion approach evaluation is summarized in Figure 6.
The test data comprises two types: • The activity is outside the ZoI so that the full potential of the fusion approach on filtering false detections can be evaluated (see Figure 4A). • The activity happens inside the ZoI so that the accuracy of detections can be evaluated (see Figure 4B).
To solve for different sampling frequencies, all data were analysed in temporal windows of 1 s duration. Detection data and GT data were compared inside these temporal windows with typical receiver operator characteristic performance measures of true pos-itives (TP), false positives (FP), true negatives (TN), and false negatives (FN) defined as follows: • True positive (TP): In a given temporal window, a system detection and a GT object exist inside the ZoI. • False positive (FP): In a given temporal window, a system detection exists inside the ZoI but no GT object is found. • True negative (TN): In a given temporal window, no system detection exists inside the ZoI and no GT object is found. • False negative (FN): In a given temporal window, no system detection exists inside the ZoI, however, a GT object is found. Figure 5. Schematic of data processing for single sensors (e.g., PIR and thermal camera) from detection to transformation between local coordinate system (LCS) to world coordinate system (WCS) and evaluation.
Typical performance measures of accuracy, precision, and recall can then be calculated: Recall = TP (TP + FN) (7) Figure 6. Schematic of data processing for multi fusion approach from detection to fusion to transformation between local coordinate system (LCS) to world coordinate system (WCS) and evaluation.

Results and Discussion
Fifteen sequences (scenario scripts) have been evaluated to assess the proposed approach. Data were analysed in time intervals of 1 s for all sequences. First, as tracking is employed in the evaluation process of single sensors (see Figure 5) and multiple sensor fusion (see Figure 6), the influence of the tracking component alone in the proposed approach is evaluated by inputting into the system the ground-truth data itself as detection data only (no knowledge of tracking ID) and allowing the tracker to associate the individual detections, form the tracks, and assign an ID. The resulting tracks are compared with the GT. These results are shown in Table 3. It can be observed from the results that the matching with the GT is almost perfect except for a few cases. Accuracy, precision, and recall are at 100% except for three cases (scripts D, H, and M). In these scripts, the lowest accuracy value is 91%, and precision is always at 100%, except for one case at 98%. Only the recall has some lower values on the aforementioned three scripts, ranging from 71% to 91%, and 100% otherwise. This means that the tracker component commits some mistakes in generating the tracks, making the recall drop by 20% ion average across these three scripts; these tracking mistakes are limited, taking into account that overall only three scripts out of fifteen result in results that are not matching perfectly with the GT. Examining these scripts in Table 2, they correspond to group activities where there are small to large groups (3 to 7 people) moving together in the area. It is well known in tracking that following different targets close to each other can be challenging and produce errors in tracking. It also must be taken into account that the GPS data are also sensor data and contain some irregular sampling; this produces even more challenges to the tracking component, which overall can be deemed to work considerably well.
Second, the evaluation was further performed, first taking one single sensor at a time, and then the different possible sensor combinations for data fusion. Each line in Table 4 shows the resulting evaluation for single sensors and for the different fusion combinations of sensors.  A closer look at Table 4 shows that the usage of this approach when combining all sensors yields one of the lowest FP values. The thermal camera has the lowest FN value but also the highest FP value, meaning that the sensor is very sensitive but at the same time is producing a considerable number of false detections. The PIR sensor in itself has the lowest FP value, but, at the same time, the number of TP is also the lowest. The consequence of this is that the proposed approach shows a significant reduction in FPs (between 95% reduction when compared to RGB and 91% when compared to the thermal camera sensor).
It must be noted that PIR sensors show a compromise between not producing FPs due to movement of tree branches and leaves and being sensitive enough to detect a person passing by. The PIR sensor sensitivity was moderate, which produced, in consequence, the lowest value of FPs but at the same time has the lowest value of TPs and the highest value of TNs. This translates to the PIR sensor having a good precision but the lowest recall among individual or fusion combined sensors. In contrast, the RGB and thermal camera sensors appear to have an opposite operation; their sensitivity is high and therefore their recall values are the highest in Table 4; however, their precision is also the lowest, and this can be seen by the high number of FPs they are producing. Having a highly sensitive system is generally preferred in security systems, although the downside is generating a number of false detections that must be verified by a security guard: therefore the importance of our fusion approach in filtering false detections. Table 4 shows that the proposed approach actually balances the two operation modes between PIR and camera (RGB, thermal) sensors. The best results are achieved when all sensor data are fused. The combination of PIR-RGB-thermal sensors leads to the best accuracy and precision values, 0.88 and 0.55, respectively. The recall is, however, not as large (0.34) given that the PIR sensor has minimal recall in itself, and this influences the overall fusion results. The tracking component may also have an influence as shown before, and recall values could potentially rise by about 20%, bringing recall values potentially to 41%. Table 5 details the performance of individual sensors and the different sensor fusion results according to the number of people appearing in the scene (single person, small group of three people, or large group of ten people) and also according to the different behaviours adopted (naïve person or system aware). In general, it can be observed that the naïve behaviour allows the system to produce a significant higher number of true positives. In system aware, the actors perform realistic movements attempting to hide from the system sensors, and fewer TP detections are achieved. The values of precision and recall decrease, particularly in the case of one person on the scene or a large group of ten people. Interestingly, the performance seems not to be affected when monitoring small groups of three people. This can be due to the fact that, within the small group, the system continuously detects one of the members, whereas if it is a single person, it is much more difficult to continuously track that person, and if it is a large group of people, it is more difficult to detect all members. It is interesting to observe that the same patterns noted before for Table 4 can also be observed in Table 5; namely, RGB and thermal cameras are sensitively tuned and have a good recall but also produce a significant number of false positives. PIR sensors have a lower recall but a better precision. The fusion of sensors enhances the overall precision at the expense of the recall for some individual sensors. Different sensor combinations produce best results according to the number of people in the scene and the actors' behaviour; however, it can be seen that combining all sensor data always leads to the best or second-best results for detection.
The results obtained with the proposed approach are very encouraging given the fact that the actors crossed the simulated border quickly and then either hid or continued their path either on the road or across the foliage on the other side of the road. Sharp movements and foliage represent a significant detection challenge. Camera sensors are sensitive enough to detect people as far as possible, even through foliage, although with the downside of generating false positives. Notwithstanding this, our proposed approach manages to filter most FPs. Regarding true positives, these would certainly be improved by adding more sensors into the fusion system, giving a better coverage of the area targeted for surveillance; some possibilities include adding crossing cameras, more PIR sensors, or other types of sensors such as seismic or airborne sensors.
Nevertheless, there are drawbacks of the proposed methodology that have been uncovered during this work. First of all, we would like to state that the proposed fusion methodology is able to satisfactorily balance out the drawbacks and strengths of the employed sensors as described previously. However, the proposed fusion method heavily relies on the quality of the input data. Particularly, in the case of through-foliage detection, it is quite challenging to reliably detect the desired event for each individual sensor modality. Consequently, this transfers to the output of the fusion, which especially can be seen in the recall in Table 4. We would like to investigate this trade-off more in future work using supplementary sensor modalities.
Additionally, it was hard to find open source datasets to compare our work to. We did not find any dataset that was directly comparable to the one collected in this work. The reason for this is mainly due to the lack of contributions in the field of through-foliage applications with the dependencies of the defined scenarios within this work. Indeed, we would like to contribute to this field in the future. Finally, we also would like to acknowledge that parameterization of the fusion methodology can be challenging with increasing numbers of sensor modalities. Obviously, in a sensor network, it is desired to use as many sensor modalities as possible to increase the probability of a sensitive system with high accuracy and recall so as to reduce the number of false alarms generated. However, simply fusing all the sensors together in a uniform way will not produce the best results. For this reason, most of the time sensors with complementary attributes are fused (e.g., PIR: high accuracy and precision; cameras: good recall). As a result, a significant amount of domain know-how as well as understanding of the sensor models is necessary to fully exhaust the potential of the proposed methodology. In the future, we would like to focus our research on how to reduce the complexity of parameterization with the overall goal to increase the robustness and therefore the reliability of such a system. One example is to use neural networks or deep learning approaches to automatically learn the weights and threshold in the decision process [17] based on the employed sensor modalities and use-cases. Regarding our tracking system, Figure 7 shows, as an application example, the track resulting from the fused data of Seq. E in Table 2. The tracking is coherent given the activity, and its application for monitoring the area is promising given that the fusion 'cleans' most false detections, which certainly would perturb the tracking, either making the track bounce to false positions, causing tracks to change IDs, or breaking tracks and create new ones, provoking fragmentation. All of these are well known issues when tracking is corrupted with noisy data.

Conclusions
In this paper, we presented a multi sensor fusion and tracking system using passive infrared detectors in combination with automatic person detection from thermal and visual video camera images. We adopted a weighted map based fusion approach combined with a rule engine. The geo-reference detections of PIR, RGB, and thermal detections resulting from an open source video classification software were used as input for the weighted maps.
We evaluated the fusion using fifteen different sequences corresponding to different acting scripts and compared the results of single sensor detections and the weighted map based fusion approach. We conclude the following results: • A significant reduction in FPs, which also translates in an increased number of TNs; • An increase in the accuracy (28% increase compared to RGB and 47% compared to thermal); • An increase in the precision (more than 220% increase compared to RGB and 71% compared to thermal); • Larger groups of people, and people behaving in a naive way, allow for collecting more detections, which in turn facilitates delivery of alerts. The increased number of FP from individual sensors is well managed in the fusion.
The fusion system proves to be effective for border surveillance tasks, but its effectiveness is dependent on the sensors delivering the input to the fusion itself. With the presented fusion approach, we achieved a significant reduction in false alarms that were mostly due to adverse weather conditions and foliage producing false detections in the deployed sensors. However, true detections will only be confirmed if there is sufficient evidence from different sensors to assert the event; thus, sensors themselves must comply with a minimum level of accuracy on their own. It is noteworthy that the proposed fusion approach can work with any number of added sensors. Indeed, in the future we would like to experiment with a larger set of sensors including seismic, Pan-Tilt-Zoom (PTZ) cameras, and airborne sensors. This paper represents an experimental exploration where the focus is to cover areas of high interest (illegal crossings at borders) from a border guard perspective; employing sensors offering broader area coverage is also part of our future studies. Data fused from heterogenous sensors can feed different components in a surveillance system. In this paper, we show the applicability of a tracking system on fused data as a promising application. Overall, our current results show an important step for the use-case of through-foliage detection with multi sensor fusion.